From 0d3c928ff2bb4a761049177fbcf8aa4944b77403 Mon Sep 17 00:00:00 2001
From: biondizzle <biondizzle@gmail.com>
Date: Sun, 17 May 2026 13:32:41 +0000
Subject: [PATCH] Update CURRENT_BUG.md: full status through Bug 14, vLLM
 integration status, architecture docs

---
 CURRENT_BUG.md | 214 ++++++++++++++++++++++++++-----------------------
 1 file changed, 115 insertions(+), 99 deletions(-)

diff --git a/CURRENT_BUG.md b/CURRENT_BUG.md
index 454bbeab..941689a4 100644
--- a/CURRENT_BUG.md
+++ b/CURRENT_BUG.md
@@ -1,18 +1,17 @@
 # Current Bug: CuTeDSLMoERunner — Status & Debug History
 
-## Current Status (May 17, 2026 09:01 UTC)
+## Current Status (May 17, 2026 13:30 UTC)
 
-**Bug 8 fixed. Ready for vLLM container test.**
+**vLLM container runs, cudagraph capture succeeds, but model output is garbage (BOS token repeat).**
 
 - ✅ `layertest.py` — 0.988 cosine
 - ✅ `cudagraph_test.py` — capture + replay works
-- ✅ `test_warmup_gs.py` — warmup gs computation works (test script has a pre-existing NameError in safety margin section, not a runner bug)
-- ❌ vLLM server — not yet tested with these fixes
+- ✅ Container builds, loads weights, warmup gs computed (no L2 gs=0)
+- ✅ Cudagraph capture completes (51 sizes, ~15 min)
+- ✅ Server accepts requests, generates tokens
+- ❌ Model output is `<｜begin▁of▁sentence｜>` token repeated — garbage logits
 
-**Fixed in this round:**
-- Bug 8: Global→local expert ID remapping (was causing CUDA_ERROR_ASSERT)
-- Removed `.cpu()` sync from `run()` — `_token_indices` now on GPU, cudagraph-safe
-- Added `_needs_token_refill` flag to handle CuTeDSL JIT GPU memory corruption after first GEMM call
+**Current theory:** Scale assembly layout mismatch between the fixed 128-row-per-expert approach and what the GEMM actually expects. The latest fix pads slot_hidden to `num_experts * 128` rows and passes `padded_expert_offsets=[0, 128, 256, ...]` to the GEMM. Build is in progress on B200 to test.
 
 ---
 
@@ -28,12 +27,6 @@
 1. Scatter x_sf rows into 128-aligned positions in a padded buffer (GPU-only, no CPU sync)
 2. Per-expert: copy 128 rows from padded buffer, `pad_and_swizzle_single()` each expert's block independently, then concatenate
 
-**Key insight from `torch_scaled_grouped_mm.py` line ~1115:** The kernel computes padded offsets internally when `consistent_token_padding=False`:
-```python
-padded_size = round_up(offs[expert_idx] - offs[expert_idx-1], pad_granularity)  # 128
-```
-So the kernel knows each expert's scale data is in a 128-row block.
-
 ### Bug 2: `searchsorted(right=False)` — Wrong Expert Assignment
 
 **Symptom:** Scale data in wrong positions after scatter.
@@ -45,28 +38,19 @@ So the kernel knows each expert's scale data is in a 128-row block.
 expert_assign = torch.searchsorted(expert_offsets[1:], row_indices, right=True)
 ```
 
-**Verified:** Row 4 → expert 1 (correct), rows 0-3 → expert 0 (correct).
-
 ### Bug 3: CuTeDSL `cute.compile` GPU Memory Corruption — CRITICAL
 
 **Symptom:** `_token_indices` was all zeros, making every token map to token 0.
 
-**Root cause:** CuTeDSL's `cute.compile` (JIT compilation) corrupts GPU memory. Tensors allocated on GPU before or during JIT compilation get zeroed. Pre-existing tensors allocated before the JIT survive. This is a bug in the CuTeDSL library.
+**Root cause:** CuTeDSL's `cute.compile` (JIT compilation) corrupts GPU memory. Tensors allocated on GPU before or during JIT compilation get zeroed.
 
-**Impact:** `_token_indices` (int32 on GPU) was zeroed, causing `hidden_states[sorted_token_ids]` to return `hidden_states[0]` for all 8 slots. Every expert saw the same input.
-
-**Fix:** Allocate `_token_indices` on CPU, keep it there. In `run()` and `compute_activation_global_scales()`, index with `sort_idx.cpu()` then move result to GPU:
-```python
-sorted_token_ids = token_indices[sort_idx.cpu()].to(device)
-```
-
-**Warning:** This introduces a CPU-GPU sync (`.cpu()`) which may interfere with cudagraph capture. Needs verification.
+**Fix:** Allocate `_token_indices` with `_fill_token_indices()` which builds on CPU and copies to GPU. Added `_needs_token_refill` flag to handle GEMM JIT corruption on first call.
 
 ### Bug 4: `expert_offsets` With Leading 0
 
 **Symptom:** GEMM produced wrong output with correct scale data.
 
-**Root cause:** The runner passed `expert_offsets[:num_experts + 1]` = `[0, 4, 8, 8]` (4 elements with leading 0) but the kernel expects `compute_expert_offsets([4, 4, 0], 3)` = `[4, 8, 8]` (3 elements, cumulative sum without leading 0).
+**Root cause:** The runner passed `expert_offsets[:num_experts + 1]` = `[0, 4, 8, 8]` (4 elements with leading 0) but the kernel expects `[4, 8, 8]` (cumulative sum without leading 0).
 
 **Fix:** Pass `expert_offsets[1:num_experts + 1]` to the GEMM.
 
@@ -74,88 +58,97 @@ sorted_token_ids = token_indices[sort_idx.cpu()].to(device)
 
 **Symptom:** Block scales all saturate at float8 max (448), producing garbage quantization.
 
-**Root cause:** The checkpoint's `input_scale` (~0.000286) is a calibration value computed from a different input magnitude (amax ≈ 0.77) than what runtime produces (amax ≈ 8.17). Too-small gs → x/gs has values up to ~13000 → block_amax/6 ≈ 2174 → overflows float8_e4m3fn max of 448 → saturated block scales → garbage.
+**Root cause:** The checkpoint's `input_scale` (~0.000286) is a calibration value computed from a different input magnitude (amax ≈ 0.77) than what runtime produces (amax ≈ 8.17). Too-small gs → block scale overflow → garbage.
 
-**Fix:** `compute_activation_global_scales()` warmup method that runs `quantize_to_nvfp4` (dynamic gs with `.max()`) before cudagraph capture to get the exact gs values for L1 and L2.
+**Fix:** `compute_activation_global_scales()` warmup method that runs `quantize_to_nvfp4` (dynamic gs with `.max()`) before cudagraph capture.
 
 ### Bug 6: L1 and L2 Need Separate Activation Global Scales
 
 **Symptom:** L2 output was garbage even with correct L1 gs.
 
-**Root cause:** After SiLU(gate)*up, the activation has amax ~286. The L1 gs (from input amax ~8) is 30x too small for L2, causing even worse block scale saturation.
+**Root cause:** After SiLU(gate)*up, the activation has amax ~286. The L1 gs is 30x too small for L2.
 
-**Fix:** `compute_activation_global_scales()` computes L1 gs from the input, runs the L1 GEMM, then computes L2 gs from the actual L1 output (after SiLU*up).
+**Fix:** `compute_activation_global_scales()` computes L2 gs from the actual L1 output (after SiLU*up).
 
 ### Bug 7: L1 and L2 Need Separate Padded Scale Buffers
 
 **Symptom:** IndexError when quantizing L2 activation — K_sf differs between L1 (448) and L2 (192).
 
-**Root cause:** `padded_x_sf_buf` was allocated with L1's K_sf (448). When L2's x_sf has K_sf=192, the buffer size mismatch causes issues.
-
 **Fix:** Separate `_padded_x_sf_buf_l1` and `_padded_x_sf_buf_l2`, plus separate `_per_expert_scale_bufs_l1` and `_per_expert_scale_bufs_l2`.
 
 ### Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT
 
-**Symptom:** `IndexKernel.cu:111` assertion `-sizes[i] <= index && index < sizes[i]` failed, cascading into CUDA_ERROR_ASSERT (710) across all workers. vLLM server crash on first inference.
+**Symptom:** `IndexKernel.cu:111` assertion failed, cascading into CUDA_ERROR_ASSERT (710) across all workers.
 
-**Root cause:** With expert parallelism (EP=8), `topk_ids` contains **global** expert IDs (0-255), but `CuTeDSLMoERunner` treated them as **local** IDs (0-31). Each rank only owns 32 experts (`num_experts=32`), so tokens assigned to experts 32-255 produced:
-1. Wrong `expert_offsets` computation (tokens matched no local expert → zero counts for many experts)
-2. Out-of-bounds scatter indices in `_assemble_scales_cudagraph_safe` (`dst_rows` exceeded `padded_x_sf` buffer size)
-3. CUDA device-side assert → all subsequent CUDA calls fail with error 710
+**Root cause:** With EP=8, `topk_ids` contains global expert IDs (0-255), but `CuTeDSLMoERunner` treated them as local IDs (0-31/48).
 
-The layertest never hit this because it uses local expert IDs directly (no EP).
-
-**Fix:** 
-1. Added `experts_start_idx` param to `CuTeDSLMoERunner`
-2. In `run()`: remap global→local via `local_ids = topk_ids - experts_start_idx`, mask non-local tokens with zero weight, clamp IDs to valid range
-3. Pass `experts_start_idx` from `deepseek_v4.py` (which already stores it from EP setup)
+**Fix:** Added `experts_start_idx` param; in `run()`, remap global→local and mask non-local tokens.
 
 ### Bug 8b: `.cpu()` Sync Breaking Cudagraph Compatibility
 
-**Symptom:** `sort_idx.cpu()` in `run()` — a CPU-GPU synchronization point that cudagraph cannot capture.
+**Fix:** Moved `_token_indices` to GPU, added `_fill_token_indices()` (CPU→GPU copy), `_needs_token_refill` for GEMM JIT.
 
-**Root cause:** `_token_indices` was kept on CPU to avoid CuTeDSL JIT GPU memory corruption (Bug 3). But cudagraph requires all ops to be GPU-only.
+### Bug 9: `padded_x_sf` Buffer Too Small — Index Out of Bounds
 
-**Fix:**
-1. Moved `_token_indices` to GPU
-2. Added `_fill_token_indices()` method to refill the tensor after potential corruption
-3. Added `_needs_token_refill` flag — set after `_ensure_stacked()` (weight JIT), checked/cleared after first `run()` call (GEMM JIT). After both JITs have fired, the tensor is stable.
+**Symptom:** `IndexKernel.cu:111` OOB in scale assembly scatter. `dst_rows` exceeded buffer size.
+
+**Root cause:** Buffer was sized for `num_experts * 128` rows, but scatter positions were computed from actual token distribution (not fixed 128 per expert). With 8192 tokens and top_k=6, dst_rows could exceed 6144.
+
+**Fix (attempted):** Sized buffer for `max_num_tokens * top_k` rows. Later reverted to `num_experts * 128` with fixed 128-row-per-expert scatter layout.
+
+### Bug 10: Wrong `top_k` and `max_num_tokens` Defaults
+
+**Symptom:** `_token_indices` max=6143 instead of 8191 (built with top_k=8, actual top_k=6).
+
+**Root cause:** `CuTeDSLMoERunner` defaulted to `max_num_tokens=8192, top_k=8`, but vLLM uses top_k=6. `deepseek_v4.py` didn't pass these values.
+
+**Fix:** Pass `max_num_tokens` and `top_k` from `deepseek_v4.py` to the runner constructor.
+
+### Bug 11: Full-Buffer Swizzle Produced Wrong GEMM Input
+
+**Symptom:** L2 gs=0.0 on EP5/EP7 during warmup. Model produced BOS token.
+
+**Root cause:** Applied the Blackwell 32_4_4 swizzle to the entire padded buffer at once, but the GEMM expects per-expert swizzled blocks. The combined swizzle layout doesn't match `expert_offsets` indexing.
+
+**Fix (in progress):** Reverted to per-expert swizzle with fixed 128-row slots.
+
+### Bug 12: `torch.full()` During Cudagraph Capture
+
+**Symptom:** `cudaErrorStreamCaptureUnsupported` on all 8 workers during cudagraph capture.
+
+**Root cause:** `torch.full()` in `run()` allocates a new tensor during stream capture, which CUDA doesn't allow.
+
+**Fix:** Pre-allocated `_l1_gsa_buf` and `_l2_gsa_buf`, use `.fill_()` instead of `torch.full()`. Also pre-allocated `_output_buf`, `_row_indices_buf`.
+
+### Bug 13: Warmup Passed Global Expert IDs Instead of Local
+
+**Symptom:** L2 gs=0.0 on EP5/EP7 (all ranks except EP0).
+
+**Root cause:** `_warmup_activation_global_scales()` passed global IDs (e.g. 336+) to `compute_activation_global_scales()`, which matches against `expert_id_range` (0..47). No tokens matched → zero L1 GEMM output → L2 gs=0.
+
+**Fix:** Pass local expert IDs (0..num_experts-1) in warmup.
+
+### Bug 14 (CURRENT): GEMM Scale Layout Mismatch — 128-Row Fixed vs Variable
+
+**Symptom:** Model generates BOS token repeatedly. Tokens are produced but logits are garbage.
+
+**Root cause:** Scale assembly places data at fixed `e*128` offsets (128 rows per expert). But the GEMM reads `scale_a` according to `expert_offsets` (real token counts, e.g. expert 0 = 500 rows). When expert 0 has 500 tokens, GEMM reads `scale_a[0:500]` but only rows 0-127 have valid scale data. Rows 128-499 are zeros → GEMM produces zeros for those tokens → garbage output.
+
+**Fix (in progress):** Pad `slot_hidden` to `num_experts * 128` rows (128 per expert) and pass `padded_expert_offsets=[0, 128, 256, ...]` to the GEMM. The GEMM processes exactly 128 tokens per expert. Padding tokens' output is discarded by scatter_add. Pre-allocated `_padded_hidden_buf`, `_padded_activated_buf`, `_padded_expert_offsets_buf`.
 
 ---
 
-## Debug Methodology — How We Got Here
+## vLLM Integration Status
 
-### Step 1: Identified the CuTeDSL kernel works (layertest = 0.988)
-
-The layertest uses `moe_pipeline.run_nvfp4_moe` with `quantize_to_nvfp4` (dynamic gs) and `assemble_scales_2d_side` (per-expert split). This is the reference implementation.
-
-### Step 2: Wrote test_runner_vs_pipeline.py
-
-Compared `runner.run()` vs `run_nvfp4_moe()` with same weights and inputs. Found runner produces all zeros.
-
-### Step 3: Wrote test_scale_assembly.py
-
-Compared `_assemble_scales_cudagraph_safe` vs `assemble_scales_2d_side`. Found data mismatch (global vs per-expert swizzle).
-
-### Step 4: Fixed scale assembly
-
-Rewrote `_assemble_scales_cudagraph_safe` with scatter + per-expert swizzle. Scale data now matches reference.
-
-### Step 5: Found GEMM still produces zeros with correct scales
-
-Isolated the issue: GEMM with the exact same inputs gives cosine 1.0, but runner gives 0.18. The problem was `expert_offsets` format (leading 0).
-
-### Step 6: Fixed expert_offsets, found token_indices corruption
-
-After fixing expert_offsets, cosine improved to 0.35. Traced to `_token_indices` being all zeros (CuTeDSL GPU corruption).
-
-### Step 7: Found and fixed the GPU corruption
-
-Moved `_token_indices` to CPU. Cosine jumped to 0.46 with default gs, 0.97 with warmup gs.
-
-### Step 8: Wrote test_warmup_gs.py
-
-Verified warmup gs computation, tested safety margins, tested different inputs. Found 1.0x safety (no margin) gives best results.
+| Component | Status | Notes |
+|-----------|--------|-------|
+| Weight loading | ✅ | Direct NVFP4 path, no BF16 round-trip |
+| Weight stacking | ✅ | `make_b_k_major` + `assemble_scales_3d_side` |
+| Global→local ID remap | ✅ | `experts_start_idx`, mask non-local tokens |
+| Warmup gs computation | ✅ | Per-layer, local expert IDs, L1+L2 gs |
+| Scale assembly | ⚠️ | 128-row fixed layout, pending GEMM alignment fix |
+| Cudagraph capture | ✅ | No dynamic allocations, no CPU syncs |
+| Model output | ❌ | Garbage (BOS repeat) — scale/GEMM layout mismatch |
 
 ---
 
@@ -165,33 +158,56 @@ Verified warmup gs computation, tested safety margins, tested different inputs.
 |------|---------|
 | `tests/layertest.py` | Reference: moe_pipeline with dynamic gs, 3 experts, layer 0. Must pass (≥0.98 cosine). |
 | `tests/cudagraph_test.py` | CuTeDSLMoERunner cudagraph capture + replay. Must pass. |
-| `tests/test_runner_vs_pipeline.py` | Compare runner.run() vs moe_pipeline. With correct gs should be ≥0.97. |
-| `tests/test_scale_assembly.py` | Compare cudagraph-safe vs reference scale assembly. Data must match. |
-| `tests/test_warmup_gs.py` | Warmup gs computation, safety margin sweep, different input test. |
-| `tests/test_scale_debug.py` | Byte-level scale debug (can be cleaned up). |
+| `tests/test_warmup_gs.py` | Warmup gs computation, safety margin sweep. |
+| `tests/test_runner_vs_pipeline.py` | Compare runner.run() vs moe_pipeline. |
+| `tests/test_scale_assembly.py` | Compare cudagraph-safe vs reference scale assembly. |
 
 **Run order after any code change:**
 1. `python3 tests/layertest.py` — must pass
 2. `python3 tests/cudagraph_test.py` — must pass
-3. `python3 tests/test_warmup_gs.py` — should show ≥0.97 cosine
 
 ---
 
-## Files Modified
+## Key Architecture: CuTeDSL NVFP4 MoE
 
-| File | Changes |
-|------|---------|
-| `vllm/nvfp4_cutedsl.py` | All 7 bug fixes, `compute_activation_global_scales()` warmup, CPU token_indices |
-| `vllm/patches/deepseek_v4.py` | Removed checkpoint `input_scale` → activation global_scale mapping |
+### Data Flow
+```
+hidden_states (BF16) ──→ global→local remap ──→ sort by expert
+    │
+    ├── L1 (gate+up)
+    │   quantize_activation_nvfp4 → x_fp4, x_sf
+    │   _assemble_scales_cudagraph_safe → scale_a (swizzled)
+    │   run_nvfp4_grouped_gemm → l1_out (BF16)
+    │
+    ├── SiLU(gate) * up → activated
+    │
+    ├── L2 (down)
+    │   quantize_activation_nvfp4 → l2_x_fp4, l2_x_sf
+    │   _assemble_scales_cudagraph_safe → scale_a (swizzled)
+    │   run_nvfp4_grouped_gemm → l2_out (BF16)
+    │
+    └── scatter_add → y (BF16)
+```
+
+### Cudagraph Constraints
+- No `.item()`, `.cpu()`, `.tolist()` — zero CPU-GPU syncs
+- No `torch.zeros/ones/full/empty/arange` during capture — pre-allocate everything
+- No dynamic shapes — `num_tokens` equals the captured budget
+- Per-expert Python loops are OK (fixed `num_experts`, unrolled at capture time)
+- `pad_and_swizzle_single` is OK on pre-padded 128×4-aligned buffers (no internal allocation)
+
+### EP Configuration (DeepSeek-V4-Pro on 8×B200)
+- 256 total experts, top_k=6
+- EP=8 → 32 local experts per rank (in practice 48 based on logs)
+- `experts_start_idx` = rank * 32 (0, 32, 64, ..., 224)
+- `max_num_tokens` from `scheduler_config.max_num_batched_tokens`
 
 ---
 
-## Next Steps for vLLM Integration
+## Repo Info
 
-1. **Add warmup call in `deepseek_v4.py`:** After `finalize_weights()`, call `runner.compute_activation_global_scales()` with a sample input (e.g., 1 token of random data). This must happen before cudagraph capture.
-
-2. **Verify cudagraph compatibility:** The `sort_idx.cpu()` call in `run()` is a CPU-GPU sync. Cudagraph may not support this. If it doesn't, need to find a way to keep `_token_indices` on GPU while avoiding the CuTeDSL corruption.
-
-3. **Test the vLLM container:** Spin up the server and test with a simple prompt. The output should be mostly correct (0.97 cosine ≈ near-perfect output).
-
-4. **Optimize warmup:** The current warmup runs a full forward pass (L1 + L2 GEMM). This is slow (~minutes due to JIT). Consider caching the gs values or computing them more efficiently.
+- **Kernel:** `sweetapi.com/biondizzle/nvfp4-megamoe-kernel` (master)
+- **Local:** `~/dev/nvfp4-megamoe-kernel/`
+- **B200:** `/root/nvfp4-megamoe-kernel/`
+- **Model:** `/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4` (read-only)
+- **Never edit on B200 directly** — edit locally → commit → push → pull on B200