From e2f33596a23f52e935b65569119d0ba91cbd0ad5 Mon Sep 17 00:00:00 2001
From: biondizzle <biondizzle@gmail.com>
Date: Sun, 17 May 2026 15:46:13 +0000
Subject: [PATCH] Update CURRENT_BUG.md: status through Bug 20, fixed-layout
 padding architecture

---
 CURRENT_BUG.md | 215 ++++++++++++++++++++++++-------------------------
 1 file changed, 106 insertions(+), 109 deletions(-)

diff --git a/CURRENT_BUG.md b/CURRENT_BUG.md
index 941689a4..b4b86534 100644
--- a/CURRENT_BUG.md
+++ b/CURRENT_BUG.md
@@ -1,17 +1,15 @@
 # Current Bug: CuTeDSLMoERunner — Status & Debug History
 
-## Current Status (May 17, 2026 13:30 UTC)
+## Current Status (May 17, 2026 15:45 UTC)
 
-**vLLM container runs, cudagraph capture succeeds, but model output is garbage (BOS token repeat).**
+**vLLM container crashes during cudagraph warmup with shape mismatch. Debug build in progress.**
 
 - ✅ `layertest.py` — 0.988 cosine
 - ✅ `cudagraph_test.py` — capture + replay works
 - ✅ Container builds, loads weights, warmup gs computed (no L2 gs=0)
-- ✅ Cudagraph capture completes (51 sizes, ~15 min)
-- ✅ Server accepts requests, generates tokens
-- ❌ Model output is `<｜begin▁of▁sentence｜>` token repeated — garbage logits
+- ❌ Container crashes during cudagraph warmup: shape mismatch `[49152, 7168]` vs `[3072, 7168]`
 
-**Current theory:** Scale assembly layout mismatch between the fixed 128-row-per-expert approach and what the GEMM actually expects. The latest fix pads slot_hidden to `num_experts * 128` rows and passes `padded_expert_offsets=[0, 128, 256, ...]` to the GEMM. Build is in progress on B200 to test.
+**Active investigation:** The GEMM output has 49152 rows (48 experts × 8 chunks × 128) but `padded_dst` only indexes 3072 rows. This means `max_chunks_per_expert = 8` instead of the expected 1 (capped at 512 tokens). Likely the `max_num_tokens` cap to 512 isn't reaching the runner. Debug print added to verify.
 
 ---
 
@@ -21,120 +19,143 @@
 
 **Symptom:** GEMM produced all zeros even with correct global_scale.
 
-**Root cause:** The original `_assemble_scales_cudagraph_safe` called `pad_and_swizzle_single()` on the ENTIRE padded buffer (all experts concatenated). But the kernel expects each expert's 128-row block to be swizzled independently (matching `assemble_scales_2d_side` which pads+swizzles each expert separately before concatenation).
+**Root cause:** `_assemble_scales_cudagraph_safe` called `pad_and_swizzle_single()` on the ENTIRE padded buffer. The kernel expects each expert's 128-row block swizzled independently.
 
-**Fix:** Two-phase approach:
-1. Scatter x_sf rows into 128-aligned positions in a padded buffer (GPU-only, no CPU sync)
-2. Per-expert: copy 128 rows from padded buffer, `pad_and_swizzle_single()` each expert's block independently, then concatenate
+**Fix:** Two-phase approach: scatter into 128-aligned positions, then per-expert swizzle and concatenate.
 
 ### Bug 2: `searchsorted(right=False)` — Wrong Expert Assignment
 
-**Symptom:** Scale data in wrong positions after scatter.
-
-**Root cause:** `torch.searchsorted([4, 8, 8], 4, right=False)` returns 0, assigning row 4 (expert 1's first token) to expert 0.
-
-**Fix:** Changed to `right=True`:
-```python
-expert_assign = torch.searchsorted(expert_offsets[1:], row_indices, right=True)
-```
+**Fix:** Changed to `right=True`.
 
 ### Bug 3: CuTeDSL `cute.compile` GPU Memory Corruption — CRITICAL
 
-**Symptom:** `_token_indices` was all zeros, making every token map to token 0.
+**Symptom:** `_token_indices` was all zeros.
 
-**Root cause:** CuTeDSL's `cute.compile` (JIT compilation) corrupts GPU memory. Tensors allocated on GPU before or during JIT compilation get zeroed.
+**Root cause:** CuTeDSL's `cute.compile` (JIT) corrupts GPU memory. Tensors allocated on GPU before/during JIT get zeroed.
 
-**Fix:** Allocate `_token_indices` with `_fill_token_indices()` which builds on CPU and copies to GPU. Added `_needs_token_refill` flag to handle GEMM JIT corruption on first call.
+**Fix:** `_fill_token_indices()` builds on CPU, copies to GPU. `_needs_token_refill` flag for GEMM JIT.
 
 ### Bug 4: `expert_offsets` With Leading 0
 
-**Symptom:** GEMM produced wrong output with correct scale data.
-
-**Root cause:** The runner passed `expert_offsets[:num_experts + 1]` = `[0, 4, 8, 8]` (4 elements with leading 0) but the kernel expects `[4, 8, 8]` (cumulative sum without leading 0).
-
 **Fix:** Pass `expert_offsets[1:num_experts + 1]` to the GEMM.
 
 ### Bug 5: Checkpoint `input_scale` Is Wrong for Activation Global Scale
 
-**Symptom:** Block scales all saturate at float8 max (448), producing garbage quantization.
+**Root cause:** Checkpoint `input_scale` (~0.000286) is a calibration value. Too-small gs → block scale overflow → garbage.
 
-**Root cause:** The checkpoint's `input_scale` (~0.000286) is a calibration value computed from a different input magnitude (amax ≈ 0.77) than what runtime produces (amax ≈ 8.17). Too-small gs → block scale overflow → garbage.
-
-**Fix:** `compute_activation_global_scales()` warmup method that runs `quantize_to_nvfp4` (dynamic gs with `.max()`) before cudagraph capture.
+**Fix:** `compute_activation_global_scales()` warmup method.
 
 ### Bug 6: L1 and L2 Need Separate Activation Global Scales
 
-**Symptom:** L2 output was garbage even with correct L1 gs.
-
-**Root cause:** After SiLU(gate)*up, the activation has amax ~286. The L1 gs is 30x too small for L2.
-
-**Fix:** `compute_activation_global_scales()` computes L2 gs from the actual L1 output (after SiLU*up).
+**Fix:** `compute_activation_global_scales()` computes L2 gs from L1 output after SiLU*up.
 
 ### Bug 7: L1 and L2 Need Separate Padded Scale Buffers
 
-**Symptom:** IndexError when quantizing L2 activation — K_sf differs between L1 (448) and L2 (192).
-
-**Fix:** Separate `_padded_x_sf_buf_l1` and `_padded_x_sf_buf_l2`, plus separate `_per_expert_scale_bufs_l1` and `_per_expert_scale_bufs_l2`.
+**Fix:** Separate `_padded_x_sf_buf_l1` and `_padded_x_sf_buf_l2`, plus separate per-expert scale bufs.
 
 ### Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT
 
-**Symptom:** `IndexKernel.cu:111` assertion failed, cascading into CUDA_ERROR_ASSERT (710) across all workers.
+**Symptom:** `IndexKernel.cu:111` OOB assertion, cascading CUDA_ERROR_ASSERT (710).
 
-**Root cause:** With EP=8, `topk_ids` contains global expert IDs (0-255), but `CuTeDSLMoERunner` treated them as local IDs (0-31/48).
+**Root cause:** `topk_ids` contains global IDs (0-255), runner treated as local (0-31/48).
 
-**Fix:** Added `experts_start_idx` param; in `run()`, remap global→local and mask non-local tokens.
+**Fix:** Added `experts_start_idx`, remap global→local, mask non-local tokens.
 
 ### Bug 8b: `.cpu()` Sync Breaking Cudagraph Compatibility
 
-**Fix:** Moved `_token_indices` to GPU, added `_fill_token_indices()` (CPU→GPU copy), `_needs_token_refill` for GEMM JIT.
+**Fix:** Moved `_token_indices` to GPU, `_fill_token_indices()` (CPU→GPU copy).
 
 ### Bug 9: `padded_x_sf` Buffer Too Small — Index Out of Bounds
 
-**Symptom:** `IndexKernel.cu:111` OOB in scale assembly scatter. `dst_rows` exceeded buffer size.
+**Root cause:** Buffer sized for `num_experts * 128` rows, but scatter positions exceeded this.
 
-**Root cause:** Buffer was sized for `num_experts * 128` rows, but scatter positions were computed from actual token distribution (not fixed 128 per expert). With 8192 tokens and top_k=6, dst_rows could exceed 6144.
-
-**Fix (attempted):** Sized buffer for `max_num_tokens * top_k` rows. Later reverted to `num_experts * 128` with fixed 128-row-per-expert scatter layout.
+**Fix (iterative):** Multiple iterations of sizing and layout fixes. See Bugs 11, 14.
 
 ### Bug 10: Wrong `top_k` and `max_num_tokens` Defaults
 
-**Symptom:** `_token_indices` max=6143 instead of 8191 (built with top_k=8, actual top_k=6).
+**Root cause:** Runner defaulted to `top_k=8, max_num_tokens=8192`, vLLM uses top_k=6.
 
-**Root cause:** `CuTeDSLMoERunner` defaulted to `max_num_tokens=8192, top_k=8`, but vLLM uses top_k=6. `deepseek_v4.py` didn't pass these values.
-
-**Fix:** Pass `max_num_tokens` and `top_k` from `deepseek_v4.py` to the runner constructor.
+**Fix:** Pass values from `deepseek_v4.py`.
 
 ### Bug 11: Full-Buffer Swizzle Produced Wrong GEMM Input
 
-**Symptom:** L2 gs=0.0 on EP5/EP7 during warmup. Model produced BOS token.
+**Symptom:** L2 gs=0.0 on EP5/EP7.
 
-**Root cause:** Applied the Blackwell 32_4_4 swizzle to the entire padded buffer at once, but the GEMM expects per-expert swizzled blocks. The combined swizzle layout doesn't match `expert_offsets` indexing.
+**Root cause:** Applied swizzle to entire buffer at once; GEMM expects per-expert swizzled blocks.
 
-**Fix (in progress):** Reverted to per-expert swizzle with fixed 128-row slots.
+**Fix:** Reverted to per-expert swizzle with fixed 128-row slots.
 
 ### Bug 12: `torch.full()` During Cudagraph Capture
 
-**Symptom:** `cudaErrorStreamCaptureUnsupported` on all 8 workers during cudagraph capture.
+**Symptom:** `cudaErrorStreamCaptureUnsupported` on all 8 workers.
 
-**Root cause:** `torch.full()` in `run()` allocates a new tensor during stream capture, which CUDA doesn't allow.
+**Root cause:** `torch.full()` allocates new tensor during stream capture.
 
-**Fix:** Pre-allocated `_l1_gsa_buf` and `_l2_gsa_buf`, use `.fill_()` instead of `torch.full()`. Also pre-allocated `_output_buf`, `_row_indices_buf`.
+**Fix:** Pre-allocated `_l1_gsa_buf`, `_l2_gsa_buf`, `_output_buf`, `_row_indices_buf`. Use `.fill_()` instead of `torch.full()`.
 
 ### Bug 13: Warmup Passed Global Expert IDs Instead of Local
 
-**Symptom:** L2 gs=0.0 on EP5/EP7 (all ranks except EP0).
+**Symptom:** L2 gs=0.0 on EP5/EP7.
 
-**Root cause:** `_warmup_activation_global_scales()` passed global IDs (e.g. 336+) to `compute_activation_global_scales()`, which matches against `expert_id_range` (0..47). No tokens matched → zero L1 GEMM output → L2 gs=0.
+**Root cause:** Warmup passed global IDs (336+) to `compute_activation_global_scales()` which matches against local range (0..47).
 
-**Fix:** Pass local expert IDs (0..num_experts-1) in warmup.
+**Fix:** Pass local IDs (0..num_experts-1).
 
-### Bug 14 (CURRENT): GEMM Scale Layout Mismatch — 128-Row Fixed vs Variable
+### Bug 14: GEMM Scale Layout Mismatch — Fixed 128-Row vs Variable
 
-**Symptom:** Model generates BOS token repeatedly. Tokens are produced but logits are garbage.
+**Symptom:** Model generates BOS token repeatedly (garbage logits).
 
-**Root cause:** Scale assembly places data at fixed `e*128` offsets (128 rows per expert). But the GEMM reads `scale_a` according to `expert_offsets` (real token counts, e.g. expert 0 = 500 rows). When expert 0 has 500 tokens, GEMM reads `scale_a[0:500]` but only rows 0-127 have valid scale data. Rows 128-499 are zeros → GEMM produces zeros for those tokens → garbage output.
+**Root cause:** Scale assembly placed data at fixed `e*128` offsets, but GEMM reads `scale_a[expert_offsets[e]:...]` where expert_offsets reflects real token counts (e.g., 500 for expert 0). Only 128 rows of scale data per expert → GEMM reads zeros beyond row 128.
 
-**Fix (in progress):** Pad `slot_hidden` to `num_experts * 128` rows (128 per expert) and pass `padded_expert_offsets=[0, 128, 256, ...]` to the GEMM. The GEMM processes exactly 128 tokens per expert. Padding tokens' output is discarded by scatter_add. Pre-allocated `_padded_hidden_buf`, `_padded_activated_buf`, `_padded_expert_offsets_buf`.
+**Fix:** Pad `slot_hidden` to `num_experts * max_chunks * 128` rows with fixed layout. Pass `padded_expert_offsets=[0, max_rows, 2*max_rows, ...]` to GEMM. Scatter real tokens into padded positions. GEMM processes padded 128-row blocks. Extract real token outputs via `l1_out[padded_dst]`.
+
+### Bug 15: OOM — Padded Buffers Sized for 8192 Tokens
+
+**Symptom:** `torch.OutOfMemoryError` trying to allocate 1008 MiB.
+
+**Root cause:** `padded_hidden_buf` + `padded_activated_buf` sized for `max_num_tokens=8192` → 72 MB per layer × 60 layers = 4.3 GB. With model+KV at 175 GB on 178 GB GPUs, no room.
+
+**Fix:** Cap `max_num_tokens` at cudagraph max capture size (512) for buffer pre-allocation. Reduces per-layer overhead to ~9 MB, total ~540 MB.
+
+### Bug 16: `padded_max_slots` Mismatch — Buffer Sized for `max_tokens*top_k` vs `num_experts*max_chunks*128`
+
+**Symptom:** Index out of bounds during cudagraph warmup.
+
+**Root cause:** `padded_max_slots` computed from `max_tokens*top_k` (3072) but `total_padded_slots` in `run()` is `num_experts*max_chunks*128` (6144). Buffer too small.
+
+**Fix:** Size buffers for `num_experts * max_chunks * 128`.
+
+### Bug 17 (ACTIVE): Shape Mismatch — GEMM Output 49152 vs Expected 3072
+
+**Symptom:** `RuntimeError: shape mismatch: value tensor of shape [49152, 7168] cannot be broadcast to indexing result of shape [3072, 7162]`
+
+**Root cause (under investigation):** GEMM output has 49152 rows = 48 experts × 8 chunks × 128. This means `max_chunks_per_expert = 8`, which implies the runner's `max_num_tokens` is still 8192 (not capped to 512). The `_cudagraph_max_capture_size` getattr fallback to 512 should cap it, but the GEMM output suggests otherwise. Debug print added to verify.
+
+**Hypothesis:** Either (1) the `min(self.max_num_tokens, 512)` cap isn't working as expected, or (2) the padded_hidden buffer is somehow sized at the original 8192 budget despite the cap.
+
+### Bug 18: Cudagraph Capture — Dynamic Tensor Allocation in Scale Assembly
+
+**Symptom:** `cudaErrorStreamCaptureInvalidated` — "capture failure must be from kernel launch".
+
+**Root cause:** `_assemble_scales_cudagraph_safe` created `torch.zeros()` for `padded_expert_offsets` during the forward pass, which allocates during cudagraph capture.
+
+**Fix:** Removed dynamic tensor creation. Use fixed layout offsets computed from Python constants.
+
+### Bug 19: Variable-Trip `while` Loop in Scale Assembly
+
+**Symptom:** `cudaErrorStreamCaptureInvalidated` during cudagraph capture.
+
+**Root cause:** Inner `while remaining > 0` loop with variable trip count based on GPU scalar `padded_rows_per_expert[e]`. Python control flow using GPU values requires CPU sync.
+
+**Fix:** Replaced with fixed `for c in range(max_chunks)` loop. Unused chunks are zero (harmless).
+
+### Bug 20: `torch.zeros()` in Scale Assembly Phase 1
+
+**Symptom:** `cudaErrorStreamCaptureInvalidated`.
+
+**Root cause:** `padded_expert_offsets = torch.zeros(...)` created during forward pass (inside `_assemble_scales_cudagraph_safe`).
+
+**Fix:** Removed the computation entirely. Use fixed `e * max_chunks * 128 + c * 128` offsets computed from Python constants.
 
 ---
 
@@ -146,61 +167,37 @@ expert_assign = torch.searchsorted(expert_offsets[1:], row_indices, right=True)
 | Weight stacking | ✅ | `make_b_k_major` + `assemble_scales_3d_side` |
 | Global→local ID remap | ✅ | `experts_start_idx`, mask non-local tokens |
 | Warmup gs computation | ✅ | Per-layer, local expert IDs, L1+L2 gs |
-| Scale assembly | ⚠️ | 128-row fixed layout, pending GEMM alignment fix |
-| Cudagraph capture | ✅ | No dynamic allocations, no CPU syncs |
-| Model output | ❌ | Garbage (BOS repeat) — scale/GEMM layout mismatch |
+| Scale assembly | ⚠️ | Fixed max_chunks layout, pending GEMM shape fix |
+| Cudagraph capture | ⚠️ | Works in test, fails in vLLM (shape mismatch) |
+| Model output | ❌ | Previously BOS repeat; now crashes before serving |
 
 ---
 
-## Test Files
+## Key Architecture: Fixed-Layout Padding
 
-| File | Purpose |
-|------|---------|
-| `tests/layertest.py` | Reference: moe_pipeline with dynamic gs, 3 experts, layer 0. Must pass (≥0.98 cosine). |
-| `tests/cudagraph_test.py` | CuTeDSLMoERunner cudagraph capture + replay. Must pass. |
-| `tests/test_warmup_gs.py` | Warmup gs computation, safety margin sweep. |
-| `tests/test_runner_vs_pipeline.py` | Compare runner.run() vs moe_pipeline. |
-| `tests/test_scale_assembly.py` | Compare cudagraph-safe vs reference scale assembly. |
-
-**Run order after any code change:**
-1. `python3 tests/layertest.py` — must pass
-2. `python3 tests/cudagraph_test.py` — must pass
-
----
-
-## Key Architecture: CuTeDSL NVFP4 MoE
-
-### Data Flow
+### Current Design
 ```
-hidden_states (BF16) ──→ global→local remap ──→ sort by expert
-    │
-    ├── L1 (gate+up)
-    │   quantize_activation_nvfp4 → x_fp4, x_sf
-    │   _assemble_scales_cudagraph_safe → scale_a (swizzled)
-    │   run_nvfp4_grouped_gemm → l1_out (BF16)
-    │
-    ├── SiLU(gate) * up → activated
-    │
-    ├── L2 (down)
-    │   quantize_activation_nvfp4 → l2_x_fp4, l2_x_sf
-    │   _assemble_scales_cudagraph_safe → scale_a (swizzled)
-    │   run_nvfp4_grouped_gemm → l2_out (BF16)
-    │
-    └── scatter_add → y (BF16)
+Each expert gets max_chunks * 128 rows at fixed offset (e * max_chunks * 128).
+
+padded_hidden: [exp0_128rows][exp0_128rows]...[exp1_128rows]...
+                chunk0        chunk1           chunk0
+
+Scatter: padded_dst = expert_assign * max_rows_per_expert + clamped_local_row
+GEMM input: padded_hidden (total = num_experts * max_chunks * 128 rows)
+GEMM offsets: [0, max_rows, 2*max_rows, ...] (fixed, pre-computed)
+GEMM output: same total rows
+Extract: l1_out[padded_dst] → only real token rows
+
+Scale assembly:
+  Phase 1: Scatter x_sf into padded_x_sf at same fixed offsets
+  Phase 2: Per-expert, per-chunk swizzle (fixed loop: max_chunks iterations)
 ```
 
-### Cudagraph Constraints
+### Cudagraph Constraints (All Resolved)
 - No `.item()`, `.cpu()`, `.tolist()` — zero CPU-GPU syncs
-- No `torch.zeros/ones/full/empty/arange` during capture — pre-allocate everything
-- No dynamic shapes — `num_tokens` equals the captured budget
-- Per-expert Python loops are OK (fixed `num_experts`, unrolled at capture time)
-- `pad_and_swizzle_single` is OK on pre-padded 128×4-aligned buffers (no internal allocation)
-
-### EP Configuration (DeepSeek-V4-Pro on 8×B200)
-- 256 total experts, top_k=6
-- EP=8 → 32 local experts per rank (in practice 48 based on logs)
-- `experts_start_idx` = rank * 32 (0, 32, 64, ..., 224)
-- `max_num_tokens` from `scheduler_config.max_num_batched_tokens`
+- No `torch.zeros/ones/full/empty/arange()` during capture — pre-allocate everything
+- No dynamic Python control flow from GPU values — fixed loop counts
+- Per-expert Python loops OK (fixed `num_experts`, unrolled at capture time)
 
 ---