From ddffb7d8df6d4b8cb86ab83aee897e77c5355184 Mon Sep 17 00:00:00 2001 From: biondizzle Date: Sun, 17 May 2026 07:53:58 +0000 Subject: [PATCH] =?UTF-8?q?docs:=20current=20bug=20analysis=20=E2=80=94=20?= =?UTF-8?q?scale=5Fa=20layout=20vs=20expert=5Foffsets=20mismatch?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- CURRENT_BUG.md | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 94 insertions(+) create mode 100644 CURRENT_BUG.md diff --git a/CURRENT_BUG.md b/CURRENT_BUG.md new file mode 100644 index 00000000..82414962 --- /dev/null +++ b/CURRENT_BUG.md @@ -0,0 +1,94 @@ +# Current Bug: CuTeDSLMoERunner produces wrong output + +## Status +- ✅ `layertest.py` — 0.988 cosine (moe_pipeline with dynamic gs + assemble_scales_2d_side) +- ✅ `cudagraph_test.py` — capture + replay succeeds +- ✅ `test_scale_assembly.py` — per-expert scale data matches reference +- ❌ `test_runner_vs_pipeline.py` — runner gives 0.18 cosine vs pipeline (should be ~0.99) + +## Root Cause: scale_a layout doesn't match expert_offsets + +The kernel uses `expert_offsets` to index into the 2D-side scale_a tensor. After swizzle, the data layout is: each expert's 128-row block is swizzled independently, then concatenated. The kernel's TMA load undoes the swizzle when reading. + +### How the pipeline works (0.988 cosine): + +1. `expert_offsets = compute_expert_offsets([4, 4, 0], 3)` → `tensor([4, 8, 8])` +2. `assemble_scales_2d_side([x_sf[0:4], x_sf[4:8]])` → 256 rows (only experts WITH tokens) + - Expert 0's data: rows 0-127 (4 real + 124 zero, swizzled) + - Expert 1's data: rows 128-255 (4 real + 124 zero, swizzled) + - Expert 2 has 0 tokens → NOT included +3. Kernel reads `scale_a[0:4]` for expert 0, `scale_a[4:8]` for expert 1 +4. After TMA un-swizzle, slot index `m` maps correctly to the original row `m` data +5. `scale_a.shape[0] = 256` → kernel knows total padded tokens = 256 + +**Key insight:** The pipeline only includes experts with tokens in scale_a. The kernel's slot-based expert_offsets ([4, 8, 8]) correctly indexes into the 256-row scale_a because the TMA un-swizzle maps slot index → original row → correct data. + +### How the runner currently works (0.18 cosine): + +1. `expert_offsets` = GPU-computed `[0, 4, 8, 8]` (with leading 0 for cumsum) +2. `_assemble_scales_cudagraph_safe` produces 384 rows (ALL experts × 128, including expert 2 with 0 tokens) + - Expert 0's data: rows 0-127 + - Expert 1's data: rows 128-255 + - Expert 2's data: rows 256-383 (all zeros, swizzled) +3. Kernel reads `scale_a[0:4]` for expert 0, `scale_a[4:8]` for expert 1 +4. After TMA un-swizzle, slot indices 0-3 map to expert 0's first 4 original rows → correct +5. But slot indices 4-7 map to... the 5th-8th rows of expert 0's swizzled block, NOT expert 1's data +6. `scale_a.shape[0] = 384` → kernel thinks there are 384 padded token slots, but expert_offsets says 8 + +**The mismatch:** The kernel interprets slot index `m=4` as row 4 of the ENTIRE scale_a tensor. After un-swizzle, row 4 is in expert 0's 128-row block. But the pipeline's scale_a has row 4 in expert 1's block (because expert 1 starts at row 128 in a 256-row tensor, but the kernel's TMA remaps slot 4 → the 5th original row which IS expert 1's 1st row in the pipeline layout). + +Wait — re-reading: in the pipeline, scale_a has 256 rows. Slot 4 → row 4 of scale_a. After un-swizzle, this is the 5th original row. In the pipeline, original rows 0-3 are expert 0's data, rows 4-7 are expert 1's data (padded to 128 each before swizzle). So the un-swizzle of row 4 gives the 1st original row of expert 1. That's correct. + +In the runner, scale_a has 384 rows. Slot 4 → row 4 of scale_a. After un-swizzle, this is the 5th original row. Original rows 0-3 are expert 0's data. Row 4 is a ZERO row (expert 0 had only 4 tokens, rows 4-127 are zero-padded). So the un-swizzle of row 4 gives a zero → expert 1 gets no valid scale data. + +**So the issue IS the 128-row padding.** When expert 0 has 4 tokens and we pad to 128 rows, slot indices 4-127 map to zeros for expert 0. The kernel needs slot indices to be contiguous per expert (0-3 for expert 0, 4-7 for expert 1), but the scale_a has 128 rows per expert block, not 4. + +### Why does the pipeline work? + +Because `pad_and_swizzle_single` on a 4-row tensor pads to 128 rows, swizzles the 128-row block, and the kernel's TMA read with slot index 4 reads from the 5th position in the swizzled 128-row block. After un-swizzle, position 4 maps back to... the 5th original row, which is a zero-padded row. Wait, this should also be wrong then. + +Unless the kernel's 2D-side scale access uses `m % 128` or similar per-block indexing. Let me check the kernel's scale_a read pattern. + +Actually — re-reading the pipeline more carefully: `assemble_scales_2d_side([x_sf[0:4], x_sf[4:8]])`. This creates a scale_a where: +- Expert 0's 4 rows are padded to 128, swizzled → 128 swizzled rows +- Expert 1's 4 rows are padded to 128, swizzled → 128 swizzled rows +- Concatenated → 256 rows + +The kernel gets `expert_offsets = [4, 8, 8]`. For expert 0 (slots 0-3), it reads scale_a positions 0-3. For expert 1 (slots 4-7), it reads scale_a positions 4-7. + +In the swizzled layout, positions 0-3 are in the first 128-row block (expert 0). Positions 4-7 are ALSO in the first 128-row block. But expert 1's data is in the second 128-row block (positions 128-255). + +So the kernel reads positions 4-7 for expert 1, but those positions contain expert 0's zero-padded/swizzled data, not expert 1's data. This should be wrong... + +BUT THE PIPELINE GIVES 0.988 COSINE. So either: +1. The kernel DOES use per-expert offsets into scale_a (reading from position `expert_padded_offset + local_slot`), or +2. The swizzle + TMA read remaps the indices in a way I'm not understanding + +Need to check the kernel's actual scale_a access pattern in the C++ code. + +## Fix Options + +### Option A: Padded expert_offsets +Change expert_offsets from slot-based to 128-row-aligned: +- Instead of `[0, 4, 8, 8]`, pass `[0, 128, 256, 384]` +- The kernel reads scale_a[0:128] for expert 0, scale_a[128:256] for expert 1, etc. +- Problem: the kernel would produce 128 output rows per expert instead of the actual token count +- This breaks the output shape + +### Option B: Only include experts with tokens in scale_a +- Same as pipeline: only pad+swizzle experts that have tokens +- Requires knowing which experts have tokens → .tolist() or .item() → breaks cudagraph +- Could use a fixed expert set (all experts always included, but with zero rows for empty experts) +- The pipeline's assemble_scales_2d_side with 0 rows produces no output for that expert + +### Option C: Understand the kernel's actual indexing and match it +- Read the kernel's scale_a access code +- Figure out exactly how slot indices map to scale_a positions +- Build the layout the kernel expects + +### Option D: Skip scale_a assembly entirely, pass raw scales +- The kernel might accept raw (un-swizzled) scales via a different path +- Or we could use the 3D-side layout for activation scales too + +## Approach +Going with **Option C** — understand the kernel's indexing, then match it.