docs: current bug analysis — scale_a layout vs expert_offsets mismatch
This commit is contained in:
94
CURRENT_BUG.md
Normal file
94
CURRENT_BUG.md
Normal file
@@ -0,0 +1,94 @@
|
||||
# Current Bug: CuTeDSLMoERunner produces wrong output
|
||||
|
||||
## Status
|
||||
- ✅ `layertest.py` — 0.988 cosine (moe_pipeline with dynamic gs + assemble_scales_2d_side)
|
||||
- ✅ `cudagraph_test.py` — capture + replay succeeds
|
||||
- ✅ `test_scale_assembly.py` — per-expert scale data matches reference
|
||||
- ❌ `test_runner_vs_pipeline.py` — runner gives 0.18 cosine vs pipeline (should be ~0.99)
|
||||
|
||||
## Root Cause: scale_a layout doesn't match expert_offsets
|
||||
|
||||
The kernel uses `expert_offsets` to index into the 2D-side scale_a tensor. After swizzle, the data layout is: each expert's 128-row block is swizzled independently, then concatenated. The kernel's TMA load undoes the swizzle when reading.
|
||||
|
||||
### How the pipeline works (0.988 cosine):
|
||||
|
||||
1. `expert_offsets = compute_expert_offsets([4, 4, 0], 3)` → `tensor([4, 8, 8])`
|
||||
2. `assemble_scales_2d_side([x_sf[0:4], x_sf[4:8]])` → 256 rows (only experts WITH tokens)
|
||||
- Expert 0's data: rows 0-127 (4 real + 124 zero, swizzled)
|
||||
- Expert 1's data: rows 128-255 (4 real + 124 zero, swizzled)
|
||||
- Expert 2 has 0 tokens → NOT included
|
||||
3. Kernel reads `scale_a[0:4]` for expert 0, `scale_a[4:8]` for expert 1
|
||||
4. After TMA un-swizzle, slot index `m` maps correctly to the original row `m` data
|
||||
5. `scale_a.shape[0] = 256` → kernel knows total padded tokens = 256
|
||||
|
||||
**Key insight:** The pipeline only includes experts with tokens in scale_a. The kernel's slot-based expert_offsets ([4, 8, 8]) correctly indexes into the 256-row scale_a because the TMA un-swizzle maps slot index → original row → correct data.
|
||||
|
||||
### How the runner currently works (0.18 cosine):
|
||||
|
||||
1. `expert_offsets` = GPU-computed `[0, 4, 8, 8]` (with leading 0 for cumsum)
|
||||
2. `_assemble_scales_cudagraph_safe` produces 384 rows (ALL experts × 128, including expert 2 with 0 tokens)
|
||||
- Expert 0's data: rows 0-127
|
||||
- Expert 1's data: rows 128-255
|
||||
- Expert 2's data: rows 256-383 (all zeros, swizzled)
|
||||
3. Kernel reads `scale_a[0:4]` for expert 0, `scale_a[4:8]` for expert 1
|
||||
4. After TMA un-swizzle, slot indices 0-3 map to expert 0's first 4 original rows → correct
|
||||
5. But slot indices 4-7 map to... the 5th-8th rows of expert 0's swizzled block, NOT expert 1's data
|
||||
6. `scale_a.shape[0] = 384` → kernel thinks there are 384 padded token slots, but expert_offsets says 8
|
||||
|
||||
**The mismatch:** The kernel interprets slot index `m=4` as row 4 of the ENTIRE scale_a tensor. After un-swizzle, row 4 is in expert 0's 128-row block. But the pipeline's scale_a has row 4 in expert 1's block (because expert 1 starts at row 128 in a 256-row tensor, but the kernel's TMA remaps slot 4 → the 5th original row which IS expert 1's 1st row in the pipeline layout).
|
||||
|
||||
Wait — re-reading: in the pipeline, scale_a has 256 rows. Slot 4 → row 4 of scale_a. After un-swizzle, this is the 5th original row. In the pipeline, original rows 0-3 are expert 0's data, rows 4-7 are expert 1's data (padded to 128 each before swizzle). So the un-swizzle of row 4 gives the 1st original row of expert 1. That's correct.
|
||||
|
||||
In the runner, scale_a has 384 rows. Slot 4 → row 4 of scale_a. After un-swizzle, this is the 5th original row. Original rows 0-3 are expert 0's data. Row 4 is a ZERO row (expert 0 had only 4 tokens, rows 4-127 are zero-padded). So the un-swizzle of row 4 gives a zero → expert 1 gets no valid scale data.
|
||||
|
||||
**So the issue IS the 128-row padding.** When expert 0 has 4 tokens and we pad to 128 rows, slot indices 4-127 map to zeros for expert 0. The kernel needs slot indices to be contiguous per expert (0-3 for expert 0, 4-7 for expert 1), but the scale_a has 128 rows per expert block, not 4.
|
||||
|
||||
### Why does the pipeline work?
|
||||
|
||||
Because `pad_and_swizzle_single` on a 4-row tensor pads to 128 rows, swizzles the 128-row block, and the kernel's TMA read with slot index 4 reads from the 5th position in the swizzled 128-row block. After un-swizzle, position 4 maps back to... the 5th original row, which is a zero-padded row. Wait, this should also be wrong then.
|
||||
|
||||
Unless the kernel's 2D-side scale access uses `m % 128` or similar per-block indexing. Let me check the kernel's scale_a read pattern.
|
||||
|
||||
Actually — re-reading the pipeline more carefully: `assemble_scales_2d_side([x_sf[0:4], x_sf[4:8]])`. This creates a scale_a where:
|
||||
- Expert 0's 4 rows are padded to 128, swizzled → 128 swizzled rows
|
||||
- Expert 1's 4 rows are padded to 128, swizzled → 128 swizzled rows
|
||||
- Concatenated → 256 rows
|
||||
|
||||
The kernel gets `expert_offsets = [4, 8, 8]`. For expert 0 (slots 0-3), it reads scale_a positions 0-3. For expert 1 (slots 4-7), it reads scale_a positions 4-7.
|
||||
|
||||
In the swizzled layout, positions 0-3 are in the first 128-row block (expert 0). Positions 4-7 are ALSO in the first 128-row block. But expert 1's data is in the second 128-row block (positions 128-255).
|
||||
|
||||
So the kernel reads positions 4-7 for expert 1, but those positions contain expert 0's zero-padded/swizzled data, not expert 1's data. This should be wrong...
|
||||
|
||||
BUT THE PIPELINE GIVES 0.988 COSINE. So either:
|
||||
1. The kernel DOES use per-expert offsets into scale_a (reading from position `expert_padded_offset + local_slot`), or
|
||||
2. The swizzle + TMA read remaps the indices in a way I'm not understanding
|
||||
|
||||
Need to check the kernel's actual scale_a access pattern in the C++ code.
|
||||
|
||||
## Fix Options
|
||||
|
||||
### Option A: Padded expert_offsets
|
||||
Change expert_offsets from slot-based to 128-row-aligned:
|
||||
- Instead of `[0, 4, 8, 8]`, pass `[0, 128, 256, 384]`
|
||||
- The kernel reads scale_a[0:128] for expert 0, scale_a[128:256] for expert 1, etc.
|
||||
- Problem: the kernel would produce 128 output rows per expert instead of the actual token count
|
||||
- This breaks the output shape
|
||||
|
||||
### Option B: Only include experts with tokens in scale_a
|
||||
- Same as pipeline: only pad+swizzle experts that have tokens
|
||||
- Requires knowing which experts have tokens → .tolist() or .item() → breaks cudagraph
|
||||
- Could use a fixed expert set (all experts always included, but with zero rows for empty experts)
|
||||
- The pipeline's assemble_scales_2d_side with 0 rows produces no output for that expert
|
||||
|
||||
### Option C: Understand the kernel's actual indexing and match it
|
||||
- Read the kernel's scale_a access code
|
||||
- Figure out exactly how slot indices map to scale_a positions
|
||||
- Build the layout the kernel expects
|
||||
|
||||
### Option D: Skip scale_a assembly entirely, pass raw scales
|
||||
- The kernel might accept raw (un-swizzled) scales via a different path
|
||||
- Or we could use the 3D-side layout for activation scales too
|
||||
|
||||
## Approach
|
||||
Going with **Option C** — understand the kernel's indexing, then match it.
|
||||
Reference in New Issue
Block a user