Files
nvfp4-megamoe-kernel/CURRENT_BUG.md

6.4 KiB
Raw Blame History

Current Bug: CuTeDSLMoERunner produces wrong output

Status

  • layertest.py — 0.988 cosine (moe_pipeline with dynamic gs + assemble_scales_2d_side)
  • cudagraph_test.py — capture + replay succeeds
  • test_scale_assembly.py — per-expert scale data matches reference
  • test_runner_vs_pipeline.py — runner gives 0.18 cosine vs pipeline (should be ~0.99)

Root Cause: scale_a layout doesn't match expert_offsets

The kernel uses expert_offsets to index into the 2D-side scale_a tensor. After swizzle, the data layout is: each expert's 128-row block is swizzled independently, then concatenated. The kernel's TMA load undoes the swizzle when reading.

How the pipeline works (0.988 cosine):

  1. expert_offsets = compute_expert_offsets([4, 4, 0], 3)tensor([4, 8, 8])
  2. assemble_scales_2d_side([x_sf[0:4], x_sf[4:8]]) → 256 rows (only experts WITH tokens)
    • Expert 0's data: rows 0-127 (4 real + 124 zero, swizzled)
    • Expert 1's data: rows 128-255 (4 real + 124 zero, swizzled)
    • Expert 2 has 0 tokens → NOT included
  3. Kernel reads scale_a[0:4] for expert 0, scale_a[4:8] for expert 1
  4. After TMA un-swizzle, slot index m maps correctly to the original row m data
  5. scale_a.shape[0] = 256 → kernel knows total padded tokens = 256

Key insight: The pipeline only includes experts with tokens in scale_a. The kernel's slot-based expert_offsets ([4, 8, 8]) correctly indexes into the 256-row scale_a because the TMA un-swizzle maps slot index → original row → correct data.

How the runner currently works (0.18 cosine):

  1. expert_offsets = GPU-computed [0, 4, 8, 8] (with leading 0 for cumsum)
  2. _assemble_scales_cudagraph_safe produces 384 rows (ALL experts × 128, including expert 2 with 0 tokens)
    • Expert 0's data: rows 0-127
    • Expert 1's data: rows 128-255
    • Expert 2's data: rows 256-383 (all zeros, swizzled)
  3. Kernel reads scale_a[0:4] for expert 0, scale_a[4:8] for expert 1
  4. After TMA un-swizzle, slot indices 0-3 map to expert 0's first 4 original rows → correct
  5. But slot indices 4-7 map to... the 5th-8th rows of expert 0's swizzled block, NOT expert 1's data
  6. scale_a.shape[0] = 384 → kernel thinks there are 384 padded token slots, but expert_offsets says 8

The mismatch: The kernel interprets slot index m=4 as row 4 of the ENTIRE scale_a tensor. After un-swizzle, row 4 is in expert 0's 128-row block. But the pipeline's scale_a has row 4 in expert 1's block (because expert 1 starts at row 128 in a 256-row tensor, but the kernel's TMA remaps slot 4 → the 5th original row which IS expert 1's 1st row in the pipeline layout).

Wait — re-reading: in the pipeline, scale_a has 256 rows. Slot 4 → row 4 of scale_a. After un-swizzle, this is the 5th original row. In the pipeline, original rows 0-3 are expert 0's data, rows 4-7 are expert 1's data (padded to 128 each before swizzle). So the un-swizzle of row 4 gives the 1st original row of expert 1. That's correct.

In the runner, scale_a has 384 rows. Slot 4 → row 4 of scale_a. After un-swizzle, this is the 5th original row. Original rows 0-3 are expert 0's data. Row 4 is a ZERO row (expert 0 had only 4 tokens, rows 4-127 are zero-padded). So the un-swizzle of row 4 gives a zero → expert 1 gets no valid scale data.

So the issue IS the 128-row padding. When expert 0 has 4 tokens and we pad to 128 rows, slot indices 4-127 map to zeros for expert 0. The kernel needs slot indices to be contiguous per expert (0-3 for expert 0, 4-7 for expert 1), but the scale_a has 128 rows per expert block, not 4.

Why does the pipeline work?

Because pad_and_swizzle_single on a 4-row tensor pads to 128 rows, swizzles the 128-row block, and the kernel's TMA read with slot index 4 reads from the 5th position in the swizzled 128-row block. After un-swizzle, position 4 maps back to... the 5th original row, which is a zero-padded row. Wait, this should also be wrong then.

Unless the kernel's 2D-side scale access uses m % 128 or similar per-block indexing. Let me check the kernel's scale_a read pattern.

Actually — re-reading the pipeline more carefully: assemble_scales_2d_side([x_sf[0:4], x_sf[4:8]]). This creates a scale_a where:

  • Expert 0's 4 rows are padded to 128, swizzled → 128 swizzled rows
  • Expert 1's 4 rows are padded to 128, swizzled → 128 swizzled rows
  • Concatenated → 256 rows

The kernel gets expert_offsets = [4, 8, 8]. For expert 0 (slots 0-3), it reads scale_a positions 0-3. For expert 1 (slots 4-7), it reads scale_a positions 4-7.

In the swizzled layout, positions 0-3 are in the first 128-row block (expert 0). Positions 4-7 are ALSO in the first 128-row block. But expert 1's data is in the second 128-row block (positions 128-255).

So the kernel reads positions 4-7 for expert 1, but those positions contain expert 0's zero-padded/swizzled data, not expert 1's data. This should be wrong...

BUT THE PIPELINE GIVES 0.988 COSINE. So either:

  1. The kernel DOES use per-expert offsets into scale_a (reading from position expert_padded_offset + local_slot), or
  2. The swizzle + TMA read remaps the indices in a way I'm not understanding

Need to check the kernel's actual scale_a access pattern in the C++ code.

Fix Options

Option A: Padded expert_offsets

Change expert_offsets from slot-based to 128-row-aligned:

  • Instead of [0, 4, 8, 8], pass [0, 128, 256, 384]
  • The kernel reads scale_a[0:128] for expert 0, scale_a[128:256] for expert 1, etc.
  • Problem: the kernel would produce 128 output rows per expert instead of the actual token count
  • This breaks the output shape

Option B: Only include experts with tokens in scale_a

  • Same as pipeline: only pad+swizzle experts that have tokens
  • Requires knowing which experts have tokens → .tolist() or .item() → breaks cudagraph
  • Could use a fixed expert set (all experts always included, but with zero rows for empty experts)
  • The pipeline's assemble_scales_2d_side with 0 rows produces no output for that expert

Option C: Understand the kernel's actual indexing and match it

  • Read the kernel's scale_a access code
  • Figure out exactly how slot indices map to scale_a positions
  • Build the layout the kernel expects

Option D: Skip scale_a assembly entirely, pass raw scales

  • The kernel might accept raw (un-swizzled) scales via a different path
  • Or we could use the 3D-side layout for activation scales too

Approach

Going with Option C — understand the kernel's indexing, then match it.