From ddffb7d8df6d4b8cb86ab83aee897e77c5355184 Mon Sep 17 00:00:00 2001
From: biondizzle <biondizzle@gmail.com>
Date: Sun, 17 May 2026 07:53:58 +0000
Subject: [PATCH] =?UTF-8?q?docs:=20current=20bug=20analysis=20=E2=80=94=20?=
 =?UTF-8?q?scale=5Fa=20layout=20vs=20expert=5Foffsets=20mismatch?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 CURRENT_BUG.md | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)
 create mode 100644 CURRENT_BUG.md

diff --git a/CURRENT_BUG.md b/CURRENT_BUG.md
new file mode 100644
index 00000000..82414962
--- /dev/null
+++ b/CURRENT_BUG.md
@@ -0,0 +1,94 @@
+# Current Bug: CuTeDSLMoERunner produces wrong output
+
+## Status
+- ✅ `layertest.py` — 0.988 cosine (moe_pipeline with dynamic gs + assemble_scales_2d_side)
+- ✅ `cudagraph_test.py` — capture + replay succeeds
+- ✅ `test_scale_assembly.py` — per-expert scale data matches reference
+- ❌ `test_runner_vs_pipeline.py` — runner gives 0.18 cosine vs pipeline (should be ~0.99)
+
+## Root Cause: scale_a layout doesn't match expert_offsets
+
+The kernel uses `expert_offsets` to index into the 2D-side scale_a tensor. After swizzle, the data layout is: each expert's 128-row block is swizzled independently, then concatenated. The kernel's TMA load undoes the swizzle when reading.
+
+### How the pipeline works (0.988 cosine):
+
+1. `expert_offsets = compute_expert_offsets([4, 4, 0], 3)` → `tensor([4, 8, 8])`
+2. `assemble_scales_2d_side([x_sf[0:4], x_sf[4:8]])` → 256 rows (only experts WITH tokens)
+   - Expert 0's data: rows 0-127 (4 real + 124 zero, swizzled)
+   - Expert 1's data: rows 128-255 (4 real + 124 zero, swizzled)
+   - Expert 2 has 0 tokens → NOT included
+3. Kernel reads `scale_a[0:4]` for expert 0, `scale_a[4:8]` for expert 1
+4. After TMA un-swizzle, slot index `m` maps correctly to the original row `m` data
+5. `scale_a.shape[0] = 256` → kernel knows total padded tokens = 256
+
+**Key insight:** The pipeline only includes experts with tokens in scale_a. The kernel's slot-based expert_offsets ([4, 8, 8]) correctly indexes into the 256-row scale_a because the TMA un-swizzle maps slot index → original row → correct data.
+
+### How the runner currently works (0.18 cosine):
+
+1. `expert_offsets` = GPU-computed `[0, 4, 8, 8]` (with leading 0 for cumsum)
+2. `_assemble_scales_cudagraph_safe` produces 384 rows (ALL experts × 128, including expert 2 with 0 tokens)
+   - Expert 0's data: rows 0-127
+   - Expert 1's data: rows 128-255
+   - Expert 2's data: rows 256-383 (all zeros, swizzled)
+3. Kernel reads `scale_a[0:4]` for expert 0, `scale_a[4:8]` for expert 1
+4. After TMA un-swizzle, slot indices 0-3 map to expert 0's first 4 original rows → correct
+5. But slot indices 4-7 map to... the 5th-8th rows of expert 0's swizzled block, NOT expert 1's data
+6. `scale_a.shape[0] = 384` → kernel thinks there are 384 padded token slots, but expert_offsets says 8
+
+**The mismatch:** The kernel interprets slot index `m=4` as row 4 of the ENTIRE scale_a tensor. After un-swizzle, row 4 is in expert 0's 128-row block. But the pipeline's scale_a has row 4 in expert 1's block (because expert 1 starts at row 128 in a 256-row tensor, but the kernel's TMA remaps slot 4 → the 5th original row which IS expert 1's 1st row in the pipeline layout).
+
+Wait — re-reading: in the pipeline, scale_a has 256 rows. Slot 4 → row 4 of scale_a. After un-swizzle, this is the 5th original row. In the pipeline, original rows 0-3 are expert 0's data, rows 4-7 are expert 1's data (padded to 128 each before swizzle). So the un-swizzle of row 4 gives the 1st original row of expert 1. That's correct.
+
+In the runner, scale_a has 384 rows. Slot 4 → row 4 of scale_a. After un-swizzle, this is the 5th original row. Original rows 0-3 are expert 0's data. Row 4 is a ZERO row (expert 0 had only 4 tokens, rows 4-127 are zero-padded). So the un-swizzle of row 4 gives a zero → expert 1 gets no valid scale data.
+
+**So the issue IS the 128-row padding.** When expert 0 has 4 tokens and we pad to 128 rows, slot indices 4-127 map to zeros for expert 0. The kernel needs slot indices to be contiguous per expert (0-3 for expert 0, 4-7 for expert 1), but the scale_a has 128 rows per expert block, not 4.
+
+### Why does the pipeline work?
+
+Because `pad_and_swizzle_single` on a 4-row tensor pads to 128 rows, swizzles the 128-row block, and the kernel's TMA read with slot index 4 reads from the 5th position in the swizzled 128-row block. After un-swizzle, position 4 maps back to... the 5th original row, which is a zero-padded row. Wait, this should also be wrong then.
+
+Unless the kernel's 2D-side scale access uses `m % 128` or similar per-block indexing. Let me check the kernel's scale_a read pattern.
+
+Actually — re-reading the pipeline more carefully: `assemble_scales_2d_side([x_sf[0:4], x_sf[4:8]])`. This creates a scale_a where:
+- Expert 0's 4 rows are padded to 128, swizzled → 128 swizzled rows
+- Expert 1's 4 rows are padded to 128, swizzled → 128 swizzled rows
+- Concatenated → 256 rows
+
+The kernel gets `expert_offsets = [4, 8, 8]`. For expert 0 (slots 0-3), it reads scale_a positions 0-3. For expert 1 (slots 4-7), it reads scale_a positions 4-7.
+
+In the swizzled layout, positions 0-3 are in the first 128-row block (expert 0). Positions 4-7 are ALSO in the first 128-row block. But expert 1's data is in the second 128-row block (positions 128-255).
+
+So the kernel reads positions 4-7 for expert 1, but those positions contain expert 0's zero-padded/swizzled data, not expert 1's data. This should be wrong...
+
+BUT THE PIPELINE GIVES 0.988 COSINE. So either:
+1. The kernel DOES use per-expert offsets into scale_a (reading from position `expert_padded_offset + local_slot`), or
+2. The swizzle + TMA read remaps the indices in a way I'm not understanding
+
+Need to check the kernel's actual scale_a access pattern in the C++ code.
+
+## Fix Options
+
+### Option A: Padded expert_offsets
+Change expert_offsets from slot-based to 128-row-aligned:
+- Instead of `[0, 4, 8, 8]`, pass `[0, 128, 256, 384]`
+- The kernel reads scale_a[0:128] for expert 0, scale_a[128:256] for expert 1, etc.
+- Problem: the kernel would produce 128 output rows per expert instead of the actual token count
+- This breaks the output shape
+
+### Option B: Only include experts with tokens in scale_a
+- Same as pipeline: only pad+swizzle experts that have tokens
+- Requires knowing which experts have tokens → .tolist() or .item() → breaks cudagraph
+- Could use a fixed expert set (all experts always included, but with zero rows for empty experts)
+- The pipeline's assemble_scales_2d_side with 0 rows produces no output for that expert
+
+### Option C: Understand the kernel's actual indexing and match it
+- Read the kernel's scale_a access code
+- Figure out exactly how slot indices map to scale_a positions
+- Build the layout the kernel expects
+
+### Option D: Skip scale_a assembly entirely, pass raw scales
+- The kernel might accept raw (un-swizzled) scales via a different path
+- Or we could use the 3D-side layout for activation scales too
+
+## Approach
+Going with **Option C** — understand the kernel's indexing, then match it.