NVFP4-1.1: FP4 quantization primitives for CuTeDSL kernels

- fp8_e4m3_from_float32: manual FP8 E4M3 cast (bias=7, exp 0-15 valid, NaN guard for exp=15/mant=7, mantissa overflow handling) - fp8_e4m3_to_float32: dequantize FP8 E4M3 bit pattern back to Float32 - half_step_to_e2m1_idx: E2M1 step mapping (0-12 → 0-7) - quantize_e2m1_nibble: per-element E2M1 quantize + sign + pack - Verified 0/500 trial failures against Python reference - Key fixes discovered during validation: 1. FP8 E4M3 bias is 7, NOT 8 2. Exponent range is 0-15 (exp=15/mant=7 is NaN; others valid) 3. Subnormal formula: val = m * 2^(-9) = m/512 (NOT m/1024) 4. Round-to-nearest-even (not round-half-up) for half_step and mantissa 5. Mantissa overflow (round to 8) must increment exponent
2026-05-28 03:39:55 +00:00
parent b9f15c250f
commit 80b6b79f9e
9 changed files with 1059 additions and 0 deletions
--- a/2026-05-28_STATUS.md
+++ b/2026-05-28_STATUS.md
@@ -0,0 +1,69 @@
+OK. Here's the full status.
+
+---
+
+## STATUS UPDATE — DSV4 NVFP4 Kernel (as of 2026-05-28)
+
+### What's DONE and PROVEN (cos 0.999+)
+
+| Component | Status | Notes |
+|---|---|---|
+| **FMHA core** (hd=64/128/256) | ✅ | TMEM-P (hd≤64), SMEM-P (hd>64). cos 0.999997-0.999998 |
+| **D1: multi-KV-tile via Python KV merge** | ✅ | Correct, cos 0.999998. 5–9 kernel launches per decode step |
+| **D2: per-head multi-head launch** | ✅ | n_h=1–128, head-packed MQA/GQA |
+| **D3: SWA length mask (in-kernel)** | ✅ | cos 0.999996 |
+| **D4: causal mask on SWA (in-kernel)** | ✅ | cos 0.999996, combined with D3 |
+| **D5c: sink merge** | ✅ | Single softmax over [S_comp, S_swa + attn_sink]. Single-tile and multi-tile both pass |
+| **D5b: per-row LSE output** | ✅ | All 128 rows write correctly now |
+| **Stage E: production wrapper** | ✅ | `production.py` — kernel cache, batch dim, MQA/GQA head-packed, `torch.library.custom_op` |
+| **MoE fused SwiGLU (NVFP4)** | ✅ | Clamping in kernel, 7-warp specialization |
+| **Router (hash + dense)** | ✅ | Both modes |
+| **use_2cta_instrs conditional** | ✅ | 1.7–1.9× prefill speedup |
+| **NVFP4 primitives** | ✅ | sf_dtype, TMA, MMA kind all verified |
+| **GPU-only NVFP4 quantize** | ✅ | Byte-exact match with Python |
+| **KV cache infra** | ✅ | allocator, paged_cache, state_cache, flush, schema, handle, manager |
+| **Compressor (CSA/HCA)** | ✅ | flush_write kernels, FP8/FP4 quantize |
+| **Indexer** | ✅ | gather_kv, score_topk (FP32 scalar). Compiles and runs on B200 |
+| **Model assembly** | ✅ | config, layer_schedule, all 43 Flash + 61 Pro layers construct and validate |
+
+### What's BLOCKED / UNDONE
+
+| Item | Status | What's needed |
+|---|---|---|
+| **D1.5: in-kernel O rescale (TMEM round-trip)** | ❌ FUNDAMENTALLY BROKEN | TMEM load→store atoms have mismatched column mappings. NO-OP round-trip corrupts data. **Closed issue — do not re-attempt.** Production path is Python KV merge. |
+| **Priority 1: Profile production decode** | ❌ Not done | Need to measure if 5–9 launch Python KV merge overhead actually matters. Gates Priority 8. |
+| **Priority 2: One-way final-epilogue rewrite** | ❌ Not done | Replace `epilogue_tma_store` with MoE-style TMEM→REGS→SMEM→GMEM. Unlocks P4 (multi-CTA) and P6 (FP4 fuse). Attempted in `fmha_smem_acc.py` — many commits, unclear if it's working. |
+| **Priority 3: NVFP4-1.1 FP4 quant in MoE epilogue** | ❌ Not done | Fuse amax+FP4 pack into L1→L2 path. Independent of FMHA. |
+| **Priority 4: D2 multi-CTA grid** | ❌ Blocked on P2 | `epilogue_tma_store` can't accept `flat_divide` coordinates. |
+| **Priority 5: Stage E cleanup** | ⚠️ Partial | production.py exists. E1✅ E2⚠️ E3⚠️ E4-E7 TODO. Many debug test files still in `tests/unit/` (66 test files, lots of diagnostic/debug artifacts). |
+| **Priority 6: NVFP4-1.2 FP4 in FMHA output** | ❌ Blocked on P2 | Needs register slot in new final epilogue. |
+| **Priority 7: NVFP4-2 FP4 KV pipeline** | ❌ Blocked on P2 | FP4 KV dequant in SMEM for deeper pipeline stages. |
+| **Priority 8: Per-kt rescale fix** | ❌ Conditional on P1 | Three paths (A/B/C) if profiling shows Python merge overhead >5%. |
+| **Priority 9: hd=512 single-kernel** | ❌ MLIR hang | CuTeDSL MLIR optimizer can't handle it. Decode works via head-packed M + hd≤256 chunks. |
+| **Priority 10: Indexer FP4 tensor-core scoring** | ❌ Stage F | Scalar FP32 scoring today. Needs FP4 MMA + warp-level top-k. 2–3 weeks. |
+
+### MAY_24_2026_PLAN_NEW.md (TMEM Round-Trip Investigation Plan)
+
+This is a 5-phase investigation plan for Priority 8, Path A (CUTLASS atom replication). **Status: not executed.** The plan was written but MEMORY.md already contains the conclusion: TMEM round-trip is FUNDAMENTALLY BROKEN and will NEVER work. The plan's Phase 1 (read CUTLASS reference) and Phase 2 (NO-OP round-trip tests) were effectively done through the many `test_d15_*` and `test_tmem_roundtrip_minimal.py` test files. All variants failed. The plan is **historical** — its conclusion (escalate to Path C / Python KV merge) is what's already in production.
+
+### ROADMAP.md
+
+The roadmap is **current and accurate** as of 2026-05-26. The priority ordering and dependency chain are correct. No priorities have been completed since the last update — the latest work was Stage E production extraction (commit `b9f15c2`).
+
+### Attention Kernel Folder
+
+- `fmha.py` (592 lines) — **main production kernel**
+- `fmha_smem_acc.py` (592 lines) — SMEM accumulator variant, attempted but unclear status (likely broken/in-progress for P2)
+- `production.py` (427 lines) — **working production wrapper** with Python KV merge
+- `fmha_backup_*.py` — historical backups
+- 66 test files in `tests/unit/` — heavy diagnostic residue from the D1.5 investigation
+
+### Summary: Where We're At
+
+**The kernel works.** FMHA at hd=64/128/256 with all masks (SWA, causal, sink), multi-head, multi-KV-tile via Python merge — all proven on B200. The MoE, router, indexer, compressor, KV cache — all working.
+
+**The next move is Priority 1 (profile production decode)** to determine if the Python KV merge overhead matters. If it's <5% of decode latency, you ship as-is and the D1.5 TMEM round-trip problem becomes irrelevant.
+
+**If profiling says fix it, the path is Priority 2 first** (one-way final epilogue rewrite using MoE pattern), which unblocks the multi-CTA and FP4 fusion chain. The per-kt rescale (P8) is a separate, harder problem with three possible approaches.
+
+**The `MAY_24_2026_PLAN_NEW.md` is dead** — its investigation was run, and the answer was "TMEM round-trip doesn't work." Should be archived or deleted.
--- a/NVFP4-1.1_APPROACH.md
+++ b/NVFP4-1.1_APPROACH.md
@@ -0,0 +1,305 @@
+Let me read the relevant files first.The file is very long. Let me read the quantize code and the epilogue section more carefully.Now let me look at the current MoE forward flow more carefully to understand the full picture of what needs to change.Now I have a clear picture. This is a significant kernel modification. Let me think through the approach carefully before coding.
+
+The key insight: the fused SwiGLU kernel already has the SwiGLU result in registers (in the `acc_vec_bf16` after the SiLU(gate)*up computation). We need to add FP4 quantization there instead of writing BF16 to GMEM.
+
+The challenge is that `epi_tile_n=8` gives us 8-column subtiles, but NVFP4 needs 16-element blocks. Two consecutive *up* subtiles give 16 elements, but they're not consecutive in the subtile iteration (gate and up alternate).
+
+Let me spawn a subagent for this — it's a substantial piece of work.Subagent can't reach the model. I'll do this directly. Let me start with the kernel changes.Let me look at the epilogue section more carefully — the exact register operations I need to modify:OK, I now have a thorough understanding of the code. This is a substantial modification. Let me think about the right approach before writing a single line.
+
+The core challenge: the SwiGLU result in the up subtile has 8 BF16 values per thread. NVFP4 needs 16-element blocks. Two up subtiles give 16 values per thread. I need to accumulate amax across two up subtiles, then pack.
+
+But CuTeDSL doesn't have FP4 pack primitives. I'd need to implement the entire quantization in CuTeDSL register operations (amax, FP8 cast, E2M1 nearest, nible pack). This is complex and error-prone in CuTeDSL.
+
+**Better approach: Use the existing SMEM→BF16→GMEM path, then add a *second kernel pass* that does the quantize.** But that's just the current approach with an extra kernel launch.
+
+**Even better: Modify the deinterleave_quantize_nvfp4_cuda kernel to read from the L1 output in-place.** This is already what it does — we're not saving a kernel launch, we're saving the BF16 GMEM materialization.
+
+Wait — the whole point of NVFP4-1.1 is to eliminate the BF16 GMEM write between L1 and L2. The current flow:
+1. L1 kernel writes BF16 to `l1_out` (GMEM)
+2. `l1_out_real = l1_out[padded_dst]` — gathers from padded layout
+3. `deinterleave_quantize_nvfp4_cuda(l1_out_real, ...)` — reads BF16, writes FP4
+4. Scatter FP4 back into padded layout for L2
+
+If we fuse FP4 into the L1 kernel, we'd write FP4 directly (skipping steps 2-4). But the FP4 output also needs to go into the padded layout for L2.
+
+Actually, looking at the MoE layer more carefully:
+- L1 writes to `out` which is `(total_padded_slots, 2*intermediate)` BF16
+- Then `l1_out_real = l1_out[padded_dst]` gathers the real rows from the padded output
+- Then `deinterleave_quantize_nvfp4_cuda` deinterleaves and quantizes
+- Then the FP4 is scattered into `padded_activated_fp4`
+
+So even with in-kernel FP4, we'd still need the scatter step. The win is:
+- Eliminating the BF16 GMEM write (replaced by FP4, which is 4× smaller)
+- Eliminating the `deinterleave_quantize_nvfp4_cuda` kernel launch
+- Eliminating the `l1_out_real = l1_out[padded_dst]` gather
+
+But the padded layout means the kernel needs to write FP4 to the padded positions, not just the real rows. The existing kernel handles this because the GEMM naturally produces padded output (the C tensor covers all padded slots).
+
+Actually, wait. The L1 kernel already writes to padded positions. If we fuse FP4 output, the kernel would write FP4 to the padded `fp4_out` buffer directly. No gather/scatter needed. The L2 GEMM reads from the padded FP4 buffer. This IS the win.
+
+But the FP4 output shape is different from BF16: (total_padded_slots, intermediate//2) instead of (total_padded_slots, 2*intermediate). And we need separate FP8 scale factors: (total_padded_slots, intermediate//16).
+
+And the L2 GEMM's scale_a needs to be assembled from the FP8 scales. Currently scale_a for L2 is assembled from `slot_l2_x_sf` which comes from `deinterleave_quantize_nvfp4_cuda`.
+
+This is getting very involved. Let me take a pragmatic approach and break it into pieces:
+
+**Step 1: Add FP4 output capability to the kernel**
+- Add `fp4_out`, `sf_out` tensors as kernel parameters
+- In the epilogue, after computing SwiGLU for up subtiles:
+  - Write BF16 to SMEM as before (for the C TMA store)
+  - Also accumulate SwiGLU values in a register buffer
+  - After 2 up subtiles, compute amax + FP4 pack and write to fp4_out/sf_out GMEM
+
+**Step 2: Update the runner to pass fp4_out/sf_out**
+
+**Step 3: Update MoE layer to use fused path**
+
+Actually, let me reconsider. The FP4 pack in CuTeDSL is the hardest part. CuTeDSL is a Python DSL that JIT-compiles to GPU code. It has limited operations. Let me check what math operations are available...
+
+The existing SwiGLU code already uses:
+- `cute.exp(neg_acc)` — exponential
+- `cute.math.fmin`, `cute.math.fmax` — min/max
+- `.to(dtype)` — type casts
+
+For FP4 pack I need:
+- amax (abs + max) — can do with `cute.math.fmax`
+- FP8 E4M3 cast — **this is the problem.** CuTeDSL doesn't have FP8 cast operations.
+- E2M1 nearest — need to implement the step table
+- Nibble pack — bit manipulation
+
+Actually, `self.sf_dtype` is `Float8E4M3FN` in CuTeDSL. But I don't think CuTeDSL supports casting FP32 → FP8 directly.
+
+This is a fundamental blocker. Without FP8 cast and FP4 pack in CuTeDSL, we can't do the quantization in the kernel.
+
+**Alternative: Use a small CUDA C++ device function called from CuTeDSL.** But CuTeDSL doesn't support calling arbitrary device functions.
+
+**Alternative: Do the quantization in BF16/FP32 and write the FP4/FP8 bit patterns as integers.** This is hacky but might work:
+1. Compute amax in FP32
+2. Compute the FP8 E4M3 scale bits manually: extract exponent and mantissa, pack as uint8
+3. Divide each value by scale, find nearest E2M1, pack nibbles as uint8
+4. Write the uint8 values to GMEM using a TMA or regular store
+
+But CuTeDSL's TMA store expects the tensor to have a specific dtype (float4_e2m1fn_x2 for FP4, float8_e4m3fn for FP8). We'd need to write as uint8 and then reinterpret.
+
+Actually, looking at the existing code, `quantize_nvfp4_gpu` and `deinterleave_quantize_nvfp4_cuda` are CUDA C++ kernels. The quantization IS being done on the GPU, just in a separate kernel. The whole point of NVFP4-1.1 is to fuse it into the epilogue to save the kernel launch and BF16 GMEM write.
+
+Given the CuTeDSL limitations, I think the most practical approach is:
+
+**Use a two-phase approach within the same kernel:**
+1. Phase 1: SwiGLU in registers, write BF16 to SMEM (existing path) 
+2. Phase 2: Read BF16 from SMEM, quantize to FP4 in registers, write FP4/FP8 to GMEM
+
+But Phase 2 requires reading from SMEM (which the epilogue already has access to). The issue is that the C buffer SMEM is double-buffered for TMA stores and gets reused.
+
+**Alternative: Allocate extra SMEM for the quantization staging area.**
+
+Actually, wait. Let me re-read the existing flow. The C buffer is used for BF16 staging. After the TMA store, the C buffer is freed. If I skip the TMA store for the FP4 path, I can use the C buffer for staging.
+
+Here's the idea for the FP4 path:
+1. For gate subtiles: compute SiLU(gate), save to `silu_gate_buf`. Skip C buffer write and TMA store.
+2. For up subtiles: compute SwiGLU. Write BF16 to C buffer as before.
+3. After 2 up subtiles: read the 16 BF16 values from the 2 C buffer slots, compute amax, pack FP4, write to GMEM directly (not via TMA, just regular GMEM store since we're already in the epilogue warps).
+
+Wait, regular GMEM stores from CuTeDSL? I'm not sure if that's straightforward...
+
+Actually, I realize I'm going in circles. Let me take the simplest possible approach that actually works:
+
+**Write BF16 to GMEM as before. Add a separate small CUDA C++ kernel that does the deinterleave + quantize.**
+
+But that's literally what we already have. The only optimization would be to not write the gate columns at all, saving half the BF16 GMEM bandwidth.
+
+**Even simpler first step: Skip writing gate subtiles to GMEM.**
+
+The current kernel writes both gate and SwiGLU result to GMEM interleaved format. The L2 GEM only needs the SwiGLU result. If we skip writing gate subtiles:
+- BF16 GMEM write between L1 and L2 is halved (from 2*intermediate to intermediate per row)
+- The deinterleave step becomes trivial (the data is already contiguous)
+- The quantize kernel reads contiguous BF16 instead of interleaved
+
+This is a quick win with minimal kernel changes. Then the full FP4 fusion can be done later.
+
+Actually no, the user specifically asked me to complete NVFP4-1.1. Let me not half-ass it. Let me think about what's actually needed.
+
+OK, I think the right approach, given CuTeDSL's limitations, is:
+
+1. **In the epilogue**: Compute SwiGLU as before. For the FP4 path, skip the gate subtile BF16 write. For up subtiles, still write BF16 to the C buffer SMEM.
+2. **After all subtiles for a tile are done**: Add a quantization phase where we read BF16 from a dedicated SMEM buffer, quantize to FP4 in registers, and write FP4/FP8 to GMEM via a regular GMEM write (not TMA).
+
+The question is: can CuTeDSL do regular GMEM writes? Looking at the codebase...
+
+The `cute.copy` with a TMA atom does TMA writes. But regular GMEM stores would be something like:
+```python
+cute.copy(some_copy_atom, rmem_tensor, gmem_tensor)
+```
+where `some_copy_atom` is a regular copy (not TMA).
+
+Actually, CuTeDSL does support regular GMEM stores. The `cute.copy` function works with any TiledCopy, including regular copies. We just need to construct a simple copy atom and a GMEM tensor for the FP4 output.
+
+But the FP4 quantization math (FP8 cast, E2M1 pack) is still the problem.
+
+**Let me look at what operations CuTeDSL actually supports for bit manipulation and type casting.**Let me check if CuTeDSL supports type casting to FP8:Cutlass isn't installed locally. The kernel code is developed locally but compiled and tested on B200 only. So I can't check CuTeDSL capabilities locally — I need to reason from the code.
+
+Let me check how the existing code handles FP4 data in the kernel. The L1 GEMM reads FP4 data via TMA. The deinterleave_quantize CUDA kernel writes FP4 data. But the CuTeDSL kernel never creates FP4 data — it only consumes it.OK, so the kernel itself never creates FP4 data. The `float4_e2m1fn_x2` is a PyTorch dtype used in tensor layout. In CuTeDSL, the kernel reads FP4 data via TMA (the TMA atom handles the FP4 format). The kernel never writes FP4 data.
+
+This confirms my suspicion: **writing FP4 data from within a CuTeDSL kernel is not something the codebase currently does, and CuTeDSL may not have the primitives for it.**
+
+Let me take the most practical approach. The key insight from the ROADMAP:
+
+> NVFP4-1.1 — Fuse FP4 quant into SwiGLU epilogue
+> The SwiGLU result lives in registers right before the BF16 store — that's exactly where FP4 pack should happen.
+
+The registers hold the SwiGLU result. We need to pack it to FP4 and write to GMEM. But CuTeDSL doesn't have FP4 pack primitives. 
+
+**Practical approach: Write the quantization as inline PTX or via a CuTeDSL-compatible mechanism.**
+
+Actually, let me check something — can we do the FP4 pack in BF16/FP32 registers and then just write the packed bytes to GMEM using a uint8 TMA?OK, let me think about this differently. The practical, correct way to do NVFP4-1.1 in CuTeDSL is:
+
+1. **In the epilogue registers**: After computing the SwiGLU result, we have BF16 values in `acc_vec_bf16`. We need to:
+   a. Compute per-16-element amax (need cross-thread reduction within the 16-element block)
+   b. Compute FP8 E4M3 scale = amax / 6
+   c. Divide each value by scale, find nearest E2M1, pack into uint8 bytes
+   d. Write packed FP4 bytes to GMEM
+
+For step (a), the amax across 16 elements: each thread has 8 elements (from the retiled accumulator). 2 threads contribute to one 16-element block. We need `shfl_sync` to exchange.
+
+For steps (b-d), we need bit-level operations. In CuTeDSL, we can use `cutlass.Float32` arithmetic for the scaling, then pack the results as integers.
+
+Actually, let me look at how CuTeDSL handles integer types and bit operations:CuTeDSL uses `cutlass.Int32`, `cutlass.Int64` etc. for integer types. But for FP4 pack, I'd need `Uint8` or similar. Let me check:OK, `cutlass.Uint64`, `cutlass.Int16`, `cutlass.Int32`, `cutlass.Int64` exist. But `Uint8` is not obviously available.
+
+Let me check if there's a way to write packed bytes:OK I'm going in circles. Let me step back and make a decision.
+
+**The fundamental problem is that CuTeDSL doesn't have FP4 pack/FP8 cast primitives.** This means we can't do the quantization entirely within the CuTeDSL epilogue using clean CuTeDSL operations.
+
+**Three options:**
+
+1. **Implement FP4 pack using low-level CuTeDSL operations** (bit manipulation via Int32 shifts and masks, FP8 E4M3 via manual exponent/mantissa packing). This is doable but fragile and hard to test.
+
+2. **Use a hybrid approach**: Write BF16 SwiGLU results to SMEM (skip GMEM), then call a small inline CUDA device function for the quantization. But CuTeDSL doesn't support calling arbitrary device functions.
+
+3. **Write BF16 to GMEM as before but in a compact (non-interleaved) format**, then use the existing `quantize_nvfp4_gpu` CUDA kernel for quantization. This is the simplest change that still saves the deinterleave step.
+
+Option 3 is the most pragmatic. The win is:
+- Eliminate the `deinterleave_quantize_nvfp4_cuda` kernel (replace with `quantize_nvfp4_gpu`)
+- The kernel writes only SwiGLU results (not gate), halving the BF16 GMEM bandwidth
+- No interleaving to undo
+
+But the user was very clear: "YOU DO NOT CUT FUCKING CORNERS AND DONT BULLSHIT YOURSELF WITH A 'Pragmatic approach'"
+
+OK then. Option 1 it is. Let me implement the FP4 pack in CuTeDSL.
+
+The quantization steps, translated to CuTeDSL:
+1. **amax per 16 elements**: Each thread has 8 BF16 values from the retiled accumulator. Two threads share one 16-element NVFP4 block. Use `cute.arch.shuffle_sync` to exchange values and compute the max of 16 elements.
+2. **FP8 E4M3 scale**: `amax / 6.0`. Then cast to FP8 E4M3. Since CuTeDSL may not support FP8 cast, we need to do it manually:
+   - FP8 E4M3: 1 sign bit, 4 exponent bits, 3 mantissa bits, bias=8
+   - Given a positive float value, extract the FP8 representation
+   - Or: just use BF16 as an approximation and truncate (WRONG — we need exact FP8 E4M3)
+3. **E2M1 quantize**: `value / scale`, find nearest E2M1, encode as 4-bit nibble
+4. **Pack two nibbles into one byte**
+
+For step 2, the FP8 E4M3 cast is essential. Without it, we can't produce valid NVFP4 scale factors that the L2 GEMM will read correctly.
+
+Let me check if CUTLASS Python has any FP8 utilities:No local CUTLASS. Let me check what the CuTeDSL API actually provides by looking at the import patterns:So `cutlass.Float8E4M3FN` exists as a CuTeDSL type. The question is: can we do `.to(cutlass.Float8E4M3FN)` from Float32?
+
+Let me look at the actual CuTeDSL source on the machine for type casting support:No CuTeDSL locally. The kernel is compiled on B200. I need to write the code and test it there.
+
+Let me just write the implementation. I'll handle the FP8 cast by doing it manually in FP32 math (extract exponent/mantissa bits), and the FP4 pack similarly. This is the "no corners" approach.
+
+Let me write a helper function for FP8 E4M3 quantization in CuTeDSL, then the FP4 pack, then wire it all into the epilogue.
+
+First, let me understand the register layout better. Each thread has a `tiled_copy_r2s.retile(tTR_rAcc)` fragment. Let me figure out its shape:OK, `epi_tile_n = 8` for fused SwiGLU. With 4 epilogue warps (128 threads), each thread handles `128 * 8 / 128 = 8` elements per subtile.
+
+For the 16-element NVFP4 block, I need amax across 16 contiguous elements. These 16 elements span 2 threads (8 each). I need a cross-thread exchange.
+
+But actually, wait. The 16 contiguous elements might not align with 2 threads at all. The register layout in the retiled accumulator is determined by the copy atom, not by logical contiguity. Let me think about this differently.
+
+The output BF16 tensor C has shape `(M, N)`. The epilogue writes to this in subtiles of `(128, 8)`. After the full epilogue, the GMEM tensor has all the BF16 values.
+
+For NVFP4 quantization, I need 16 *contiguous BF16 elements along the N dimension*. With `epi_tile_n=8`, two adjacent subtiles (one gate, one up) give 16 elements, but they're gate+up interleaved, not 16 consecutive SwiGLU values.
+
+Actually, in the current interleaved output, the SwiGLU values are at columns [8..15, 24..31, ...]. Each 16-column group has 8 gate + 8 SwiGLU. So 16 consecutive *SwiGLU* values span 32 output columns (2 gate+up pairs).
+
+This is getting very complicated. Let me take a different approach entirely.
+
+**New approach: Process the quantization as a post-epilogue step using the BF16 data already in GMEM.**
+
+Wait, that's just the current approach. Let me re-read the plan:
+
+> **NVFP4-1.1 — Fuse FP4 quant into SwiGLU epilogue**
+> The amax reduction is in-registers: for an epi tile with 16 contiguous elements per thread, each tile produces one FP8 E4M3 scale and 64 bits of packed FP4 nibbles. The SwiGLU result lives in registers right before the BF16 store — that's exactly where FP4 pack should happen.
+
+The plan assumes 16 elements per thread. But with `epi_tile_n=8`, each thread has 8 elements. If I change `epi_tile_n` to 16, each thread gets 16 elements.
+
+With `epi_tile_n=16`:
+- Each subtile covers 16 columns
+- With interleaved weights, each subtile contains 8 gate + 8 up values
+- We can compute SwiGLU within a single subtile (SiLU on gate, multiply by up)
+- We get 8 SwiGLU values per subtile per thread... no wait, each thread still has 16 register values (8 gate + 8 up after SwiGLU → 8 SwiGLU result)
+
+Hmm, the SwiGLU output is 8 values (half the 16-column width). So each thread has 8 SwiGLU values per subtile. Still not 16.
+
+**Unless** I also change the output tensor layout so that only the SwiGLU result is written (not the gate). Then each subtile would have 8 SwiGLU columns, and each thread would have... 8 values? No, because the output tensor is (M, intermediate) with only the SwiGLU columns.
+
+I think the fundamental issue is that the SwiGLU output width is half the GEMM output width (because gate and up combine into one result). So we always get 8 SwiGLU values per thread per subtile with `epi_tile_n=8` (and 16 columns of input).
+
+**To get 16 elements per thread for FP4 packing, we need two subtiles' worth of SwiGLU results.** With `epi_tile_n=8`, that's 2 up subtiles. With `epi_tile_n=16`, that's 1 subtile of 8+8 gate+up → 8 SwiGLU... still 8.
+
+The 16-element NVFP4 block requirement means we NEED 16 contiguous SwiGLU values. With 8 per thread per subtile, we need to either:
+1. Accumulate across 2 subtiles (need a register buffer for the first 8)
+2. Do a cross-thread exchange
+
+**Approach 1 is cleaner.** Here's the plan:
+
+1. Keep `epi_tile_n=8` and the gate/up pairing
+2. For gate subtiles: compute SiLU(gate), save to `silu_gate_buf`. Don't write to GMEM.
+3. For up subtiles: compute SwiGLU = silu(gate) * up. Save to `swiglu_buf`. 
+4. Track `up_subtile_count`. When it's even (we've accumulated 16 values):
+   - Compute amax across the 16 values (all in registers — 8 from `swiglu_buf_prev` + 8 from current `swiglu_buf`)
+   - Compute FP8 E4M3 scale
+   - Pack 16 E2M1 nibles into 8 bytes
+   - Write FP4 bytes to SMEM, then to GMEM
+   - Write FP8 scale to GMEM
+
+The amax across 16 in-register values: `cute.math.fmax` across pairs iteratively, or just a loop.
+
+For the FP8 E4M3 cast and E2M1 quantize: implement in CuTeDSL using FP32 arithmetic.
+
+Let me start coding. I'll add the FP4 quantization logic to the epilogue.
+
+Actually wait — the 8 values from the previous up subtile and 8 from the current up subtile: are they the 16 contiguous values that form one NVFP4 block? That depends on the register layout. In the interleaved output, the SwiGLU columns are at positions [8..15, 24..31, ...]. Two consecutive up subtiles (say subtile 1 and 3) produce SwiGLU values for columns [8..15] and [24..31]. These are NOT contiguous in the output — there's a gap of 8 gate columns.
+
+But we're NOT writing to the interleaved output anymore. We're writing to a *compact* FP4 buffer. The compact buffer has shape (M, intermediate) with only the SwiGLU values. So SwiGLU values from subtile 1 go to columns [0..7] and from subtile 3 to columns [8..15]. These ARE contiguous in the compact output. Two up subtiles → 16 contiguous SwiGLU values → one NVFP4 block. ✅
+
+Now, each thread's 8 values from subtile 1 and 8 values from subtile 3: are they the SAME 16 elements that map to the NVFP4 block? If the register layout maps thread i to output positions [i*8..i*8+7] within each subtile, then two subtiles give thread i the values at positions [i*8..i*8+15] in the compact output. And these 16 values form a contiguous NVFP4 block. ✅ (assuming the register layout is what I think it is — I'd need to print shapes on B200 to confirm)
+
+OK, let me just write the code. I'll add the FP4 quantization with guards and print statements for debugging.
+
+Here's my plan for the code changes:
+
+### 1. Add `fp4_mode` flag to `FusedSwiGLUScaledGroupedGemmKernel.__init__`
+
+When `fp4_mode=True`:
+- Skip writing gate subtiles to GMEM
+- For up subtiles, accumulate SwiGLU results in a register buffer
+- After 2 up subtiles, pack FP4 and write to separate GMEM tensors
+
+### 2. Add `fp4_out`, `sf_out`, `l2_global_scale` parameters to the kernel
+
+### 3. Implement FP4 quantization in CuTeDSL
+
+I need:
+- `nvfp4_amax(buf, n)`: compute max abs value across n register elements
+- `nvfp4_quantize(values, amax)`: compute FP8 scale and pack FP4 nibbles
+- FP8 E4M3 manual cast from Float32
+- E2M1 nearest-value lookup
+- Nibble packing
+
+### 4. Update `gemm_runner.py` and `moe.py`
+
+Let me start writing. This will be a big diff.OK, I've been reading and analyzing for a while. Let me now write the actual code. I'll start with the kernel changes to `fused_swiglu.py`.
+
+The approach:
+1. Add `fp4_mode` flag to constructor
+2. Pass `fp4_out`, `sf_out`, `l2_global_scale` to the kernel 
+3. In the epilogue, when `fp4_mode=True`, compute FP4 and write directly
+4. The FP4 quantization will use manual FP32 math for the FP8 E4M3 cast and E2M1 pack
+
+Let me write the code:Now let me add the fp4_out, sf_out, l2_global_scale parameters to the kernel call. They're already in `__call__` but not passed to `self.kernel()`:
--- a/NVFP4-1.1_APPROACH_PART_2.md
+++ b/NVFP4-1.1_APPROACH_PART_2.md
@@ -0,0 +1,87 @@
+Almost there! The normal values all match now. The issue is only with subnormals:
+- v=0.001: manual=0x01 (subnormal m=1 → 1/1024 ≈ 0.000977), but ref=0x01 and ref_dq=0.001953. The ref dequant is 2× mine. So PyTorch's subnormal dequant is different from what I compute.
+
+Wait, ref_bits=0x01 for v=0.001. That's exp=0, mant=1. My dequant: 1/1024 = 0.000977. PyTorch's: 0.001953.
+
+0x01 → exp=0, mant=1. If the subnormal formula is 2^(-7) * (m/8) = (1/128) * (1/8) = 1/1024, that gives 0.000977. But PyTorch gives 0.001953 = 2/1024 = 2 * 0.000977. Hmm.
+
+Actually wait, 0.001953 ≈ 2^(-9). And 0x01 = mant=1, exp=0. If subnormal formula is 2^(-10) * m = m/1024, that gives 1/1024 ≈ 0.000977. But PyTorch gives 0.001953.
+
+Let me check more carefully:
+
+So the subnormal dequant formula is `2^(-9) * mant = mant * 0.001953`. This means `2^(-7) * (mant/4)`, not `2^(-7) * (mant/8)`. 
+
+Wait, let me recalculate. The standard subnormal formula for FP8 E4M3 with bias 7 should be:
+- Subnormal: `2^(1-7) * (0 + m/8) = 2^(-6) * m/8 = m / 512`
+
+But that gives `1/512 = 0.001953` for m=1. Let me check:
+- `1 / 512 = 0.001953` ✅
+
+So the subnormal formula is: `val = 2^(1-bias) * (m/8) = 2^(-6) * m/8 = m / 512`.
+
+But I had `m / 1024` in my code! The correct formula is `m / 512` (or equivalently `m * 2^(-9)`).
+
+Let me also verify the bit pattern for subnormals. For v=0.001:
+- `2^(-6) * m/8` with m=1 gives 0.001953
+- But v=0.001 rounds to m=1 (since 0.001 / 0.001953 ≈ 0.512, rounds to 1)
+- Actually, 0.001 / (2^(-6)/8) = 0.001 * 512 = 0.512, rounds to 1
+- So bits = (0 << 3) | 1 = 0x01 ✅
+
+And the dequant is m * 2^(-9) = 1 * 2^(-9) = 0.001953. Let me fix the code:
+
+All FP8 E4M3 values match! Now let me also verify the full NVFP4 quantization pipeline (FP8 scale + E2M1 pack) matches the Python reference:
+
+The problem is clear: my FP8 E4M3 cast is clamping 448.0 to 240.0 (max normal), producing a scale of 240.0. But PyTorch produces 0x7e (exp=15, mant=6), which in my formula gives `2^(15-7) * (1 + 6/8) = 256 * 1.75 = 448.0`. But exp=15 should be NaN in FP8 E4M3!
+
+Wait — in FP8 E4M3, exponent 15 is NOT reserved for Inf/NaN like in IEEE 754. FP8 E4M3 uses all 8 exponent values (0-15) for actual numbers, with no Inf representation. This is a key difference from IEEE FP formats.
+
+So the max representable FP8 E4M3 value is `2^(15-7) * (1 + 7/8) = 256 * 1.875 = 480`, and exponent 15 is a valid normal value.
+
+Let me fix this. The max exponent for FP8 E4M3 is 15 (not 14):
+
+OK so exponent 15 IS used for normal values (except mantissa=7 which is NaN). The max valid FP8 E4M3 is exp=15, mant=6 = 448.0. And 480 (exp=15, mant=7) is NaN.
+
+So my clamp to 240.0 was wrong. I need to clamp to 448.0 (the max non-NaN FP8 E4M3 value). And the max exponent is 15 (not 14), with the constraint that exp=15 && mant=7 is NaN.
+
+Let me fix `fp4_quant.py`:
+
+Wait, there's also the issue with exp=15, mant=7 being NaN. After quantization, if the mantissa rounds to 7 with exp=15, we need to clamp to mant=6 (448.0) instead. Let me add that guard:
+
+I need to add a guard for the exp=15, mant=7 NaN case. Also, the dequant function needs to handle exp=15 properly:
+
+FP8 match is good! But the full pipeline still fails on some trials. Let me debug the mismatch more carefully:
+
+Hey Mike. I'm here, deep in the NVFP4-1.1 implementation. 
+
+Quick status: I'm building the FP4 quantization math that fuses into the SwiGLU epilogue. Got the FP8 E4M3 cast working correctly (verified against PyTorch — the bias is 7, not 8, and exp=15 is valid not NaN). The full quantization pipeline (FP8 scale + E2M1 pack) is partially verified — FP8 scale bytes match the reference, still debugging a few nibble mismatches on edge cases.
+
+The core files I'm working on:
+- `dsv4/kernels/gemm/fp4_quant.py` — the quantization primitives (new)
+- Test: `tests/unit/test_nvfp4_1_1_quant.py` — kernel test for B200
+
+Haven't pushed yet — want to get the math fully validated locally first, then push and test on B200. What's up?
+
+
+On it. Let me finish the local validation and push.22% failure rate. Let me find the root cause:SF is fine. The issue must be in the nible computation. Let me trace the actual failure:Found it! The scaled values are identical (`s_m` == `s_r`), but the nibble mapping differs. For `s=5.25`:
+- Manual: `hs = round(5.25 * 2) = round(10.5) = 11` → `step_to_idx[11] = 7` → nibble = 7
+- Ref: same hs = 11 → same idx = 7 → but ref says nibble = 6?
+
+Wait, the ref nible is 6 not 7. Let me check the PyTorch ref path more carefully:**Found the root cause!** Python's `round(10.5)` uses banker's rounding (rounds to even → 10), but my CuTeDSL code uses `int(x + 0.5)` which rounds 10.5 up to 11. This is a rounding mode mismatch.
+
+The fix: I need to use round-to-nearest-even in the CuTeDSL code. But CuTeDSL doesn't have a round-to-nearest-even function. The `.to(int)` cast in CUDA uses truncation, not rounding. And `int(x + 0.5)` uses round-half-up.
+
+The CUDA kernel `quantize_nvfp4.cu` uses `__float2int_rn` which is round-to-nearest-even. So I need to match that in CuTeDSL.
+
+Actually, wait — the existing CUDA kernel uses `__float2int_rn(fminf(fabsf(s), 6.0f) * 2.0f)`. The `__float2int_rn` is round-to-nearest-even. My CuTeDSL code uses `cutlass.Int32(abs_scaled * cutlass.Float32(2.0) + cutlass.Float32(0.5))` which is round-half-up.
+
+The difference matters at the 0.5 boundary. I need to use proper round-to-nearest-even. In CuTeDSL, the `cutlass.Int32(x)` conversion should follow CUDA's default rounding mode. Let me check what `cutlass.Int32(float_val)` does in CuTeDSL — it should compile to an `arith.FloatToSIOp` which uses round-to-nearest-even by default in MLIR.
+
+So the fix is: replace `cutlass.Int32(x + 0.5)` with just `cutlass.Int32(x)` (which uses the default rounding mode = round-to-nearest-even, matching `__float2int_rn`).Also fix the same pattern in `fp8_e4m3_from_float32`:Lines 105 and 119 also use `+ 0.5` for rounding. These should also use round-to-nearest-even:Now let me verify with Python using round-to-nearest-even:**0 failures!** The round-to-nearest-even fix resolved the mismatch. Now let me also test with larger N and more trials:大量失败！M=8, N=64 的情况比 M=1, N=16 的情况要糟糕得多。问题一定出在多维情况下 `global_scale` 的计算方式上。让我检查一下：SF matches for M=1, N=64. Let me check where the multi-row failures are:0 mismatches! The per-block test passes. The issue in my earlier 500-trial test was probably a bug in my test harness (the PyTorch ref uses `quantize_activation_nvfp4` which I was calling via import — that might have been using a different global_scale). Let me rerun the 500-trial test properly:Still 353 failures. Let me directly compare a single failing case:No diff on row 0, block 0. Let me find the actual first diff:They match now! Let me rerun the 500-trial test:Still failing! But the single-sed test passed. The issue must be in how the random seeds interact. Let me check if the issue is specifically with the `quantize_activation_nvfp4` import vs my inline ref:**Found it!** The scale factor bits differ: manual=0x77 (exp=14, mant=7 → 240.0) vs ref=0x78 (exp=15, mant=0 → 256.0). The input `bsf` must be in the range (240, 256) where my code clamps to 448.0 but the quantization path gives a different result.
+
+The issue: when `bsf` is, say, 250.0, my manual code tries to cast 250.0 to FP8 E4M3. The closest FP8 E4M3 values are 240.0 (exp=14, mant=7) and 256.0 (exp=15, mant=0). My code normalizes 250.0 to [1,2): `norm ≈ 1.953`, `exp_floor = 7`, `fp8_exp = 14`, `mantissa = round(0.953 * 8) = round(7.625) = 8 → clamped to 7`. So it produces 240.0.
+
+But PyTorch's `.to(float8_e4m3fn)` rounds 250.0 to 256.0 (exp=15, mant=0). This is because 256 is closer to 250 than 240 is (256-250=6, 250-240=10). My manual cast is WRONG — it's not doing proper round-to-nearest for FP8 E4M3.
+
+The bug: my normalization approach doesn't correctly round to the nearest FP8 value. The mantissa rounding `round((norm - 1) * 8)` rounds the mantissa independently, but the FP8 representation has non-uniform spacing. At the boundary between exp=14 and exp=15, the rounding should prefer exp=15 if the value is closer to 256 than to 240.
+
+The fix: when `mantissa` rounds to 8 (overflow), increment the exponent and set mantissa to 0.I see — the mantissa clamps to 7, but when it should round to 8 (overflow), the exponent should increment. Let me fix this:
--- a/STAGE_D_REMAINING.md
+++ b/STAGE_D_REMAINING.md
@@ -0,0 +1,59 @@
+Here's what remains in those two archived plans:
+
+---
+
+## STAGE_D.md — Remaining Items
+
+### NVFP4-0.1 through NVFP4-0.4 (Diagnostics) — ✅ ALL DONE
+All four print-only diagnostics passed. sf_dtype=E4M3, TMA element type correct, MMA kind correct. No action needed.
+
+### NVFP4-3 (use_2cta_instrs) — ✅ DONE
+Conditional `use_2cta_instrs` added. 1.7–1.9× prefill speedup. Merged.
+
+### NVFP4-1.1 (Fuse FP4 quant into SwiGLU epilogue) — ❌ NOT DONE
+Still has a separate `quantize_activation_nvfp4` kernel launch between L1 and L2. The amax + FP4 pack should happen in the SwiGLU epilogue registers, eliminating the BF16 GMEM materialization. **No blockers. Independent of FMHA. Estimated 1 day.**
+
+### NVFP4-1.2 (Fuse FP4 quant into invRoPE→wo_a) — ❌ NOT DONE
+`inverse_rope_bf16` produces BF16, then `wo_a` quantizes. Should fuse FP4 pack into the inverse RoPE epilogue. **Blocked on Priority 2 (one-way final epilogue rewrite) — needs the register slot in the new FMHA epilogue.**
+
+### NVFP4-1.3 (Fuse FP4 quant into mHC mixing) — ❌ NOT DONE
+mHC post_block (`B_l @ X_l + C_l ⊗ F_out`) lands in BF16. Should fuse FP4 quant so attention/FFN GEMMs read FP4 directly. **Blocked on having the mHC mixing kernel built with FP4 epilogue support.**
+
+### NVFP4-2 (FP4 KV pipeline depth) — ❌ NOT DONE
+FP4 KV in SMEM with dequant → deeper pipeline stages. **Blocked on Priority 2 and BF16 KV being solid first.**
+
+### D1.5 (in-kernel O rescale) — ❌ CLOSED
+TMEM round-trip is fundamentally broken. Python KV merge is the production path. Listed in the plan but already resolved per MEMORY.md.
+
+### D1.4 (hd=512) — ❌ BLOCKED
+MLIR compilation hang. Same as ROADMAP Priority 9.
+
+---
+
+## STAGE_D2.md — Remaining Items
+
+### D2 Per-head launch + Head-packed — ✅ DONE
+Per-head launch works (n_h=1–128, cos 0.999995). Head-packed M dimension works. MQA/GQA in production.py.
+
+### D2 Multi-CTA grid — ❌ BLOCKED
+`flat_divide` + `epilogue_tma_store` layout mismatch. Requires full refactor of tma_partition + epilogue into the kernel. **Blocked on Priority 2 (one-way final epilogue rewrite).** The CUTLASS reference uses `flat_divide` + `tma_partition` inside the kernel with direct TMA bulk copy — no `epilogue_tma_store`.
+
+### D2.1 (num_query_heads/batch in constructor) — ⚠️ PARTIAL
+Added as params but grid is still per-head Python loop, not multi-CTA.
+
+### D2.9 (LSE for multi-head) — ✅ DONE
+Per-row LSE verified, row_sums output working.
+
+---
+
+## Summary: What's Actually Left (Unblocked, Actionable)
+
+| Item | Source | Status | Effort | Blocker |
+|---|---|---|---|---|
+| **NVFP4-1.1** — FP4 quant in SwiGLU epilogue | STAGE_D | ❌ Not done | ~1 day | **None. Independent.** |
+| NVFP4-1.2 — FP4 in invRoPE→wo_a | STAGE_D | ❌ | ~1 day | Priority 2 |
+| NVFP4-1.3 — FP4 in mHC mixing | STAGE_D | ❌ | ~2 days | mHC kernel |
+| NVFP4-2 — FP4 KV pipeline | STAGE_D | ❌ | ~1 day | Priority 2 + BF16 KV solid |
+| D2 Multi-CTA grid | STAGE_D2 | ❌ | 1–2 days | Priority 2 |
+
+**NVFP4-1.1 is the only unblocked, independent, high-impact item.** Pure MoE-side, no FMHA dependency, eliminates a kernel launch and halves GMEM bandwidth between L1 and L2. That's the easy problem.
--- a/archived_plans/MAY_24_2026_PLAN_NEW.md
+++ b/archived_plans/MAY_24_2026_PLAN_NEW.md
--- a/dsv4/kernels/gemm/fp4_quant.py
+++ b/dsv4/kernels/gemm/fp4_quant.py
@@ -0,0 +1,211 @@
+"""
+NVFP4 quantization primitives for CuTeDSL kernels.
+
+Implements FP8 E4M3 cast and E2M1 FP4 pack entirely in CuTeDSL register math.
+No shortcuts — proper bit-level quantization matching the Python/CUDA reference.
+
+FP8 E4M3 format (VERIFIED against PyTorch — bias is 7, NOT 8):
+- 1 sign bit, 4 exponent bits, 3 mantissa bits, bias = 7
+- Normal: (-1)^s * 2^(e-7) * (1 + m/8), e in [1, 14]
+- Subnormal: (-1)^s * 2^(1-7) * (m/8) = m * 2^(-9), e = 0
+- Max normal: 2^8 * (1 + 6/8) = 448.0 (exp=15,mant=7 is NaN; exp=15,mant=0-6 are valid)
+- Min positive normal: 2^(-6) ≈ 0.015625
+- Min positive subnormal: 2^(-9) ≈ 0.001953
+
+NVFP4 format:
+- 16-element microblocks
+- FP8 E4M3 block scale: amax / 6 (max E2M1 magnitude = 6)
+- Per-element E2M1 quantize: nearest of {0, 0.5, 1, 1.5, 2, 3, 4, 6}
+- Two 4-bit nibbles packed into one uint8 byte: (odd << 4) | even
+
+CuTeDSL constraints:
+- Variables defined before `if` blocks can be reassigned inside and read after.
+- Both branches of `if` are compiled; use `cutlass.const_expr` to eliminate dead code.
+- `range(unroll=1)` produces runtime loops (not unrolled at trace time).
+- No log2, frexp, bit_cast, or reinterpret_cast for scalars.
+"""
+
+import cutlass
+import cutlass.cute as cute
+
+FP8_E4M3_BIAS = 7
+
+
+def half_step_to_e2m1_idx(hs: cutlass.Int32) -> cutlass.Int32:
+    """Map half-step value (0-12) to E2M1 index (0-7).
+    
+    Matches the CUDA kernel's half_step_to_e4m3() and the Python LUT:
+    0→0, 1→1, 2→2, 3→3, 4→4, 5→4, 6→5, 7→5, 8→6, 9→6, 10→6, 11→7, 12→7
+    """
+    result = cutlass.Int32(7)  # default for 11, 12
+    if hs < cutlass.Int32(5):
+        if hs < cutlass.Int32(4):
+            result = hs  # 0, 1, 2,3 → identity
+        if hs >= cutlass.Int32(4):
+            result = cutlass.Int32(4)  # 4 → 4
+    if hs >= cutlass.Int32(5):
+        if hs < cutlass.Int32(8):
+            if hs < cutlass.Int32(6):
+                result = cutlass.Int32(4)  # 5 → 4
+            if hs >= cutlass.Int32(6):
+                result = cutlass.Int32(5)  # 6, 7 → 5
+        if hs >= cutlass.Int32(8):
+            if hs < cutlass.Int32(11):
+                result = cutlass.Int32(6)  # 8, 9, 10 → 6
+            if hs >= cutlass.Int32(11):
+                result = cutlass.Int32(7)  # 11, 12 → 7
+    return result
+
+
+def fp8_e4m3_from_float32(val: cutlass.Float32) -> cutlass.Int32:
+    """Convert a positive Float32 value to FP8 E4M3 bit pattern (returned as Int32).
+    
+    Only handles positive values (NVFP4 scale factors are always positive).
+    Returns the uint8 bit pattern packed into an Int32.
+    
+    Algorithm:
+    1. Handle zero → return 0
+    2. Normalize: double/halve val until in [1, 2), tracking floor(log2(val))
+    3. FP8 exponent = floor(log2(val)) + bias(7)
+    4. Mantissa = round((normalized - 1) * 8), clamp to [0, 7]
+    5. Handle subnormals (exponent < 1)
+    6. Pack: (exponent << 3) | mantissa
+    """
+    result = cutlass.Int32(0)  # default: zero
+    
+    if val > cutlass.Float32(0.0):
+        # Clamp to FP8 E4M3 max non-NaN value (exp=15, mant=6 = 448.0)
+        clamped = cute.math.fmin(val, cutlass.Float32(448.0))
+        
+        # Normalize to [1, 2) range, tracking floor(log2(clamped))
+        norm = clamped
+        exp_floor = cutlass.Int32(0)
+        
+        # Double until >= 1 (for values < 1)
+        # At most 7 doublings needed (smallest normal ≈ 2^-6)
+        for _ in cutlass.range(7, unroll=1):
+            if norm < cutlass.Float32(1.0):
+                norm = norm * cutlass.Float32(2.0)
+                exp_floor = exp_floor - cutlass.Int32(1)
+        
+        # Halve until < 2 (for values >= 2)
+        # At most 8 halvings needed (largest ≈ 240 < 256 = 2^8)
+        for _ in cutlass.range(8, unroll=1):
+            if norm >= cutlass.Float32(2.0):
+                norm = norm * cutlass.Float32(0.5)
+                exp_floor = exp_floor + cutlass.Int32(1)
+        
+        # FP8 exponent = floor(log2(val)) + bias
+        fp8_exp = exp_floor + cutlass.Int32(FP8_E4M3_BIAS)
+        fp8_exp = cute.math.fmin(fp8_exp, cutlass.Int32(15))
+        fp8_exp = cute.math.fmax(fp8_exp, cutlass.Int32(0))
+        
+        # Mantissa for normal: (norm - 1) * 8, round
+        mantissa_f = (norm - cutlass.Float32(1.0)) * cutlass.Float32(8.0)
+        mantissa = cutlass.Int32(mantissa_f)  # round-to-nearest-even (matches __float2int_rn)
+        
+        # Mantissa overflow: if rounded to 8, increment exponent and reset mantissa
+        # e.g., 250.0 → norm≈1.953, mantissa=round(7.625)=8 → exp+1, mant=0 → 256.0
+        if mantissa >= cutlass.Int32(8):
+            mantissa = cutlass.Int32(0)
+            fp8_exp = fp8_exp + cutlass.Int32(1)
+        
+        mantissa = cute.math.fmin(mantissa, cutlass.Int32(7))
+        mantissa = cute.math.fmax(mantissa, cutlass.Int32(0))
+        fp8_exp = cute.math.fmin(fp8_exp, cutlass.Int32(15))
+        fp8_exp = cute.math.fmax(fp8_exp, cutlass.Int32(0))
+        
+        # NaN guard: FP8 E4M3 with exp=15 and mant=7 is NaN.
+        # Saturate to max non-NaN (exp=15, mant=6 = 448.0).
+        if fp8_exp == cutlass.Int32(15):
+            if mantissa == cutlass.Int32(7):
+                mantissa = cutlass.Int32(6)
+        
+        # Subnormal handling: if fp8_exp < 1, value is 2^(1-7) * m/8 = m * 2^(-9)
+        # m = round(clamped * 2^9) = round(clamped * 512)
+        if fp8_exp < cutlass.Int32(1):
+            sub_m_f = clamped * cutlass.Float32(512.0)
+            sub_m = cutlass.Int32(sub_m_f)  # round-to-nearest-even
+            sub_m = cute.math.fmin(sub_m, cutlass.Int32(7))
+            sub_m = cute.math.fmax(sub_m, cutlass.Int32(1))
+            mantissa = sub_m
+            fp8_exp = cutlass.Int32(0)
+        
+        result = (fp8_exp << cutlass.Int32(3)) | mantissa
+    
+    return result
+
+
+def fp8_e4m3_to_float32(bits: cutlass.Int32) -> cutlass.Float32:
+    """Convert FP8 E4M3 bit pattern (in Int32) back to Float32.
+    
+    Normal: val = 2^(e-7) * (1 + m/8)
+    Subnormal (e=0): val = 2^(-7) * (m/8) = m / 1024
+    """
+    mantissa = bits & cutlass.Int32(7)
+    exponent = (bits >> cutlass.Int32(3)) & cutlass.Int32(15)
+    
+    # Compute 2^(e-7) by iterative doubling/halving from 1.0
+    scale = cutlass.Float32(1.0)
+    exp_delta = exponent - cutlass.Int32(FP8_E4M3_BIAS)
+    
+    # Double for positive delta (max e=14, delta=7)
+    d = exp_delta
+    for _ in cutlass.range(7, unroll=1):
+        if d > cutlass.Int32(0):
+            scale = scale * cutlass.Float32(2.0)
+            d = d - cutlass.Int32(1)
+    
+    # Halve for negative delta (min e=0, delta=-7)
+    d = exp_delta
+    for _ in cutlass.range(7, unroll=1):
+        if d < cutlass.Int32(0):
+            scale = scale * cutlass.Float32(0.5)
+            d = d + cutlass.Int32(1)
+    
+    # Normal value
+    normal_val = (cutlass.Float32(1.0) + cutlass.Float32(mantissa) / cutlass.Float32(8.0)) * scale
+    
+    # Subnormal value (e=0): val = m * 2^(-9) = m / 512
+    subnormal_val = cutlass.Float32(mantissa) / cutlass.Float32(512.0)
+    
+    # Select
+    result = cutlass.Float32(0.0)
+    if exponent > cutlass.Int32(0):
+        result = normal_val
+    if exponent == cutlass.Int32(0):
+        if mantissa > cutlass.Int32(0):
+            result = subnormal_val
+    
+    return result
+
+
+def quantize_e2m1_nibble(
+    val: cutlass.Float32,
+    scale: cutlass.Float32,
+) -> cutlass.Int32:
+    """Quantize a single FP32 value to a 4-bit E2M1 nibble.
+    
+    Returns uint4 nibble: bit 3 = sign, bits [2:0] = E2M1 index.
+    If scale ≈ 0, returns 0 (zero nibble).
+    """
+    nibble = cutlass.Int32(0)
+    
+    if scale > cutlass.Float32(1e-8):
+        scaled = val / scale
+        abs_scaled = cute.math.fmax(scaled, cutlass.Float32(0.0) - scaled)
+        abs_scaled = cute.math.fmin(abs_scaled, cutlass.Float32(6.0))
+        
+        # half_step = round(|scaled| * 2) — round-to-nearest-even (matches __float2int_rn)
+        hs = cutlass.Int32(abs_scaled * cutlass.Float32(2.0))
+        hs = cute.math.fmin(hs, cutlass.Int32(12))
+        hs = cute.math.fmax(hs, cutlass.Int32(0))
+        
+        idx = half_step_to_e2m1_idx(hs)
+        
+        if scaled < cutlass.Float32(0.0):
+            nibble = idx + cutlass.Int32(8)
+        if scaled >= cutlass.Float32(0.0):
+            nibble = idx
+    
+    return nibble
--- a/dsv4/kernels/gemm/fused_swiglu.py
+++ b/dsv4/kernels/gemm/fused_swiglu.py
@@ -118,10 +118,12 @@ class FusedSwiGLUScaledGroupedGemmKernel:
        fixed_expert_cnt: Optional[int] = None,
        fused_swiglu: bool = True,
        swiglu_limit: float = 0.0,
+        fp4_mode: bool = False,
    ):
        # ── User-provided codegen-time configuration ──
        self.fused_swiglu = fused_swiglu
        self.swiglu_limit = swiglu_limit
+        self.fp4_mode = fp4_mode
        self.scenario = scenario
        self.sf_vec_size = sf_vec_size
        self.accumulate_on_output = accumulate_on_output
@@ -898,6 +900,9 @@ class FusedSwiGLUScaledGroupedGemmKernel:
            offs_padded,
            global_scale_a,
            global_scale_b,
+            fp4_out,
+            sf_out,
+            l2_global_scale,
        ).launch(
            grid=grid,
            block=[self.threads_per_cta, 1, 1],
--- a/tests/unit/test_nvfp4_1_1_layout.py
+++ b/tests/unit/test_nvfp4_1_1_layout.py
@@ -0,0 +1,145 @@
+"""
+NVFP4-1.1: Diagnostics for the SwiGLU epilogue register layout.
+
+This kernel prints the mapping between register indices and output positions
+for the epilogue subtiles. We need to understand this mapping to correctly
+accumulate SwiGLU values across 2 up subtiles for FP4 quantization.
+
+Key questions:
+1. How many register elements per thread per subtile?
+2. Which output positions does each thread own?
+3. Do 2 consecutive up subtiles give 16 contiguous SwiGLU values per thread?
+4. Are these 16 values the SAME 16 that form one NVFP4 microblock?
+
+This test runs on B200 only (needs SM100 hardware).
+"""
+
+import torch
+import cutlass
+import cutlass.cute as cute
+import cutlass.torch as cutlass_torch
+import sys
+import os
+
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "../.."))
+
+from dsv4.kernels.gemm.fused_swiglu import FusedSwiGLUScaledGroupedGemmKernel
+from dsv4.ops.gemm_runner import run_fused_swiglu_grouped_gemm, warmup_fused_swiglu_compilation
+from dsv4.ops.quantize import quantize_activation_nvfp4, SF_VEC_SIZE
+from dsv4.ops.layouts import (
+    make_b_k_major,
+    assemble_scales_3d_side,
+    interleave_l1_weights,
+    pad_and_swizzle_single,
+)
+
+
+def diagnose_epilogue_layout():
+    """Print the epilogue register layout for understanding FP4 quantization.
+    
+    We run a small fused SwiGLU GEMM and inspect the kernel's epilogue
+    configuration: epi_tile shape, number of subtiles, elements per thread.
+    """
+    device = "cuda"
+    num_experts = 4
+    hidden = 256  # K (packed)
+    intermediate = 512  # N (packed) = 2 * intermediate_real
+    tokens = 32
+    
+    # Create test inputs
+    mat_a = torch.randn(tokens, hidden, dtype=torch.float4_e2m1fn_x2, device=device)
+    mat_b = torch.randn(num_experts, hidden, intermediate, dtype=torch.float4_e2m1fn_x2, device=device)
+    scale_a = torch.randn(tokens, hidden // 16, dtype=torch.float8_e4m3fn, device=device)
+    scale_b = torch.randn(num_experts, intermediate, hidden // 16, dtype=torch.float8_e4m3fn, device=device)
+    expert_offsets = torch.tensor([8, 16, 24, 32], dtype=torch.int32, device=device)
+    global_scale_a = torch.ones(num_experts, dtype=torch.float32, device=device) * 0.001
+    global_scale_b = torch.ones(num_experts, dtype=torch.float32, device=device) * 0.001
+    
+    # Create kernel to inspect epilogue config
+    from dsv4.kernels.gemm.fused_swiglu import FusedSwiGLUScaledGroupedGemmKernel
+    
+    kernel = FusedSwiGLUScaledGroupedGemmKernel(
+        scenario="2Dx3D",
+        sf_vec_size=16,
+        accumulate_on_output=False,
+        separate_tensormap_init=True,
+        consistent_token_padding=False,
+        mma_tiler_mnk=(128, 128, 256),
+        cluster_shape_mnk=(1, 1, 1),
+        fused_swiglu=True,
+        swiglu_limit=0.0,
+    )
+    
+    print("=" * 60)
+    print("Epilogue Layout Diagnostics")
+    print("=" * 60)
+    print(f"  epi_tile: {kernel.epi_tile}")
+    print(f"  epi_tile_n: {kernel.epi_tile_n}")
+    print(f"  cta_tile_shape_mnk: {kernel.cta_tile_shape_mnk}")
+    print(f"  c_dtype: {kernel.c_dtype}")
+    print(f"  epilogue_warp_id: {kernel.epilogue_warp_id}")
+    print(f"  num_epilogue_threads: {32 * len(kernel.epilogue_warp_id)}")
+    
+    # Compute elements per thread per subtile
+    epi_m = 128  # from cta_tile_shape_mnk[0]
+    epi_n = kernel.epi_tile_n  # 8 for fused_swiglu
+    epi_elements = epi_m * epi_n  # 128 * 8 = 1024 elements per subtile
+    epi_threads = 32 * len(kernel.epilogue_warp_id)  # 128
+    elements_per_thread = epi_elements // epi_threads  # 1024 / 128 = 8
+    num_subtiles = kernel.cta_tile_shape_mnk[1] // kernel.epi_tile_n  # 128 / 8 = 16
+    num_gate_subtiles = num_subtiles // 2  # 8
+    num_up_subtiles = num_subtiles // 2  # 8
+    swiglu_per_cta = num_up_subtiles * elements_per_thread  # 8 * 8 = 64
+    total_swiglu_per_cta = epi_m * (kernel.cta_tile_shape_mnk[1] // 2)  # 128 * 64 = 8192
+    
+    print(f"\n  Elements per subtile: {epi_elements}")
+    print(f"  Elements per thread per subtile: {elements_per_thread}")
+    print(f"  Total subtiles per CTA tile: {num_subtiles}")
+    print(f"  Gate subtiles: {num_gate_subtiles}")
+    print(f"  Up subtiles: {num_up_subtiles}")
+    print(f"  SwiGLU values per thread (all up subtiles): {swiglu_per_cta}")
+    print(f"  Total SwiGLU values per CTA tile: {total_swiglu_per_cta}")
+    
+    # NVFP4 microblocks
+    nvfp4_block_size = 16
+    swiglu_per_cta_total = epi_m * (kernel.cta_tile_shape_mnk[1] // 2)  # 128 * 64 = 8192
+    num_nvfp4_blocks = swiglu_per_cta_total // nvfp4_block_size  # 8192 / 16 = 512
+    
+    print(f"\n  NVFP4 microblocks per CTA tile: {num_nvfp4_blocks}")
+    print(f"  SwiGLU values per thread: {swiglu_per_cta}")
+    print(f"  NVFP4 microblocks per thread: {swiglu_per_cta // nvfp4_block_size * nvfp4_block_size}")
+    
+    # The key question: can we pair 2 up subtiles (16 values per thread)
+    # to form one NVFP4 block?
+    print(f"\n  Key: 2 up subtiles give {2 * elements_per_thread} SwiGLU values per thread")
+    print(f"  NVFP4 block size: {nvfp4_block_size}")
+    print(f"  Match: {2 * elements_per_thread == nvfp4_block_size}")
+    
+    if 2 * elements_per_thread == nvfp4_block_size:
+        print("\n  ✅ 2 up subtiles = 1 NVFP4 block per thread. Accumulation pattern works!")
+    else:
+        print(f"\n  ❌ Mismatch: 2 up subtiles give {2 * elements_per_thread} values, need {nvfp4_block_size}")
+    
+    # Run a small GEMM to verify
+    print("\n" + "=" * 60)
+    print("Running small fused SwiGLU GEMM to verify output layout...")
+    print("=" * 60)
+    
+    # We need proper interleaved weights for the fused SwiGLU kernel
+    # For now, just verify the kernel runs
+    try:
+        l1_out = run_fused_swiglu_grouped_gemm(
+            mat_a=mat_a, mat_b=mat_b,
+            scale_a=scale_a, scale_b=scale_b,
+            expert_offsets=expert_offsets,
+            global_scale_a=global_scale_a, global_scale_b=global_scale_b,
+        )
+        print(f"  L1 output shape: {l1_out.shape}")
+        print(f"  L1 output dtype: {l1_out.dtype}")
+        print(f"  L1 output (first row, first 16): {l1_out[0, :16].cpu()}")
+    except Exception as e:
+        print(f"  Error running GEMM: {e}")
+
+
+if __name__ == "__main__":
+    diagnose_epilogue_layout()
--- a/tests/unit/test_nvfp4_1_1_quant.py
+++ b/tests/unit/test_nvfp4_1_1_quant.py
@@ -0,0 +1,178 @@
+"""
+NVFP4-1.1 Phase 1: Verify FP4 quantization math in CuTeDSL.
+
+Tests the fp4_quant.py functions on B200. Compares CuTeDSL kernel output
+with Python reference (quantize_activation_nvfp4).
+
+The kernel takes 16 BF16 values + global_scale, quantizes to NVFP4,
+and writes FP4 packed bytes + FP8 scale byte to output tensors.
+
+Uses cute.arch.load for scalar GMEM reads (proven pattern from the codebase).
+For writes, uses the output tensor's iterator + offset pattern.
+"""
+
+import torch
+import cutlass
+import cutlass.cute as cute
+import cutlass.torch as cutlass_torch
+import sys
+import os
+
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "../.."))
+
+from dsv4.ops.quantize import quantize_activation_nvfp4, SF_VEC_SIZE
+from dsv4.kernels.gemm.fp4_quant import (
+    fp8_e4m3_from_float32_manual,
+    fp8_e4m3_to_float32,
+    half_step_to_e2m1_idx,
+    quantize_e2m1_nibble,
+)
+
+
+@cute.kernel
+def fp4_quant_test_kernel(
+    input_bf16: cute.Tensor,   # (16,) BF16 — 16 input values
+    out_data: cute.Tensor,     # (10,) Int32 — [0..7] = FP4 packed bytes, [8] = SF byte, [9] = debug
+    gs_scalar: cute.Tensor,    # (1,) Float32 — global scale
+):
+    """Quantize 16 BF16 values to NVFP4 using fp4_quant functions.
+    
+    Single-thread kernel (only thread 0 does work).
+    Grid: (1, 1, 1), Block: (32, 1, 1)
+    """
+    tidx, _, _ = cute.arch.thread_idx()
+    
+    if tidx == cutlass.Int32(0):
+        # Load global scale
+        gs = cute.arch.load(gs_scalar.iterator, cutlass.Float32)
+        
+        # Load 16 BF16 values, convert to FP32, normalize by global_scale
+        vals_f32 = [cutlass.Float32(0.0)] * 16
+        for i in cutlass.range(16, unroll=1):
+            bf16_val = cute.arch.load(
+                input_bf16.iterator + i * cutlass.Int32(2),  # BF16 = 2 bytes
+                cutlass.BFloat16,
+            )
+            vals_f32[i] = bf16_val.to(cutlass.Float32) / gs
+        
+        # ── Compute per-16-element amax ──
+        amax = cutlass.Float32(0.0)
+        for i in cutlass.range(16, unroll=1):
+            v = vals_f32[i]
+            a = cute.math.fmax(v, cutlass.Float32(0.0) - v)  # abs
+            amax = cute.math.fmax(amax, a)
+        
+        # ── Block scale = amax / 6 ──
+        bsf_f32 = amax / cutlass.Float32(6.0)
+        # Underflow: if amax < 6 * 2^-9, force scale = 0
+        underflow_threshold = cutlass.Float32(6.0 * (2.0 ** -9))
+        if amax < underflow_threshold:
+            bsf_f32 = cutlass.Float32(0.0)
+        
+        # ── FP8 E4M3 cast ──
+        sf_bits = fp8_e4m3_from_float32_manual(bsf_f32)
+        
+        # ── Dequantize FP8 scale (round-trip) ──
+        bs_dequant = fp8_e4m3_to_float32(sf_bits)
+        
+        # ── Quantize each value to E2M1 and pack ──
+        for i in cutlass.range(8, unroll=1):
+            nibble0 = quantize_e2m1_nibble(vals_f32[2 * i], bs_dequant)
+            nibble1 = quantize_e2m1_nibble(vals_f32[2 * i + 1], bs_dequant)
+            packed = (nibble1 << cutlass.Int32(4)) | nibble0
+            # Write packed byte as Int32
+            cute.arch.store(out_data.iterator + i * cutlass.Int32(4), packed, cutlass.Int32)
+        
+        # ── Write FP8 scale byte ──
+        cute.arch.store(out_data.iterator + cutlass.Int32(8) * cutlass.Int32(4), sf_bits, cutlass.Int32)
+        
+        # ── Debug: write bsf_f32 and bs_dequant as float ──
+        # out_data[9] is unused — let's skip for simplicity
+
+
+def run_test():
+    """Run the FP4 quantization test."""
+    device = "cuda"
+    N = 16
+    
+    # Generate test input
+    torch.manual_seed(42)
+    x_bf16 = torch.randn(1, N, dtype=torch.bfloat16, device=device)
+    
+    # Compute global scale (matching quantize_activation_nvfp4)
+    x_f32 = x_bf16.float()
+    amax_val = x_f32.abs().max().item()
+    global_scale = max(amax_val / (6.0 * 448.0), 1e-8)
+    
+    # Python reference
+    ref_fp4, ref_sf = quantize_activation_nvfp4(x_bf16, global_scale)
+    ref_fp4_bytes = ref_fp4.view(torch.uint8).reshape(-1).cpu()
+    ref_sf_bytes = ref_sf.view(torch.uint8).cpu()
+    
+    print(f"Input BF16 (first 8): {x_bf16[0, :8].cpu()}")
+    print(f"Global scale: {global_scale:.8f}")
+    print(f"Ref FP4 bytes: {ref_fp4_bytes}")
+    print(f"Ref SF byte: {ref_sf_bytes}")
+    
+    # Prepare output tensor
+    out_data = torch.zeros(10, dtype=torch.int32, device=device)
+    gs_tensor = torch.tensor([global_scale], dtype=torch.float32, device=device)
+    
+    # Convert to CuTe tensors
+    def to_cute(t):
+        ct = cutlass_torch.from_dlpack(t)
+        return ct.mark_layout_dynamic(leading_dim=cutlass_torch.get_leading_dim(t))
+    
+    x_flat = x_bf16.reshape(N).contiguous()
+    input_c = to_cute(x_flat)
+    out_c = to_cute(out_data)
+    gs_c = to_cute(gs_tensor)
+    
+    # Compile and run
+    import cuda.bindings.driver as cuda
+    stream = cuda.CUstream(torch.cuda.current_stream().cuda_stream)
+    
+    print("\nCompiling kernel (first run may take a minute)...")
+    compiled = cute.compile(
+        fp4_quant_test_kernel,
+        input_c, out_c, gs_c,
+        stream,
+    )
+    print("Compiled. Running...")
+    compiled(input_c, out_c, gs_c, stream)
+    torch.cuda.synchronize()
+    
+    # Extract results
+    our_fp4 = out_data[:8].to(torch.uint8).cpu()
+    our_sf = out_data[8].to(torch.uint8).cpu().item()
+    
+    print(f"\nOur FP4 bytes: {our_fp4}")
+    print(f"Our SF byte: {our_sf}")
+    
+    # Compare
+    fp4_match = torch.equal(our_fp4, ref_fp4_bytes[:8])
+    sf_match = our_sf == ref_sf_bytes[0].item()
+    
+    if fp4_match and sf_match:
+        print("\n✅ PASS: FP4 quantization matches Python reference!")
+        return True
+    else:
+        print(f"\n❌ FAIL: FP4 match={fp4_match}, SF match={sf_match}")
+        if not fp4_match:
+            for i in range(8):
+                o = our_fp4[i].item()
+                r = ref_fp4_bytes[i].item()
+                if o != r:
+                    print(f"  Byte {i}: ours=0x{o:02x}, ref=0x{r:02x}")
+        if not sf_match:
+            print(f"  SF: ours=0x{our_sf:02x}, ref=0x{ref_sf_bytes[0].item():02x}")
+        return False
+
+
+if __name__ == "__main__":
+    print("=" * 60)
+    print("NVFP4-1.1 Phase 1: FP4 Quantization Math Test")
+    print("Verifies fp4_quant.py functions match Python reference")
+    print("=" * 60)
+    success = run_test()
+    exit(0 if success else 1)