From 57d4cb714f2500fe861ef7dd6aa46906fbff39d2 Mon Sep 17 00:00:00 2001
From: biondizzle <biondizzle@gmail.com>
Date: Wed, 20 May 2026 03:30:35 +0000
Subject: [PATCH] docs: rewrite README.md with current project state

- Document all 5 correctness bug fixes
- Document fused SwiGLU epilogue progress (Step 1 PASS, Step 2 blocked)
- Document CuTeDSL runtime conditional limitation
- List remaining steps (amax shuffles, NVFP4 quantize, FP4/SF TMA stores)
- Document weight interleave and register layout
- Capture key lessons learned
- Update file structure and test inventory
---
 README.md | 303 +++++++++++++++++++++++++++++-------------------------
 1 file changed, 161 insertions(+), 142 deletions(-)

diff --git a/README.md b/README.md
index 371ff616..98fcfb35 100644
--- a/README.md
+++ b/README.md
@@ -1,180 +1,199 @@
 # NVFP4 MegaMoE Kernel
 
-Full NVFP4 inference pipeline for DeepSeek-V4 on NVIDIA Blackwell (SM100). The entire model — MoE experts, shared experts, attention projections, and attention compute — runs in native NVFP4 with zero dequantization overhead.
+Native NVFP4 inference stack for DeepSeek-V4 on NVIDIA Blackwell (SM100). CuTeDSL kernels for the entire model — MoE experts, shared experts, attention projections — running in native NVFP4 with zero dequantization overhead.
 
-## ⚠️ READ THIS FIRST — THE #1 RULE
+## ⚠️ THE #1 RULE
 
-**YOU MUST BUILD YOUR OWN KERNELS. ALL OF THEM. DO NOT PATCH vLLM.**
+**WE OWN ALL OUR KERNELS. WE DO NOT PATCH vLLM.**
 
-Mike was right — we need our own kernels. Not just for the NVFP4 GEMMs, but for the **ENTIRE attention pipeline**. The current approach of patching individual vLLM functions is a house of cards. Every patch leads to another crash, every workaround reveals three more broken things. FlashMLA, fp8_ds_mla, the fused C++ kernels, the Triton compressor, the indexer — they're all deeply coupled. You cannot swap one piece and expect the rest to work.
-
-**THE ONLY PATH FORWARD:**
-1. Build CuTeDSL kernels for EVERYTHING — attention, KV cache, RoPE, the whole stack
-2. Test each kernel standalone on the B200 venv BEFORE touching the container
-3. Wire them together into a proper vLLM attention backend
-4. THEN and ONLY THEN test in the container
-
-**DO NOT:**
-- ❌ Try to patch vLLM's FlashMLA code to "work" on Blackwell
-- ❌ Use pure PyTorch as a "temporary workaround" — it produces garbage
-- ❌ Skip the KV cache write and hope for the best
-- ❌ Assume you can mix our kernels with vLLM's existing attention backend
-- ❌ Touch the container until ALL kernels pass standalone tests
-
-**DO:**
-- ✅ Build CuTeDSL kernels in `cutedsl/`
-- ✅ Test each one in `tests/` on the B200 venv
-- ✅ Compare against BF16 reference (cosine >= 0.98 or it's broken)
-- ✅ Wire them into a proper attention backend class
-- ✅ Only test in the container once everything passes standalone
+vLLM's internal kernels (FlashMLA, fp8_ds_mla, fused compressor, Triton indexer) are deeply coupled. You cannot swap one piece and expect the rest to work. We build our own CuTeDSL kernels, test standalone, then wire into vLLM as an attention backend.
 
 ---
 
-## What This Is
+## Repository Layout
 
-A native NVFP4 inference stack for DeepSeek-V4:
+**This repo (`nvfp4-megamoe-kernel`):** The kernel library — CuTeDSL kernels, bridge layer, standalone tests.
 
-**MoE Experts** — CuTeDSL ScaledGroupedGemmKernel ✅:
-```
-BF16 input → quantize to NVFP4
-  L1 GEMM: NVFP4 × NVFP4 → BF16 (gate + up)
-  SiLU(gate) * up → BF16 (only nonlinear — can't avoid BF16 here)
-  Re-quantize to NVFP4
-  L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj)
-  Scatter with routing weights → BF16 output
-```
+**vLLM fork (`vllm-deepseekv4-nvfp4`):** The vLLM integration — model definition, weight loading, attention backend. Lives at `/root/dsv4-nvfp4-workspace/vllm` on the B200.
 
-**Attention Projections** — CuTeDSL NVFP4 GEMM ✅:
-- `q_a_proj`, `q_b_proj`, `kv_proj`, `wo_b_proj` — native NVFP4, cosine 0.995 vs BF16
-- `wo_a` — BF16 BMM (o_a_proj weights are BF16 in checkpoint)
-- All verified with `tests/test_full_layer_b200.py`
+**Workspace (`/root/dsv4-nvfp4-workspace`):**
+- `kernel/` — clone of this repo
+- `vllm/` — clone of the vLLM fork
+- `FUSED_EPILOGUE_PLAN.md` — fused SwiGLU epilogue plan
+- `FUSED_EPILOGUE_STATUS.md` — current status
 
-**Shared Experts** — CuTeDSL NVFP4 GEMM ✅:
-- `gate_up_proj`, `down_proj` — native NVFP4, cosine 0.990 vs BF16
+---
 
-**Attention Pipeline** — ✅ Verified standalone, 🔧 vLLM integration blocked by NaN:
-- KV cache write (RoPE → fp8 quant → paged cache) — cosine 0.999
-- KV cache read (paged cache → fp8 dequant → BF16) — cosine 0.999
-- Decode attention (1 query vs N cached KVs) — cosine 0.9998
-- Full pipeline (inv RoPE + o_a BMM + o_b) — cosine 0.996–0.999
-- All 5 layer types (C128A, C4A, SWA) — cosine ≥0.996
+## What We Have
 
-## Architecture: DeepSeek-V4-Pro
+### ✅ CuTeDSL NVFP4 Grouped GEMM (the building block)
 
-**MegaMoE (384 experts, top-6) with CSA + HCA + mHC:**
+`ScaledGroupedGemmKernel` in `cutedsl/kernel/moe/torch_scaled_grouped_mm.py` — a production-grade NVFP4 grouped GEMM kernel:
+- 2D×3D scenario: A(M,K) × B(E,K,N) → C(M,N)
+- Block-scaled: per-16-element FP8 scales on both A and B sides
+- Global scales (per-expert) for full dynamic range
+- Persistent scheduler, TMA pipelining, SMEM swizzle
+- CUDAGraph-safe (workspace pre-allocated, no runtime allocations)
 
-- **CSA (Compress Ratio 4)**: Compressed Sparse Attention — KV compressed 4x with overlap (coff=2). Indexer finds per-layer top-k.
-- **HCA (Compress Ratio 128)**: Heavily Compressed Attention — KV compressed 128x. Top-k indices pre-computed during metadata build.
-- **mHC**: Manifold-Constrained Hyper-Connections — replaces standard residual connections. Learned mixing with Sinkhorn normalization.
-- **SWA**: Sliding Window Attention — local window (compress_ratio=0, last layer only)
+### ✅ Bridge Layer (`cutedsl/bridge.py`)
 
-**Compress Ratios (from config.json):**
-```
-Layer 0: 128 (HCA)   Layer 1: 128 (HCA)   Layer 2: 4 (CSA)   Layer 3: 128 (HCA)
-Layer 4: 4 (CSA)     ...alternating 4/128...                  Layer 60: 0 (SWA)
-```
+- `quantize_to_nvfp4()` — BF16 → NVFP4 with global scale
+- `quantize_activation_nvfp4()` — cudagraph-safe quantize (pre-computed gs)
+- `quantize_weight_to_nvfp4()` — weight quantization (along K dim)
+- `interleave_l1_weights()` — gate/up interleave at granularity 8 BF16
+- `make_b_k_major()` — B tensor stride conversion
+- `assemble_scales_2d_side()` / `assemble_scales_3d_side()` — scale assembly + swizzle
+- `warmup_compilation()` — eager JIT compilation before first forward pass
+- `run_nvfp4_grouped_gemm()` — the main entry point
 
-**Expert intermediate size: 3072** (NOT 18432 — that's 6×3072 for top-6)
+### ✅ MoE Runner (`cutedsl/runner.py`)
 
-**DeepGEMM MegaMoE**: DeepSeek's persistent grouped GEMM for MoE uses TMA tensormap updates per expert with variable block_m (16-192) based on expected tokens per expert. Our CuTeDSL runner uses `run_nvfp4_grouped_gemm` (simpler, but proven correct in standalone tests).
+`CuTeDSLMoERunner` — runs the MoE forward pass:
+1. Quantize input BF16 → NVFP4 (using pre-computed gs)
+2. L1 GEMM: NVFP4 × NVFP4 → BF16 (gate+up fused)
+3. SiLU(gate) * up → BF16 (PyTorch, not yet fused)
+4. Re-quantize BF16 → NVFP4
+5. L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj)
+6. Scatter with routing weights
 
-## Current Status
+### ✅ NVFP4 Linear (`cutedsl/nvfp4_linear.py`)
 
-### ✅ Verified (B200 venv, real weights, zero NaN)
+`CuTeDSLNvfp4Linear` — single-expert NVFP4 GEMM for shared experts and attention projections.
 
-| Component | Test | Cosine vs BF16 |
-|-----------|------|----------------|
-| CuTeDSL NVFP4 Linear (q_a, kv, q_b, wo_b) | `test_full_layer_b200.py` | 0.994+ |
-| CuTeDSL NVFP4 MoE (L1 gate+up, SiLU, L2 down) | `layertest.py` | 0.988 |
-| FP8 KV quantize/dequant | `test_kv_cache_b200.py` | 0.9997 |
-| Paged KV cache read/write | `test_kv_cache_b200.py` | 1.0 |
-| CSA sparse attention (cr=4) | `test_sparse_attn_b200.py` | works, no NaN |
-| HCA sparse attention (cr=128) | `test_sparse_attn_b200.py` | works, no NaN |
-| Full attention pipeline (all layer types) | `test_v4_attention_b200.py` | 0.981–0.995 |
-| KV cache write + decode attention | `test_decode_attention_b200.py` | 0.9998 |
-| Decode vs prefill consistency (5 layers) | `test_decode_vs_prefill_b200.py` | 0.996–0.999 |
-| E2E 61-layer model (shared experts) | `test_e2e_decode_b200.py` | healthy logits |
-| MoE runner (grouped GEMM, 16 experts) | `test_moe_runner_nan_b200.py` | no NaN, all sizes |
-| Full layer (attention + MoE) | `test_full_layer_nan_b200.py` | no NaN |
-| Multi-layer chain (3 layers) | `test_full_layer_nan_b200.py` | no NaN |
+### ✅ Fused SwiGLU Kernel (in progress)
 
-### ❌ Container — NaN in vLLM compiled execution
+`fused_swiglu_grouped_mm.py` — extends `ScaledGroupedGemmKernel` with a fused SiLU epilogue:
+- **Step 1 DONE:** SiLU in registers validated (0.034% error vs PyTorch)
+- **Step 2 BLOCKED:** Gate/up pairing blocked by CuTeDSL type system (see below)
 
-The container produces empty/garbage output. Debug logs show NaN in `hidden_states` from the first forward pass. **The NaN is NOT from our kernels** — it comes from vLLM's compiled execution infrastructure (see CURRENT_BUG.md for full investigation).
+---
 
-Most likely sources:
-1. `attn_gemm_parallel_execute` — fused parallel GEMM (NOT our CuTeDSL kernel)
-2. `fused_q_kv_rmsnorm` — CUDA kernel that may produce NaN on Blackwell
-3. Weight packing during model loading
-4. `torch.compile` + cudagraph interaction with CuTeDSL buffers
+## Correctness Bugs Fixed (May 20, 2026)
 
-### ❌ Does NOT Work
+All 5 bugs fixed, committed, pushed:
 
-- **NVFP4 Q×K^T GEMM** — cosine 0.86, too lossy for attention scores. Keep attention in BF16.
-- **Patching vLLM's FlashMLA path** — house of cards. Don't do it.
+| Bug | Issue | Fix |
+|-----|-------|-----|
+| 1 | `_needs_token_refill` myth — cute.compile doesn't corrupt GPU memory | Removed hack, added `warmup_compilation()`, pre-allocated workspace per cache entry |
+| 2 | Dequantize→requantize supposedly lossy | Verified 100% byte-identical round-trip. Deprecated `prepare_weights_from_dequantized` |
+| 3 | `clamp(min=1e-8)` on zero blocks gives nonzero FP8 scale | Detect zero blocks, force FP8 scale to exact 0 |
+| 4 | Underflow blocks (amax < 6×2⁻⁹) get nonzero FP4 from div-by-tiny-number | Detect underflow blocks, zero x_norm before division |
+| 5 | Expert counting materializes 18M bool tensor | `torch.bincount` replaces O(n×E) comparison |
 
-## Test Files
+---
 
-| Test | What it does | Status |
-|------|-------------|--------|
-| `tests/test_full_layer_b200.py` | All NVFP4 projections vs BF16 | ✅ 0.994+ |
-| `tests/layertest.py` | MoE layer test | ✅ 0.988 |
-| `tests/cudagraph_test.py` | CUDAGraph compatibility | ✅ PASS |
-| `tests/test_v4_attention_b200.py` | All 3 layer types (SWA, C128A, C4A) | ✅ 0.981-0.995 |
-| `tests/test_kv_cache_b200.py` | FP8/NVFP4 KV cache + paged cache | ✅ 0.9997 |
-| `tests/test_sparse_attn_b200.py` | CSA/HCA sparse + SWA merged | ✅ works |
-| `tests/test_decode_attention_b200.py` | Prefill + decode with KV cache | ✅ 0.9998 |
-| `tests/test_decode_vs_prefill_b200.py` | Decode vs prefill consistency | ✅ 0.996-0.999 |
-| `tests/test_e2e_decode_b200.py` | 61-layer E2E (shared experts) | ✅ healthy logits |
-| `tests/test_moe_nan_b200.py` | Single expert NaN check | ✅ no NaN |
-| `tests/test_moe_runner_nan_b200.py` | MoE grouped GEMM NaN check | ✅ no NaN |
-| `tests/test_full_layer_nan_b200.py` | Full layer + multi-layer NaN check | ✅ no NaN |
+## Fused SwiGLU Epilogue — Current State
 
-## Project Structure
+### The Goal
+
+Fuse SiLU(gate)*up + NVFP4 quantization into the L1 GEMM epilogue. This eliminates:
+- ~580MB BF16 write to GMEM
+- ~290MB BF16 read back
+- 3 kernel launches + 12 quantize ops
+- Expected: **~30-40% latency reduction** for the MoE block
+
+### Step 1: SiLU in Registers — ✅ VALIDATED
+
+`cute.exp` and element-wise FP32 ops work correctly on CuTe register tensors in the epilogue. SiLU(x) = x / (1+exp(-x)) produces 0.034% relative error vs PyTorch.
+
+### Step 2: Gate/Up Pairing — ❌ BLOCKED BY CUTEDSL TYPE SYSTEM
+
+**The problem:** CuTeDSL compiles ALL subtile iterations into one kernel. Runtime conditionals (`if is_gate_subtile`) that affect:
+- Register tensor assignment → `DSLRuntimeError` (type structure mismatch)
+- TMA store skipping → corrupted output
+- Mask blending on register tensors → wrong results
+
+CuTeDSL requires that ALL code paths produce tensors with the same structure. Even though both branches produce the same tensor type, the compiler can't unify them when the branch condition is a runtime value.
+
+### What's Needed for Step 2
+
+**Option A: Paired subtile iteration.** Instead of iterating subtiles [0,1,2,3] and branching on each, iterate as gate/up pairs [(0,2), (1,3)]. For each pair, load both gate and up accumulator, compute SiLU(gate)*up, store result. No runtime conditionals — every iteration does the same thing. Requires restructuring the epilogue loop.
+
+**Option B: const_expr debug flag.** Compile a separate kernel with `debug_silu_bf16=True` that writes post-SiLU BF16 to a (M, intermediate) side tensor. Validate, then add NVFP4 quantize + FP4/SF TMA stores. The production kernel (flag=False) skips the BF16 write.
+
+**Option C: Separate post-GEMM SiLU kernel.** A small CUDA kernel that reads BF16 L1 output, applies SiLU(gate)*up, writes result. Adds one kernel launch but avoids the CuTeDSL type system constraint entirely.
+
+### Remaining Steps (after gate/up pairing)
+
+| Step | What | Status |
+|------|------|--------|
+| 3 | Per-16-element amax via warp shuffles | Not started |
+| 4 | FP8 E4M3 scale + E2M1 round + nibble pack | Not started |
+| 5 | FP4 TMA store to padded L2 buffer | Not started |
+| 6 | FP8 SF TMA store through blockscaled layout | Not started |
+
+### Weight Interleave
+
+Gate/up weights must be interleaved at granularity 8 BF16 (4 FP4) for the fused epilogue. `interleave_l1_weights()` in bridge.py implements this. Pure-PyTorch invariant test passes. Kernel-level test blocked by the same subtile iteration issue.
+
+### Register Layout (from DeepGEMM)
+
+After `SM100_TMEM_LOAD_16dp256b1x`, register fragment has gate/up paired:
+- (values[0], values[2]), (values[1], values[3])
+- (values[4], values[6]), (values[5], values[7])
+
+Our CuTeDSL kernel uses `tiled_copy_r2s.retile()` which may produce a different register layout. Need to verify against the debug BF16 output.
+
+---
+
+## DeepSeek-V4 Architecture Notes
+
+**NOT MLA.** DeepSeek-V4 uses:
+- **CSA** (Compressed Sparse Attention, cr=4): KV compressed 4x, indexer finds top-k
+- **HCA** (Heavily Compressed Attention, cr=128): KV compressed 128x, pre-computed indices
+- **SWA**: Standard sliding window (window=128, last layer only)
+- **mHC**: Manifold-Constrained Hyper-Connections — replaces residual connections
+- **384 experts, top-6, intermediate=3072**
+
+Compress ratios by layer: alternating 128/4, layer 60 = 0 (SWA).
+
+---
+
+## File Structure
 
 ```
-nvfp4-megamoe-kernel/
-├── cutedsl/                          # CuTeDSL kernel + bridge layer
-│   ├── bridge.py                     # Tensor layout conversion, quantization, kernel launch
-│   ├── nvfp4_linear.py              # CuTeDSLNvfp4Linear — NVFP4 GEMM runner
-│   ├── runner.py                     # CuTeDSLMoERunner — grouped GEMM MoE
-│   ├── blackwell_attention.py        # KV cache + attention (standalone, works)
-│   ├── csa_attention.py             # CSA/HCA attention (BF16 SDPA)
-│   ├── custom_ops.py                # torch.autograd wrappers
-│   └── kernel/moe/                   # NVIDIA's ScaledGroupedGemmKernel
-├── vllm/                             # vLLM integration
-│   ├── nvfp4_cutedsl.py             # CuTeDSLMoERunner (vLLM wrapper)
-│   ├── cutedsl_quant_method.py      # CuTeDSLNvfp4LinearMethod
-│   └── patches/
-│       ├── deepseek_v4_attention.py # Attention patch (Blackwell dispatch)
-│       ├── deepseek_compressor.py   # Compressor patch (skip fused kernel on Blackwell)
-│       ├── patch_kv_cache_utils.py  # KV cache page size fix
-│       ├── patch_swa_cache.py       # SWA cache alignment fix
-│       └── layers/
-│           ├── csa_attention.py     # BF16 SDPA + KV cache (our Blackwell path)
-│           └── deepseek_compressor.py # Skip fused kernel on Blackwell
-├── tests/                            # Standalone tests (run on B200 venv)
-├── Dockerfile                        # Container build
-├── README.md                         # This file
-└── CURRENT_BUG.md                    # Current bug investigation
+cutedsl/
+├── bridge.py                          # Quantization, layout, kernel launch
+├── nvfp4_linear.py                    # Single-expert NVFP4 GEMM runner
+├── runner.py                          # MoE grouped GEMM runner
+├── blackwell_attention.py             # KV cache + attention (standalone)
+├── csa_attention.py                   # CSA/HCA attention
+├── custom_ops.py                      # torch.autograd wrappers
+├── moe_pipeline.py                    # Standalone test pipeline (deprecated path)
+└── kernel/moe/
+    ├── torch_scaled_grouped_mm.py     # ScaledGroupedGemmKernel (the GEMM)
+    └── fused_swiglu_grouped_mm.py     # FusedSwiGLUScaledGroupedGemmKernel (WiP)
+
+tests/
+├── test_fused_step1.py               # SiLU validation (PASS)
+├── test_fp4_roundtrip.py             # Checkpoint byte match (PASS)
+├── test_interleave_gemm.py           # Weight interleave GEMM test (BLOCKED)
+├── layertest.py                      # MoE layer test (PASS, 0.988 cosine)
+├── cudagraph_test.py                  # CUDAGraph test (PASS)
+├── test_full_layer_b200.py           # All NVFP4 projections (PASS, 0.994+)
+├── test_v4_attention_b200.py         # All 3 attention types (PASS)
+├── test_kv_cache_b200.py             # KV cache (PASS, 0.9997)
+├── test_sparse_attn_b200.py          # CSA/HCA (PASS)
+├── test_decode_attention_b200.py     # Prefill+decode (PASS, 0.9998)
+└── ...
 ```
 
-## Plan
+---
 
-### Phase 1: MoE Kernel ✅ DONE
-### Phase 2: NVFP4 Linear Kernels ✅ DONE
-### Phase 3: Attention Pipeline ✅ DONE (standalone, all tests pass)
-### Phase 4: vLLM Integration 🔧 IN PROGRESS — blocked by NaN from vLLM infrastructure
+## Key Lessons (Things We Fucked Up)
 
-**Current blocker:** NaN in the vLLM container's compiled execution. Our kernels produce zero NaN standalone. The NaN comes from vLLM's `attn_gemm_parallel_execute` or `fused_q_kv_rmsnorm` CUDA kernels, weight packing, or torch.compile interaction.
+1. **⛔ NEVER assume CuTeDSL GPU tensors survive JIT compilation.** `cute.compile` zeroes GPU memory. Keep index/mapping tensors on CPU. Always verify with `.cpu().tolist()` after JIT.
 
-**Next:**
-1. Install vllm in B200 venv, test the exact parallel GEMM path
-2. Test with torch.compile disabled in the container
-3. Add NaN checks inside the parallel GEMM wrapper
-4. If the parallel GEMM is the source, replace it with our CuTeDSL kernels (path of least resistance)
+2. **⛔ NEVER nuke working code without understanding why it exists.** The cudagraph-safe functions exist because vLLM REQUIRES cudagraph.
 
-### Phase 5: Production
-- End-to-end benchmarking
-- Optimize tile sizes
-- Clean up
+3. **⛔ NEVER fabricate facts from MEMORY.md.** Verify what "works" means before citing it.
+
+4. **⛔ NEVER quantize a padded buffer and slice the output.** Quantize compact data, scatter into padded layout.
+
+5. **⛔ Silent weight drops are deadly.** vLLM's `if name not in params_dict: continue` skips weights with no warning. Replace with hard RuntimeError.
+
+6. **⛔ NVFP4 is NOT suitable for attention Q×K^T.** Per-element dot products are too sensitive. Keep attention in BF16.
+
+7. **⛔ NEVER touch drivers, kernels, firmware, or system packages on the B200.** The cluster costs millions. Always confirm with Mike.
+
+8. **⛔ CuTeDSL runtime conditionals on register tensors are broken.** Can't branch on runtime values when the branch affects tensor structure. Use const_expr flags or restructure the loop.