From 8b7fa0c91e65729921fee46bb7c555741e02eb84 Mon Sep 17 00:00:00 2001
From: biondizzle <biondizzle@gmail.com>
Date: Thu, 14 May 2026 12:48:08 +0000
Subject: [PATCH] add README: pipeline diagram, file map, data formats, known
 issues

---
 README.md | 268 +++++++++++++++++++++++++++++-------------------------
 1 file changed, 146 insertions(+), 122 deletions(-)

diff --git a/README.md b/README.md
index 8fee00e2..93524a59 100644
--- a/README.md
+++ b/README.md
@@ -1,158 +1,182 @@
-# NVFP4 Mega MoE Kernel — CUTLASS Native Blackwell Implementation
+# nvfp4-megamoe-kernel
 
-Native NVFP4 block-scaled GEMM kernel for DeepSeek-V4-Pro on NVIDIA B200 (Blackwell SM100).
+Native NVFP4 block-scaled MoE kernel for DeepSeek-V4-Pro on NVIDIA Blackwell (SM100).
 
-## What This Does
+Replaces the broken `fp8_nvfp4_mega_moe` kernel from DeepGEMM with a working CUTLASS-based implementation that emits real `SM100_MMA_MXF4_SS` tensor core instructions.
 
-Replaces the broken `fp8_nvfp4_mega_moe` DeepGEMM kernel with a working CUTLASS-based implementation that uses **native Blackwell tensor core instructions** (`SM100_MMA_MXF4_SS`) for E2M1 × E2M1 matrix multiplication with UE4M3 block scaling.
+---
 
 ## Architecture
 
-DeepSeek-V4-Pro MoE layer (per rank, expert parallel):
-- **L1 (gate_up_proj):** HIDDEN=7168 → 2×INTERMEDIATE=6144
-- **L2 (down_proj):** INTERMEDIATE=3072 → HIDDEN=7168
-- 256 experts total, 32-48 per rank (depends on EP config), top-6 routing
-- NVFP4 quantization: packed E2M1 (int8, 2 FP4 per byte) + UE4M3 block16 scales
+DeepSeek-V4-Pro is a 256-expert MoE model with expert parallelism across 8 ranks (B200 GPUs). Each rank handles 32 experts. For each token, the router picks the top-6 experts.
 
-## Components
+### The MoE Forward Pass
 
-### `cutlass_nvfp4_gemm/` — The CUTLASS Extension
-
-| File | Purpose |
-|------|---------|
-| `cutlass_nvfp4_gemm.cu` | CUTLASS GEMM + scale factor remap kernel |
-| `pytorch_binding.cpp` | PyTorch C++ extension binding |
-| `kernel.py` | Python wrapper (`cutlass_nvfp4_gemm`, `cutlass_grouped_nvfp4_gemm`) |
-| `setup.py` | Build configuration (SM100a target) |
-
-### `nvfp4_mega_moe.py` — Main Entry Point
-
-Called by the patched `deepseek_v4.py`. Dispatches to CUTLASS when `MEGA_MOE_USE_CUTLASS=1`.
-
-### `weight_transform.py` — Weight Transformation
-
-Converts raw NVFP4 checkpoint weights into the format expected by the kernel:
-- Folds global scales (float32) into block scales (UE4M3)
-- Interleaves L1 gate_up weights for 2CTA UMMA
-
-### `symm_buffer.py` — Symmetric Buffer
-
-Stub for NVLink cross-rank all-reduce. Matches the DeepGEMM API expected by vLLM's deepseek_v4.py.
-
-## Critical Quirks & Pitfalls
-
-### 1. Scale Factor Layout (THE BIG ONE)
-
-CUTLASS's `Sm1xxBlockScaledConfig` expects scale factors in an **interleaved layout**, NOT simple row-major. The layout is defined by:
-
-```cpp
-SfAtom = Shape<Shape<32,4>, Shape<16,4>>
-       with Stride<Stride<16,4>, Stride<0,1>>
-layout_SFA = tile_to_shape(SfAtom{}, make_shape(M,K), Step<_2,_1>{})
+```
+Input hidden states (BF16)
+        │
+        ▼
+┌─────────────────┐
+│  Shared Experts │  ← BYPASSED (returning zeros — FlashInfer TF32 GEMM crashes)
+│  (FlashInfer    │
+│   CUTLASS)      │
+└─────────────────┘
+        │
+        ▼
+  Staging Kernel (vLLM built-in)
+  BF16 → packed E2M1 (int8) + UE4M3 block-16 scales (uint32)
+  Writes to SymmBuffer.x / SymmBuffer.x_sf
+        │
+        ▼
+  Router (vLLM built-in)
+  Writes topk_ids / topk_weights to SymmBuffer
+        │
+        ▼
+┌─────────────────────────────────────────┐
+│          nvfp4_mega_moe_full            │  ← nvfp4_mega_moe.py
+│                                         │
+│  1. Read staged activation from buffer  │
+│  2. L1 GEMM: gate_up_proj              │  ← CUTLASS NVFP4 block-scaled
+│     E2M1 × E2M1 + UE4M3 scales         │    SM100_MMA_MXF4_SS PTX
+│     → BF16 output (6144-wide)          │
+│  3. SiLU(gate) * up  (activation)      │
+│  4. stage_activation: BF16 → FP4       │  ← simple absmax quantize (needs work)
+│  5. L2 GEMM: down_proj                 │  ← CUTLASS NVFP4 block-scaled
+│     E2M1 × E2M1 + UE4M3 scales         │    SM100_MMA_MXF4_SS PTX
+│     → BF16 output (7168-wide)          │
+│  6. Write to output tensor              │
+└─────────────────────────────────────────┘
 ```
 
-If you pass row-major scales directly, TMA loads read garbage addresses → **NaN output** → downstream CUDA illegal memory access.
+### vLLM Startup Sequence (how our code plugs in)
 
-**Fix:** GPU-side remap kernel using `cute::idx2crd()` to convert CUTLASS layout indices to (row, k_group) coordinates, then index into row-major source.
+```
+1. vLLM engine init
+   └─ ModelOptNvFp4Config selected (NVFP4 quantization scheme)
+   └─ FlashInferCutlassNvFp4LinearKernel for linear layers
 
-### 2. CUTLASS Version Matters
+2. Model construction
+   └─ DeepseekV4ForCausalLM → DeepseekV4MoE → DeepseekV4DecoderLayer
+       Each layer has: attention + MoE block
+       MoE block has: shared experts + 256 routed experts
 
-TileLang's bundled CUTLASS is **too old** — missing `float_e2m1_t`, `float_ue4m3_t`, block-scaled types. You need the latest from GitHub:
-```bash
-git clone --depth 1 https://github.com/NVIDIA/cutlass.git /root/cutlass
+3. Weight loading
+   └─ 95 safetensor shards loaded
+   └─ weight, weight_scale, weight_scale_2 loaded per linear
+
+4. process_weights_after_loading  ← THIS IS WHERE WE HOOK IN
+   └─ ModelOptNvFp4LinearMethod swizzles/pads weights for CUTLASS
+   └─ finalize_mega_moe_weights()
+       └─ weight_transform.py: transform_nvfp4_weights_for_mega_moe()
+           • Folds weight_scale_2 (global scale) into weight_scale (block scale)
+           • UE4M3 block-16 scales: 4 values packed per uint32
+           • Interleaves L1 (gate_up) weights for 2CTA UMMA
+           • Returns ((l1_w, l1_sf), (l2_w, l2_sf)) per rank
+
+5. SymmBuffer allocation
+   └─ symm_buffer.py: get_symm_buffer_for_nvfp4 mega_moe()
+       • Pre-allocates GPU buffers for:
+         - x: int8 packed E2M1 activations
+         - x_sf: uint32 packed UE4M3 activation scales
+         - topk_idx: int32 expert indices
+         - topk_weights: float32 routing weights
+         - buffer: BF16 all-reduce buffer
+
+6. Profile run (warmup)
+   └─ First forward pass to allocate KV cache, etc.
+   └─ This is where the CUTLASS GEMM first executes
+
+7. Ready to serve
 ```
 
-Key files only in the latest version:
-- `include/cutlass/float_subbyte.h` — `float_e2m1_t` and `float_ue4m3_t`
-- `include/cutlass/detail/sm100_blockscaled_layout.hpp` — SFA/SFB layout computation
-- `examples/72_blackwell_narrow_precision_gemm/72b_nvfp4_nvfp4_gemm.cu` — reference implementation
+---
 
-### 3. nixl_ep Breaks CUDA 13 Images
+## File Map
 
-The `vllm/vllm-openai:nightly` image ships `nixl_ep` compiled against CUDA 12, but the image is CUDA 13. At import time it tries to `dlopen("libcudart.so.12")` → crash. Remove it:
-```dockerfile
-RUN pip uninstall -y nixl-ep; rm -rf /usr/local/lib/python3.12/dist-packages/nixl_ep
+```
+nvfp4_megamoe_kernel/
+├── __init__.py              # Public API exports
+├── nvfp4_mega_moe.py       # Main kernel: nvfp4_mega_moe_full, nvfp4_mega_moe_l1/l2, stage_activation
+├── weight_transform.py     # Weight prep: fold global scale, pack UE4M3, interleave L1
+├── symm_buffer.py          # GPU buffer allocation for MoE dispatch
+│
+└── cutlass_nvfp4_gemm/     # CUTLASS CUDA extension (the actual hardware kernel)
+    ├── cutlass_nvfp4_gemm.cu    # CUDA: CUTLASS GEMM + SF remap kernel
+    ├── pytorch_binding.cpp      # PyTorch C++ binding (_C.forward)
+    ├── kernel.py                # Python: cutlass_grouped_nvfp4_gemm (per-expert loop)
+    ├── sf_layout.py             # CUTLASS SF interleaved layout math
+    ├── setup.py                 # Build config (nvcc, CUTLASS include paths)
+    ├── build.sh                 # Build script
+    ├── test_gemm.py             # Standalone test
+    └── README.md
 ```
 
-### 4. Fabric Manager Required for B200
+### What each file does (in call order)
 
-B200 NVLink clusters need `nvidia-fabricmanager` running before CUDA runtime can init. Without it:
-- `nvidia-smi` works (kernel module)
-- `cudaGetDeviceCount()` segfaults (userspace driver)
-- Error 802: "system not yet initialized"
+| File | When it runs | What it does |
+|------|-------------|--------------|
+| `weight_transform.py` | Once at startup (weight loading) | Takes raw NVFP4 checkpoint weights, folds global scales into block scales, packs UE4M3 into uint32, interleaves L1 gate_up weights. Output: `((l1_w, l1_sf), (l2_w, l2_sf))` |
+| `symm_buffer.py` | Once at startup (buffer alloc) | Pre-allocates GPU tensors for activations, scales, routing data, and all-reduce. These persist across forward passes. |
+| `nvfp4_mega_moe.py` | Every forward pass | Orchestrates the MoE: reads from symm buffer → L1 GEMM → activation → re-quantize → L2 GEMM → output. Contains `stage_activation` (BF16→FP4 quantize for L1→L2). |
+| `cutlass_nvfp4_gemm/kernel.py` | Every forward pass (called by nvfp4_mega_moe) | Per-expert loop: gather tokens for each expert, call CUTLASS GEMM, scatter results with routing weights. |
+| `cutlass_nvfp4_gemm/cutlass_nvfp4_gemm.cu` | Every forward pass (CUDA kernel) | The actual CUTLASS kernel: native NVFP4 block-scaled GEMM + GPU-side scale factor remap (row-major → CUTLASS interleaved layout). |
+| `cutlass_nvfp4_gemm/sf_layout.py` | Build time / reference | Documents the CUTLASS SfAtom layout. Currently unused at runtime (remap is in CUDA). |
+
+---
+
+## Data Formats
+
+### Weights
+- **Packed E2M1** (`int8`): 2 FP4 values per byte. Shape: `(E_per_rank, N, K//2)`, K-major layout.
+- **UE4M3 block scales** (`float8_e4m3fn`): 1 scale per 16 FP4 values (group_size=16). Shape: `(E_per_rank, N, K//16)`.
+
+### Activations (after staging kernel)
+- **Packed E2M1** (`int8`): Shape: `(num_tokens, K//2)`.
+- **UE4M3 scales** (`uint32`): 4 UE4M3 values packed per uint32. Shape: `(num_tokens, K//64)`.
+
+### GEMM dimensions (DeepSeek-V4-Pro)
+- **L1 (gate_up_proj):** M×6144×7168 (per expert)
+- **L2 (down_proj):** M×7168×3072 (per expert)
+- 48 experts per rank (256 total / 8 ranks), top-6 routing
+
+---
+
+## Build & Deploy (B200)
 
 ```bash
-systemctl enable nvidia-fabricmanager
-systemctl start nvidia-fabricmanager
+# On B200 host — CUTLASS must be cloned and mounted
+cd /root/nvidia-meeting/deepseek-v4-quant/
+
+# Rebuild container (CUTLASS is host-mounted at /root/cutlass)
+KERNEL_CACHE_BUSTER=$(date +%s) docker compose build --no-cache
+docker compose up -d
 ```
 
-### 5. Docker GPU Access
+The CUTLASS extension builds inside the container during `pip install` of the nvfp4-megamoe-kernel package. It needs:
+- CUDA 13.0 toolkit (in the vllm/vllm-openai:nightly image)
+- CUTLASS headers at `/root/cutlass/include/`
+- CCCL headers at `/usr/local/cuda-13.0/targets/x86_64-linux/include/cccl/`
+- Device with SM100 compute capability (B200)
 
-Must use `deploy.resources.reservations.devices` in docker-compose, NOT `runtime: nvidia` in daemon.json:
-```yaml
-deploy:
-  resources:
-    reservations:
-      devices:
-        - driver: nvidia
-          count: all
-          capabilities: [gpu]
-```
+---
 
-### 6. PyTorch Extension API (nightly vLLM)
+## Known Issues
 
-- Use `c10::cuda::getCurrentCUDAStream()` not `at::cuda::getCurrentCUDAStream()`
-- Use `torch::kBFloat16` not `at::kBF16`
-- `CUDAExtension` uses `include_dirs` not `extra_include_paths`
-- `python3` not `python` in the image
+1. **Shared experts bypassed** — FlashInfer/DeepGEMM TF32 GEMM crashes the vLLM worker. Currently returning zeros for shared expert output. This produces garbage text.
 
-### 7. CCCL Headers
+2. **MoE dispatch is slow** — `cutlass_grouped_nvfp4_gemm` uses a Python loop over 48 experts with per-token scatter/gather. Needs a proper grouped GEMM or at least CUDA-side dispatch.
 
-CUTLASS 3.x depends on libcu++ (CCCL). Found at:
-```
-/usr/local/cuda-13.0/targets/x86_64-linux/include/cccl/
-```
+3. **stage_activation is approximate** — Simple per-token absmax quantization for L1→L2 re-quant. Should use proper E2M1 quantization matching vLLM's staging kernel.
 
-### 8. No Mixing CUDA Versions
+4. **Scale factor remap adds overhead** — GPU kernel remaps row-major → CUTLASS interleaved layout every GEMM call. Should pre-compute during weight transform.
 
-**Hard rule.** If something needs CUDA 12 in a CUDA 13 image, remove the thing that needs CUDA 12. Never symlink `libcudart.so.13 → libcudart.so.12`.
-
-## Building
-
-On the B200 server, inside the vLLM Docker container:
-
-```bash
-cd /root/nvfp4-megamoe-kernel/src/nvfp4_megamoe_kernel/cutlass_nvfp4_gemm
-TORCH_CUDA_ARCH_LIST=10.0 python3 setup.py build_ext --inplace
-```
-
-Requires:
-- CUTLASS at `/root/cutlass/include`
-- CCCL at `/usr/local/cuda-13.0/targets/x86_64-linux/include/cccl/`
+---
 
 ## Environment Variables
 
 | Variable | Default | Description |
 |----------|---------|-------------|
-| `MEGA_MOE_STATIC` | `0` | Set to `1` to bypass kernel (return zeros) |
-| `MEGA_MOE_USE_CUTLASS` | `1` | Use CUTLASS native NVFP4 kernel |
-| `MEGA_MOE_DEBUG` | `0` | Enable debug prints |
-| `VLLM_USE_NIXL` | `0` | Disable NIXL (broken in nightly) |
-
-## Current Status (May 14, 2026)
-
-- ✅ CUTLASS NVFP4 GEMM compiles and loads
-- ✅ Scale factor remap works (no NaN)
-- ✅ vLLM server starts with native kernel
-- ✅ L1 and L2 CUTLASS kernels execute
-- ⚠️ Output is garbage — shared experts are bypassed (zeros)
-- ⚠️ FlashInfer/DeepGEMM TF32 GEMM (shared experts) crashes workers
-- ⚠️ MoE dispatch is slow (Python per-expert loop)
-
-## Next Steps
-
-1. Fix shared experts crash (FlashInfer TF32 GEMM illegal memory access)
-2. Verify numerical correctness of SF remap (compare against dequantize+BF16 reference)
-3. Optimize MoE dispatch (batched/grouped GEMM)
-4. Replace simple `stage_activation` with proper E2M1 quantization
-5. Re-enable shared experts once FlashInfer crash is fixed
+| `MEGA_MOE_STATIC` | 0 | Set to 1 to skip MoE kernel entirely (return zeros) |
+| `MEGA_MOE_DEBUG` | 0 | Set to 1 for verbose logging |
+| `MEGA_MOE_USE_CUTLASS` | 1 | Use CUTLASS path (always 1 now, TileLang removed) |
+| `SKIP_ATTENTION` | 0 | Skip attention layers (debug) |