From 1dfe5ffd05f5d175dce71b1fb35190534237bdfa Mon Sep 17 00:00:00 2001
From: biondizzle <biondizzle@gmail.com>
Date: Thu, 14 May 2026 11:23:32 +0000
Subject: [PATCH] Add comprehensive README documenting quirks, pitfalls, and
 setup

---
 README.md | 178 +++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 148 insertions(+), 30 deletions(-)

diff --git a/README.md b/README.md
index 7b819617..8fee00e2 100644
--- a/README.md
+++ b/README.md
@@ -1,40 +1,158 @@
-# NVFP4 Mega MoE Kernel — Mojo Rewrite
+# NVFP4 Mega MoE Kernel — CUTLASS Native Blackwell Implementation
 
-Rewrite of the DeepGEMM `fp8_nvfp4_mega_moe` kernel in Mojo.
+Native NVFP4 block-scaled GEMM kernel for DeepSeek-V4-Pro on NVIDIA B200 (Blackwell SM100).
 
-## Why Mojo?
-- Python-like syntax, C-level performance
-- Direct GPU programming without PTX inline asm
-- Safer than CUDA C++ (ownership, borrowing)
-- Better ergonomics for complex kernel development
+## What This Does
+
+Replaces the broken `fp8_nvfp4_mega_moe` DeepGEMM kernel with a working CUTLASS-based implementation that uses **native Blackwell tensor core instructions** (`SM100_MMA_MXF4_SS`) for E2M1 × E2M1 matrix multiplication with UE4M3 block scaling.
 
 ## Architecture
 
-The kernel performs NVFP4 (E2M1 + UE4M3 block16 scales) matrix multiply
-for MoE (Mixture of Experts) with expert parallelism across NVLink.
+DeepSeek-V4-Pro MoE layer (per rank, expert parallel):
+- **L1 (gate_up_proj):** HIDDEN=7168 → 2×INTERMEDIATE=6144
+- **L2 (down_proj):** INTERMEDIATE=3072 → HIDDEN=7168
+- 256 experts total, 32-48 per rank (depends on EP config), top-6 routing
+- NVFP4 quantization: packed E2M1 (int8, 2 FP4 per byte) + UE4M3 block16 scales
 
-### Key operations:
-1. **Staging** — quantize BF16 activation to FP4 (E2M1) with UE8M0 scales
-2. **TMA load** — load packed FP4 weights and UE4M3 scales from global memory
-3. **UMMA** — `mxf4nvf4` matrix multiply with block scaling
-4. **Epilogue** — quantize L1 output (BF16 → FP4 + UE4M3 scales for L2)
-5. **NVLink sync** — cross-rank barrier and buffer management
+## Components
 
-### NVFP4 specifics (vs MXFP4):
-- group_size=16 (UE4M3 block scales), not group_size=32 (UE8M0)
-- 2 SF K-columns per BLOCK_K (128/16/4=2), not 1
-- Weights are E2M1 packed int8 (2 values per byte)
-- `mxf4nvf4` UMMA instruction with `scale_vec::4X`
+### `cutlass_nvfp4_gemm/` — The CUTLASS Extension
 
-## Structure
+| File | Purpose |
+|------|---------|
+| `cutlass_nvfp4_gemm.cu` | CUTLASS GEMM + scale factor remap kernel |
+| `pytorch_binding.cpp` | PyTorch C++ extension binding |
+| `kernel.py` | Python wrapper (`cutlass_nvfp4_gemm`, `cutlass_grouped_nvfp4_gemm`) |
+| `setup.py` | Build configuration (SM100a target) |
+
+### `nvfp4_mega_moe.py` — Main Entry Point
+
+Called by the patched `deepseek_v4.py`. Dispatches to CUTLASS when `MEGA_MOE_USE_CUTLASS=1`.
+
+### `weight_transform.py` — Weight Transformation
+
+Converts raw NVFP4 checkpoint weights into the format expected by the kernel:
+- Folds global scales (float32) into block scales (UE4M3)
+- Interleaves L1 gate_up weights for 2CTA UMMA
+
+### `symm_buffer.py` — Symmetric Buffer
+
+Stub for NVLink cross-rank all-reduce. Matches the DeepGEMM API expected by vLLM's deepseek_v4.py.
+
+## Critical Quirks & Pitfalls
+
+### 1. Scale Factor Layout (THE BIG ONE)
+
+CUTLASS's `Sm1xxBlockScaledConfig` expects scale factors in an **interleaved layout**, NOT simple row-major. The layout is defined by:
+
+```cpp
+SfAtom = Shape<Shape<32,4>, Shape<16,4>>
+       with Stride<Stride<16,4>, Stride<0,1>>
+layout_SFA = tile_to_shape(SfAtom{}, make_shape(M,K), Step<_2,_1>{})
 ```
-src/
-  mega_moe.mojo    — main kernel entry point
-  staging.mojo     — activation quantization (BF16 → FP4)
-  tma.mojo         — TMA descriptor creation and copy
-  umma.mojo        — UMMA descriptor and MMA operations
-  epilogue.mojo    — output quantization and TMA store
-  barrier.mojo     — NVLink cluster sync and symm buffer
-  layout.mojo      — weight transformation and SF layout
-  utils.mojo       — math helpers, UE4M3 packing
+
+If you pass row-major scales directly, TMA loads read garbage addresses → **NaN output** → downstream CUDA illegal memory access.
+
+**Fix:** GPU-side remap kernel using `cute::idx2crd()` to convert CUTLASS layout indices to (row, k_group) coordinates, then index into row-major source.
+
+### 2. CUTLASS Version Matters
+
+TileLang's bundled CUTLASS is **too old** — missing `float_e2m1_t`, `float_ue4m3_t`, block-scaled types. You need the latest from GitHub:
+```bash
+git clone --depth 1 https://github.com/NVIDIA/cutlass.git /root/cutlass
 ```
+
+Key files only in the latest version:
+- `include/cutlass/float_subbyte.h` — `float_e2m1_t` and `float_ue4m3_t`
+- `include/cutlass/detail/sm100_blockscaled_layout.hpp` — SFA/SFB layout computation
+- `examples/72_blackwell_narrow_precision_gemm/72b_nvfp4_nvfp4_gemm.cu` — reference implementation
+
+### 3. nixl_ep Breaks CUDA 13 Images
+
+The `vllm/vllm-openai:nightly` image ships `nixl_ep` compiled against CUDA 12, but the image is CUDA 13. At import time it tries to `dlopen("libcudart.so.12")` → crash. Remove it:
+```dockerfile
+RUN pip uninstall -y nixl-ep; rm -rf /usr/local/lib/python3.12/dist-packages/nixl_ep
+```
+
+### 4. Fabric Manager Required for B200
+
+B200 NVLink clusters need `nvidia-fabricmanager` running before CUDA runtime can init. Without it:
+- `nvidia-smi` works (kernel module)
+- `cudaGetDeviceCount()` segfaults (userspace driver)
+- Error 802: "system not yet initialized"
+
+```bash
+systemctl enable nvidia-fabricmanager
+systemctl start nvidia-fabricmanager
+```
+
+### 5. Docker GPU Access
+
+Must use `deploy.resources.reservations.devices` in docker-compose, NOT `runtime: nvidia` in daemon.json:
+```yaml
+deploy:
+  resources:
+    reservations:
+      devices:
+        - driver: nvidia
+          count: all
+          capabilities: [gpu]
+```
+
+### 6. PyTorch Extension API (nightly vLLM)
+
+- Use `c10::cuda::getCurrentCUDAStream()` not `at::cuda::getCurrentCUDAStream()`
+- Use `torch::kBFloat16` not `at::kBF16`
+- `CUDAExtension` uses `include_dirs` not `extra_include_paths`
+- `python3` not `python` in the image
+
+### 7. CCCL Headers
+
+CUTLASS 3.x depends on libcu++ (CCCL). Found at:
+```
+/usr/local/cuda-13.0/targets/x86_64-linux/include/cccl/
+```
+
+### 8. No Mixing CUDA Versions
+
+**Hard rule.** If something needs CUDA 12 in a CUDA 13 image, remove the thing that needs CUDA 12. Never symlink `libcudart.so.13 → libcudart.so.12`.
+
+## Building
+
+On the B200 server, inside the vLLM Docker container:
+
+```bash
+cd /root/nvfp4-megamoe-kernel/src/nvfp4_megamoe_kernel/cutlass_nvfp4_gemm
+TORCH_CUDA_ARCH_LIST=10.0 python3 setup.py build_ext --inplace
+```
+
+Requires:
+- CUTLASS at `/root/cutlass/include`
+- CCCL at `/usr/local/cuda-13.0/targets/x86_64-linux/include/cccl/`
+
+## Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `MEGA_MOE_STATIC` | `0` | Set to `1` to bypass kernel (return zeros) |
+| `MEGA_MOE_USE_CUTLASS` | `1` | Use CUTLASS native NVFP4 kernel |
+| `MEGA_MOE_DEBUG` | `0` | Enable debug prints |
+| `VLLM_USE_NIXL` | `0` | Disable NIXL (broken in nightly) |
+
+## Current Status (May 14, 2026)
+
+- ✅ CUTLASS NVFP4 GEMM compiles and loads
+- ✅ Scale factor remap works (no NaN)
+- ✅ vLLM server starts with native kernel
+- ✅ L1 and L2 CUTLASS kernels execute
+- ⚠️ Output is garbage — shared experts are bypassed (zeros)
+- ⚠️ FlashInfer/DeepGEMM TF32 GEMM (shared experts) crashes workers
+- ⚠️ MoE dispatch is slow (Python per-expert loop)
+
+## Next Steps
+
+1. Fix shared experts crash (FlashInfer TF32 GEMM illegal memory access)
+2. Verify numerical correctness of SF remap (compare against dequantize+BF16 reference)
+3. Optimize MoE dispatch (batched/grouped GEMM)
+4. Replace simple `stage_activation` with proper E2M1 quantization
+5. Re-enable shared experts once FlashInfer crash is fixed