docs: update README — vLLM cudagraph inference running, output quality in progress
This commit is contained in:
76
README.md
76
README.md
@@ -26,6 +26,15 @@ BF16 input → quantize to NVFP4
|
||||
|
||||
Both GEMM types use `float4_e2m1fn_x2` for weights, `float8_e4m3fn` for block scales, `float32` for global scales. BF16 is used only for SiLU activation, the final MoE scatter, and the compressor — the minimum possible.
|
||||
|
||||
## Current Status: vLLM Inference Running 🎉
|
||||
|
||||
**vLLM serves DeepSeek-V4-Pro NVFP4 with cudagraph enabled.** The model loads, cudagraph captures successfully, and inference runs. Output quality is still being tuned (garbage tokens currently), but this is the first time the entire pipeline — model loading, kernel compilation, cudagraph capture, and inference — works end-to-end.
|
||||
|
||||
**Test Results:**
|
||||
- `tests/layertest.py`: cosine 0.988 vs BF16 reference ✅
|
||||
- `tests/cudagraph_test.py`: capture + replay PASS ✅
|
||||
- vLLM inference: running with cudagraph, output quality in progress
|
||||
|
||||
## How We Got Here
|
||||
|
||||
### The C++ CUTLASS Kernel Was Broken
|
||||
@@ -100,25 +109,36 @@ padded_scales.zero_()
|
||||
- VRAM cost is negligible (~350KB for activation intermediates across all MoE layers)
|
||||
- vLLM already does this everywhere (attention, FFN, etc.)
|
||||
|
||||
#### 3. Dynamic Output Allocation
|
||||
#### 3. Kernel Compilation in the Forward Path
|
||||
|
||||
`torch.zeros(num_tokens, ...)` inside the forward pass creates a new tensor each call. In cudagraph, new allocations are recorded and replayed — this works, but only if `num_tokens` is fixed (which it is, since vLLM captures at fixed token budgets).
|
||||
`cute.compile()` is a host-side JIT operation that generates PTX and compiles a CUDA kernel. It cannot be called inside cudagraph capture. The fix: **compile once during warmup, cache the compiled kernel, then only invoke `compiled()` on subsequent calls**.
|
||||
|
||||
#### Test Harness
|
||||
The compiled kernel uses `separate_tensormap_init=True`, which handles TMA descriptor re-initialization for new tensor data. We create new `mark_layout_dynamic` CuTe tensor views for each forward call, and the compiled kernel accepts them.
|
||||
|
||||
`tests/cudagraph_test.py` validates cudagraph compatibility by:
|
||||
1. Creating a runner with dummy weights
|
||||
2. Running a warmup forward pass (triggers kernel compilation)
|
||||
3. Attempting `torch.cuda.graph(g)` capture on the forward pass
|
||||
4. If capture fails, patching `torch.cuda.synchronize`, `.item()`, `.tolist()`, `.cpu()` to detect exactly which syncs are happening
|
||||
**Critical lesson:** Caching the compiled kernel across different tensor allocations initially produced wrong results (cosine 0.5 instead of 0.99). The issue was NOT that caching is fundamentally broken — it was that our bridge had other bugs (wrong `make_b_k_major` stride check, `quantize_weight_to_nvfp4` packing N instead of K). Once those were fixed, cached compilation works correctly.
|
||||
|
||||
Run on the B200:
|
||||
```bash
|
||||
cd /root/nvfp4-megamoe-kernel
|
||||
source tests/.venv/bin/activate
|
||||
python3 tests/cudagraph_test.py
|
||||
#### 4. Weight Quantization: K is the Packed Dimension
|
||||
|
||||
`quantize_weight_to_nvfp4` packs K (dim 0) differently from `quantize_to_nvfp4` which packs the last dimension. For a weight matrix `(K, N)`:
|
||||
- K=7168 is the packed dimension (7168 → 3584 in float4)
|
||||
- N=6144 stays as-is
|
||||
- Block scales are computed along K blocks: `(K//16, N)` not `(K//2, N//16)`
|
||||
- The nibble packing uses `[:, ::2, :]` and `[:, 1::2, :]` (along the K block dim)
|
||||
|
||||
Confusing the two quantization functions produces wrong tensor shapes that crash or produce garbage.
|
||||
|
||||
#### 5. B Tensor K-Major Layout
|
||||
|
||||
The CuTeDSL kernel expects B tensors in K-major memory layout (K elements contiguous in memory). `torch.stack` produces N-major layout. The fix: **double-permute trick** — transpose, make contiguous, transpose back. Same shape, different strides.
|
||||
|
||||
```python
|
||||
# Double-permute: (E,K,N) → (E,N,K) → contiguous → (E,K,N)
|
||||
# Same shape, but K-contiguous memory layout
|
||||
return b_tensor.permute(0, 2, 1).contiguous().permute(0, 2, 1)
|
||||
```
|
||||
|
||||
A single permute changes the tensor SHAPE (swapping K and N), which breaks everything downstream.
|
||||
|
||||
### Key Lessons
|
||||
|
||||
1. **A wrong reference is worse than no reference** — the 0.2 cosine against a broken BF16 dequant sent us chasing SF remap bugs for weeks
|
||||
@@ -127,6 +147,8 @@ python3 tests/cudagraph_test.py
|
||||
4. **Separate the GEMM from the pipeline** — our `layertest.py` runs without vLLM, Docker, or tensor parallelism. It caught the kernel bug that vLLM's integration layers masked.
|
||||
5. **Don't dequant what's already quantized** — if the kernel expects NVFP4 and the checkpoint is NVFP4, leave it alone. No BF16 round-trips.
|
||||
6. **GPU scalar slicing is a silent cudagraph killer** — no error, no warning, just `cudaErrorStreamCaptureInvalidated` with no pointer to the cause. The test harness catches it.
|
||||
7. **Weight vs activation quantization are different** — K-packed (weights) vs last-dim-packed (activations). Mixing them up produces wrong shapes and garbage output.
|
||||
8. **Double-permute for memory layout changes** — single permute changes shape, double-permute changes layout. The kernel cares about both.
|
||||
|
||||
## Project Structure
|
||||
|
||||
@@ -147,7 +169,7 @@ nvfp4-megamoe-kernel/
|
||||
│ └── deepseek_v4_attention.py # Attention patch (NVFP4 native)
|
||||
├── tests/
|
||||
│ ├── cudagraph_test.py # CUDAGraph compatibility test (✅ PASS)
|
||||
│ ├── layertest.py # Layer 0 comparison: CuTeDSL vs BF16 (✅ cosine 0.989)
|
||||
│ ├── layertest.py # Layer 0 comparison: CuTeDSL vs BF16 (✅ cosine 0.988)
|
||||
│ ├── test_cutedsl.py # Small standalone CuTeDSL test (✅ cosine 0.991)
|
||||
│ ├── test_uniform_fp4.py # Uniform data GEMM test
|
||||
│ ├── test_b_layout.py # B matrix column layout test
|
||||
@@ -161,20 +183,22 @@ Handles all tensor layout conversion from our pipeline to what the CuTeDSL kerne
|
||||
|
||||
| Function | What it does |
|
||||
|----------|--------------|
|
||||
| `quantize_activation_nvfp4()` | BF16 → float4_e2m1fn_x2 + float8_e4m3fn block scales (cudagraph-safe, no `.max()` sync) |
|
||||
| `quantize_weight_to_nvfp4()` | Same, but for weight matrices with K as the packed dimension |
|
||||
| `quantize_to_nvfp4()` | BF16 → float4_e2m1fn_x2 + float8_e4m3fn block scales (NOT cudagraph-safe — uses .max()) |
|
||||
| `quantize_activation_nvfp4()` | Same, but cudagraph-safe (fixed global_scale, no .max()) |
|
||||
| `quantize_weight_to_nvfp4()` | Same, but K is the packed dimension (different block scale shape) |
|
||||
| `assemble_scales_2d_side()` | Pad and swizzle activation scale factors (2Dx3D A side) |
|
||||
| `assemble_scales_3d_side()` | Pad and swizzle weight scale factors (2Dx3D B side) |
|
||||
| `make_b_k_major()` | Convert B tensor from N-major to K-major strides (required by kernel) |
|
||||
| `run_nvfp4_grouped_gemm()` | Full kernel launch (compile + run, cudagraph-safe) |
|
||||
| `make_b_k_major()` | Convert B tensor from N-major to K-major strides (double-permute trick) |
|
||||
| `run_nvfp4_grouped_gemm()` | Kernel launch with cached compilation (cudagraph-safe) |
|
||||
|
||||
## Running Tests
|
||||
|
||||
On the B200:
|
||||
On the B200 (host venv, no container):
|
||||
|
||||
```bash
|
||||
cd /root/nvfp4-megamoe-kernel
|
||||
source tests/.venv/bin/activate
|
||||
export CUDA_TOOLKIT_PATH=/usr/local/cuda
|
||||
|
||||
# CUDAGraph compatibility test
|
||||
python3 tests/cudagraph_test.py
|
||||
@@ -202,7 +226,7 @@ python3 tests/layertest.py
|
||||
### Phase 1: Kernel ✅ DONE
|
||||
- CuTeDSL ScaledGroupedGemmKernel works with NVFP4
|
||||
- Bridge layer handles all tensor layout conversion
|
||||
- Full MoE pipeline (L1→SiLU→L2→scatter) produces cosine 0.989 vs BF16
|
||||
- Full MoE pipeline (L1→SiLU→L2→scatter) produces cosine 0.988 vs BF16
|
||||
|
||||
### Phase 2: vLLM Integration ✅ DONE
|
||||
- CuTeDSLMoERunner wires CuTeDSL kernel into vLLM
|
||||
@@ -222,14 +246,22 @@ python3 tests/layertest.py
|
||||
- Test harness validates capture + replay
|
||||
- VRAM overhead: ~350KB (negligible)
|
||||
- Compute overhead: zero rows through GEMM on padding (memory-bound, free)
|
||||
- Kernel compilation cached: `cute.compile()` once during warmup, `compiled()` on forward calls
|
||||
|
||||
### Phase 3: Optimization
|
||||
### Phase 3: Output Quality 🔧 IN PROGRESS
|
||||
- vLLM serves the model with cudagraph, but output is garbage tokens
|
||||
- Layer 0 cosine is 0.988 in isolation, so the GEMM math is correct
|
||||
- Investigating: weight loading path in vLLM, tensor parallelism handling, scale normalization
|
||||
- Need to compare vLLM's weight pipeline vs layertest's direct path
|
||||
|
||||
### Phase 4: Optimization
|
||||
- Replace wo_a FP8 conversion with native NVFP4 GEMM (eliminate last dequant)
|
||||
- Fix compressor weight_loader so it stays NVFP4 native
|
||||
- Explore larger tile sizes for better occupancy
|
||||
- Profile end-to-end inference on full model
|
||||
- Proper TMA descriptor management for kernel caching across different tensor pools
|
||||
|
||||
### Phase 4: Production
|
||||
### Phase 5: Production
|
||||
- Clean up old C++ kernel code (tagged `the-last-of-cutlass`)
|
||||
- Add proper error handling and logging
|
||||
- Benchmark vs BF16 baseline
|
||||
|
||||
Reference in New Issue
Block a user