README.md

# NVFP4 MegaMoE Kernel

Full NVFP4 inference pipeline for DeepSeek-V4 on NVIDIA Blackwell (SM100). The entire model — MoE experts, shared experts, and attention projections — runs in native NVFP4 with zero dequantization overhead.

## What This Is

A native NVFP4 inference stack for DeepSeek-V4:

**MoE Experts** — CuTeDSL ScaledGroupedGemmKernel (our work):
```
BF16 input → quantize to NVFP4
  L1 GEMM: NVFP4 × NVFP4 → BF16 (gate + up)
  SiLU(gate) * up → BF16 (only nonlinear — can't avoid BF16 here)
  Re-quantize → NVFP4
  L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj)
  Scatter with routing weights → BF16 output
```

**Attention Projections** — FlashInferCutlassNvFp4LinearKernel (vLLM built-in):
- `wq_b`, `wo_b`, `fused_wqa_wkv` — native NVFP4, no conversion
- `wo_a` — NVFP4→FP8 for `fp8_einsum` (only attention weight that needs conversion)
- Compressor — BF16 (weight_loader stacking issue, small matmul)

**Shared Experts** — FlashInferCutlassNvFp4LinearKernel (vLLM built-in):
- `gate_up_proj`, `down_proj` — native NVFP4

Both GEMM types use `float4_e2m1fn_x2` for weights, `float8_e4m3fn` for block scales, `float32` for global scales. BF16 is used only for SiLU activation, the final MoE scatter, and the compressor — the minimum possible.

## How We Got Here

### The C++ CUTLASS Kernel Was Broken

The original kernel was a C++ `.cu` file using CUTLASS's C++ API directly. It passed all the simple tests (uniform data → exact output, SF remap verifier → 0 errors) but produced **cosine 0.05** with real random data. After weeks of debugging the SF remap (8+ iterations, all producing the same 0.2 cosine against a wrong reference), we discovered:

1. **The BF16 reference comparison was wrong** — our Python dequantization didn't match CUTLASS's internal FP4 handling. A wrong reference is worse than no reference. We chased ghosts through 8+ SF remap rewrites because the 0.2 cosine was never about the remap.

2. **The C++ CUTLASS kernel misinterpreted FP4 data** — even with SF remap verified correct (0 byte errors), the GEMM produced garbage with non-uniform data. The issue was in how CUTLASS's C++ API handles FP4 packing/tiling internally — something we couldn't easily debug or fix.

3. **The checkpoint `input_scale` was a red herring** — we tried using the checkpoint's calibration scale as the activation normalization scale. It saturated all block scales to 448.0 (max float8). The `input_scale` is a calibration constant for alpha computation, not a normalization scale.

### The CuTeDSL Kernel Works

NVIDIA's CuTeDSL approach (Python-based CUTLASS kernels compiled via MLML → PTX) is what the CUTLASS team recommends for Blackwell. Their official MoE scaled grouped GEMM example (`torch_scaled_grouped_mm.py`) supports NVFP4 out of the box. We adapted it.

**Results with real DeepSeek-V4 layer 0 weights:**
- L1 GEMM alone: cosine 0.995
- Full MoE pipeline (L1→SiLU→L2→scatter): cosine 0.989
- Weight loading: **0% loss** — direct uint8→float4_e2m1fn_x2 view-cast, bit-identical to checkpoint
- Activation quantization: ~1.1% cosine loss (dynamic BF16→NVFP4 — inherent to the format, unavoidable)
- GEMM kernel: 0% loss (CuTeDSL is correct)

The 0.989 cosine is entirely from activation quantization. The weights are bit-identical to the checkpoint — no BF16 round-trip, no precision loss.

### The Dequant→Requant Anti-Pattern

Early versions dequantized all NVFP4 weights to BF16, then let vLLM's `FlashInferCutlassNvFp4LinearKernel` requantize them back to NVFP4 at inference time. This:
- Wasted 5 minutes on load doing NVFP4→BF16 conversion
- Lost precision on the double round-trip
- Caused vLLM to hang — the NVFP4 attention kernel expects native NVFP4 weights, not BF16 weights with an NVFP4 quant_method attached

The fix: **keep everything in NVFP4**. The checkpoint stores NVFP4. The kernels consume NVFP4. No conversion needed.

### CUDAGraph Compatibility

vLLM uses CUDA graphs to eliminate kernel launch overhead in the decode path. CUDA graphs record the entire forward pass once, then replay it — but they require **fixed tensor shapes, fixed memory addresses, and zero CPU-GPU syncs**.

Our original runner was not cudagraph-safe. We had to fix several classes of issues:

#### 1. CPU↔CUDA Tensor Copies

`torch.tensor([0,1,...], device=x.device)` creates the tensor on **CPU first**, then copies to CUDA. This copy is forbidden during graph capture. The fix: **cache tensors per device** on first use, outside the graph.

```python
# BAD — CPU→CUDA copy inside graph
step_to_idx = torch.tensor([0,1,2,3,4,4,5,5,6,6,6,7,7], device=x.device)

# GOOD — cached on first use, reused in graph
step_to_idx = _get_step_to_idx_lut(x.device)  # returns cached CUDA tensor
```

Similarly, `torch.zeros` and `torch.rand` don't support `float4_e2m1fn_x2` or `float8_e4m3fn` dtypes. The fix: create as `uint8` or `float16`, then `.view()` or `.to()` the target dtype.

#### 2. GPU Scalar Slicing

`buf[:gpu_scalar, :]` requires the runtime to query the GPU scalar's value to determine the output shape. This triggers an implicit CPU-GPU sync, which invalidates the graph. The fix: **always use full pre-allocated buffers**. Extra rows are zeros that contribute nothing to the computation.

```python
# BAD — GPU scalar as slice index (implicit sync)
total_padded_rows = padded_expert_offsets[-1]  # GPU scalar
padded_scales = buf[:total_padded_rows, :padded_cols]  # sync!

# GOOD — full pre-allocated buffer, zero out before use
padded_scales = self._padded_scales_buf  # always max size
padded_scales.zero_()
```

**Design decision:** Padding to max size wastes a few rows of compute on zero data, but:
- The extra rows are zeros → zero GEMM output → no accuracy impact
- GEMMs are memory-bandwidth bound → multiplying zeros is nearly free
- VRAM cost is negligible (~350KB for activation intermediates across all MoE layers)
- vLLM already does this everywhere (attention, FFN, etc.)

#### 3. Dynamic Output Allocation

`torch.zeros(num_tokens, ...)` inside the forward pass creates a new tensor each call. In cudagraph, new allocations are recorded and replayed — this works, but only if `num_tokens` is fixed (which it is, since vLLM captures at fixed token budgets).

#### Test Harness

`tests/cudagraph_test.py` validates cudagraph compatibility by:
1. Creating a runner with dummy weights
2. Running a warmup forward pass (triggers kernel compilation)
3. Attempting `torch.cuda.graph(g)` capture on the forward pass
4. If capture fails, patching `torch.cuda.synchronize`, `.item()`, `.tolist()`, `.cpu()` to detect exactly which syncs are happening

Run on the B200:
```bash
cd /root/nvfp4-megamoe-kernel
source tests/.venv/bin/activate
python3 tests/cudagraph_test.py
```

### Key Lessons

1. **A wrong reference is worse than no reference** — the 0.2 cosine against a broken BF16 dequant sent us chasing SF remap bugs for weeks
2. **The C++ CUTLASS API is a footgun for FP4** — CuTeDSL handles tensor layouts, tiling, and SF construction correctly by construction
3. **Test with real data early** — uniform tests pass even with broken kernels; random data reveals real bugs
4. **Separate the GEMM from the pipeline** — our `layertest.py` runs without vLLM, Docker, or tensor parallelism. It caught the kernel bug that vLLM's integration layers masked.
5. **Don't dequant what's already quantized** — if the kernel expects NVFP4 and the checkpoint is NVFP4, leave it alone. No BF16 round-trips.
6. **GPU scalar slicing is a silent cudagraph killer** — no error, no warning, just `cudaErrorStreamCaptureInvalidated` with no pointer to the cause. The test harness catches it.

## Project Structure

```
nvfp4-megamoe-kernel/
├── cutedsl/                          # CuTeDSL kernel + bridge layer
│   ├── bridge.py                     # Tensor layout conversion, quantization, kernel launch
│   ├── moe_pipeline.py              # Full MoE pipeline (L1→SiLU→L2→scatter)
│   └── kernel/moe/                   # NVIDIA's ScaledGroupedGemmKernel (untouched)
│       ├── torch_scaled_grouped_mm.py   # The working kernel (3900 lines)
│       ├── moe_utils.py
│       moe_persistent_scheduler.py
│       └── moe_sched_extension.py
├── vllm/                             # vLLM integration
│   ├── nvfp4_cutedsl.py             # CuTeDSLMoERunner — cudagraph-safe MoE kernel interface
│   └── patches/
│       ├── deepseek_v4.py           # DeepSeek-V4 model patch (NVFP4 native)
│       └── deepseek_v4_attention.py # Attention patch (NVFP4 native)
├── tests/
│   ├── cudagraph_test.py            # CUDAGraph compatibility test (✅ PASS)
│   ├── layertest.py                 # Layer 0 comparison: CuTeDSL vs BF16 (✅ cosine 0.989)
│   ├── test_cutedsl.py              # Small standalone CuTeDSL test (✅ cosine 0.991)
│   ├── test_uniform_fp4.py          # Uniform data GEMM test
│   ├── test_b_layout.py             # B matrix column layout test
│   └── test_quick_rand.py           # Quick random GEMM sanity check
└── reference/                        # Reference files for study
```

## The Bridge Layer (`cutedsl/bridge.py`)

Handles all tensor layout conversion from our pipeline to what the CuTeDSL kernel expects:

| Function | What it does |
|----------|--------------|
| `quantize_activation_nvfp4()` | BF16 → float4_e2m1fn_x2 + float8_e4m3fn block scales (cudagraph-safe, no `.max()` sync) |
| `quantize_weight_to_nvfp4()` | Same, but for weight matrices with K as the packed dimension |
| `assemble_scales_2d_side()` | Pad and swizzle activation scale factors (2Dx3D A side) |
| `assemble_scales_3d_side()` | Pad and swizzle weight scale factors (2Dx3D B side) |
| `make_b_k_major()` | Convert B tensor from N-major to K-major strides (required by kernel) |
| `run_nvfp4_grouped_gemm()` | Full kernel launch (compile + run, cudagraph-safe) |

## Running Tests

On the B200:

```bash
cd /root/nvfp4-megamoe-kernel
source tests/.venv/bin/activate

# CUDAGraph compatibility test
python3 tests/cudagraph_test.py

# Small standalone test
python3 tests/test_cutedsl.py

# Full layer 0 comparison with real weights
python3 tests/layertest.py
```

## NVFP4 Coverage

| Component | Format | Kernel | Conversion? |
|-----------|--------|--------|-------------|
| MoE experts (L1+L2) | NVFP4 native | CuTeDSL ScaledGroupedGemm | No — direct uint8→float4 view-cast |
| Shared experts | NVFP4 native | FlashInferCutlassNvFp4 | No — stays native |
| wq_b, wo_b, fused_wqa_wkv | NVFP4 native | FlashInferCutlassNvFp4 | No — stays native |
| wo_a | NVFP4 → FP8 | fp8_einsum | Yes — fp8_einsum requires FP8 |
| Compressor | NVFP4 → BF16 | torch.mm | Yes — weight_loader stacking issue |
| KV cache | FP8 | FlashInfer MLA | N/A — FP8 is optimal for KV cache |

## Plan

### Phase 1: Kernel ✅ DONE
- CuTeDSL ScaledGroupedGemmKernel works with NVFP4
- Bridge layer handles all tensor layout conversion
- Full MoE pipeline (L1→SiLU→L2→scatter) produces cosine 0.989 vs BF16

### Phase 2: vLLM Integration ✅ DONE
- CuTeDSLMoERunner wires CuTeDSL kernel into vLLM
- Weight loading: checkpoint uint8 → float4_e2m1fn_x2 view-cast (bit-preserving)
- Block scales (float8_e4m3fn) and global scales (float32) pass through directly
- L1 dual global scale handling: normalize to max(gate_gs, up_gs), fold ratio into block scales
- Attention projections stay native NVFP4 (FlashInferCutlassNvFp4LinearKernel)
- CuTeDSL kernel warmup during model load (prevents RPC timeout)
- Removed all debug prints and env var gates from vLLM serving path

### Phase 2.5: CUDAGraph Compatibility ✅ DONE
- CuTeDSLMoERunner is fully cudagraph-safe
- Zero CPU-GPU syncs, zero dynamic shapes, zero GPU scalar slicing
- All intermediate buffers pre-allocated at max_num_tokens * top_k
- `quantize_activation_nvfp4` uses cached LUT (no CPU→CUDA copy)
- `torch.zeros/rand` for float4/float8 → uint8→view or float16→cast
- Test harness validates capture + replay
- VRAM overhead: ~350KB (negligible)
- Compute overhead: zero rows through GEMM on padding (memory-bound, free)

### Phase 3: Optimization
- Replace wo_a FP8 conversion with native NVFP4 GEMM (eliminate last dequant)
- Fix compressor weight_loader so it stays NVFP4 native
- Explore larger tile sizes for better occupancy
- Profile end-to-end inference on full model

### Phase 4: Production
- Clean up old C++ kernel code (tagged `the-last-of-cutlass`)
- Add proper error handling and logging
- Benchmark vs BF16 baseline
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								# NVFP4 MegaMoE Kernel
-												Initial: TileLang NVFP4 mega_moe kernel package

- nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe
- transform_nvfp4_weights_for_mega_moe: weight transformation (tested)
- SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs
- MEGA_MOE_STATIC=1 support for pipeline testing
- pyproject.toml for pip install

											
										
										
											2026-05-13 15:44:51 +00:00
-												docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status

- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)

											
										
										
											2026-05-16 05:43:33 +00:00
+								Full NVFP4 inference pipeline for DeepSeek-V4 on NVIDIA Blackwell (SM100). The entire model — MoE experts, shared experts, and attention projections — runs in native NVFP4 with zero dequantization overhead.
-												Initial: TileLang NVFP4 mega_moe kernel package

- nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe
- transform_nvfp4_weights_for_mega_moe: weight transformation (tested)
- SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs
- MEGA_MOE_STATIC=1 support for pipeline testing
- pyproject.toml for pip install

											
										
										
											2026-05-13 15:44:51 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								## What This Is
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status

- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)

											
										
										
											2026-05-16 05:43:33 +00:00
+								A native NVFP4 inference stack for DeepSeek-V4:
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status

- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)

											
										
										
											2026-05-16 05:43:33 +00:00
+								**MoE Experts** — CuTeDSL ScaledGroupedGemmKernel (our work):
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
+								```
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								BF16 input → quantize to NVFP4
 								  L1 GEMM: NVFP4 × NVFP4 → BF16 (gate + up)
 								  SiLU(gate) * up → BF16 (only nonlinear — can't avoid BF16 here)
 								  Re-quantize → NVFP4
 								  L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj)
 								  Scatter with routing weights → BF16 output
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
+								```
-												docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status

- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)

											
										
										
											2026-05-16 05:43:33 +00:00
+								**Attention Projections** — FlashInferCutlassNvFp4LinearKernel (vLLM built-in):
 								- `wq_b`, `wo_b`, `fused_wqa_wkv` — native NVFP4, no conversion
 								- `wo_a` — NVFP4→FP8 for `fp8_einsum` (only attention weight that needs conversion)
 								- Compressor — BF16 (weight_loader stacking issue, small matmul)
 								**Shared Experts** — FlashInferCutlassNvFp4LinearKernel (vLLM built-in):
 								- `gate_up_proj`, `down_proj` — native NVFP4
 								Both GEMM types use `float4_e2m1fn_x2` for weights, `float8_e4m3fn` for block scales, `float32` for global scales. BF16 is used only for SiLU activation, the final MoE scatter, and the compressor — the minimum possible.
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								## How We Got Here
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								### The C++ CUTLASS Kernel Was Broken
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								The original kernel was a C++ `.cu` file using CUTLASS's C++ API directly. It passed all the simple tests (uniform data → exact output, SF remap verifier → 0 errors) but produced **cosine 0.05** with real random data. After weeks of debugging the SF remap (8+ iterations, all producing the same 0.2 cosine against a wrong reference), we discovered:
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+. **The BF16 reference comparison was wrong** — our Python dequantization didn't match CUTLASS's internal FP4 handling. A wrong reference is worse than no reference. We chased ghosts through 8+ SF remap rewrites because the 0.2 cosine was never about the remap.
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+. **The C++ CUTLASS kernel misinterpreted FP4 data** — even with SF remap verified correct (0 byte errors), the GEMM produced garbage with non-uniform data. The issue was in how CUTLASS's C++ API handles FP4 packing/tiling internally — something we couldn't easily debug or fix.
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+. **The checkpoint `input_scale` was a red herring** — we tried using the checkpoint's calibration scale as the activation normalization scale. It saturated all block scales to 448.0 (max float8). The `input_scale` is a calibration constant for alpha computation, not a normalization scale.
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								### The CuTeDSL Kernel Works
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: update README with cudagraph compatibility work and decisions

											
										
										
											2026-05-16 18:55:47 +00:00
+								NVIDIA's CuTeDSL approach (Python-based CUTLASS kernels compiled via MLML → PTX) is what the CUTLASS team recommends for Blackwell. Their official MoE scaled grouped GEMM example (`torch_scaled_grouped_mm.py`) supports NVFP4 out of the box. We adapted it.
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								**Results with real DeepSeek-V4 layer 0 weights:**
 								- L1 GEMM alone: cosine 0.995
 								- Full MoE pipeline (L1→SiLU→L2→scatter): cosine 0.989
-												feat: clean Dockerfile, docker-compose, import fixes for CuTeDSL build

Dockerfile:
- Removed: C++ CUTLASS extension build, TileLang install, CUTLASS clone
- Added: nvidia-cutlass-dsl==4.5.0 install, cutedsl/ copy
- Copy nvfp4_cutedsl.py to vllm models dir
- Verify step checks cutlass import

docker-compose.yml:
- Removed stale env vars (MEGA_MOE_DEBUG, MEGA_MOE_STATIC, etc.)

deepseek_v4.py:
- Fix import: vllm.nvfp4_cutedsl → vllm.model_executor.models.nvfp4_cutedsl

README.md:
- Updated results: 0% weight loss confirmed (bit-identical view-cast)
- 1.1% cosine loss is entirely from activation quantization

											
										
										
											2026-05-16 03:50:07 +00:00
+								- Weight loading: **0% loss** — direct uint8→float4_e2m1fn_x2 view-cast, bit-identical to checkpoint
 								- Activation quantization: ~1.1% cosine loss (dynamic BF16→NVFP4 — inherent to the format, unavoidable)
 								- GEMM kernel: 0% loss (CuTeDSL is correct)
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												feat: clean Dockerfile, docker-compose, import fixes for CuTeDSL build

Dockerfile:
- Removed: C++ CUTLASS extension build, TileLang install, CUTLASS clone
- Added: nvidia-cutlass-dsl==4.5.0 install, cutedsl/ copy
- Copy nvfp4_cutedsl.py to vllm models dir
- Verify step checks cutlass import

docker-compose.yml:
- Removed stale env vars (MEGA_MOE_DEBUG, MEGA_MOE_STATIC, etc.)

deepseek_v4.py:
- Fix import: vllm.nvfp4_cutedsl → vllm.model_executor.models.nvfp4_cutedsl

README.md:
- Updated results: 0% weight loss confirmed (bit-identical view-cast)
- 1.1% cosine loss is entirely from activation quantization

											
										
										
											2026-05-16 03:50:07 +00:00
+								The 0.989 cosine is entirely from activation quantization. The weights are bit-identical to the checkpoint — no BF16 round-trip, no precision loss.
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status

- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)

											
										
										
											2026-05-16 05:43:33 +00:00
+								### The Dequant→Requant Anti-Pattern
 								Early versions dequantized all NVFP4 weights to BF16, then let vLLM's `FlashInferCutlassNvFp4LinearKernel` requantize them back to NVFP4 at inference time. This:
 								- Wasted 5 minutes on load doing NVFP4→BF16 conversion
 								- Lost precision on the double round-trip
 								- Caused vLLM to hang — the NVFP4 attention kernel expects native NVFP4 weights, not BF16 weights with an NVFP4 quant_method attached
 								The fix: **keep everything in NVFP4**. The checkpoint stores NVFP4. The kernels consume NVFP4. No conversion needed.
-												docs: update README with cudagraph compatibility work and decisions

											
										
										
											2026-05-16 18:55:47 +00:00
+								### CUDAGraph Compatibility
 								vLLM uses CUDA graphs to eliminate kernel launch overhead in the decode path. CUDA graphs record the entire forward pass once, then replay it — but they require **fixed tensor shapes, fixed memory addresses, and zero CPU-GPU syncs**.
 								Our original runner was not cudagraph-safe. We had to fix several classes of issues:
 								#### 1. CPU↔CUDA Tensor Copies
 								`torch.tensor([0,1,...], device=x.device)` creates the tensor on **CPU first**, then copies to CUDA. This copy is forbidden during graph capture. The fix: **cache tensors per device** on first use, outside the graph.
 								```python
 								# BAD — CPU→CUDA copy inside graph
 								step_to_idx = torch.tensor([0,1,2,3,4,4,5,5,6,6,6,7,7], device=x.device)
 								# GOOD — cached on first use, reused in graph
 								step_to_idx = _get_step_to_idx_lut(x.device)  # returns cached CUDA tensor
 								```
 								Similarly, `torch.zeros` and `torch.rand` don't support `float4_e2m1fn_x2` or `float8_e4m3fn` dtypes. The fix: create as `uint8` or `float16`, then `.view()` or `.to()` the target dtype.
 								#### 2. GPU Scalar Slicing
 								`buf[:gpu_scalar, :]` requires the runtime to query the GPU scalar's value to determine the output shape. This triggers an implicit CPU-GPU sync, which invalidates the graph. The fix: **always use full pre-allocated buffers**. Extra rows are zeros that contribute nothing to the computation.
 								```python
 								# BAD — GPU scalar as slice index (implicit sync)
 								total_padded_rows = padded_expert_offsets[-1]  # GPU scalar
 								padded_scales = buf[:total_padded_rows, :padded_cols]  # sync!
 								# GOOD — full pre-allocated buffer, zero out before use
 								padded_scales = self._padded_scales_buf  # always max size
 								padded_scales.zero_()
 								```
 								**Design decision:** Padding to max size wastes a few rows of compute on zero data, but:
 								- The extra rows are zeros → zero GEMM output → no accuracy impact
 								- GEMMs are memory-bandwidth bound → multiplying zeros is nearly free
 								- VRAM cost is negligible (~350KB for activation intermediates across all MoE layers)
 								- vLLM already does this everywhere (attention, FFN, etc.)
 								#### 3. Dynamic Output Allocation
 								`torch.zeros(num_tokens, ...)` inside the forward pass creates a new tensor each call. In cudagraph, new allocations are recorded and replayed — this works, but only if `num_tokens` is fixed (which it is, since vLLM captures at fixed token budgets).
 								#### Test Harness
 								`tests/cudagraph_test.py` validates cudagraph compatibility by:
 . Creating a runner with dummy weights
 . Running a warmup forward pass (triggers kernel compilation)
 . Attempting `torch.cuda.graph(g)` capture on the forward pass
 . If capture fails, patching `torch.cuda.synchronize`, `.item()`, `.tolist()`, `.cpu()` to detect exactly which syncs are happening
 								Run on the B200:
 								```bash
 								cd /root/nvfp4-megamoe-kernel
 								source tests/.venv/bin/activate
 								python3 tests/cudagraph_test.py
 								```
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								### Key Lessons
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+. **A wrong reference is worse than no reference** — the 0.2 cosine against a broken BF16 dequant sent us chasing SF remap bugs for weeks
 . **The C++ CUTLASS API is a footgun for FP4** — CuTeDSL handles tensor layouts, tiling, and SF construction correctly by construction
 . **Test with real data early** — uniform tests pass even with broken kernels; random data reveals real bugs
 . **Separate the GEMM from the pipeline** — our `layertest.py` runs without vLLM, Docker, or tensor parallelism. It caught the kernel bug that vLLM's integration layers masked.
-												docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status

- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)

											
										
										
											2026-05-16 05:43:33 +00:00
+. **Don't dequant what's already quantized** — if the kernel expects NVFP4 and the checkpoint is NVFP4, leave it alone. No BF16 round-trips.
-												docs: update README with cudagraph compatibility work and decisions

											
										
										
											2026-05-16 18:55:47 +00:00
+. **GPU scalar slicing is a silent cudagraph killer** — no error, no warning, just `cudaErrorStreamCaptureInvalidated` with no pointer to the cause. The test harness catches it.
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								## Project Structure
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
 								```
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								nvfp4-megamoe-kernel/
 								├── cutedsl/                          # CuTeDSL kernel + bridge layer
 								│   ├── bridge.py                     # Tensor layout conversion, quantization, kernel launch
 								│   ├── moe_pipeline.py              # Full MoE pipeline (L1→SiLU→L2→scatter)
 								│   └── kernel/moe/                   # NVIDIA's ScaledGroupedGemmKernel (untouched)
 								│       ├── torch_scaled_grouped_mm.py   # The working kernel (3900 lines)
 								│       ├── moe_utils.py
-												docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status

- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)

											
										
										
											2026-05-16 05:43:33 +00:00
+								│       moe_persistent_scheduler.py
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								│       └── moe_sched_extension.py
 								├── vllm/                             # vLLM integration
-												docs: update README with cudagraph compatibility work and decisions

											
										
										
											2026-05-16 18:55:47 +00:00
+								│   ├── nvfp4_cutedsl.py             # CuTeDSLMoERunner — cudagraph-safe MoE kernel interface
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								│   └── patches/
-												docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status

- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)

											
										
										
											2026-05-16 05:43:33 +00:00
+								│       ├── deepseek_v4.py           # DeepSeek-V4 model patch (NVFP4 native)
 								│       └── deepseek_v4_attention.py # Attention patch (NVFP4 native)
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								├── tests/
-												docs: update README with cudagraph compatibility work and decisions

											
										
										
											2026-05-16 18:55:47 +00:00
+								│   ├── cudagraph_test.py            # CUDAGraph compatibility test (✅ PASS)
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								│   ├── layertest.py                 # Layer 0 comparison: CuTeDSL vs BF16 (✅ cosine 0.989)
 								│   ├── test_cutedsl.py              # Small standalone CuTeDSL test (✅ cosine 0.991)
 								│   ├── test_uniform_fp4.py          # Uniform data GEMM test
 								│   ├── test_b_layout.py             # B matrix column layout test
 								│   └── test_quick_rand.py           # Quick random GEMM sanity check
-												docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status

- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)

											
										
										
											2026-05-16 05:43:33 +00:00
+								└── reference/                        # Reference files for study
-												Initial: TileLang NVFP4 mega_moe kernel package

- nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe
- transform_nvfp4_weights_for_mega_moe: weight transformation (tested)
- SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs
- MEGA_MOE_STATIC=1 support for pipeline testing
- pyproject.toml for pip install

											
										
										
											2026-05-13 15:44:51 +00:00
+								```
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								## The Bridge Layer (`cutedsl/bridge.py`)
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								Handles all tensor layout conversion from our pipeline to what the CuTeDSL kernel expects:
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								| Function | What it does |
 								|----------|--------------|
-												docs: update README with cudagraph compatibility work and decisions

											
										
										
											2026-05-16 18:55:47 +00:00
+								| `quantize_activation_nvfp4()` | BF16 → float4_e2m1fn_x2 + float8_e4m3fn block scales (cudagraph-safe, no `.max()` sync) |
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								| `quantize_weight_to_nvfp4()` | Same, but for weight matrices with K as the packed dimension |
 								| `assemble_scales_2d_side()` | Pad and swizzle activation scale factors (2Dx3D A side) |
 								| `assemble_scales_3d_side()` | Pad and swizzle weight scale factors (2Dx3D B side) |
 								| `make_b_k_major()` | Convert B tensor from N-major to K-major strides (required by kernel) |
-												docs: update README with cudagraph compatibility work and decisions

											
										
										
											2026-05-16 18:55:47 +00:00
+								| `run_nvfp4_grouped_gemm()` | Full kernel launch (compile + run, cudagraph-safe) |
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								## Running Tests
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								On the B200:
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
 								```bash
-												docs: update README with cudagraph compatibility work and decisions

											
										
										
											2026-05-16 18:55:47 +00:00
+								cd /root/nvfp4-megamoe-kernel
 								source tests/.venv/bin/activate
 								# CUDAGraph compatibility test
 								python3 tests/cudagraph_test.py
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								# Small standalone test
-												docs: update README with cudagraph compatibility work and decisions

											
										
										
											2026-05-16 18:55:47 +00:00
+								python3 tests/test_cutedsl.py
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								# Full layer 0 comparison with real weights
-												docs: update README with cudagraph compatibility work and decisions

											
										
										
											2026-05-16 18:55:47 +00:00
+								python3 tests/layertest.py
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								```
-												feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)

											
										
										
											2026-05-15 11:38:18 +00:00
-												docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status

- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)

											
										
										
											2026-05-16 05:43:33 +00:00
+								## NVFP4 Coverage
 								| Component | Format | Kernel | Conversion? |
 								|-----------|--------|--------|-------------|
 								| MoE experts (L1+L2) | NVFP4 native | CuTeDSL ScaledGroupedGemm | No — direct uint8→float4 view-cast |
 								| Shared experts | NVFP4 native | FlashInferCutlassNvFp4 | No — stays native |
 								| wq_b, wo_b, fused_wqa_wkv | NVFP4 native | FlashInferCutlassNvFp4 | No — stays native |
 								| wo_a | NVFP4 → FP8 | fp8_einsum | Yes — fp8_einsum requires FP8 |
 								| Compressor | NVFP4 → BF16 | torch.mm | Yes — weight_loader stacking issue |
 								| KV cache | FP8 | FlashInfer MLA | N/A — FP8 is optimal for KV cache |
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								## Plan
 								### Phase 1: Kernel ✅ DONE
 								- CuTeDSL ScaledGroupedGemmKernel works with NVFP4
 								- Bridge layer handles all tensor layout conversion
 								- Full MoE pipeline (L1→SiLU→L2→scatter) produces cosine 0.989 vs BF16
-												docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status

- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)

											
										
										
											2026-05-16 05:43:33 +00:00
+								### Phase 2: vLLM Integration ✅ DONE
 								- CuTeDSLMoERunner wires CuTeDSL kernel into vLLM
 								- Weight loading: checkpoint uint8 → float4_e2m1fn_x2 view-cast (bit-preserving)
 								- Block scales (float8_e4m3fn) and global scales (float32) pass through directly
-												feat: clean Dockerfile, docker-compose, import fixes for CuTeDSL build

Dockerfile:
- Removed: C++ CUTLASS extension build, TileLang install, CUTLASS clone
- Added: nvidia-cutlass-dsl==4.5.0 install, cutedsl/ copy
- Copy nvfp4_cutedsl.py to vllm models dir
- Verify step checks cutlass import

docker-compose.yml:
- Removed stale env vars (MEGA_MOE_DEBUG, MEGA_MOE_STATIC, etc.)

deepseek_v4.py:
- Fix import: vllm.nvfp4_cutedsl → vllm.model_executor.models.nvfp4_cutedsl

README.md:
- Updated results: 0% weight loss confirmed (bit-identical view-cast)
- 1.1% cosine loss is entirely from activation quantization

											
										
										
											2026-05-16 03:50:07 +00:00
+								- L1 dual global scale handling: normalize to max(gate_gs, up_gs), fold ratio into block scales
-												docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status

- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)

											
										
										
											2026-05-16 05:43:33 +00:00
+								- Attention projections stay native NVFP4 (FlashInferCutlassNvFp4LinearKernel)
 								- CuTeDSL kernel warmup during model load (prevents RPC timeout)
 								- Removed all debug prints and env var gates from vLLM serving path
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
-												docs: update README with cudagraph compatibility work and decisions

											
										
										
											2026-05-16 18:55:47 +00:00
+								### Phase 2.5: CUDAGraph Compatibility ✅ DONE
 								- CuTeDSLMoERunner is fully cudagraph-safe
 								- Zero CPU-GPU syncs, zero dynamic shapes, zero GPU scalar slicing
 								- All intermediate buffers pre-allocated at max_num_tokens * top_k
 								- `quantize_activation_nvfp4` uses cached LUT (no CPU→CUDA copy)
 								- `torch.zeros/rand` for float4/float8 → uint8→view or float16→cast
 								- Test harness validates capture + replay
 								- VRAM overhead: ~350KB (negligible)
 								- Compute overhead: zero rows through GEMM on padding (memory-bound, free)
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								### Phase 3: Optimization
-												docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status

- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)

											
										
										
											2026-05-16 05:43:33 +00:00
+								- Replace wo_a FP8 conversion with native NVFP4 GEMM (eliminate last dequant)
 								- Fix compressor weight_loader so it stays NVFP4 native
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								- Explore larger tile sizes for better occupancy
 								- Profile end-to-end inference on full model
 								### Phase 4: Production
-												docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status

- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)

											
										
										
											2026-05-16 05:43:33 +00:00
+								- Clean up old C++ kernel code (tagged `the-last-of-cutlass`)
-												docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub

README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.

											
										
										
											2026-05-16 03:33:16 +00:00
+								- Add proper error handling and logging
 								- Benchmark vs BF16 baseline