b3451c74f8
Update README and CURRENT_BUG.md with current state
...
- README: updated NVFP4 coverage table, status, and plan
- CURRENT_BUG.md: full debugging journey, what works, what's next
- Both reflect decision to build our own CuTeDSL kernels
2026-05-18 20:05:03 +00:00
af087e655e
docs: update README — vLLM cudagraph inference running, output quality in progress
2026-05-16 21:40:59 +00:00
f7e29fdf1e
docs: update README with cudagraph compatibility work and decisions
2026-05-16 18:55:47 +00:00
e5370140cb
docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status
...
- Added NVFP4 coverage table (what's native, what's converted, why)
- Documented the dequant→requant anti-pattern that caused vLLM hangs
- Updated plan: Phase 2 done, Phase 3 targets remaining conversions
- Removed stale REWRITE_PLAN reference
- Updated project structure (nvfp4_cutedsl.py, removed old refs)
2026-05-16 05:43:33 +00:00
b04bff7e8b
feat: clean Dockerfile, docker-compose, import fixes for CuTeDSL build
...
Dockerfile:
- Removed: C++ CUTLASS extension build, TileLang install, CUTLASS clone
- Added: nvidia-cutlass-dsl==4.5.0 install, cutedsl/ copy
- Copy nvfp4_cutedsl.py to vllm models dir
- Verify step checks cutlass import
docker-compose.yml:
- Removed stale env vars (MEGA_MOE_DEBUG, MEGA_MOE_STATIC, etc.)
deepseek_v4.py:
- Fix import: vllm.nvfp4_cutedsl → vllm.model_executor.models.nvfp4_cutedsl
README.md:
- Updated results: 0% weight loss confirmed (bit-identical view-cast)
- 1.1% cosine loss is entirely from activation quantization
2026-05-16 03:50:07 +00:00
3ec9c3074b
docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub
...
README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.
Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)
Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
- Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
- Handles slot-based routing, L1→SiLU→L2→scatter
- prepare_weights_from_dequantized() for weight prep
Tagged the-last-of-cutlass on the old C++ kernel state.
2026-05-16 03:33:16 +00:00
9908fd64d9
feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap
...
Major changes from initial TileLang prototype:
Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided
SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)
Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS
No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
M-dependent layout, cross-layer collisions)
Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM
Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)
2026-05-15 11:38:18 +00:00
c2b752c2fe
Initial: TileLang NVFP4 mega_moe kernel package
...
- nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe
- transform_nvfp4_weights_for_mega_moe: weight transformation (tested)
- SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs
- MEGA_MOE_STATIC=1 support for pipeline testing
- pyproject.toml for pip install
2026-05-13 15:44:51 +00:00