nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	b465579a02	cleanup: nuke all debug prints and env var gates from vLLM patch Removed: - [WT-LOAD] weight loader debug (MEGA_MOE_DEBUG gate) - [NVFP4 DEBUG] shape logging in _run_mega_moe - [NVFP4_DEBUG] post-load expert weight counting - [NVFP4] post-load sync + CUDA OK print (NVFP4_DEBUG_SYNC gate) - [POST-LOAD] all-zero param tensor scanning - [LOGITS] top-k printing + Paris probe - SKIP_ATTENTION env var gate for skipping attention - Unused total_fp8/total_bf16 variables Debugging belongs in layertest.py, not in the vLLM serving path. These prints polluted logs, bloated context windows, and slowed loading.	2026-05-16 04:10:42 +00:00
biondizzle	6d17988b51	fix: L1 gate/up split — intermediate_size is per-projection, not fused intermediate_size=3072 is the size of gate OR up, not gate+up. Split L1 output at intermediate_size, not intermediate_size//2. gate = l1_out[:, :3072], up = l1_out[:, 3072:]	2026-05-16 04:04:40 +00:00
biondizzle	37aa0cbeab	debug: add try/except with shape logging to _run_mega_moe	2026-05-16 04:02:01 +00:00
biondizzle	b04bff7e8b	feat: clean Dockerfile, docker-compose, import fixes for CuTeDSL build Dockerfile: - Removed: C++ CUTLASS extension build, TileLang install, CUTLASS clone - Added: nvidia-cutlass-dsl==4.5.0 install, cutedsl/ copy - Copy nvfp4_cutedsl.py to vllm models dir - Verify step checks cutlass import docker-compose.yml: - Removed stale env vars (MEGA_MOE_DEBUG, MEGA_MOE_STATIC, etc.) deepseek_v4.py: - Fix import: vllm.nvfp4_cutedsl → vllm.model_executor.models.nvfp4_cutedsl README.md: - Updated results: 0% weight loss confirmed (bit-identical view-cast) - 1.1% cosine loss is entirely from activation quantization	2026-05-16 03:50:07 +00:00
biondizzle	a0ff8a3278	fix: transpose checkpoint block scales (N,K_sf)→(K_sf,N) for bridge The bridge's assemble_scales_3d_side expects (K_sf, N) input and transposes to (N, K_sf) internally before swizzling. The checkpoint stores scales as (N, K_sf). Without this transpose, the kernel was reading completely wrong scale data — cosine dropped to 0.713. Also fixed dual global scale normalization: after transpose, gate/up are along dim 1 (columns), not dim 0 (rows).	2026-05-16 03:43:30 +00:00
biondizzle	389453fbf4	feat: direct NVFP4 path — no BF16 round-trip on weights finalize_weights() now view-casts checkpoint uint8 → float4_e2m1fn_x2 directly. Block scales (float8_e4m3fn) and global scales (float32) pass through unchanged. Zero precision loss on the weights themselves. L1 dual global scale handling: gate and up have different global scales. Normalize to max(gate_gs, up_gs) and fold the ratio into block scales via float32 (one multiply + float8 round-trip on the RATIO only — much better than dequantizing the entire weight matrix). layertest.py: updated to test direct path. Expect cosine improvement from 0.989 → 0.995+ (matching the L1-only result).	2026-05-16 03:41:23 +00:00
biondizzle	8fd9579127	feat: vLLM integration — replace C++ kernel with CuTeDSL deepseek_v4.py changes: - finalize_weights(): dequantize checkpoint → BF16 → re-quantize to float4_e2m1fn_x2 via CuTeDSLMoERunner (replaces transform_nvfp4_weights_for_mega_moe) - _run_mega_moe(): calls CuTeDSLMoERunner.run() (replaces nvfp4_mega_moe_full) - Removed get_symm_buffer() and SymmBuffer (CuTeDSL manages its own workspace) - Removed _transformed_l1_weights / _transformed_l2_weights - Added _cutedsl_runner class variable - Weight loader unchanged (checkpoint loading is the same) vllm/nvfp4_cutedsl.py: - CuTeDSLMoERunner class handles the full pipeline - prepare_weights_from_dequantized() for weight prep - run() does L1→SiLU→L2→scatter with NVFP4-native GEMMs	2026-05-16 03:36:12 +00:00
biondizzle	3ec9c3074b	docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub README.md: full rewrite explaining how we got here, project structure, plan, and key lessons learned from the C++ CUTLASS disaster. Removed: - DEBUG_LOG.md (old debug timeline, no longer relevant) - REWRITE_PLAN.md (plan is now in README) - test_gemm.py (C++ extension test) Added: - vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel - Handles slot-based routing, L1→SiLU→L2→scatter - prepare_weights_from_dequantized() for weight prep Tagged the-last-of-cutlass on the old C++ kernel state.	2026-05-16 03:33:16 +00:00
biondizzle	294e9f98f2	cleanup: rename _ue8m0_to_float32 → _block_scale_to_float32, remove dead code - Renamed misleading _ue8m0_to_float32 to _block_scale_to_float32 (our checkpoint uses float8_e4m3fn, NOT E8M0) - Removed dead is_scale_e8m0 property (never referenced) - Removed dead _block_scale_to_float32 copy in MegaMoEExperts class - Cleaned up stale E8M0/UE8M0/shift-by-23 comments - Simplified E8M0 assertion to ValueError (not assert False) - Updated DeepseekV4FP8Config docstring for NVFP4	2026-05-16 01:55:56 +00:00
biondizzle	79b9becf9c	revert: don't use checkpoint input_scale for activation normalization Using checkpoint input_scale as the normalization scale saturates FP4 values (all block scales = 448). The input_scale is a calibration constant, NOT the amax/(6448) normalization scale. Reverted to dynamic amax/(6448) for activation quantization. The correct use of checkpoint input_scale is still under investigation. Preserved: _w13_input_scale and _w2_input_scale in finalize_weights for future use once we understand the correct alpha contract.	2026-05-16 00:12:41 +00:00
biondizzle	a7eae10ef4	fix: use checkpoint input_scale for activation quantization Critical fix: the checkpoint's input_scale was used during weight calibration but we were computing dynamic scale from data (amax/2688). This was 13x off from the checkpoint value. Changes: - stage_activation() accepts optional input_global_scale parameter - nvfp4_mega_moe_full() accepts l1_input_scale and l2_input_scale - vLLM patch preserves w13/w2_input_scale in finalize_weights - L1 activation uses checkpoint w13_input_scale for quantization - L2 activation uses checkpoint w2_input_scale for quantization - alpha = input_scale * weight_scale_2 (correct calibration contract)	2026-05-15 23:57:08 +00:00
biondizzle	fd59222fc0	fix: stop folding global scale into float8 block scales The fold block_sf (float8) * global_sf (float32) -> float8 loses ~25% precision. Product of ~56-448 block_sf * ~4.65e-05 global_sf lands in float8 low-precision zone where step size is 25%. This makes model output garbage despite finite values. Fix: keep block scales as original float8, return global scales separately as float32 per-expert vectors. Apply global scale as per-expert GEMM alpha in cutlass_grouped_nvfp4_gemm (already iterates per-expert). For L1 with separate gate/up global scales, use gate_gs as alpha and apply up_correction ratio to the up half post-GEMM. weight_transform.py: no more _fold_global_scale, returns (w, sf, global_sf) nvfp4_mega_moe.py: per-expert alpha = activation_gs * weight_gs kernel.py: per_expert_alpha parameter in grouped GEMM deepseek_v4.py: updated type hints and comments	2026-05-15 12:42:53 +00:00
biondizzle	9908fd64d9	feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap Major changes from initial TileLang prototype: Kernel: - CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp) - Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter - 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild) - slot_token gathered in cutlass_grouped_nvfp4_gemm when provided SF Remap (source-first): - Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf)) for CUTLASS dest index — no idx2crd/flatten coordinate extraction - 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN) - Uses cute::cosize() for physical allocation size (not cute::size) - SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major) Weight transform: - UE4M3 unpack with bit reinterpret (not value cast) - Global scale folding (weight_scale_2) for gate/up split - clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS No prepack cache: - SFB remapped per-call inside CUTLASS (~µs, not the bottleneck) - See README for why prepack cache must never return (OOM, CUDA graphs, M-dependent layout, cross-layer collisions) Stage activation: - Nearest-neighbor E2M1 quantization (no clamp, no uniform steps) - Per-tensor global scale → alpha for L2 GEMM Bug fixes: - _fold_global_scale: removed broken logical_widths branch - unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support - Correct expert param mapping for NVFP4 checkpoint - SiLU applied per-slot (not after summing expert paths)	2026-05-15 11:38:18 +00:00

1 2 3 4

163 Commits