Commit Graph

163 Commits

Author SHA1 Message Date
b465579a02 cleanup: nuke all debug prints and env var gates from vLLM patch
Removed:
- [WT-LOAD] weight loader debug (MEGA_MOE_DEBUG gate)
- [NVFP4 DEBUG] shape logging in _run_mega_moe
- [NVFP4_DEBUG] post-load expert weight counting
- [NVFP4] post-load sync + CUDA OK print (NVFP4_DEBUG_SYNC gate)
- [POST-LOAD] all-zero param tensor scanning
- [LOGITS] top-k printing + Paris probe
- SKIP_ATTENTION env var gate for skipping attention
- Unused total_fp8/total_bf16 variables

Debugging belongs in layertest.py, not in the vLLM serving path.
These prints polluted logs, bloated context windows, and slowed loading.
2026-05-16 04:10:42 +00:00
6d17988b51 fix: L1 gate/up split — intermediate_size is per-projection, not fused
intermediate_size=3072 is the size of gate OR up, not gate+up.
Split L1 output at intermediate_size, not intermediate_size//2.
gate = l1_out[:, :3072], up = l1_out[:, 3072:]
2026-05-16 04:04:40 +00:00
37aa0cbeab debug: add try/except with shape logging to _run_mega_moe 2026-05-16 04:02:01 +00:00
b04bff7e8b feat: clean Dockerfile, docker-compose, import fixes for CuTeDSL build
Dockerfile:
- Removed: C++ CUTLASS extension build, TileLang install, CUTLASS clone
- Added: nvidia-cutlass-dsl==4.5.0 install, cutedsl/ copy
- Copy nvfp4_cutedsl.py to vllm models dir
- Verify step checks cutlass import

docker-compose.yml:
- Removed stale env vars (MEGA_MOE_DEBUG, MEGA_MOE_STATIC, etc.)

deepseek_v4.py:
- Fix import: vllm.nvfp4_cutedsl → vllm.model_executor.models.nvfp4_cutedsl

README.md:
- Updated results: 0% weight loss confirmed (bit-identical view-cast)
- 1.1% cosine loss is entirely from activation quantization
2026-05-16 03:50:07 +00:00
a0ff8a3278 fix: transpose checkpoint block scales (N,K_sf)→(K_sf,N) for bridge
The bridge's assemble_scales_3d_side expects (K_sf, N) input and
transposes to (N, K_sf) internally before swizzling. The checkpoint
stores scales as (N, K_sf). Without this transpose, the kernel was
reading completely wrong scale data — cosine dropped to 0.713.

Also fixed dual global scale normalization: after transpose, gate/up
are along dim 1 (columns), not dim 0 (rows).
2026-05-16 03:43:30 +00:00
389453fbf4 feat: direct NVFP4 path — no BF16 round-trip on weights
finalize_weights() now view-casts checkpoint uint8 → float4_e2m1fn_x2
directly. Block scales (float8_e4m3fn) and global scales (float32)
pass through unchanged. Zero precision loss on the weights themselves.

L1 dual global scale handling: gate and up have different global scales.
Normalize to max(gate_gs, up_gs) and fold the ratio into block scales
via float32 (one multiply + float8 round-trip on the RATIO only —
much better than dequantizing the entire weight matrix).

layertest.py: updated to test direct path. Expect cosine improvement
from 0.989 → 0.995+ (matching the L1-only result).
2026-05-16 03:41:23 +00:00
8fd9579127 feat: vLLM integration — replace C++ kernel with CuTeDSL
deepseek_v4.py changes:
- finalize_weights(): dequantize checkpoint → BF16 → re-quantize to
  float4_e2m1fn_x2 via CuTeDSLMoERunner (replaces transform_nvfp4_weights_for_mega_moe)
- _run_mega_moe(): calls CuTeDSLMoERunner.run() (replaces nvfp4_mega_moe_full)
- Removed get_symm_buffer() and SymmBuffer (CuTeDSL manages its own workspace)
- Removed _transformed_l1_weights / _transformed_l2_weights
- Added _cutedsl_runner class variable
- Weight loader unchanged (checkpoint loading is the same)

vllm/nvfp4_cutedsl.py:
- CuTeDSLMoERunner class handles the full pipeline
- prepare_weights_from_dequantized() for weight prep
- run() does L1→SiLU→L2→scatter with NVFP4-native GEMMs
2026-05-16 03:36:12 +00:00
3ec9c3074b docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub
README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.

Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)

Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
  - Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
  - Handles slot-based routing, L1→SiLU→L2→scatter
  - prepare_weights_from_dequantized() for weight prep

Tagged the-last-of-cutlass on the old C++ kernel state.
2026-05-16 03:33:16 +00:00
294e9f98f2 cleanup: rename _ue8m0_to_float32 → _block_scale_to_float32, remove dead code
- Renamed misleading _ue8m0_to_float32 to _block_scale_to_float32
  (our checkpoint uses float8_e4m3fn, NOT E8M0)
- Removed dead is_scale_e8m0 property (never referenced)
- Removed dead _block_scale_to_float32 copy in MegaMoEExperts class
- Cleaned up stale E8M0/UE8M0/shift-by-23 comments
- Simplified E8M0 assertion to ValueError (not assert False)
- Updated DeepseekV4FP8Config docstring for NVFP4
2026-05-16 01:55:56 +00:00
79b9becf9c revert: don't use checkpoint input_scale for activation normalization
Using checkpoint input_scale as the normalization scale saturates
FP4 values (all block scales = 448). The input_scale is a calibration
constant, NOT the amax/(6*448) normalization scale.

Reverted to dynamic amax/(6*448) for activation quantization.
The correct use of checkpoint input_scale is still under investigation.

Preserved: _w13_input_scale and _w2_input_scale in finalize_weights
for future use once we understand the correct alpha contract.
2026-05-16 00:12:41 +00:00
a7eae10ef4 fix: use checkpoint input_scale for activation quantization
Critical fix: the checkpoint's input_scale was used during weight
calibration but we were computing dynamic scale from data (amax/2688).
This was 13x off from the checkpoint value.

Changes:
- stage_activation() accepts optional input_global_scale parameter
- nvfp4_mega_moe_full() accepts l1_input_scale and l2_input_scale
- vLLM patch preserves w13/w2_input_scale in finalize_weights
- L1 activation uses checkpoint w13_input_scale for quantization
- L2 activation uses checkpoint w2_input_scale for quantization
- alpha = input_scale * weight_scale_2 (correct calibration contract)
2026-05-15 23:57:08 +00:00
fd59222fc0 fix: stop folding global scale into float8 block scales
The fold block_sf (float8) * global_sf (float32) -> float8 loses ~25% precision.
Product of ~56-448 block_sf * ~4.65e-05 global_sf lands in float8 low-precision
zone where step size is 25%. This makes model output garbage despite finite values.

Fix: keep block scales as original float8, return global scales separately as
float32 per-expert vectors. Apply global scale as per-expert GEMM alpha in
cutlass_grouped_nvfp4_gemm (already iterates per-expert). For L1 with separate
gate/up global scales, use gate_gs as alpha and apply up_correction ratio to
the up half post-GEMM.

weight_transform.py: no more _fold_global_scale, returns (w, sf, global_sf)
nvfp4_mega_moe.py: per-expert alpha = activation_gs * weight_gs
kernel.py: per_expert_alpha parameter in grouped GEMM
deepseek_v4.py: updated type hints and comments
2026-05-15 12:42:53 +00:00
9908fd64d9 feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap
Major changes from initial TileLang prototype:

Kernel:
- CUTLASS NVFP4 block-scaled GEMM (SM100 Blackwell, OpClassBlockScaledTensorOp)
- Slot-based dispatch: L1 GEMM → SiLU+Mul per-slot → L2 GEMM → index_add scatter
- 1D slot_expert_ids passed to both L1 and L2 (no 2D topk_ids rebuild)
- slot_token gathered in cutlass_grouped_nvfp4_gemm when provided

SF Remap (source-first):
- Iterates logical (m, k_sf) source grid, uses layout_sf(make_coord(m, k_sf))
  for CUTLASS dest index — no idx2crd/flatten coordinate extraction
- 2D kernel launch: dim3 block(32,8), grid over (K_sf, MN)
- Uses cute::cosize() for physical allocation size (not cute::size)
- SFA: (MN, K_sf) row-major; SFB: (K_sf, MN) row-major (col-major)

Weight transform:
- UE4M3 unpack with bit reinterpret (not value cast)
- Global scale folding (weight_scale_2) for gate/up split
- clamp(0,448) → float8_e4m3fn, transpose (N,K)→(K,N) for CUTLASS

No prepack cache:
- SFB remapped per-call inside CUTLASS (~µs, not the bottleneck)
- See README for why prepack cache must never return (OOM, CUDA graphs,
  M-dependent layout, cross-layer collisions)

Stage activation:
- Nearest-neighbor E2M1 quantization (no clamp, no uniform steps)
- Per-tensor global scale → alpha for L2 GEMM

Bug fixes:
- _fold_global_scale: removed broken logical_widths branch
- unpack_ue4m3_u32: int32 for CUDA bitwise, view not to, ND support
- Correct expert param mapping for NVFP4 checkpoint
- SiLU applied per-slot (not after summing expert paths)
2026-05-15 11:38:18 +00:00