Commit Graph

85 Commits

Author SHA1 Message Date
c289c44920 Fix BF16 wo_a: per-group BMM instead of flat linear
The BF16 wo_a path was calling self.wo_a(o_inv.reshape(num_tokens, -1))
which flattens across groups: (num_tokens, n_local_heads*head_dim)=(tokens, 8192).
But wo_a is a BMM with in_features=n_heads*head_dim/n_groups=4096.

The FP8 path handles this via einsum 'bhr,hdr->bhd' with per-group shapes.
The BF16 path now does the same: reshape o_inv to per-group format,
do torch.bmm, then reshape output and handle TP all-gather manually.
2026-05-19 04:10:02 +00:00
6f9a400ae0 Fix hc_head mapping: checkpoint uses hc_head.hc_fn, model params are flat hc_head_fn
- Removed hc_head prefix mapping (checkpoint already has model.hc_head.*)
- Fixed substr: hc_head.hc_fn→hc_head_fn (not hc_head.fn→hc_head_fn)
- The model has self.hc_head_fn as flat params, not inside a sub-module
2026-05-19 03:58:25 +00:00
4cf5b8b751 Fix compressor path: attn.mla_attn.compressor (not attn.compressor)
The compressor is inside mla_attn, not directly on the attention wrapper.
Debug output confirmed: layers.0.attn.mla_attn.compressor.fused_wkv_wgate.*
2026-05-19 03:47:26 +00:00
fece06f746 Add unit tests for NVFP4 weight mapper and inverse RoPE BF16 2026-05-19 03:22:00 +00:00
b856ee9315 Clean up debug scripts 2026-05-19 02:47:29 +00:00
8fe5546bb3 Fix debug script 2026-05-19 02:43:17 +00:00
788f0aa65a Add step-by-step debug for wo_a 2026-05-19 02:43:05 +00:00
77e4970d93 Add debug script for wo_a quantization 2026-05-19 02:40:43 +00:00
80122b850b Add debug script for wo_a 2026-05-19 02:39:55 +00:00
ae233ab648 Fix test: cos_sin_cache on CUDA device 2026-05-19 02:37:50 +00:00
882d4996ff Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj)
The B200 container crashes in DeepGEMM's fp8_einsum (t.dim() == N assertion
in layout.hpp:39) when processing wo_a (o-projection first half) in the
attention layer. The crash is caused by scale tensor dimension mismatch
for the SM100 recipe (1, 1, 128).

Instead of fighting DeepGEMM, replace the entire wo_a path with our own
CuTeDSL NVFP4 kernel:

1. inverse_rope_bf16() — Python implementation of inverse RoPE
   (replaces fused_inv_rope_fp8_quant CUDA kernel)
2. CuTeDSLNvfp4WoA — NVFP4 grouped linear for wo_a using
   ScaledGroupedGemm with n_local_groups=8 groups
3. wo_a weight quantized to NVFP4 instead of FP8 (native NVFP4,
   no conversion to another quantization)

Changes:
- cutedsl/inverse_rope.py: BF16 inverse RoPE (conjugate rotation)
- cutedsl/wo_a_grouped_linear.py: CuTeDSL NVFP4 grouped GEMM for wo_a
- vllm/patches/deepseek_v4_attention.py: Use NVFP4 path when runner
  is initialized, keep DeepGEMM fallback
- vllm/patches/deepseek_v4.py: Init NVFP4 runner instead of FP8 quant
- tests/test_wo_a.py: Unit test for inverse RoPE + wo_a GEMM
2026-05-19 02:36:30 +00:00
00fe63b56f Fix compile test: add warmup for activation global scales 2026-05-19 01:57:16 +00:00
bba3bca4d3 Add torch.compile + custom op integration test 2026-05-19 01:56:46 +00:00
35fab6cff3 Replace autograd.Function with torch.library.custom_op for Dynamo compat
Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals
(cute.compile, JIT, etc.). The autograd.Function approach was unreliable
with fullgraph mode — Dynamo would still try to trace through it.

Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque
black box. No reimplementing the kernel — just route through the existing
runner via a registry pattern:
  - Runners registered in global dict with integer IDs
  - Custom op takes (tensors, runner_id, shape_hint) -> tensor
  - Dynamo calls fake impl for shape inference, never touches the runner
  - At execution time, real impl looks up runner and calls _run_impl

Changes:
  - New: cutedsl/custom_ops.py (custom op definitions + registry)
  - New: tests/test_custom_op.py (local unit tests, no GPU needed)
  - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes)
  - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py
    to use custom ops instead of autograd.Function
  - Updated: cutedsl_quant_method.py to use custom op + registry
2026-05-19 01:54:48 +00:00
f74447bfd0 Proper NVFP4 integration: quantized compressor/indexer + mapper fixes
Weight mapper fixes:
- Reorder substr renames: compressor renames first, then .self_attn.compressor.
  → .attn.mla_attn.compressor., then indexer renames (so indexer keys end up
  under mla_attn after the compressor rename already fired)
- Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm,
  position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names)
- Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv,
  gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape,
  weights_proj stays (structural: compressor.indexer → indexer.compressor)
- Remove broken suffix renames (already fixed in prior commit)

Model architecture fixes:
- Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4
  checkpoint has quantized compressor weights with input_scale/weight_scale)
- Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config
  (was None, but checkpoint has quantized weights)
- Add indexer.compressor.fused_wkv_wgate stacking in load_weights

Infrastructure:
- Add deepseek_compressor.py to Dockerfile
- Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM)
- Update unit test to 50 cases (compressor + indexer + quantization scales)
2026-05-18 23:20:13 +00:00
17496b2615 Fix NVFP4 weights mapper: add prefix mappings, fix substr order
- Add orig_to_new_prefix mappings (layers→model.layers, embed_tokens→model.embed_tokens, etc.)
  AutoWeightsLoader strips the model. prefix before the mapper runs, so these are required
- Move .self_attn.compressor. → .attn.mla_attn.compressor. before .self_attn. → .attn.
  in substr_renames so compressor keys get the mla_attn prefix before the general rename
- Remove suffix renames (head.weight→lm_head.weight, embed.weight→embed_tokens.weight)
  that were causing double-mapping since the NVFP4 checkpoint already uses lm_head/embed_tokens
- Add unit test: tests/test_nvfp4_mapper.py (39 cases, no vLLM/CUDA needed)
2026-05-18 23:03:34 +00:00
6ce6a47be9 Add NVFP4 linear runner + attention projection test
- CuTeDSLNvfp4Linear: generic single-GEMM runner for any NVFP4 projection
- test_attention.py: tests q_a_proj, q_b_proj, kv_proj, o_b_proj vs BF16
- Same pad+swizzle pattern as shared expert, but no SiLU/fusion
2026-05-18 20:14:03 +00:00
f07643791e Fix hidden_size: shared expert uses 7168, not HC_DIM 28672 2026-05-18 20:10:32 +00:00
c1aa4af123 Shared expert: dedicated CuTeDSL runner with proper scale assembly
- CuTeDSLSharedExpertRunner: num_groups=1 GEMM, no scatter/routing
- _assemble_scales_single_group: pad to 128 rows + Blackwell swizzle
- All buffers pre-allocated for cudagraph compatibility
- Updated test to use dedicated runner instead of MoE runner hack
2026-05-18 20:08:34 +00:00
e8b289e30d WIP: CuTeDSL shared expert kernel
Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py).
Tried reusing MoE runner with 1 expert — fails because MoE runner assumes
hidden_size != HC_DIM for scatter. Need dedicated runner with correct
scale assembly. Will continue tomorrow.
2026-05-18 20:02:19 +00:00
bedcfc4dab Pipeline test: use max_num_tokens=8192 matching vLLM 2026-05-17 23:04:44 +00:00
c45364b3a8 Add MoE scale ratio output 2026-05-17 22:58:27 +00:00
bf99ad49ec Print both MoE and residual cosine 2026-05-17 22:56:56 +00:00
8637020487 Fix multi-layer test: add residual connections 2026-05-17 22:55:40 +00:00
11dce13afe Add multi-layer pipeline test to check error accumulation 2026-05-17 22:53:28 +00:00
72628fb689 Full pipeline test: runner vs BF16 reference 2026-05-17 21:29:16 +00:00
2796bd81e8 Fix: scatter FP4 as uint8 (float4 doesn't support index_put) 2026-05-17 21:28:04 +00:00
364f8372bb Fix FP4 buffer shapes: D//2 for packed dimensions 2026-05-17 21:26:46 +00:00
5e4d674736 Test fix: quantize slot_hidden, scatter FP4, pass slot_x_sf 2026-05-17 21:25:58 +00:00
4d0b6d889d Set runner weights before _ensure_stacked 2026-05-17 21:22:50 +00:00
b7acac5e4e Call _ensure_stacked() before using runner buffers 2026-05-17 21:22:30 +00:00
1acf01fc1a Fix token_indices: repeat each token ID top_k times, not arange 2026-05-17 21:22:11 +00:00
a478ca4746 Debug: trace runner logic step by step, test L1 GEMM 2026-05-17 21:21:45 +00:00
a100bd11c1 Simplify pipeline test: BF16 ref + bridge ref + full runner 2026-05-17 21:20:41 +00:00
6eade5e7f8 Fix: gs values are floats not tensors 2026-05-17 21:19:47 +00:00
b05a38a9bd Test stages 1-2 first: sort + L1 GEMM 2026-05-17 21:19:23 +00:00
9728604ea1 Pipeline test: stage-by-stage with BF16 reference comparison 2026-05-17 21:19:17 +00:00
7fff5fd39b Fix: correct intermediate_size=3072, weight key prefix, dequantize shapes 2026-05-17 21:18:20 +00:00
4ef345773d Rewrite pipeline test: load real weights, step-by-step vs BF16 reference 2026-05-17 21:17:18 +00:00
b43541afdd Fix test path setup 2026-05-17 21:00:00 +00:00
490ddfa294 Pipeline test: use synthetic weights at 256x512 (JIT at 7168x18432 hangs for hours) 2026-05-17 20:58:06 +00:00
c1bb551446 Fix weight loading: skip already-loaded experts correctly 2026-05-17 18:15:51 +00:00
955d7533f2 Use system Python for pipeline test (CuTeDSL in system site-packages) 2026-05-17 18:13:42 +00:00
925e390b93 Fix import: use direct import from vllm/ subdirectory 2026-05-17 18:12:53 +00:00
cd6144b832 Fix imports: all functions are in cutedsl.bridge, not separate modules 2026-05-17 18:11:03 +00:00
5e63a0d8a3 Rewrite pipeline test: use raw checkpoint weights, compare runner vs dynamic-gs reference 2026-05-17 18:10:05 +00:00
e51eafe288 Rewrite pipeline test: compare runner vs reference with real weights, step-by-step 2026-05-17 18:08:33 +00:00
e38d60a6e8 Add pipeline test with real model weights, add swiglu_limit to reference moe_pipeline 2026-05-17 18:07:44 +00:00
87a223f1ac Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses 2026-05-17 16:52:40 +00:00
33e28100ee test: use runner's built-in warmup method 2026-05-17 08:24:27 +00:00