Commit Graph

116 Commits

Author SHA1 Message Date
be8566a443 Add decode vs prefill consistency test 2026-05-19 16:00:33 +00:00
2ddd3d0702 Test with all 61 layers (shared experts only) 2026-05-19 15:55:41 +00:00
842e6e1381 Fix view→reshape for non-contiguous tensor 2026-05-19 15:54:40 +00:00
f0f8d8211b Add e2e decode test (3 layers: C128A, C4A, SWA) 2026-05-19 15:53:29 +00:00
6ceb05327f Add blackwell_attention module and comprehensive test 2026-05-19 15:30:29 +00:00
85c74e5932 Fix attention for decode (1 query vs N cached KVs) 2026-05-19 15:28:52 +00:00
85099c7e75 Fix fp8 amax in decode test 2026-05-19 15:28:17 +00:00
c66b0b88c0 Add decode attention pipeline test — reproduces KV cache bug 2026-05-19 15:27:55 +00:00
8e6721917e Fix syntax in RoPE KV test 2026-05-19 10:31:07 +00:00
cbf440f75a Add RoPE KV test 2026-05-19 10:28:15 +00:00
dd7f2627e8 Add full model forward test (WIP), sparse attention test passes 2026-05-19 09:04:19 +00:00
9781953509 Add CSA/HCA sparse attention kernel test 2026-05-19 09:02:12 +00:00
d60673864a Fix kv_ref transpose in KV cache test 2026-05-19 08:58:46 +00:00
c1099d76d2 Add KV cache kernel test - fp8 quantize/dequant, paged cache, CSA/HCA compression 2026-05-19 08:57:31 +00:00
c54ddbdae1 Fix NVFP4 attention: slice output to actual N after 128-padding 2026-05-19 08:55:31 +00:00
42285b6c24 Add CuTeDSL NVFP4 attention kernel test - Q×K^T GEMM 2026-05-19 08:54:59 +00:00
9465929e6e Add DeepSeek-V4 CSA/HCA attention pipeline test (not MLA) 2026-05-19 08:51:16 +00:00
d08a457829 Fix cos_sin cache shape in NVFP4 attention test 2026-05-19 08:38:55 +00:00
7dd8871e84 Add NVFP4 attention test - quantize Q and K for Q×K^T GEMM 2026-05-19 08:38:25 +00:00
3de75c4e37 Add CSA/HCA attention kernel (PyTorch SDPA, Blackwell-safe)
Replaces vLLM's broken FlashMLA sparse attention which doesn't work on
SM100 (Blackwell). Uses torch.nn.functional.scaled_dot_product_attention
which works on all GPUs.

Architecture:
- CSA (C128A): Batched sparse gather + SDPA on top-k positions
- HCA (C4A): Same with compressed KV + per-layer indexer
- SWA: Sliding window attention
- Full reference: standard SDPA for testing without compression

Also adds test_csa_attention_b200.py to verify the full attention path.
2026-05-19 07:58:10 +00:00
65f48be38c Add attention path test: pinpoint FlashMLA failure 2026-05-19 07:54:01 +00:00
04ad6409e5 Rewrite test: diagnose whether warmup gs matters at inference time 2026-05-19 07:49:41 +00:00
496848e158 Fix ffn_hc.scale key name 2026-05-19 07:48:09 +00:00
5a4e355d3a Add model forward test: reproduce vLLM empty output outside container 2026-05-19 07:47:48 +00:00
87453a53b0 Fix checkpoint keys: attn_hc.*, compressor.*, q_a_proj/q_b_proj/kv_proj 2026-05-19 07:17:37 +00:00
f97762cc9f Fix full layer test: use correct checkpoint key names
Checkpoint uses q_a_proj/q_b_proj/kv_proj/q_a_norm — NOT the vLLM
fused names (fused_wqa_wkv, wq_b, q_norm).
2026-05-19 07:16:33 +00:00
cc48a5715e Add full layer 0 B200 test: CuTeDSL vs BF16 reference
Tests each attention/FFN projection individually against BF16 dequantized
reference, then runs full layer forward. Identifies exactly where garbage
enters the pipeline.

Key finding: checkpoint uses different names than vLLM:
- q_a_proj, q_b_proj, kv_proj (not fused_wqa_wkv)
- q_a_norm (not q_norm)
- compressor.* (C4A layers only)
- sinks (attn_sink)
2026-05-19 07:14:58 +00:00
199efe0871 Fix dims: o_groups=16, o_lora_rank=1024 from config 2026-05-19 06:37:25 +00:00
b4fee70151 Fix device mismatch in test 2026-05-19 06:36:22 +00:00
6b4b9774d1 Add B200 test: prove O-projection root cause + validate fix 2026-05-19 06:32:54 +00:00
77baca668e Patch attention forward: BF16 inv RoPE + BMM wo_a + NVFP4 wo_b
The original attention forward uses fused_inv_rope_fp8_quant +
deepseek_v4_fp8_einsum which requires wo_a to have FP8 weights
and weight_scale_inv. Our checkpoint has wo_a in BF16, so the
original path crashes (produces empty output).

Replace O projection with:
1. _apply_inv_rope_bf16: pure PyTorch inverse RoPE (no FP8)
2. BMM grouped linear for wo_a (BF16)
3. NVFP4 wo_b via CuTeDSL

Also fixes activation global scale bug from previous commit:
- input_global_scale_inv IS the activation gs, don't re-invert
- w13_input_scale_orig (after undoing convert) IS the MoE gs

Test: tests/test_o_projection.py validates inv RoPE roundtrip
and wo_a BMM correctness.
2026-05-19 06:30:18 +00:00
c289c44920 Fix BF16 wo_a: per-group BMM instead of flat linear
The BF16 wo_a path was calling self.wo_a(o_inv.reshape(num_tokens, -1))
which flattens across groups: (num_tokens, n_local_heads*head_dim)=(tokens, 8192).
But wo_a is a BMM with in_features=n_heads*head_dim/n_groups=4096.

The FP8 path handles this via einsum 'bhr,hdr->bhd' with per-group shapes.
The BF16 path now does the same: reshape o_inv to per-group format,
do torch.bmm, then reshape output and handle TP all-gather manually.
2026-05-19 04:10:02 +00:00
6f9a400ae0 Fix hc_head mapping: checkpoint uses hc_head.hc_fn, model params are flat hc_head_fn
- Removed hc_head prefix mapping (checkpoint already has model.hc_head.*)
- Fixed substr: hc_head.hc_fn→hc_head_fn (not hc_head.fn→hc_head_fn)
- The model has self.hc_head_fn as flat params, not inside a sub-module
2026-05-19 03:58:25 +00:00
4cf5b8b751 Fix compressor path: attn.mla_attn.compressor (not attn.compressor)
The compressor is inside mla_attn, not directly on the attention wrapper.
Debug output confirmed: layers.0.attn.mla_attn.compressor.fused_wkv_wgate.*
2026-05-19 03:47:26 +00:00
fece06f746 Add unit tests for NVFP4 weight mapper and inverse RoPE BF16 2026-05-19 03:22:00 +00:00
b856ee9315 Clean up debug scripts 2026-05-19 02:47:29 +00:00
8fe5546bb3 Fix debug script 2026-05-19 02:43:17 +00:00
788f0aa65a Add step-by-step debug for wo_a 2026-05-19 02:43:05 +00:00
77e4970d93 Add debug script for wo_a quantization 2026-05-19 02:40:43 +00:00
80122b850b Add debug script for wo_a 2026-05-19 02:39:55 +00:00
ae233ab648 Fix test: cos_sin_cache on CUDA device 2026-05-19 02:37:50 +00:00
882d4996ff Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj)
The B200 container crashes in DeepGEMM's fp8_einsum (t.dim() == N assertion
in layout.hpp:39) when processing wo_a (o-projection first half) in the
attention layer. The crash is caused by scale tensor dimension mismatch
for the SM100 recipe (1, 1, 128).

Instead of fighting DeepGEMM, replace the entire wo_a path with our own
CuTeDSL NVFP4 kernel:

1. inverse_rope_bf16() — Python implementation of inverse RoPE
   (replaces fused_inv_rope_fp8_quant CUDA kernel)
2. CuTeDSLNvfp4WoA — NVFP4 grouped linear for wo_a using
   ScaledGroupedGemm with n_local_groups=8 groups
3. wo_a weight quantized to NVFP4 instead of FP8 (native NVFP4,
   no conversion to another quantization)

Changes:
- cutedsl/inverse_rope.py: BF16 inverse RoPE (conjugate rotation)
- cutedsl/wo_a_grouped_linear.py: CuTeDSL NVFP4 grouped GEMM for wo_a
- vllm/patches/deepseek_v4_attention.py: Use NVFP4 path when runner
  is initialized, keep DeepGEMM fallback
- vllm/patches/deepseek_v4.py: Init NVFP4 runner instead of FP8 quant
- tests/test_wo_a.py: Unit test for inverse RoPE + wo_a GEMM
2026-05-19 02:36:30 +00:00
00fe63b56f Fix compile test: add warmup for activation global scales 2026-05-19 01:57:16 +00:00
bba3bca4d3 Add torch.compile + custom op integration test 2026-05-19 01:56:46 +00:00
35fab6cff3 Replace autograd.Function with torch.library.custom_op for Dynamo compat
Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals
(cute.compile, JIT, etc.). The autograd.Function approach was unreliable
with fullgraph mode — Dynamo would still try to trace through it.

Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque
black box. No reimplementing the kernel — just route through the existing
runner via a registry pattern:
  - Runners registered in global dict with integer IDs
  - Custom op takes (tensors, runner_id, shape_hint) -> tensor
  - Dynamo calls fake impl for shape inference, never touches the runner
  - At execution time, real impl looks up runner and calls _run_impl

Changes:
  - New: cutedsl/custom_ops.py (custom op definitions + registry)
  - New: tests/test_custom_op.py (local unit tests, no GPU needed)
  - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes)
  - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py
    to use custom ops instead of autograd.Function
  - Updated: cutedsl_quant_method.py to use custom op + registry
2026-05-19 01:54:48 +00:00
f74447bfd0 Proper NVFP4 integration: quantized compressor/indexer + mapper fixes
Weight mapper fixes:
- Reorder substr renames: compressor renames first, then .self_attn.compressor.
  → .attn.mla_attn.compressor., then indexer renames (so indexer keys end up
  under mla_attn after the compressor rename already fired)
- Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm,
  position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names)
- Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv,
  gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape,
  weights_proj stays (structural: compressor.indexer → indexer.compressor)
- Remove broken suffix renames (already fixed in prior commit)

Model architecture fixes:
- Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4
  checkpoint has quantized compressor weights with input_scale/weight_scale)
- Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config
  (was None, but checkpoint has quantized weights)
- Add indexer.compressor.fused_wkv_wgate stacking in load_weights

Infrastructure:
- Add deepseek_compressor.py to Dockerfile
- Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM)
- Update unit test to 50 cases (compressor + indexer + quantization scales)
2026-05-18 23:20:13 +00:00
17496b2615 Fix NVFP4 weights mapper: add prefix mappings, fix substr order
- Add orig_to_new_prefix mappings (layers→model.layers, embed_tokens→model.embed_tokens, etc.)
  AutoWeightsLoader strips the model. prefix before the mapper runs, so these are required
- Move .self_attn.compressor. → .attn.mla_attn.compressor. before .self_attn. → .attn.
  in substr_renames so compressor keys get the mla_attn prefix before the general rename
- Remove suffix renames (head.weight→lm_head.weight, embed.weight→embed_tokens.weight)
  that were causing double-mapping since the NVFP4 checkpoint already uses lm_head/embed_tokens
- Add unit test: tests/test_nvfp4_mapper.py (39 cases, no vLLM/CUDA needed)
2026-05-18 23:03:34 +00:00
6ce6a47be9 Add NVFP4 linear runner + attention projection test
- CuTeDSLNvfp4Linear: generic single-GEMM runner for any NVFP4 projection
- test_attention.py: tests q_a_proj, q_b_proj, kv_proj, o_b_proj vs BF16
- Same pad+swizzle pattern as shared expert, but no SiLU/fusion
2026-05-18 20:14:03 +00:00
f07643791e Fix hidden_size: shared expert uses 7168, not HC_DIM 28672 2026-05-18 20:10:32 +00:00
c1aa4af123 Shared expert: dedicated CuTeDSL runner with proper scale assembly
- CuTeDSLSharedExpertRunner: num_groups=1 GEMM, no scatter/routing
- _assemble_scales_single_group: pad to 128 rows + Blackwell swizzle
- All buffers pre-allocated for cudagraph compatibility
- Updated test to use dedicated runner instead of MoE runner hack
2026-05-18 20:08:34 +00:00