Commit Graph

74 Commits

Author SHA1 Message Date
00fe63b56f Fix compile test: add warmup for activation global scales 2026-05-19 01:57:16 +00:00
bba3bca4d3 Add torch.compile + custom op integration test 2026-05-19 01:56:46 +00:00
35fab6cff3 Replace autograd.Function with torch.library.custom_op for Dynamo compat
Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals
(cute.compile, JIT, etc.). The autograd.Function approach was unreliable
with fullgraph mode — Dynamo would still try to trace through it.

Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque
black box. No reimplementing the kernel — just route through the existing
runner via a registry pattern:
  - Runners registered in global dict with integer IDs
  - Custom op takes (tensors, runner_id, shape_hint) -> tensor
  - Dynamo calls fake impl for shape inference, never touches the runner
  - At execution time, real impl looks up runner and calls _run_impl

Changes:
  - New: cutedsl/custom_ops.py (custom op definitions + registry)
  - New: tests/test_custom_op.py (local unit tests, no GPU needed)
  - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes)
  - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py
    to use custom ops instead of autograd.Function
  - Updated: cutedsl_quant_method.py to use custom op + registry
2026-05-19 01:54:48 +00:00
f74447bfd0 Proper NVFP4 integration: quantized compressor/indexer + mapper fixes
Weight mapper fixes:
- Reorder substr renames: compressor renames first, then .self_attn.compressor.
  → .attn.mla_attn.compressor., then indexer renames (so indexer keys end up
  under mla_attn after the compressor rename already fired)
- Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm,
  position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names)
- Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv,
  gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape,
  weights_proj stays (structural: compressor.indexer → indexer.compressor)
- Remove broken suffix renames (already fixed in prior commit)

Model architecture fixes:
- Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4
  checkpoint has quantized compressor weights with input_scale/weight_scale)
- Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config
  (was None, but checkpoint has quantized weights)
- Add indexer.compressor.fused_wkv_wgate stacking in load_weights

Infrastructure:
- Add deepseek_compressor.py to Dockerfile
- Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM)
- Update unit test to 50 cases (compressor + indexer + quantization scales)
2026-05-18 23:20:13 +00:00
17496b2615 Fix NVFP4 weights mapper: add prefix mappings, fix substr order
- Add orig_to_new_prefix mappings (layers→model.layers, embed_tokens→model.embed_tokens, etc.)
  AutoWeightsLoader strips the model. prefix before the mapper runs, so these are required
- Move .self_attn.compressor. → .attn.mla_attn.compressor. before .self_attn. → .attn.
  in substr_renames so compressor keys get the mla_attn prefix before the general rename
- Remove suffix renames (head.weight→lm_head.weight, embed.weight→embed_tokens.weight)
  that were causing double-mapping since the NVFP4 checkpoint already uses lm_head/embed_tokens
- Add unit test: tests/test_nvfp4_mapper.py (39 cases, no vLLM/CUDA needed)
2026-05-18 23:03:34 +00:00
6ce6a47be9 Add NVFP4 linear runner + attention projection test
- CuTeDSLNvfp4Linear: generic single-GEMM runner for any NVFP4 projection
- test_attention.py: tests q_a_proj, q_b_proj, kv_proj, o_b_proj vs BF16
- Same pad+swizzle pattern as shared expert, but no SiLU/fusion
2026-05-18 20:14:03 +00:00
f07643791e Fix hidden_size: shared expert uses 7168, not HC_DIM 28672 2026-05-18 20:10:32 +00:00
c1aa4af123 Shared expert: dedicated CuTeDSL runner with proper scale assembly
- CuTeDSLSharedExpertRunner: num_groups=1 GEMM, no scatter/routing
- _assemble_scales_single_group: pad to 128 rows + Blackwell swizzle
- All buffers pre-allocated for cudagraph compatibility
- Updated test to use dedicated runner instead of MoE runner hack
2026-05-18 20:08:34 +00:00
e8b289e30d WIP: CuTeDSL shared expert kernel
Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py).
Tried reusing MoE runner with 1 expert — fails because MoE runner assumes
hidden_size != HC_DIM for scatter. Need dedicated runner with correct
scale assembly. Will continue tomorrow.
2026-05-18 20:02:19 +00:00
bedcfc4dab Pipeline test: use max_num_tokens=8192 matching vLLM 2026-05-17 23:04:44 +00:00
c45364b3a8 Add MoE scale ratio output 2026-05-17 22:58:27 +00:00
bf99ad49ec Print both MoE and residual cosine 2026-05-17 22:56:56 +00:00
8637020487 Fix multi-layer test: add residual connections 2026-05-17 22:55:40 +00:00
11dce13afe Add multi-layer pipeline test to check error accumulation 2026-05-17 22:53:28 +00:00
72628fb689 Full pipeline test: runner vs BF16 reference 2026-05-17 21:29:16 +00:00
2796bd81e8 Fix: scatter FP4 as uint8 (float4 doesn't support index_put) 2026-05-17 21:28:04 +00:00
364f8372bb Fix FP4 buffer shapes: D//2 for packed dimensions 2026-05-17 21:26:46 +00:00
5e4d674736 Test fix: quantize slot_hidden, scatter FP4, pass slot_x_sf 2026-05-17 21:25:58 +00:00
4d0b6d889d Set runner weights before _ensure_stacked 2026-05-17 21:22:50 +00:00
b7acac5e4e Call _ensure_stacked() before using runner buffers 2026-05-17 21:22:30 +00:00
1acf01fc1a Fix token_indices: repeat each token ID top_k times, not arange 2026-05-17 21:22:11 +00:00
a478ca4746 Debug: trace runner logic step by step, test L1 GEMM 2026-05-17 21:21:45 +00:00
a100bd11c1 Simplify pipeline test: BF16 ref + bridge ref + full runner 2026-05-17 21:20:41 +00:00
6eade5e7f8 Fix: gs values are floats not tensors 2026-05-17 21:19:47 +00:00
b05a38a9bd Test stages 1-2 first: sort + L1 GEMM 2026-05-17 21:19:23 +00:00
9728604ea1 Pipeline test: stage-by-stage with BF16 reference comparison 2026-05-17 21:19:17 +00:00
7fff5fd39b Fix: correct intermediate_size=3072, weight key prefix, dequantize shapes 2026-05-17 21:18:20 +00:00
4ef345773d Rewrite pipeline test: load real weights, step-by-step vs BF16 reference 2026-05-17 21:17:18 +00:00
b43541afdd Fix test path setup 2026-05-17 21:00:00 +00:00
490ddfa294 Pipeline test: use synthetic weights at 256x512 (JIT at 7168x18432 hangs for hours) 2026-05-17 20:58:06 +00:00
c1bb551446 Fix weight loading: skip already-loaded experts correctly 2026-05-17 18:15:51 +00:00
955d7533f2 Use system Python for pipeline test (CuTeDSL in system site-packages) 2026-05-17 18:13:42 +00:00
925e390b93 Fix import: use direct import from vllm/ subdirectory 2026-05-17 18:12:53 +00:00
cd6144b832 Fix imports: all functions are in cutedsl.bridge, not separate modules 2026-05-17 18:11:03 +00:00
5e63a0d8a3 Rewrite pipeline test: use raw checkpoint weights, compare runner vs dynamic-gs reference 2026-05-17 18:10:05 +00:00
e51eafe288 Rewrite pipeline test: compare runner vs reference with real weights, step-by-step 2026-05-17 18:08:33 +00:00
e38d60a6e8 Add pipeline test with real model weights, add swiglu_limit to reference moe_pipeline 2026-05-17 18:07:44 +00:00
87a223f1ac Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses 2026-05-17 16:52:40 +00:00
33e28100ee test: use runner's built-in warmup method 2026-05-17 08:24:27 +00:00
8c9a51e006 fix: call _ensure_stacked in warmup test 2026-05-17 08:07:09 +00:00
5ba77e355f test: warmup gs computation with safety margin sweep 2026-05-17 08:06:27 +00:00
37fecb588f fix: separate L1/L2 scale buffers (different K_sf), fix assembly calls 2026-05-17 07:43:05 +00:00
8dadd9a723 test: scale assembly debug 2026-05-17 07:37:47 +00:00
7b95e76723 test: runner vs pipeline comparison + scale assembly comparison 2026-05-17 07:33:20 +00:00
cc75a55bd9 restore: new bridge/moe_pipeline/layertest 2026-05-16 19:55:19 +00:00
0c878b3a9e temp: restore old layertest+bridge for cosine comparison 2026-05-16 19:54:04 +00:00
d15c43294b fix: test L2 weight N dim should be hidden_size, not hidden_size//2 2026-05-16 19:07:36 +00:00
28788c6f55 fix: L1 weight N dimension is 2*intermediate (gate+up), not intermediate
float4_e2m1fn_x2 packs 2 values per byte along K, not N.
The GEMM output N dimension is the logical N from mat_b.shape[2],
not 2x packed. Previous n_dim*2 was wrong — it accidentally worked
in the test because intermediate_size*2 == 2*intermediate_size.
Real model with N=9216 exposed the bug.
2026-05-16 19:07:08 +00:00
54c470e535 fix: use float16->float8 cast for rand_sf (torch.rand doesn't support float8) 2026-05-16 18:13:14 +00:00
f2de95c526 fix: use randint for float4 dummy weights in cudagraph test 2026-05-16 18:08:45 +00:00