Commit Graph

388 Commits

Author SHA1 Message Date
5a4e355d3a Add model forward test: reproduce vLLM empty output outside container 2026-05-19 07:47:48 +00:00
f5ce728ef2 Fix OOM: add --max-model-len=876544 + revert CPU dummy weight
The CPU dummy weight broke torch.mm(compressor.weight.T) which expects
GPU tensors. Instead, reduce max_model_len to fit KV cache within
available memory (876544 instead of 1048576).
2026-05-19 07:35:43 +00:00
79a41d9197 Save ~5-8 GiB GPU VRAM: move dummy weight to CPU
The CuTeDSL kernel never reads layer.weight — it uses the runner's
pre-processed fp4/sf/gs tensors. The dummy BF16 weight exists only for
vLLM model introspection. Moving it to CPU saves massive VRAM:
- q_b_proj alone: 65536*1536*2 = 192 MiB on GPU → ~0 MiB
- All layers combined: ~5-8 GiB saved

This should fix the KV cache OOM (needed 10.28 GiB, had 9.36 GiB).
2026-05-19 07:29:38 +00:00
cebc586014 Fix OOM: use 1-token warmup sample + free immediately
8 tokens * 7168 hidden * ~40 NVFP4 layers = ~2.3 MiB per layer * 40 = 92 MiB
But the dummy weight param (out_features * in_features * 2 bytes BF16) was
the real killer — each layer allocated a BF16 dummy of its full weight shape.
With 1 token the warmup still gets a valid gs, and empty_cache frees the
sample tensor before KV cache allocation.
2026-05-19 07:28:57 +00:00
5122cadc94 Update CURRENT_BUG.md: root cause found + fix committed 2026-05-19 07:21:30 +00:00
6e6f95dfa8 FIX: Use warmup-based activation global scale in CuTeDSL linear kernel
The checkpoint's input_scale is a calibration-time value that doesn't
match what quantize_activation_nvfp4 expects at runtime. Using it as
the activation global scale produces garbage output (empty EOS tokens).

The fix: run a warmup forward pass with sample data and compute the
activation global scale from the actual activation distribution, exactly
like our standalone test does (which passes with cosine >= 0.994).

This is the root cause of the vLLM server returning empty content.
2026-05-19 07:21:07 +00:00
0a7769972f Fix garbled shared_expert_pipeline.py: imports/class were merged 2026-05-19 07:18:10 +00:00
87453a53b0 Fix checkpoint keys: attn_hc.*, compressor.*, q_a_proj/q_b_proj/kv_proj 2026-05-19 07:17:37 +00:00
f97762cc9f Fix full layer test: use correct checkpoint key names
Checkpoint uses q_a_proj/q_b_proj/kv_proj/q_a_norm — NOT the vLLM
fused names (fused_wqa_wkv, wq_b, q_norm).
2026-05-19 07:16:33 +00:00
cc48a5715e Add full layer 0 B200 test: CuTeDSL vs BF16 reference
Tests each attention/FFN projection individually against BF16 dequantized
reference, then runs full layer forward. Identifies exactly where garbage
enters the pipeline.

Key finding: checkpoint uses different names than vLLM:
- q_a_proj, q_b_proj, kv_proj (not fused_wqa_wkv)
- q_a_norm (not q_norm)
- compressor.* (C4A layers only)
- sinks (attn_sink)
2026-05-19 07:14:58 +00:00
dbaa3d6fe6 Update CURRENT_BUG.md and README with current state
Empty output still happening. Documented what's been tried, what works
standalone, what we don't know, and the plan to bypass vLLM's kernel
selection entirely by calling our runners directly.
2026-05-19 07:05:45 +00:00
62abf41b03 Revert deepseek_v4_attention.py to ffc2264 — don't nuke existing patches
The file at ffc2264 already had our BF16 wo_a path (_apply_inv_rope_bf16 +
BMM + all-gather) with FP8 fallback. I was replacing it from the wrong
vllm source, losing all prior work. Restored to the known-good version.
2026-05-19 06:52:40 +00:00
4c2effa2be Fix attention patch: source from v0.21.0 stable, not local clone
The local vllm clone has different imports (breakable_cudagraph) that
don't exist in the Docker image. Now sourced from v0.21.0 tag.
2026-05-19 06:44:59 +00:00
284b6a5d57 Fix attention patch: use original vllm imports, only patch forward method
Previous version copied the entire file from our local vllm clone which
had imports (breakable_cudagraph) missing from the Docker image's vllm.
Now we start from the Docker image's original file and only patch the
DeepseekV4MultiHeadLatentAttentionWrapper.forward method.
2026-05-19 06:40:58 +00:00
199efe0871 Fix dims: o_groups=16, o_lora_rank=1024 from config 2026-05-19 06:37:25 +00:00
b4fee70151 Fix device mismatch in test 2026-05-19 06:36:22 +00:00
6b4b9774d1 Add B200 test: prove O-projection root cause + validate fix 2026-05-19 06:32:54 +00:00
77baca668e Patch attention forward: BF16 inv RoPE + BMM wo_a + NVFP4 wo_b
The original attention forward uses fused_inv_rope_fp8_quant +
deepseek_v4_fp8_einsum which requires wo_a to have FP8 weights
and weight_scale_inv. Our checkpoint has wo_a in BF16, so the
original path crashes (produces empty output).

Replace O projection with:
1. _apply_inv_rope_bf16: pure PyTorch inverse RoPE (no FP8)
2. BMM grouped linear for wo_a (BF16)
3. NVFP4 wo_b via CuTeDSL

Also fixes activation global scale bug from previous commit:
- input_global_scale_inv IS the activation gs, don't re-invert
- w13_input_scale_orig (after undoing convert) IS the MoE gs

Test: tests/test_o_projection.py validates inv RoPE roundtrip
and wo_a BMM correctness.
2026-05-19 06:30:18 +00:00
ffc2264c41 Fix activation global scale: don't double-invert input_global_scale_inv
The activation global scale = amax / (6.0 * 448.0). Both the linear
kernel and MoE kernel were taking 1.0 / (value that's already the
correct gs), inverting it and producing garbage quantization.

Linear kernel: input_global_scale_inv IS the gs, so use it directly.
MoE kernel: w13_input_scale_orig (after undoing convert inversion) IS
the gs, so use it directly.
2026-05-19 06:03:08 +00:00
918342feeb MHC: replace monolithic layers/mhc.py with pure PyTorch
The nightly vLLM image puts ALL MHC code in layers/mhc.py (not kernels/mhc/).
It imports tilelang at top level and JIT-compiles kernels.

Replace the entire file with pure PyTorch implementations using
direct_register_custom_op for mhc_pre, mhc_post, mhc_fused_post_pre,
and hc_head_fused_kernel. No tilelang dependency at all.

Also removes the separate mhc_torch_ops.py and kernels/mhc/ patches
which don't apply to the nightly image layout.
2026-05-19 05:41:55 +00:00
dfd9c10ae9 Fix MHC import: don't import .torch from layers/mhc.py
The layers/mhc.py was trying to import kernels.mhc.torch which
failed because our __init__.py was breaking the package. Instead,
just import our mhc_torch_ops which has everything we need.

Also fix __init__.py to explicitly import mhc_pre_torch and
mhc_post_torch from .torch instead of using import *.
2026-05-19 05:36:35 +00:00
e404e18efb Also replace layers/mhc.py CustomOp dispatch
The original layers/mhc.py forward_cuda calls
torch.ops.vllm.mhc_pre_tilelang which triggers TileLang JIT.
Replace with our torch implementations in forward_cuda.
This is what the CustomOp dispatch routes through.
2026-05-19 05:31:05 +00:00
5e6d459145 Fix MHC custom op registration
Previous approach used @CustomOp.register which doesn't create
torch.ops.vllm.mhc_pre. The model code calls torch.ops.vllm.mhc_pre()
directly, which requires direct_register_custom_op.

Use direct_register_custom_op to register mhc_pre, mhc_post,
mhc_fused_post_pre, and hc_head_fused_kernel as PyTorch custom ops
with torch (eager) implementations.

Patch kernels/mhc/__init__.py to import from both .torch (original)
and .mhc_torch_ops (our replacements), skipping tilelang import.
2026-05-19 05:19:48 +00:00
9ff1679064 Replace MHC TileLang kernels with pure PyTorch
TileLang kernels (mhc_pre_big_fuse_tilelang, mhc_fused_tilelang) don't
work correctly on Blackwell SM100 and cause empty model output.

Replace with pure PyTorch implementations:
- mhc_pre_torch: Sinkhorn-normalized HC residual mixing
- mhc_post_torch: HC post block (einsum residual + post layer mix)
- mhc_fused_post_pre_torch: Fused post+pre (composition of above)
- hc_head_fused_torch: RMS norm + linear + sigmoid + weighted sum

Patch both layers/mhc.py (CustomOp dispatch) and kernels/mhc/__init__.py
(no tilelang import). Also remove tilelang from pyproject.toml deps.
2026-05-19 05:07:41 +00:00
5c770c68ca Keep MoE scale tensors: framework warmup needs them
The framework's deep_gemm_warmup calls get_fused_moe_quant_config
which accesses w13_input_scale etc. Setting them to None caused
TypeError: float / NoneType. Keep scales (small tensors) and only
free the large weight tensors.
2026-05-19 04:50:31 +00:00
e0f385ac45 Fix workspace_shapes: output dim is hidden_dim, not K*2
K comes from hidden_states.size(-1) which is the full BF16 dimension
(7168), not the packed weight dimension. K*2=14336 is wrong.
The MoE output is always hidden_dim (7168).
2026-05-19 04:42:22 +00:00
cfd8ec741d Debug: add shape mismatch logging in MoE apply 2026-05-19 04:35:58 +00:00
ffc1a5c6a8 Fix workspace_shapes: remove wrong assertion, compute output dim from K
The framework may pass K in different forms (packed or unpacked).
Use max(K*2, hidden_dim) to handle both cases.
2026-05-19 04:28:04 +00:00
f023b3b2c6 Fix: wrap dummy MoE weights in nn.Parameter
PyTorch requires module attributes to be nn.Parameter or None.
torch.empty can't be assigned to a registered parameter slot.
2026-05-19 04:21:35 +00:00
b06dcb40dc Fix MoE w1=None crash: keep shape-preserving dummy weights on CPU
The modular kernel framework reads w1.shape[0] in its outer apply()
before delegating to our expert impl. Setting layer.w13_weight = None
caused AttributeError. Replace with shape-preserving CPU dummy tensors
to free GPU memory while keeping shape metadata accessible.
2026-05-19 04:17:10 +00:00
c289c44920 Fix BF16 wo_a: per-group BMM instead of flat linear
The BF16 wo_a path was calling self.wo_a(o_inv.reshape(num_tokens, -1))
which flattens across groups: (num_tokens, n_local_heads*head_dim)=(tokens, 8192).
But wo_a is a BMM with in_features=n_heads*head_dim/n_groups=4096.

The FP8 path handles this via einsum 'bhr,hdr->bhd' with per-group shapes.
The BF16 path now does the same: reshape o_inv to per-group format,
do torch.bmm, then reshape output and handle TP all-gather manually.
2026-05-19 04:10:02 +00:00
6f9a400ae0 Fix hc_head mapping: checkpoint uses hc_head.hc_fn, model params are flat hc_head_fn
- Removed hc_head prefix mapping (checkpoint already has model.hc_head.*)
- Fixed substr: hc_head.hc_fn→hc_head_fn (not hc_head.fn→hc_head_fn)
- The model has self.hc_head_fn as flat params, not inside a sub-module
2026-05-19 03:58:25 +00:00
909a2710e4 Fix double lm_head mapping: NVFP4 checkpoint already uses correct names
The checkpoint has lm_head.weight and model.embed_tokens.weight
already — the suffix mappings head.weight→lm_head.weight and
embed.weight→embed_tokens.weight were incorrectly applying to keys
that already had the right prefix, producing lm_lm_head.weight.
2026-05-19 03:54:14 +00:00
4cf5b8b751 Fix compressor path: attn.mla_attn.compressor (not attn.compressor)
The compressor is inside mla_attn, not directly on the attention wrapper.
Debug output confirmed: layers.0.attn.mla_attn.compressor.fused_wkv_wgate.*
2026-05-19 03:47:26 +00:00
9d41419e9f Debug: print compressor params to diagnose KeyError 2026-05-19 03:44:40 +00:00
db5192fe41 Patch from Docker image's vLLM (0.20.2rc1) instead of newer upstream
The nightly Docker image uses an older vLLM that doesn't have
NormGateLinear, breakable_cudagraph, etc. Patching the Docker
image's own files ensures compatibility.

- deepseek_v4.py: Patches from Docker image + NVFP4 mapper + wo_a BF16
- deepseek_v4_attention.py: Patches from Docker image + inv rope BF16
  + weights_proj quant + removed QuantFP8/GroupShape imports
2026-05-19 03:35:15 +00:00
df5a496f5d Fix: make eager_break_during_capture import conditional for older vLLM 2026-05-19 03:29:05 +00:00
4ed91b81d0 Fix inverse RoPE formula: swap signs on cross terms 2026-05-19 03:22:10 +00:00
fece06f746 Add unit tests for NVFP4 weight mapper and inverse RoPE BF16 2026-05-19 03:22:00 +00:00
b0b5113467 Fix weight mapper: compressor → attn.compressor (not mla_attn), quant weights_proj
- The compressor is on attn.compressor (not attn.mla_attn.compressor)
- weights_proj in indexer is NVFP4-quantized in our checkpoint
2026-05-19 03:20:41 +00:00
396a83ea56 Clean vLLM integration: use official paths, BF16 wo_a, proper weight mapper
- deepseek_v4.py: Fresh upstream copy with minimal NVFP4 changes
  - wo_a uses quant_config=None (BF16 in NVFP4 checkpoint, no scales)
  - Added _make_deepseek_v4_nvfp4_weights_mapper() using official WeightsMapper API
  - Handles: self_attn→attn, mlp→ffn, gate_proj→w1, compressor renames, etc.
  - Mapper selected by quant_config.get_name() == 'modelopt_fp4'

- deepseek_v4_attention.py: Fresh upstream copy with minimal NVFP4 changes
  - Removed _wo_a_act_quant and custom CuTeDSL wo_a runner
  - Added _apply_inv_rope_bf16() helper (inverse RoPE in BF16)
  - Detects BF16 wo_a (no weight_scale_inv) and uses BF16 path
  - FP8 einsum path kept as fallback for SM90 checkpoints
  - BF16 path: inverse RoPE → wo_a() → wo_b() (standard linear methods)
2026-05-19 03:13:38 +00:00
b856ee9315 Clean up debug scripts 2026-05-19 02:47:29 +00:00
05cdde1676 Fix wo_a: scatter each group's data at correct offset in padded buffer
The grouped GEMM expects each group's tokens at their own offset range:
- Group 0: rows [0, padded_T)
- Group 1: rows [padded_T, 2*padded_T)
- etc.

Previously we wrote all groups' data contiguously starting at row 0,
so group 1+ would read zeros from the padding area. Now we scatter
each group's quantized activation at the correct offset.

Also:
- Size buffer for total_max_rows = padded_max * n_groups
- Use assemble_scales_2d_side for multi-group scale assembly
- Extract output per-group at correct offsets
2026-05-19 02:45:57 +00:00
8fe5546bb3 Fix debug script 2026-05-19 02:43:17 +00:00
788f0aa65a Add step-by-step debug for wo_a 2026-05-19 02:43:05 +00:00
5f5b997fc3 Fix wo_a: permute to groups-first layout for grouped GEMM
The grouped GEMM expects mat_a to be laid out contiguously per group:
[all tokens for group0, all tokens for group1, ...]
A simple reshape of (T, G, D) → (T*G, D) gives interleaved layout
which is wrong. Fix: permute to (G, T, D) before flattening.
Same fix for output: permute (G, T, R) → (T, G, R).
2026-05-19 02:41:32 +00:00
77e4970d93 Add debug script for wo_a quantization 2026-05-19 02:40:43 +00:00
80122b850b Add debug script for wo_a 2026-05-19 02:39:55 +00:00
ae233ab648 Fix test: cos_sin_cache on CUDA device 2026-05-19 02:37:50 +00:00
882d4996ff Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj)
The B200 container crashes in DeepGEMM's fp8_einsum (t.dim() == N assertion
in layout.hpp:39) when processing wo_a (o-projection first half) in the
attention layer. The crash is caused by scale tensor dimension mismatch
for the SM100 recipe (1, 1, 128).

Instead of fighting DeepGEMM, replace the entire wo_a path with our own
CuTeDSL NVFP4 kernel:

1. inverse_rope_bf16() — Python implementation of inverse RoPE
   (replaces fused_inv_rope_fp8_quant CUDA kernel)
2. CuTeDSLNvfp4WoA — NVFP4 grouped linear for wo_a using
   ScaledGroupedGemm with n_local_groups=8 groups
3. wo_a weight quantized to NVFP4 instead of FP8 (native NVFP4,
   no conversion to another quantization)

Changes:
- cutedsl/inverse_rope.py: BF16 inverse RoPE (conjugate rotation)
- cutedsl/wo_a_grouped_linear.py: CuTeDSL NVFP4 grouped GEMM for wo_a
- vllm/patches/deepseek_v4_attention.py: Use NVFP4 path when runner
  is initialized, keep DeepGEMM fallback
- vllm/patches/deepseek_v4.py: Init NVFP4 runner instead of FP8 quant
- tests/test_wo_a.py: Unit test for inverse RoPE + wo_a GEMM
2026-05-19 02:36:30 +00:00