Commit Graph

479 Commits

Author SHA1 Message Date
9cdf79fd9c wip: fused SwiGLU kernel scaffold + bridge interleave + plan
- fused_swiglu_grouped_mm.py: copypaste of torch_scaled_grouped_mm.py with
  class rename and fused_swiglu/swiglu_limit params added
- bridge.py: added interleave_l1_weights, deinterleave_l1_weights,
  warmup_fused_swiglu_compilation
- Pure-PyTorch interleave invariant passes (A@cat vs deinterleave(A@interleave))
- Standalone GEMM interleave test fails due to kernel-internal N-tiling
  layout (expected, skipping per plan)
- FUSED_EPILOGUE_PLAN.md updated with register layout, amax shuffle plan,
  4-step implementation strategy
2026-05-20 03:04:38 +00:00
2f8b26c176 chore: remove unused _expert_id_range after bincount migration 2026-05-20 02:17:44 +00:00
7e2adb7e85 perf: replace expert counting O(n*E) comparison with torch.bincount O(n)
Bug #5 fix: (sorted_ids.unsqueeze(1) == expert_id_range.unsqueeze(0)).sum(dim=0)
materializes a (num_slots × num_experts) bool tensor every forward — 48K × 384 = 18M
elements. torch.bincount(sorted_ids, minlength=num_experts) gives the same result
in O(n) with no intermediate allocation. ~200× less work.

Also removes the now-unused _expert_id_range buffer.
2026-05-20 02:17:23 +00:00
d59b10e170 fix: zero out x_norm for underflow blocks before division in NVFP4 quantization
Bug #4 fix: When a block has amax > 0 but amax/6 underflows to 0 in
FP8 (amax < 6*2^-9 ≈ 0.0117), the block scale is 0, but the division
x / clamp(0, 1e-8) inflates x into nonzero FP4 buckets (up to ±6.0).
This produces semantically wrong FP4 even though dequant gives 0 (6*0=0).

Root cause: we only detected truly-zero blocks (amax == 0) but not
underflow blocks (0 < amax < FP8_threshold). The fix:

1. Detect both zero and underflow blocks: block_amax < 6 * 2^-9
2. Zero out x_reshaped for these blocks BEFORE division
3. Force FP8 scale to 0 for these blocks

This ensures x_scaled = 0 → FP4 nibbles = 0 → dequant = 0.
Verified: bug scenario now produces nibble=0, scale=0.
Checkpoint byte match remains 100%.
2026-05-20 02:16:49 +00:00
c8fa87fac7 fix: detect zero blocks in NVFP4 quantization, force FP4+FP8 to exact zero
Bug #3 fix: The clamp(min=1e-8) on block_amax prevented NaN from 0/0
but allowed truly-zero blocks to get a nonzero FP8 scale (5e-12 from
underflow). While the kernel produces 0 * 0 = 0 (no NaN), the nonzero
scale is semantically wrong and could interact badly with future kernels.

Fix: detect zero blocks explicitly (block_amax == 0), clamp only for
safe division, then force FP8 scale to exact zero for zero blocks via
torch.where. The FP4 nibbles are already zero (0 / anything = 0).

Verified: checkpoint byte match remains 100%, zero blocks produce
exact-zero dequantization, no NaN propagation.

Applies to all three quantization functions:
- quantize_to_nvfp4 (activation with computed gs)
- quantize_activation_nvfp4 (activation with pre-computed gs)
- quantize_weight_to_nvfp4 (weight quantization)
2026-05-20 02:14:50 +00:00
3c6b5a0522 chore: deprecate prepare_weights_from_dequantized and prepare_weights_direct
Verified that our NVFP4 packing convention (odd<<4|even, round-half-to-even)
matches the DeepSeek-V4 checkpoint exactly: 100% byte-identical round-trip
across all tested experts. The dequantize->requantize path is lossless in
practice but wasteful. Marked both prepare_weights_from_dequantized and
prepare_weights_direct as deprecated in favor of prepare_weights_from_stacked
which loads checkpoint FP4 bytes directly via .view().

Also added test_fp4_roundtrip.py for future reference.
2026-05-20 02:11:40 +00:00
3181f74c86 fix: correct scale factor dimensions in warmup (K_sf = ceil_div(K_packed,8) not ceil_div(K_packed,16))
K_packed = original_K // 2. The scale factor dimension is
K_sf = ceil_div(original_K, 16) = ceil_div(K_packed * 2, 16) = ceil_div(K_packed, 8).
The previous code used ceil_div(K_packed, 16) which was wrong.
2026-05-20 02:08:26 +00:00
cc6b094450 fix: root-cause JIT memory corruption myth, add eager warmup, remove _needs_token_refill
Bug #1 fix: The _needs_token_refill workaround was a band-aid over a
misdiagnosis. cute.compile does NOT corrupt GPU memory (verified on B200).
The original corruption was from a different bug (likely OOB write or
weight loading issue).

Changes:
- bridge.py: Add warmup_compilation() for eager JIT before runtime buffers
  exist. Pre-allocate workspace per cache entry (no torch.full in hot path).
  Cache stores {compiled, workspace, workspace_size} instead of just compiled.
  CuTe tensor wrappers re-created per call (cheap metadata, avoids stale refs).
- runner.py: Remove _needs_token_refill hack. Add eager warmup call in
  _ensure_stacked() for both L1 and L2 GEMM shapes.
- nvfp4_linear.py: Add eager warmup in finalize_weights() for single GEMM.

The warmup approach ensures cute.compile runs exactly once per shape during
model init, before any forward pass. This is deterministic and eliminates
any possible interaction between JIT and runtime GPU memory.
2026-05-20 02:08:01 +00:00
039a9e27d6 fix: handle 3D swa_indices and correct kv_bf16 expand dims 2026-05-20 01:36:27 +00:00
b3f6f260ce feat: add native CuTeDSL SWA decode attention kernel stub + batched SDPA fallback 2026-05-20 01:28:05 +00:00
268dc251c1 fix: replace _allocate_buffers with _ensure_buffer_size for dynamic sizing 2026-05-20 00:02:10 +00:00
09669dded4 fix: dynamic buffer sizing in nvfp4_linear for varying token counts 2026-05-19 23:59:55 +00:00
02b9c1ac20 nuke vllm because this keep confusing people 2026-05-19 23:04:36 +00:00
02b57071be Update README.md and CURRENT_BUG.md: eliminate stale issues, document NaN investigation, clarify our kernels are clean 2026-05-19 20:22:10 +00:00
7070fadf72 Add full layer NaN test (attention + MoE, multi-layer chain) 2026-05-19 18:36:49 +00:00
152b0749df Use 16 experts for MoE runner test (fits in memory) 2026-05-19 18:35:40 +00:00
daa59a7c75 Add MoE runner NaN test (grouped GEMM with real weights) 2026-05-19 18:34:56 +00:00
9308634e65 Fix intermediate size: 3072 not 18432 2026-05-19 18:34:12 +00:00
2b91bb1b71 Rewrite MoE NaN test: per-expert format, activation quantization, grouped GEMM 2026-05-19 18:33:57 +00:00
8904d409f8 Fix MoE weight key names, add fallback 2026-05-19 18:32:49 +00:00
e45ceb2226 Add MoE NaN reproduction test, update CURRENT_BUG.md with NaN tracing and test plan 2026-05-19 18:32:14 +00:00
22ec43e685 Add input NaN debug to trace where NaN starts 2026-05-19 18:15:53 +00:00
b86d0d2dee Add prefill inputs NaN debug 2026-05-19 18:04:18 +00:00
45a2d8851d Add prefill attention value debug check 2026-05-19 17:55:35 +00:00
1589b79137 Use module-level Blackwell flag in compressor (works during torch.compile) 2026-05-19 17:37:26 +00:00
658b12cb3d CRITICAL FIX: Remove double Q normalization and fix RoPE sin slice 2026-05-19 17:27:33 +00:00
facc6509e7 Fix imports in vLLM codepaths test 2026-05-19 17:26:50 +00:00
835e1a0590 Fix f-string syntax 2026-05-19 17:26:40 +00:00
9c30168202 Add test for exact vLLM codepaths (fused_qnorm, kv_write, decode) 2026-05-19 17:26:10 +00:00
8f80991fdf CRITICAL FIX: Properly dequantize fp8 KV in decode using per-token inv_scale 2026-05-19 17:08:58 +00:00
d67d8613af FIX: Use vLLM's decode_swa_indices for correct paged KV cache access during decode 2026-05-19 16:55:44 +00:00
3b204c4772 Fix UnboundLocalError: move num_decode_tokens before debug print 2026-05-19 16:43:28 +00:00
30890b621d CRITICAL FIX: Skip compressor fused attention kernel on Blackwell — it bypasses our attention path 2026-05-19 16:35:07 +00:00
b8e2cf61ad Add debug logging to Blackwell attention path 2026-05-19 16:31:55 +00:00
d7f686bcfc Fix wrapper attribute access: kv_cache, attn_sink, max_model_len via mla_attn 2026-05-19 16:19:28 +00:00
114da83090 Add CSA/HCA decode + prefill attention to Blackwell path 2026-05-19 16:06:24 +00:00
2cc1910c45 Fix N for C128A (need 128 tokens) 2026-05-19 16:04:53 +00:00
cea453cbab Fix compressor key name 2026-05-19 16:04:38 +00:00
04f2b2d8d4 Add CSA sparse attention test (compressed KV gather + SWA merge) 2026-05-19 16:04:19 +00:00
4c6464e7e0 Update CURRENT_BUG: KV cache pipeline verified, all tests passing 2026-05-19 16:01:10 +00:00
be8566a443 Add decode vs prefill consistency test 2026-05-19 16:00:33 +00:00
2ddd3d0702 Test with all 61 layers (shared experts only) 2026-05-19 15:55:41 +00:00
842e6e1381 Fix view→reshape for non-contiguous tensor 2026-05-19 15:54:40 +00:00
f0f8d8211b Add e2e decode test (3 layers: C128A, C4A, SWA) 2026-05-19 15:53:29 +00:00
255913fba4 Vectorize paged KV cache read/write, kill container 2026-05-19 15:48:16 +00:00
8b2cb41160 Fix KV cache: write to paged cache, handle uint8→fp8 conversion, fix RoPE bug 2026-05-19 15:34:09 +00:00
6ceb05327f Add blackwell_attention module and comprehensive test 2026-05-19 15:30:29 +00:00
85c74e5932 Fix attention for decode (1 query vs N cached KVs) 2026-05-19 15:28:52 +00:00
85099c7e75 Fix fp8 amax in decode test 2026-05-19 15:28:17 +00:00
c66b0b88c0 Add decode attention pipeline test — reproduces KV cache bug 2026-05-19 15:27:55 +00:00