Commit Graph

490 Commits

Author SHA1 Message Date
06bf4f482d README: comprehensive update with current kernel status 2026-05-20 04:42:57 +00:00
a30d9eb523 Update README with final kernel status 2026-05-20 04:39:57 +00:00
04eca7c6da Custom CUDA kernel for de-interleave plus NVFP4 quantize 2026-05-20 04:39:47 +00:00
061d5692a9 Remove debug print statements from pipeline 2026-05-20 04:20:46 +00:00
aa8563c626 Fused SwiGLU epilogue with granularity-8 weight interleave
- Fix interleave_l1_weights: remove //2 bug (g=granularity_bf16 for N-axis)
- Apply L1 weight+SF interleave in runner._ensure_stacked() and moe_pipeline
- De-interleave L1 GEMM output before gate/up split
- Fused SwiGLU kernel: epi_tile=(128,8) for subtile-level pairing
  - Even subtiles = gate: SiLU in FP32 registers, save to register buffer
  - Odd subtiles = up: silu(gate)*up from buffer
  - Both branches produce same BF16 tensor type (CuTeDSL constraint)
- run_nvfp4_moe_fused() pipeline: fused L1 + PyTorch L2
- Runner: fused_swiglu=True option for CuTeDSLMoERunner
- Layertest: both fused and non-fused paths PASS (cosine 0.988)
- README.md updated with current status and lessons learned
2026-05-20 04:13:52 +00:00
57d4cb714f docs: rewrite README.md with current project state
- Document all 5 correctness bug fixes
- Document fused SwiGLU epilogue progress (Step 1 PASS, Step 2 blocked)
- Document CuTeDSL runtime conditional limitation
- List remaining steps (amax shuffles, NVFP4 quantize, FP4/SF TMA stores)
- Document weight interleave and register layout
- Capture key lessons learned
- Update file structure and test inventory
2026-05-20 03:30:35 +00:00
6c04155167 wip: Step 2 gate/up pairing — SiLU validated, runtime conditionals blocked by CuTeDSL
SiLU in registers: PASS (0.034% error, Step 1 stable)
Gate/up subtile detection: blocked by CuTeDSL type system

CuTeDSL compiles the kernel for ALL subtile iterations at once.
Runtime conditionals (if is_gate_subtile) that affect:
- Register tensor assignment → DSLRuntimeError (type structure mismatch)
- TMA store skipping → corrupted output
- Mask blending → wrong results

Path forward: use const_expr debug flag for the BF16 side output,
or process gate/up in a separate post-GEMM kernel.
2026-05-20 03:26:20 +00:00
9f0c1b8c5d wip: Step 1 SiLU validation complete, Step 2 gate/up pairing planning
Step 1 VALIDATED:
- cute.exp works on register tensors in the epilogue
- SiLU (x / (1+exp(-x))) produces correct results
- Relative error vs PyTorch: 0.034%, max abs: 0.0625 (BF16 precision)

Step 2 (gate/up pairing) approach:
- Register-level pairing requires understanding acc_vec layout from tiled_copy_r2s
- DeepGEMM pattern: (values[0], values[2]) pairs for tcgen05.ld
- CuTeDSL retile may produce different layout than direct PTX loads
- SMEM-level SiLU is a valid intermediate: avoids GMEM round-trip while
  working in logical (M, N) coordinate space
- Non-interleaved weights + SMEM SiLU is simplest starting point
2026-05-20 03:16:34 +00:00
b84f2f7bf9 fix: cutlass.Float32 not cutlass.float32_t in fused epilogue
Step 1 SiLU validation: PASS
- cute.exp works on register tensors
- SiLU (x / (1+exp(-x))) in registers matches PyTorch reference
- Relative error: 0.034%, Max abs error: 0.0625 (BF16 precision limit)
2026-05-20 03:12:23 +00:00
08992b818d wip: add run_fused_swiglu_grouped_gemm bridge + step1 test 2026-05-20 03:10:56 +00:00
9c43c69a4c wip: fused SwiGLU Stage 1 - SiLU in registers (full acc_vec)
Stage 1 of the fused epilogue: applies SiLU (x * sigmoid(x)) to the
full accumulator register tensor before writing BF16 to C.

This validates that cute.exp and element-wise FP32 operations work
on CuTe register tensors in the epilogue. The gate/up pairing is
not yet implemented (Stage 2).

The fused_swiglu flag is const_expr(0) by default, so the standard
epilogue path is unchanged unless the flag is enabled.
2026-05-20 03:07:02 +00:00
2f053f674e wip: fused SwiGLU kernel scaffold + bridge interleave + plan
- fused_swiglu_grouped_mm.py: copypaste of torch_scaled_grouped_mm.py with
  class rename and fused_swiglu/swiglu_limit params added
- bridge.py: added interleave_l1_weights, deinterleave_l1_weights,
  warmup_fused_swiglu_compilation
- Pure-PyTorch interleave invariant passes (A@cat vs deinterleave(A@interleave))
- Standalone GEMM interleave test fails due to kernel-internal N-tiling
  layout (expected, skipping per plan)
- FUSED_EPILOGUE_PLAN.md updated with register layout, amax shuffle plan,
  4-step implementation strategy
2026-05-20 03:04:38 +00:00
4f178d6e9c chore: remove unused _expert_id_range after bincount migration 2026-05-20 02:17:44 +00:00
84a2f6d441 perf: replace expert counting O(n*E) comparison with torch.bincount O(n)
Bug #5 fix: (sorted_ids.unsqueeze(1) == expert_id_range.unsqueeze(0)).sum(dim=0)
materializes a (num_slots × num_experts) bool tensor every forward — 48K × 384 = 18M
elements. torch.bincount(sorted_ids, minlength=num_experts) gives the same result
in O(n) with no intermediate allocation. ~200× less work.

Also removes the now-unused _expert_id_range buffer.
2026-05-20 02:17:23 +00:00
4882d8553c fix: zero out x_norm for underflow blocks before division in NVFP4 quantization
Bug #4 fix: When a block has amax > 0 but amax/6 underflows to 0 in
FP8 (amax < 6*2^-9 ≈ 0.0117), the block scale is 0, but the division
x / clamp(0, 1e-8) inflates x into nonzero FP4 buckets (up to ±6.0).
This produces semantically wrong FP4 even though dequant gives 0 (6*0=0).

Root cause: we only detected truly-zero blocks (amax == 0) but not
underflow blocks (0 < amax < FP8_threshold). The fix:

1. Detect both zero and underflow blocks: block_amax < 6 * 2^-9
2. Zero out x_reshaped for these blocks BEFORE division
3. Force FP8 scale to 0 for these blocks

This ensures x_scaled = 0 → FP4 nibbles = 0 → dequant = 0.
Verified: bug scenario now produces nibble=0, scale=0.
Checkpoint byte match remains 100%.
2026-05-20 02:16:49 +00:00
e653712598 fix: detect zero blocks in NVFP4 quantization, force FP4+FP8 to exact zero
Bug #3 fix: The clamp(min=1e-8) on block_amax prevented NaN from 0/0
but allowed truly-zero blocks to get a nonzero FP8 scale (5e-12 from
underflow). While the kernel produces 0 * 0 = 0 (no NaN), the nonzero
scale is semantically wrong and could interact badly with future kernels.

Fix: detect zero blocks explicitly (block_amax == 0), clamp only for
safe division, then force FP8 scale to exact zero for zero blocks via
torch.where. The FP4 nibbles are already zero (0 / anything = 0).

Verified: checkpoint byte match remains 100%, zero blocks produce
exact-zero dequantization, no NaN propagation.

Applies to all three quantization functions:
- quantize_to_nvfp4 (activation with computed gs)
- quantize_activation_nvfp4 (activation with pre-computed gs)
- quantize_weight_to_nvfp4 (weight quantization)
2026-05-20 02:14:50 +00:00
1857bdedc3 chore: deprecate prepare_weights_from_dequantized and prepare_weights_direct
Verified that our NVFP4 packing convention (odd<<4|even, round-half-to-even)
matches the DeepSeek-V4 checkpoint exactly: 100% byte-identical round-trip
across all tested experts. The dequantize->requantize path is lossless in
practice but wasteful. Marked both prepare_weights_from_dequantized and
prepare_weights_direct as deprecated in favor of prepare_weights_from_stacked
which loads checkpoint FP4 bytes directly via .view().

Also added test_fp4_roundtrip.py for future reference.
2026-05-20 02:11:40 +00:00
ef398006a7 fix: correct scale factor dimensions in warmup (K_sf = ceil_div(K_packed,8) not ceil_div(K_packed,16))
K_packed = original_K // 2. The scale factor dimension is
K_sf = ceil_div(original_K, 16) = ceil_div(K_packed * 2, 16) = ceil_div(K_packed, 8).
The previous code used ceil_div(K_packed, 16) which was wrong.
2026-05-20 02:08:26 +00:00
8f1a20562f fix: root-cause JIT memory corruption myth, add eager warmup, remove _needs_token_refill
Bug #1 fix: The _needs_token_refill workaround was a band-aid over a
misdiagnosis. cute.compile does NOT corrupt GPU memory (verified on B200).
The original corruption was from a different bug (likely OOB write or
weight loading issue).

Changes:
- bridge.py: Add warmup_compilation() for eager JIT before runtime buffers
  exist. Pre-allocate workspace per cache entry (no torch.full in hot path).
  Cache stores {compiled, workspace, workspace_size} instead of just compiled.
  CuTe tensor wrappers re-created per call (cheap metadata, avoids stale refs).
- runner.py: Remove _needs_token_refill hack. Add eager warmup call in
  _ensure_stacked() for both L1 and L2 GEMM shapes.
- nvfp4_linear.py: Add eager warmup in finalize_weights() for single GEMM.

The warmup approach ensures cute.compile runs exactly once per shape during
model init, before any forward pass. This is deterministic and eliminates
any possible interaction between JIT and runtime GPU memory.
2026-05-20 02:08:01 +00:00
6ec0afc318 fix: handle 3D swa_indices and correct kv_bf16 expand dims 2026-05-20 01:36:27 +00:00
aa593361e7 feat: add native CuTeDSL SWA decode attention kernel stub + batched SDPA fallback 2026-05-20 01:28:05 +00:00
3599b44c0f fix: replace _allocate_buffers with _ensure_buffer_size for dynamic sizing 2026-05-20 00:02:10 +00:00
1d5e70adfb fix: dynamic buffer sizing in nvfp4_linear for varying token counts 2026-05-19 23:59:55 +00:00
1901bf585e nuke vllm because this keep confusing people 2026-05-19 23:04:36 +00:00
5fb70b4cd2 Update README.md and CURRENT_BUG.md: eliminate stale issues, document NaN investigation, clarify our kernels are clean 2026-05-19 20:22:10 +00:00
2e6559402c Add full layer NaN test (attention + MoE, multi-layer chain) 2026-05-19 18:36:49 +00:00
cca145e35c Use 16 experts for MoE runner test (fits in memory) 2026-05-19 18:35:40 +00:00
7893e7514d Add MoE runner NaN test (grouped GEMM with real weights) 2026-05-19 18:34:56 +00:00
7b432da754 Fix intermediate size: 3072 not 18432 2026-05-19 18:34:12 +00:00
293f14a179 Rewrite MoE NaN test: per-expert format, activation quantization, grouped GEMM 2026-05-19 18:33:57 +00:00
62f2395e30 Fix MoE weight key names, add fallback 2026-05-19 18:32:49 +00:00
9455466648 Add MoE NaN reproduction test, update CURRENT_BUG.md with NaN tracing and test plan 2026-05-19 18:32:14 +00:00
0316cec6fb Add input NaN debug to trace where NaN starts 2026-05-19 18:15:53 +00:00
4c45d73b82 Add prefill inputs NaN debug 2026-05-19 18:04:18 +00:00
0773c9608c Add prefill attention value debug check 2026-05-19 17:55:35 +00:00
4f02113aa0 Use module-level Blackwell flag in compressor (works during torch.compile) 2026-05-19 17:37:26 +00:00
8cf6ac3e8c CRITICAL FIX: Remove double Q normalization and fix RoPE sin slice 2026-05-19 17:27:33 +00:00
a94ad73c64 Fix imports in vLLM codepaths test 2026-05-19 17:26:50 +00:00
f3f9674810 Fix f-string syntax 2026-05-19 17:26:40 +00:00
6cc2312e61 Add test for exact vLLM codepaths (fused_qnorm, kv_write, decode) 2026-05-19 17:26:10 +00:00
aade8593f7 CRITICAL FIX: Properly dequantize fp8 KV in decode using per-token inv_scale 2026-05-19 17:08:58 +00:00
2f811bc8bd FIX: Use vLLM's decode_swa_indices for correct paged KV cache access during decode 2026-05-19 16:55:44 +00:00
da6fa2f1d6 Fix UnboundLocalError: move num_decode_tokens before debug print 2026-05-19 16:43:28 +00:00
76fff5fc8b CRITICAL FIX: Skip compressor fused attention kernel on Blackwell — it bypasses our attention path 2026-05-19 16:35:07 +00:00
0554332352 Add debug logging to Blackwell attention path 2026-05-19 16:31:55 +00:00
f9a09df81a Fix wrapper attribute access: kv_cache, attn_sink, max_model_len via mla_attn 2026-05-19 16:19:28 +00:00
b95e934703 Add CSA/HCA decode + prefill attention to Blackwell path 2026-05-19 16:06:24 +00:00
abff942edd Fix N for C128A (need 128 tokens) 2026-05-19 16:04:53 +00:00
49c2e088d4 Fix compressor key name 2026-05-19 16:04:38 +00:00
7d89ede9f9 Add CSA sparse attention test (compressed KV gather + SWA merge) 2026-05-19 16:04:19 +00:00