Commit Graph

147 Commits

Author SHA1 Message Date
6a352cf6da TMEM alias analysis: (128,16) PV broken, (128,128) PV with zero-pad works. Root cause: PV A-fragment layout differs from QK C-fragment layout for (128,16) PV, causing TMEM column mismatch. Using (128,128) PV as workaround. 2026-05-21 09:10:12 +00:00
73cb3a3277 Debugging TMEM alias for (128,16) PV: zero output confirmed, PV reads from wrong TMEM columns. Need to align softmax P write with PV A-fragment layout. 2026-05-21 09:00:42 +00:00
768a696373 Stage B N-tiling: (128,16) PV MMA compiles and runs, cosine 0.36 (TMEM alias mismatch bug). FMHA head_dim=64 passes. Debugging TMEM layout alignment. 2026-05-21 08:45:49 +00:00
f3a7dc1598 FOOTGUN #0: num_tma_load_bytes MUST include V bytes. Fix v27, v29, comment all. Update README. 2026-05-21 07:13:14 +00:00
d873284e84 v29: FIX DEADLOCK - add V bytes to num_tma_load_bytes. V=I(128,128) cosine 1.0 2026-05-21 07:08:29 +00:00
6ec308dd50 v29 (padded V, deadlocks), v30 (diag copy, works) — debugging epilogue deadlock with (128,128) PV 2026-05-21 06:40:27 +00:00
3fdb9f008b v28 attempt: PV MMA (128,64) - cosine 0.004, debugging 2026-05-21 05:41:44 +00:00
1f6fb856ea more stuff 2026-05-21 05:08:57 +00:00
1314a67135 Stage B progress: PV works for square (128,128), broken for (128,64)
- Bug 1 (V MN-major): Fix applied
- Bug 2 (softmax packing): Confirmed correct (V=I test: cosine 1.0)
- Bug 3 (ACCUMULATE): Fix applied (first PV must overwrite, not accumulate)
- Bug 4 (CURRENT): PV MMA broken for non-square output
  - (128,128) PV with random V: cosine 0.999999 
  - (128,64) PV with MN-major V: cosine ~0.01 
  - Softmax packing, layout aliasing, pipeline ordering all verified correct
  - Root cause unknown — likely epilogue/V layout/MMA tiler issue

Added test_pv_diag.py (V=I and random V, 128x128 output — PASS)
Added test_layout_compare.py (TMEM layout inspection)
Added test_inspect_types.py (TMEM pointer arithmetic verification)
Updated test_mma_si_pv.py with head_dim param, pv_mma_tiler_mn fix, ACCUMULATE fix
Updated READMEs with current state
2026-05-21 04:40:28 +00:00
2cbb0369fd Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage
Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)

Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided

Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)

Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read

Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)
2026-05-21 04:10:07 +00:00
1eca10e39f Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed
Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)
2026-05-21 00:12:47 +00:00
e678afcde0 Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong
Key fixes:
- PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps)
- TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded)
- P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py)
- V SMEM aliasing via recast_ptr

Status:
- Stage A: cosine 0.999999 
- Stage B: runs without crash, identity softmax cosine -0.02 
- Diagnostics: TMEM layout inspection, bisection results
2026-05-20 20:26:25 +00:00
dd7af0cd8a feat: GPU-native SWA + sparse decode attention kernels (CuTeDSL)
- native_swa_decode.py: BlackwellSWADecodeKernel
  - CTA mapping: 1 CTA per (decode_token, q_head_group)
  - Online softmax with KV tile streaming (16 tokens/tile)
  - Pre-dequantized bf16 KV (fp8 dequant on host - MLIR cvt_fpext
    requires 32-bit aligned vector, no scalar fp8->bf16 support)
  - Cosine 0.9999+ vs PyTorch batched SDPA reference
  - Fallback _fallback_batched_sdp when CuTeDSL unavailable

- native_sparse_decode.py: BlackwellSparseDecodeKernel
  - Combined SWA + compressed KV in single attention pass
  - Supports CSA (cr=4) and HCA (cr=128) layers
  - Sink weight merge on host side
  - Cosine 0.9999+ vs combined SDPA reference

- fp8_bf16.py: Documents MLIR limitation (cvt_fpext requires
  vector<4xf8>, no scalar support). Pre-dequant is the workaround.

- vLLM wiring (attention.py):
  - SWA-only layers: native_swa_decode_attention
  - CSA/HCA layers: native_sparse_decode_attention with topk + attn_sink
  - csa_attention.py updated to use native kernels

- Tests: test_decode_pipeline.py, test_sparse_decode.py both passing
2026-05-20 05:46:15 +00:00
d775d1075d Fused SwiGLU epilogue with granularity-8 weight interleave
- Fix interleave_l1_weights: remove //2 bug (g=granularity_bf16 for N-axis)
- Apply L1 weight+SF interleave in runner._ensure_stacked() and moe_pipeline
- De-interleave L1 GEMM output before gate/up split
- Fused SwiGLU kernel: epi_tile=(128,8) for subtile-level pairing
  - Even subtiles = gate: SiLU in FP32 registers, save to register buffer
  - Odd subtiles = up: silu(gate)*up from buffer
  - Both branches produce same BF16 tensor type (CuTeDSL constraint)
- run_nvfp4_moe_fused() pipeline: fused L1 + PyTorch L2
- Runner: fused_swiglu=True option for CuTeDSLMoERunner
- Layertest: both fused and non-fused paths PASS (cosine 0.988)
- README.md updated with current status and lessons learned
2026-05-20 04:13:52 +00:00
b1778eedf8 wip: Step 2 gate/up pairing — SiLU validated, runtime conditionals blocked by CuTeDSL
SiLU in registers: PASS (0.034% error, Step 1 stable)
Gate/up subtile detection: blocked by CuTeDSL type system

CuTeDSL compiles the kernel for ALL subtile iterations at once.
Runtime conditionals (if is_gate_subtile) that affect:
- Register tensor assignment → DSLRuntimeError (type structure mismatch)
- TMA store skipping → corrupted output
- Mask blending → wrong results

Path forward: use const_expr debug flag for the BF16 side output,
or process gate/up in a separate post-GEMM kernel.
2026-05-20 03:26:20 +00:00
ed89e678be wip: add run_fused_swiglu_grouped_gemm bridge + step1 test 2026-05-20 03:10:56 +00:00
9cdf79fd9c wip: fused SwiGLU kernel scaffold + bridge interleave + plan
- fused_swiglu_grouped_mm.py: copypaste of torch_scaled_grouped_mm.py with
  class rename and fused_swiglu/swiglu_limit params added
- bridge.py: added interleave_l1_weights, deinterleave_l1_weights,
  warmup_fused_swiglu_compilation
- Pure-PyTorch interleave invariant passes (A@cat vs deinterleave(A@interleave))
- Standalone GEMM interleave test fails due to kernel-internal N-tiling
  layout (expected, skipping per plan)
- FUSED_EPILOGUE_PLAN.md updated with register layout, amax shuffle plan,
  4-step implementation strategy
2026-05-20 03:04:38 +00:00
3c6b5a0522 chore: deprecate prepare_weights_from_dequantized and prepare_weights_direct
Verified that our NVFP4 packing convention (odd<<4|even, round-half-to-even)
matches the DeepSeek-V4 checkpoint exactly: 100% byte-identical round-trip
across all tested experts. The dequantize->requantize path is lossless in
practice but wasteful. Marked both prepare_weights_from_dequantized and
prepare_weights_direct as deprecated in favor of prepare_weights_from_stacked
which loads checkpoint FP4 bytes directly via .view().

Also added test_fp4_roundtrip.py for future reference.
2026-05-20 02:11:40 +00:00
7070fadf72 Add full layer NaN test (attention + MoE, multi-layer chain) 2026-05-19 18:36:49 +00:00
152b0749df Use 16 experts for MoE runner test (fits in memory) 2026-05-19 18:35:40 +00:00
daa59a7c75 Add MoE runner NaN test (grouped GEMM with real weights) 2026-05-19 18:34:56 +00:00
9308634e65 Fix intermediate size: 3072 not 18432 2026-05-19 18:34:12 +00:00
2b91bb1b71 Rewrite MoE NaN test: per-expert format, activation quantization, grouped GEMM 2026-05-19 18:33:57 +00:00
8904d409f8 Fix MoE weight key names, add fallback 2026-05-19 18:32:49 +00:00
e45ceb2226 Add MoE NaN reproduction test, update CURRENT_BUG.md with NaN tracing and test plan 2026-05-19 18:32:14 +00:00
facc6509e7 Fix imports in vLLM codepaths test 2026-05-19 17:26:50 +00:00
835e1a0590 Fix f-string syntax 2026-05-19 17:26:40 +00:00
9c30168202 Add test for exact vLLM codepaths (fused_qnorm, kv_write, decode) 2026-05-19 17:26:10 +00:00
2cc1910c45 Fix N for C128A (need 128 tokens) 2026-05-19 16:04:53 +00:00
cea453cbab Fix compressor key name 2026-05-19 16:04:38 +00:00
04f2b2d8d4 Add CSA sparse attention test (compressed KV gather + SWA merge) 2026-05-19 16:04:19 +00:00
be8566a443 Add decode vs prefill consistency test 2026-05-19 16:00:33 +00:00
2ddd3d0702 Test with all 61 layers (shared experts only) 2026-05-19 15:55:41 +00:00
842e6e1381 Fix view→reshape for non-contiguous tensor 2026-05-19 15:54:40 +00:00
f0f8d8211b Add e2e decode test (3 layers: C128A, C4A, SWA) 2026-05-19 15:53:29 +00:00
6ceb05327f Add blackwell_attention module and comprehensive test 2026-05-19 15:30:29 +00:00
85c74e5932 Fix attention for decode (1 query vs N cached KVs) 2026-05-19 15:28:52 +00:00
85099c7e75 Fix fp8 amax in decode test 2026-05-19 15:28:17 +00:00
c66b0b88c0 Add decode attention pipeline test — reproduces KV cache bug 2026-05-19 15:27:55 +00:00
8e6721917e Fix syntax in RoPE KV test 2026-05-19 10:31:07 +00:00
cbf440f75a Add RoPE KV test 2026-05-19 10:28:15 +00:00
dd7f2627e8 Add full model forward test (WIP), sparse attention test passes 2026-05-19 09:04:19 +00:00
9781953509 Add CSA/HCA sparse attention kernel test 2026-05-19 09:02:12 +00:00
d60673864a Fix kv_ref transpose in KV cache test 2026-05-19 08:58:46 +00:00
c1099d76d2 Add KV cache kernel test - fp8 quantize/dequant, paged cache, CSA/HCA compression 2026-05-19 08:57:31 +00:00
c54ddbdae1 Fix NVFP4 attention: slice output to actual N after 128-padding 2026-05-19 08:55:31 +00:00
42285b6c24 Add CuTeDSL NVFP4 attention kernel test - Q×K^T GEMM 2026-05-19 08:54:59 +00:00
9465929e6e Add DeepSeek-V4 CSA/HCA attention pipeline test (not MLA) 2026-05-19 08:51:16 +00:00
d08a457829 Fix cos_sin cache shape in NVFP4 attention test 2026-05-19 08:38:55 +00:00
7dd8871e84 Add NVFP4 attention test - quantize Q and K for Q×K^T GEMM 2026-05-19 08:38:25 +00:00