96e7210db7
Debugging TMEM alias for (128,16) PV: zero output confirmed, PV reads from wrong TMEM columns. Need to align softmax P write with PV A-fragment layout.
2026-05-21 09:00:42 +00:00
ad3f63033d
Stage B N-tiling: (128,16) PV MMA compiles and runs, cosine 0.36 (TMEM alias mismatch bug). FMHA head_dim=64 passes. Debugging TMEM layout alignment.
2026-05-21 08:45:49 +00:00
5e37ea56e4
FOOTGUN #0 : num_tma_load_bytes MUST include V bytes. Fix v27, v29, comment all. Update README.
2026-05-21 07:13:14 +00:00
dd8d872bec
v29: FIX DEADLOCK - add V bytes to num_tma_load_bytes. V=I(128,128) cosine 1.0
2026-05-21 07:08:29 +00:00
f1c4ee0e4d
v29 (padded V, deadlocks), v30 (diag copy, works) — debugging epilogue deadlock with (128,128) PV
2026-05-21 06:40:27 +00:00
15c987244f
v28 attempt: PV MMA (128,64) - cosine 0.004, debugging
2026-05-21 05:41:44 +00:00
c20518332e
more stuff
2026-05-21 05:08:57 +00:00
0dc6fe4a7d
Stage B progress: PV works for square (128,128), broken for (128,64)
...
- Bug 1 (V MN-major): Fix applied
- Bug 2 (softmax packing): Confirmed correct (V=I test: cosine 1.0)
- Bug 3 (ACCUMULATE): Fix applied (first PV must overwrite, not accumulate)
- Bug 4 (CURRENT): PV MMA broken for non-square output
- (128,128) PV with random V: cosine 0.999999 ✅
- (128,64) PV with MN-major V: cosine ~0.01 ❌
- Softmax packing, layout aliasing, pipeline ordering all verified correct
- Root cause unknown — likely epilogue/V layout/MMA tiler issue
Added test_pv_diag.py (V=I and random V, 128x128 output — PASS)
Added test_layout_compare.py (TMEM layout inspection)
Added test_inspect_types.py (TMEM pointer arithmetic verification)
Updated test_mma_si_pv.py with head_dim param, pv_mma_tiler_mn fix, ACCUMULATE fix
Updated READMEs with current state
2026-05-21 04:40:28 +00:00
7a8945eb76
Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage
...
Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)
Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided
Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)
Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read
Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)
2026-05-21 04:10:07 +00:00
467ade37b2
Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed
...
Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.
Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)
2026-05-21 00:12:47 +00:00
97656a5cd1
Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong
...
Key fixes:
- PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps)
- TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded)
- P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py)
- V SMEM aliasing via recast_ptr
Status:
- Stage A: cosine 0.999999 ✅
- Stage B: runs without crash, identity softmax cosine -0.02 ❌
- Diagnostics: TMEM layout inspection, bisection results
2026-05-20 20:26:25 +00:00
bbba289bd8
feat: GPU-native SWA + sparse decode attention kernels (CuTeDSL)
...
- native_swa_decode.py: BlackwellSWADecodeKernel
- CTA mapping: 1 CTA per (decode_token, q_head_group)
- Online softmax with KV tile streaming (16 tokens/tile)
- Pre-dequantized bf16 KV (fp8 dequant on host - MLIR cvt_fpext
requires 32-bit aligned vector, no scalar fp8->bf16 support)
- Cosine 0.9999+ vs PyTorch batched SDPA reference
- Fallback _fallback_batched_sdp when CuTeDSL unavailable
- native_sparse_decode.py: BlackwellSparseDecodeKernel
- Combined SWA + compressed KV in single attention pass
- Supports CSA (cr=4) and HCA (cr=128) layers
- Sink weight merge on host side
- Cosine 0.9999+ vs combined SDPA reference
- fp8_bf16.py: Documents MLIR limitation (cvt_fpext requires
vector<4xf8>, no scalar support). Pre-dequant is the workaround.
- vLLM wiring (attention.py):
- SWA-only layers: native_swa_decode_attention
- CSA/HCA layers: native_sparse_decode_attention with topk + attn_sink
- csa_attention.py updated to use native kernels
- Tests: test_decode_pipeline.py, test_sparse_decode.py both passing
2026-05-20 05:46:15 +00:00
aa8563c626
Fused SwiGLU epilogue with granularity-8 weight interleave
...
- Fix interleave_l1_weights: remove //2 bug (g=granularity_bf16 for N-axis)
- Apply L1 weight+SF interleave in runner._ensure_stacked() and moe_pipeline
- De-interleave L1 GEMM output before gate/up split
- Fused SwiGLU kernel: epi_tile=(128,8) for subtile-level pairing
- Even subtiles = gate: SiLU in FP32 registers, save to register buffer
- Odd subtiles = up: silu(gate)*up from buffer
- Both branches produce same BF16 tensor type (CuTeDSL constraint)
- run_nvfp4_moe_fused() pipeline: fused L1 + PyTorch L2
- Runner: fused_swiglu=True option for CuTeDSLMoERunner
- Layertest: both fused and non-fused paths PASS (cosine 0.988)
- README.md updated with current status and lessons learned
2026-05-20 04:13:52 +00:00
6c04155167
wip: Step 2 gate/up pairing — SiLU validated, runtime conditionals blocked by CuTeDSL
...
SiLU in registers: PASS (0.034% error, Step 1 stable)
Gate/up subtile detection: blocked by CuTeDSL type system
CuTeDSL compiles the kernel for ALL subtile iterations at once.
Runtime conditionals (if is_gate_subtile) that affect:
- Register tensor assignment → DSLRuntimeError (type structure mismatch)
- TMA store skipping → corrupted output
- Mask blending → wrong results
Path forward: use const_expr debug flag for the BF16 side output,
or process gate/up in a separate post-GEMM kernel.
2026-05-20 03:26:20 +00:00
08992b818d
wip: add run_fused_swiglu_grouped_gemm bridge + step1 test
2026-05-20 03:10:56 +00:00
2f053f674e
wip: fused SwiGLU kernel scaffold + bridge interleave + plan
...
- fused_swiglu_grouped_mm.py: copypaste of torch_scaled_grouped_mm.py with
class rename and fused_swiglu/swiglu_limit params added
- bridge.py: added interleave_l1_weights, deinterleave_l1_weights,
warmup_fused_swiglu_compilation
- Pure-PyTorch interleave invariant passes (A@cat vs deinterleave(A@interleave))
- Standalone GEMM interleave test fails due to kernel-internal N-tiling
layout (expected, skipping per plan)
- FUSED_EPILOGUE_PLAN.md updated with register layout, amax shuffle plan,
4-step implementation strategy
2026-05-20 03:04:38 +00:00
1857bdedc3
chore: deprecate prepare_weights_from_dequantized and prepare_weights_direct
...
Verified that our NVFP4 packing convention (odd<<4|even, round-half-to-even)
matches the DeepSeek-V4 checkpoint exactly: 100% byte-identical round-trip
across all tested experts. The dequantize->requantize path is lossless in
practice but wasteful. Marked both prepare_weights_from_dequantized and
prepare_weights_direct as deprecated in favor of prepare_weights_from_stacked
which loads checkpoint FP4 bytes directly via .view().
Also added test_fp4_roundtrip.py for future reference.
2026-05-20 02:11:40 +00:00
2e6559402c
Add full layer NaN test (attention + MoE, multi-layer chain)
2026-05-19 18:36:49 +00:00
cca145e35c
Use 16 experts for MoE runner test (fits in memory)
2026-05-19 18:35:40 +00:00
7893e7514d
Add MoE runner NaN test (grouped GEMM with real weights)
2026-05-19 18:34:56 +00:00
7b432da754
Fix intermediate size: 3072 not 18432
2026-05-19 18:34:12 +00:00
293f14a179
Rewrite MoE NaN test: per-expert format, activation quantization, grouped GEMM
2026-05-19 18:33:57 +00:00
62f2395e30
Fix MoE weight key names, add fallback
2026-05-19 18:32:49 +00:00
9455466648
Add MoE NaN reproduction test, update CURRENT_BUG.md with NaN tracing and test plan
2026-05-19 18:32:14 +00:00
a94ad73c64
Fix imports in vLLM codepaths test
2026-05-19 17:26:50 +00:00
f3f9674810
Fix f-string syntax
2026-05-19 17:26:40 +00:00
6cc2312e61
Add test for exact vLLM codepaths (fused_qnorm, kv_write, decode)
2026-05-19 17:26:10 +00:00
abff942edd
Fix N for C128A (need 128 tokens)
2026-05-19 16:04:53 +00:00
49c2e088d4
Fix compressor key name
2026-05-19 16:04:38 +00:00
7d89ede9f9
Add CSA sparse attention test (compressed KV gather + SWA merge)
2026-05-19 16:04:19 +00:00
696a890df7
Add decode vs prefill consistency test
2026-05-19 16:00:33 +00:00
359654f08e
Test with all 61 layers (shared experts only)
2026-05-19 15:55:41 +00:00
3e6041d752
Fix view→reshape for non-contiguous tensor
2026-05-19 15:54:40 +00:00
ff9f373633
Add e2e decode test (3 layers: C128A, C4A, SWA)
2026-05-19 15:53:29 +00:00
0023fee706
Add blackwell_attention module and comprehensive test
2026-05-19 15:30:29 +00:00
142a4a1ad4
Fix attention for decode (1 query vs N cached KVs)
2026-05-19 15:28:52 +00:00
4b85605edf
Fix fp8 amax in decode test
2026-05-19 15:28:17 +00:00
4f23055450
Add decode attention pipeline test — reproduces KV cache bug
2026-05-19 15:27:55 +00:00
8e6721917e
Fix syntax in RoPE KV test
2026-05-19 10:31:07 +00:00
cbf440f75a
Add RoPE KV test
2026-05-19 10:28:15 +00:00
dd7f2627e8
Add full model forward test (WIP), sparse attention test passes
2026-05-19 09:04:19 +00:00
9781953509
Add CSA/HCA sparse attention kernel test
2026-05-19 09:02:12 +00:00
d60673864a
Fix kv_ref transpose in KV cache test
2026-05-19 08:58:46 +00:00
c1099d76d2
Add KV cache kernel test - fp8 quantize/dequant, paged cache, CSA/HCA compression
2026-05-19 08:57:31 +00:00
c54ddbdae1
Fix NVFP4 attention: slice output to actual N after 128-padding
2026-05-19 08:55:31 +00:00
42285b6c24
Add CuTeDSL NVFP4 attention kernel test - Q×K^T GEMM
2026-05-19 08:54:59 +00:00
9465929e6e
Add DeepSeek-V4 CSA/HCA attention pipeline test (not MLA)
2026-05-19 08:51:16 +00:00
d08a457829
Fix cos_sin cache shape in NVFP4 attention test
2026-05-19 08:38:55 +00:00
7dd8871e84
Add NVFP4 attention test - quantize Q and K for Q×K^T GEMM
2026-05-19 08:38:25 +00:00
3de75c4e37
Add CSA/HCA attention kernel (PyTorch SDPA, Blackwell-safe)
...
Replaces vLLM's broken FlashMLA sparse attention which doesn't work on
SM100 (Blackwell). Uses torch.nn.functional.scaled_dot_product_attention
which works on all GPUs.
Architecture:
- CSA (C128A): Batched sparse gather + SDPA on top-k positions
- HCA (C4A): Same with compressed KV + per-layer indexer
- SWA: Sliding window attention
- Full reference: standard SDPA for testing without compression
Also adds test_csa_attention_b200.py to verify the full attention path.
2026-05-19 07:58:10 +00:00