Commit Graph

164 Commits

Author SHA1 Message Date
a983a8fb41 WIP: TMEM vector for per-row row_sum (not yet working)
Key finding: the root cause is that each epilogue thread owns MULTIPLE rows
in the QK C-fragment, so scalar row_max/row_sum are wrong (global across
all rows, not per-row). The V=ones diagnostic confirmed: all 128 threads
use the same row_sum (from row 114).

Tried: TMEM vector store+load of row_sum (composition(tStS, (128,2))).
This is a no-op because both write and read use the SAME QK partition
with a scalar row_sum. The vector approach only helps when different
partitions are used for write vs read, or when per-row values are stored.

Next steps:
1. Need PER-ROW row_max and row_sum, not per-thread scalar
2. The CUTLASS FMHA works because each thread owns exactly 1 row
3. Options: restructure thread layout, or compute per-row values differently
4. The vector must store ALL 128 per-row values, then read per-row in C9
2026-05-21 18:45:30 +00:00
331d9e95f3 WIP: Stage C softmax - partial progress
Key finding: cute.size(v, mode=[0]) in @cute.jit produces wrong code.
Hardcoding s_k=128 (matching Stage B) fixes the base pipeline.

Current status: kernel produces non-zero output but softmax math is still wrong.
Applied fixes: pv_done_bar, acc_scale with scale, fastmath=True
Need to debug row_sum computation and C9 normalization.
2026-05-21 18:04:21 +00:00
84cd636ba9 Stage C fixes: pv_done_bar sync, acc_scale with scale, fastmath=True
- Add pv_done_bar (barrier_id=4): MMA signals PV complete, epilogue
  waits before O rescale (C6) and final normalization (C9)
- Fix acc_scale: exp2(scale * (old_max - new_max)) includes the
  scale_softmax_log2 factor matching CUTLASS FMHA reference
- fastmath=True for both exp2 calls (P computation + rescale)
- No *0.5 (our scalar row_sum pattern initializes (0,0) not (sum,sum))
2026-05-21 17:58:04 +00:00
52b46a2dee Stage C: add validation harness with real softmax reference (C1) 2026-05-21 17:49:26 +00:00
3fb3c925af Restructure: cutedsl/ -> dsv4/ with proper layering
- Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py
- Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc.
- Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda)
- Moved PyTorch bridges to dsv4/ops/
- Moved nn.Module layers to dsv4layers/
- Moved reference implementations to dsv4/reference/
- Moved vendored CUTLASS code to vendored/
- Archived ~190 debug tests to tests/archive/
- Kept ~15 canonical tests in tests/unit/
- Updated all import paths
- Added stubs for future components (model/, cache/, loader/)
- Updated pyproject.toml: dsv4-inference package name
2026-05-21 17:30:44 +00:00
99e143dd0e Fix: add scale_softmax_log2, use O TMEM rescale for C9 normalization
- scale_softmax_log2 was missing from _setup (patch artifact)
- C9 normalization: load O from TMEM, multiply by 1/row_sum, store back
  instead of trying to capture runtime value in const_expr lambda
- Then use standard epilogue_tma_store with identity transform
2026-05-21 17:15:15 +00:00
df04ba40ee Stage C: online softmax kernel (WIP) - test_fmha_v3_softmax.py
- C1: Real softmax reference (torch.softmax, not identity)
- C2: Per-thread row_max/row_sum registers
- C3: QK scale folded (1/sqrt(d) * log2(e))
- C4: Row max via .reduce(MAX)
- C5: Rescale factor (exp2(old_max - new_max))
- C6: O rescale in TMEM (correction_rescale pattern)
- C7: Real exp2 for P computation
- C8: Row sum via packed f32x2 reduction
- C9: Final normalization (1/row_sum in epilogue)
- Dynamic s_k for V FMHA reconstruction
- fastmath=False for correctness first
2026-05-21 17:10:58 +00:00
2030d41e41 Fix TMEM overlap in test_pv64_with_softmax.py too — cosine 0.999999
Same P/O overlap bug: O at col 64 overlapped P at [32,96).
Same fixes: O at col 128, FMHA V reconstruction, power-of-2 TMEM alloc.
2026-05-21 15:32:49 +00:00
0f4f69907e STAGE B BUG 4b FIXED: TMEM P/O overlap + FMHA V reconstruction
Root cause: PV output O started at TMEM column 64 (from find_tmem_tensor_col_offset),
overlapping with P at columns [32,96). PV MMA reading P while writing O to overlapping
columns corrupted the A operand mid-computation.

For (128,128) PV, O started at 128 (no overlap) so it worked by accident.
For (128,64) PV, O started at 64, overlapping P [32,96) -> NaN/garbage.

Fix: Place O at column 128 (after both S [0,128) and P [32,96)).
Also added FMHA-style V reconstruction: logical (HEAD_DIM, s_k, 1) stride (1, hd, hd*s_k)
instead of passing DLPack V directly to TMA.

test_fmha_v3.py: (128,64) PV with random V -> cosine 0.999999 PASS
2026-05-21 15:30:24 +00:00
4564758466 Stage B Bug 4b debugging: P/A alias proven working, V layout issue for (128,64) PV
Key findings:
- P/A alias WORKS: PV reads non-zero P from TMEM at offset 32 (proven by no-softmax test)
- V mode bug: V=(128,64) only loads 64 K-values, PV needs 128. Output = sum(S[:,:64]) = 0.67 cosine
- FMHA-style V reconstruction (hd,n,1) stride (1,hd) gives NaN for (128,64) PV
- K-major V (64,128) contiguous gives NaN for (128,64) PV
- Square (128,128) PV works with ALL V approaches (cosine 0.999999)
- Non-square PV consistently broken regardless of V layout

Test files:
- test_128_128_fmha_v.py: (128,128) with FMHA V - PASS
- test_pv64_fmha_v.py: (128,64) with FMHA V - NaN
- test_pv64_kmajor_v.py: (128,64) with K-major V - NaN
- test_pv64_with_softmax.py: (128,64) with original V - 0.67
- test_pv64_no_softmax.py: proves P/A alias works
- test_fmha_v3.py: full pipeline with QK C-fragment composition store
2026-05-21 15:20:14 +00:00
81d5d8d04c FMHA v3: KV-tile interleaving pipeline - QK works, Bug 4b blocks PV 2026-05-21 12:52:29 +00:00
73e03cfa6d Stage B: PV(128,64) test + v2 pipeline fixes
- test_pv64.py: (128,64) PV with separate V SMEM, single ab pipeline
  Result: cosine 0.669848 — data path works but P layout mismatch
  Softmax writes P via QK C-fragment layout, PV reads via PV A-fragment layout
  These differ for non-(128,128) PV — Bug 1 from README

- test_fmha_v2_fixed.py: KV-tile interleaved pipeline with fixes
  Fix 1: per-pipeline tx_count (Q vs KV separate byte counts)
  Fix 2: NamedBarrier for softmax-done signal (replaces double-acquire deadlock)
  Fix 3: Separate SMEM for V (no recast_ptr overlap with K)
  Still produces zeros — needs P layout fix (same root cause as test_pv64)
2026-05-21 11:49:06 +00:00
61b23efbcf stuff and stuff 2026-05-21 10:50:30 +00:00
d72f854efb FMHA v1: pv_mma_tiler=(128,64,128) works with V=I, fails with real V (SMEM layout bug) 2026-05-21 10:47:46 +00:00
dbb240adc9 Root cause FOUND: V SMEM only holds 1 K-tile (2048 BF16), but PV MMA iterates 8 K-phases. For non-(128,128) PV, most K-phases read wrong V data. Zero-padded V works because V is (128,128) covering all 8 K-phases. FMHA interleaves QK+PV per KV-tile to avoid this. 2026-05-21 09:56:54 +00:00
d4934371d0 Key finding: PV A-fragment layout is IDENTICAL for (128,128)/(128,32)/(128,16) PV. Bug is NOT TMEM alias. cta_tile_shape_mnk wrong for non-(128,128) PV. V SMEM and O C-fragment sizes look correct. Debugging V/epilogue paths. 2026-05-21 09:44:22 +00:00
422af26024 Update README: Bug 4 status, (128,16) PV zero output, (128,128) PV zero-pad workaround (cosine 1.0) 2026-05-21 09:20:09 +00:00
781684dd89 TMEM alias analysis: (128,16) PV broken, (128,128) PV with zero-pad works. Root cause: PV A-fragment layout differs from QK C-fragment layout for (128,16) PV, causing TMEM column mismatch. Using (128,128) PV as workaround. 2026-05-21 09:10:12 +00:00
96e7210db7 Debugging TMEM alias for (128,16) PV: zero output confirmed, PV reads from wrong TMEM columns. Need to align softmax P write with PV A-fragment layout. 2026-05-21 09:00:42 +00:00
ad3f63033d Stage B N-tiling: (128,16) PV MMA compiles and runs, cosine 0.36 (TMEM alias mismatch bug). FMHA head_dim=64 passes. Debugging TMEM layout alignment. 2026-05-21 08:45:49 +00:00
5e37ea56e4 FOOTGUN #0: num_tma_load_bytes MUST include V bytes. Fix v27, v29, comment all. Update README. 2026-05-21 07:13:14 +00:00
dd8d872bec v29: FIX DEADLOCK - add V bytes to num_tma_load_bytes. V=I(128,128) cosine 1.0 2026-05-21 07:08:29 +00:00
f1c4ee0e4d v29 (padded V, deadlocks), v30 (diag copy, works) — debugging epilogue deadlock with (128,128) PV 2026-05-21 06:40:27 +00:00
15c987244f v28 attempt: PV MMA (128,64) - cosine 0.004, debugging 2026-05-21 05:41:44 +00:00
c20518332e more stuff 2026-05-21 05:08:57 +00:00
0dc6fe4a7d Stage B progress: PV works for square (128,128), broken for (128,64)
- Bug 1 (V MN-major): Fix applied
- Bug 2 (softmax packing): Confirmed correct (V=I test: cosine 1.0)
- Bug 3 (ACCUMULATE): Fix applied (first PV must overwrite, not accumulate)
- Bug 4 (CURRENT): PV MMA broken for non-square output
  - (128,128) PV with random V: cosine 0.999999 
  - (128,64) PV with MN-major V: cosine ~0.01 
  - Softmax packing, layout aliasing, pipeline ordering all verified correct
  - Root cause unknown — likely epilogue/V layout/MMA tiler issue

Added test_pv_diag.py (V=I and random V, 128x128 output — PASS)
Added test_layout_compare.py (TMEM layout inspection)
Added test_inspect_types.py (TMEM pointer arithmetic verification)
Updated test_mma_si_pv.py with head_dim param, pv_mma_tiler_mn fix, ACCUMULATE fix
Updated READMEs with current state
2026-05-21 04:40:28 +00:00
7a8945eb76 Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage
Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)

Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided

Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)

Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read

Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)
2026-05-21 04:10:07 +00:00
467ade37b2 Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed
Key finding: C-fragment and A-fragment use different physical TMEM address
mappings. St32x32bOp with C-fragment writes to C-layout addresses, but PV MMA
reads from A-layout addresses. Forward FMHA recast validated FP16 only, not BF16.

Working: FP32 ld/st roundtrip, BF16 elemwise, BF16 recast ld S0->st S1 (all cos 0.999999)
Broken: C-frag st + A-frag read (NaN), A-frag store + PV MMA (cos -0.02)
Next: Fix register data flow (128 FP16/thread load vs 64 BF16/thread store mismatch)
2026-05-21 00:12:47 +00:00
97656a5cd1 Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong
Key fixes:
- PipelineUmmaAsync consumer group: 32*4=128 threads (not 4 warps)
- TMEM offsets computed from find_tmem_tensor_col_offset (not hardcoded)
- P fragment from p_tmem_s.outer + make_fragment_A (matching fmha.py)
- V SMEM aliasing via recast_ptr

Status:
- Stage A: cosine 0.999999 
- Stage B: runs without crash, identity softmax cosine -0.02 
- Diagnostics: TMEM layout inspection, bisection results
2026-05-20 20:26:25 +00:00
bbba289bd8 feat: GPU-native SWA + sparse decode attention kernels (CuTeDSL)
- native_swa_decode.py: BlackwellSWADecodeKernel
  - CTA mapping: 1 CTA per (decode_token, q_head_group)
  - Online softmax with KV tile streaming (16 tokens/tile)
  - Pre-dequantized bf16 KV (fp8 dequant on host - MLIR cvt_fpext
    requires 32-bit aligned vector, no scalar fp8->bf16 support)
  - Cosine 0.9999+ vs PyTorch batched SDPA reference
  - Fallback _fallback_batched_sdp when CuTeDSL unavailable

- native_sparse_decode.py: BlackwellSparseDecodeKernel
  - Combined SWA + compressed KV in single attention pass
  - Supports CSA (cr=4) and HCA (cr=128) layers
  - Sink weight merge on host side
  - Cosine 0.9999+ vs combined SDPA reference

- fp8_bf16.py: Documents MLIR limitation (cvt_fpext requires
  vector<4xf8>, no scalar support). Pre-dequant is the workaround.

- vLLM wiring (attention.py):
  - SWA-only layers: native_swa_decode_attention
  - CSA/HCA layers: native_sparse_decode_attention with topk + attn_sink
  - csa_attention.py updated to use native kernels

- Tests: test_decode_pipeline.py, test_sparse_decode.py both passing
2026-05-20 05:46:15 +00:00
aa8563c626 Fused SwiGLU epilogue with granularity-8 weight interleave
- Fix interleave_l1_weights: remove //2 bug (g=granularity_bf16 for N-axis)
- Apply L1 weight+SF interleave in runner._ensure_stacked() and moe_pipeline
- De-interleave L1 GEMM output before gate/up split
- Fused SwiGLU kernel: epi_tile=(128,8) for subtile-level pairing
  - Even subtiles = gate: SiLU in FP32 registers, save to register buffer
  - Odd subtiles = up: silu(gate)*up from buffer
  - Both branches produce same BF16 tensor type (CuTeDSL constraint)
- run_nvfp4_moe_fused() pipeline: fused L1 + PyTorch L2
- Runner: fused_swiglu=True option for CuTeDSLMoERunner
- Layertest: both fused and non-fused paths PASS (cosine 0.988)
- README.md updated with current status and lessons learned
2026-05-20 04:13:52 +00:00
6c04155167 wip: Step 2 gate/up pairing — SiLU validated, runtime conditionals blocked by CuTeDSL
SiLU in registers: PASS (0.034% error, Step 1 stable)
Gate/up subtile detection: blocked by CuTeDSL type system

CuTeDSL compiles the kernel for ALL subtile iterations at once.
Runtime conditionals (if is_gate_subtile) that affect:
- Register tensor assignment → DSLRuntimeError (type structure mismatch)
- TMA store skipping → corrupted output
- Mask blending → wrong results

Path forward: use const_expr debug flag for the BF16 side output,
or process gate/up in a separate post-GEMM kernel.
2026-05-20 03:26:20 +00:00
08992b818d wip: add run_fused_swiglu_grouped_gemm bridge + step1 test 2026-05-20 03:10:56 +00:00
2f053f674e wip: fused SwiGLU kernel scaffold + bridge interleave + plan
- fused_swiglu_grouped_mm.py: copypaste of torch_scaled_grouped_mm.py with
  class rename and fused_swiglu/swiglu_limit params added
- bridge.py: added interleave_l1_weights, deinterleave_l1_weights,
  warmup_fused_swiglu_compilation
- Pure-PyTorch interleave invariant passes (A@cat vs deinterleave(A@interleave))
- Standalone GEMM interleave test fails due to kernel-internal N-tiling
  layout (expected, skipping per plan)
- FUSED_EPILOGUE_PLAN.md updated with register layout, amax shuffle plan,
  4-step implementation strategy
2026-05-20 03:04:38 +00:00
1857bdedc3 chore: deprecate prepare_weights_from_dequantized and prepare_weights_direct
Verified that our NVFP4 packing convention (odd<<4|even, round-half-to-even)
matches the DeepSeek-V4 checkpoint exactly: 100% byte-identical round-trip
across all tested experts. The dequantize->requantize path is lossless in
practice but wasteful. Marked both prepare_weights_from_dequantized and
prepare_weights_direct as deprecated in favor of prepare_weights_from_stacked
which loads checkpoint FP4 bytes directly via .view().

Also added test_fp4_roundtrip.py for future reference.
2026-05-20 02:11:40 +00:00
2e6559402c Add full layer NaN test (attention + MoE, multi-layer chain) 2026-05-19 18:36:49 +00:00
cca145e35c Use 16 experts for MoE runner test (fits in memory) 2026-05-19 18:35:40 +00:00
7893e7514d Add MoE runner NaN test (grouped GEMM with real weights) 2026-05-19 18:34:56 +00:00
7b432da754 Fix intermediate size: 3072 not 18432 2026-05-19 18:34:12 +00:00
293f14a179 Rewrite MoE NaN test: per-expert format, activation quantization, grouped GEMM 2026-05-19 18:33:57 +00:00
62f2395e30 Fix MoE weight key names, add fallback 2026-05-19 18:32:49 +00:00
9455466648 Add MoE NaN reproduction test, update CURRENT_BUG.md with NaN tracing and test plan 2026-05-19 18:32:14 +00:00
a94ad73c64 Fix imports in vLLM codepaths test 2026-05-19 17:26:50 +00:00
f3f9674810 Fix f-string syntax 2026-05-19 17:26:40 +00:00
6cc2312e61 Add test for exact vLLM codepaths (fused_qnorm, kv_write, decode) 2026-05-19 17:26:10 +00:00
abff942edd Fix N for C128A (need 128 tokens) 2026-05-19 16:04:53 +00:00
49c2e088d4 Fix compressor key name 2026-05-19 16:04:38 +00:00
7d89ede9f9 Add CSA sparse attention test (compressed KV gather + SWA merge) 2026-05-19 16:04:19 +00:00
696a890df7 Add decode vs prefill consistency test 2026-05-19 16:00:33 +00:00
359654f08e Test with all 61 layers (shared experts only) 2026-05-19 15:55:41 +00:00