eebf33b97d
test: clean minimal nvvm.inline_ptx test
2026-05-28 04:45:21 +00:00
882d48588b
test: debug nvvm.inline_ptx with CUTLASS_LOG_LEVEL=DEBUG
2026-05-28 04:44:35 +00:00
3ffb3b807a
test: minimal nvvm.inline_ptx isolation test
2026-05-28 04:43:18 +00:00
e33c48e44c
NVFP4-1.1: Use nvvm.inline_ptx instead of llvm.inline_asm for f32→i32
...
llvm.inline_asm fails with 'LLVM ERROR: unsupported operation' in CuTeDSL
lowering pipeline. Switch to nvvm.inline_ptx which is native to the NVVM
dialect and lowers correctly.
- f32_to_i32_rni: cvt.rni.s32.f32 via nvvm.inline_ptx
- f32_to_i32_rz: cvt.rzi.s32.f32 via nvvm.inline_ptx
- f32_to_i32_rmi: cvt.rmi.s32.f32 via nvvm.inline_ptx
2026-05-28 04:42:33 +00:00
74dba6ab9d
auto: pre-test commit
2026-05-28 04:40:20 +00:00
1cbb3cf752
NVFP4-1.1: Replace threshold rounding with inline PTX cvt.rni/rz/rmi
...
- Add f32_to_i32_rni (cvt.rni.s32.f32) for round-to-nearest-even
- Add f32_to_i32_rz (cvt.rzi.s32.f32) for round-toward-zero
- Add f32_to_i32_rmi (cvt.rmi.s32.f32) for round-to-minus-infinity
- Replace round_rne_u0_8 and abs_scaled_to_e2m1_idx threshold hacks
with proper PTX hardware rounding in fp8_e4m3_from_float32
- quantize_e2m1_nibble now uses f32_to_i32_rni + LUT logic for half_step
- Add test_ptx_convert.py for inline PTX conversion verification
- This is the CORRECT approach per NVFP4-1.1_INLINE_PTX_APPROACH.md option 1
2026-05-28 04:40:17 +00:00
2777ebfe8e
NVFP4-1.1: ultra-minimal test — Float32 comparison + Int32 select
2026-05-28 04:35:06 +00:00
2087eaef49
NVFP4-1.1: minimal threshold rounding test
2026-05-28 04:33:38 +00:00
1828a71cde
NVFP4-1.1: test kernel uses Float32 input (avoids BF16 scalar load issue)
2026-05-28 04:32:08 +00:00
d2aa93aad7
NVFP4-1.1: fix Int32 clamping — use comparisons instead of fmin/fmax (float-only ops)
2026-05-28 04:30:06 +00:00
accc66741d
NVFP4-1.1: update test kernel with threshold rounding API
2026-05-28 04:27:29 +00:00
dabcc415a8
NVFP4-1.1: threshold rounding for float-to-int — avoids CuTeDSL limitation
...
All float-to-int conversions replaced with threshold comparisons:
- round_rne_u0_8: mantissa rounding via Float32 comparisons → Int32 constants
- abs_scaled_to_e2m1_idx: direct |scaled| → E2M1 index (no half_step needed)
- Verified 0/500 trial failures against Python reference
Key thresholds (RNE boundaries):
- 0.25, 0.75, 1.25, 1.75, 2.75, 3.75, 5.25 with > vs >= for RNE tie-breaking
- Fixed: 2.75 must use >= (not >) to match round(5.5)=6 RNE
2026-05-28 04:26:40 +00:00
acf46c494c
NVFP4-1.1: update approach doc and fp4_quant with CuTeDSL API fixes
2026-05-28 04:09:58 +00:00
f3a2b37d70
NVFP4-1.1: document CuTeDSL float-to-int limitation, revise approach to compact SwiGLU output
2026-05-28 04:06:27 +00:00
c3d5a7b82f
NVFP4-1.1: try .to(Int32) for float-to-int conversion
2026-05-28 04:02:45 +00:00
dc35d29811
NVFP4-1.1: fix cute.arch.store signature - store(ptr, val) not store(ptr, val, dtype)
2026-05-28 04:01:38 +00:00
a05a76bb6b
NVFP4-1.1: add Int32 cast diagnostic test
2026-05-28 03:59:01 +00:00
e565ebce91
NVFP4-1.1: replace cute.math.fmin with cute.arch.fmin (correct API)
2026-05-28 03:55:54 +00:00
20d5ddfa3d
NVFP4-1.1: fix indentation for @cute.jit decorators
2026-05-28 03:52:46 +00:00
f6f59d34cb
NVFP4-1.1: add @cute.jit decorator to fp4_quant functions for CuTeDSL if-block support
2026-05-28 03:50:11 +00:00
0ecb98daee
auto: pre-test commit
2026-05-28 03:49:03 +00:00
6f94925491
NVFP4-1.1: fix cute.math.fmax -> cute.arch.fmax (correct CuTeDSL API)
2026-05-28 03:48:51 +00:00
60790564f0
NVFP4-1.1: fix test - two-pass kernel, cute.arch.store confirmed on B200
2026-05-28 03:46:45 +00:00
ca9f920414
auto: pre-test commit
2026-05-28 03:42:39 +00:00
a41de129cb
NVFP4-1.1: fix test kernel - use cute.copy instead of cute.arch.store
2026-05-28 03:42:24 +00:00
3a78bdf570
NVFP4-1.1: add CuTeDSL kernel test for FP4 quantization
2026-05-28 03:40:54 +00:00
80b6b79f9e
NVFP4-1.1: FP4 quantization primitives for CuTeDSL kernels
...
- fp8_e4m3_from_float32: manual FP8 E4M3 cast (bias=7, exp 0-15 valid,
NaN guard for exp=15/mant=7, mantissa overflow handling)
- fp8_e4m3_to_float32: dequantize FP8 E4M3 bit pattern back to Float32
- half_step_to_e2m1_idx: E2M1 step mapping (0-12 → 0-7)
- quantize_e2m1_nibble: per-element E2M1 quantize + sign + pack
- Verified 0/500 trial failures against Python reference
- Key fixes discovered during validation:
1. FP8 E4M3 bias is 7, NOT 8
2. Exponent range is 0-15 (exp=15/mant=7 is NaN; others valid)
3. Subnormal formula: val = m * 2^(-9) = m/512 (NOT m/1024)
4. Round-to-nearest-even (not round-half-up) for half_step and mantissa
5. Mantissa overflow (round to 8) must increment exponent
2026-05-28 03:39:55 +00:00
b9f15c250f
Stage E: head-packed MQA/GQA, batch dim, custom_op, integration API
...
- production.py: head-packed M dimension for MQA/GQA (q_per_kv*T rows
in single launch per KV group, eliminating redundant K/V TMA loads)
- production.py: batch dimension support (outer Python loop)
- production.py: warmup_attention_kernels() for pre-compilation
- production.py: dsv4_attention_per_head() for exact per-head sink bias
- __init__.py: sparse_fmha_with_swa, dense_fmha_with_swa, swa_only_fmha
integration functions bridging AttentionSubBlock → production FMHA
- custom_ops.py: dsv4::sparse_fmha_with_swa custom_op registration
- test_production.py: comprehensive tests (MHA/MQA/GQA, head-packed vs
per-head parity, multi-segment KV, SWA+causal+sink, batch, edge cases)
2026-05-27 15:15:03 +00:00
2412a5431b
MQA/GQA: batch Q heads into kernel batch dim, shared K/V per KV group
2026-05-27 08:31:23 +00:00
06a895ff99
Clean test suite for production attention (1/2/4 segments, multi-head)
2026-05-27 07:12:02 +00:00
778d9d4f4f
Compile with row_sums tensor so kernel writes per-row row_sums
2026-05-27 07:10:00 +00:00
0736a04d9b
Fix KV merge: use NORMALIZED O (O_unnorm/row_sum) with LSE
2026-05-27 07:07:51 +00:00
06e7f7ab48
Debug: print LSE values for 2-segment merge
2026-05-27 07:04:39 +00:00
8f8d14c300
Match tensor slicing exactly to test_d1_kv_merge (2D slices, 3D unsqueeze)
2026-05-27 06:58:28 +00:00
6ee61717c0
Match tensor shapes from working test_d1_kv_merge
2026-05-27 06:56:04 +00:00
3a25c7feff
Test multi-KV merge (2 segments) separately from multi-head
2026-05-27 06:54:16 +00:00
36a6f07a7e
Fix: unsqueeze k/v when dim==2
2026-05-27 06:52:43 +00:00
fc4172937c
Clean production wrapper: always normalize=False + KV merge
2026-05-27 06:51:14 +00:00
8f87109f86
Single-segment: use normalize=False + per-row normalization from row_sums
2026-05-27 06:48:56 +00:00
fe55bf23a0
Split single-segment (normalized) and multi-segment (KV merge) paths
2026-05-27 06:46:30 +00:00
e45b94c01b
Test: compare both normalized and un-normalized reference
2026-05-27 06:44:37 +00:00
b70ab2a6ee
Return o_accum directly (un-normalized merge result)
2026-05-27 06:42:58 +00:00
6111db571c
Match working test: don't pass row_sums to kernel
2026-05-27 06:41:44 +00:00
312ac52d15
Normalize O_accum by exp(lse) before returning
2026-05-27 06:39:36 +00:00
ddc701af9b
Use exact merge formula from working test_d1_kv_merge.py
2026-05-27 06:38:04 +00:00
8321ccf9c1
Fix production KV merge: use normalized O for log-sum-exp merge
2026-05-27 06:36:24 +00:00
98c93c1cd8
Stage E: production attention wrapper + Python KV merge, clean fmha_smem_acc
2026-05-27 06:34:10 +00:00
51e456df44
Slice MMA tile coords from tOgO for TMA copy
2026-05-27 05:39:42 +00:00
1caa737b09
Move sC_flat_staged creation before const_expr guard
2026-05-27 05:38:39 +00:00
3c9dbc0c5d
Staged sC_flat with (128, pv_n_tile//2, 2) to match TMA atom
2026-05-27 05:37:05 +00:00