|
|
0ecb98daee
|
auto: pre-test commit
|
2026-05-28 03:49:03 +00:00 |
|
|
|
6f94925491
|
NVFP4-1.1: fix cute.math.fmax -> cute.arch.fmax (correct CuTeDSL API)
|
2026-05-28 03:48:51 +00:00 |
|
|
|
60790564f0
|
NVFP4-1.1: fix test - two-pass kernel, cute.arch.store confirmed on B200
|
2026-05-28 03:46:45 +00:00 |
|
|
|
ca9f920414
|
auto: pre-test commit
|
2026-05-28 03:42:39 +00:00 |
|
|
|
a41de129cb
|
NVFP4-1.1: fix test kernel - use cute.copy instead of cute.arch.store
|
2026-05-28 03:42:24 +00:00 |
|
|
|
3a78bdf570
|
NVFP4-1.1: add CuTeDSL kernel test for FP4 quantization
|
2026-05-28 03:40:54 +00:00 |
|
|
|
80b6b79f9e
|
NVFP4-1.1: FP4 quantization primitives for CuTeDSL kernels
- fp8_e4m3_from_float32: manual FP8 E4M3 cast (bias=7, exp 0-15 valid,
NaN guard for exp=15/mant=7, mantissa overflow handling)
- fp8_e4m3_to_float32: dequantize FP8 E4M3 bit pattern back to Float32
- half_step_to_e2m1_idx: E2M1 step mapping (0-12 → 0-7)
- quantize_e2m1_nibble: per-element E2M1 quantize + sign + pack
- Verified 0/500 trial failures against Python reference
- Key fixes discovered during validation:
1. FP8 E4M3 bias is 7, NOT 8
2. Exponent range is 0-15 (exp=15/mant=7 is NaN; others valid)
3. Subnormal formula: val = m * 2^(-9) = m/512 (NOT m/1024)
4. Round-to-nearest-even (not round-half-up) for half_step and mantissa
5. Mantissa overflow (round to 8) must increment exponent
|
2026-05-28 03:39:55 +00:00 |
|
|
|
b9f15c250f
|
Stage E: head-packed MQA/GQA, batch dim, custom_op, integration API
- production.py: head-packed M dimension for MQA/GQA (q_per_kv*T rows
in single launch per KV group, eliminating redundant K/V TMA loads)
- production.py: batch dimension support (outer Python loop)
- production.py: warmup_attention_kernels() for pre-compilation
- production.py: dsv4_attention_per_head() for exact per-head sink bias
- __init__.py: sparse_fmha_with_swa, dense_fmha_with_swa, swa_only_fmha
integration functions bridging AttentionSubBlock → production FMHA
- custom_ops.py: dsv4::sparse_fmha_with_swa custom_op registration
- test_production.py: comprehensive tests (MHA/MQA/GQA, head-packed vs
per-head parity, multi-segment KV, SWA+causal+sink, batch, edge cases)
|
2026-05-27 15:15:03 +00:00 |
|
|
|
2412a5431b
|
MQA/GQA: batch Q heads into kernel batch dim, shared K/V per KV group
|
2026-05-27 08:31:23 +00:00 |
|
|
|
06a895ff99
|
Clean test suite for production attention (1/2/4 segments, multi-head)
|
2026-05-27 07:12:02 +00:00 |
|
|
|
778d9d4f4f
|
Compile with row_sums tensor so kernel writes per-row row_sums
|
2026-05-27 07:10:00 +00:00 |
|
|
|
0736a04d9b
|
Fix KV merge: use NORMALIZED O (O_unnorm/row_sum) with LSE
|
2026-05-27 07:07:51 +00:00 |
|
|
|
06e7f7ab48
|
Debug: print LSE values for 2-segment merge
|
2026-05-27 07:04:39 +00:00 |
|
|
|
8f8d14c300
|
Match tensor slicing exactly to test_d1_kv_merge (2D slices, 3D unsqueeze)
|
2026-05-27 06:58:28 +00:00 |
|
|
|
6ee61717c0
|
Match tensor shapes from working test_d1_kv_merge
|
2026-05-27 06:56:04 +00:00 |
|
|
|
3a25c7feff
|
Test multi-KV merge (2 segments) separately from multi-head
|
2026-05-27 06:54:16 +00:00 |
|
|
|
36a6f07a7e
|
Fix: unsqueeze k/v when dim==2
|
2026-05-27 06:52:43 +00:00 |
|
|
|
fc4172937c
|
Clean production wrapper: always normalize=False + KV merge
|
2026-05-27 06:51:14 +00:00 |
|
|
|
8f87109f86
|
Single-segment: use normalize=False + per-row normalization from row_sums
|
2026-05-27 06:48:56 +00:00 |
|
|
|
fe55bf23a0
|
Split single-segment (normalized) and multi-segment (KV merge) paths
|
2026-05-27 06:46:30 +00:00 |
|
|
|
e45b94c01b
|
Test: compare both normalized and un-normalized reference
|
2026-05-27 06:44:37 +00:00 |
|
|
|
b70ab2a6ee
|
Return o_accum directly (un-normalized merge result)
|
2026-05-27 06:42:58 +00:00 |
|
|
|
6111db571c
|
Match working test: don't pass row_sums to kernel
|
2026-05-27 06:41:44 +00:00 |
|
|
|
312ac52d15
|
Normalize O_accum by exp(lse) before returning
|
2026-05-27 06:39:36 +00:00 |
|
|
|
ddc701af9b
|
Use exact merge formula from working test_d1_kv_merge.py
|
2026-05-27 06:38:04 +00:00 |
|
|
|
8321ccf9c1
|
Fix production KV merge: use normalized O for log-sum-exp merge
|
2026-05-27 06:36:24 +00:00 |
|
|
|
98c93c1cd8
|
Stage E: production attention wrapper + Python KV merge, clean fmha_smem_acc
|
2026-05-27 06:34:10 +00:00 |
|
|
|
51e456df44
|
Slice MMA tile coords from tOgO for TMA copy
|
2026-05-27 05:39:42 +00:00 |
|
|
|
1caa737b09
|
Move sC_flat_staged creation before const_expr guard
|
2026-05-27 05:38:39 +00:00 |
|
|
|
3c9dbc0c5d
|
Staged sC_flat with (128, pv_n_tile//2, 2) to match TMA atom
|
2026-05-27 05:37:05 +00:00 |
|
|
|
de2028b106
|
Split sC_flat into staged layout to match TMA atom decomposition
|
2026-05-27 05:35:56 +00:00 |
|
|
|
a0e9f7534b
|
Use tCgC_epi (transformed) for GMEM side of TMA partition
|
2026-05-27 05:34:40 +00:00 |
|
|
|
b02e103ac0
|
Add c_simple GMEM tensor (non-dynamic) for SMEM accumulator TMA store
|
2026-05-27 05:33:30 +00:00 |
|
|
|
2438826eee
|
Use tma_partition with group_modes on both sC_flat and gO
|
2026-05-27 05:31:47 +00:00 |
|
|
|
603f52de78
|
Fix gO creation: use slice_(pv_mma_tiler) like fmha.py
|
2026-05-27 05:30:50 +00:00 |
|
|
|
b39d7f1a14
|
Try cute.copy(tma_c, sC_flat, gO) directly
|
2026-05-27 05:29:51 +00:00 |
|
|
|
2af767a90c
|
Try full tensor TMA copy without slicing
|
2026-05-27 05:28:43 +00:00 |
|
|
|
7d14a2f764
|
sC_flat with simple (128, pv_n_tile) layout for full epi_tile coverage
|
2026-05-27 05:27:51 +00:00 |
|
|
|
6fb0e6a417
|
Use sC_flat (non-swizzled epi_s layout) for TMA store from SMEM accumulator
|
2026-05-27 05:26:50 +00:00 |
|
|
|
4a2a06f9e1
|
Fix gO slice: use separate Int32(0) instead of tuple
|
2026-05-27 05:25:33 +00:00 |
|
|
|
bf36979a8d
|
Use CUTLASS FMHA reference pattern for sC->GMEM TMA store (flat_divide + tma_partition)
|
2026-05-27 05:24:39 +00:00 |
|
|
|
97bc6d8d2f
|
Add c_direct GMEM tensor for direct writes in SMEM accumulator path
|
2026-05-27 05:15:47 +00:00 |
|
|
|
3d349b497b
|
SME accumulator: direct GMEM write from sO_acc (bypass TMA for multi-kt)
|
2026-05-27 05:14:31 +00:00 |
|
|
|
7d1e0a605d
|
Different coordinate dims for bSG_sC (2D) and bSG_gC (3D)
|
2026-05-27 05:13:38 +00:00 |
|
|
|
75b272c5f2
|
2D coordinate for bSG_sC TMA copy
|
2026-05-27 05:12:58 +00:00 |
|
|
|
72dff90165
|
3D coordinate for bSG_sC/gC TMA copy
|
2026-05-27 05:12:11 +00:00 |
|
|
|
b8b6e8cc0b
|
Slice bSG_gC MMA tile coords for TMA copy
|
2026-05-27 05:11:26 +00:00 |
|
|
|
754740d5e5
|
Try bSG_sC[(None, 0)] for TMA copy coordinate
|
2026-05-27 05:10:40 +00:00 |
|
|
|
23a2b49daf
|
Add SMEM accumulator for n_kv_tiles>1: O load from TMEM, accumulate in sO_acc, TMA store from sC
|
2026-05-27 05:09:54 +00:00 |
|
|
|
a858ed1c14
|
Fix test: normalize=False for un-normalized O comparison
|
2026-05-27 05:06:52 +00:00 |
|