e74c84458c
Clean up E2M1 dequant: use LUT approach (consultant recommendation)
...
Both indexer files now use a constexpr LUT matching Python's
E2M1_MAGNITUDES = [0, 0.5, 1, 1.5, 2, 3, 4, 6].
This is cleaner and more auditable than bit-manipulation.
2026-05-28 16:17:47 +00:00
79ef87f9a9
FIX: E2M1 FP4 dequantization bug in indexer_score_topk.cu
...
The dequant_fp4_scalar function was treating the magnitude bits as
a raw integer (0-6) instead of the E2M1 floating-point format:
Old (WRONG): val = (int)(nibble & 0x07) * scale
New (CORRECT): proper E2M1 decode with exponent + mantissa
E2M1 encoding (bias=1):
exp=0 subnormal: 0b000=0, 0b001=0.5
exp=1: 0b010=1, 0b011=1.5
exp=2: 0b100=2, 0b101=3
exp=3: 0b110=4, 0b111=6
Bug found by outside consultant. Affects indexer top-k selection
correctness — wrong FP4 key decoding would select wrong CSA blocks.
Fixed in both:
- dsv4/kernels/indexer/indexer_score_topk.cu
- dsv4/kernels/cuda/indexer_score_topk.cu
2026-05-28 16:16:24 +00:00
44c4bade5f
Rewrite fmha_sm100_tc.cuh with working N=16 PV sub-tile approach
...
Production FMHA kernel template for Blackwell SM100:
- FmhaSm100Kernel<HD>::launch(q, k, v, o, s_k, scale, stream)
- QK: SS MMA N=128, one K-tile at a time
- PV: SS MMA N=16 sub-tiles (HD/16 calls per K-tile)
- Epilogue: TMEM → regs → BF16 → GMEM
- ~25KB SMEM for all HD values
- All HD=16/64/128/256 pass with cos 0.999997+
2026-05-28 16:04:11 +00:00
a18d9c1584
Update CURRENT_ISSUE: ALL HD=16/64/128/256 PASS cos 0.999997+
...
Documented Layout D N=64 bug and N=16 sub-tile workaround.
2026-05-28 16:03:05 +00:00
01319d7247
auto: pre-test commit
2026-05-28 15:59:22 +00:00
43516ed4ec
auto: pre-test commit
2026-05-28 15:55:59 +00:00
1ec3e1ed2c
auto: pre-test commit
2026-05-28 15:55:18 +00:00
babff1f402
auto: pre-test commit
2026-05-28 15:54:05 +00:00
2b007d2008
auto: pre-test commit
2026-05-28 15:53:39 +00:00
84b997881f
auto: pre-test commit
2026-05-28 15:53:04 +00:00
6e5401df3b
auto: pre-test commit
2026-05-28 15:51:55 +00:00
102174fade
auto: pre-test commit
2026-05-28 15:50:52 +00:00
2dcfc0089f
auto: pre-test commit
2026-05-28 15:49:47 +00:00
1cdb90462f
auto: pre-test commit
2026-05-28 15:48:15 +00:00
80fd612132
auto: pre-test commit
2026-05-28 15:47:58 +00:00
9583cbc67a
auto: pre-test commit
2026-05-28 15:46:53 +00:00
1b86860c19
auto: pre-test commit
2026-05-28 15:46:16 +00:00
66cc117e11
auto: pre-test commit
2026-05-28 15:44:45 +00:00
2b32b51882
Update CURRENT_ISSUE with final session status
2026-05-28 15:22:32 +00:00
6249989cf6
Clean up HD=64 test, V layout verified correct
2026-05-28 15:21:33 +00:00
e1daad6955
Verify V SMEM values vs GMEM for HD=64
2026-05-28 15:19:31 +00:00
bafd26707b
FMHA HD=64 with BLOCK_MN_B=16, 4 N-tiles per K-tile
2026-05-28 15:17:40 +00:00
6896d1aebb
Update CURRENT_ISSUE: HD=16 done, HD=64 in progress
2026-05-28 15:16:19 +00:00
6b9b06647a
Clean up HD=64 debug prints, keep register-math PV check
2026-05-28 15:15:22 +00:00
5c9d471162
Add register-math PV reference for HD=64 debug
2026-05-28 15:13:47 +00:00
43e9efbc2b
Fix string literal
2026-05-28 15:12:20 +00:00
906be7ce50
Add filtered cosine (exclude near-zero)
2026-05-28 15:11:14 +00:00
40c83c769a
Fix: remove ×2 QK scale correction (MMA scale is 1.0, not 0.5)
2026-05-28 15:09:57 +00:00
6ea7356fdd
Debug: print P values for HD=64
2026-05-28 15:07:55 +00:00
4b052f22a5
Fix: opt into >48KB shared memory for HD=64
2026-05-28 15:06:37 +00:00
7becbfc07e
Fix: printf after var declarations
2026-05-28 15:03:25 +00:00
2d44f8e356
Debug: check if HD=64 kernel starts
2026-05-28 15:02:00 +00:00
46e4d07c71
Test PV SS MMA with B=(64,16) BLOCK_MN=64
2026-05-28 14:58:10 +00:00
465e089a2b
Add launch error check for HD=64
2026-05-28 14:56:07 +00:00
2fd64c464d
FMHA HD=64 with BLOCK_MN_B=64 for V, proper output dimensions
2026-05-28 14:54:10 +00:00
15ecc1f616
Full FMHA HD=64 with PV SS MMA (SMEM-P)
2026-05-28 14:52:29 +00:00
5b2e690936
Milestone: Full FMHA HD=16 with PV SS MMA (SMEM-P) — cosine 0.9997
2026-05-28 14:50:43 +00:00
78026839b7
Fix V canonical layout: swap g_mn/g_k indices (d=MN, lr=K)
2026-05-28 14:49:17 +00:00
9a3b43c42b
Fix reference to also use uniform P
2026-05-28 14:47:10 +00:00
75bdcbf728
Debug: override P with uniform 1/128
2026-05-28 14:46:21 +00:00
af93c283c7
Enable all 8 PV K-tiles
2026-05-28 14:45:13 +00:00
6f5be8a4e4
Debug: print P values
2026-05-28 14:44:09 +00:00
3d15f5bb21
Debug: 1 PV K-tile
2026-05-28 14:43:01 +00:00
284a06ddf1
FMHA v5: clean rewrite with QK + softmax + PV SS per K-tile
2026-05-28 14:42:13 +00:00
342193e0b4
Fix tb scope
2026-05-28 14:40:55 +00:00
a6f7ef7c45
Add softmax read from TMEM
2026-05-28 14:40:35 +00:00
38b0ff0bf8
Add QK GEMM to minimal PV test
2026-05-28 14:39:51 +00:00
e9f8f9e6e3
Minimal PV with s_p_vals in SMEM
2026-05-28 14:38:58 +00:00
97ebb964a2
Move s_p_vals to dynamic SMEM
2026-05-28 14:38:03 +00:00
d2387dd858
Full FMHA v4: per-K-tile P fill into reusable (128,16) buffer
2026-05-28 14:37:11 +00:00