Commit Graph

5 Commits

Author SHA1 Message Date
8f35b75164 D2: comprehensive head-packed test (n_h=1, 64, 128, hd=64, 128) 2026-05-25 17:16:05 +00:00
efdedab399 fix tests: use 3D tensors (M, hd, 1) matching kernel local_tile expectations 2026-05-25 16:54:56 +00:00
a4499f5aa8 fix tests: pad Q to 128 rows (M tile size) for all configs 2026-05-25 16:53:17 +00:00
af136eee27 fix: use CUstream instead of cuStream(0) 2026-05-25 16:51:52 +00:00
4826fa6afb D2: add num_query_heads/batch_size params + head-packed test
- FmhaKernel.__init__: add num_query_heads=1, batch_size=1
- Grid: (ceil_div(n_h*T, 128), 1, batch) for multi-CTA
- Test: head-packed multi-head (Q reshaped to (n_h*T, hd))
- n_h=1 regression, n_h=128 Pro decode, n_h=64 Flash, hd=128
2026-05-25 16:50:49 +00:00