Files
nvfp4-megamoe-kernel/dsv4/kernels
biondizzle dbe2ecbd41 D2: add num_query_heads/batch_size params + batch grid dimension
- Head-packed approach: Q is (n_h*T, hd, 1), kernel treats each row independently
- Grid: (1, 1, batch) — M dimension handled by head packing
- n_h=128, T=1 → M=128, one MMA tile, all heads in single CTA
- Tested: cos 0.999995 for both n_h=1 and n_h=128
2026-05-25 17:15:08 +00:00
..