- FmhaKernel.__init__: add num_query_heads=1, batch_size=1 - Grid: (ceil_div(n_h*T, 128), 1, batch) for multi-CTA - Test: head-packed multi-head (Q reshaped to (n_h*T, hd)) - n_h=1 regression, n_h=128 Pro decode, n_h=64 Flash, hd=128
- FmhaKernel.__init__: add num_query_heads=1, batch_size=1 - Grid: (ceil_div(n_h*T, 128), 1, batch) for multi-CTA - Test: head-packed multi-head (Q reshaped to (n_h*T, hd)) - n_h=1 regression, n_h=128 Pro decode, n_h=64 Flash, hd=128