nvfp4-megamoe-kernel

Files

biondizzle dbe2ecbd41 D2: add num_query_heads/batch_size params + batch grid dimension

- Head-packed approach: Q is (n_h*T, hd, 1), kernel treats each row independently
- Grid: (1, 1, batch) — M dimension handled by head packing
- n_h=128, T=1 → M=128, one MMA tile, all heads in single CTA
- Tested: cos 0.999995 for both n_h=1 and n_h=128

2026-05-25 17:15:08 +00:00

cache

Flush compressor: schema fix, prepare_forward, flush_write kernels, state rotation

2026-05-22 00:25:47 +00:00

kernels

D2: add num_query_heads/batch_size params + batch grid dimension

2026-05-25 17:15:08 +00:00

layers

NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring

2026-05-25 16:19:07 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering