Files
nvfp4-megamoe-kernel/tests
biondizzle 409838ace2 refactor: per-sub-tile TMA loads with padded GMEM allocations
- Q, K, V all loaded per (128,16) sub-tile via TMA
- Q GMEM padded to (128, HD) to satisfy TMA tile requirements
- Simpler SMEM layout — only (128,16) staging buffers needed
- Updated test with padded allocations
2026-05-29 04:41:03 +00:00
..