- Q, K, V all loaded per (128,16) sub-tile via TMA - Q GMEM padded to (128, HD) to satisfy TMA tile requirements - Simpler SMEM layout — only (128,16) staging buffers needed - Updated test with padded allocations
- Q, K, V all loaded per (128,16) sub-tile via TMA - Q GMEM padded to (128, HD) to satisfy TMA tile requirements - Simpler SMEM layout — only (128,16) staging buffers needed - Updated test with padded allocations