nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Author	SHA1	Message	Date
biondizzle	e747742598	P7: Document TMEM column layout, add multi-row softmax test docs/p7_tmem_column_layout.md: Verified that tcgen05.ld 32x32b.x8 is the correct instruction for multi-row softmax. Each call reads 8 KV positions for 32 rows. No instruction change needed from single-row. test_p7_multi_row_softmax.py: Tests T=1,4,32,64,128 at various HD and N. Gate: cos >= 0.999996.	2026-05-30 17:17:54 +00:00
biondizzle	6421f7c3f3	P4 RESOLVED: TMA hang was GMEM misalignment, not descriptor/driver issue Evidence: TMA loads succeed with 128B-aligned GMEM on all descriptor configs. The bit-21 workaround was NOT needed. The 'misaligned address' crashes were caused by passing non-128B-aligned GMEM pointers to cp.async.bulk.tensor. Added docs/p4_tma_hang_resolution.md with root cause and fix. Cleaned up stale P4 test files.	2026-05-30 08:42:18 +00:00
biondizzle	a40c05f3f2	archive: TMA driver-API files + CUDA 13 TMA discovery notes Key findings documented in docs/cuda13_tma_notes.md: - CUDA 13 globalStrides are in BYTES not elements (root cause of desc creation failures) - BFLOAT16 data type available in CUDA 13 - Driver API descriptors create OK but cp.async.bulk.tensor hangs on driver 13.0 + toolkit 13.2 - CuTeDSL tma_partition works (production path) Archived (not deleted): - fmha_tma_driver_api.cuh, fmha_6warp_tma_driver_api.cuh, test_fmha_tma_driver_api.cu - These will work once driver matches toolkit version	2026-05-29 06:52:39 +00:00

Author

SHA1

Message

Date

biondizzle

e747742598

P7: Document TMEM column layout, add multi-row softmax test

docs/p7_tmem_column_layout.md: Verified that tcgen05.ld 32x32b.x8 is
the correct instruction for multi-row softmax. Each call reads 8 KV
positions for 32 rows. No instruction change needed from single-row.

test_p7_multi_row_softmax.py: Tests T=1,4,32,64,128 at various HD and N.
Gate: cos >= 0.999996.

2026-05-30 17:17:54 +00:00

biondizzle

6421f7c3f3

P4 RESOLVED: TMA hang was GMEM misalignment, not descriptor/driver issue

Evidence: TMA loads succeed with 128B-aligned GMEM on all descriptor configs.
The bit-21 workaround was NOT needed. The 'misaligned address' crashes were
caused by passing non-128B-aligned GMEM pointers to cp.async.bulk.tensor.

Added docs/p4_tma_hang_resolution.md with root cause and fix.
Cleaned up stale P4 test files.

2026-05-30 08:42:18 +00:00

biondizzle

a40c05f3f2

archive: TMA driver-API files + CUDA 13 TMA discovery notes

Key findings documented in docs/cuda13_tma_notes.md:
- CUDA 13 globalStrides are in BYTES not elements (root cause of desc creation failures)
- BFLOAT16 data type available in CUDA 13
- Driver API descriptors create OK but cp.async.bulk.tensor hangs on driver 13.0 + toolkit 13.2
- CuTeDSL tma_partition works (production path)

Archived (not deleted):
- fmha_tma_driver_api.cuh, fmha_6warp_tma_driver_api.cuh, test_fmha_tma_driver_api.cu
- These will work once driver matches toolkit version

2026-05-29 06:52:39 +00:00

3 Commits