Evidence: TMA loads succeed with 128B-aligned GMEM on all descriptor configs.
The bit-21 workaround was NOT needed. The 'misaligned address' crashes were
caused by passing non-128B-aligned GMEM pointers to cp.async.bulk.tensor.
Added docs/p4_tma_hang_resolution.md with root cause and fix.
Cleaned up stale P4 test files.
Key findings documented in docs/cuda13_tma_notes.md:
- CUDA 13 globalStrides are in BYTES not elements (root cause of desc creation failures)
- BFLOAT16 data type available in CUDA 13
- Driver API descriptors create OK but cp.async.bulk.tensor hangs on driver 13.0 + toolkit 13.2
- CuTeDSL tma_partition works (production path)
Archived (not deleted):
- fmha_tma_driver_api.cuh, fmha_6warp_tma_driver_api.cuh, test_fmha_tma_driver_api.cu
- These will work once driver matches toolkit version