Evidence: TMA loads succeed with 128B-aligned GMEM on all descriptor configs. The bit-21 workaround was NOT needed. The 'misaligned address' crashes were caused by passing non-128B-aligned GMEM pointers to cp.async.bulk.tensor. Added docs/p4_tma_hang_resolution.md with root cause and fix. Cleaned up stale P4 test files.