4 softmax warps (0-3), 4 correction warps (4-7), 1 MMA (8), 1 TMA (9). 320 threads total. Softmax: QK→softmax, write P, write row metadata to TMEM vector. Correction: read vector via QK partition, rescale O (C6), normalize O (C9). Compiles successfully but hits CUDA_ERROR_ILLEGAL_ADDRESS at runtime. Likely: vector TMEM offsets or correction TMEM access layout is wrong. Key files: - tests/unit/test_fmha_v3_correction.py (new correction architecture) - tests/unit/test_fmha_v3_softmax.py (working n=128, cosine 0.993)
0 lines
0 B
Plaintext
0 lines
0 B
Plaintext
The file is empty.