- Changed P store from FP32 QK C-fragment layout to BF16 PV A-fragment layout - rP_bf16_reg stores directly to TMEM using tOrP0 layout - Ensures softmax writes P to same TMEM columns that PV GEMM reads
- Changed P store from FP32 QK C-fragment layout to BF16 PV A-fragment layout - rP_bf16_reg stores directly to TMEM using tOrP0 layout - Ensures softmax writes P to same TMEM columns that PV GEMM reads