|
|
7a8945eb76
|
Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage
Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)
Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided
Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)
Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read
Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)
|
2026-05-21 04:10:07 +00:00 |
|