nvfp4-megamoe-kernel/tests/test_softmax_only.py at 50e9b5da81477e000e4770e94de9d175204f5ff0

Files

biondizzle 7a8945eb76 Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage

Pipeline deadlock fixed:
- No cta_layout_vmnk on mma_si PipelineUmmaAsync
- TMA warp excluded from tmem.wait_for_alloc
- PipelineTmaStore (not TmaStorePipeline)

Bug 1 (V MN-major): fix applied
- PV MMA uses v_major=OperandMajorMode.MN
- V shaped (64,128) strides(1,64) via as_strided

Bug 2 (softmax packing): C-fragment composition store applied
- FP32 to BF16 packing works
- St32x32bOp uses Float32 (not BFloat16)

Bug 3 (PV garbage): investigating
- PV MMA cosine ~0.01 against reference
- Suspected TMEM layout mismatch between softmax P store and PV A-fragment read

Test results:
- test_mma_si_only: cosine 0.999999 PASS
- test_mma_si_pv: cosine 0.01 FAIL (pipeline works, PV output wrong)

2026-05-21 04:10:07 +00:00

16 KiB

Raw Blame History

View Raw

16 KiB Raw Blame History

16 KiB

Raw Blame History