Files
nvfp4-megamoe-kernel/tests/unit
biondizzle dd364b6d4d 10-warp idle test: no crash but cosine 0.29 (6-warp gives 0.999999)
Adding 4 idle warps (4-7) to 320-thread CTA:
- No crash, no deadlock (idle warps just pass)
- But output is garbage: cosine 0.29 vs 0.999999

Same softmax+MMA code, same TMEM layout, same barriers.
Only difference: mma_warp_id=8 (was 4), threads_per_cta=320 (was 192)
and 4 idle warps 4-7.

Something in the pipeline/barrier system assumes the old 6-warp topology.
Need to identify which component uses threads_per_cta or warp_idx
in a way that breaks with more warps.
2026-05-21 22:07:53 +00:00
..