Files
nvfp4-megamoe-kernel/tests
biondizzle 9d96c2fbbf CRITICAL FIX: FP32 RoPE cache + FP32 arithmetic for inverse RoPE round-trip
BF16 cos/sin cache destroys cos²+sin²=1 identity (can be 0.996 in BF16).
This causes ~3% error per RoPE→inverse RoPE round-trip, accumulating
across 61 layers into garbage output. FP32 cache + FP32 arithmetic
gives exact round-trip (diff < 1e-7).

Also fixes: MoE expert loop indentation (was only running last expert).
2026-05-31 09:14:59 +00:00
..
2026-05-30 21:22:34 +00:00