SiLU in registers: PASS (0.034% error, Step 1 stable) Gate/up subtile detection: blocked by CuTeDSL type system CuTeDSL compiles the kernel for ALL subtile iterations at once. Runtime conditionals (if is_gate_subtile) that affect: - Register tensor assignment → DSLRuntimeError (type structure mismatch) - TMA store skipping → corrupted output - Mask blending → wrong results Path forward: use const_expr debug flag for the BF16 side output, or process gate/up in a separate post-GEMM kernel.
3.4 KiB
3.4 KiB