SiLU in registers: PASS (0.034% error, Step 1 stable)
Gate/up subtile detection: blocked by CuTeDSL type system
CuTeDSL compiles the kernel for ALL subtile iterations at once.
Runtime conditionals (if is_gate_subtile) that affect:
- Register tensor assignment → DSLRuntimeError (type structure mismatch)
- TMA store skipping → corrupted output
- Mask blending → wrong results
Path forward: use const_expr debug flag for the BF16 side output,
or process gate/up in a separate post-GEMM kernel.