Key finding: cute.size(v, mode=[0]) in @cute.jit produces wrong code. Hardcoding s_k=128 (matching Stage B) fixes the base pipeline. Current status: kernel produces non-zero output but softmax math is still wrong. Applied fixes: pv_done_bar, acc_scale with scale, fastmath=True Need to debug row_sum computation and C9 normalization.