The checkpoint stores input_scale per projection — the pre-computed
activation normalization factor. Using 1/2688 was wrong for most layers
(e.g. down_proj input_scale=0.031 vs 1/2688=0.000372 — 83x off).
This caused under-quantized activations and garbage output.
First call: cute.compile() with real tensors (warmup).
Subsequent calls: just invoke compiled() with new CuTe views.
No cute.compile() in the forward path = cudagraph-safe.
The CuTeDSL kernel's TMA descriptors are bound to the
compilation-time tensor addresses. Caching the compiled kernel
and reusing it with different tensor allocations produces wrong
memory access patterns (cosine 0.5 instead of 0.99).
Fresh compilation is proven correct (cosine 0.989). We can
optimize later with proper TMA descriptor reinitialization.
quantize_to_nvfp4() only packs the last dimension, but for weight
matrices (K, N), K is the packed dimension. The weight quantizer
reshapes (k_blocks, block_size, N) and computes block scales along
the K block dimension. This was accidentally replaced with a simple
delegation to quantize_to_nvfp4, producing wrong tensor shapes.
The kernel's TMA descriptors are sized from compilation-time shapes.
Dummy 256x256 caused wrong memory access for real 3584x6144 data.
Now compiles with actual runtime tensors on first use, cached by
(num_experts, K, N). Compilation happens once during warmup.
Forward call remains cudagraph-safe.
The compiled kernel's TMA descriptors are sized based on compilation
shapes. Using dummy 256x256 shapes caused wrong memory access patterns
for the real 3584x6144 data. Now uses actual K_packed and N_packed
from the runtime tensors.
float4_e2m1fn_x2 packs 2 values per byte along K, not N.
The GEMM output N dimension is the logical N from mat_b.shape[2],
not 2x packed. Previous n_dim*2 was wrong — it accidentally worked
in the test because intermediate_size*2 == 2*intermediate_size.
Real model with N=9216 exposed the bug.
torch.tensor() creates on CPU then copies to CUDA, which is forbidden
during cudagraph capture. new_tensor() creates directly on the
source tensor's device.
- Removed all [:total_slots] dynamic slicing with GPU scalars
- slot_hidden gathers from hidden_states directly using sorted_token_ids
- scatter_add uses full sorted_token_ids (padding slots have zero weight)
- _assemble_scales_cudagraph_safe returns 2D via padded_scales.shape[0]
- Fixed padded_scales_buf allocation via float16->float8 cast
- GEMM output size: n_dim * 2 for float4_e2m1fn_x2 packed format