Bug #5 fix: (sorted_ids.unsqueeze(1) == expert_id_range.unsqueeze(0)).sum(dim=0)
materializes a (num_slots × num_experts) bool tensor every forward — 48K × 384 = 18M
elements. torch.bincount(sorted_ids, minlength=num_experts) gives the same result
in O(n) with no intermediate allocation. ~200× less work.
Also removes the now-unused _expert_id_range buffer.
Bug #4 fix: When a block has amax > 0 but amax/6 underflows to 0 in
FP8 (amax < 6*2^-9 ≈ 0.0117), the block scale is 0, but the division
x / clamp(0, 1e-8) inflates x into nonzero FP4 buckets (up to ±6.0).
This produces semantically wrong FP4 even though dequant gives 0 (6*0=0).
Root cause: we only detected truly-zero blocks (amax == 0) but not
underflow blocks (0 < amax < FP8_threshold). The fix:
1. Detect both zero and underflow blocks: block_amax < 6 * 2^-9
2. Zero out x_reshaped for these blocks BEFORE division
3. Force FP8 scale to 0 for these blocks
This ensures x_scaled = 0 → FP4 nibbles = 0 → dequant = 0.
Verified: bug scenario now produces nibble=0, scale=0.
Checkpoint byte match remains 100%.
Bug #3 fix: The clamp(min=1e-8) on block_amax prevented NaN from 0/0
but allowed truly-zero blocks to get a nonzero FP8 scale (5e-12 from
underflow). While the kernel produces 0 * 0 = 0 (no NaN), the nonzero
scale is semantically wrong and could interact badly with future kernels.
Fix: detect zero blocks explicitly (block_amax == 0), clamp only for
safe division, then force FP8 scale to exact zero for zero blocks via
torch.where. The FP4 nibbles are already zero (0 / anything = 0).
Verified: checkpoint byte match remains 100%, zero blocks produce
exact-zero dequantization, no NaN propagation.
Applies to all three quantization functions:
- quantize_to_nvfp4 (activation with computed gs)
- quantize_activation_nvfp4 (activation with pre-computed gs)
- quantize_weight_to_nvfp4 (weight quantization)
Verified that our NVFP4 packing convention (odd<<4|even, round-half-to-even)
matches the DeepSeek-V4 checkpoint exactly: 100% byte-identical round-trip
across all tested experts. The dequantize->requantize path is lossless in
practice but wasteful. Marked both prepare_weights_from_dequantized and
prepare_weights_direct as deprecated in favor of prepare_weights_from_stacked
which loads checkpoint FP4 bytes directly via .view().
Also added test_fp4_roundtrip.py for future reference.
Bug #1 fix: The _needs_token_refill workaround was a band-aid over a
misdiagnosis. cute.compile does NOT corrupt GPU memory (verified on B200).
The original corruption was from a different bug (likely OOB write or
weight loading issue).
Changes:
- bridge.py: Add warmup_compilation() for eager JIT before runtime buffers
exist. Pre-allocate workspace per cache entry (no torch.full in hot path).
Cache stores {compiled, workspace, workspace_size} instead of just compiled.
CuTe tensor wrappers re-created per call (cheap metadata, avoids stale refs).
- runner.py: Remove _needs_token_refill hack. Add eager warmup call in
_ensure_stacked() for both L1 and L2 GEMM shapes.
- nvfp4_linear.py: Add eager warmup in finalize_weights() for single GEMM.
The warmup approach ensures cute.compile runs exactly once per shape during
model init, before any forward pass. This is deterministic and eliminates
any possible interaction between JIT and runtime GPU memory.