Tests the NVIDIA reference kernel with our quantization pipeline:
1. Quantize BF16 → NVFP4 (our stage_activation logic)
2. Pad and swizzle scale factors (to_blocked)
3. Run ScaledGroupedGemmKernel (2Dx3D scenario)
4. Compare against BF16 matmul reference
Also adds cutedsl/moe.py module for the future pipeline integration.
Using checkpoint input_scale as the normalization scale saturates
FP4 values (all block scales = 448). The input_scale is a calibration
constant, NOT the amax/(6*448) normalization scale.
Reverted to dynamic amax/(6*448) for activation quantization.
The correct use of checkpoint input_scale is still under investigation.
Preserved: _w13_input_scale and _w2_input_scale in finalize_weights
for future use once we understand the correct alpha contract.
Critical fix: the checkpoint's input_scale was used during weight
calibration but we were computing dynamic scale from data (amax/2688).
This was 13x off from the checkpoint value.
Changes:
- stage_activation() accepts optional input_global_scale parameter
- nvfp4_mega_moe_full() accepts l1_input_scale and l2_input_scale
- vLLM patch preserves w13/w2_input_scale in finalize_weights
- L1 activation uses checkpoint w13_input_scale for quantization
- L2 activation uses checkpoint w2_input_scale for quantization
- alpha = input_scale * weight_scale_2 (correct calibration contract)
Checks that forward remap wrote the correct bytes by comparing
src[mn*stride_mn + k_sf*stride_ksf] against dst[layout_sf(make_coord(mn, k_sf*16, 0))].
Prints error count for SFA and SFB on first GEMM call.
- Allocation: cute::size(cute::filter_zeros(layout)) matches CUTLASS examples
- Kernel: layout_sf(make_coord(mn, k_sf*16, 0)) — no branching on LayoutRank
- Avoids silent fallthrough that wrote dst[0] for all threads
CuTe maps compatible flat coordinates into the natural hierarchical
coordinate before applying strides. No manual decomposition needed.
k_elem = k_sf * 16 (logical K element, not compact SF index).
SFA: src_stride_mn=K_sf, src_stride_ksf=1 (row-major M, K_sf)
SFB: src_stride_mn=1, src_stride_ksf=N (row-major K_sf, N after transpose)
Removes ambiguity about physical memory layout. The source indexing
now uses mn*src_stride_mn + k_sf*src_stride_ksf which works for
any contiguous or transposed layout.
SFB scales arrive as (K_sf, N) row-major after transpose+contiguous
in weight_transform.py. The col_major_src flag correctly describes
this. Don't assume both sources are (MN, K_sf).
- Iterate over source indices (MN * K_sf) instead of dst indices
- Use layout_sf forward mapping: layout_sf(make_coord(mn, k_sf*16))
- No more idx2crd reverse extraction or stride-0 ambiguity
- Cleaner, less error-prone, blog-compatible
- First flattened group IS M/N (not K as previously assumed)
- mn = f0 + 32*f1 + 128*f2
- k_sf = f4 + 4*f5 (f3 is stride-0 inner K, ignored)
- The atom stride-0 dimension (f3) maps to offset 0, not a meaningful
K sub-index. The actual k_sf comes from f4 (sub_k) + f5*4 (tile_k)
- Original code had group assignment right but k_sf extraction wrong
Based on veitner bearblog analysis of CUTLASS SF layout:
- Shape is ((32,4,K_tiles), (SFVecSize,4,M_tiles)) for SFA
- get<0..2> covers K dimension, get<3..5> covers M dimension
- k_sf = K_element_index / SFVecSize
The comment explicitly warned about this: allocation uses cosize (physical
size including tile padding) but the iteration bound used size (logical size).
This meant padding positions in the CUTLASS SF layout were never written,
leaving them as zero instead of their actual SF values. With uniform data
(all-ones), all SF values are the same so the bug was invisible. With
random data, different SF values are needed at different positions and
the missing writes corrupt the result.
The fold block_sf (float8) * global_sf (float32) -> float8 loses ~25% precision.
Product of ~56-448 block_sf * ~4.65e-05 global_sf lands in float8 low-precision
zone where step size is 25%. This makes model output garbage despite finite values.
Fix: keep block scales as original float8, return global scales separately as
float32 per-expert vectors. Apply global scale as per-expert GEMM alpha in
cutlass_grouped_nvfp4_gemm (already iterates per-expert). For L1 with separate
gate/up global scales, use gate_gs as alpha and apply up_correction ratio to
the up half post-GEMM.
weight_transform.py: no more _fold_global_scale, returns (w, sf, global_sf)
nvfp4_mega_moe.py: per-expert alpha = activation_gs * weight_gs
kernel.py: per_expert_alpha parameter in grouped GEMM
deepseek_v4.py: updated type hints and comments
cute::crd2idx requires hierarchical coordinates matching the layout's
nested shape, which we don't have from flat (m, k_sf). Reverted to
idx2crd dest-first approach. The real bug was cute::size vs
cute::cosize for allocation, not the remap direction.
CuTe Layout objects with hierarchical shapes can't be called directly
with flat (m, k_sf). Use cute::crd2idx(make_coord(m, k_sf), layout_sf)
to convert logical coordinates to physical indices.