Commit Graph

74 Commits

Author SHA1 Message Date
09ff5c5b98 feat: full NVFP4 MoE pipeline (L1→SiLU→L2→scatter)
cutedsl/moe_pipeline.py: complete pipeline
  - stage_activation: BF16 → NVFP4 (keeps data in FP4)
  - L1 GEMM: NVFP4 × NVFP4 → BF16 (gate+up)
  - SiLU(gate) * up: BF16 (only nonlinear, can't avoid)
  - Re-quantize: BF16 → NVFP4 (back to native)
  - L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj)
  - Scatter with routing weights → BF16 output

layertest.py: now tests the FULL MoE pipeline against BF16 reference.

NVFP4-native: both GEMMs use float4_e2m1fn_x2 for A and B,
float8_e4m3fn for block scales, float32 for global scales.
BF16 only for SiLU activation and final scatter.
2026-05-16 03:22:43 +00:00
0359215ab4 fix: compare kernel vs BF16 in slot-major layout 2026-05-16 03:18:41 +00:00
ed18638a3c fix: slot-major token layout for grouped GEMM
Tokens must be laid out as [expert0_tokens | expert1_tokens | ...]
for the 2Dx3D grouped GEMM. Each expert gets its own contiguous
block of tokens. Scale factors split by expert offsets.
2026-05-16 03:17:19 +00:00
5385de3142 fix: layertest tests L1 GEMM only with correct output size
L1 produces (tokens, 6144) gate+up, not (tokens, 7168) hidden.
Compare against BF16 L1 reference only.
2026-05-16 03:15:29 +00:00
0cdcc4144a refactor: add cutedsl/bridge.py, rewrite layertest to use it
bridge.py: clean API for CuTeDSL kernel
- quantize_to_nvfp4 / quantize_weight_to_nvfp4
- assemble_scales_2d_side / assemble_scales_3d_side
- make_b_k_major (stride conversion)
- compute_expert_offsets
- run_nvfp4_grouped_gemm (full kernel launch)

layertest.py: now uses bridge layer, tests with real
DeepSeek-V4 layer 0 weights (7168 hidden, 6144 intermediate).

The bridge code will be reused by the vLLM integration layer.
2026-05-16 03:13:54 +00:00
2ef71dc21a fix: B tensor K-major strides, scale_b axis swap
Two fixes:
1. B tensor: permute(0,2,1).contiguous().permute(0,2,1) gives K-major
   stride (16384,1,128) matching reference
2. scale_b: transpose to (N, K_sf) before swizzling — reference uses
   (intermediate, hidden//16) not (hidden//16, intermediate)
2026-05-16 03:04:31 +00:00
6294b84213 fix: B tensor must be K-major (transpose last 2 dims)
Reference shows B stride=(16384,1,128) — K is stride-1 (K-major).
Our stack produces N-major stride=(16384,128,1). Added .T.contiguous().
2026-05-16 03:03:00 +00:00
7c882fe2e0 fix: correct weight quantization for CuTeDSL kernel
Weight K dimension (hidden) must be the packed dimension, not N.
Block scales computed along K dim. FP4 packing along K.
2026-05-16 02:58:55 +00:00
ca28f1335d refactor: copy CuTeDSL kernel into repo with local imports
Copied from CUTLASS examples (no more runtime dependency on
/root/cutlass/examples/). Fixed all imports to use cutedsl.kernel.*
instead of blackwell.kernel.*.

Structure:
  cutedsl/__init__.py
  cutedsl/kernel/__init__.py
  cutedsl/kernel/moe/  (the MoE scaled grouped GEMM)
  cutedsl/kernel/blockscaled_gemm/  (dense blockscaled GEMM)

test_cutedsl.py updated to import from our local copy.
2026-05-16 02:57:54 +00:00
a3aa2d201e fix: clarify import path setup for CuTeDSL 2026-05-16 02:55:25 +00:00
f951d284e7 test: add CuTeDSL NVFP4 GEMM test using reference ScaledGroupedGemmKernel
Tests the NVIDIA reference kernel with our quantization pipeline:
1. Quantize BF16 → NVFP4 (our stage_activation logic)
2. Pad and swizzle scale factors (to_blocked)
3. Run ScaledGroupedGemmKernel (2Dx3D scenario)
4. Compare against BF16 matmul reference

Also adds cutedsl/moe.py module for the future pipeline integration.
2026-05-16 02:55:04 +00:00
a2ea836c74 docs: add CuTeDSL rewrite plan + reference files
The C++ CUTLASS kernel is fundamentally broken (cosine 0.05 with real
data). Switching to NVIDIA's CuTeDSL approach based on their official
MoE scaled grouped GEMM example.

Reference files copied:
- moe_torch_scaled_grouped_mm.py (3900 lines — our new kernel)
- moe_utils.py, moe_persistent_scheduler.py, moe_sched_extension.py
- grouped_blockscaled_gemm.py, dense_blockscaled_gemm_persistent.py
- blockscaled_layout.py
2026-05-16 02:41:51 +00:00
c4a262bd54 test: streamline layertest — kernel vs BF16 ref only, exit on fail
Removed original checkpoint loading (already verified 0.997 cosine).
Test now: load NVFP4 → dequant BF16 ref → run kernel → compare.
Exits with code 1 if cosine < 0.99.
2026-05-16 02:29:41 +00:00
de9b50cbe7 fix: use setup.py install for CUTLASS extension build 2026-05-16 02:21:17 +00:00
882bff8fb7 fix: also build CUTLASS C++ extension in run_test.sh 2026-05-16 02:19:40 +00:00
55d9a24bf6 fix: handle model. prefix normalization in checkpoint keys 2026-05-16 02:18:52 +00:00
bdf9f31ae2 fix: checkpoint keys don't have 'model.' prefix 2026-05-16 02:17:13 +00:00
ea5ee7c1f7 fix: remove prefix_filter from layer tensor loading 2026-05-16 02:15:55 +00:00
303b6a8993 cleanup: move useful tests to tests/, nuke stale debug tests
Kept (moved to tests/):
- test_uniform_fp4.py — proves GEMM math (72.0 = 1.5² × K)
- test_b_layout.py — proves B matrix column layout
- test_quick_rand.py — quick GEMM sanity check

Removed (stale SF remap debug artifacts):
- test_forward_map.py, test_gemm_sweep.py, test_m1_gemm.py
- test_minimal_gemm.py, test_rand_gemm.py, test_sf_check.py
- test_sf_remap.py, test_sf_signed.py, test_sf_layout_diag.cu
2026-05-16 02:14:37 +00:00
2114bd11be test: add standalone layer 0 comparison test (no vLLM, no Docker)
tests/layertest.py:
- Loads layer 0 expert weights from both original (MXFP4) and NVFP4 checkpoints
- Dequantizes both to BF16 for reference comparison
- Runs MoE forward pass in pure BF16 (no kernel)
- Runs same forward pass through our NVFP4 CUTLASS kernel
- Compares cosine similarity: kernel vs BF16 reference

tests/run_test.sh:
- Creates venv, installs deps, builds kernel from source, runs test

Isolates our kernel completely from vLLM's weight loading, tensor
parallelism, and MoE routing. If cosine ≈ 1.0, bug is in vLLM. If
cosine ≈ 0, bug is in our kernel pipeline.
2026-05-16 02:13:18 +00:00
294e9f98f2 cleanup: rename _ue8m0_to_float32 → _block_scale_to_float32, remove dead code
- Renamed misleading _ue8m0_to_float32 to _block_scale_to_float32
  (our checkpoint uses float8_e4m3fn, NOT E8M0)
- Removed dead is_scale_e8m0 property (never referenced)
- Removed dead _block_scale_to_float32 copy in MegaMoEExperts class
- Cleaned up stale E8M0/UE8M0/shift-by-23 comments
- Simplified E8M0 assertion to ValueError (not assert False)
- Updated DeepseekV4FP8Config docstring for NVFP4
2026-05-16 01:55:56 +00:00
4a624879ca docs: update DEBUG_LOG — input_scale red herring, current state, next steps 2026-05-16 01:15:49 +00:00
79b9becf9c revert: don't use checkpoint input_scale for activation normalization
Using checkpoint input_scale as the normalization scale saturates
FP4 values (all block scales = 448). The input_scale is a calibration
constant, NOT the amax/(6*448) normalization scale.

Reverted to dynamic amax/(6*448) for activation quantization.
The correct use of checkpoint input_scale is still under investigation.

Preserved: _w13_input_scale and _w2_input_scale in finalize_weights
for future use once we understand the correct alpha contract.
2026-05-16 00:12:41 +00:00
a7eae10ef4 fix: use checkpoint input_scale for activation quantization
Critical fix: the checkpoint's input_scale was used during weight
calibration but we were computing dynamic scale from data (amax/2688).
This was 13x off from the checkpoint value.

Changes:
- stage_activation() accepts optional input_global_scale parameter
- nvfp4_mega_moe_full() accepts l1_input_scale and l2_input_scale
- vLLM patch preserves w13/w2_input_scale in finalize_weights
- L1 activation uses checkpoint w13_input_scale for quantization
- L2 activation uses checkpoint w2_input_scale for quantization
- alpha = input_scale * weight_scale_2 (correct calibration contract)
2026-05-15 23:57:08 +00:00
af50e98fe9 test: B layout test with N=128 K=256 2026-05-15 23:52:22 +00:00
efd7a2c56d test: B matrix weight layout verification via one-hot A 2026-05-15 23:52:00 +00:00
bb5a1ba4c8 cleanup: remove unused slot_token from nvfp4_moe_l2
L2 input is already slot-major, so slot_token was accepted but never
passed to the GEMM. Made it explicit by removing the parameter.
2026-05-15 23:50:39 +00:00
887360281e docs: major update — SF remap verified correct, BF16 ref is the red herring
Key finding: the 0.2 cosine was always a wrong reference, not a wrong GEMM.
Proof: uniform FP4+SF produces mathematically exact output, and the
roundtrip SF verifier passes with 0 errors. Do NOT re-investigate SF remap.
2026-05-15 23:38:34 +00:00
eb26d291cb test: uniform FP4 + uniform SF sanity check 2026-05-15 23:36:08 +00:00
1f09b51168 test: check SF signed vs unsigned interpretation 2026-05-15 23:35:06 +00:00
4f857d5f99 docs: major DEBUG_LOG update — forward mapping, verifier, full debug timeline 2026-05-15 23:02:30 +00:00
aa209ddd21 debug: add SF remap roundtrip verifier
Checks that forward remap wrote the correct bytes by comparing
src[mn*stride_mn + k_sf*stride_ksf] against dst[layout_sf(make_coord(mn, k_sf*16, 0))].
Prints error count for SFA and SFB on first GEMM call.
2026-05-15 22:59:44 +00:00
6626b75a2f fix: use filter_zeros for SF allocation + no-branch forward mapping
- Allocation: cute::size(cute::filter_zeros(layout)) matches CUTLASS examples
- Kernel: layout_sf(make_coord(mn, k_sf*16, 0)) — no branching on LayoutRank
- Avoids silent fallthrough that wrote dst[0] for all threads
2026-05-15 22:58:51 +00:00
6fc8fa61e0 fix: use flat logical coordinate layout_sf(make_coord(mn, k_elem, 0))
CuTe maps compatible flat coordinates into the natural hierarchical
coordinate before applying strides. No manual decomposition needed.
k_elem = k_sf * 16 (logical K element, not compact SF index).
2026-05-15 22:53:57 +00:00
a48717ccf5 fix: remove duplicate dst_idx declaration 2026-05-15 22:31:05 +00:00
5ff1b9e401 fix: use hierarchical coordinates for layout_sf forward mapping
Flat make_coord(mn, k*16) doesn't decompose into the nested atom shape.
Must manually decompose:
  mn -> (m0, m1, mt) where m0=mn%32, m1=(mn/32)%4, mt=mn/128
  k_sf -> (k0, k1, kt) where k0=0 (stride-0), k1=k_sf%4, kt=k_sf/4
2026-05-15 22:11:14 +00:00
3b4a7b591f test: verify forward mapping with prepack vs live SFB 2026-05-15 22:09:56 +00:00
a1fd4d6233 revert: back to layout_sf(make_coord(...)) — crd2idx was unnecessary 2026-05-15 21:55:00 +00:00
ea678ece64 fix: remove duplicate template declaration 2026-05-15 21:54:10 +00:00
59dad8e2fb fix: use crd2idx instead of layout operator() for SF forward mapping 2026-05-15 21:52:02 +00:00
a09d8e477e fix: remove static_assert in constexpr else (build fix) 2026-05-15 21:27:27 +00:00
7285331395 fix: replace col_major_src with explicit source strides
SFA: src_stride_mn=K_sf, src_stride_ksf=1 (row-major M, K_sf)
SFB: src_stride_mn=1, src_stride_ksf=N (row-major K_sf, N after transpose)

Removes ambiguity about physical memory layout. The source indexing
now uses mn*src_stride_mn + k_sf*src_stride_ksf which works for
any contiguous or transposed layout.
2026-05-15 21:23:21 +00:00
f6fd549800 fix: restore col_major_src handling for SFB source layout
SFB scales arrive as (K_sf, N) row-major after transpose+contiguous
in weight_transform.py. The col_major_src flag correctly describes
this. Don't assume both sources are (MN, K_sf).
2026-05-15 21:19:58 +00:00
63e67e1025 fix: rewrite SF remap as forward mapping (source→dst)
- Iterate over source indices (MN * K_sf) instead of dst indices
- Use layout_sf forward mapping: layout_sf(make_coord(mn, k_sf*16))
- No more idx2crd reverse extraction or stride-0 ambiguity
- Cleaner, less error-prone, blog-compatible
2026-05-15 20:51:30 +00:00
30b6c89424 fix: correct SF remap coordinate extraction
- First flattened group IS M/N (not K as previously assumed)
- mn = f0 + 32*f1 + 128*f2
- k_sf = f4 + 4*f5 (f3 is stride-0 inner K, ignored)
- The atom stride-0 dimension (f3) maps to offset 0, not a meaningful
  K sub-index. The actual k_sf comes from f4 (sub_k) + f5*4 (tile_k)
- Original code had group assignment right but k_sf extraction wrong
2026-05-15 20:44:46 +00:00
ff5a0843dc fix: divide K element index by SFVecSize to get k_sf
Based on veitner bearblog analysis of CUTLASS SF layout:
- Shape is ((32,4,K_tiles), (SFVecSize,4,M_tiles)) for SFA
- get<0..2> covers K dimension, get<3..5> covers M dimension
- k_sf = K_element_index / SFVecSize
2026-05-15 20:17:24 +00:00
a09b9b53a3 cleanup: remove printf and diag function from CUDA kernel (build fix) 2026-05-15 20:11:40 +00:00
e7c3341317 docs: update DEBUG_LOG with M/K swap root cause 2026-05-15 20:03:20 +00:00
deb6b3231a debug: swap M/K in SF remap + add printf diagnostics 2026-05-15 20:01:47 +00:00
22f0457ccf test: isolate SFA vs SFB remap bug 2026-05-15 19:59:39 +00:00