Commit Graph

719 Commits

Author SHA1 Message Date
2ef71dc21a fix: B tensor K-major strides, scale_b axis swap
Two fixes:
1. B tensor: permute(0,2,1).contiguous().permute(0,2,1) gives K-major
   stride (16384,1,128) matching reference
2. scale_b: transpose to (N, K_sf) before swizzling — reference uses
   (intermediate, hidden//16) not (hidden//16, intermediate)
2026-05-16 03:04:31 +00:00
6294b84213 fix: B tensor must be K-major (transpose last 2 dims)
Reference shows B stride=(16384,1,128) — K is stride-1 (K-major).
Our stack produces N-major stride=(16384,128,1). Added .T.contiguous().
2026-05-16 03:03:00 +00:00
7c882fe2e0 fix: correct weight quantization for CuTeDSL kernel
Weight K dimension (hidden) must be the packed dimension, not N.
Block scales computed along K dim. FP4 packing along K.
2026-05-16 02:58:55 +00:00
ca28f1335d refactor: copy CuTeDSL kernel into repo with local imports
Copied from CUTLASS examples (no more runtime dependency on
/root/cutlass/examples/). Fixed all imports to use cutedsl.kernel.*
instead of blackwell.kernel.*.

Structure:
  cutedsl/__init__.py
  cutedsl/kernel/__init__.py
  cutedsl/kernel/moe/  (the MoE scaled grouped GEMM)
  cutedsl/kernel/blockscaled_gemm/  (dense blockscaled GEMM)

test_cutedsl.py updated to import from our local copy.
2026-05-16 02:57:54 +00:00
a3aa2d201e fix: clarify import path setup for CuTeDSL 2026-05-16 02:55:25 +00:00
f951d284e7 test: add CuTeDSL NVFP4 GEMM test using reference ScaledGroupedGemmKernel
Tests the NVIDIA reference kernel with our quantization pipeline:
1. Quantize BF16 → NVFP4 (our stage_activation logic)
2. Pad and swizzle scale factors (to_blocked)
3. Run ScaledGroupedGemmKernel (2Dx3D scenario)
4. Compare against BF16 matmul reference

Also adds cutedsl/moe.py module for the future pipeline integration.
2026-05-16 02:55:04 +00:00
a2ea836c74 docs: add CuTeDSL rewrite plan + reference files
The C++ CUTLASS kernel is fundamentally broken (cosine 0.05 with real
data). Switching to NVIDIA's CuTeDSL approach based on their official
MoE scaled grouped GEMM example.

Reference files copied:
- moe_torch_scaled_grouped_mm.py (3900 lines — our new kernel)
- moe_utils.py, moe_persistent_scheduler.py, moe_sched_extension.py
- grouped_blockscaled_gemm.py, dense_blockscaled_gemm_persistent.py
- blockscaled_layout.py
2026-05-16 02:41:51 +00:00
c4a262bd54 test: streamline layertest — kernel vs BF16 ref only, exit on fail
Removed original checkpoint loading (already verified 0.997 cosine).
Test now: load NVFP4 → dequant BF16 ref → run kernel → compare.
Exits with code 1 if cosine < 0.99.
2026-05-16 02:29:41 +00:00
de9b50cbe7 fix: use setup.py install for CUTLASS extension build 2026-05-16 02:21:17 +00:00
882bff8fb7 fix: also build CUTLASS C++ extension in run_test.sh 2026-05-16 02:19:40 +00:00
55d9a24bf6 fix: handle model. prefix normalization in checkpoint keys 2026-05-16 02:18:52 +00:00
bdf9f31ae2 fix: checkpoint keys don't have 'model.' prefix 2026-05-16 02:17:13 +00:00
ea5ee7c1f7 fix: remove prefix_filter from layer tensor loading 2026-05-16 02:15:55 +00:00
303b6a8993 cleanup: move useful tests to tests/, nuke stale debug tests
Kept (moved to tests/):
- test_uniform_fp4.py — proves GEMM math (72.0 = 1.5² × K)
- test_b_layout.py — proves B matrix column layout
- test_quick_rand.py — quick GEMM sanity check

Removed (stale SF remap debug artifacts):
- test_forward_map.py, test_gemm_sweep.py, test_m1_gemm.py
- test_minimal_gemm.py, test_rand_gemm.py, test_sf_check.py
- test_sf_remap.py, test_sf_signed.py, test_sf_layout_diag.cu
2026-05-16 02:14:37 +00:00
2114bd11be test: add standalone layer 0 comparison test (no vLLM, no Docker)
tests/layertest.py:
- Loads layer 0 expert weights from both original (MXFP4) and NVFP4 checkpoints
- Dequantizes both to BF16 for reference comparison
- Runs MoE forward pass in pure BF16 (no kernel)
- Runs same forward pass through our NVFP4 CUTLASS kernel
- Compares cosine similarity: kernel vs BF16 reference

tests/run_test.sh:
- Creates venv, installs deps, builds kernel from source, runs test

Isolates our kernel completely from vLLM's weight loading, tensor
parallelism, and MoE routing. If cosine ≈ 1.0, bug is in vLLM. If
cosine ≈ 0, bug is in our kernel pipeline.
2026-05-16 02:13:18 +00:00
294e9f98f2 cleanup: rename _ue8m0_to_float32 → _block_scale_to_float32, remove dead code
- Renamed misleading _ue8m0_to_float32 to _block_scale_to_float32
  (our checkpoint uses float8_e4m3fn, NOT E8M0)
- Removed dead is_scale_e8m0 property (never referenced)
- Removed dead _block_scale_to_float32 copy in MegaMoEExperts class
- Cleaned up stale E8M0/UE8M0/shift-by-23 comments
- Simplified E8M0 assertion to ValueError (not assert False)
- Updated DeepseekV4FP8Config docstring for NVFP4
2026-05-16 01:55:56 +00:00
4a624879ca docs: update DEBUG_LOG — input_scale red herring, current state, next steps 2026-05-16 01:15:49 +00:00
79b9becf9c revert: don't use checkpoint input_scale for activation normalization
Using checkpoint input_scale as the normalization scale saturates
FP4 values (all block scales = 448). The input_scale is a calibration
constant, NOT the amax/(6*448) normalization scale.

Reverted to dynamic amax/(6*448) for activation quantization.
The correct use of checkpoint input_scale is still under investigation.

Preserved: _w13_input_scale and _w2_input_scale in finalize_weights
for future use once we understand the correct alpha contract.
2026-05-16 00:12:41 +00:00
a7eae10ef4 fix: use checkpoint input_scale for activation quantization
Critical fix: the checkpoint's input_scale was used during weight
calibration but we were computing dynamic scale from data (amax/2688).
This was 13x off from the checkpoint value.

Changes:
- stage_activation() accepts optional input_global_scale parameter
- nvfp4_mega_moe_full() accepts l1_input_scale and l2_input_scale
- vLLM patch preserves w13/w2_input_scale in finalize_weights
- L1 activation uses checkpoint w13_input_scale for quantization
- L2 activation uses checkpoint w2_input_scale for quantization
- alpha = input_scale * weight_scale_2 (correct calibration contract)
2026-05-15 23:57:08 +00:00
af50e98fe9 test: B layout test with N=128 K=256 2026-05-15 23:52:22 +00:00
efd7a2c56d test: B matrix weight layout verification via one-hot A 2026-05-15 23:52:00 +00:00
bb5a1ba4c8 cleanup: remove unused slot_token from nvfp4_moe_l2
L2 input is already slot-major, so slot_token was accepted but never
passed to the GEMM. Made it explicit by removing the parameter.
2026-05-15 23:50:39 +00:00
887360281e docs: major update — SF remap verified correct, BF16 ref is the red herring
Key finding: the 0.2 cosine was always a wrong reference, not a wrong GEMM.
Proof: uniform FP4+SF produces mathematically exact output, and the
roundtrip SF verifier passes with 0 errors. Do NOT re-investigate SF remap.
2026-05-15 23:38:34 +00:00
eb26d291cb test: uniform FP4 + uniform SF sanity check 2026-05-15 23:36:08 +00:00
1f09b51168 test: check SF signed vs unsigned interpretation 2026-05-15 23:35:06 +00:00
4f857d5f99 docs: major DEBUG_LOG update — forward mapping, verifier, full debug timeline 2026-05-15 23:02:30 +00:00
aa209ddd21 debug: add SF remap roundtrip verifier
Checks that forward remap wrote the correct bytes by comparing
src[mn*stride_mn + k_sf*stride_ksf] against dst[layout_sf(make_coord(mn, k_sf*16, 0))].
Prints error count for SFA and SFB on first GEMM call.
2026-05-15 22:59:44 +00:00
6626b75a2f fix: use filter_zeros for SF allocation + no-branch forward mapping
- Allocation: cute::size(cute::filter_zeros(layout)) matches CUTLASS examples
- Kernel: layout_sf(make_coord(mn, k_sf*16, 0)) — no branching on LayoutRank
- Avoids silent fallthrough that wrote dst[0] for all threads
2026-05-15 22:58:51 +00:00
6fc8fa61e0 fix: use flat logical coordinate layout_sf(make_coord(mn, k_elem, 0))
CuTe maps compatible flat coordinates into the natural hierarchical
coordinate before applying strides. No manual decomposition needed.
k_elem = k_sf * 16 (logical K element, not compact SF index).
2026-05-15 22:53:57 +00:00
a48717ccf5 fix: remove duplicate dst_idx declaration 2026-05-15 22:31:05 +00:00
5ff1b9e401 fix: use hierarchical coordinates for layout_sf forward mapping
Flat make_coord(mn, k*16) doesn't decompose into the nested atom shape.
Must manually decompose:
  mn -> (m0, m1, mt) where m0=mn%32, m1=(mn/32)%4, mt=mn/128
  k_sf -> (k0, k1, kt) where k0=0 (stride-0), k1=k_sf%4, kt=k_sf/4
2026-05-15 22:11:14 +00:00
3b4a7b591f test: verify forward mapping with prepack vs live SFB 2026-05-15 22:09:56 +00:00
a1fd4d6233 revert: back to layout_sf(make_coord(...)) — crd2idx was unnecessary 2026-05-15 21:55:00 +00:00
ea678ece64 fix: remove duplicate template declaration 2026-05-15 21:54:10 +00:00
59dad8e2fb fix: use crd2idx instead of layout operator() for SF forward mapping 2026-05-15 21:52:02 +00:00
a09d8e477e fix: remove static_assert in constexpr else (build fix) 2026-05-15 21:27:27 +00:00
7285331395 fix: replace col_major_src with explicit source strides
SFA: src_stride_mn=K_sf, src_stride_ksf=1 (row-major M, K_sf)
SFB: src_stride_mn=1, src_stride_ksf=N (row-major K_sf, N after transpose)

Removes ambiguity about physical memory layout. The source indexing
now uses mn*src_stride_mn + k_sf*src_stride_ksf which works for
any contiguous or transposed layout.
2026-05-15 21:23:21 +00:00
f6fd549800 fix: restore col_major_src handling for SFB source layout
SFB scales arrive as (K_sf, N) row-major after transpose+contiguous
in weight_transform.py. The col_major_src flag correctly describes
this. Don't assume both sources are (MN, K_sf).
2026-05-15 21:19:58 +00:00
63e67e1025 fix: rewrite SF remap as forward mapping (source→dst)
- Iterate over source indices (MN * K_sf) instead of dst indices
- Use layout_sf forward mapping: layout_sf(make_coord(mn, k_sf*16))
- No more idx2crd reverse extraction or stride-0 ambiguity
- Cleaner, less error-prone, blog-compatible
2026-05-15 20:51:30 +00:00
30b6c89424 fix: correct SF remap coordinate extraction
- First flattened group IS M/N (not K as previously assumed)
- mn = f0 + 32*f1 + 128*f2
- k_sf = f4 + 4*f5 (f3 is stride-0 inner K, ignored)
- The atom stride-0 dimension (f3) maps to offset 0, not a meaningful
  K sub-index. The actual k_sf comes from f4 (sub_k) + f5*4 (tile_k)
- Original code had group assignment right but k_sf extraction wrong
2026-05-15 20:44:46 +00:00
ff5a0843dc fix: divide K element index by SFVecSize to get k_sf
Based on veitner bearblog analysis of CUTLASS SF layout:
- Shape is ((32,4,K_tiles), (SFVecSize,4,M_tiles)) for SFA
- get<0..2> covers K dimension, get<3..5> covers M dimension
- k_sf = K_element_index / SFVecSize
2026-05-15 20:17:24 +00:00
a09b9b53a3 cleanup: remove printf and diag function from CUDA kernel (build fix) 2026-05-15 20:11:40 +00:00
e7c3341317 docs: update DEBUG_LOG with M/K swap root cause 2026-05-15 20:03:20 +00:00
deb6b3231a debug: swap M/K in SF remap + add printf diagnostics 2026-05-15 20:01:47 +00:00
22f0457ccf test: isolate SFA vs SFB remap bug 2026-05-15 19:59:39 +00:00
9eaf6d07e8 test: quick random test 2026-05-15 19:58:57 +00:00
fa7b394571 docs: update DEBUG_LOG with root cause (size→cosize) and full debug timeline 2026-05-15 18:56:09 +00:00
c3841983a0 fix: SF remap uses cute::cosize() instead of cute::size()
The comment explicitly warned about this: allocation uses cosize (physical
size including tile padding) but the iteration bound used size (logical size).
This meant padding positions in the CUTLASS SF layout were never written,
leaving them as zero instead of their actual SF values. With uniform data
(all-ones), all SF values are the same so the bug was invisible. With
random data, different SF values are needed at different positions and
the missing writes corrupt the result.
2026-05-15 18:52:23 +00:00
67dcfa83f5 test: random data at small dims + alpha sweep 2026-05-15 18:51:52 +00:00
60f7f60818 test: ultra-minimal GEMM with all-ones 2026-05-15 18:51:31 +00:00