Commit Graph

57 Commits

Author SHA1 Message Date
ea5ee7c1f7 fix: remove prefix_filter from layer tensor loading 2026-05-16 02:15:55 +00:00
303b6a8993 cleanup: move useful tests to tests/, nuke stale debug tests
Kept (moved to tests/):
- test_uniform_fp4.py — proves GEMM math (72.0 = 1.5² × K)
- test_b_layout.py — proves B matrix column layout
- test_quick_rand.py — quick GEMM sanity check

Removed (stale SF remap debug artifacts):
- test_forward_map.py, test_gemm_sweep.py, test_m1_gemm.py
- test_minimal_gemm.py, test_rand_gemm.py, test_sf_check.py
- test_sf_remap.py, test_sf_signed.py, test_sf_layout_diag.cu
2026-05-16 02:14:37 +00:00
2114bd11be test: add standalone layer 0 comparison test (no vLLM, no Docker)
tests/layertest.py:
- Loads layer 0 expert weights from both original (MXFP4) and NVFP4 checkpoints
- Dequantizes both to BF16 for reference comparison
- Runs MoE forward pass in pure BF16 (no kernel)
- Runs same forward pass through our NVFP4 CUTLASS kernel
- Compares cosine similarity: kernel vs BF16 reference

tests/run_test.sh:
- Creates venv, installs deps, builds kernel from source, runs test

Isolates our kernel completely from vLLM's weight loading, tensor
parallelism, and MoE routing. If cosine ≈ 1.0, bug is in vLLM. If
cosine ≈ 0, bug is in our kernel pipeline.
2026-05-16 02:13:18 +00:00
294e9f98f2 cleanup: rename _ue8m0_to_float32 → _block_scale_to_float32, remove dead code
- Renamed misleading _ue8m0_to_float32 to _block_scale_to_float32
  (our checkpoint uses float8_e4m3fn, NOT E8M0)
- Removed dead is_scale_e8m0 property (never referenced)
- Removed dead _block_scale_to_float32 copy in MegaMoEExperts class
- Cleaned up stale E8M0/UE8M0/shift-by-23 comments
- Simplified E8M0 assertion to ValueError (not assert False)
- Updated DeepseekV4FP8Config docstring for NVFP4
2026-05-16 01:55:56 +00:00
4a624879ca docs: update DEBUG_LOG — input_scale red herring, current state, next steps 2026-05-16 01:15:49 +00:00
79b9becf9c revert: don't use checkpoint input_scale for activation normalization
Using checkpoint input_scale as the normalization scale saturates
FP4 values (all block scales = 448). The input_scale is a calibration
constant, NOT the amax/(6*448) normalization scale.

Reverted to dynamic amax/(6*448) for activation quantization.
The correct use of checkpoint input_scale is still under investigation.

Preserved: _w13_input_scale and _w2_input_scale in finalize_weights
for future use once we understand the correct alpha contract.
2026-05-16 00:12:41 +00:00
a7eae10ef4 fix: use checkpoint input_scale for activation quantization
Critical fix: the checkpoint's input_scale was used during weight
calibration but we were computing dynamic scale from data (amax/2688).
This was 13x off from the checkpoint value.

Changes:
- stage_activation() accepts optional input_global_scale parameter
- nvfp4_mega_moe_full() accepts l1_input_scale and l2_input_scale
- vLLM patch preserves w13/w2_input_scale in finalize_weights
- L1 activation uses checkpoint w13_input_scale for quantization
- L2 activation uses checkpoint w2_input_scale for quantization
- alpha = input_scale * weight_scale_2 (correct calibration contract)
2026-05-15 23:57:08 +00:00
af50e98fe9 test: B layout test with N=128 K=256 2026-05-15 23:52:22 +00:00
efd7a2c56d test: B matrix weight layout verification via one-hot A 2026-05-15 23:52:00 +00:00
bb5a1ba4c8 cleanup: remove unused slot_token from nvfp4_moe_l2
L2 input is already slot-major, so slot_token was accepted but never
passed to the GEMM. Made it explicit by removing the parameter.
2026-05-15 23:50:39 +00:00
887360281e docs: major update — SF remap verified correct, BF16 ref is the red herring
Key finding: the 0.2 cosine was always a wrong reference, not a wrong GEMM.
Proof: uniform FP4+SF produces mathematically exact output, and the
roundtrip SF verifier passes with 0 errors. Do NOT re-investigate SF remap.
2026-05-15 23:38:34 +00:00
eb26d291cb test: uniform FP4 + uniform SF sanity check 2026-05-15 23:36:08 +00:00
1f09b51168 test: check SF signed vs unsigned interpretation 2026-05-15 23:35:06 +00:00
4f857d5f99 docs: major DEBUG_LOG update — forward mapping, verifier, full debug timeline 2026-05-15 23:02:30 +00:00
aa209ddd21 debug: add SF remap roundtrip verifier
Checks that forward remap wrote the correct bytes by comparing
src[mn*stride_mn + k_sf*stride_ksf] against dst[layout_sf(make_coord(mn, k_sf*16, 0))].
Prints error count for SFA and SFB on first GEMM call.
2026-05-15 22:59:44 +00:00
6626b75a2f fix: use filter_zeros for SF allocation + no-branch forward mapping
- Allocation: cute::size(cute::filter_zeros(layout)) matches CUTLASS examples
- Kernel: layout_sf(make_coord(mn, k_sf*16, 0)) — no branching on LayoutRank
- Avoids silent fallthrough that wrote dst[0] for all threads
2026-05-15 22:58:51 +00:00
6fc8fa61e0 fix: use flat logical coordinate layout_sf(make_coord(mn, k_elem, 0))
CuTe maps compatible flat coordinates into the natural hierarchical
coordinate before applying strides. No manual decomposition needed.
k_elem = k_sf * 16 (logical K element, not compact SF index).
2026-05-15 22:53:57 +00:00
a48717ccf5 fix: remove duplicate dst_idx declaration 2026-05-15 22:31:05 +00:00
5ff1b9e401 fix: use hierarchical coordinates for layout_sf forward mapping
Flat make_coord(mn, k*16) doesn't decompose into the nested atom shape.
Must manually decompose:
  mn -> (m0, m1, mt) where m0=mn%32, m1=(mn/32)%4, mt=mn/128
  k_sf -> (k0, k1, kt) where k0=0 (stride-0), k1=k_sf%4, kt=k_sf/4
2026-05-15 22:11:14 +00:00
3b4a7b591f test: verify forward mapping with prepack vs live SFB 2026-05-15 22:09:56 +00:00
a1fd4d6233 revert: back to layout_sf(make_coord(...)) — crd2idx was unnecessary 2026-05-15 21:55:00 +00:00
ea678ece64 fix: remove duplicate template declaration 2026-05-15 21:54:10 +00:00
59dad8e2fb fix: use crd2idx instead of layout operator() for SF forward mapping 2026-05-15 21:52:02 +00:00
a09d8e477e fix: remove static_assert in constexpr else (build fix) 2026-05-15 21:27:27 +00:00
7285331395 fix: replace col_major_src with explicit source strides
SFA: src_stride_mn=K_sf, src_stride_ksf=1 (row-major M, K_sf)
SFB: src_stride_mn=1, src_stride_ksf=N (row-major K_sf, N after transpose)

Removes ambiguity about physical memory layout. The source indexing
now uses mn*src_stride_mn + k_sf*src_stride_ksf which works for
any contiguous or transposed layout.
2026-05-15 21:23:21 +00:00
f6fd549800 fix: restore col_major_src handling for SFB source layout
SFB scales arrive as (K_sf, N) row-major after transpose+contiguous
in weight_transform.py. The col_major_src flag correctly describes
this. Don't assume both sources are (MN, K_sf).
2026-05-15 21:19:58 +00:00
63e67e1025 fix: rewrite SF remap as forward mapping (source→dst)
- Iterate over source indices (MN * K_sf) instead of dst indices
- Use layout_sf forward mapping: layout_sf(make_coord(mn, k_sf*16))
- No more idx2crd reverse extraction or stride-0 ambiguity
- Cleaner, less error-prone, blog-compatible
2026-05-15 20:51:30 +00:00
30b6c89424 fix: correct SF remap coordinate extraction
- First flattened group IS M/N (not K as previously assumed)
- mn = f0 + 32*f1 + 128*f2
- k_sf = f4 + 4*f5 (f3 is stride-0 inner K, ignored)
- The atom stride-0 dimension (f3) maps to offset 0, not a meaningful
  K sub-index. The actual k_sf comes from f4 (sub_k) + f5*4 (tile_k)
- Original code had group assignment right but k_sf extraction wrong
2026-05-15 20:44:46 +00:00
ff5a0843dc fix: divide K element index by SFVecSize to get k_sf
Based on veitner bearblog analysis of CUTLASS SF layout:
- Shape is ((32,4,K_tiles), (SFVecSize,4,M_tiles)) for SFA
- get<0..2> covers K dimension, get<3..5> covers M dimension
- k_sf = K_element_index / SFVecSize
2026-05-15 20:17:24 +00:00
a09b9b53a3 cleanup: remove printf and diag function from CUDA kernel (build fix) 2026-05-15 20:11:40 +00:00
e7c3341317 docs: update DEBUG_LOG with M/K swap root cause 2026-05-15 20:03:20 +00:00
deb6b3231a debug: swap M/K in SF remap + add printf diagnostics 2026-05-15 20:01:47 +00:00
22f0457ccf test: isolate SFA vs SFB remap bug 2026-05-15 19:59:39 +00:00
9eaf6d07e8 test: quick random test 2026-05-15 19:58:57 +00:00
fa7b394571 docs: update DEBUG_LOG with root cause (size→cosize) and full debug timeline 2026-05-15 18:56:09 +00:00
c3841983a0 fix: SF remap uses cute::cosize() instead of cute::size()
The comment explicitly warned about this: allocation uses cosize (physical
size including tile padding) but the iteration bound used size (logical size).
This meant padding positions in the CUTLASS SF layout were never written,
leaving them as zero instead of their actual SF values. With uniform data
(all-ones), all SF values are the same so the bug was invisible. With
random data, different SF values are needed at different positions and
the missing writes corrupt the result.
2026-05-15 18:52:23 +00:00
67dcfa83f5 test: random data at small dims + alpha sweep 2026-05-15 18:51:52 +00:00
60f7f60818 test: ultra-minimal GEMM with all-ones 2026-05-15 18:51:31 +00:00
363dd893f0 test: dimension sweep to isolate GEMM bug 2026-05-15 18:51:09 +00:00
fee5a97ebb fix: cosine_similarity dim for M>0 2026-05-15 18:50:45 +00:00
f9330a1777 test: standalone M=1 GEMM test with deterministic data 2026-05-15 18:47:26 +00:00
1b63a46168 docs: update DEBUG_LOG with cosine≈0 finding + new hypotheses 2026-05-15 18:35:00 +00:00
773967452f debug: fix gs scalar conversion + add traceback 2026-05-15 18:27:44 +00:00
df916b87eb debug: fix gs.item() for multi-element tensor 2026-05-15 18:09:41 +00:00
755f9ad567 debug: fix per_expert_alpha ref + clean up BF16 reference scaling 2026-05-15 17:55:11 +00:00
de8acc7965 debug: dump raw GEMM inputs + first 8 output values 2026-05-15 17:02:40 +00:00
9159cb6bb3 docs: add debug log — current state, hypotheses, fixes 2026-05-15 15:48:57 +00:00
2fd55a94c6 fix: weight reshape bug + igs double-count in BF16 reference 2026-05-15 15:46:16 +00:00
c421a668f3 debug: BF16 reference GEMM + cosine comparison for L1 2026-05-15 14:16:24 +00:00
995589ac8a debug: add FP4 quantization round-trip diagnostic 2026-05-15 13:41:09 +00:00