Commit Graph

42 Commits

Author SHA1 Message Date
a8bd962452 fix: SF remap — iterate dest indices, extract logical (m, k_sf) from nested coord
The forward-map approach (src -> layout_sf(m, k)) failed because CuTe's
layout operator requires coordinates matching the nested shape rank, and
passing flat (int, int) to a ((32,4),K) shape triggers Mismatched Ranks.

New approach: iterate over CUTLASS dest indices, use idx2crd to get the
hierarchical coordinate, flatten it, then extract logical (m, k_sf) by
interpreting the flattened sub-coordinates correctly:
  flat[0..2] = (inner_M, sub_M, tile_M) -> m = tile_M*128 + inner_M*4 + sub_M
  flat[3..5] = (inner_K, sub_K, tile_K) -> k_sf = tile_K*4 + sub_K
  (inner_K is within one SF group — same byte, so ignored for k_sf)

Previous bug: get<0> and get<1> of flatten gave (inner_M, sub_M) — both
M sub-indices. K information was never extracted, so only k_group=0 worked.

Dest buffer is zero-initialized so padding slots (where m >= MN or
k_sf >= K_sf) stay zero.
2026-05-14 15:01:47 +00:00
395cc31883 fix: use layout_sf(m, k_elem) instead of make_coord for nested shapes
make_coord(m, k_elem) produces rank-2 coord, but tile_to_shape creates
nested shapes like ((32,4,tiles_m), (16,4,tiles_k)) which expect
matching nested coords. layout_sf(m, k) operator handles hierarchical
projection automatically.
2026-05-14 14:57:13 +00:00
d90967d6e9 fix: SF remap — element-space K coords + zero-init dest buffer
Two fixes:
1. CuTe layout uses element-space K, not group-space. k_group=3 with
   SFVecSize=16 maps to k_elem=48 in the layout, not k=3.
   Added SFVecSize param to remap kernel, multiply k_sf * SFVecSize
   before passing to layout_sf().

2. Zero-init CUTLASS dest buffer before remap. The layout pads to
   tile boundaries (128x64), so dest is larger than M*K_sf. Unmapped
   padding slots reading garbage causes sporadic wrong results.
   Also fixed grid size to use source count (M*K_sf), not dest size.
2026-05-14 14:54:18 +00:00
5968ebad9f fix: SF remap was using idx2crd+flatten which gives atom sub-indices, not logical (m,k)
The remap kernel iterated over CUTLASS linear indices and tried to
reverse-map with idx2crd + flatten. But flatten() on the nested CuTe
coordinate (from tile_to_shape(SfAtom{}, ...)) gives atom-level
sub-indices, not logical (m, k). This caused all K-groups > 0 in SFA
to map to m*K_sf+0, losing K-group information entirely.

Proof: setting SFA[0,0]=2.0 changed row 0, but SFA[0,3]=2.0 produced
zero change. Only K-group 0 was being read.

Fix: iterate over SOURCE indices (row-major m, k) and use the CuTe
layout forward: layout_sf(make_coord(m, k)) -> CUTLASS dst index.
This is the correct forward direction that CuTe handles natively.

Constant-scale test (all SF=1.0) gave cosine=1.0, confirming the FP4
data path is correct. The bug was purely in the SF remap.
2026-05-14 14:51:02 +00:00
cf796e37cf debug: add weight_scale_2 shape/value logging in weight transform 2026-05-14 14:19:35 +00:00
879adc324d fix: _fold_global_scale — remove broken logical_widths branch
The logical_widths branch took expert 0 and 1's global scales and
applied them to ALL experts. For L1 with logical_widths=[3072,3072],
every expert got expert-0's scale on its gate half and expert-1's
scale on its up half. All other experts' global scales were discarded.

The else branch correctly broadcasts each expert's own (E,1) global
scale across (E, N, K//16). Removed the dead logical_widths code.
2026-05-14 14:17:44 +00:00
ef9cd023a9 fix: unpack_ue4m3_u32 — uint32 lacks CUDA bitwise ops, use int32
PyTorch doesn't implement bitwise_and/shift for UInt32 on CUDA.
Cast to int32 first, then extract bytes, then uint8 → view float8.
2026-05-14 13:44:42 +00:00
1c39e21d87 fix: remove broken L1 weight interleave
The interleave assumed gate/up were pre-interleaved in groups of 16
and that we needed 2CTA UMMA layout. Both wrong:
1. vLLM w13_weight is plain concat [gate; up] along output dim
2. Our CUTLASS kernel uses ClusterShape 1x1x1, not 2CTA

The interleave was shuffling weights into nonsense, making L1 GEMM
compute the wrong thing, and chunk(2) would split wrong halves.
2026-05-14 13:05:45 +00:00
80495c0cd6 docs: clarify SF layout remap is in CUDA, not sf_layout.py
sf_layout.py was a no-op (return sf) but the actual remap happens
in remap_sf_to_cutlass_kernel in cutlass_nvfp4_gemm.cu. Updated
sf_layout.py to pure reference docs so nobody gets confused again.
2026-05-14 13:04:31 +00:00
16f91ff0e1 fix: rewrite stage_activation with proper E2M1 quantization
Three bugs fixed:
1. clamp(0,15) was destroying sign bits — E2M1 is sign-magnitude 4-bit
   nibbles, not unsigned. Half the activation was zeroed.
2. Scale stored block_max but divided by block_max/6, so stored scale was
   6× too large. Now correctly stores block_max/6 (the actual dequant factor).
3. Uniform 0.5 step doesn't match E2M1 values {0,0.5,1,1.5,2,3,4,6}.
   Now snaps to nearest E2M1 representable magnitude.

New _quantize_to_e2m1 helper handles all three correctly:
- Sign-magnitude 4-bit nibble packing (bit3=sign, bits2:0=mag index)
- Correct block scale (block_max / 6.0)
- Nearest-neighbor to actual E2M1 values
2026-05-14 13:02:10 +00:00
3bcc0ac057 fix: unpack_ue4m3_u32 was value-casting instead of bit-reinterpreting
Byte 0x3F was becoming float8(63.0) instead of the float8 whose bit
pattern IS 0x3F (~0.984). Pack uses .view() (correct), unpack used
.to() (wrong) — they were not inverses. This corrupted every activation
scale fed to the L1 GEMM while weight scales were fine.
2026-05-14 12:59:20 +00:00
d3f35c9465 cleanup: remove abandoned TileLang and Mojo files
- Deleted: layout.mojo, mega_moe.mojo, quantize.mojo (Mojo attempt)
- Deleted: nvfp4_blockscaled_gemm.py, staging.py, nvfp4_mega_moe.py (TileLang top-level)
- Deleted: tilelang_nvfp4_gemm.py, tilelang_kernels.py, nvfp4_dequant.py (TileLang package)
- Deleted: src/weight_transform.py (duplicate of package version)
- Fixed nvfp4_mega_moe.py: inlined unpack_ue4m3_u32, removed TileLang fallback imports
- Fixed weight_transform.py: renamed function, removed TileLang alias, updated docs
- Fixed __init__.py: removed TileLang alias, updated docstring
- CUTLASS is the only kernel path now
2026-05-14 12:44:47 +00:00
802c4ee12c Revert stage_activation to simple quantize (staging kernel API incompatible with L1 output dims) 2026-05-14 12:14:01 +00:00
69e0174792 Fix stage_activation: use Triton staging kernel instead of broken simple quantize 2026-05-14 12:01:34 +00:00
c016e66e23 Add CUDA sync + NaN/Inf check after each expert GEMM in grouped kernel 2026-05-14 11:27:58 +00:00
904fc37ad8 Fix: use idx2crd instead of get_coord for CuTe layout coordinate lookup 2026-05-14 10:50:26 +00:00
494d30b6ab Fix: use CuTe get_coord for proper scale factor remap to CUTLASS interleaved layout 2026-05-14 10:48:58 +00:00
869151d211 Fix kernel.py: remove broken expand on scale factors (was expanding sf to weight size) 2026-05-14 10:36:16 +00:00
84becfac93 Test: pass scales directly to CUTLASS (no remap) to diagnose layout issue 2026-05-14 10:23:02 +00:00
a272bc49b0 Fix: torch::kBFloat16 2026-05-14 10:21:10 +00:00
3f62e49e6e Fix PyTorch API: use c10::cuda and at::kBF16 2026-05-14 10:20:00 +00:00
2ee4e26772 Fix: remove compile-time SM100 guard from pytorch binding, use runtime check instead 2026-05-14 10:18:36 +00:00
540e68593f Add scale factor remap kernel: remap simple row-major SFs to CUTLASS interleaved layout 2026-05-14 10:05:38 +00:00
2998c889e7 Implement simple FP4 quantization for L1→L2 re-quant step (no vLLM fp4_utils dependency) 2026-05-14 09:50:52 +00:00
98913c9b1a Fix stage_activation: use Triton staging kernel from vLLM patch instead of fp4_utils 2026-05-14 09:38:50 +00:00
25cbc85afe Replace kernel.py with thin wrapper around pre-compiled _C extension 2026-05-14 09:25:56 +00:00
33e5d67326 Add CUTLASS_CHECK macro 2026-05-13 23:28:03 +00:00
b7c5cba407 Fix device_memory include path 2026-05-13 23:27:06 +00:00
3299d22ad6 Fix type casts and includes for CUTLASS NVFP4 GEMM 2026-05-13 23:26:18 +00:00
1eb9c43217 Rewrite CUTLASS kernel based on NVIDIA example 72b (nv_float4_t, CollectiveBuilder, OpClassBlockScaledTensorOp) 2026-05-13 23:25:20 +00:00
8a9af441dc Fix includes: use cutlass/float_subbyte.h (has float_e2m1_t and float_ue4m3_t), point to latest CUTLASS 2026-05-13 23:23:01 +00:00
d789f5e3e0 Add CCCL include path for CUTLASS 3.x 2026-05-13 23:18:26 +00:00
12588047fd Fix setup.py: use include_dirs and extra_compile_args (correct PyTorch extension API) 2026-05-13 23:17:30 +00:00
1b1c3a42fe Fix setup.py source paths 2026-05-13 23:14:05 +00:00
f375c80bfe feat: CUTLASS NVFP4 block-scaled GEMM kernel (native SM100 Blackwell)
- Native NVFP4 block-scaled MMA using CUTLASS MainloopSm100TmaUmmaWarpSpecializedBlockScaled
- Invokes mxf8f6f4.block_scale tensor core instructions (tcgen05.mma)
- E2M1 (packed int8) + UE4M3 (float8_e4m3fn) block-16 scales → BF16 output
- No dequantization: hardware block-scaled MMA avoids costly dequantize+BF16 path
- PyTorch CUDA extension with CollectiveBuilder auto-deduction
- Grouped expert GEMM for MoE dispatch (32 experts/rank, top-6 routing)
- Integrated into nvfp4_mega_moe.py as primary path with TileLang fallback
- Standalone C API (cutlass_nvfp4_gemm.cu) for direct B200 compilation
- Build script, setup.py, and test script for B200 deployment

Files:
  cutlass_nvfp4_gemm/ — Kernel source, PyTorch binding, build/test scripts
  nvfp4_mega_moe.py — Updated to use CUTLASS kernel when available
2026-05-13 23:11:15 +00:00
56c7880296 Native NVFP4 TileLang kernel: tcgen05 block-scaled MMA 2026-05-13 23:02:06 +00:00
bf13665dbe Implement TileLang NVFP4 mega_moe L1/L2 kernels
- nvfp4_mega_moe_l1: L1 GEMM (gate_up_proj) with FP4 dequant → BF16 GEMM
- nvfp4_mega_moe_l2: L2 GEMM (down_proj) with FP4 dequant → BF16 GEMM
- nvfp4_dequant.py: E2M1 packed → BF16 with UE4M3 block16 scales
- tilelang_kernels.py: Grouped expert GEMM with TileLang-compiled BF16 GEMM
- Full pipeline: L1 GEMM → SiLU+Mul → re-quantize → L2 GEMM → output
- MEGA_MOE_STATIC=1 bypass still works for pipeline testing

Current approach: dequantize FP4→BF16 then run BF16 GEMM via TileLang T.gemm
(auto-lowers to tcgen05 on Blackwell). Will be upgraded to native FP4
block-scaled MMA (tcgen05.mma kind::mxf8f6f4.block_scale) once TileLang
adds E2M1+UE4M3 support.
2026-05-13 22:36:58 +00:00
ebc0ab0cac Fix: keep scales as float8_e4m3fn, don't pack to uint32 (min_all_cuda unsupported) 2026-05-13 21:54:39 +00:00
94233c4dd3 Fix __init__.py: remove private imports 2026-05-13 21:43:47 +00:00
1a452ffabd Fix weight_transform signature to match nightly vLLM finalize_weights call 2026-05-13 21:41:43 +00:00
47ca5631d8 Fix __init__.py: only import from package modules 2026-05-13 16:08:37 +00:00
c2b752c2fe Initial: TileLang NVFP4 mega_moe kernel package
- nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe
- transform_nvfp4_weights_for_mega_moe: weight transformation (tested)
- SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs
- MEGA_MOE_STATIC=1 support for pipeline testing
- pyproject.toml for pip install
2026-05-13 15:44:51 +00:00