cf796e37cf
debug: add weight_scale_2 shape/value logging in weight transform
2026-05-14 14:19:35 +00:00
879adc324d
fix: _fold_global_scale — remove broken logical_widths branch
...
The logical_widths branch took expert 0 and 1's global scales and
applied them to ALL experts. For L1 with logical_widths=[3072,3072],
every expert got expert-0's scale on its gate half and expert-1's
scale on its up half. All other experts' global scales were discarded.
The else branch correctly broadcasts each expert's own (E,1) global
scale across (E, N, K//16). Removed the dead logical_widths code.
2026-05-14 14:17:44 +00:00
ef9cd023a9
fix: unpack_ue4m3_u32 — uint32 lacks CUDA bitwise ops, use int32
...
PyTorch doesn't implement bitwise_and/shift for UInt32 on CUDA.
Cast to int32 first, then extract bytes, then uint8 → view float8.
2026-05-14 13:44:42 +00:00
1c39e21d87
fix: remove broken L1 weight interleave
...
The interleave assumed gate/up were pre-interleaved in groups of 16
and that we needed 2CTA UMMA layout. Both wrong:
1. vLLM w13_weight is plain concat [gate; up] along output dim
2. Our CUTLASS kernel uses ClusterShape 1x1x1, not 2CTA
The interleave was shuffling weights into nonsense, making L1 GEMM
compute the wrong thing, and chunk(2) would split wrong halves.
2026-05-14 13:05:45 +00:00
80495c0cd6
docs: clarify SF layout remap is in CUDA, not sf_layout.py
...
sf_layout.py was a no-op (return sf) but the actual remap happens
in remap_sf_to_cutlass_kernel in cutlass_nvfp4_gemm.cu. Updated
sf_layout.py to pure reference docs so nobody gets confused again.
2026-05-14 13:04:31 +00:00
16f91ff0e1
fix: rewrite stage_activation with proper E2M1 quantization
...
Three bugs fixed:
1. clamp(0,15) was destroying sign bits — E2M1 is sign-magnitude 4-bit
nibbles, not unsigned. Half the activation was zeroed.
2. Scale stored block_max but divided by block_max/6, so stored scale was
6× too large. Now correctly stores block_max/6 (the actual dequant factor).
3. Uniform 0.5 step doesn't match E2M1 values {0,0.5,1,1.5,2,3,4,6}.
Now snaps to nearest E2M1 representable magnitude.
New _quantize_to_e2m1 helper handles all three correctly:
- Sign-magnitude 4-bit nibble packing (bit3=sign, bits2:0=mag index)
- Correct block scale (block_max / 6.0)
- Nearest-neighbor to actual E2M1 values
2026-05-14 13:02:10 +00:00
3bcc0ac057
fix: unpack_ue4m3_u32 was value-casting instead of bit-reinterpreting
...
Byte 0x3F was becoming float8(63.0) instead of the float8 whose bit
pattern IS 0x3F (~0.984). Pack uses .view() (correct), unpack used
.to() (wrong) — they were not inverses. This corrupted every activation
scale fed to the L1 GEMM while weight scales were fine.
2026-05-14 12:59:20 +00:00
8b7fa0c91e
add README: pipeline diagram, file map, data formats, known issues
2026-05-14 12:48:08 +00:00
d3f35c9465
cleanup: remove abandoned TileLang and Mojo files
...
- Deleted: layout.mojo, mega_moe.mojo, quantize.mojo (Mojo attempt)
- Deleted: nvfp4_blockscaled_gemm.py, staging.py, nvfp4_mega_moe.py (TileLang top-level)
- Deleted: tilelang_nvfp4_gemm.py, tilelang_kernels.py, nvfp4_dequant.py (TileLang package)
- Deleted: src/weight_transform.py (duplicate of package version)
- Fixed nvfp4_mega_moe.py: inlined unpack_ue4m3_u32, removed TileLang fallback imports
- Fixed weight_transform.py: renamed function, removed TileLang alias, updated docs
- Fixed __init__.py: removed TileLang alias, updated docstring
- CUTLASS is the only kernel path now
2026-05-14 12:44:47 +00:00
802c4ee12c
Revert stage_activation to simple quantize (staging kernel API incompatible with L1 output dims)
2026-05-14 12:14:01 +00:00
69e0174792
Fix stage_activation: use Triton staging kernel instead of broken simple quantize
2026-05-14 12:01:34 +00:00
c016e66e23
Add CUDA sync + NaN/Inf check after each expert GEMM in grouped kernel
2026-05-14 11:27:58 +00:00
1dfe5ffd05
Add comprehensive README documenting quirks, pitfalls, and setup
2026-05-14 11:23:32 +00:00
904fc37ad8
Fix: use idx2crd instead of get_coord for CuTe layout coordinate lookup
2026-05-14 10:50:26 +00:00
494d30b6ab
Fix: use CuTe get_coord for proper scale factor remap to CUTLASS interleaved layout
2026-05-14 10:48:58 +00:00
869151d211
Fix kernel.py: remove broken expand on scale factors (was expanding sf to weight size)
2026-05-14 10:36:16 +00:00
84becfac93
Test: pass scales directly to CUTLASS (no remap) to diagnose layout issue
2026-05-14 10:23:02 +00:00
a272bc49b0
Fix: torch::kBFloat16
2026-05-14 10:21:10 +00:00
3f62e49e6e
Fix PyTorch API: use c10::cuda and at::kBF16
2026-05-14 10:20:00 +00:00
2ee4e26772
Fix: remove compile-time SM100 guard from pytorch binding, use runtime check instead
2026-05-14 10:18:36 +00:00
540e68593f
Add scale factor remap kernel: remap simple row-major SFs to CUTLASS interleaved layout
2026-05-14 10:05:38 +00:00
2998c889e7
Implement simple FP4 quantization for L1→L2 re-quant step (no vLLM fp4_utils dependency)
2026-05-14 09:50:52 +00:00
98913c9b1a
Fix stage_activation: use Triton staging kernel from vLLM patch instead of fp4_utils
2026-05-14 09:38:50 +00:00
25cbc85afe
Replace kernel.py with thin wrapper around pre-compiled _C extension
2026-05-14 09:25:56 +00:00
33e5d67326
Add CUTLASS_CHECK macro
2026-05-13 23:28:03 +00:00
b7c5cba407
Fix device_memory include path
2026-05-13 23:27:06 +00:00
3299d22ad6
Fix type casts and includes for CUTLASS NVFP4 GEMM
2026-05-13 23:26:18 +00:00
1eb9c43217
Rewrite CUTLASS kernel based on NVIDIA example 72b (nv_float4_t, CollectiveBuilder, OpClassBlockScaledTensorOp)
2026-05-13 23:25:20 +00:00
8a9af441dc
Fix includes: use cutlass/float_subbyte.h (has float_e2m1_t and float_ue4m3_t), point to latest CUTLASS
2026-05-13 23:23:01 +00:00
d789f5e3e0
Add CCCL include path for CUTLASS 3.x
2026-05-13 23:18:26 +00:00
12588047fd
Fix setup.py: use include_dirs and extra_compile_args (correct PyTorch extension API)
2026-05-13 23:17:30 +00:00
1b1c3a42fe
Fix setup.py source paths
2026-05-13 23:14:05 +00:00
f375c80bfe
feat: CUTLASS NVFP4 block-scaled GEMM kernel (native SM100 Blackwell)
...
- Native NVFP4 block-scaled MMA using CUTLASS MainloopSm100TmaUmmaWarpSpecializedBlockScaled
- Invokes mxf8f6f4.block_scale tensor core instructions (tcgen05.mma)
- E2M1 (packed int8) + UE4M3 (float8_e4m3fn) block-16 scales → BF16 output
- No dequantization: hardware block-scaled MMA avoids costly dequantize+BF16 path
- PyTorch CUDA extension with CollectiveBuilder auto-deduction
- Grouped expert GEMM for MoE dispatch (32 experts/rank, top-6 routing)
- Integrated into nvfp4_mega_moe.py as primary path with TileLang fallback
- Standalone C API (cutlass_nvfp4_gemm.cu) for direct B200 compilation
- Build script, setup.py, and test script for B200 deployment
Files:
cutlass_nvfp4_gemm/ — Kernel source, PyTorch binding, build/test scripts
nvfp4_mega_moe.py — Updated to use CUTLASS kernel when available
2026-05-13 23:11:15 +00:00
56c7880296
Native NVFP4 TileLang kernel: tcgen05 block-scaled MMA
2026-05-13 23:02:06 +00:00
bf13665dbe
Implement TileLang NVFP4 mega_moe L1/L2 kernels
...
- nvfp4_mega_moe_l1: L1 GEMM (gate_up_proj) with FP4 dequant → BF16 GEMM
- nvfp4_mega_moe_l2: L2 GEMM (down_proj) with FP4 dequant → BF16 GEMM
- nvfp4_dequant.py: E2M1 packed → BF16 with UE4M3 block16 scales
- tilelang_kernels.py: Grouped expert GEMM with TileLang-compiled BF16 GEMM
- Full pipeline: L1 GEMM → SiLU+Mul → re-quantize → L2 GEMM → output
- MEGA_MOE_STATIC=1 bypass still works for pipeline testing
Current approach: dequantize FP4→BF16 then run BF16 GEMM via TileLang T.gemm
(auto-lowers to tcgen05 on Blackwell). Will be upgraded to native FP4
block-scaled MMA (tcgen05.mma kind::mxf8f6f4.block_scale) once TileLang
adds E2M1+UE4M3 support.
2026-05-13 22:36:58 +00:00
ebc0ab0cac
Fix: keep scales as float8_e4m3fn, don't pack to uint32 (min_all_cuda unsupported)
2026-05-13 21:54:39 +00:00
94233c4dd3
Fix __init__.py: remove private imports
2026-05-13 21:43:47 +00:00
1a452ffabd
Fix weight_transform signature to match nightly vLLM finalize_weights call
2026-05-13 21:41:43 +00:00
47ca5631d8
Fix __init__.py: only import from package modules
2026-05-13 16:08:37 +00:00
c2b752c2fe
Initial: TileLang NVFP4 mega_moe kernel package
...
- nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe
- transform_nvfp4_weights_for_mega_moe: weight transformation (tested)
- SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs
- MEGA_MOE_STATIC=1 support for pipeline testing
- pyproject.toml for pip install
2026-05-13 15:44:51 +00:00