Commit Graph

33 Commits

Author SHA1 Message Date
8b7fa0c91e add README: pipeline diagram, file map, data formats, known issues 2026-05-14 12:48:08 +00:00
d3f35c9465 cleanup: remove abandoned TileLang and Mojo files
- Deleted: layout.mojo, mega_moe.mojo, quantize.mojo (Mojo attempt)
- Deleted: nvfp4_blockscaled_gemm.py, staging.py, nvfp4_mega_moe.py (TileLang top-level)
- Deleted: tilelang_nvfp4_gemm.py, tilelang_kernels.py, nvfp4_dequant.py (TileLang package)
- Deleted: src/weight_transform.py (duplicate of package version)
- Fixed nvfp4_mega_moe.py: inlined unpack_ue4m3_u32, removed TileLang fallback imports
- Fixed weight_transform.py: renamed function, removed TileLang alias, updated docs
- Fixed __init__.py: removed TileLang alias, updated docstring
- CUTLASS is the only kernel path now
2026-05-14 12:44:47 +00:00
802c4ee12c Revert stage_activation to simple quantize (staging kernel API incompatible with L1 output dims) 2026-05-14 12:14:01 +00:00
69e0174792 Fix stage_activation: use Triton staging kernel instead of broken simple quantize 2026-05-14 12:01:34 +00:00
c016e66e23 Add CUDA sync + NaN/Inf check after each expert GEMM in grouped kernel 2026-05-14 11:27:58 +00:00
1dfe5ffd05 Add comprehensive README documenting quirks, pitfalls, and setup 2026-05-14 11:23:32 +00:00
904fc37ad8 Fix: use idx2crd instead of get_coord for CuTe layout coordinate lookup 2026-05-14 10:50:26 +00:00
494d30b6ab Fix: use CuTe get_coord for proper scale factor remap to CUTLASS interleaved layout 2026-05-14 10:48:58 +00:00
869151d211 Fix kernel.py: remove broken expand on scale factors (was expanding sf to weight size) 2026-05-14 10:36:16 +00:00
84becfac93 Test: pass scales directly to CUTLASS (no remap) to diagnose layout issue 2026-05-14 10:23:02 +00:00
a272bc49b0 Fix: torch::kBFloat16 2026-05-14 10:21:10 +00:00
3f62e49e6e Fix PyTorch API: use c10::cuda and at::kBF16 2026-05-14 10:20:00 +00:00
2ee4e26772 Fix: remove compile-time SM100 guard from pytorch binding, use runtime check instead 2026-05-14 10:18:36 +00:00
540e68593f Add scale factor remap kernel: remap simple row-major SFs to CUTLASS interleaved layout 2026-05-14 10:05:38 +00:00
2998c889e7 Implement simple FP4 quantization for L1→L2 re-quant step (no vLLM fp4_utils dependency) 2026-05-14 09:50:52 +00:00
98913c9b1a Fix stage_activation: use Triton staging kernel from vLLM patch instead of fp4_utils 2026-05-14 09:38:50 +00:00
25cbc85afe Replace kernel.py with thin wrapper around pre-compiled _C extension 2026-05-14 09:25:56 +00:00
33e5d67326 Add CUTLASS_CHECK macro 2026-05-13 23:28:03 +00:00
b7c5cba407 Fix device_memory include path 2026-05-13 23:27:06 +00:00
3299d22ad6 Fix type casts and includes for CUTLASS NVFP4 GEMM 2026-05-13 23:26:18 +00:00
1eb9c43217 Rewrite CUTLASS kernel based on NVIDIA example 72b (nv_float4_t, CollectiveBuilder, OpClassBlockScaledTensorOp) 2026-05-13 23:25:20 +00:00
8a9af441dc Fix includes: use cutlass/float_subbyte.h (has float_e2m1_t and float_ue4m3_t), point to latest CUTLASS 2026-05-13 23:23:01 +00:00
d789f5e3e0 Add CCCL include path for CUTLASS 3.x 2026-05-13 23:18:26 +00:00
12588047fd Fix setup.py: use include_dirs and extra_compile_args (correct PyTorch extension API) 2026-05-13 23:17:30 +00:00
1b1c3a42fe Fix setup.py source paths 2026-05-13 23:14:05 +00:00
f375c80bfe feat: CUTLASS NVFP4 block-scaled GEMM kernel (native SM100 Blackwell)
- Native NVFP4 block-scaled MMA using CUTLASS MainloopSm100TmaUmmaWarpSpecializedBlockScaled
- Invokes mxf8f6f4.block_scale tensor core instructions (tcgen05.mma)
- E2M1 (packed int8) + UE4M3 (float8_e4m3fn) block-16 scales → BF16 output
- No dequantization: hardware block-scaled MMA avoids costly dequantize+BF16 path
- PyTorch CUDA extension with CollectiveBuilder auto-deduction
- Grouped expert GEMM for MoE dispatch (32 experts/rank, top-6 routing)
- Integrated into nvfp4_mega_moe.py as primary path with TileLang fallback
- Standalone C API (cutlass_nvfp4_gemm.cu) for direct B200 compilation
- Build script, setup.py, and test script for B200 deployment

Files:
  cutlass_nvfp4_gemm/ — Kernel source, PyTorch binding, build/test scripts
  nvfp4_mega_moe.py — Updated to use CUTLASS kernel when available
2026-05-13 23:11:15 +00:00
56c7880296 Native NVFP4 TileLang kernel: tcgen05 block-scaled MMA 2026-05-13 23:02:06 +00:00
bf13665dbe Implement TileLang NVFP4 mega_moe L1/L2 kernels
- nvfp4_mega_moe_l1: L1 GEMM (gate_up_proj) with FP4 dequant → BF16 GEMM
- nvfp4_mega_moe_l2: L2 GEMM (down_proj) with FP4 dequant → BF16 GEMM
- nvfp4_dequant.py: E2M1 packed → BF16 with UE4M3 block16 scales
- tilelang_kernels.py: Grouped expert GEMM with TileLang-compiled BF16 GEMM
- Full pipeline: L1 GEMM → SiLU+Mul → re-quantize → L2 GEMM → output
- MEGA_MOE_STATIC=1 bypass still works for pipeline testing

Current approach: dequantize FP4→BF16 then run BF16 GEMM via TileLang T.gemm
(auto-lowers to tcgen05 on Blackwell). Will be upgraded to native FP4
block-scaled MMA (tcgen05.mma kind::mxf8f6f4.block_scale) once TileLang
adds E2M1+UE4M3 support.
2026-05-13 22:36:58 +00:00
ebc0ab0cac Fix: keep scales as float8_e4m3fn, don't pack to uint32 (min_all_cuda unsupported) 2026-05-13 21:54:39 +00:00
94233c4dd3 Fix __init__.py: remove private imports 2026-05-13 21:43:47 +00:00
1a452ffabd Fix weight_transform signature to match nightly vLLM finalize_weights call 2026-05-13 21:41:43 +00:00
47ca5631d8 Fix __init__.py: only import from package modules 2026-05-13 16:08:37 +00:00
c2b752c2fe Initial: TileLang NVFP4 mega_moe kernel package
- nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe
- transform_nvfp4_weights_for_mega_moe: weight transformation (tested)
- SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs
- MEGA_MOE_STATIC=1 support for pipeline testing
- pyproject.toml for pip install
2026-05-13 15:44:51 +00:00