nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	8b7fa0c91e	add README: pipeline diagram, file map, data formats, known issues	2026-05-14 12:48:08 +00:00
biondizzle	d3f35c9465	cleanup: remove abandoned TileLang and Mojo files - Deleted: layout.mojo, mega_moe.mojo, quantize.mojo (Mojo attempt) - Deleted: nvfp4_blockscaled_gemm.py, staging.py, nvfp4_mega_moe.py (TileLang top-level) - Deleted: tilelang_nvfp4_gemm.py, tilelang_kernels.py, nvfp4_dequant.py (TileLang package) - Deleted: src/weight_transform.py (duplicate of package version) - Fixed nvfp4_mega_moe.py: inlined unpack_ue4m3_u32, removed TileLang fallback imports - Fixed weight_transform.py: renamed function, removed TileLang alias, updated docs - Fixed __init__.py: removed TileLang alias, updated docstring - CUTLASS is the only kernel path now	2026-05-14 12:44:47 +00:00
biondizzle	802c4ee12c	Revert stage_activation to simple quantize (staging kernel API incompatible with L1 output dims)	2026-05-14 12:14:01 +00:00
biondizzle	69e0174792	Fix stage_activation: use Triton staging kernel instead of broken simple quantize	2026-05-14 12:01:34 +00:00
biondizzle	c016e66e23	Add CUDA sync + NaN/Inf check after each expert GEMM in grouped kernel	2026-05-14 11:27:58 +00:00
biondizzle	1dfe5ffd05	Add comprehensive README documenting quirks, pitfalls, and setup	2026-05-14 11:23:32 +00:00
biondizzle	904fc37ad8	Fix: use idx2crd instead of get_coord for CuTe layout coordinate lookup	2026-05-14 10:50:26 +00:00
biondizzle	494d30b6ab	Fix: use CuTe get_coord for proper scale factor remap to CUTLASS interleaved layout	2026-05-14 10:48:58 +00:00
biondizzle	869151d211	Fix kernel.py: remove broken expand on scale factors (was expanding sf to weight size)	2026-05-14 10:36:16 +00:00
biondizzle	84becfac93	Test: pass scales directly to CUTLASS (no remap) to diagnose layout issue	2026-05-14 10:23:02 +00:00
biondizzle	a272bc49b0	Fix: torch::kBFloat16	2026-05-14 10:21:10 +00:00
biondizzle	3f62e49e6e	Fix PyTorch API: use c10::cuda and at::kBF16	2026-05-14 10:20:00 +00:00
biondizzle	2ee4e26772	Fix: remove compile-time SM100 guard from pytorch binding, use runtime check instead	2026-05-14 10:18:36 +00:00
biondizzle	540e68593f	Add scale factor remap kernel: remap simple row-major SFs to CUTLASS interleaved layout	2026-05-14 10:05:38 +00:00
biondizzle	2998c889e7	Implement simple FP4 quantization for L1→L2 re-quant step (no vLLM fp4_utils dependency)	2026-05-14 09:50:52 +00:00
biondizzle	98913c9b1a	Fix stage_activation: use Triton staging kernel from vLLM patch instead of fp4_utils	2026-05-14 09:38:50 +00:00
biondizzle	25cbc85afe	Replace kernel.py with thin wrapper around pre-compiled _C extension	2026-05-14 09:25:56 +00:00
biondizzle	33e5d67326	Add CUTLASS_CHECK macro	2026-05-13 23:28:03 +00:00
biondizzle	b7c5cba407	Fix device_memory include path	2026-05-13 23:27:06 +00:00
biondizzle	3299d22ad6	Fix type casts and includes for CUTLASS NVFP4 GEMM	2026-05-13 23:26:18 +00:00
biondizzle	1eb9c43217	Rewrite CUTLASS kernel based on NVIDIA example 72b (nv_float4_t, CollectiveBuilder, OpClassBlockScaledTensorOp)	2026-05-13 23:25:20 +00:00
biondizzle	8a9af441dc	Fix includes: use cutlass/float_subbyte.h (has float_e2m1_t and float_ue4m3_t), point to latest CUTLASS	2026-05-13 23:23:01 +00:00
biondizzle	d789f5e3e0	Add CCCL include path for CUTLASS 3.x	2026-05-13 23:18:26 +00:00
biondizzle	12588047fd	Fix setup.py: use include_dirs and extra_compile_args (correct PyTorch extension API)	2026-05-13 23:17:30 +00:00
biondizzle	1b1c3a42fe	Fix setup.py source paths	2026-05-13 23:14:05 +00:00
biondizzle	f375c80bfe	feat: CUTLASS NVFP4 block-scaled GEMM kernel (native SM100 Blackwell) - Native NVFP4 block-scaled MMA using CUTLASS MainloopSm100TmaUmmaWarpSpecializedBlockScaled - Invokes mxf8f6f4.block_scale tensor core instructions (tcgen05.mma) - E2M1 (packed int8) + UE4M3 (float8_e4m3fn) block-16 scales → BF16 output - No dequantization: hardware block-scaled MMA avoids costly dequantize+BF16 path - PyTorch CUDA extension with CollectiveBuilder auto-deduction - Grouped expert GEMM for MoE dispatch (32 experts/rank, top-6 routing) - Integrated into nvfp4_mega_moe.py as primary path with TileLang fallback - Standalone C API (cutlass_nvfp4_gemm.cu) for direct B200 compilation - Build script, setup.py, and test script for B200 deployment Files: cutlass_nvfp4_gemm/ — Kernel source, PyTorch binding, build/test scripts nvfp4_mega_moe.py — Updated to use CUTLASS kernel when available	2026-05-13 23:11:15 +00:00
biondizzle	56c7880296	Native NVFP4 TileLang kernel: tcgen05 block-scaled MMA	2026-05-13 23:02:06 +00:00
biondizzle	bf13665dbe	Implement TileLang NVFP4 mega_moe L1/L2 kernels - nvfp4_mega_moe_l1: L1 GEMM (gate_up_proj) with FP4 dequant → BF16 GEMM - nvfp4_mega_moe_l2: L2 GEMM (down_proj) with FP4 dequant → BF16 GEMM - nvfp4_dequant.py: E2M1 packed → BF16 with UE4M3 block16 scales - tilelang_kernels.py: Grouped expert GEMM with TileLang-compiled BF16 GEMM - Full pipeline: L1 GEMM → SiLU+Mul → re-quantize → L2 GEMM → output - MEGA_MOE_STATIC=1 bypass still works for pipeline testing Current approach: dequantize FP4→BF16 then run BF16 GEMM via TileLang T.gemm (auto-lowers to tcgen05 on Blackwell). Will be upgraded to native FP4 block-scaled MMA (tcgen05.mma kind::mxf8f6f4.block_scale) once TileLang adds E2M1+UE4M3 support.	2026-05-13 22:36:58 +00:00
biondizzle	ebc0ab0cac	Fix: keep scales as float8_e4m3fn, don't pack to uint32 (min_all_cuda unsupported)	2026-05-13 21:54:39 +00:00
biondizzle	94233c4dd3	Fix __init__.py: remove private imports	2026-05-13 21:43:47 +00:00
biondizzle	1a452ffabd	Fix weight_transform signature to match nightly vLLM finalize_weights call	2026-05-13 21:41:43 +00:00
biondizzle	47ca5631d8	Fix __init__.py: only import from package modules	2026-05-13 16:08:37 +00:00
biondizzle	c2b752c2fe	Initial: TileLang NVFP4 mega_moe kernel package - nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe - transform_nvfp4_weights_for_mega_moe: weight transformation (tested) - SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs - MEGA_MOE_STATIC=1 support for pipeline testing - pyproject.toml for pip install	2026-05-13 15:44:51 +00:00

33 Commits