|
|
8b7fa0c91e
|
add README: pipeline diagram, file map, data formats, known issues
|
2026-05-14 12:48:08 +00:00 |
|
|
|
d3f35c9465
|
cleanup: remove abandoned TileLang and Mojo files
- Deleted: layout.mojo, mega_moe.mojo, quantize.mojo (Mojo attempt)
- Deleted: nvfp4_blockscaled_gemm.py, staging.py, nvfp4_mega_moe.py (TileLang top-level)
- Deleted: tilelang_nvfp4_gemm.py, tilelang_kernels.py, nvfp4_dequant.py (TileLang package)
- Deleted: src/weight_transform.py (duplicate of package version)
- Fixed nvfp4_mega_moe.py: inlined unpack_ue4m3_u32, removed TileLang fallback imports
- Fixed weight_transform.py: renamed function, removed TileLang alias, updated docs
- Fixed __init__.py: removed TileLang alias, updated docstring
- CUTLASS is the only kernel path now
|
2026-05-14 12:44:47 +00:00 |
|
|
|
802c4ee12c
|
Revert stage_activation to simple quantize (staging kernel API incompatible with L1 output dims)
|
2026-05-14 12:14:01 +00:00 |
|
|
|
69e0174792
|
Fix stage_activation: use Triton staging kernel instead of broken simple quantize
|
2026-05-14 12:01:34 +00:00 |
|
|
|
c016e66e23
|
Add CUDA sync + NaN/Inf check after each expert GEMM in grouped kernel
|
2026-05-14 11:27:58 +00:00 |
|
|
|
1dfe5ffd05
|
Add comprehensive README documenting quirks, pitfalls, and setup
|
2026-05-14 11:23:32 +00:00 |
|
|
|
904fc37ad8
|
Fix: use idx2crd instead of get_coord for CuTe layout coordinate lookup
|
2026-05-14 10:50:26 +00:00 |
|
|
|
494d30b6ab
|
Fix: use CuTe get_coord for proper scale factor remap to CUTLASS interleaved layout
|
2026-05-14 10:48:58 +00:00 |
|
|
|
869151d211
|
Fix kernel.py: remove broken expand on scale factors (was expanding sf to weight size)
|
2026-05-14 10:36:16 +00:00 |
|
|
|
84becfac93
|
Test: pass scales directly to CUTLASS (no remap) to diagnose layout issue
|
2026-05-14 10:23:02 +00:00 |
|
|
|
a272bc49b0
|
Fix: torch::kBFloat16
|
2026-05-14 10:21:10 +00:00 |
|
|
|
3f62e49e6e
|
Fix PyTorch API: use c10::cuda and at::kBF16
|
2026-05-14 10:20:00 +00:00 |
|
|
|
2ee4e26772
|
Fix: remove compile-time SM100 guard from pytorch binding, use runtime check instead
|
2026-05-14 10:18:36 +00:00 |
|
|
|
540e68593f
|
Add scale factor remap kernel: remap simple row-major SFs to CUTLASS interleaved layout
|
2026-05-14 10:05:38 +00:00 |
|
|
|
2998c889e7
|
Implement simple FP4 quantization for L1→L2 re-quant step (no vLLM fp4_utils dependency)
|
2026-05-14 09:50:52 +00:00 |
|
|
|
98913c9b1a
|
Fix stage_activation: use Triton staging kernel from vLLM patch instead of fp4_utils
|
2026-05-14 09:38:50 +00:00 |
|
|
|
25cbc85afe
|
Replace kernel.py with thin wrapper around pre-compiled _C extension
|
2026-05-14 09:25:56 +00:00 |
|
|
|
33e5d67326
|
Add CUTLASS_CHECK macro
|
2026-05-13 23:28:03 +00:00 |
|
|
|
b7c5cba407
|
Fix device_memory include path
|
2026-05-13 23:27:06 +00:00 |
|
|
|
3299d22ad6
|
Fix type casts and includes for CUTLASS NVFP4 GEMM
|
2026-05-13 23:26:18 +00:00 |
|
|
|
1eb9c43217
|
Rewrite CUTLASS kernel based on NVIDIA example 72b (nv_float4_t, CollectiveBuilder, OpClassBlockScaledTensorOp)
|
2026-05-13 23:25:20 +00:00 |
|
|
|
8a9af441dc
|
Fix includes: use cutlass/float_subbyte.h (has float_e2m1_t and float_ue4m3_t), point to latest CUTLASS
|
2026-05-13 23:23:01 +00:00 |
|
|
|
d789f5e3e0
|
Add CCCL include path for CUTLASS 3.x
|
2026-05-13 23:18:26 +00:00 |
|
|
|
12588047fd
|
Fix setup.py: use include_dirs and extra_compile_args (correct PyTorch extension API)
|
2026-05-13 23:17:30 +00:00 |
|
|
|
1b1c3a42fe
|
Fix setup.py source paths
|
2026-05-13 23:14:05 +00:00 |
|
|
|
f375c80bfe
|
feat: CUTLASS NVFP4 block-scaled GEMM kernel (native SM100 Blackwell)
- Native NVFP4 block-scaled MMA using CUTLASS MainloopSm100TmaUmmaWarpSpecializedBlockScaled
- Invokes mxf8f6f4.block_scale tensor core instructions (tcgen05.mma)
- E2M1 (packed int8) + UE4M3 (float8_e4m3fn) block-16 scales → BF16 output
- No dequantization: hardware block-scaled MMA avoids costly dequantize+BF16 path
- PyTorch CUDA extension with CollectiveBuilder auto-deduction
- Grouped expert GEMM for MoE dispatch (32 experts/rank, top-6 routing)
- Integrated into nvfp4_mega_moe.py as primary path with TileLang fallback
- Standalone C API (cutlass_nvfp4_gemm.cu) for direct B200 compilation
- Build script, setup.py, and test script for B200 deployment
Files:
cutlass_nvfp4_gemm/ — Kernel source, PyTorch binding, build/test scripts
nvfp4_mega_moe.py — Updated to use CUTLASS kernel when available
|
2026-05-13 23:11:15 +00:00 |
|
|
|
56c7880296
|
Native NVFP4 TileLang kernel: tcgen05 block-scaled MMA
|
2026-05-13 23:02:06 +00:00 |
|
|
|
bf13665dbe
|
Implement TileLang NVFP4 mega_moe L1/L2 kernels
- nvfp4_mega_moe_l1: L1 GEMM (gate_up_proj) with FP4 dequant → BF16 GEMM
- nvfp4_mega_moe_l2: L2 GEMM (down_proj) with FP4 dequant → BF16 GEMM
- nvfp4_dequant.py: E2M1 packed → BF16 with UE4M3 block16 scales
- tilelang_kernels.py: Grouped expert GEMM with TileLang-compiled BF16 GEMM
- Full pipeline: L1 GEMM → SiLU+Mul → re-quantize → L2 GEMM → output
- MEGA_MOE_STATIC=1 bypass still works for pipeline testing
Current approach: dequantize FP4→BF16 then run BF16 GEMM via TileLang T.gemm
(auto-lowers to tcgen05 on Blackwell). Will be upgraded to native FP4
block-scaled MMA (tcgen05.mma kind::mxf8f6f4.block_scale) once TileLang
adds E2M1+UE4M3 support.
|
2026-05-13 22:36:58 +00:00 |
|
|
|
ebc0ab0cac
|
Fix: keep scales as float8_e4m3fn, don't pack to uint32 (min_all_cuda unsupported)
|
2026-05-13 21:54:39 +00:00 |
|
|
|
94233c4dd3
|
Fix __init__.py: remove private imports
|
2026-05-13 21:43:47 +00:00 |
|
|
|
1a452ffabd
|
Fix weight_transform signature to match nightly vLLM finalize_weights call
|
2026-05-13 21:41:43 +00:00 |
|
|
|
47ca5631d8
|
Fix __init__.py: only import from package modules
|
2026-05-13 16:08:37 +00:00 |
|
|
|
c2b752c2fe
|
Initial: TileLang NVFP4 mega_moe kernel package
- nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe
- transform_nvfp4_weights_for_mega_moe: weight transformation (tested)
- SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs
- MEGA_MOE_STATIC=1 support for pipeline testing
- pyproject.toml for pip install
|
2026-05-13 15:44:51 +00:00 |
|