nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	3fb3c925af	Restructure: cutedsl/ -> dsv4/ with proper layering - Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py - Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc. - Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda) - Moved PyTorch bridges to dsv4/ops/ - Moved nn.Module layers to dsv4layers/ - Moved reference implementations to dsv4/reference/ - Moved vendored CUTLASS code to vendored/ - Archived ~190 debug tests to tests/archive/ - Kept ~15 canonical tests in tests/unit/ - Updated all import paths - Added stubs for future components (model/, cache/, loader/) - Updated pyproject.toml: dsv4-inference package name	2026-05-21 17:30:44 +00:00
biondizzle	1d5e70adfb	fix: dynamic buffer sizing in nvfp4_linear for varying token counts	2026-05-19 23:59:55 +00:00
biondizzle	9ff1679064	Replace MHC TileLang kernels with pure PyTorch TileLang kernels (mhc_pre_big_fuse_tilelang, mhc_fused_tilelang) don't work correctly on Blackwell SM100 and cause empty model output. Replace with pure PyTorch implementations: - mhc_pre_torch: Sinkhorn-normalized HC residual mixing - mhc_post_torch: HC post block (einsum residual + post layer mix) - mhc_fused_post_pre_torch: Fused post+pre (composition of above) - hc_head_fused_torch: RMS norm + linear + sigmoid + weighted sum Patch both layers/mhc.py (CustomOp dispatch) and kernels/mhc/__init__.py (no tilelang import). Also remove tilelang from pyproject.toml deps.	2026-05-19 05:07:41 +00:00
biondizzle	c2b752c2fe	Initial: TileLang NVFP4 mega_moe kernel package - nvfp4_mega_moe_full: drop-in replacement for deep_gemm.mega.fp8_nvfp4_mega_moe - transform_nvfp4_weights_for_mega_moe: weight transformation (tested) - SymmBuffer + get_symm_buffer_for_nvfp4_mega_moe: API-matching stubs - MEGA_MOE_STATIC=1 support for pipeline testing - pyproject.toml for pip install	2026-05-13 15:44:51 +00:00

4 Commits