nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	48386e34ad	Fix torch.compile: use custom autograd Function instead of @torch.compiler.disable torch.compile fullgraph mode can't handle @torch.compiler.disable (skips the function and refuses to compile). Custom autograd Functions are treated as opaque ops by torch.compile — they execute eagerly without the compiler trying to trace into CuTeDSL internals (JIT, Path.cwd, etc).	2026-05-18 21:38:28 +00:00
biondizzle	85e1cd3b69	Fix torch.compile crash: @torch.compiler.disable on all CuTeDSL run() CuTeDSL internals (Path.cwd, threading, JIT) are incompatible with torch.dynamo tracing. Marking run() as compiler-disabled makes the runners opaque to torch.compile — they execute eagerly while the rest of the model gets compiled.	2026-05-18 21:07:35 +00:00
biondizzle	a94011ec92	Fix torch.compile crash: remove threading.Lock from LUT cache path The _NVFP4_STEP_LUT_LOCK caused 'Unsupported context manager' under torch.compile/cudagraph. LUT is now pre-populated during warmup so the fast path (cache hit) never hits a lock. Also removed all init/warmup debug prints from CuTeDSL kernels.	2026-05-18 20:54:55 +00:00
biondizzle	450793311c	Wire CuTeDSL kernels into vLLM: replace all BF16 dequant with native NVFP4 - CuTeDSLNvfp4Method: custom quant method that creates CuTeDSL runners during process_weights_after_loading, then swaps to CuTeDSLNvfp4LinearMethod for forward dispatch - Attention projections (fused_wqa_wkv, wq_b, wo_b) now route through CuTeDSLNvfp4Linear (cosine 0.992-0.996 vs BF16 reference) - Shared expert now uses CuTeDSLSharedExpertRunner (cosine 0.992 vs BF16) with monkey-patched forward for fused L1+SiLU+L2 pipeline - Deleted all BF16 dequant code (_dequant_nvfp4_to_bf16, _post_quant_fix, input_scale fixes) - Deleted _post_quant_fix hook from utils.py - Fixed SwiGLU clamp: gate clamped BEFORE SiLU (matching SiluAndMulWithClamp) - Cleaned up all debug prints - Updated Dockerfile with new kernel files	2026-05-18 20:27:42 +00:00
biondizzle	70f50a1ec6	Fix scale assembly: use correctly-sized temp buffer for swizzle	2026-05-18 20:09:50 +00:00
biondizzle	97bdd604e9	Fix scale assembly: reshape swizzled output to 2D	2026-05-18 20:09:19 +00:00
biondizzle	c1aa4af123	Shared expert: dedicated CuTeDSL runner with proper scale assembly - CuTeDSLSharedExpertRunner: num_groups=1 GEMM, no scatter/routing - _assemble_scales_single_group: pad to 128 rows + Blackwell swizzle - All buffers pre-allocated for cudagraph compatibility - Updated test to use dedicated runner instead of MoE runner hack	2026-05-18 20:08:34 +00:00
biondizzle	e8b289e30d	WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow.	2026-05-18 20:02:19 +00:00

8 Commits