nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	f5ce728ef2	Fix OOM: add --max-model-len=876544 + revert CPU dummy weight The CPU dummy weight broke torch.mm(compressor.weight.T) which expects GPU tensors. Instead, reduce max_model_len to fit KV cache within available memory (876544 instead of 1048576).	2026-05-19 07:35:43 +00:00
biondizzle	79a41d9197	Save ~5-8 GiB GPU VRAM: move dummy weight to CPU The CuTeDSL kernel never reads layer.weight — it uses the runner's pre-processed fp4/sf/gs tensors. The dummy BF16 weight exists only for vLLM model introspection. Moving it to CPU saves massive VRAM: - q_b_proj alone: 6553615362 = 192 MiB on GPU → ~0 MiB - All layers combined: ~5-8 GiB saved This should fix the KV cache OOM (needed 10.28 GiB, had 9.36 GiB).	2026-05-19 07:29:38 +00:00
biondizzle	cebc586014	Fix OOM: use 1-token warmup sample + free immediately 8 tokens * 7168 hidden * ~40 NVFP4 layers = ~2.3 MiB per layer * 40 = 92 MiB But the dummy weight param (out_features * in_features * 2 bytes BF16) was the real killer — each layer allocated a BF16 dummy of its full weight shape. With 1 token the warmup still gets a valid gs, and empty_cache frees the sample tensor before KV cache allocation.	2026-05-19 07:28:57 +00:00
biondizzle	6e6f95dfa8	FIX: Use warmup-based activation global scale in CuTeDSL linear kernel The checkpoint's input_scale is a calibration-time value that doesn't match what quantize_activation_nvfp4 expects at runtime. Using it as the activation global scale produces garbage output (empty EOS tokens). The fix: run a warmup forward pass with sample data and compute the activation global scale from the actual activation distribution, exactly like our standalone test does (which passes with cosine >= 0.994). This is the root cause of the vLLM server returning empty content.	2026-05-19 07:21:07 +00:00
biondizzle	ffc2264c41	Fix activation global scale: don't double-invert input_global_scale_inv The activation global scale = amax / (6.0 * 448.0). Both the linear kernel and MoE kernel were taking 1.0 / (value that's already the correct gs), inverting it and producing garbage quantization. Linear kernel: input_global_scale_inv IS the gs, so use it directly. MoE kernel: w13_input_scale_orig (after undoing convert inversion) IS the gs, so use it directly.	2026-05-19 06:03:08 +00:00
biondizzle	35fab6cff3	Replace autograd.Function with torch.library.custom_op for Dynamo compat Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals (cute.compile, JIT, etc.). The autograd.Function approach was unreliable with fullgraph mode — Dynamo would still try to trace through it. Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque black box. No reimplementing the kernel — just route through the existing runner via a registry pattern: - Runners registered in global dict with integer IDs - Custom op takes (tensors, runner_id, shape_hint) -> tensor - Dynamo calls fake impl for shape inference, never touches the runner - At execution time, real impl looks up runner and calls _run_impl Changes: - New: cutedsl/custom_ops.py (custom op definitions + registry) - New: tests/test_custom_op.py (local unit tests, no GPU needed) - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes) - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py to use custom ops instead of autograd.Function - Updated: cutedsl_quant_method.py to use custom op + registry	2026-05-19 01:54:48 +00:00
biondizzle	98153002c0	Go back to torch.library.custom_op with correct GEMM impl allow_in_graph doesn't work — Dynamo can't create proxies for Python objects (the runner). The custom op approach requires only tensor args. This time the GEMM impl correctly: - Uses quantize_activation_nvfp4 for activation quantization - Pads x_fp4 via uint8 + view(float4) for torch.zeros compat - Assembles A-side scales with pad + swizzle - Uses int32 expert_offsets (CuTeDSL requirement) - Passes runner's pre-assembled mat_b, scale_b, gsb tensors	2026-05-19 01:24:41 +00:00
biondizzle	02c500bbb1	Switch to allow_in_graph for Dynamo opacity instead of custom op The custom op approach required reimplementing the GEMM (wrong scale assembly, wrong tensor formats, cudaErrorIllegalAddress). Instead, use torch.autograd.Function + torch._dynamo.allow_in_graph which tells Dynamo to treat the function as an opaque kernel call, while still using the runner's battle-tested _run_impl for the actual GEMM. allow_in_graph is the proper way to register opaque ops for Dynamo without reimplementing the computation.	2026-05-19 01:20:07 +00:00
biondizzle	581d87f9a6	Remove warmup forward from process_weights_after_loading The warmup custom op call hit cudaErrorIllegalAddress because our custom op GEMM implementation doesn't match the runner's call convention. Skip warmup for now — MoE kernel warmup handles CuTeDSL JIT cleanup.	2026-05-19 01:18:54 +00:00
biondizzle	5d49849156	Fix: torch.zeros doesn't support Float4_e2m1fn_x2 dtype Allocate as uint8 then view as float4_e2m1fn_x2 for padding buffer.	2026-05-19 01:15:24 +00:00
biondizzle	e1fcfc4f01	Add CuTeDSL warmup + CUDA sync after JIT compilation CuTeDSL cute.compile corrupts GPU memory. Add warmup forward + torch.cuda.synchronize() + health check after finalize_weights, matching the MoE runner pattern.	2026-05-19 01:11:44 +00:00
biondizzle	1d9c0f996c	Fix expert_offsets dtype: CuTeDSL expects int32 not int64 The DSLRuntimeError 'prev_off is Int32...update to Int64 inside if' was caused by passing int64 expert_offsets when the kernel expects int32.	2026-05-19 01:05:20 +00:00
biondizzle	b81200f427	Fix CuTeDSL NVFP4 linear: correct scale assembly in custom op - pad_and_swizzle_single takes 1 arg (2D tensor), not 4 - Inline the scale assembly logic: pad x_sf → swizzle → unsqueeze for 1 group - Remove unused CuTeDSLNvfp4Linear import from custom op impl	2026-05-19 01:01:42 +00:00
biondizzle	e0eb436914	Fix custom_op registration: use as decorator with proper type hints	2026-05-19 00:54:30 +00:00
biondizzle	c609e9ba3c	Use torch.library.custom_op for CuTeDSL NVFP4 linear GEMM Dynamo in fullgraph mode traces through torch.autograd.Function, hitting CuTeDSL JIT internals (Path.cwd) and crashing. Registering as a custom op makes it opaque to Dynamo — tracing calls the fake impl, real impl only runs during inference. Custom op: cutedsl::nvfp4_gemm(x, mat_b, scale_b, global_scale_b, in_features, out_features, activation_global_scale) -> Tensor Store finalized weight tensors on the layer (from runner._mat_b etc.) instead of the runner object, since custom ops can only accept tensors.	2026-05-19 00:50:43 +00:00
biondizzle	c043a11bcc	Register CuTeDSL as proper NvFp4LinearKernel for NVFP4 linear layers - Create CuTeDSLNvFp4LinearKernel extending NvFp4LinearKernel base class - Register it via init_nvfp4_linear_kernel() selection mechanism (inserted at top of _POSSIBLE_NVFP4_KERNELS, before FlashInfer) - process_weights_after_loading: uint8→FP4, permute, create CuTeDSL runner - apply_weights: route through CuTeDSL GEMM - Update Dockerfile: copy kernel + registration script - Fix attention: always use forward() for quantized compressor/indexer layers (dtype check was fragile after kernel swaps weights to dummy BF16)	2026-05-19 00:44:44 +00:00

16 Commits