nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	35fab6cff3	Replace autograd.Function with torch.library.custom_op for Dynamo compat Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals (cute.compile, JIT, etc.). The autograd.Function approach was unreliable with fullgraph mode — Dynamo would still try to trace through it. Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque black box. No reimplementing the kernel — just route through the existing runner via a registry pattern: - Runners registered in global dict with integer IDs - Custom op takes (tensors, runner_id, shape_hint) -> tensor - Dynamo calls fake impl for shape inference, never touches the runner - At execution time, real impl looks up runner and calls _run_impl Changes: - New: cutedsl/custom_ops.py (custom op definitions + registry) - New: tests/test_custom_op.py (local unit tests, no GPU needed) - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes) - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py to use custom ops instead of autograd.Function - Updated: cutedsl_quant_method.py to use custom op + registry	2026-05-19 01:54:48 +00:00
biondizzle	98153002c0	Go back to torch.library.custom_op with correct GEMM impl allow_in_graph doesn't work — Dynamo can't create proxies for Python objects (the runner). The custom op approach requires only tensor args. This time the GEMM impl correctly: - Uses quantize_activation_nvfp4 for activation quantization - Pads x_fp4 via uint8 + view(float4) for torch.zeros compat - Assembles A-side scales with pad + swizzle - Uses int32 expert_offsets (CuTeDSL requirement) - Passes runner's pre-assembled mat_b, scale_b, gsb tensors	2026-05-19 01:24:41 +00:00
biondizzle	02c500bbb1	Switch to allow_in_graph for Dynamo opacity instead of custom op The custom op approach required reimplementing the GEMM (wrong scale assembly, wrong tensor formats, cudaErrorIllegalAddress). Instead, use torch.autograd.Function + torch._dynamo.allow_in_graph which tells Dynamo to treat the function as an opaque kernel call, while still using the runner's battle-tested _run_impl for the actual GEMM. allow_in_graph is the proper way to register opaque ops for Dynamo without reimplementing the computation.	2026-05-19 01:20:07 +00:00
biondizzle	581d87f9a6	Remove warmup forward from process_weights_after_loading The warmup custom op call hit cudaErrorIllegalAddress because our custom op GEMM implementation doesn't match the runner's call convention. Skip warmup for now — MoE kernel warmup handles CuTeDSL JIT cleanup.	2026-05-19 01:18:54 +00:00
biondizzle	5d49849156	Fix: torch.zeros doesn't support Float4_e2m1fn_x2 dtype Allocate as uint8 then view as float4_e2m1fn_x2 for padding buffer.	2026-05-19 01:15:24 +00:00
biondizzle	e1fcfc4f01	Add CuTeDSL warmup + CUDA sync after JIT compilation CuTeDSL cute.compile corrupts GPU memory. Add warmup forward + torch.cuda.synchronize() + health check after finalize_weights, matching the MoE runner pattern.	2026-05-19 01:11:44 +00:00
biondizzle	1d9c0f996c	Fix expert_offsets dtype: CuTeDSL expects int32 not int64 The DSLRuntimeError 'prev_off is Int32...update to Int64 inside if' was caused by passing int64 expert_offsets when the kernel expects int32.	2026-05-19 01:05:20 +00:00
biondizzle	b81200f427	Fix CuTeDSL NVFP4 linear: correct scale assembly in custom op - pad_and_swizzle_single takes 1 arg (2D tensor), not 4 - Inline the scale assembly logic: pad x_sf → swizzle → unsqueeze for 1 group - Remove unused CuTeDSLNvfp4Linear import from custom op impl	2026-05-19 01:01:42 +00:00
biondizzle	e0eb436914	Fix custom_op registration: use as decorator with proper type hints	2026-05-19 00:54:30 +00:00
biondizzle	c609e9ba3c	Use torch.library.custom_op for CuTeDSL NVFP4 linear GEMM Dynamo in fullgraph mode traces through torch.autograd.Function, hitting CuTeDSL JIT internals (Path.cwd) and crashing. Registering as a custom op makes it opaque to Dynamo — tracing calls the fake impl, real impl only runs during inference. Custom op: cutedsl::nvfp4_gemm(x, mat_b, scale_b, global_scale_b, in_features, out_features, activation_global_scale) -> Tensor Store finalized weight tensors on the layer (from runner._mat_b etc.) instead of the runner object, since custom ops can only accept tensors.	2026-05-19 00:50:43 +00:00
biondizzle	c043a11bcc	Register CuTeDSL as proper NvFp4LinearKernel for NVFP4 linear layers - Create CuTeDSLNvFp4LinearKernel extending NvFp4LinearKernel base class - Register it via init_nvfp4_linear_kernel() selection mechanism (inserted at top of _POSSIBLE_NVFP4_KERNELS, before FlashInfer) - process_weights_after_loading: uint8→FP4, permute, create CuTeDSL runner - apply_weights: route through CuTeDSL GEMM - Update Dockerfile: copy kernel + registration script - Fix attention: always use forward() for quantized compressor/indexer layers (dtype check was fragile after kernel swaps weights to dummy BF16)	2026-05-19 00:44:44 +00:00

11 Commits