Commit Graph

11 Commits

Author SHA1 Message Date
35fab6cff3 Replace autograd.Function with torch.library.custom_op for Dynamo compat
Dynamo (torch.compile fullgraph) cannot trace through CuTeDSL internals
(cute.compile, JIT, etc.). The autograd.Function approach was unreliable
with fullgraph mode — Dynamo would still try to trace through it.

Fix: torch.library.custom_op makes Dynamo treat our GEMM as an opaque
black box. No reimplementing the kernel — just route through the existing
runner via a registry pattern:
  - Runners registered in global dict with integer IDs
  - Custom op takes (tensors, runner_id, shape_hint) -> tensor
  - Dynamo calls fake impl for shape inference, never touches the runner
  - At execution time, real impl looks up runner and calls _run_impl

Changes:
  - New: cutedsl/custom_ops.py (custom op definitions + registry)
  - New: tests/test_custom_op.py (local unit tests, no GPU needed)
  - Removed: _Nvfp4LinearApply, _MoEApply (autograd.Function classes)
  - Updated: nvfp4_linear.py, runner.py, cutedsl.py, nvfp4_cutedsl.py
    to use custom ops instead of autograd.Function
  - Updated: cutedsl_quant_method.py to use custom op + registry
2026-05-19 01:54:48 +00:00
98153002c0 Go back to torch.library.custom_op with correct GEMM impl
allow_in_graph doesn't work — Dynamo can't create proxies for Python
objects (the runner). The custom op approach requires only tensor args.

This time the GEMM impl correctly:
- Uses quantize_activation_nvfp4 for activation quantization
- Pads x_fp4 via uint8 + view(float4) for torch.zeros compat
- Assembles A-side scales with pad + swizzle
- Uses int32 expert_offsets (CuTeDSL requirement)
- Passes runner's pre-assembled mat_b, scale_b, gsb tensors
2026-05-19 01:24:41 +00:00
02c500bbb1 Switch to allow_in_graph for Dynamo opacity instead of custom op
The custom op approach required reimplementing the GEMM (wrong scale
assembly, wrong tensor formats, cudaErrorIllegalAddress). Instead,
use torch.autograd.Function + torch._dynamo.allow_in_graph which
tells Dynamo to treat the function as an opaque kernel call, while
still using the runner's battle-tested _run_impl for the actual GEMM.

allow_in_graph is the proper way to register opaque ops for Dynamo
without reimplementing the computation.
2026-05-19 01:20:07 +00:00
581d87f9a6 Remove warmup forward from process_weights_after_loading
The warmup custom op call hit cudaErrorIllegalAddress because our
custom op GEMM implementation doesn't match the runner's call convention.
Skip warmup for now — MoE kernel warmup handles CuTeDSL JIT cleanup.
2026-05-19 01:18:54 +00:00
5d49849156 Fix: torch.zeros doesn't support Float4_e2m1fn_x2 dtype
Allocate as uint8 then view as float4_e2m1fn_x2 for padding buffer.
2026-05-19 01:15:24 +00:00
e1fcfc4f01 Add CuTeDSL warmup + CUDA sync after JIT compilation
CuTeDSL cute.compile corrupts GPU memory. Add warmup forward +
torch.cuda.synchronize() + health check after finalize_weights,
matching the MoE runner pattern.
2026-05-19 01:11:44 +00:00
1d9c0f996c Fix expert_offsets dtype: CuTeDSL expects int32 not int64
The DSLRuntimeError 'prev_off is Int32...update to Int64 inside if' was
caused by passing int64 expert_offsets when the kernel expects int32.
2026-05-19 01:05:20 +00:00
b81200f427 Fix CuTeDSL NVFP4 linear: correct scale assembly in custom op
- pad_and_swizzle_single takes 1 arg (2D tensor), not 4
- Inline the scale assembly logic: pad x_sf → swizzle → unsqueeze for 1 group
- Remove unused CuTeDSLNvfp4Linear import from custom op impl
2026-05-19 01:01:42 +00:00
e0eb436914 Fix custom_op registration: use as decorator with proper type hints 2026-05-19 00:54:30 +00:00
c609e9ba3c Use torch.library.custom_op for CuTeDSL NVFP4 linear GEMM
Dynamo in fullgraph mode traces through torch.autograd.Function, hitting
CuTeDSL JIT internals (Path.cwd) and crashing. Registering as a custom op
makes it opaque to Dynamo — tracing calls the fake impl, real impl only
runs during inference.

Custom op: cutedsl::nvfp4_gemm(x, mat_b, scale_b, global_scale_b,
    in_features, out_features, activation_global_scale) -> Tensor

Store finalized weight tensors on the layer (from runner._mat_b etc.)
instead of the runner object, since custom ops can only accept tensors.
2026-05-19 00:50:43 +00:00
c043a11bcc Register CuTeDSL as proper NvFp4LinearKernel for NVFP4 linear layers
- Create CuTeDSLNvFp4LinearKernel extending NvFp4LinearKernel base class
- Register it via init_nvfp4_linear_kernel() selection mechanism
  (inserted at top of _POSSIBLE_NVFP4_KERNELS, before FlashInfer)
- process_weights_after_loading: uint8→FP4, permute, create CuTeDSL runner
- apply_weights: route through CuTeDSL GEMM
- Update Dockerfile: copy kernel + registration script
- Fix attention: always use forward() for quantized compressor/indexer
  layers (dtype check was fragile after kernel swaps weights to dummy BF16)
2026-05-19 00:44:44 +00:00