nvfp4-megamoe-kernel

Files

biondizzle a434545d12 Blackwell swizzle CUDA kernel for CUDA graph capture

Python view operations (reshape, transpose, permute) are not
graph-capturable — they cause cudaErrorStreamCaptureUnsupported.

Added:
- dsv4/kernels/cuda/blackwell_swizzle.cu: custom CUDA kernel for 32_4_4 swizzle
- to_blocked(): detects graph capture, uses CUDA kernel instead of Python views
- MoE _assemble_scales_cudagraph_safe: same treatment
- Shared expert _assemble_scales_single_group: same treatment
- Linear _assemble_scales_single_group: same treatment
- Pre-allocated swizzled output buffers for all layers (avoids torch.empty_like)

The CUDA kernel writes to a pre-allocated buffer — no per-step allocations.
Eager path unchanged (still uses fast Python view operations).

2026-06-04 03:03:02 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

dense.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

fp4_quant.py

NVFP4-1.1: Mark fp4_quant.py as toolchain-blocked, clean up test files

2026-05-28 04:59:01 +00:00

fused_swiglu.py

fix: use cute.where() directly for clamp in fused SwiGLU