nvfp4-megamoe-kernel

Files

biondizzle cf2b7ab7ec feat: NVFP4 gate projection for router (replaces BF16 cuBLAS)

The dense router now uses NVFP4 GEMM via Nvfp4Linear for the gate
projection when NVFP4 scales are available in the checkpoint. This
replaces the BF16 cuBLAS GEMM with Blackwell SM100 tensor-core
NVFP4 acceleration.

Changes:
- dsv4/layers/router.py: add gate_lin (Nvfp4Linear) alongside W_gate
  fallback. New load_nvfp4_gate() method.
- dsv4/kernels/router/dense_router_decode.py: add
  dense_router_dispatch_nvfp4() using Nvfp4Linear + activation_topk
- dsv4/kernels/router/__init__.py: export new function
- single_shot_inference.py: load NVFP4 gate weights when available,
  fall back to BF16 when not

2026-06-01 05:58:56 +00:00

cache

E1: Wire LayerCacheHandle gather methods + CUDA gather kernels

2026-05-30 21:09:21 +00:00

kernels

feat: NVFP4 gate projection for router (replaces BF16 cuBLAS)

2026-06-01 05:58:56 +00:00

layers

feat: NVFP4 gate projection for router (replaces BF16 cuBLAS)

2026-06-01 05:58:56 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering