nvfp4-megamoe-kernel

Files

biondizzle 31ebe4f2db Wire NVFP4 fused router kernel into e2e single-shot pipeline

- Add dense_router_dispatch_nvfp4_fused() in dense_router_decode.py:
  single-kernel NVFP4 blockscaled GEMM + fused router epilogue
- Router.load_nvfp4_fused_gate(): stores raw NVFP4 tensors for fused path
- Router._run_dense_impl() dispatch priority: fused > 2-kernel > BF16
- single_shot_inference.py: loads raw NVFP4 gate weights for fused kernel
  instead of building Nvfp4Linear (which was the 2-kernel path)
- Fix selection sort bug in nvfp4_fused_router_kernel.py: pass 0 was
  missing t_s/t_i/t_a temp save before swap, causing undefined vars
- Export dense_router_dispatch_nvfp4_fused from __init__.py

2026-06-01 09:47:48 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

attention.py

E2/E3: compressor bridge, indexer bridge, flush pipeline wiring

2026-05-30 21:16:54 +00:00

embedding.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

ffn.py

Layer dispatch: config, schedule, attention/FFN sub-blocks, TransformerLayer