nvfp4-megamoe-kernel

Files

biondizzle 5e09be08af Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill)

The intermediate tensor from fused SwiGLU deinterleave is a column slice
(non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and
the CUDA kernel crashes with 'input must be contiguous'.

Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused
and in SharedExpert._run_l2. This is the root cause, not a workaround —
CUDA kernels legitimately require contiguous memory.

2026-06-03 07:56:19 +00:00

_archive

Cleanup Step 2: Archive Lineage P code, fix broken imports

2026-06-02 19:27:07 +00:00

cache

Cleanup Step 2: Archive Lineage P code, fix broken imports

2026-06-02 19:27:07 +00:00

kernels

Wire prefill FMHA into production.py and single_shot

2026-06-03 03:49:57 +00:00

layers

Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill)