nvfp4-megamoe-kernel

Files

biondizzle 80bb27f5bf CUDA graph: Fix gsa broadcast — contiguous for prefill, reshape for decode

The stride-0 expand view for gsa_gpu caused illegal memory access
in quantize_nvfp4_from_buffer kernel. The CUDA kernel may not handle
stride-0 tensors correctly.

Fix:
- M=1 decode (graph-captured): just reshape scalar to (1,) — no alloc
- M>1 prefill (not graph-captured): expand + contiguous — allocation OK

2026-06-03 18:08:18 +00:00

_archive

Cleanup Step 2: Archive Lineage P code, fix broken imports

2026-06-02 19:27:07 +00:00

cache

Cleanup Step 2: Archive Lineage P code, fix broken imports

2026-06-02 19:27:07 +00:00

decode

CUDA graph: Fix remaining sync violations from B200 detector run 2

2026-06-03 17:20:34 +00:00

kernels

Fix compressor: do not add positional bias to KV content