nvfp4-megamoe-kernel

Files

biondizzle a9ea30353c CUDA graph: Fix sync violations (Category 1-2)

1. mhc.py: Remove .item() from post_block (122 syncs/step eliminated)
   - The X_next.abs().max().item() was syncing EVERY layer's post_block
   - Diagnostics moved to caller (outside graph region)

2. linear.py: Pre-allocate _scale_a_buf in _ensure_buffer_size
   - _assemble_scales_single_group now uses pre-allocated buffer
   - Eliminates per-call torch.zeros() allocation (graph capture killer)

3. shared_expert.py: Same fix — use pre-allocated padded_x_sf_buf
   - _assemble_scales_single_group no longer allocates

4. quantize.py: Remove .contiguous() from gsa expand
   - expand() creates stride-0 view, CUDA kernel reads correctly
   - No allocation on the hot path

5. Add CUDA_GRAPH_SYNC_INVENTORY.md with full violation catalog

2026-06-03 16:37:20 +00:00

_archive

Cleanup Step 2: Archive Lineage P code, fix broken imports

2026-06-02 19:27:07 +00:00

cache

Cleanup Step 2: Archive Lineage P code, fix broken imports

2026-06-02 19:27:07 +00:00

kernels

Fix compressor: do not add positional bias to KV content

2026-06-03 15:52:00 +00:00

layers

CUDA graph: Fix sync violations (Category 1-2)