nvfp4-megamoe-kernel

Files

biondizzle ae26f6b83c Fix dense router BF16 dispatch: use torch.matmul instead of F.linear

- F.linear(x, W) computes x @ W.T which caused shape mismatch when
  W_gate was pre-transposed to [E, H]
- Use torch.matmul(x, W_gate) instead — computes x @ W directly, no
  transpose needed, no FP32 conversion, fully graph-capturable
- W_gate stays as [H, E] (original checkpoint shape)

2026-06-04 05:58:24 +00:00

attention

Wire prefill FMHA into production.py and single_shot

2026-06-03 03:49:57 +00:00

cache

Cleanup Step 2: Archive Lineage P code, fix broken imports

2026-06-02 19:27:07 +00:00

compressor

Cleanup Step 2: Archive Lineage P code, fix broken imports

2026-06-02 19:27:07 +00:00

cuda

Add c10/cuda/CUDAStream.h include for getCurrentCUDAStream