Files
nvfp4-megamoe-kernel/dsv4/kernels
biondizzle e46b615873 Fix dense router BF16 dispatch for CUDA graph capture
- Run GEMM in BF16 (not FP32) during graph capture — Blackwell tensor cores
  handle BF16 natively; FP32 GEMM triggers cudaErrorStreamCaptureUnsupported
- Pre-transpose W_gate to [E, H] at load time — avoids .T view during capture
- Convert only logits output to FP32 for sqrt(softplus) numerical stability
- This fixes the graph capture failure at layer 0 Graph B
2026-06-04 05:50:13 +00:00
..