nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Author	SHA1	Message	Date
biondizzle	1e6adf5e01	P3: wire 6-warp multi-head FMHA decode fast path into production.py - fmha_multihead_launch.cu: PyTorch launch wrapper for fmha_6warp_multihead_kernel (c10::BFloat16 boundary, uint16_t bf16_t inside kernel, zero-cost casts) - fmha_multihead_op.py: torch.utils.cpp_extension JIT loader + custom_op registration (dsv4::fmha_multihead_decode for torch.compile) - production.py: fast path dispatch for T=1, n_segments==1, hd in {64,128,256} Falls through to CuTeDSL slow path for multi-segment/prefill - test_p3_fast_decode.py: integration test (MHA/MQA/GQA, cosine >= 0.999998) Architecture: Grid: dim3(1, n_h, batch_size) — one CTA per (head, batch) MQA: k_head_stride=0 so all Q heads share same K/V Single kernel launch, zero cudaDeviceSynchronize on hot path Normalized output for single-segment decode	2026-05-30 08:12:23 +00:00

Author

SHA1

Message

Date

biondizzle

1e6adf5e01

P3: wire 6-warp multi-head FMHA decode fast path into production.py

- fmha_multihead_launch.cu: PyTorch launch wrapper for fmha_6warp_multihead_kernel
  (c10::BFloat16 boundary, uint16_t bf16_t inside kernel, zero-cost casts)
- fmha_multihead_op.py: torch.utils.cpp_extension JIT loader + custom_op registration
  (dsv4::fmha_multihead_decode for torch.compile)
- production.py: fast path dispatch for T=1, n_segments==1, hd in {64,128,256}
  Falls through to CuTeDSL slow path for multi-segment/prefill
- test_p3_fast_decode.py: integration test (MHA/MQA/GQA, cosine >= 0.999998)

Architecture:
  Grid: dim3(1, n_h, batch_size) — one CTA per (head, batch)
  MQA: k_head_stride=0 so all Q heads share same K/V
  Single kernel launch, zero cudaDeviceSynchronize on hot path
  Normalized output for single-segment decode

2026-05-30 08:12:23 +00:00

1 Commits