nvfp4-megamoe-kernel/.gitignore at master - nvfp4-megamoe-kernel - Gitea: Git with a cup of tea

biondizzle/nvfp4-megamoe-kernel

Files

biondizzle a9d5e09f4c B1: mixed FP8/BF16 decode FMHA integration

- New: fmha_mixed_fp8_decode.cuh (Blackwell FP8 tensor-core FMHA kernel)
- New: fmha_mixed_fp8_capi.cu (C ABI launcher)
- New: fmha_mixed_fp8_op.py (Python ctypes/nvcc bridge)
- New: fp8_attention_io.cu (Q quantize + mixed KV gather kernels)
- New: fmha_umma_desc.cuh additions (f8f6f4 UMMA + idesc helpers)
- Modified: production.py (dsv4_attention_mixed_fp8_decode API)
- Modified: single_shot_inference.py (B1 gather + FMHA path)
- Modified: __init__.py (export mixed FP8 API)
- New: docs/B1_MIXED_FP8_FMHA.md, FINAL_STRETCH.md

noPE KV stays FP8_E4M3 + per-row scale, RoPE stays BF16.
No global FP8->BF16 KV staging before FMHA.
Decode-only (T==1), specialized HD=512/NOPE=448/ROPE=64.
CUDA compile/runtime validation pending on B200.

2026-06-02 22:53:14 +00:00

5 lines

58 B

Plaintext

Raw Permalink Blame History

 __pycache__/
 *.pyc
 *.egg-info/
 nvfp4-megamoe-kernel-*.zip