75288bd12f
Wire prefill FMHA into production.py and single_shot
...
- Add dsv4_attention_mixed_fp8_prefill to production.py
- _run_production_fmha_mixed now dispatches to prefill kernel for T>1
- Remove decode-only T==1 restriction
- Update FINAL_STRETCH.md: prefill marked DONE, batched prefill TODO noted
2026-06-03 03:49:57 +00:00
3e3b352e7e
Update FINAL_STRETCH.md: B1 and B2 marked DONE with test results and bug fixes
2026-06-03 01:50:21 +00:00
a9d5e09f4c
B1: mixed FP8/BF16 decode FMHA integration
...
- New: fmha_mixed_fp8_decode.cuh (Blackwell FP8 tensor-core FMHA kernel)
- New: fmha_mixed_fp8_capi.cu (C ABI launcher)
- New: fmha_mixed_fp8_op.py (Python ctypes/nvcc bridge)
- New: fp8_attention_io.cu (Q quantize + mixed KV gather kernels)
- New: fmha_umma_desc.cuh additions (f8f6f4 UMMA + idesc helpers)
- Modified: production.py (dsv4_attention_mixed_fp8_decode API)
- Modified: single_shot_inference.py (B1 gather + FMHA path)
- Modified: __init__.py (export mixed FP8 API)
- New: docs/B1_MIXED_FP8_FMHA.md, FINAL_STRETCH.md
noPE KV stays FP8_E4M3 + per-row scale, RoPE stays BF16.
No global FP8->BF16 KV staging before FMHA.
Decode-only (T==1), specialized HD=512/NOPE=448/ROPE=64.
CUDA compile/runtime validation pending on B200.
2026-06-02 22:53:14 +00:00
2eb4f0886e
things
2026-06-02 22:31:13 +00:00
9ba6476d3f
auto: pre-test commit
2026-06-02 21:39:01 +00:00
0b6ca0df80
P5 integration + B3 q_a_norm fused + gsa scalar fix
...
P5: Wire up fused mHC pre_block + RMSNorm + NVFP4 quantize kernel
- Replaces: pre_block bmm + rmsnorm (4+ launches) + quantize (2 launches)
- With: 2 kernel launches (mhc_rmsnorm_amax_gsa + mhc_rmsnorm_quantize_nvfp4)
- Both attn and ffn mHC paths now use P5 fused kernel
- Savings: ~5 launches/site × 2 sites × 61 layers = 610 launches/token
B3: Fused rmsnorm+quant for q_a_norm → q_b path
- q_a output → rmsnorm_quantize_nvfp4 → QuantizedActivation → q_b.run_from_quantized
- Eliminates BF16 round-trip between q_a_norm and q_b GEMM
- Saves: ~6 kernel launches per layer (rmsnorm 4+ + quantize 2 vs fused 2)
gsa scalar fix in Nvfp4Linear.run_from_quantized:
- CuTeDSL NVFP4 GEMM expects global_scale_a as per-expert scalar (shape (1,))
- Per-row gsa from fused kernels must be reduced to scalar (max) for M>1
- For M=1 decode: already scalar, no reduction needed
- Fixes potential correctness issue at prefill (M>1) when using fused paths
Cleanup: Remove --ab-compare flag and A/B comparison code (replaced by P5)
2026-06-02 21:20:34 +00:00
ac4eedc444
auto: pre-test commit
2026-06-02 20:16:43 +00:00
ecd48ab65e
A1: Add explicit stop set for DSV4 turn-end tokens
...
Previously only stopped on tokenizer.eos_token_id. DSV4 uses special
turn-end tokens (<|end_of_sentence|>, USER_TOKEN=128803) that indicate
the assistant turn is complete. Missing these caused decode to continue
past the model's natural stopping point, producing degenerate output.
Also increased diagnostic logging (every step for first 20 steps) to
catch turn-end token emissions.
2026-06-02 19:59:52 +00:00