nvfp4-megamoe-kernel

Files

biondizzle b9243fe40a B2: FP8 tensor-core indexer scoring + weighted ReLU + top-k

- New kernel: dsv4/kernels/cuda/indexer_fp8_score_topk.cu
  - Native Blackwell FP8 GEMM via tcgen05.mma.kind::f8f6f4
  - Q (n_ih=64, ihd=128) quantized BF16→FP8, K consumed directly as FP8_E4M3
  - TMEM read using 16x256b.x1 (4-warps parallel, proven from B1 FMHA)
  - On-the-fly: dequant (q_scale*k_scale) → ReLU → weighted sum → top-k
  - No global BF16 staging of indexer keys, no FP32 einsum on CUDA cores
  - Per-thread register heap top-k (same algorithm as indexer_score_topk.cu)

- Modified: single_shot_inference.py
  - Indexer.forward() now takes kv_cache directly (not comp_idx_kv BF16)
  - Consumes FP8 indexer keys from cache without BF16 dequantization
  - Dispatches to B2 FP8 kernel for T=1, n_ih=64, ihd=128 (production decode)
  - FP32 einsum fallback retained only for T>1 (prefill)

- Removed 'Intentional first-pass limits' section from B1 doc
  (those limits ARE the correct production design, not shortcuts)

2026-06-02 23:18:54 +00:00

B1_MIXED_FP8_FMHA.md

B2: FP8 tensor-core indexer scoring + weighted ReLU + top-k

2026-06-02 23:18:54 +00:00

cuda13_tma_notes.md

archive: TMA driver-API files + CUDA 13 TMA discovery notes

2026-05-29 06:52:39 +00:00

p4_tma_hang_resolution.md

P4 RESOLVED: TMA hang was GMEM misalignment, not descriptor/driver issue

2026-05-30 08:42:18 +00:00

p7_tmem_column_layout.md

P7: Document TMEM column layout, add multi-row softmax test

2026-05-30 17:17:54 +00:00

PERFORMANCE_AUDIT.md

Cleanup Step 1: Move root-level files to proper directories

2026-06-02 19:24:39 +00:00