Commit Graph

7 Commits

Author SHA1 Message Date
8df5de5477 Update B1 docs with test results and bug fix 2026-06-03 01:50:59 +00:00
b9243fe40a B2: FP8 tensor-core indexer scoring + weighted ReLU + top-k
- New kernel: dsv4/kernels/cuda/indexer_fp8_score_topk.cu
  - Native Blackwell FP8 GEMM via tcgen05.mma.kind::f8f6f4
  - Q (n_ih=64, ihd=128) quantized BF16→FP8, K consumed directly as FP8_E4M3
  - TMEM read using 16x256b.x1 (4-warps parallel, proven from B1 FMHA)
  - On-the-fly: dequant (q_scale*k_scale) → ReLU → weighted sum → top-k
  - No global BF16 staging of indexer keys, no FP32 einsum on CUDA cores
  - Per-thread register heap top-k (same algorithm as indexer_score_topk.cu)

- Modified: single_shot_inference.py
  - Indexer.forward() now takes kv_cache directly (not comp_idx_kv BF16)
  - Consumes FP8 indexer keys from cache without BF16 dequantization
  - Dispatches to B2 FP8 kernel for T=1, n_ih=64, ihd=128 (production decode)
  - FP32 einsum fallback retained only for T>1 (prefill)

- Removed 'Intentional first-pass limits' section from B1 doc
  (those limits ARE the correct production design, not shortcuts)
2026-06-02 23:18:54 +00:00
a9d5e09f4c B1: mixed FP8/BF16 decode FMHA integration
- New: fmha_mixed_fp8_decode.cuh (Blackwell FP8 tensor-core FMHA kernel)
- New: fmha_mixed_fp8_capi.cu (C ABI launcher)
- New: fmha_mixed_fp8_op.py (Python ctypes/nvcc bridge)
- New: fp8_attention_io.cu (Q quantize + mixed KV gather kernels)
- New: fmha_umma_desc.cuh additions (f8f6f4 UMMA + idesc helpers)
- Modified: production.py (dsv4_attention_mixed_fp8_decode API)
- Modified: single_shot_inference.py (B1 gather + FMHA path)
- Modified: __init__.py (export mixed FP8 API)
- New: docs/B1_MIXED_FP8_FMHA.md, FINAL_STRETCH.md

noPE KV stays FP8_E4M3 + per-row scale, RoPE stays BF16.
No global FP8->BF16 KV staging before FMHA.
Decode-only (T==1), specialized HD=512/NOPE=448/ROPE=64.
CUDA compile/runtime validation pending on B200.
2026-06-02 22:53:14 +00:00
8de47e26ce Cleanup Step 1: Move root-level files to proper directories
- Move test_*.py → tests/integration/
- Move probe_*.py, dump_*.py → helpers/
- Move PERFORMANCE_AUDIT.md → docs/
- Move single_shot_PYTORCH_REFERENCE.py → dsv4/reference/
- Fix 3 import references in test_layer_comparison, test_mhc_comparison, test_compressor_position_bias
- Add helpers/import_closure.py (dead-code detection tool)
2026-06-02 19:24:39 +00:00
e747742598 P7: Document TMEM column layout, add multi-row softmax test
docs/p7_tmem_column_layout.md: Verified that tcgen05.ld 32x32b.x8 is
the correct instruction for multi-row softmax. Each call reads 8 KV
positions for 32 rows. No instruction change needed from single-row.

test_p7_multi_row_softmax.py: Tests T=1,4,32,64,128 at various HD and N.
Gate: cos >= 0.999996.
2026-05-30 17:17:54 +00:00
6421f7c3f3 P4 RESOLVED: TMA hang was GMEM misalignment, not descriptor/driver issue
Evidence: TMA loads succeed with 128B-aligned GMEM on all descriptor configs.
The bit-21 workaround was NOT needed. The 'misaligned address' crashes were
caused by passing non-128B-aligned GMEM pointers to cp.async.bulk.tensor.

Added docs/p4_tma_hang_resolution.md with root cause and fix.
Cleaned up stale P4 test files.
2026-05-30 08:42:18 +00:00
a40c05f3f2 archive: TMA driver-API files + CUDA 13 TMA discovery notes
Key findings documented in docs/cuda13_tma_notes.md:
- CUDA 13 globalStrides are in BYTES not elements (root cause of desc creation failures)
- BFLOAT16 data type available in CUDA 13
- Driver API descriptors create OK but cp.async.bulk.tensor hangs on driver 13.0 + toolkit 13.2
- CuTeDSL tma_partition works (production path)

Archived (not deleted):
- fmha_tma_driver_api.cuh, fmha_6warp_tma_driver_api.cuh, test_fmha_tma_driver_api.cu
- These will work once driver matches toolkit version
2026-05-29 06:52:39 +00:00