nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	99b6de316b	Fix prefill kernel: add missing tb base in PV TMEM read, fix ACCUMULATE for per-row PV Two critical fixes: 1. prefill_read_pv_all_subs: was missing 'tb' base in TMEM read address 2. PV MMA ACCUMULATE: use pv_kt == 0 (not kv_tile==0 && pv_kt==0 && n_sub==0) so each query row's PV starts fresh instead of accumulating into previous row's result	2026-06-03 02:59:19 +00:00
biondizzle	9034f67b0f	Fix prefill kernel: read ALL n_sub PV results (was only n_sub=0) Critical bug: prefill_read_pv_row only read n_sub=0 (16 out of 512 HD dims). Replaced with prefill_read_pv_all_subs that iterates over all 32 n_sub groups. Also fixed TMEM row-group/warp mapping for rows 32-127.	2026-06-03 02:54:59 +00:00
biondizzle	a4ef6c3454	Add B1 mixed FP8 prefill FMHA kernel (T>1 support) New files: - fmha_mixed_fp8_prefill.cuh: kernel supporting T=1..128 - Sub-batch processing (T_BATCH=32) to fit in 232KB SMEM - Multi-row QK TMEM read using tcgen05.ld.32x32b.x8 - Per-row online softmax - Per-row PV MMA (correctness first; batched PV is TODO) - Attention sink support - fmha_mixed_fp8_prefill_capi.cu: C API bridge - fmha_mixed_fp8_prefill_op.py: Python ctypes loader - test_b1_mixed_fp8_prefill.py: unit test (T=1..32, N=128..4096) Also: fix production FMHA layer test (BF16 fallback for o_a_proj, router gate BF16 quantize path, missing DEVICE constant)	2026-06-03 02:50:27 +00:00
biondizzle	1f757151ef	Fix router gate BF16 quantize path for production FMHA test	2026-06-03 02:47:47 +00:00
biondizzle	07168357cc	Fix o_a_proj weight loading: add BF16 fallback for grouped linear	2026-06-03 02:38:00 +00:00
biondizzle	27d8d80a40	Fix missing DEVICE constant in production FMHA test	2026-06-03 02:31:11 +00:00
biondizzle	26a817c2f2	Fix production FMHA layer test: compare raw FMHA vs SDPA on production gathered KV Phase 1: Run full pipeline to populate KV caches with real model weights. Phase 2: For each layer, gather KV in mixed FP8/BF16 format, run both production FMHA and PyTorch SDPA, compare cosine similarity. Uses random Q (not model-generated) to isolate FMHA kernel accuracy from upstream pipeline issues.	2026-06-03 02:26:37 +00:00
biondizzle	ba67e055f7	Add production FMHA layer comparison test Test loads real model weights, runs attention forward for layers 0-4, compares production B1 mixed FP8 FMHA output vs PyTorch SDPA reference. This will reveal the FMHA cosine degradation (was 0.679 at L1) with real data patterns, not just synthetic random data. Production values: HD=512, NOPE=448, ROPE=64, H=128, 8 GPUs.	2026-06-03 02:22:23 +00:00
biondizzle	af58f2c5b2	Add B1 weight/format verification at L0 in single_shot v-b1-b2-done-20260603	2026-06-03 01:52:55 +00:00
biondizzle	8df5de5477	Update B1 docs with test results and bug fix	2026-06-03 01:50:59 +00:00
biondizzle	3e3b352e7e	Update FINAL_STRETCH.md: B1 and B2 marked DONE with test results and bug fixes	2026-06-03 01:50:21 +00:00
biondizzle	84a02f8995	Remove debug test files, keep production B1/B2 unit tests	2026-06-03 01:49:39 +00:00
biondizzle	6fa9ad7852	B2 indexer: adopt TMEM warp-to-row mapping fix Key insight: tcgen05.ld.32x32b.x8 maps warp 0 to rows 0-31 and warp 1 to rows 32-63 from the SAME TMEM address. The hardware routes row slices based on warp position in the warpgroup. Fix approach (from external LLM review): - Warps 0-1 both read from tb + col_base (same address) - Each warp writes partial scores to its own sWarpScores partition - After __syncthreads(), merge both partitions for final 64-head scores - No race conditions, no cross-warp accumulation bugs	2026-06-03 01:42:38 +00:00
biondizzle	6c92ff91f3	B2 indexer: temporary heads 0-31 only while figuring out TMEM row 32-63 layout	2026-06-03 01:12:10 +00:00
biondizzle	7732c93f62	Fix B2 indexer: use 16x256b.x1 TMEM read with TMEM_COLS=512 Revert to 16x256b.x1 approach (reads 64 rows from single column). Previous hang was likely due to TMEM_COLS=128 (too small). With TMEM_COLS=512, the full 128-row MMA output fits in TMEM. Lane i reads rows 4i..4i+3. Lanes 0-15 cover rows 0-63. 4 warps (0-3) each process 32 columns, computing weighted ReLU scores.	2026-06-03 01:08:48 +00:00
biondizzle	a75a9843af	Fix B2 indexer: add sLogits scratch buffer to SMEM layout	2026-06-03 00:59:06 +00:00
biondizzle	cc7b17fdaa	Fix B2 indexer: use 2-warps for TMEM read (P7 row-slice model) ROOT CAUSE: The TMEM read for rows 32-63 was wrong. The 32x32b.x8 instruction reads 32 rows per warp. Per P7 docs, warp 0 sees rows 0-31 and warp 1 sees rows 32-63 from the SAME TMEM address. There is no TMEM offset for different row groups — the row-to-lane mapping depends on the warp ID. Fix: warp 0 reads heads 0-31, warp 1 reads heads 32-63 from tb + col_base. Cross-warp reduce via SMEM to compute full 64-head weighted ReLU scores.	2026-06-03 00:55:27 +00:00
biondizzle	8d0a02ca67	B2 TMEM debug: try stride=SK_TILE/8=16 for row group 32-63	2026-06-03 00:52:32 +00:00
biondizzle	fdf702470c	Add B2 TMEM read debug kernel and test	2026-06-03 00:50:52 +00:00
biondizzle	f1cf4c0215	Add B2 QK debug test with w_h=1 for simple comparison	2026-06-03 00:46:48 +00:00
biondizzle	d36dbba01c	Fix B2 indexer: increase TMEM_COLS to 512 for full 128-row MMA output The MMA produces 128 rows × 128 cols = 4 row-groups × 128 TMEM cols = 512 total. Even though we only read rows 0-63, the MMA writes all 128 rows. TMEM_COLS must match the MMA output size, not just the read size.	2026-06-03 00:45:15 +00:00
biondizzle	797345dfe9	Add B2 score debug test	2026-06-03 00:43:44 +00:00
biondizzle	afb82b9c89	Fix B2 indexer: replace broken 16x256b TMEM read with proven 32x32b.x8 ROOT CAUSES: 1. tcgen05.ld.16x256b.x1 was hanging — either invalid instruction or unaligned 2. TMEM_COLS=128 was too small for 64-row MMA output (needs 256 for 2 row-groups) 3. TMEM row-group addressing: rows 32-63 are at offset SK_TILE (128) in TMEM Fixes: - Use tcgen05.ld.32x32b.x8 (proven in B1 FMHA) instead of 16x256b.x1 - Increase TMEM_COLS from 128 to 256 - Read both row-groups (0-31 and 32-63) per 8-column chunk - Each lane handles head i (from row-group 0) and head 32+i (from row-group 1) - Warp-level reduce sums contributions from all 64 heads per column	2026-06-03 00:39:49 +00:00
biondizzle	99e50fcb58	Add B2 minimal debug test to find hang point	2026-06-03 00:35:48 +00:00
biondizzle	e21bd14408	Fix B1 test LSE reference shape handling	2026-06-03 00:25:53 +00:00
biondizzle	4fe7f9dc37	Fix B1 FMHA: swap V matrix canonical layout args (dd, kk) not (kk, dd) ROOT CAUSE: canon_idx_bf16_16x16(kk, dd) was swapping the outer/inner group structure compared to the working TMA-loaded V layout in the multitile kernel. Working layout: (lr/8)128 + (dd/8)64 + (dd%8)8 + (lr%8) B1 with (kk,dd): (dd/8)128 + (kk/8)64 + (kk%8)8 + (dd%8) <- WRONG B1 with (dd,kk): (kk/8)128 + (dd/8)64 + (dd%8)*8 + (kk%8) <- CORRECT This caused the V matrix to be loaded into SMEM with transposed group structure, producing garbage output (cos=0.158 vs BF16 reference).	2026-06-03 00:24:20 +00:00
biondizzle	29a95a3db6	Add B1 QK vs PV isolation test	2026-06-03 00:23:35 +00:00
biondizzle	c322e3f301	Add B1 FMHA debug test for cosine failure investigation	2026-06-03 00:22:00 +00:00
biondizzle	5447d1d1dc	Add comprehensive B2 FP8 indexer unit test	2026-06-03 00:21:29 +00:00
biondizzle	38eecb28d8	Add comprehensive B1 mixed FP8 FMHA unit test	2026-06-03 00:20:07 +00:00
biondizzle	f2063c0588	B1: minimal debug test for mixed FP8 FMHA (1 head, N=128)	2026-06-03 00:09:36 +00:00
biondizzle	0cea0b33ff	B1 test: fix BF16 reference to use PyTorch SDPA	2026-06-03 00:07:38 +00:00
biondizzle	a51d19a7fc	B1: add mixed FP8 FMHA cosine verification test (HD=512, N=128-2048)	2026-06-03 00:06:25 +00:00
biondizzle	b9243fe40a	B2: FP8 tensor-core indexer scoring + weighted ReLU + top-k - New kernel: dsv4/kernels/cuda/indexer_fp8_score_topk.cu - Native Blackwell FP8 GEMM via tcgen05.mma.kind::f8f6f4 - Q (n_ih=64, ihd=128) quantized BF16→FP8, K consumed directly as FP8_E4M3 - TMEM read using 16x256b.x1 (4-warps parallel, proven from B1 FMHA) - On-the-fly: dequant (q_scale*k_scale) → ReLU → weighted sum → top-k - No global BF16 staging of indexer keys, no FP32 einsum on CUDA cores - Per-thread register heap top-k (same algorithm as indexer_score_topk.cu) - Modified: single_shot_inference.py - Indexer.forward() now takes kv_cache directly (not comp_idx_kv BF16) - Consumes FP8 indexer keys from cache without BF16 dequantization - Dispatches to B2 FP8 kernel for T=1, n_ih=64, ihd=128 (production decode) - FP32 einsum fallback retained only for T>1 (prefill) - Removed 'Intentional first-pass limits' section from B1 doc (those limits ARE the correct production design, not shortcuts)	2026-06-02 23:18:54 +00:00
biondizzle	a9d5e09f4c	B1: mixed FP8/BF16 decode FMHA integration - New: fmha_mixed_fp8_decode.cuh (Blackwell FP8 tensor-core FMHA kernel) - New: fmha_mixed_fp8_capi.cu (C ABI launcher) - New: fmha_mixed_fp8_op.py (Python ctypes/nvcc bridge) - New: fp8_attention_io.cu (Q quantize + mixed KV gather kernels) - New: fmha_umma_desc.cuh additions (f8f6f4 UMMA + idesc helpers) - Modified: production.py (dsv4_attention_mixed_fp8_decode API) - Modified: single_shot_inference.py (B1 gather + FMHA path) - Modified: __init__.py (export mixed FP8 API) - New: docs/B1_MIXED_FP8_FMHA.md, FINAL_STRETCH.md noPE KV stays FP8_E4M3 + per-row scale, RoPE stays BF16. No global FP8->BF16 KV staging before FMHA. Decode-only (T==1), specialized HD=512/NOPE=448/ROPE=64. CUDA compile/runtime validation pending on B200.	2026-06-02 22:53:14 +00:00
biondizzle	2eb4f0886e	things pre-b1	2026-06-02 22:31:13 +00:00
biondizzle	9d4a014fad	Fix NameError: dequantize_nvfp4 not in scope in forward_attention The B3 fused q_a_norm path used dequantize_nvfp4 but it was only imported in forward_layer, not forward_attention. Added local import.	2026-06-02 21:52:29 +00:00
biondizzle	9ba6476d3f	auto: pre-test commit	2026-06-02 21:39:01 +00:00
biondizzle	845227c06c	Fix stale lock file in CUDA loader — prevents infinite spin on crash recovery torch.utils.cpp_extension.load creates a 'lock' file in the build directory during compilation. If the compiling process is killed (OOM, timeout, user interrupt), the lock file is never removed and subsequent processes spin forever polling it (clock_nanosleep(100ms) → stat(lock) → repeat). Fix: _cleanup_stale_lock() removes lock files older than 10 minutes before any compilation attempt. This is the correct threshold — CUDA kernel compilation should never take more than a few minutes, so a 10-minute-old lock is guaranteed stale.	2026-06-02 21:34:58 +00:00
biondizzle	0b6ca0df80	P5 integration + B3 q_a_norm fused + gsa scalar fix P5: Wire up fused mHC pre_block + RMSNorm + NVFP4 quantize kernel - Replaces: pre_block bmm + rmsnorm (4+ launches) + quantize (2 launches) - With: 2 kernel launches (mhc_rmsnorm_amax_gsa + mhc_rmsnorm_quantize_nvfp4) - Both attn and ffn mHC paths now use P5 fused kernel - Savings: ~5 launches/site × 2 sites × 61 layers = 610 launches/token B3: Fused rmsnorm+quant for q_a_norm → q_b path - q_a output → rmsnorm_quantize_nvfp4 → QuantizedActivation → q_b.run_from_quantized - Eliminates BF16 round-trip between q_a_norm and q_b GEMM - Saves: ~6 kernel launches per layer (rmsnorm 4+ + quantize 2 vs fused 2) gsa scalar fix in Nvfp4Linear.run_from_quantized: - CuTeDSL NVFP4 GEMM expects global_scale_a as per-expert scalar (shape (1,)) - Per-row gsa from fused kernels must be reduced to scalar (max) for M>1 - For M=1 decode: already scalar, no reduction needed - Fixes potential correctness issue at prefill (M>1) when using fused paths Cleanup: Remove --ab-compare flag and A/B comparison code (replaced by P5)	2026-06-02 21:20:34 +00:00
biondizzle	7e42b5e090	A1: Add ◇ (think_start) priming after Assistant token DSV4 is a reasoning model. The standard prompt format is: BOS <\|User\|> prompt <\|Assistant\|> ◇ Without the ◇ priming, the model is out-of-distribution — it expects to be inside a thinking block but never received the sentinel. This causes degenerate output from step 0 (France instead of Paris, looping on newlines/repeated tokens). With ◇, the model will: 1. Generate thinking content (reasoning) 2. Emit ◇ (think_end=128822) to close the thinking block 3. Produce the actual answer 4. Emit EOS (token 1) This matches the pattern described in the Kimi K2 accuracy blog: https://vllm.ai/blog/2025-10-28-kimi-k2-accuracy — malformed prompt formatting is the #1 cause of degenerate output in chat-tuned reasoning models.	2026-06-02 20:23:47 +00:00
biondizzle	ac4eedc444	auto: pre-test commit	2026-06-02 20:16:43 +00:00
biondizzle	ecd48ab65e	A1: Add explicit stop set for DSV4 turn-end tokens Previously only stopped on tokenizer.eos_token_id. DSV4 uses special turn-end tokens (<\|end_of_sentence\|>, USER_TOKEN=128803) that indicate the assistant turn is complete. Missing these caused decode to continue past the model's natural stopping point, producing degenerate output. Also increased diagnostic logging (every step for first 20 steps) to catch turn-end token emissions.	2026-06-02 19:59:52 +00:00
biondizzle	35dbb8d12b	Cleanup Part 2: Fix docs, stale references, dead code - Update README.md package structure to match actual file tree - Remove references to nonexistent fmha.py, fmha_smem_acc, kernels/decode/ - Document live attention path: production.py → fmha_multitile_op → capi.cu → .cuh - Add _archive/ section - Fix loader.py docstring: fused_amax_quantize_nvfp4 → quantize_nvfp4_from_buffer - Remove preload_all() (dead, referenced nonexistent compressor_reduce_quant.cu)	2026-06-02 19:27:28 +00:00
biondizzle	f3b551956d	Cleanup Step 2: Archive Lineage P code, fix broken imports - Move dead dsv4/ modules to dsv4/_archive/ (52 files) - model/{dsv4,mtp,layer,layer_schedule} - layers/{embedding,attention,ffn,norm} (kept linear,mhc,router,moe,shared_expert,grouped_linear - live) - cache/, kernels/cache/, kernels/indexer/{csa_indexer,score_topk,compute_valid_lens} - kernels/router/{nvfp4_fused_router,dense_router_decode_kernel,dense_router_prefill} - ops/{topk,topk_select,rope,router}, loader/{hf_checkpoint,layout_convert} - reference/{attention,compressor,csa_attention,moe_pipeline} - kernels/compressor/{compress_tail,csa_hca} - Restore dsv4/ops/{router,custom_ops}.py (needed by live layers) - Fix dsv4/kernels/{indexer,compressor,attention}/__init__.py (removed broken imports) - Remove preload_all() from loader.py (dead, referenced nonexistent .cu file) - Fix loader.py docstring (fused_amax_quantize_nvfp4 → quantize_nvfp4_from_buffer) - Move broken tests to tests/e2e_archive/ - test_fused_router, production_values_test, e2e/{one_layer,model_construction,csa_hca} - vLLM has 0 imports of dsv4 (Step 0 confirmed)	2026-06-02 19:27:07 +00:00
biondizzle	8de47e26ce	Cleanup Step 1: Move root-level files to proper directories - Move test_.py → tests/integration/ - Move probe_.py, dump_*.py → helpers/ - Move PERFORMANCE_AUDIT.md → docs/ - Move single_shot_PYTORCH_REFERENCE.py → dsv4/reference/ - Fix 3 import references in test_layer_comparison, test_mhc_comparison, test_compressor_position_bias - Add helpers/import_closure.py (dead-code detection tool)	2026-06-02 19:24:39 +00:00
biondizzle	b111525af4	Fix indexer documentation and safety issues 1. score_topk.py: Fix docstring — K^IComp[s] is shared (MQA), not per-head K^IComp[s,h] Matches the .cu kernel and production Indexer.forward() einsum. 2. score_topk.py: Add WARNING about valid_lens broadcast being wrong for batched prefill 3. csa_indexer.py: Replace random weights with RuntimeError — CSAIndexer has no checkpoint loading. Production uses the Indexer class in single_shot_inference.py. 4. csa_indexer.py: Document RoPE assumption — indexer queries/keys have no RoPE. NEEDS VERIFICATION against HF reference.	2026-06-02 19:08:40 +00:00
biondizzle	d770111cb1	Remove stale duplicate .cu files from indexer/ subfolder The CUDA loader (dsv4/kernels/cuda/loader.py) resolves all .cu files relative to dsv4/kernels/cuda/. The indexer/ subfolder copies were never loaded — they were dead code that could silently diverge from the canonical copies in cuda/.	2026-06-02 18:49:40 +00:00
biondizzle	eb5ef93bf1	Add A/B comparison mode for P4 fused vs unfused RMSNorm+quantize - Added --ab-compare flag to run both fused and unfused paths for first 3 layers - Compares x_normed, gsa values, FP4 data, and GEMM outputs (q_a, kv) - Added --no-fused-rmsnorm to disable P4 and use unfused path - This will help diagnose the correctness regression introduced by P4	2026-06-02 18:49:30 +00:00
biondizzle	b8bab01a55	Update PERFORMANCE_AUDIT.md — P4 done, P5 kernel done (pending integration)	2026-06-02 18:26:01 +00:00

1 2 3 4 5 ...

2272 Commits