nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	c77b83fffc	Add degeneration test 1: chat-template token-ID diff	2026-06-03 08:17:09 +00:00
biondizzle	c5a131c358	more doc clean up again	2026-06-03 08:14:07 +00:00
biondizzle	019a3a34b7	Clean up L0 B1 verify noise (gate on VERBOSE), update FINAL_STRETCH.md Batched prefill + T>128 chunking now complete. All dangling items in FINAL_STRETCH.md are marked done.	2026-06-03 08:12:54 +00:00
biondizzle	5e09be08af	Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill) The intermediate tensor from fused SwiGLU deinterleave is a column slice (non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and the CUDA kernel crashes with 'input must be contiguous'. Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused and in SharedExpert._run_l2. This is the root cause, not a workaround — CUDA kernels legitimately require contiguous memory.	2026-06-03 07:56:19 +00:00
biondizzle	60309ef124	Batched prefill: replace T=1 token-by-token with chunked T≤128 batch processing - Process prefill tokens in chunks of up to 128 (FMHA T≤128 constraint) - Each chunk goes through ALL 61 layers before the next chunk - KV cache append_swa, compressor, indexer all already support T>1 - FMHA dispatches to dsv4_attention_mixed_fp8_prefill for T>1 - For T>128: splits into multiple launches automatically - mHC, Router, MoE, Nvfp4Linear all handle M>1 natively - Eliminates ~N_prefill * 61 per-token overhead from the old loop	2026-06-03 07:39:37 +00:00
biondizzle	0bf276f8c9	more doc cleanup	2026-06-03 07:37:13 +00:00
biondizzle	d463ac8512	doc cleanup	2026-06-03 07:34:12 +00:00
biondizzle	7450ebc67a	CORRECTNESS_BACKLOG.md: comprehensive production pipeline verification results — all tested and confirmed findings from PART A diagnostics	2026-06-03 07:31:01 +00:00
biondizzle	9dbfac9dfa	PART A: verify kv_norm_w loaded correctly	2026-06-03 07:03:39 +00:00
biondizzle	a682c6adf4	PART A: add raw compressor output diagnostic	2026-06-03 06:56:56 +00:00
biondizzle	f2c1b3afd5	PART A: fix KV diagnostics — compute q_a before indexer, add Q_heads magnitude check	2026-06-03 06:33:51 +00:00
biondizzle	86e59c16c5	PART A: add KV gather diagnostics at blowup layer	2026-06-03 06:25:35 +00:00
biondizzle	262f844e2e	PART A: add detailed blowup diagnostics — capture mHC intermediate values when \|X\| > 1e6	2026-06-03 06:10:33 +00:00
biondizzle	6459fbca9a	fix: import forward_attention	2026-06-03 05:41:33 +00:00
biondizzle	91dfac34d8	PART A: simplified to production-only diagnostics — track per-layer \|X\| during prefill and decode, detect blowup early	2026-06-03 05:33:22 +00:00
biondizzle	d99503732d	fix: add BF16 gate weight fallback for dense routers (missing from test)	2026-06-03 05:22:47 +00:00
biondizzle	801bfc9a83	add router mode debug print	2026-06-03 05:15:52 +00:00
biondizzle	b385ecc05e	PART A: decode diagnostics test — production vs reference per-layer X comparison at decode step	2026-06-03 05:06:40 +00:00
biondizzle	d518fcb82a	test: correct sink bias reference — denominator-only, no V contribution	2026-06-03 04:57:37 +00:00
biondizzle	9574a9dc2e	test: add sink bias to reference SDPA in decode FMHA comparison	2026-06-03 04:53:55 +00:00
biondizzle	9a9b347b2b	test: add per-head magnitude ratio diagnostics to decode FMHA test	2026-06-03 04:50:23 +00:00
biondizzle	f5fa20c581	fix: syntax error — missing closing paren in indexer.forward call	2026-06-03 04:46:41 +00:00
biondizzle	693975ec92	fix: device mismatches in decode FMHA test — dec_pos must be on per-layer GPU	2026-06-03 04:46:24 +00:00
biondizzle	e1d96c509d	test: decode FMHA layer comparison — checks FMHA accuracy during decode step	2026-06-03 04:39:12 +00:00
biondizzle	1ebe7f0dde	Add PART_A_NEXT_SESSION.md: clues for decode degeneration debugging	2026-06-03 04:34:28 +00:00
biondizzle	d8306be3f2	Fix PART A test: proper FP8 quantization and MQA reference	2026-06-03 04:20:36 +00:00
biondizzle	4126909dfb	Simplify PART A test: compressor + FMHA at production scale	2026-06-03 04:18:13 +00:00
biondizzle	8c54cfa748	Fix KVCache init in PART A test	2026-06-03 04:15:41 +00:00
biondizzle	04cf8ca848	Add PART A diagnostic tests: compressor + KV cache + FMHA at production scale	2026-06-03 04:13:53 +00:00
biondizzle	75288bd12f	Wire prefill FMHA into production.py and single_shot - Add dsv4_attention_mixed_fp8_prefill to production.py - _run_production_fmha_mixed now dispatches to prefill kernel for T>1 - Remove decode-only T==1 restriction - Update FINAL_STRETCH.md: prefill marked DONE, batched prefill TODO noted	2026-06-03 03:49:57 +00:00
biondizzle	5417f65b08	CRITICAL FIX: Add T-dimension strides to prefill FMHA kernel The kernel was using head strides for the T (query row) dimension, which happened to work for T=1 (qr=0 always) but was wrong for T>1. For (B,H,T,NOPE) layout: - Head stride = TNOPE, but T stride = NOPE - Scale head stride = T, but T stride = 1 - RoPE head stride = TROPE, but T stride = ROPE Added q_nope_t_stride, q_scale_t_stride, q_rope_t_stride to params struct, C API, and Python wrapper.	2026-06-03 03:48:17 +00:00
biondizzle	dd1cbe1faa	Fix smem size for prefill debug test	2026-06-03 03:47:01 +00:00
biondizzle	09384a637a	Fix constexpr issues in prefill debug test	2026-06-03 03:46:29 +00:00
biondizzle	d3dc8cf901	Add prefill T=2 debug CUDA test with intermediate value printing	2026-06-03 03:46:14 +00:00
biondizzle	223c22488f	Simplify prefill PV read: use decode kernel's exact pattern Replace complex n_sub-iterating read with the same HD/8 iteration pattern as the proven decode kernel. Extract from lane qr%32 instead of always lane 0. For qr>=32, use warp 1; for qr>=64, add TMEM offset. This should fix the row 1 accuracy issue (was cos=0.94 vs decode).	2026-06-03 03:22:49 +00:00
biondizzle	2bf5e74e61	Add prefill debug test: compare T=1 decode vs prefill kernel step by step	2026-06-03 03:05:25 +00:00
biondizzle	eb69c3bfb9	CRITICAL FIX: add missing tb base in QK TMEM read address prefill_read_qk_rows was reading from address 0 (sg_off + n * 8) instead of tb + sg_off + n * 8. This caused garbage QK values, explaining the 0.928 cosine for T=1 and NaN for T>1.	2026-06-03 03:00:57 +00:00
biondizzle	99b6de316b	Fix prefill kernel: add missing tb base in PV TMEM read, fix ACCUMULATE for per-row PV Two critical fixes: 1. prefill_read_pv_all_subs: was missing 'tb' base in TMEM read address 2. PV MMA ACCUMULATE: use pv_kt == 0 (not kv_tile==0 && pv_kt==0 && n_sub==0) so each query row's PV starts fresh instead of accumulating into previous row's result	2026-06-03 02:59:19 +00:00
biondizzle	9034f67b0f	Fix prefill kernel: read ALL n_sub PV results (was only n_sub=0) Critical bug: prefill_read_pv_row only read n_sub=0 (16 out of 512 HD dims). Replaced with prefill_read_pv_all_subs that iterates over all 32 n_sub groups. Also fixed TMEM row-group/warp mapping for rows 32-127.	2026-06-03 02:54:59 +00:00
biondizzle	a4ef6c3454	Add B1 mixed FP8 prefill FMHA kernel (T>1 support) New files: - fmha_mixed_fp8_prefill.cuh: kernel supporting T=1..128 - Sub-batch processing (T_BATCH=32) to fit in 232KB SMEM - Multi-row QK TMEM read using tcgen05.ld.32x32b.x8 - Per-row online softmax - Per-row PV MMA (correctness first; batched PV is TODO) - Attention sink support - fmha_mixed_fp8_prefill_capi.cu: C API bridge - fmha_mixed_fp8_prefill_op.py: Python ctypes loader - test_b1_mixed_fp8_prefill.py: unit test (T=1..32, N=128..4096) Also: fix production FMHA layer test (BF16 fallback for o_a_proj, router gate BF16 quantize path, missing DEVICE constant)	2026-06-03 02:50:27 +00:00
biondizzle	1f757151ef	Fix router gate BF16 quantize path for production FMHA test	2026-06-03 02:47:47 +00:00
biondizzle	07168357cc	Fix o_a_proj weight loading: add BF16 fallback for grouped linear	2026-06-03 02:38:00 +00:00
biondizzle	27d8d80a40	Fix missing DEVICE constant in production FMHA test	2026-06-03 02:31:11 +00:00
biondizzle	26a817c2f2	Fix production FMHA layer test: compare raw FMHA vs SDPA on production gathered KV Phase 1: Run full pipeline to populate KV caches with real model weights. Phase 2: For each layer, gather KV in mixed FP8/BF16 format, run both production FMHA and PyTorch SDPA, compare cosine similarity. Uses random Q (not model-generated) to isolate FMHA kernel accuracy from upstream pipeline issues.	2026-06-03 02:26:37 +00:00
biondizzle	ba67e055f7	Add production FMHA layer comparison test Test loads real model weights, runs attention forward for layers 0-4, compares production B1 mixed FP8 FMHA output vs PyTorch SDPA reference. This will reveal the FMHA cosine degradation (was 0.679 at L1) with real data patterns, not just synthetic random data. Production values: HD=512, NOPE=448, ROPE=64, H=128, 8 GPUs.	2026-06-03 02:22:23 +00:00
biondizzle	af58f2c5b2	Add B1 weight/format verification at L0 in single_shot v-b1-b2-done-20260603	2026-06-03 01:52:55 +00:00
biondizzle	8df5de5477	Update B1 docs with test results and bug fix	2026-06-03 01:50:59 +00:00
biondizzle	3e3b352e7e	Update FINAL_STRETCH.md: B1 and B2 marked DONE with test results and bug fixes	2026-06-03 01:50:21 +00:00
biondizzle	84a02f8995	Remove debug test files, keep production B1/B2 unit tests	2026-06-03 01:49:39 +00:00
biondizzle	6fa9ad7852	B2 indexer: adopt TMEM warp-to-row mapping fix Key insight: tcgen05.ld.32x32b.x8 maps warp 0 to rows 0-31 and warp 1 to rows 32-63 from the SAME TMEM address. The hardware routes row slices based on warp position in the warpgroup. Fix approach (from external LLM review): - Warps 0-1 both read from tb + col_base (same address) - Each warp writes partial scores to its own sWarpScores partition - After __syncthreads(), merge both partitions for final 64-head scores - No race conditions, no cross-warp accumulation bugs	2026-06-03 01:42:38 +00:00

1 2 3 4 5 ...

2309 Commits