nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	90ac38cde0	Add CUDA graph stream management test	2026-06-06 08:14:29 +00:00
biondizzle	26042e3f01	Add minimal CUDA graph multi-GPU test to isolate zero-output bug	2026-06-06 08:13:18 +00:00
biondizzle	df05289d6f	CUDA graph: Fix remaining sync violations from B200 detector run 2 1. grouped_linear.py: Remove conditional host read of GPU tensor - 'if group_offsets[0] != 0' reads GPU value on host → sync - Fix: unconditionally update offsets every call (GPU-only multiply) 2. test_cuda_graph_readiness.py: Use pinned CPU buffers for token transfer - dec_tid_buf[0] = python_int → CPU→GPU sync - Fix: write to pinned CPU buffer, then copy_ (async, graph-capturable) 3. Add dsv4/decode/cuda_graph_decoder.py (skeleton)	2026-06-03 17:20:34 +00:00
biondizzle	caac8ae108	Fix syntax error: 'is not not None' -> 'is not None'	2026-06-03 16:34:33 +00:00
biondizzle	ba68212fa7	Add CUDA graph readiness detector (Section A of GETTING_CUDAGRAPH_READY.md) - Grep for Section B sync patterns in hot path files - Method 1: run decode forward with torch.cuda.set_sync_debug_mode('error') - Method 2: attempt CUDA graph capture of L0 decode step - Full model load + prefill + warmup before detection - Results saved to /tmp/cuda_graph_readiness_results.json	2026-06-03 16:34:15 +00:00
biondizzle	3c06fd5591	Test 2: fix topk tensor shape (flatten before iterating)	2026-06-03 08:47:32 +00:00
biondizzle	29d6986dd4	Test 2: fix quantize_to_nvfp4 import	2026-06-03 08:21:39 +00:00
biondizzle	60b9bbd470	Test 2: fix import - use mHCLayer from dsv4.layers.mhc, fixed prompt encoding	2026-06-03 08:20:21 +00:00
biondizzle	2a42686e8e	Test 1 v2: diff hand-rolled vs official DSV4 encoding	2026-06-03 08:18:56 +00:00
biondizzle	11c2d5fe53	Add degeneration test 2: falsify mHC residual growth root cause	2026-06-03 08:18:01 +00:00
biondizzle	c77b83fffc	Add degeneration test 1: chat-template token-ID diff	2026-06-03 08:17:09 +00:00
biondizzle	9dbfac9dfa	PART A: verify kv_norm_w loaded correctly	2026-06-03 07:03:39 +00:00
biondizzle	a682c6adf4	PART A: add raw compressor output diagnostic	2026-06-03 06:56:56 +00:00
biondizzle	f2c1b3afd5	PART A: fix KV diagnostics — compute q_a before indexer, add Q_heads magnitude check	2026-06-03 06:33:51 +00:00
biondizzle	86e59c16c5	PART A: add KV gather diagnostics at blowup layer	2026-06-03 06:25:35 +00:00
biondizzle	262f844e2e	PART A: add detailed blowup diagnostics — capture mHC intermediate values when \|X\| > 1e6	2026-06-03 06:10:33 +00:00
biondizzle	6459fbca9a	fix: import forward_attention	2026-06-03 05:41:33 +00:00
biondizzle	91dfac34d8	PART A: simplified to production-only diagnostics — track per-layer \|X\| during prefill and decode, detect blowup early	2026-06-03 05:33:22 +00:00
biondizzle	d99503732d	fix: add BF16 gate weight fallback for dense routers (missing from test)	2026-06-03 05:22:47 +00:00
biondizzle	801bfc9a83	add router mode debug print	2026-06-03 05:15:52 +00:00
biondizzle	b385ecc05e	PART A: decode diagnostics test — production vs reference per-layer X comparison at decode step	2026-06-03 05:06:40 +00:00
biondizzle	d518fcb82a	test: correct sink bias reference — denominator-only, no V contribution	2026-06-03 04:57:37 +00:00
biondizzle	9574a9dc2e	test: add sink bias to reference SDPA in decode FMHA comparison	2026-06-03 04:53:55 +00:00
biondizzle	9a9b347b2b	test: add per-head magnitude ratio diagnostics to decode FMHA test	2026-06-03 04:50:23 +00:00
biondizzle	f5fa20c581	fix: syntax error — missing closing paren in indexer.forward call	2026-06-03 04:46:41 +00:00
biondizzle	693975ec92	fix: device mismatches in decode FMHA test — dec_pos must be on per-layer GPU	2026-06-03 04:46:24 +00:00
biondizzle	e1d96c509d	test: decode FMHA layer comparison — checks FMHA accuracy during decode step	2026-06-03 04:39:12 +00:00
biondizzle	d8306be3f2	Fix PART A test: proper FP8 quantization and MQA reference	2026-06-03 04:20:36 +00:00
biondizzle	4126909dfb	Simplify PART A test: compressor + FMHA at production scale	2026-06-03 04:18:13 +00:00
biondizzle	8c54cfa748	Fix KVCache init in PART A test	2026-06-03 04:15:41 +00:00
biondizzle	04cf8ca848	Add PART A diagnostic tests: compressor + KV cache + FMHA at production scale	2026-06-03 04:13:53 +00:00
biondizzle	dd1cbe1faa	Fix smem size for prefill debug test	2026-06-03 03:47:01 +00:00
biondizzle	09384a637a	Fix constexpr issues in prefill debug test	2026-06-03 03:46:29 +00:00
biondizzle	d3dc8cf901	Add prefill T=2 debug CUDA test with intermediate value printing	2026-06-03 03:46:14 +00:00
biondizzle	2bf5e74e61	Add prefill debug test: compare T=1 decode vs prefill kernel step by step	2026-06-03 03:05:25 +00:00
biondizzle	a4ef6c3454	Add B1 mixed FP8 prefill FMHA kernel (T>1 support) New files: - fmha_mixed_fp8_prefill.cuh: kernel supporting T=1..128 - Sub-batch processing (T_BATCH=32) to fit in 232KB SMEM - Multi-row QK TMEM read using tcgen05.ld.32x32b.x8 - Per-row online softmax - Per-row PV MMA (correctness first; batched PV is TODO) - Attention sink support - fmha_mixed_fp8_prefill_capi.cu: C API bridge - fmha_mixed_fp8_prefill_op.py: Python ctypes loader - test_b1_mixed_fp8_prefill.py: unit test (T=1..32, N=128..4096) Also: fix production FMHA layer test (BF16 fallback for o_a_proj, router gate BF16 quantize path, missing DEVICE constant)	2026-06-03 02:50:27 +00:00
biondizzle	1f757151ef	Fix router gate BF16 quantize path for production FMHA test	2026-06-03 02:47:47 +00:00
biondizzle	07168357cc	Fix o_a_proj weight loading: add BF16 fallback for grouped linear	2026-06-03 02:38:00 +00:00
biondizzle	27d8d80a40	Fix missing DEVICE constant in production FMHA test	2026-06-03 02:31:11 +00:00
biondizzle	26a817c2f2	Fix production FMHA layer test: compare raw FMHA vs SDPA on production gathered KV Phase 1: Run full pipeline to populate KV caches with real model weights. Phase 2: For each layer, gather KV in mixed FP8/BF16 format, run both production FMHA and PyTorch SDPA, compare cosine similarity. Uses random Q (not model-generated) to isolate FMHA kernel accuracy from upstream pipeline issues.	2026-06-03 02:26:37 +00:00
biondizzle	ba67e055f7	Add production FMHA layer comparison test Test loads real model weights, runs attention forward for layers 0-4, compares production B1 mixed FP8 FMHA output vs PyTorch SDPA reference. This will reveal the FMHA cosine degradation (was 0.679 at L1) with real data patterns, not just synthetic random data. Production values: HD=512, NOPE=448, ROPE=64, H=128, 8 GPUs.	2026-06-03 02:22:23 +00:00
biondizzle	84a02f8995	Remove debug test files, keep production B1/B2 unit tests	2026-06-03 01:49:39 +00:00
biondizzle	fdf702470c	Add B2 TMEM read debug kernel and test	2026-06-03 00:50:52 +00:00
biondizzle	f1cf4c0215	Add B2 QK debug test with w_h=1 for simple comparison	2026-06-03 00:46:48 +00:00
biondizzle	797345dfe9	Add B2 score debug test	2026-06-03 00:43:44 +00:00
biondizzle	99e50fcb58	Add B2 minimal debug test to find hang point	2026-06-03 00:35:48 +00:00
biondizzle	e21bd14408	Fix B1 test LSE reference shape handling	2026-06-03 00:25:53 +00:00
biondizzle	29a95a3db6	Add B1 QK vs PV isolation test	2026-06-03 00:23:35 +00:00
biondizzle	c322e3f301	Add B1 FMHA debug test for cosine failure investigation	2026-06-03 00:22:00 +00:00
biondizzle	5447d1d1dc	Add comprehensive B2 FP8 indexer unit test	2026-06-03 00:21:29 +00:00

1 2 3 4 5 ...

1124 Commits