nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	2e6559402c	Add full layer NaN test (attention + MoE, multi-layer chain)	2026-05-19 18:36:49 +00:00
biondizzle	cca145e35c	Use 16 experts for MoE runner test (fits in memory)	2026-05-19 18:35:40 +00:00
biondizzle	7893e7514d	Add MoE runner NaN test (grouped GEMM with real weights)	2026-05-19 18:34:56 +00:00
biondizzle	7b432da754	Fix intermediate size: 3072 not 18432	2026-05-19 18:34:12 +00:00
biondizzle	293f14a179	Rewrite MoE NaN test: per-expert format, activation quantization, grouped GEMM	2026-05-19 18:33:57 +00:00
biondizzle	62f2395e30	Fix MoE weight key names, add fallback	2026-05-19 18:32:49 +00:00
biondizzle	9455466648	Add MoE NaN reproduction test, update CURRENT_BUG.md with NaN tracing and test plan	2026-05-19 18:32:14 +00:00
biondizzle	0316cec6fb	Add input NaN debug to trace where NaN starts	2026-05-19 18:15:53 +00:00
biondizzle	4c45d73b82	Add prefill inputs NaN debug	2026-05-19 18:04:18 +00:00
biondizzle	0773c9608c	Add prefill attention value debug check	2026-05-19 17:55:35 +00:00
biondizzle	4f02113aa0	Use module-level Blackwell flag in compressor (works during torch.compile)	2026-05-19 17:37:26 +00:00
biondizzle	8cf6ac3e8c	CRITICAL FIX: Remove double Q normalization and fix RoPE sin slice	2026-05-19 17:27:33 +00:00
biondizzle	a94ad73c64	Fix imports in vLLM codepaths test	2026-05-19 17:26:50 +00:00
biondizzle	f3f9674810	Fix f-string syntax	2026-05-19 17:26:40 +00:00
biondizzle	6cc2312e61	Add test for exact vLLM codepaths (fused_qnorm, kv_write, decode)	2026-05-19 17:26:10 +00:00
biondizzle	aade8593f7	CRITICAL FIX: Properly dequantize fp8 KV in decode using per-token inv_scale	2026-05-19 17:08:58 +00:00
biondizzle	2f811bc8bd	FIX: Use vLLM's decode_swa_indices for correct paged KV cache access during decode	2026-05-19 16:55:44 +00:00
biondizzle	da6fa2f1d6	Fix UnboundLocalError: move num_decode_tokens before debug print	2026-05-19 16:43:28 +00:00
biondizzle	76fff5fc8b	CRITICAL FIX: Skip compressor fused attention kernel on Blackwell — it bypasses our attention path	2026-05-19 16:35:07 +00:00
biondizzle	0554332352	Add debug logging to Blackwell attention path	2026-05-19 16:31:55 +00:00
biondizzle	f9a09df81a	Fix wrapper attribute access: kv_cache, attn_sink, max_model_len via mla_attn	2026-05-19 16:19:28 +00:00
biondizzle	b95e934703	Add CSA/HCA decode + prefill attention to Blackwell path	2026-05-19 16:06:24 +00:00
biondizzle	abff942edd	Fix N for C128A (need 128 tokens)	2026-05-19 16:04:53 +00:00
biondizzle	49c2e088d4	Fix compressor key name	2026-05-19 16:04:38 +00:00
biondizzle	7d89ede9f9	Add CSA sparse attention test (compressed KV gather + SWA merge)	2026-05-19 16:04:19 +00:00
biondizzle	51a7a89c5c	Update CURRENT_BUG: KV cache pipeline verified, all tests passing	2026-05-19 16:01:10 +00:00
biondizzle	696a890df7	Add decode vs prefill consistency test	2026-05-19 16:00:33 +00:00
biondizzle	359654f08e	Test with all 61 layers (shared experts only)	2026-05-19 15:55:41 +00:00
biondizzle	3e6041d752	Fix view→reshape for non-contiguous tensor	2026-05-19 15:54:40 +00:00
biondizzle	ff9f373633	Add e2e decode test (3 layers: C128A, C4A, SWA)	2026-05-19 15:53:29 +00:00
biondizzle	a5870fa05c	Vectorize paged KV cache read/write, kill container	2026-05-19 15:48:16 +00:00
biondizzle	9e428b83c7	Fix KV cache: write to paged cache, handle uint8→fp8 conversion, fix RoPE bug	2026-05-19 15:34:09 +00:00
biondizzle	0023fee706	Add blackwell_attention module and comprehensive test	2026-05-19 15:30:29 +00:00
biondizzle	142a4a1ad4	Fix attention for decode (1 query vs N cached KVs)	2026-05-19 15:28:52 +00:00
biondizzle	4b85605edf	Fix fp8 amax in decode test	2026-05-19 15:28:17 +00:00
biondizzle	4f23055450	Add decode attention pipeline test — reproduces KV cache bug	2026-05-19 15:27:55 +00:00
biondizzle	31b9cfbdbd	Update README and CURRENT_BUG: BUILD YOUR OWN KERNELS. Stop patching vLLM.	2026-05-19 15:19:55 +00:00
biondizzle	dca8bfc3a8	Fix _apply_rope_kv: use inline RoPE instead of 3D apply_gptj_rope	2026-05-19 10:36:21 +00:00
biondizzle	8e6721917e	Fix syntax in RoPE KV test	2026-05-19 10:31:07 +00:00
biondizzle	cbf440f75a	Add RoPE KV test	2026-05-19 10:28:15 +00:00
biondizzle	a5fabbdf66	Apply RoPE to KV in Blackwell attention path - fix NaN output	2026-05-19 10:27:15 +00:00
biondizzle	7e97551fd3	Fix: use self.scale instead of self.softmax_scale in Blackwell attention path	2026-05-19 10:04:46 +00:00
biondizzle	39310c357d	Patch compressor cache for Blackwell (no FlashMLA alignment) - fixes 91 missing layers	2026-05-19 09:52:23 +00:00
biondizzle	d9cd8fa165	Add debug patch to print layer name mismatch	2026-05-19 09:45:10 +00:00
biondizzle	9a0b015aac	Reduce max_model_len to 256	2026-05-19 09:37:38 +00:00
biondizzle	de1fb839f0	Patch SWA and Indexer cache specs for Blackwell (no FlashMLA alignment)	2026-05-19 09:29:57 +00:00
biondizzle	ea771ff70b	Reduce max_model_len to 512 for initial container test	2026-05-19 09:23:10 +00:00
biondizzle	bcfbd1e25b	Reduce max_model_len to 32768 (876544 requires 204 GiB KV cache)	2026-05-19 09:13:33 +00:00
biondizzle	e91421f06e	Fix KV cache page size patch: separate groups for large SWA pages	2026-05-19 09:05:14 +00:00
biondizzle	dd7f2627e8	Add full model forward test (WIP), sparse attention test passes	2026-05-19 09:04:19 +00:00

1 2 3 4 5 ...

465 Commits