nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	22a2fc563e	cleanup: remove diagnostic test file	2026-05-25 16:25:05 +00:00
biondizzle	a064b99d3d	fix test 4: use silu(gate)+swiglu interleaved (matching fused kernel output)	2026-05-25 16:24:04 +00:00
biondizzle	e76ea36337	fix test: use proper global_scale from quantize_to_nvfp4 for larger shape test	2026-05-25 16:23:00 +00:00
biondizzle	5508f29625	add GPU quantize diagnostic test	2026-05-25 16:20:29 +00:00
biondizzle	c2e3d15633	NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring - Add quantize_nvfp4.cu: BF16→FP4 GPU kernel (no CPU sync, warp shuffle amax) - Add quantize_nvfp4_gpu() bridge in ops/quantize.py - Fix deinterleave_quantize kernel path (dsv4/ops/kernels → dsv4/kernels/cuda) - Wire GPU quantize into Nvfp4MoE._run_impl(): - L1 input: quantize_nvfp4_gpu (replaces quantize_activation_nvfp4) - Fused SwiGLU L2: deinterleave_quantize_nvfp4_cuda (single kernel) - Non-fused L2: quantize_nvfp4_gpu - Add test_nvfp4_gpu_quantize.py for both kernels	2026-05-25 16:19:07 +00:00
biondizzle	6504f091ca	NVFP4-1.1 Step 3: post-SWiGLU quantization test suite (all PASS) - Standalone kernel cos 0.979 (128x512) - Post-SwiGLU quantization cos 0.976 (vs Python 0.995) - Larger shape cos 0.979 (512x4096) - FP8 scale match 100% across all tests - GPU kernel replaces CPU-GPU sync quantize path - Ready for integration into MoE pipeline	2026-05-25 09:08:01 +00:00
biondizzle	5e8347836f	NVFP4-1.1: working BF16→FP4 quantize kernel (cos 0.979) - Standalone CuTeDSL kernel using cute.arch.load/store - 1 CTA per row, 32 threads/CTA - BF16 load via Uint16 bitcast - FP8 E4M3 scale output (100% match) - FP4 packed nibble output (cos 0.979 vs Python ref) - Uses absf + arithmetic max/min (CuTeDSL ternary limitation) - Step 2 of SwiGLU FP4 fusion pipeline	2026-05-25 08:58:19 +00:00
biondizzle	52d11d7f92	NVFP4-1.1: standalone BF16→FP4 quantize kernel (WIP) + dequantize verification	2026-05-25 03:23:44 +00:00
biondizzle	1f310defa0	fix: quantize_activation_nvfp4 returns 2 values, not 3	2026-05-25 03:17:13 +00:00
biondizzle	6dac3bcaf0	NVFP4-1.1: add FP4 quantize round-trip test (step 1 of kernel fusion)	2026-05-25 03:15:40 +00:00
biondizzle	eb46e4d15e	NVFP4-0.2-0.4: add FP4 primitives diagnostic test	2026-05-25 03:07:53 +00:00
biondizzle	29ad36934d	cleanup: remove D2 diagnostic/experimental files, keep working codebase clean	2026-05-25 02:40:12 +00:00
biondizzle	d5b69ac122	D2: simpler shape diagnostic using CuTe from Python (no kernel needed)	2026-05-25 02:36:41 +00:00
biondizzle	684e9a85fe	fix: use utils.sm100 instead of sm100 in diagnostic	2026-05-25 02:34:25 +00:00
biondizzle	7599801f57	D2: add flat_divide shape diagnostic kernel for multi-CTA grid	2026-05-25 02:33:15 +00:00
biondizzle	6cc151097e	Revert D2 multi-CTA attempts - keeping per-head launch approach (works correctly)	2026-05-25 01:08:38 +00:00
biondizzle	4c79e5533e	D2: add multi-CTA grid with block_idx_y for Q/O head indexing	2026-05-24 23:27:38 +00:00
biondizzle	a5271821a8	D2: add scale test (more heads, larger hd)	2026-05-24 22:49:44 +00:00
biondizzle	d563c93fc5	D2: add per-head launch test	2026-05-24 22:48:22 +00:00
biondizzle	9b476d87f9	fix: compare un-normalized O against un-normalized reference	2026-05-24 22:44:11 +00:00
biondizzle	db353ec35a	D2: add simple n_h=1 regression test	2026-05-24 22:39:25 +00:00
biondizzle	4418e04a28	D1: revert per-row LSE to sfw_idx=0 for now (debugging D2 regression)	2026-05-24 22:28:11 +00:00
biondizzle	2cc66bff68	D2: add initial multi-head test file	2026-05-24 22:26:10 +00:00
biondizzle	49e66fb6e4	D1: corrected KV merge test with proper normalized output formula	2026-05-24 22:24:27 +00:00
biondizzle	c47f648617	fix lse verify	2026-05-24 22:23:08 +00:00
biondizzle	3577e09603	D1: add LSE verification test	2026-05-24 22:22:31 +00:00
biondizzle	674c5b9c18	D1: fix per-row LSE output + add KV merge test v2 with per-row LSE	2026-05-24 22:21:51 +00:00
biondizzle	c33185ca0a	D1: add rescale diagnostic	2026-05-24 22:18:12 +00:00
biondizzle	02edff5ac7	D1: add KV merge test using log-sum-exp (avoids TMEM round-trip)	2026-05-24 22:17:24 +00:00
biondizzle	35a3c04e8e	fix debug test	2026-05-24 22:04:51 +00:00
biondizzle	a391aa1fd3	D1: add rescale debug test	2026-05-24 22:04:20 +00:00
biondizzle	f1aab1bfc1	D1: add multi-KV-tile O rescale test (s_k=256,384,512)	2026-05-24 22:00:42 +00:00
biondizzle	c11ac38ceb	D1.4: Remove --opt-level 0 from hd512 test (use default opt level)	2026-05-24 16:42:01 +00:00
biondizzle	b14d88f37f	D1.4: Fix merge test - use use_smem_p=False for hd=256 kernel (SMEM budget)	2026-05-24 16:36:48 +00:00
biondizzle	e6c9e6c0d0	D1.4: Add external k_sub merge test for hd=512 (avoids slow in-kernel k_sub compilation)	2026-05-24 16:31:06 +00:00
biondizzle	13fcf16b14	D1.4: Use --opt-level 0 only (ptxas -j not supported, MLIR is the bottleneck)	2026-05-24 15:43:17 +00:00
biondizzle	b4da412b30	D1.4: Use options string for compile flags (--ptxas-options -j64 --opt-level 0)	2026-05-24 15:40:39 +00:00
biondizzle	4f69dffc93	D1.4: Add PtxasOptions -j64 + OptLevel(0) for faster hd=512 compilation	2026-05-24 15:36:35 +00:00
biondizzle	331ddb29b7	D1.4: Fix regression test for un-normalized O output (D5a)	2026-05-24 15:13:16 +00:00
biondizzle	449a6e7ede	Fix: add cutlass import to test_d1_qk512	2026-05-24 14:20:32 +00:00
biondizzle	ce267909ad	Fix: add cpasync import to test_d1_qk512	2026-05-24 14:20:01 +00:00
biondizzle	625837fd44	D1.4: Add hd=512 QK-only and standalone test for compilation debugging	2026-05-24 14:19:26 +00:00
biondizzle	592873b560	D1.4: Reduce pv_n_tile to 128 for hd=512 to fit SMEM budget (192KB)	2026-05-24 08:07:32 +00:00
biondizzle	787d0160a1	D1: Full test with TMEM-P at hd=64,128,256,512	2026-05-24 04:07:40 +00:00
biondizzle	24b9ebfba9	D1: SMEM-P test at hd=128	2026-05-24 03:48:37 +00:00
biondizzle	0f50933f69	D1: Fix SMEM-P (coordinate store), LSE (FP32), add TMEM-P-only test	2026-05-24 03:27:14 +00:00
biondizzle	f645f3994a	D1: LSE diagnostic at various hd	2026-05-24 03:23:16 +00:00
biondizzle	c042fcf6c7	D1: Add diagnostic test (TMEM-P vs SMEM-P at various hd)	2026-05-24 03:22:23 +00:00
biondizzle	1c5d6475e5	D1 test: compare un-norm O + norm using ref row_sum + LSE verification	2026-05-24 03:21:01 +00:00
biondizzle	93e7fe97f7	D1.5: Always output un-normalized O + LSE (epilogue_tma_store only, no TMEM round-trip normalize)	2026-05-24 03:18:38 +00:00

1 2 3 4 5 ...

267 Commits