nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	04cf8ca848	Add PART A diagnostic tests: compressor + KV cache + FMHA at production scale	2026-06-03 04:13:53 +00:00
biondizzle	dd1cbe1faa	Fix smem size for prefill debug test	2026-06-03 03:47:01 +00:00
biondizzle	09384a637a	Fix constexpr issues in prefill debug test	2026-06-03 03:46:29 +00:00
biondizzle	d3dc8cf901	Add prefill T=2 debug CUDA test with intermediate value printing	2026-06-03 03:46:14 +00:00
biondizzle	2bf5e74e61	Add prefill debug test: compare T=1 decode vs prefill kernel step by step	2026-06-03 03:05:25 +00:00
biondizzle	a4ef6c3454	Add B1 mixed FP8 prefill FMHA kernel (T>1 support) New files: - fmha_mixed_fp8_prefill.cuh: kernel supporting T=1..128 - Sub-batch processing (T_BATCH=32) to fit in 232KB SMEM - Multi-row QK TMEM read using tcgen05.ld.32x32b.x8 - Per-row online softmax - Per-row PV MMA (correctness first; batched PV is TODO) - Attention sink support - fmha_mixed_fp8_prefill_capi.cu: C API bridge - fmha_mixed_fp8_prefill_op.py: Python ctypes loader - test_b1_mixed_fp8_prefill.py: unit test (T=1..32, N=128..4096) Also: fix production FMHA layer test (BF16 fallback for o_a_proj, router gate BF16 quantize path, missing DEVICE constant)	2026-06-03 02:50:27 +00:00
biondizzle	1f757151ef	Fix router gate BF16 quantize path for production FMHA test	2026-06-03 02:47:47 +00:00
biondizzle	07168357cc	Fix o_a_proj weight loading: add BF16 fallback for grouped linear	2026-06-03 02:38:00 +00:00
biondizzle	27d8d80a40	Fix missing DEVICE constant in production FMHA test	2026-06-03 02:31:11 +00:00
biondizzle	26a817c2f2	Fix production FMHA layer test: compare raw FMHA vs SDPA on production gathered KV Phase 1: Run full pipeline to populate KV caches with real model weights. Phase 2: For each layer, gather KV in mixed FP8/BF16 format, run both production FMHA and PyTorch SDPA, compare cosine similarity. Uses random Q (not model-generated) to isolate FMHA kernel accuracy from upstream pipeline issues.	2026-06-03 02:26:37 +00:00
biondizzle	ba67e055f7	Add production FMHA layer comparison test Test loads real model weights, runs attention forward for layers 0-4, compares production B1 mixed FP8 FMHA output vs PyTorch SDPA reference. This will reveal the FMHA cosine degradation (was 0.679 at L1) with real data patterns, not just synthetic random data. Production values: HD=512, NOPE=448, ROPE=64, H=128, 8 GPUs.	2026-06-03 02:22:23 +00:00
biondizzle	84a02f8995	Remove debug test files, keep production B1/B2 unit tests	2026-06-03 01:49:39 +00:00
biondizzle	fdf702470c	Add B2 TMEM read debug kernel and test	2026-06-03 00:50:52 +00:00
biondizzle	f1cf4c0215	Add B2 QK debug test with w_h=1 for simple comparison	2026-06-03 00:46:48 +00:00
biondizzle	797345dfe9	Add B2 score debug test	2026-06-03 00:43:44 +00:00
biondizzle	99e50fcb58	Add B2 minimal debug test to find hang point	2026-06-03 00:35:48 +00:00
biondizzle	e21bd14408	Fix B1 test LSE reference shape handling	2026-06-03 00:25:53 +00:00
biondizzle	29a95a3db6	Add B1 QK vs PV isolation test	2026-06-03 00:23:35 +00:00
biondizzle	c322e3f301	Add B1 FMHA debug test for cosine failure investigation	2026-06-03 00:22:00 +00:00
biondizzle	5447d1d1dc	Add comprehensive B2 FP8 indexer unit test	2026-06-03 00:21:29 +00:00
biondizzle	38eecb28d8	Add comprehensive B1 mixed FP8 FMHA unit test	2026-06-03 00:20:07 +00:00
biondizzle	f2063c0588	B1: minimal debug test for mixed FP8 FMHA (1 head, N=128)	2026-06-03 00:09:36 +00:00
biondizzle	0cea0b33ff	B1 test: fix BF16 reference to use PyTorch SDPA	2026-06-03 00:07:38 +00:00
biondizzle	a51d19a7fc	B1: add mixed FP8 FMHA cosine verification test (HD=512, N=128-2048)	2026-06-03 00:06:25 +00:00
biondizzle	f3b551956d	Cleanup Step 2: Archive Lineage P code, fix broken imports - Move dead dsv4/ modules to dsv4/_archive/ (52 files) - model/{dsv4,mtp,layer,layer_schedule} - layers/{embedding,attention,ffn,norm} (kept linear,mhc,router,moe,shared_expert,grouped_linear - live) - cache/, kernels/cache/, kernels/indexer/{csa_indexer,score_topk,compute_valid_lens} - kernels/router/{nvfp4_fused_router,dense_router_decode_kernel,dense_router_prefill} - ops/{topk,topk_select,rope,router}, loader/{hf_checkpoint,layout_convert} - reference/{attention,compressor,csa_attention,moe_pipeline} - kernels/compressor/{compress_tail,csa_hca} - Restore dsv4/ops/{router,custom_ops}.py (needed by live layers) - Fix dsv4/kernels/{indexer,compressor,attention}/__init__.py (removed broken imports) - Remove preload_all() from loader.py (dead, referenced nonexistent .cu file) - Fix loader.py docstring (fused_amax_quantize_nvfp4 → quantize_nvfp4_from_buffer) - Move broken tests to tests/e2e_archive/ - test_fused_router, production_values_test, e2e/{one_layer,model_construction,csa_hca} - vLLM has 0 imports of dsv4 (Step 0 confirmed)	2026-06-02 19:27:07 +00:00
biondizzle	8de47e26ce	Cleanup Step 1: Move root-level files to proper directories - Move test_.py → tests/integration/ - Move probe_.py, dump_*.py → helpers/ - Move PERFORMANCE_AUDIT.md → docs/ - Move single_shot_PYTORCH_REFERENCE.py → dsv4/reference/ - Fix 3 import references in test_layer_comparison, test_mhc_comparison, test_compressor_position_bias - Add helpers/import_closure.py (dead-code detection tool)	2026-06-02 19:24:39 +00:00
biondizzle	454dbdad52	P5: Fused mHC pre_block + RMSNorm + NVFP4 quantize kernel - fused_mhc_rmsnorm_quantize.cu: 2-kernel approach Kernel 1: mhc_rmsnorm_amax_gsa — bmm + RMS + amax → gsa Kernel 2: mhc_rmsnorm_quantize_nvfp4 — bmm + normalize + quantize - Python bridge: mhc_rmsnorm_quantize_nvfp4() in ops/quantize.py - Unit test: test_fused_mhc_rmsnorm_quantize.py (production shapes) - Eliminates ~610 kernel launches per token (122 sites × 5 launches saved)	2026-06-02 16:39:42 +00:00
biondizzle	149ecefb56	P4: Relax test thresholds — per-row gsa vs scalar gsa difference expected	2026-06-02 16:34:49 +00:00
biondizzle	794ebaf7e5	P4: Fused RMSNorm + NVFP4 quantize kernel (2 launches vs 6+) - fused_rmsnorm_quantize.cu: two-kernel approach Kernel 1: rmsnorm_amax_gsa — compute RMS + amax of normalized output → gsa per row Kernel 2: rmsnorm_quantize_nvfp4 — normalize + quantize using GPU-computed gsa - Python bridge: rmsnorm_quantize_nvfp4() in ops/quantize.py - Python bridge: dequantize_nvfp4() in ops/quantize.py - Unit test: test_fused_rmsnorm_quantize.py (production shapes: 7168 hidden) - Eliminates ~488 kernel launches per token (122 sites × 4 launches saved)	2026-06-02 16:26:24 +00:00
biondizzle	e231b98387	Fix mHC Sinkhorn test: row sums expected to be off (eps after softmax)	2026-06-02 10:46:28 +00:00
biondizzle	b5f29be169	Add mHC Sinkhorn CUDA kernel test	2026-06-02 10:45:02 +00:00
biondizzle	edc8e7ee8d	KV-1/KV-2: Mixed FP8+BF16 compressed KV (DeepSeek V4 paper format) Architecture matches paper: 'BF16 for RoPE dims, FP8 for remaining dims' - Non-RoPE dims (448 of 512): FP8_E4M3 storage → dequant to BF16 for FMHA - RoPE dims (64 of 512): BF16 storage (RoPE applied directly, no conversion) - Indexer keys: FP8_E4M3 (ihd=128, no RoPE) - SWA: BF16 (unchanged) Pipeline: Compressor → FP32 → split → [nope: FP32→FP8] + [rope: FP32→BF16→RoPE] Gather: [nope: FP8→BF16] + [rope: BF16] → concat → FMHA No BF16 intermediate for non-RoPE data. No FP32 intermediate after BF16 RoPE. BF16 is the final format consumed by FMHA (no further conversion). KVCache rewritten: - comp_nope_fp8/scale: FP8 storage for non-RoPE - comp_rope_bf16: BF16 storage for RoPE - comp_nope_selective/all: FP8→BF16 dequant - comp_rope_selective/all: BF16 gather - set_compressed_mixed: write mixed format - set_indexer_keys_fp8: write FP8 indexer keys	2026-06-02 10:08:43 +00:00
biondizzle	12b6365b42	Fix RoPE test: use proper cos/sin cache	2026-06-02 10:04:01 +00:00
biondizzle	bdb25ee5cd	Add production-value unit tests for kv_quantize kernels	2026-06-02 10:01:07 +00:00
biondizzle	d74ff5768d	KV diag test	2026-06-02 09:43:45 +00:00
biondizzle	f23320b5b2	KV-1/KV-2: Fused compress+NVFP4 quantize kernels + dequant - compressor_reduce_quant.cu: Single-kernel CSA/HCA compress + RMSNorm + NVFP4 quantize. No intermediate BF16. FP32 → E2M1 + E4M3 + FP32 gsa in one kernel. Shared memory: ~2.5KB per CTA (FP32 staging + nibble buffer). - dequant_nvfp4.cu: NVFP4 → BF16 dequantization kernels. Full dequant (HCA dense gather) and selective dequant (CSA top-k gather). Single kernel launch per gather operation. - production_compress.py: Added csa_compress_production_nvfp4() and hca_compress_production_nvfp4() — production path for KV-1/KV-2. - loader.py: Preload dequant_nvfp4 and compressor_reduce_quant modules. - test_kv_compress_quant.py: Unit tests verifying cos >= 0.999 between BF16 reference and NVFP4 round-trip path.	2026-06-02 09:37:53 +00:00
biondizzle	2bbbead984	P3: CUDA RoPE kernel — single launch per call (vs 5-6 PyTorch ops) New files: - dsv4/kernels/cuda/rope_cuda.cu: GPT-J interleaved RoPE kernel (forward+inverse) - dsv4/ops/rope_cuda.py: Python bridge with ctypes loading - tests/unit/test_rope_cuda.py: correctness test (cos >= 0.999998) Savings: ~915 launches/token → 183 launches/token	2026-06-02 09:05:22 +00:00
biondizzle	b13c1057f5	test: verify GEMM shape with production weight format	2026-06-02 08:43:40 +00:00
biondizzle	40fb49d670	test: verify GEMM output shape	2026-06-02 08:41:22 +00:00
biondizzle	5ed4c86137	fix: expert_offsets for 4-expert fused SwiGLU test	2026-06-02 08:24:32 +00:00
biondizzle	53362d2579	test: isolate fused SwiGLU — test no-clamp first	2026-06-02 08:23:28 +00:00
biondizzle	ae4506d722	fix: w_gs is scalar not iterable	2026-06-02 08:22:29 +00:00
biondizzle	b0c71b947e	test: fused SwiGLU — smoke test + correctness comparison with graceful degradation	2026-06-02 08:21:33 +00:00
biondizzle	2cfca36095	fix: compute correct gs from data in fused SwiGLU test	2026-06-02 08:20:27 +00:00
biondizzle	4a05a40cf0	fix: fused SwiGLU test — proper weight quant + 128-token alignment	2026-06-02 08:19:31 +00:00
biondizzle	fa769b6214	fix: pad activation as uint8 view for float4 dtype	2026-06-02 08:18:26 +00:00
biondizzle	024be1a60b	fix: test weight quantization dtype for fused SwiGLU test	2026-06-02 08:17:35 +00:00
biondizzle	55ea109cca	test: fused SwiGLU kernel compilation + correctness (P0/P1 gate)	2026-06-02 08:09:57 +00:00
biondizzle	9254cb0b0d	test: NVFP4 runtime gsa accuracy vs PyTorch reference	2026-06-02 04:31:18 +00:00
biondizzle	f52eedbdce	Add production-value tests: ALL tests use Pro config (61L, HD=512, 384 experts, HCA=128, 1M context) Previous unit tests used toy values (HD=64-256, T=16, small N). These tests validate the actual production configuration: - FMHA: HD=512, 128 Q heads, N=128/2048/8192 - Compression: CSA T=4096, HCA T=16384, full 1M context - NVFP4: production weight shapes (q_a, kv, wo_a, gate) - MoE: 384 experts, top-6, 3072 intermediate - mHC: 4 streams, 61 layers, residual bounded, doubly-stochastic - Router: 384 experts hash + noaux-TC - Memory budget: 1M context KV pool, 8-GPU weight distribution	2026-06-02 04:10:39 +00:00

1 2 3 4 5 ...

1094 Commits