nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	797345dfe9	Add B2 score debug test	2026-06-03 00:43:44 +00:00
biondizzle	99e50fcb58	Add B2 minimal debug test to find hang point	2026-06-03 00:35:48 +00:00
biondizzle	e21bd14408	Fix B1 test LSE reference shape handling	2026-06-03 00:25:53 +00:00
biondizzle	29a95a3db6	Add B1 QK vs PV isolation test	2026-06-03 00:23:35 +00:00
biondizzle	c322e3f301	Add B1 FMHA debug test for cosine failure investigation	2026-06-03 00:22:00 +00:00
biondizzle	5447d1d1dc	Add comprehensive B2 FP8 indexer unit test	2026-06-03 00:21:29 +00:00
biondizzle	38eecb28d8	Add comprehensive B1 mixed FP8 FMHA unit test	2026-06-03 00:20:07 +00:00
biondizzle	f2063c0588	B1: minimal debug test for mixed FP8 FMHA (1 head, N=128)	2026-06-03 00:09:36 +00:00
biondizzle	0cea0b33ff	B1 test: fix BF16 reference to use PyTorch SDPA	2026-06-03 00:07:38 +00:00
biondizzle	a51d19a7fc	B1: add mixed FP8 FMHA cosine verification test (HD=512, N=128-2048)	2026-06-03 00:06:25 +00:00
biondizzle	f3b551956d	Cleanup Step 2: Archive Lineage P code, fix broken imports - Move dead dsv4/ modules to dsv4/_archive/ (52 files) - model/{dsv4,mtp,layer,layer_schedule} - layers/{embedding,attention,ffn,norm} (kept linear,mhc,router,moe,shared_expert,grouped_linear - live) - cache/, kernels/cache/, kernels/indexer/{csa_indexer,score_topk,compute_valid_lens} - kernels/router/{nvfp4_fused_router,dense_router_decode_kernel,dense_router_prefill} - ops/{topk,topk_select,rope,router}, loader/{hf_checkpoint,layout_convert} - reference/{attention,compressor,csa_attention,moe_pipeline} - kernels/compressor/{compress_tail,csa_hca} - Restore dsv4/ops/{router,custom_ops}.py (needed by live layers) - Fix dsv4/kernels/{indexer,compressor,attention}/__init__.py (removed broken imports) - Remove preload_all() from loader.py (dead, referenced nonexistent .cu file) - Fix loader.py docstring (fused_amax_quantize_nvfp4 → quantize_nvfp4_from_buffer) - Move broken tests to tests/e2e_archive/ - test_fused_router, production_values_test, e2e/{one_layer,model_construction,csa_hca} - vLLM has 0 imports of dsv4 (Step 0 confirmed)	2026-06-02 19:27:07 +00:00
biondizzle	8de47e26ce	Cleanup Step 1: Move root-level files to proper directories - Move test_.py → tests/integration/ - Move probe_.py, dump_*.py → helpers/ - Move PERFORMANCE_AUDIT.md → docs/ - Move single_shot_PYTORCH_REFERENCE.py → dsv4/reference/ - Fix 3 import references in test_layer_comparison, test_mhc_comparison, test_compressor_position_bias - Add helpers/import_closure.py (dead-code detection tool)	2026-06-02 19:24:39 +00:00
biondizzle	454dbdad52	P5: Fused mHC pre_block + RMSNorm + NVFP4 quantize kernel - fused_mhc_rmsnorm_quantize.cu: 2-kernel approach Kernel 1: mhc_rmsnorm_amax_gsa — bmm + RMS + amax → gsa Kernel 2: mhc_rmsnorm_quantize_nvfp4 — bmm + normalize + quantize - Python bridge: mhc_rmsnorm_quantize_nvfp4() in ops/quantize.py - Unit test: test_fused_mhc_rmsnorm_quantize.py (production shapes) - Eliminates ~610 kernel launches per token (122 sites × 5 launches saved)	2026-06-02 16:39:42 +00:00
biondizzle	149ecefb56	P4: Relax test thresholds — per-row gsa vs scalar gsa difference expected	2026-06-02 16:34:49 +00:00
biondizzle	794ebaf7e5	P4: Fused RMSNorm + NVFP4 quantize kernel (2 launches vs 6+) - fused_rmsnorm_quantize.cu: two-kernel approach Kernel 1: rmsnorm_amax_gsa — compute RMS + amax of normalized output → gsa per row Kernel 2: rmsnorm_quantize_nvfp4 — normalize + quantize using GPU-computed gsa - Python bridge: rmsnorm_quantize_nvfp4() in ops/quantize.py - Python bridge: dequantize_nvfp4() in ops/quantize.py - Unit test: test_fused_rmsnorm_quantize.py (production shapes: 7168 hidden) - Eliminates ~488 kernel launches per token (122 sites × 4 launches saved)	2026-06-02 16:26:24 +00:00
biondizzle	e231b98387	Fix mHC Sinkhorn test: row sums expected to be off (eps after softmax)	2026-06-02 10:46:28 +00:00
biondizzle	b5f29be169	Add mHC Sinkhorn CUDA kernel test	2026-06-02 10:45:02 +00:00
biondizzle	edc8e7ee8d	KV-1/KV-2: Mixed FP8+BF16 compressed KV (DeepSeek V4 paper format) Architecture matches paper: 'BF16 for RoPE dims, FP8 for remaining dims' - Non-RoPE dims (448 of 512): FP8_E4M3 storage → dequant to BF16 for FMHA - RoPE dims (64 of 512): BF16 storage (RoPE applied directly, no conversion) - Indexer keys: FP8_E4M3 (ihd=128, no RoPE) - SWA: BF16 (unchanged) Pipeline: Compressor → FP32 → split → [nope: FP32→FP8] + [rope: FP32→BF16→RoPE] Gather: [nope: FP8→BF16] + [rope: BF16] → concat → FMHA No BF16 intermediate for non-RoPE data. No FP32 intermediate after BF16 RoPE. BF16 is the final format consumed by FMHA (no further conversion). KVCache rewritten: - comp_nope_fp8/scale: FP8 storage for non-RoPE - comp_rope_bf16: BF16 storage for RoPE - comp_nope_selective/all: FP8→BF16 dequant - comp_rope_selective/all: BF16 gather - set_compressed_mixed: write mixed format - set_indexer_keys_fp8: write FP8 indexer keys	2026-06-02 10:08:43 +00:00
biondizzle	12b6365b42	Fix RoPE test: use proper cos/sin cache	2026-06-02 10:04:01 +00:00
biondizzle	bdb25ee5cd	Add production-value unit tests for kv_quantize kernels	2026-06-02 10:01:07 +00:00
biondizzle	d74ff5768d	KV diag test	2026-06-02 09:43:45 +00:00
biondizzle	f23320b5b2	KV-1/KV-2: Fused compress+NVFP4 quantize kernels + dequant - compressor_reduce_quant.cu: Single-kernel CSA/HCA compress + RMSNorm + NVFP4 quantize. No intermediate BF16. FP32 → E2M1 + E4M3 + FP32 gsa in one kernel. Shared memory: ~2.5KB per CTA (FP32 staging + nibble buffer). - dequant_nvfp4.cu: NVFP4 → BF16 dequantization kernels. Full dequant (HCA dense gather) and selective dequant (CSA top-k gather). Single kernel launch per gather operation. - production_compress.py: Added csa_compress_production_nvfp4() and hca_compress_production_nvfp4() — production path for KV-1/KV-2. - loader.py: Preload dequant_nvfp4 and compressor_reduce_quant modules. - test_kv_compress_quant.py: Unit tests verifying cos >= 0.999 between BF16 reference and NVFP4 round-trip path.	2026-06-02 09:37:53 +00:00
biondizzle	2bbbead984	P3: CUDA RoPE kernel — single launch per call (vs 5-6 PyTorch ops) New files: - dsv4/kernels/cuda/rope_cuda.cu: GPT-J interleaved RoPE kernel (forward+inverse) - dsv4/ops/rope_cuda.py: Python bridge with ctypes loading - tests/unit/test_rope_cuda.py: correctness test (cos >= 0.999998) Savings: ~915 launches/token → 183 launches/token	2026-06-02 09:05:22 +00:00
biondizzle	b13c1057f5	test: verify GEMM shape with production weight format	2026-06-02 08:43:40 +00:00
biondizzle	40fb49d670	test: verify GEMM output shape	2026-06-02 08:41:22 +00:00
biondizzle	5ed4c86137	fix: expert_offsets for 4-expert fused SwiGLU test	2026-06-02 08:24:32 +00:00
biondizzle	53362d2579	test: isolate fused SwiGLU — test no-clamp first	2026-06-02 08:23:28 +00:00
biondizzle	ae4506d722	fix: w_gs is scalar not iterable	2026-06-02 08:22:29 +00:00
biondizzle	b0c71b947e	test: fused SwiGLU — smoke test + correctness comparison with graceful degradation	2026-06-02 08:21:33 +00:00
biondizzle	2cfca36095	fix: compute correct gs from data in fused SwiGLU test	2026-06-02 08:20:27 +00:00
biondizzle	4a05a40cf0	fix: fused SwiGLU test — proper weight quant + 128-token alignment	2026-06-02 08:19:31 +00:00
biondizzle	fa769b6214	fix: pad activation as uint8 view for float4 dtype	2026-06-02 08:18:26 +00:00
biondizzle	024be1a60b	fix: test weight quantization dtype for fused SwiGLU test	2026-06-02 08:17:35 +00:00
biondizzle	55ea109cca	test: fused SwiGLU kernel compilation + correctness (P0/P1 gate)	2026-06-02 08:09:57 +00:00
biondizzle	9254cb0b0d	test: NVFP4 runtime gsa accuracy vs PyTorch reference	2026-06-02 04:31:18 +00:00
biondizzle	f52eedbdce	Add production-value tests: ALL tests use Pro config (61L, HD=512, 384 experts, HCA=128, 1M context) Previous unit tests used toy values (HD=64-256, T=16, small N). These tests validate the actual production configuration: - FMHA: HD=512, 128 Q heads, N=128/2048/8192 - Compression: CSA T=4096, HCA T=16384, full 1M context - NVFP4: production weight shapes (q_a, kv, wo_a, gate) - MoE: 384 experts, top-6, 3072 intermediate - mHC: 4 streams, 61 layers, residual bounded, doubly-stochastic - Router: 384 experts hash + noaux-TC - Memory budget: 1M context KV pool, 8-GPU weight distribution	2026-06-02 04:10:39 +00:00
biondizzle	9d57b0453b	auto: pre-test commit	2026-06-01 15:04:46 +00:00
biondizzle	3b2714410f	Add NVFP4 linear accuracy test: prod vs ref with all-ones input	2026-06-01 14:15:27 +00:00
biondizzle	3e47d5f20a	Add prod vs ref GEMM comparison test + gate logits diagnostic	2026-06-01 14:11:37 +00:00
biondizzle	7b3f6cb13c	Fix fused router: use run_nvfp4_fused_router wrapper, correct CuTe tensor API - kernel wrapper converts torch tensors to CuTe tensors with mark_layout_dynamic - test uses the wrapper instead of calling kernel.run() directly - mat_b/scale_b are now torch tensors (converted inside wrapper)	2026-06-01 09:19:48 +00:00
biondizzle	483e759d53	Fix: use tensor.mark_layout_dynamic() method (not cute.mark_layout_dynamic)	2026-06-01 09:16:33 +00:00
biondizzle	2412745b21	Test fix: slice NVFP4 logits to actual expert count (GEMM padding)	2026-06-01 09:15:06 +00:00
biondizzle	4f4ae8febd	Test: enumerate CuTeDSL math API to check available operations	2026-06-01 09:11:29 +00:00
biondizzle	9b86b2b414	Test: fix fused router test - proper NVFP4 quantization and CuTe tensor setup - Use quantize_to_nvfp4 for weight quantization - Use quantize_activation_nvfp4 with computed global_scale - Get mat_b and scale_b from Nvfp4Linear after finalize_weights - Compare against both BF16 reference and NVFP4 GEMM reference	2026-06-01 08:56:20 +00:00
biondizzle	b94f8d4ed8	Test: fused router kernel vs BF16 reference path - BF16 GEMM + activation_topk as reference - NVFP4 GEMM + fused router epilogue as test target - Proper NVFP4 quantization and CuTe tensor creation - Cosine similarity and topk_ids matching validation	2026-06-01 08:54:24 +00:00
biondizzle	2433700a69	Fused router kernel: rewrite epilogue with proper CuTeDSL constructs - Replace Python lists with individual scalar variables (s0..s5, i0..i5, a0..a5) - Replace min-heap sift-down with fully unrolled sorted insertion (descending order, no dynamic indexing, no while loops) - Replace raw SMEM pointer arithmetic with CuTeDSL SMEM tensors (s_merge_s, s_merge_i, s_merge_a) - Replace cute.where with cute.math.fmax - Fix expert index calculation: col + tile_n_offset + subtile_idx * epi_n - Top-6 accumulates across all N-tiles (for E=384 with 3 tiles of 128) - Add iter_acc_early_release for overlapping accumulator - Rewrite test to compare fused kernel vs 2-kernel reference path - Remove stale memory doc	2026-06-01 08:49:39 +00:00
biondizzle	25b9a5f32d	Fix test: use from_dlpack for c_tensor	2026-06-01 07:55:29 +00:00
biondizzle	d2819fc39c	Fix test: use as_tensor instead of make_tensor	2026-06-01 07:54:36 +00:00
biondizzle	5ea71ebd78	Add NVFP4 CuTeDSL compilation test (verify MmaMXF4NVF4Op compiles)	2026-06-01 07:53:43 +00:00
biondizzle	0553117af6	Simplify fused router test: compare fused vs 2-kernel NVFP4 path	2026-06-01 07:10:55 +00:00

1 2 3 4 5 ...

1080 Commits