nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	2bbbead984	P3: CUDA RoPE kernel — single launch per call (vs 5-6 PyTorch ops) New files: - dsv4/kernels/cuda/rope_cuda.cu: GPT-J interleaved RoPE kernel (forward+inverse) - dsv4/ops/rope_cuda.py: Python bridge with ctypes loading - tests/unit/test_rope_cuda.py: correctness test (cos >= 0.999998) Savings: ~915 launches/token → 183 launches/token	2026-06-02 09:05:22 +00:00
biondizzle	b13c1057f5	test: verify GEMM shape with production weight format	2026-06-02 08:43:40 +00:00
biondizzle	40fb49d670	test: verify GEMM output shape	2026-06-02 08:41:22 +00:00
biondizzle	5ed4c86137	fix: expert_offsets for 4-expert fused SwiGLU test	2026-06-02 08:24:32 +00:00
biondizzle	53362d2579	test: isolate fused SwiGLU — test no-clamp first	2026-06-02 08:23:28 +00:00
biondizzle	ae4506d722	fix: w_gs is scalar not iterable	2026-06-02 08:22:29 +00:00
biondizzle	b0c71b947e	test: fused SwiGLU — smoke test + correctness comparison with graceful degradation	2026-06-02 08:21:33 +00:00
biondizzle	2cfca36095	fix: compute correct gs from data in fused SwiGLU test	2026-06-02 08:20:27 +00:00
biondizzle	4a05a40cf0	fix: fused SwiGLU test — proper weight quant + 128-token alignment	2026-06-02 08:19:31 +00:00
biondizzle	fa769b6214	fix: pad activation as uint8 view for float4 dtype	2026-06-02 08:18:26 +00:00
biondizzle	024be1a60b	fix: test weight quantization dtype for fused SwiGLU test	2026-06-02 08:17:35 +00:00
biondizzle	55ea109cca	test: fused SwiGLU kernel compilation + correctness (P0/P1 gate)	2026-06-02 08:09:57 +00:00
biondizzle	9254cb0b0d	test: NVFP4 runtime gsa accuracy vs PyTorch reference	2026-06-02 04:31:18 +00:00
biondizzle	f52eedbdce	Add production-value tests: ALL tests use Pro config (61L, HD=512, 384 experts, HCA=128, 1M context) Previous unit tests used toy values (HD=64-256, T=16, small N). These tests validate the actual production configuration: - FMHA: HD=512, 128 Q heads, N=128/2048/8192 - Compression: CSA T=4096, HCA T=16384, full 1M context - NVFP4: production weight shapes (q_a, kv, wo_a, gate) - MoE: 384 experts, top-6, 3072 intermediate - mHC: 4 streams, 61 layers, residual bounded, doubly-stochastic - Router: 384 experts hash + noaux-TC - Memory budget: 1M context KV pool, 8-GPU weight distribution	2026-06-02 04:10:39 +00:00
biondizzle	9d57b0453b	auto: pre-test commit	2026-06-01 15:04:46 +00:00
biondizzle	3b2714410f	Add NVFP4 linear accuracy test: prod vs ref with all-ones input	2026-06-01 14:15:27 +00:00
biondizzle	3e47d5f20a	Add prod vs ref GEMM comparison test + gate logits diagnostic	2026-06-01 14:11:37 +00:00
biondizzle	7b3f6cb13c	Fix fused router: use run_nvfp4_fused_router wrapper, correct CuTe tensor API - kernel wrapper converts torch tensors to CuTe tensors with mark_layout_dynamic - test uses the wrapper instead of calling kernel.run() directly - mat_b/scale_b are now torch tensors (converted inside wrapper)	2026-06-01 09:19:48 +00:00
biondizzle	483e759d53	Fix: use tensor.mark_layout_dynamic() method (not cute.mark_layout_dynamic)	2026-06-01 09:16:33 +00:00
biondizzle	2412745b21	Test fix: slice NVFP4 logits to actual expert count (GEMM padding)	2026-06-01 09:15:06 +00:00
biondizzle	4f4ae8febd	Test: enumerate CuTeDSL math API to check available operations	2026-06-01 09:11:29 +00:00
biondizzle	9b86b2b414	Test: fix fused router test - proper NVFP4 quantization and CuTe tensor setup - Use quantize_to_nvfp4 for weight quantization - Use quantize_activation_nvfp4 with computed global_scale - Get mat_b and scale_b from Nvfp4Linear after finalize_weights - Compare against both BF16 reference and NVFP4 GEMM reference	2026-06-01 08:56:20 +00:00
biondizzle	b94f8d4ed8	Test: fused router kernel vs BF16 reference path - BF16 GEMM + activation_topk as reference - NVFP4 GEMM + fused router epilogue as test target - Proper NVFP4 quantization and CuTe tensor creation - Cosine similarity and topk_ids matching validation	2026-06-01 08:54:24 +00:00
biondizzle	2433700a69	Fused router kernel: rewrite epilogue with proper CuTeDSL constructs - Replace Python lists with individual scalar variables (s0..s5, i0..i5, a0..a5) - Replace min-heap sift-down with fully unrolled sorted insertion (descending order, no dynamic indexing, no while loops) - Replace raw SMEM pointer arithmetic with CuTeDSL SMEM tensors (s_merge_s, s_merge_i, s_merge_a) - Replace cute.where with cute.math.fmax - Fix expert index calculation: col + tile_n_offset + subtile_idx * epi_n - Top-6 accumulates across all N-tiles (for E=384 with 3 tiles of 128) - Add iter_acc_early_release for overlapping accumulator - Rewrite test to compare fused kernel vs 2-kernel reference path - Remove stale memory doc	2026-06-01 08:49:39 +00:00
biondizzle	25b9a5f32d	Fix test: use from_dlpack for c_tensor	2026-06-01 07:55:29 +00:00
biondizzle	d2819fc39c	Fix test: use as_tensor instead of make_tensor	2026-06-01 07:54:36 +00:00
biondizzle	5ea71ebd78	Add NVFP4 CuTeDSL compilation test (verify MmaMXF4NVF4Op compiles)	2026-06-01 07:53:43 +00:00
biondizzle	0553117af6	Simplify fused router test: compare fused vs 2-kernel NVFP4 path	2026-06-01 07:10:55 +00:00
biondizzle	44a0e59808	Fix fused router test: use quantize_weight_to_nvfp4 (correct function name)	2026-06-01 07:08:56 +00:00
biondizzle	940f37fb6c	NVFP4 fused router kernel: full rewrite with proper block-scaled GEMM setup Major fixes: - Added tiled_mma_sfb creation (always CtaGroup.ONE, rounded N) - Added mma_tiler_sfb, cta_tile_shape_mnk_sfb, cluster_layout_sfb_vmnk - Use blockscaled_utils.make_smem_layout_sfa/sfb (with sf_vec_size) instead of sm100_utils (which doesn't support block-scaled SF layouts) - Proper TMEM column accounting for SFA + SFB + accumulator - Fixed make_blockscaled_trivial_tiled_mma argument order (a_dtype, b_dtype, a_major, b_major, sf_dtype, sf_vec_size, cta_group, mma_inst_shape) - Fixed SFB TMA atom to use tiled_mma_sfb and cluster_layout_sfb_vmnk - Fixed SFB partition_SFB to use tiled_mma_sfb.get_slice - Fixed SFB global tile partitioning to use mma_tiler_sfb - Fixed mainloop_s2t_copy_and_partition to use TMEM fragments (make_fragment_SFA/SFB) as the tSF parameter - Updated run_nvfp4_fused_router wrapper to accept processed weight tensors from Nvfp4Linear._mat_b and _scale_b - Updated test to properly build Nvfp4Linear and use processed weights The old code was a rough sketch that never worked — it was missing the entire tiled_mma_sfb infrastructure, used wrong SMEM layout functions, and had broken TMA atom setup for scale factors.	2026-06-01 07:08:12 +00:00
biondizzle	e6803b450d	rewrite: simplified fused router test (reference + import check)	2026-06-01 06:53:17 +00:00
biondizzle	262cec262d	fix: add shape assertions to fused router test	2026-06-01 06:51:47 +00:00
biondizzle	db07d17a62	fix: set activation global scale in fused router test	2026-06-01 06:50:41 +00:00
biondizzle	2abb4a19d9	fix: set gs and ws2 fields for Nvfp4Linear in fused router test	2026-06-01 06:49:43 +00:00
biondizzle	61c04f7152	fix: Nvfp4Linear field is sf not scale_b	2026-06-01 06:48:39 +00:00
biondizzle	982f245c67	fix: use correct Nvfp4Linear field names (fp4, scale_b, gsb)	2026-06-01 06:47:15 +00:00
biondizzle	16af96380f	fix: use internal fields for Nvfp4Linear weight setup in test	2026-06-01 06:46:05 +00:00
biondizzle	7f1f224c78	fix: quantize_weight_to_nvfp4 returns 3 values, not 4	2026-06-01 06:43:53 +00:00
biondizzle	27fd847dd0	fix: correct quantize function name in fused router test	2026-06-01 06:41:54 +00:00
biondizzle	0873d65253	test: add fused router kernel test Compares NVFP4 fused CuTeDSL kernel against reference (Nvfp4Linear + activation_topk) for correctness.	2026-06-01 06:40:46 +00:00
biondizzle	9f14cb17d1	test: add compressor position_bias unit test Verifies CUDA kernel matches PyTorch reference with and without position_bias for both CSA (m=4) and HCA (m=128) paths.	2026-06-01 05:55:05 +00:00
biondizzle	2155fd6c90	test: production compressor kernel unit test	2026-06-01 05:19:13 +00:00
biondizzle	13be3ad443	FMHA sink bias in kernel + single_shot production rewrite FMHA kernel (fmha_6warp_tma_multirow_multitile.cuh): - Added sink_bias field to FmhaTmaMultiRowMultiTileParams - After KV tile loop, sink logit is included in online softmax rescale: new_max = max(running_max, sink_bias * scale) rescale existing O_unnorm and running_sum running_sum += exp(sink_bias * scale - new_max) No PV contribution from sink (D5c: single softmax) - C API: fmha_multitile_decode_launch now takes sink_bias_ptr - Python: fmha_multitile_decode_raw accepts attn_sink tensor single_shot_inference.py: - Full rewrite to use production kernel stack - mHC: uses dsv4.layers.mhc.mHCLayer (proper Sinkhorn-Knopp) - Projections: uses Nvfp4Linear (CuTeDSL GEMM) for q_a, q_b, kv, o_b - FMHA: 6-warp TMA multi-tile with sink bias (no SDPA fallback) - MoE: Nvfp4MoE + Nvfp4SharedExpert (no reference fallback) - Router: production dense/hash dispatch - Compressor/Indexer: reference dequant (not yet on tensor cores) - NO try/except fallbacks on production paths	2026-05-31 23:10:13 +00:00
biondizzle	baee36e728	Fix dtype mismatch in validate_layer: cast flat to float before F.linear	2026-05-31 20:23:18 +00:00
biondizzle	46c4ef2cf5	Add per-layer validation test (tests/validate_layer.py) Compares forward_layer output with step-by-step PyTorch reference to identify where residual blowup originates. Uses our own NVFP4 dequant — no HF dependency.	2026-05-31 20:22:13 +00:00
biondizzle	98fa410167	Add HF reference test script	2026-05-31 20:11:37 +00:00
biondizzle	7d9e70c5d5	Fix remaining mHC API references: layer_compare.py, layer.py comment	2026-05-31 18:38:34 +00:00
biondizzle	f6c02f808f	Add layer-by-layer comparison test for debugging	2026-05-31 12:48:43 +00:00
biondizzle	6ad577bd18	Add HuggingFace reference comparison test	2026-05-31 12:05:19 +00:00
biondizzle	429fc3db40	Fix expert weight indexing for 1D tensor	2026-05-31 09:23:10 +00:00

1 2 3 4 5 ...

1058 Commits