Commit Graph

1047 Commits

Author SHA1 Message Date
55ea109cca test: fused SwiGLU kernel compilation + correctness (P0/P1 gate) 2026-06-02 08:09:57 +00:00
9254cb0b0d test: NVFP4 runtime gsa accuracy vs PyTorch reference 2026-06-02 04:31:18 +00:00
f52eedbdce Add production-value tests: ALL tests use Pro config (61L, HD=512, 384 experts, HCA=128, 1M context)
Previous unit tests used toy values (HD=64-256, T=16, small N).
These tests validate the actual production configuration:
- FMHA: HD=512, 128 Q heads, N=128/2048/8192
- Compression: CSA T=4096, HCA T=16384, full 1M context
- NVFP4: production weight shapes (q_a, kv, wo_a, gate)
- MoE: 384 experts, top-6, 3072 intermediate
- mHC: 4 streams, 61 layers, residual bounded, doubly-stochastic
- Router: 384 experts hash + noaux-TC
- Memory budget: 1M context KV pool, 8-GPU weight distribution
2026-06-02 04:10:39 +00:00
9d57b0453b auto: pre-test commit 2026-06-01 15:04:46 +00:00
3b2714410f Add NVFP4 linear accuracy test: prod vs ref with all-ones input 2026-06-01 14:15:27 +00:00
3e47d5f20a Add prod vs ref GEMM comparison test + gate logits diagnostic 2026-06-01 14:11:37 +00:00
7b3f6cb13c Fix fused router: use run_nvfp4_fused_router wrapper, correct CuTe tensor API
- kernel wrapper converts torch tensors to CuTe tensors with mark_layout_dynamic
- test uses the wrapper instead of calling kernel.run() directly
- mat_b/scale_b are now torch tensors (converted inside wrapper)
2026-06-01 09:19:48 +00:00
483e759d53 Fix: use tensor.mark_layout_dynamic() method (not cute.mark_layout_dynamic) 2026-06-01 09:16:33 +00:00
2412745b21 Test fix: slice NVFP4 logits to actual expert count (GEMM padding) 2026-06-01 09:15:06 +00:00
4f4ae8febd Test: enumerate CuTeDSL math API to check available operations 2026-06-01 09:11:29 +00:00
9b86b2b414 Test: fix fused router test - proper NVFP4 quantization and CuTe tensor setup
- Use quantize_to_nvfp4 for weight quantization
- Use quantize_activation_nvfp4 with computed global_scale
- Get mat_b and scale_b from Nvfp4Linear after finalize_weights
- Compare against both BF16 reference and NVFP4 GEMM reference
2026-06-01 08:56:20 +00:00
b94f8d4ed8 Test: fused router kernel vs BF16 reference path
- BF16 GEMM + activation_topk as reference
- NVFP4 GEMM + fused router epilogue as test target
- Proper NVFP4 quantization and CuTe tensor creation
- Cosine similarity and topk_ids matching validation
2026-06-01 08:54:24 +00:00
2433700a69 Fused router kernel: rewrite epilogue with proper CuTeDSL constructs
- Replace Python lists with individual scalar variables (s0..s5, i0..i5, a0..a5)
- Replace min-heap sift-down with fully unrolled sorted insertion
  (descending order, no dynamic indexing, no while loops)
- Replace raw SMEM pointer arithmetic with CuTeDSL SMEM tensors
  (s_merge_s, s_merge_i, s_merge_a)
- Replace cute.where with cute.math.fmax
- Fix expert index calculation: col + tile_n_offset + subtile_idx * epi_n
- Top-6 accumulates across all N-tiles (for E=384 with 3 tiles of 128)
- Add iter_acc_early_release for overlapping accumulator
- Rewrite test to compare fused kernel vs 2-kernel reference path
- Remove stale memory doc
2026-06-01 08:49:39 +00:00
25b9a5f32d Fix test: use from_dlpack for c_tensor 2026-06-01 07:55:29 +00:00
d2819fc39c Fix test: use as_tensor instead of make_tensor 2026-06-01 07:54:36 +00:00
5ea71ebd78 Add NVFP4 CuTeDSL compilation test (verify MmaMXF4NVF4Op compiles) 2026-06-01 07:53:43 +00:00
0553117af6 Simplify fused router test: compare fused vs 2-kernel NVFP4 path 2026-06-01 07:10:55 +00:00
44a0e59808 Fix fused router test: use quantize_weight_to_nvfp4 (correct function name) 2026-06-01 07:08:56 +00:00
940f37fb6c NVFP4 fused router kernel: full rewrite with proper block-scaled GEMM setup
Major fixes:
- Added tiled_mma_sfb creation (always CtaGroup.ONE, rounded N)
- Added mma_tiler_sfb, cta_tile_shape_mnk_sfb, cluster_layout_sfb_vmnk
- Use blockscaled_utils.make_smem_layout_sfa/sfb (with sf_vec_size)
  instead of sm100_utils (which doesn't support block-scaled SF layouts)
- Proper TMEM column accounting for SFA + SFB + accumulator
- Fixed make_blockscaled_trivial_tiled_mma argument order
  (a_dtype, b_dtype, a_major, b_major, sf_dtype, sf_vec_size, cta_group, mma_inst_shape)
- Fixed SFB TMA atom to use tiled_mma_sfb and cluster_layout_sfb_vmnk
- Fixed SFB partition_SFB to use tiled_mma_sfb.get_slice
- Fixed SFB global tile partitioning to use mma_tiler_sfb
- Fixed mainloop_s2t_copy_and_partition to use TMEM fragments
  (make_fragment_SFA/SFB) as the tSF parameter
- Updated run_nvfp4_fused_router wrapper to accept processed weight
  tensors from Nvfp4Linear._mat_b and _scale_b
- Updated test to properly build Nvfp4Linear and use processed weights

The old code was a rough sketch that never worked — it was missing
the entire tiled_mma_sfb infrastructure, used wrong SMEM layout
functions, and had broken TMA atom setup for scale factors.
2026-06-01 07:08:12 +00:00
e6803b450d rewrite: simplified fused router test (reference + import check) 2026-06-01 06:53:17 +00:00
262cec262d fix: add shape assertions to fused router test 2026-06-01 06:51:47 +00:00
db07d17a62 fix: set activation global scale in fused router test 2026-06-01 06:50:41 +00:00
2abb4a19d9 fix: set gs and ws2 fields for Nvfp4Linear in fused router test 2026-06-01 06:49:43 +00:00
61c04f7152 fix: Nvfp4Linear field is sf not scale_b 2026-06-01 06:48:39 +00:00
982f245c67 fix: use correct Nvfp4Linear field names (fp4, scale_b, gsb) 2026-06-01 06:47:15 +00:00
16af96380f fix: use internal fields for Nvfp4Linear weight setup in test 2026-06-01 06:46:05 +00:00
7f1f224c78 fix: quantize_weight_to_nvfp4 returns 3 values, not 4 2026-06-01 06:43:53 +00:00
27fd847dd0 fix: correct quantize function name in fused router test 2026-06-01 06:41:54 +00:00
0873d65253 test: add fused router kernel test
Compares NVFP4 fused CuTeDSL kernel against reference
(Nvfp4Linear + activation_topk) for correctness.
2026-06-01 06:40:46 +00:00
9f14cb17d1 test: add compressor position_bias unit test
Verifies CUDA kernel matches PyTorch reference with and without
position_bias for both CSA (m=4) and HCA (m=128) paths.
2026-06-01 05:55:05 +00:00
2155fd6c90 test: production compressor kernel unit test 2026-06-01 05:19:13 +00:00
13be3ad443 FMHA sink bias in kernel + single_shot production rewrite
FMHA kernel (fmha_6warp_tma_multirow_multitile.cuh):
- Added sink_bias field to FmhaTmaMultiRowMultiTileParams
- After KV tile loop, sink logit is included in online softmax rescale:
  new_max = max(running_max, sink_bias * scale)
  rescale existing O_unnorm and running_sum
  running_sum += exp(sink_bias * scale - new_max)
  No PV contribution from sink (D5c: single softmax)
- C API: fmha_multitile_decode_launch now takes sink_bias_ptr
- Python: fmha_multitile_decode_raw accepts attn_sink tensor

single_shot_inference.py:
- Full rewrite to use production kernel stack
- mHC: uses dsv4.layers.mhc.mHCLayer (proper Sinkhorn-Knopp)
- Projections: uses Nvfp4Linear (CuTeDSL GEMM) for q_a, q_b, kv, o_b
- FMHA: 6-warp TMA multi-tile with sink bias (no SDPA fallback)
- MoE: Nvfp4MoE + Nvfp4SharedExpert (no reference fallback)
- Router: production dense/hash dispatch
- Compressor/Indexer: reference dequant (not yet on tensor cores)
- NO try/except fallbacks on production paths
2026-05-31 23:10:13 +00:00
baee36e728 Fix dtype mismatch in validate_layer: cast flat to float before F.linear 2026-05-31 20:23:18 +00:00
46c4ef2cf5 Add per-layer validation test (tests/validate_layer.py)
Compares forward_layer output with step-by-step PyTorch reference
to identify where residual blowup originates. Uses our own NVFP4
dequant — no HF dependency.
2026-05-31 20:22:13 +00:00
98fa410167 Add HF reference test script 2026-05-31 20:11:37 +00:00
7d9e70c5d5 Fix remaining mHC API references: layer_compare.py, layer.py comment 2026-05-31 18:38:34 +00:00
f6c02f808f Add layer-by-layer comparison test for debugging 2026-05-31 12:48:43 +00:00
6ad577bd18 Add HuggingFace reference comparison test 2026-05-31 12:05:19 +00:00
429fc3db40 Fix expert weight indexing for 1D tensor 2026-05-31 09:23:10 +00:00
33004dcbf4 Fix expert weight broadcasting (wt.item() for scalar multiply) 2026-05-31 09:22:27 +00:00
1434b35971 Add residual diagnostic test — per-layer magnitude tracking 2026-05-31 09:21:41 +00:00
970869d017 Fix mHCBlock import + relax RoPE round-trip threshold (BF16 noise expected) 2026-05-31 09:17:07 +00:00
a2ee78b564 Fix RoPE shape bug (interleave needs separate even/odd assembly) 2026-05-31 09:15:59 +00:00
9d96c2fbbf CRITICAL FIX: FP32 RoPE cache + FP32 arithmetic for inverse RoPE round-trip
BF16 cos/sin cache destroys cos²+sin²=1 identity (can be 0.996 in BF16).
This causes ~3% error per RoPE→inverse RoPE round-trip, accumulating
across 61 layers into garbage output. FP32 cache + FP32 arithmetic
gives exact round-trip (diff < 1e-7).

Also fixes: MoE expert loop indentation (was only running last expert).
2026-05-31 09:14:59 +00:00
db74a887ab Add minimal e2e test + fix MoE expert loop bug (indentation) 2026-05-31 09:14:03 +00:00
fac269c938 fix verify_attention: proper multi-head SDPA + GQA 2026-05-31 05:55:10 +00:00
2333fc8b4b fix verify_attention.py: proper nvfp4_linear calls 2026-05-31 05:53:49 +00:00
c09f68c867 add verify_attention.py: single-layer attention component test 2026-05-31 05:51:36 +00:00
4472928506 E3: model construction test 2026-05-30 21:22:34 +00:00
c4b40dd06c E2: CSA/HCA integration test — gather + FMHA end-to-end
Tests:
- CSA: gather_compressed_kv (top-k) + gather_swa_kv + sparse FMHA
- HCA: gather_all_compressed_kv + gather_swa_kv + dense FMHA
- Verifies shapes, dtypes, and numerical sanity (no NaN/Inf)
2026-05-30 21:19:28 +00:00