bdb25ee5cd
Add production-value unit tests for kv_quantize kernels
2026-06-02 10:01:07 +00:00
d74ff5768d
KV diag test
2026-06-02 09:43:45 +00:00
f23320b5b2
KV-1/KV-2: Fused compress+NVFP4 quantize kernels + dequant
...
- compressor_reduce_quant.cu: Single-kernel CSA/HCA compress + RMSNorm + NVFP4 quantize.
No intermediate BF16. FP32 → E2M1 + E4M3 + FP32 gsa in one kernel.
Shared memory: ~2.5KB per CTA (FP32 staging + nibble buffer).
- dequant_nvfp4.cu: NVFP4 → BF16 dequantization kernels.
Full dequant (HCA dense gather) and selective dequant (CSA top-k gather).
Single kernel launch per gather operation.
- production_compress.py: Added csa_compress_production_nvfp4() and
hca_compress_production_nvfp4() — production path for KV-1/KV-2.
- loader.py: Preload dequant_nvfp4 and compressor_reduce_quant modules.
- test_kv_compress_quant.py: Unit tests verifying cos >= 0.999
between BF16 reference and NVFP4 round-trip path.
2026-06-02 09:37:53 +00:00
2bbbead984
P3: CUDA RoPE kernel — single launch per call (vs 5-6 PyTorch ops)
...
New files:
- dsv4/kernels/cuda/rope_cuda.cu: GPT-J interleaved RoPE kernel (forward+inverse)
- dsv4/ops/rope_cuda.py: Python bridge with ctypes loading
- tests/unit/test_rope_cuda.py: correctness test (cos >= 0.999998)
Savings: ~915 launches/token → 183 launches/token
2026-06-02 09:05:22 +00:00
b13c1057f5
test: verify GEMM shape with production weight format
2026-06-02 08:43:40 +00:00
40fb49d670
test: verify GEMM output shape
2026-06-02 08:41:22 +00:00
5ed4c86137
fix: expert_offsets for 4-expert fused SwiGLU test
2026-06-02 08:24:32 +00:00
53362d2579
test: isolate fused SwiGLU — test no-clamp first
2026-06-02 08:23:28 +00:00
ae4506d722
fix: w_gs is scalar not iterable
2026-06-02 08:22:29 +00:00
b0c71b947e
test: fused SwiGLU — smoke test + correctness comparison with graceful degradation
2026-06-02 08:21:33 +00:00
2cfca36095
fix: compute correct gs from data in fused SwiGLU test
2026-06-02 08:20:27 +00:00
4a05a40cf0
fix: fused SwiGLU test — proper weight quant + 128-token alignment
2026-06-02 08:19:31 +00:00
fa769b6214
fix: pad activation as uint8 view for float4 dtype
2026-06-02 08:18:26 +00:00
024be1a60b
fix: test weight quantization dtype for fused SwiGLU test
2026-06-02 08:17:35 +00:00
55ea109cca
test: fused SwiGLU kernel compilation + correctness (P0/P1 gate)
2026-06-02 08:09:57 +00:00
9254cb0b0d
test: NVFP4 runtime gsa accuracy vs PyTorch reference
2026-06-02 04:31:18 +00:00
9d57b0453b
auto: pre-test commit
2026-06-01 15:04:46 +00:00
3b2714410f
Add NVFP4 linear accuracy test: prod vs ref with all-ones input
2026-06-01 14:15:27 +00:00
3e47d5f20a
Add prod vs ref GEMM comparison test + gate logits diagnostic
2026-06-01 14:11:37 +00:00
7b3f6cb13c
Fix fused router: use run_nvfp4_fused_router wrapper, correct CuTe tensor API
...
- kernel wrapper converts torch tensors to CuTe tensors with mark_layout_dynamic
- test uses the wrapper instead of calling kernel.run() directly
- mat_b/scale_b are now torch tensors (converted inside wrapper)
2026-06-01 09:19:48 +00:00
483e759d53
Fix: use tensor.mark_layout_dynamic() method (not cute.mark_layout_dynamic)
2026-06-01 09:16:33 +00:00
2412745b21
Test fix: slice NVFP4 logits to actual expert count (GEMM padding)
2026-06-01 09:15:06 +00:00
4f4ae8febd
Test: enumerate CuTeDSL math API to check available operations
2026-06-01 09:11:29 +00:00
9b86b2b414
Test: fix fused router test - proper NVFP4 quantization and CuTe tensor setup
...
- Use quantize_to_nvfp4 for weight quantization
- Use quantize_activation_nvfp4 with computed global_scale
- Get mat_b and scale_b from Nvfp4Linear after finalize_weights
- Compare against both BF16 reference and NVFP4 GEMM reference
2026-06-01 08:56:20 +00:00
b94f8d4ed8
Test: fused router kernel vs BF16 reference path
...
- BF16 GEMM + activation_topk as reference
- NVFP4 GEMM + fused router epilogue as test target
- Proper NVFP4 quantization and CuTe tensor creation
- Cosine similarity and topk_ids matching validation
2026-06-01 08:54:24 +00:00
2433700a69
Fused router kernel: rewrite epilogue with proper CuTeDSL constructs
...
- Replace Python lists with individual scalar variables (s0..s5, i0..i5, a0..a5)
- Replace min-heap sift-down with fully unrolled sorted insertion
(descending order, no dynamic indexing, no while loops)
- Replace raw SMEM pointer arithmetic with CuTeDSL SMEM tensors
(s_merge_s, s_merge_i, s_merge_a)
- Replace cute.where with cute.math.fmax
- Fix expert index calculation: col + tile_n_offset + subtile_idx * epi_n
- Top-6 accumulates across all N-tiles (for E=384 with 3 tiles of 128)
- Add iter_acc_early_release for overlapping accumulator
- Rewrite test to compare fused kernel vs 2-kernel reference path
- Remove stale memory doc
2026-06-01 08:49:39 +00:00
25b9a5f32d
Fix test: use from_dlpack for c_tensor
2026-06-01 07:55:29 +00:00
d2819fc39c
Fix test: use as_tensor instead of make_tensor
2026-06-01 07:54:36 +00:00
5ea71ebd78
Add NVFP4 CuTeDSL compilation test (verify MmaMXF4NVF4Op compiles)
2026-06-01 07:53:43 +00:00
0553117af6
Simplify fused router test: compare fused vs 2-kernel NVFP4 path
2026-06-01 07:10:55 +00:00
44a0e59808
Fix fused router test: use quantize_weight_to_nvfp4 (correct function name)
2026-06-01 07:08:56 +00:00
940f37fb6c
NVFP4 fused router kernel: full rewrite with proper block-scaled GEMM setup
...
Major fixes:
- Added tiled_mma_sfb creation (always CtaGroup.ONE, rounded N)
- Added mma_tiler_sfb, cta_tile_shape_mnk_sfb, cluster_layout_sfb_vmnk
- Use blockscaled_utils.make_smem_layout_sfa/sfb (with sf_vec_size)
instead of sm100_utils (which doesn't support block-scaled SF layouts)
- Proper TMEM column accounting for SFA + SFB + accumulator
- Fixed make_blockscaled_trivial_tiled_mma argument order
(a_dtype, b_dtype, a_major, b_major, sf_dtype, sf_vec_size, cta_group, mma_inst_shape)
- Fixed SFB TMA atom to use tiled_mma_sfb and cluster_layout_sfb_vmnk
- Fixed SFB partition_SFB to use tiled_mma_sfb.get_slice
- Fixed SFB global tile partitioning to use mma_tiler_sfb
- Fixed mainloop_s2t_copy_and_partition to use TMEM fragments
(make_fragment_SFA/SFB) as the tSF parameter
- Updated run_nvfp4_fused_router wrapper to accept processed weight
tensors from Nvfp4Linear._mat_b and _scale_b
- Updated test to properly build Nvfp4Linear and use processed weights
The old code was a rough sketch that never worked — it was missing
the entire tiled_mma_sfb infrastructure, used wrong SMEM layout
functions, and had broken TMA atom setup for scale factors.
2026-06-01 07:08:12 +00:00
e6803b450d
rewrite: simplified fused router test (reference + import check)
2026-06-01 06:53:17 +00:00
262cec262d
fix: add shape assertions to fused router test
2026-06-01 06:51:47 +00:00
db07d17a62
fix: set activation global scale in fused router test
2026-06-01 06:50:41 +00:00
2abb4a19d9
fix: set gs and ws2 fields for Nvfp4Linear in fused router test
2026-06-01 06:49:43 +00:00
61c04f7152
fix: Nvfp4Linear field is sf not scale_b
2026-06-01 06:48:39 +00:00
982f245c67
fix: use correct Nvfp4Linear field names (fp4, scale_b, gsb)
2026-06-01 06:47:15 +00:00
16af96380f
fix: use internal fields for Nvfp4Linear weight setup in test
2026-06-01 06:46:05 +00:00
7f1f224c78
fix: quantize_weight_to_nvfp4 returns 3 values, not 4
2026-06-01 06:43:53 +00:00
27fd847dd0
fix: correct quantize function name in fused router test
2026-06-01 06:41:54 +00:00
0873d65253
test: add fused router kernel test
...
Compares NVFP4 fused CuTeDSL kernel against reference
(Nvfp4Linear + activation_topk) for correctness.
2026-06-01 06:40:46 +00:00
9f14cb17d1
test: add compressor position_bias unit test
...
Verifies CUDA kernel matches PyTorch reference with and without
position_bias for both CSA (m=4) and HCA (m=128) paths.
2026-06-01 05:55:05 +00:00
2155fd6c90
test: production compressor kernel unit test
2026-06-01 05:19:13 +00:00
13be3ad443
FMHA sink bias in kernel + single_shot production rewrite
...
FMHA kernel (fmha_6warp_tma_multirow_multitile.cuh):
- Added sink_bias field to FmhaTmaMultiRowMultiTileParams
- After KV tile loop, sink logit is included in online softmax rescale:
new_max = max(running_max, sink_bias * scale)
rescale existing O_unnorm and running_sum
running_sum += exp(sink_bias * scale - new_max)
No PV contribution from sink (D5c: single softmax)
- C API: fmha_multitile_decode_launch now takes sink_bias_ptr
- Python: fmha_multitile_decode_raw accepts attn_sink tensor
single_shot_inference.py:
- Full rewrite to use production kernel stack
- mHC: uses dsv4.layers.mhc.mHCLayer (proper Sinkhorn-Knopp)
- Projections: uses Nvfp4Linear (CuTeDSL GEMM) for q_a, q_b, kv, o_b
- FMHA: 6-warp TMA multi-tile with sink bias (no SDPA fallback)
- MoE: Nvfp4MoE + Nvfp4SharedExpert (no reference fallback)
- Router: production dense/hash dispatch
- Compressor/Indexer: reference dequant (not yet on tensor cores)
- NO try/except fallbacks on production paths
2026-05-31 23:10:13 +00:00
4b9eed02e1
Cleanup C1-C7: delete dead CuTeDSL FMHA, test probes, scratch files
...
- Deleted fmha.py (CuTeDSL slow path), FmhaKernel, Python KV merge
- Deleted fmha_sm100.cuh, fmha_sm100_tc.cuh, fmha_sm100_launch.cu, fmha_epilogue_sm100.cuh
- Moved fmha_qk_verify.cuh to tests/unit/qk_verify_kernel.cuh
- Deleted decode_sparse.py, decode_swa.py, kernels/decode/
- Deleted 46 test_d*.py probes, test_smem_*, test_cotiled_*, test_tmem_*,
test_smem_p_*, test_ultra_minimal, test_fmha_pv16, test_working_softmax_maybe
- Deleted root scratch: debug_linear.py, test_mapping.py, run_router_tests.py
- Moved archive/ to archived_plans/code_archive/
- Rewrote production.py: single fast path via 6-warp multi-tile kernel
- Added STATUS.md, audit_attention_live.md
- Moved NEXT_PRIORITIES*.md to archived_plans/
2026-05-30 21:08:12 +00:00
2c18609296
P8: Fix P6 test imports after deleting multihead module
2026-05-30 17:25:01 +00:00
e1b9e94c24
P8: Fix test imports after deleting multihead module
2026-05-30 17:23:13 +00:00
e747742598
P7: Document TMEM column layout, add multi-row softmax test
...
docs/p7_tmem_column_layout.md: Verified that tcgen05.ld 32x32b.x8 is
the correct instruction for multi-row softmax. Each call reads 8 KV
positions for 32 rows. No instruction change needed from single-row.
test_p7_multi_row_softmax.py: Tests T=1,4,32,64,128 at various HD and N.
Gate: cos >= 0.999996.
2026-05-30 17:17:54 +00:00
f1ce47e3c9
P7: Add TMEM column layout probe test
2026-05-30 17:14:50 +00:00