nvfp4-megamoe-kernel/tests at f23320b5b2a2dc019cf4129d27dca050b5bffc77 - nvfp4-megamoe-kernel - Gitea: Git with a cup of tea

biondizzle/nvfp4-megamoe-kernel

Files

History

biondizzle f23320b5b2 KV-1/KV-2: Fused compress+NVFP4 quantize kernels + dequant

- compressor_reduce_quant.cu: Single-kernel CSA/HCA compress + RMSNorm + NVFP4 quantize.
  No intermediate BF16. FP32 → E2M1 + E4M3 + FP32 gsa in one kernel.
  Shared memory: ~2.5KB per CTA (FP32 staging + nibble buffer).

- dequant_nvfp4.cu: NVFP4 → BF16 dequantization kernels.
  Full dequant (HCA dense gather) and selective dequant (CSA top-k gather).
  Single kernel launch per gather operation.

- production_compress.py: Added csa_compress_production_nvfp4() and
  hca_compress_production_nvfp4() — production path for KV-1/KV-2.

- loader.py: Preload dequant_nvfp4 and compressor_reduce_quant modules.

- test_kv_compress_quant.py: Unit tests verifying cos >= 0.999
  between BF16 reference and NVFP4 round-trip path.

2026-06-02 09:37:53 +00:00

..

E3: model construction test

2026-05-30 21:22:34 +00:00

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

KV-1/KV-2: Fused compress+NVFP4 quantize kernels + dequant

2026-06-02 09:37:53 +00:00

check_log.sh

Add check_log.sh convenience script

2026-05-22 17:07:23 +00:00

compare_hf_reference.py

Add HuggingFace reference comparison test

2026-05-31 12:05:19 +00:00

compare_layer0.py

Add HF reference test script

2026-05-31 20:11:37 +00:00

layer_compare.py

Fix remaining mHC API references: layer_compare.py, layer.py comment

2026-05-31 18:38:34 +00:00

production_values_test.py

Add production-value tests: ALL tests use Pro config (61L, HD=512, 384 experts, HCA=128, 1M context)

2026-06-02 04:10:39 +00:00

requirements.txt

test: add standalone layer 0 comparison test (no vLLM, no Docker)

2026-05-16 02:13:18 +00:00

run_test.sh

run_test.sh: SIGKILL all children of screen session on cleanup

2026-05-22 17:08:12 +00:00

test_minimal_e2e.py

Fix mHCBlock import + relax RoPE round-trip threshold (BF16 noise expected)

2026-05-31 09:17:07 +00:00

test_residual_diagnostic.py

Fix expert weight indexing for 1D tensor

2026-05-31 09:23:10 +00:00

validate_layer.py

Fix dtype mismatch in validate_layer: cast flat to float before F.linear

2026-05-31 20:23:18 +00:00

verify_attention.py

fix verify_attention: proper multi-head SDPA + GQA

2026-05-31 05:55:10 +00:00