nvfp4-megamoe-kernel/tests at 2672e98e4cb53103e5b9e904996a44c8243bf3f2 - nvfp4-megamoe-kernel - Gitea: Git with a cup of tea

biondizzle/nvfp4-megamoe-kernel

Files

History

biondizzle 3de75c4e37 Add CSA/HCA attention kernel (PyTorch SDPA, Blackwell-safe)

Replaces vLLM's broken FlashMLA sparse attention which doesn't work on
SM100 (Blackwell). Uses torch.nn.functional.scaled_dot_product_attention
which works on all GPUs.

Architecture:
- CSA (C128A): Batched sparse gather + SDPA on top-k positions
- HCA (C4A): Same with compressed KV + per-layer indexer
- SWA: Sliding window attention
- Full reference: standard SDPA for testing without compression

Also adds test_csa_attention_b200.py to verify the full attention path.

2026-05-19 07:58:10 +00:00

..

cudagraph_test.py

fix: test L2 weight N dim should be hidden_size, not hidden_size//2

2026-05-16 19:07:36 +00:00

debug_output.py

Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses

2026-05-17 16:52:40 +00:00

layertest.py

restore: new bridge/moe_pipeline/layertest

2026-05-16 19:55:19 +00:00

requirements.txt

test: add standalone layer 0 comparison test (no vLLM, no Docker)

2026-05-16 02:13:18 +00:00

run_test.sh

fix: use setup.py install for CUTLASS extension build

2026-05-16 02:21:17 +00:00

test_attention_path_b200.py

Add attention path test: pinpoint FlashMLA failure

2026-05-19 07:54:01 +00:00

test_attention.py

Add NVFP4 linear runner + attention projection test

2026-05-18 20:14:03 +00:00

test_b_layout.py

cleanup: move useful tests to tests/, nuke stale debug tests

2026-05-16 02:14:37 +00:00

test_compile_custom_op.py

Fix compile test: add warmup for activation global scales

2026-05-19 01:57:16 +00:00

test_csa_attention_b200.py

Add CSA/HCA attention kernel (PyTorch SDPA, Blackwell-safe)

2026-05-19 07:58:10 +00:00

test_custom_op.py

Replace autograd.Function with torch.library.custom_op for Dynamo compat

2026-05-19 01:54:48 +00:00

test_cutedsl.py

fix: B tensor K-major strides, scale_b axis swap

2026-05-16 03:04:31 +00:00

test_full_layer_b200.py

Fix checkpoint keys: attn_hc.*, compressor.*, q_a_proj/q_b_proj/kv_proj

2026-05-19 07:17:37 +00:00

test_inv_rope.py

Add unit tests for NVFP4 weight mapper and inverse RoPE BF16

2026-05-19 03:22:00 +00:00

test_model_forward_b200.py

Rewrite test: diagnose whether warmup gs matters at inference time

2026-05-19 07:49:41 +00:00

test_multilayer.py

Add MoE scale ratio output

2026-05-17 22:58:27 +00:00

test_nvfp4_mapper.py

Fix hc_head mapping: checkpoint uses hc_head.hc_fn, model params are flat hc_head_fn

2026-05-19 03:58:25 +00:00

test_o_projection_b200.py

Fix dims: o_groups=16, o_lora_rank=1024 from config

2026-05-19 06:37:25 +00:00

test_o_projection.py

Patch attention forward: BF16 inv RoPE + BMM wo_a + NVFP4 wo_b

2026-05-19 06:30:18 +00:00

test_pipeline_real_weights.py

Pipeline test: use max_num_tokens=8192 matching vLLM

2026-05-17 23:04:44 +00:00

test_quick_rand.py

cleanup: move useful tests to tests/, nuke stale debug tests

2026-05-16 02:14:37 +00:00

test_runner_vs_pipeline.py

test: runner vs pipeline comparison + scale assembly comparison

2026-05-17 07:33:20 +00:00

test_scale_assembly.py

fix: separate L1/L2 scale buffers (different K_sf), fix assembly calls

2026-05-17 07:43:05 +00:00

test_scale_debug.py

test: scale assembly debug

2026-05-17 07:37:47 +00:00

test_shared_expert.py

Fix hidden_size: shared expert uses 7168, not HC_DIM 28672

2026-05-18 20:10:32 +00:00

test_uniform_fp4.py

cleanup: move useful tests to tests/, nuke stale debug tests

2026-05-16 02:14:37 +00:00

test_warmup_gs.py

test: use runner's built-in warmup method

2026-05-17 08:24:27 +00:00

test_wo_a_bmm.py

Fix BF16 wo_a: per-group BMM instead of flat linear

2026-05-19 04:10:02 +00:00

test_wo_a.py

Fix test: cos_sin_cache on CUDA device

2026-05-19 02:37:50 +00:00