nvfp4-megamoe-kernel/tests at dfd9c10ae90a3d74eeb8db0701b541f004a9dbbc - nvfp4-megamoe-kernel - Gitea: Git with a cup of tea

biondizzle/nvfp4-megamoe-kernel

Files

History

biondizzle c289c44920 Fix BF16 wo_a: per-group BMM instead of flat linear

The BF16 wo_a path was calling self.wo_a(o_inv.reshape(num_tokens, -1))
which flattens across groups: (num_tokens, n_local_heads*head_dim)=(tokens, 8192).
But wo_a is a BMM with in_features=n_heads*head_dim/n_groups=4096.

The FP8 path handles this via einsum 'bhr,hdr->bhd' with per-group shapes.
The BF16 path now does the same: reshape o_inv to per-group format,
do torch.bmm, then reshape output and handle TP all-gather manually.

2026-05-19 04:10:02 +00:00

..

cudagraph_test.py

fix: test L2 weight N dim should be hidden_size, not hidden_size//2

2026-05-16 19:07:36 +00:00

debug_output.py

Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses

2026-05-17 16:52:40 +00:00

layertest.py

restore: new bridge/moe_pipeline/layertest

2026-05-16 19:55:19 +00:00

requirements.txt

test: add standalone layer 0 comparison test (no vLLM, no Docker)

2026-05-16 02:13:18 +00:00

run_test.sh

fix: use setup.py install for CUTLASS extension build

2026-05-16 02:21:17 +00:00

test_attention.py

Add NVFP4 linear runner + attention projection test

2026-05-18 20:14:03 +00:00

test_b_layout.py

cleanup: move useful tests to tests/, nuke stale debug tests

2026-05-16 02:14:37 +00:00

test_compile_custom_op.py

Fix compile test: add warmup for activation global scales

2026-05-19 01:57:16 +00:00

test_custom_op.py

Replace autograd.Function with torch.library.custom_op for Dynamo compat

2026-05-19 01:54:48 +00:00

test_cutedsl.py

fix: B tensor K-major strides, scale_b axis swap

2026-05-16 03:04:31 +00:00

test_inv_rope.py

Add unit tests for NVFP4 weight mapper and inverse RoPE BF16

2026-05-19 03:22:00 +00:00

test_multilayer.py

Add MoE scale ratio output

2026-05-17 22:58:27 +00:00

test_nvfp4_mapper.py

Fix hc_head mapping: checkpoint uses hc_head.hc_fn, model params are flat hc_head_fn

2026-05-19 03:58:25 +00:00

test_pipeline_real_weights.py

Pipeline test: use max_num_tokens=8192 matching vLLM

2026-05-17 23:04:44 +00:00

test_quick_rand.py

cleanup: move useful tests to tests/, nuke stale debug tests

2026-05-16 02:14:37 +00:00

test_runner_vs_pipeline.py

test: runner vs pipeline comparison + scale assembly comparison

2026-05-17 07:33:20 +00:00

test_scale_assembly.py

fix: separate L1/L2 scale buffers (different K_sf), fix assembly calls

2026-05-17 07:43:05 +00:00

test_scale_debug.py

test: scale assembly debug

2026-05-17 07:37:47 +00:00

test_shared_expert.py

Fix hidden_size: shared expert uses 7168, not HC_DIM 28672

2026-05-18 20:10:32 +00:00

test_uniform_fp4.py

cleanup: move useful tests to tests/, nuke stale debug tests

2026-05-16 02:14:37 +00:00

test_warmup_gs.py

test: use runner's built-in warmup method

2026-05-17 08:24:27 +00:00

test_wo_a_bmm.py

Fix BF16 wo_a: per-group BMM instead of flat linear

2026-05-19 04:10:02 +00:00

test_wo_a.py

Fix test: cos_sin_cache on CUDA device

2026-05-19 02:37:50 +00:00