nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	d15c43294b	fix: test L2 weight N dim should be hidden_size, not hidden_size//2	2026-05-16 19:07:36 +00:00
biondizzle	28788c6f55	fix: L1 weight N dimension is 2intermediate (gate+up), not intermediate float4_e2m1fn_x2 packs 2 values per byte along K, not N. The GEMM output N dimension is the logical N from mat_b.shape[2], not 2x packed. Previous n_dim2 was wrong — it accidentally worked in the test because intermediate_size2 == 2intermediate_size. Real model with N=9216 exposed the bug.	2026-05-16 19:07:08 +00:00
biondizzle	54c470e535	fix: use float16->float8 cast for rand_sf (torch.rand doesn't support float8)	2026-05-16 18:13:14 +00:00
biondizzle	f2de95c526	fix: use randint for float4 dummy weights in cudagraph test	2026-05-16 18:08:45 +00:00
biondizzle	f66d4b69a4	GPU-only scale assembly + cudagraph test harness - assemble_activation_scales_gpu: builds padded+swizzled scale tensor without .item() or .tolist() CPU syncs. Uses GPU index arange + cat + single scatter instead of per-expert Python slicing. - Still has a for e in range(num_experts) loop but num_experts is compile-time constant so torch.compile unrolls it. - Added tests/cudagraph_test.py: attempts CUDA graph capture on the MoE runner, diagnoses sync violations with patched torch functions. - Removed the if total_slots == 0 early return (Python control flow on GPU data)	2026-05-16 18:05:13 +00:00