nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	6692166d0f	Update CURRENT_BUG.md: Bug 25 (swiglu_limit), shared expert path verification, variable padded offsets	2026-05-17 17:56:04 +00:00
biondizzle	87a223f1ac	Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses	2026-05-17 16:52:40 +00:00
biondizzle	3d0b1408b4	Update CURRENT_BUG.md: Bug 21 (shared buffers), clean up status	2026-05-17 15:52:06 +00:00
biondizzle	e2f33596a2	Update CURRENT_BUG.md: status through Bug 20, fixed-layout padding architecture	2026-05-17 15:46:13 +00:00
biondizzle	0d3c928ff2	Update CURRENT_BUG.md: full status through Bug 14, vLLM integration status, architecture docs	2026-05-17 13:32:41 +00:00
biondizzle	eb7d4f099b	Update CURRENT_BUG.md with Bug 8 (global→local expert ID) and Bug 8b (.cpu() sync)	2026-05-17 09:01:24 +00:00
biondizzle	ca3cba5bbd	Fix global→local expert ID remapping for EP and remove .cpu() sync Root cause of CUDA_ERROR_ASSERT index out of bounds: - topk_ids contains GLOBAL expert IDs (0-255) but runner treated them as local IDs (0-31 with EP=8). Tokens for non-local experts got wrong expert assignments, causing out-of-bounds scatter indices in _assemble_scales_cudagraph_safe. Fixes: 1. Add experts_start_idx param to CuTeDSLMoERunner 2. In run(), remap global→local IDs and zero weights for non-local experts 3. Move _token_indices from CPU to GPU (remove sort_idx.cpu() sync) 4. Add _fill_token_indices() and _needs_token_refill to handle CuTeDSL JIT GPU memory corruption (refill after first GEMM call)	2026-05-17 08:58:43 +00:00
biondizzle	ddffb7d8df	docs: current bug analysis — scale_a layout vs expert_offsets mismatch	2026-05-17 07:53:58 +00:00

8 Commits