nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	b3451c74f8	Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels	2026-05-18 20:05:03 +00:00
biondizzle	e8b289e30d	WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow.	2026-05-18 20:02:19 +00:00
biondizzle	a51edd238e	Add post-quant-init forward hook to fix attention NVFP4 The key insight: process_weights_after_loading runs AFTER load_weights and sets up FlashInferCutlassNvFp4LinearKernel with broken input_global_scale_inv. Any fix inside load_weights gets overwritten. Solution: register a one-shot forward pre-hook that runs on the first forward call (guaranteed after all init). It dequantizes attention NVFP4 weights to BF16 and replaces quant_method with UnquantizedLinearMethod. Since process_weights_after_loading already ran, our changes won't be overwritten. Standalone test confirmed: all attention weights produce valid non-NaN output when dequantized to BF16.	2026-05-18 17:56:19 +00:00
biondizzle	5c1dda10f6	Add granular attention diagnostics: pre/post attn, embed, dequant stats	2026-05-18 14:24:14 +00:00
biondizzle	334e95047e	Fix: dequantize ALL attention NVFP4 projections to BF16 Root cause of NaN from layer 0: FlashInferCutlassNvFp4LinearKernel uses checkpoint input_scale for activation quantization, which produces NaN immediately. Fix: dequantize all attention NVFP4 weights (wq_a, wq_b, wkv, wo_a, wo_b) to BF16 at load time, bypassing the broken input_scale entirely. Uses existing _dequant_nvfp4_to_bf16 method. This trades memory for correctness. Future optimization: add warmup for attention input_global_scale_inv (same as MoE warmup).	2026-05-18 13:09:36 +00:00
biondizzle	9e7639fba4	Add layer-by-layer diagnostic prints (CLAWMINE_DEBUG=1, enforce-eager) When CLAWMINE_DEBUG=1, prints amax/mean/NaN/Inf after each layer. Must run with --enforce-eager (data-dependent prints break Dynamo). Gated by os.environ so dead-code-eliminated during compilation.	2026-05-18 12:51:51 +00:00
biondizzle	e65f2b2ba2	Update CURRENT_BUG.md with Bug 26 fix	2026-05-17 21:36:25 +00:00
biondizzle	6692166d0f	Update CURRENT_BUG.md: Bug 25 (swiglu_limit), shared expert path verification, variable padded offsets	2026-05-17 17:56:04 +00:00
biondizzle	87a223f1ac	Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses	2026-05-17 16:52:40 +00:00
biondizzle	3d0b1408b4	Update CURRENT_BUG.md: Bug 21 (shared buffers), clean up status	2026-05-17 15:52:06 +00:00
biondizzle	e2f33596a2	Update CURRENT_BUG.md: status through Bug 20, fixed-layout padding architecture	2026-05-17 15:46:13 +00:00
biondizzle	0d3c928ff2	Update CURRENT_BUG.md: full status through Bug 14, vLLM integration status, architecture docs	2026-05-17 13:32:41 +00:00
biondizzle	eb7d4f099b	Update CURRENT_BUG.md with Bug 8 (global→local expert ID) and Bug 8b (.cpu() sync)	2026-05-17 09:01:24 +00:00
biondizzle	ca3cba5bbd	Fix global→local expert ID remapping for EP and remove .cpu() sync Root cause of CUDA_ERROR_ASSERT index out of bounds: - topk_ids contains GLOBAL expert IDs (0-255) but runner treated them as local IDs (0-31 with EP=8). Tokens for non-local experts got wrong expert assignments, causing out-of-bounds scatter indices in _assemble_scales_cudagraph_safe. Fixes: 1. Add experts_start_idx param to CuTeDSLMoERunner 2. In run(), remap global→local IDs and zero weights for non-local experts 3. Move _token_indices from CPU to GPU (remove sort_idx.cpu() sync) 4. Add _fill_token_indices() and _needs_token_refill to handle CuTeDSL JIT GPU memory corruption (refill after first GEMM call)	2026-05-17 08:58:43 +00:00
biondizzle	ddffb7d8df	docs: current bug analysis — scale_a layout vs expert_offsets mismatch	2026-05-17 07:53:58 +00:00

15 Commits