nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	e38d60a6e8	Add pipeline test with real model weights, add swiglu_limit to reference moe_pipeline	2026-05-17 18:07:44 +00:00
biondizzle	cc75a55bd9	restore: new bridge/moe_pipeline/layertest	2026-05-16 19:55:19 +00:00
biondizzle	0c878b3a9e	temp: restore old layertest+bridge for cosine comparison	2026-05-16 19:54:04 +00:00
biondizzle	0069769d12	debug: print global scales	2026-05-16 19:38:31 +00:00
biondizzle	84589fe984	debug: more prints	2026-05-16 19:31:54 +00:00
biondizzle	fa2d5708c5	debug: add L1 GEMM and SiLU output debug prints	2026-05-16 19:29:42 +00:00
biondizzle	4c06c51ec3	fix: moe_pipeline.py gate/up split — L1 output is 2*intermediate, not intermediate	2026-05-16 19:28:15 +00:00
biondizzle	174ad70dca	fix: same gate/up split fix in moe_pipeline.py	2026-05-16 04:04:53 +00:00
biondizzle	09ff5c5b98	feat: full NVFP4 MoE pipeline (L1→SiLU→L2→scatter) cutedsl/moe_pipeline.py: complete pipeline - stage_activation: BF16 → NVFP4 (keeps data in FP4) - L1 GEMM: NVFP4 × NVFP4 → BF16 (gate+up) - SiLU(gate) * up: BF16 (only nonlinear, can't avoid) - Re-quantize: BF16 → NVFP4 (back to native) - L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj) - Scatter with routing weights → BF16 output layertest.py: now tests the FULL MoE pipeline against BF16 reference. NVFP4-native: both GEMMs use float4_e2m1fn_x2 for A and B, float8_e4m3fn for block scales, float32 for global scales. BF16 only for SiLU activation and final scatter.	2026-05-16 03:22:43 +00:00