nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	aa8563c626	Fused SwiGLU epilogue with granularity-8 weight interleave - Fix interleave_l1_weights: remove //2 bug (g=granularity_bf16 for N-axis) - Apply L1 weight+SF interleave in runner._ensure_stacked() and moe_pipeline - De-interleave L1 GEMM output before gate/up split - Fused SwiGLU kernel: epi_tile=(128,8) for subtile-level pairing - Even subtiles = gate: SiLU in FP32 registers, save to register buffer - Odd subtiles = up: silu(gate)*up from buffer - Both branches produce same BF16 tensor type (CuTeDSL constraint) - run_nvfp4_moe_fused() pipeline: fused L1 + PyTorch L2 - Runner: fused_swiglu=True option for CuTeDSLMoERunner - Layertest: both fused and non-fused paths PASS (cosine 0.988) - README.md updated with current status and lessons learned	2026-05-20 04:13:52 +00:00
biondizzle	6c04155167	wip: Step 2 gate/up pairing — SiLU validated, runtime conditionals blocked by CuTeDSL SiLU in registers: PASS (0.034% error, Step 1 stable) Gate/up subtile detection: blocked by CuTeDSL type system CuTeDSL compiles the kernel for ALL subtile iterations at once. Runtime conditionals (if is_gate_subtile) that affect: - Register tensor assignment → DSLRuntimeError (type structure mismatch) - TMA store skipping → corrupted output - Mask blending → wrong results Path forward: use const_expr debug flag for the BF16 side output, or process gate/up in a separate post-GEMM kernel.	2026-05-20 03:26:20 +00:00
biondizzle	9f0c1b8c5d	wip: Step 1 SiLU validation complete, Step 2 gate/up pairing planning Step 1 VALIDATED: - cute.exp works on register tensors in the epilogue - SiLU (x / (1+exp(-x))) produces correct results - Relative error vs PyTorch: 0.034%, max abs: 0.0625 (BF16 precision) Step 2 (gate/up pairing) approach: - Register-level pairing requires understanding acc_vec layout from tiled_copy_r2s - DeepGEMM pattern: (values[0], values[2]) pairs for tcgen05.ld - CuTeDSL retile may produce different layout than direct PTX loads - SMEM-level SiLU is a valid intermediate: avoids GMEM round-trip while working in logical (M, N) coordinate space - Non-interleaved weights + SMEM SiLU is simplest starting point	2026-05-20 03:16:34 +00:00
biondizzle	b84f2f7bf9	fix: cutlass.Float32 not cutlass.float32_t in fused epilogue Step 1 SiLU validation: PASS - cute.exp works on register tensors - SiLU (x / (1+exp(-x))) in registers matches PyTorch reference - Relative error: 0.034%, Max abs error: 0.0625 (BF16 precision limit)	2026-05-20 03:12:23 +00:00
biondizzle	9c43c69a4c	wip: fused SwiGLU Stage 1 - SiLU in registers (full acc_vec) Stage 1 of the fused epilogue: applies SiLU (x * sigmoid(x)) to the full accumulator register tensor before writing BF16 to C. This validates that cute.exp and element-wise FP32 operations work on CuTe register tensors in the epilogue. The gate/up pairing is not yet implemented (Stage 2). The fused_swiglu flag is const_expr(0) by default, so the standard epilogue path is unchanged unless the flag is enabled.	2026-05-20 03:07:02 +00:00
biondizzle	2f053f674e	wip: fused SwiGLU kernel scaffold + bridge interleave + plan - fused_swiglu_grouped_mm.py: copypaste of torch_scaled_grouped_mm.py with class rename and fused_swiglu/swiglu_limit params added - bridge.py: added interleave_l1_weights, deinterleave_l1_weights, warmup_fused_swiglu_compilation - Pure-PyTorch interleave invariant passes (A@cat vs deinterleave(A@interleave)) - Standalone GEMM interleave test fails due to kernel-internal N-tiling layout (expected, skipping per plan) - FUSED_EPILOGUE_PLAN.md updated with register layout, amax shuffle plan, 4-step implementation strategy	2026-05-20 03:04:38 +00:00
biondizzle	ca28f1335d	refactor: copy CuTeDSL kernel into repo with local imports Copied from CUTLASS examples (no more runtime dependency on /root/cutlass/examples/). Fixed all imports to use cutedsl.kernel.* instead of blackwell.kernel.*. Structure: cutedsl/__init__.py cutedsl/kernel/__init__.py cutedsl/kernel/moe/ (the MoE scaled grouped GEMM) cutedsl/kernel/blockscaled_gemm/ (dense blockscaled GEMM) test_cutedsl.py updated to import from our local copy.	2026-05-16 02:57:54 +00:00

7 Commits