nvfp4-megamoe-kernel

Files

biondizzle 676a0448c0 CRITICAL FIX: _l1_out_buf was 2x too narrow — caused GPU memory corruption

The L1 GEMM produces gate+up combined output with 2*intermediate_size
BF16 columns, but _l1_out_buf was only allocated with intermediate_size
columns. The GEMM wrote past the buffer boundary, corrupting GPU memory
and causing cudaErrorInvalidValue on subsequent operations.

This was the root cause of ALL the cudaErrorInvalidValue errors in the
shared expert and MoE L2 paths — the corrupted memory from the L1 buffer
overflow propagated downstream.

Fix: _l1_out_buf shape (max_rows, 2*intermediate_size) instead of
(max_rows, intermediate_size). Applied to both shared_expert.py and moe.py.

Also removed all DEBUG sync/print statements from quantize.py and
shared_expert.py — the bug was not in the quantize kernels, it was
the buffer overflow.

2026-06-04 02:06:18 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

custom_ops.py

Stage E: head-packed MQA/GQA, batch dim, custom_op, integration API

2026-05-27 15:15:03 +00:00

gemm_runner.py

Fix: Add out= parameter to run_fused_swiglu_grouped_gemm signature

2026-06-03 21:45:15 +00:00

layouts.py

Restructure: cutedsl/ -> dsv4/ with proper layering