nvfp4-megamoe-kernel

Files

biondizzle 2262e10fca fix: PV GEMM — V canonical uses CORES_MN_V=2 (block_mn=16), not 16

V is the B operand with block_mn=16 in the PV MMA. Its canonical layout
uses CORES_MN=16/8=2, not 128/8=16. The previous code used CORES_MN=16
which produced wrong canonical indexing → garbage PV output.

Also:
- V SMEM size is (16,16) canonical = 256 BF16, not (128,16) = 2048
- P written as 16 elements at row 0 (T=1 decode)
- V loaded from TMA (16,128) and sub-sampled to (16,16) canonical
- V TMA coord: {col_start, d_base} for (HD,s_k) tensor

2026-05-29 18:54:02 +00:00

cache

Flush compressor: schema fix, prepare_forward, flush_write kernels, state rotation

2026-05-22 00:25:47 +00:00

kernels

fix: PV GEMM — V canonical uses CORES_MN_V=2 (block_mn=16), not 16

2026-05-29 18:54:02 +00:00

layers

NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring

2026-05-25 16:19:07 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering