nvfp4-megamoe-kernel

Files

biondizzle 016edbcc97 D5c: add row_sum output for proper external normalization

The kernel's O_unnorm is max-shifted (divided by 2^row_max), so
O_norm != O_unnorm * exp(-LSE). Instead, O_norm = O_unnorm / row_sum.
Added mRowSums output tensor to enable correct normalization.

2026-05-26 15:07:22 +00:00

cache

Flush compressor: schema fix, prepare_forward, flush_write kernels, state rotation

2026-05-22 00:25:47 +00:00

kernels

D5c: add row_sum output for proper external normalization

2026-05-26 15:07:22 +00:00

layers

NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring

2026-05-25 16:19:07 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

model

Fix layer construction: match existing API signatures, add RMSNorm impl

2026-05-21 23:31:58 +00:00

ops

NVFP4-3: add use_2cta_instrs conditional to gemm_runner

2026-05-25 16:42:02 +00:00

reference

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00