nvfp4-megamoe-kernel

Files

biondizzle 3cb339129b FMHA SM100: Fix Phase 1 — single-thread reference for correctness

Use thread 0 for all computation (slow but correct).
SMEM for Q and O sharing across threads.
Online softmax with O rescale — correct D1.5 approach.
D3 SWA mask implemented.
Target: cos ~0.999998 then parallelize.

2026-05-28 05:32:47 +00:00

cache

Flush compressor: schema fix, prepare_forward, flush_write kernels, state rotation

2026-05-22 00:25:47 +00:00

kernels

FMHA SM100: Fix Phase 1 — single-thread reference for correctness

2026-05-28 05:32:47 +00:00

layers

NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring

2026-05-25 16:19:07 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

model

Fix layer construction: match existing API signatures, add RMSNorm impl

2026-05-21 23:31:58 +00:00

ops

Stage E: head-packed MQA/GQA, batch dim, custom_op, integration API

2026-05-27 15:15:03 +00:00

reference

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00