nvfp4-megamoe-kernel

Files

biondizzle afb93eae22 D1.5: Revert broken TMEM round-trip O rescale, document as fundamentally broken

TMEM round-trip via Ld32x32bOp/St32x32bOp corrupts O accumulator data
even with CUTLASS correction_rescale pattern. All variants tested:
- Repetition(16) + composition (CUTLASS exact pattern) — BROKEN
- Repetition(32) + composition — BROKEN
- Repetition(16) raw layout (no composition) — BROKEN
Even NO-OP (multiply by 1.0) produces catastrophically wrong results.

Production path remains Python KV merge (cos 0.999998 for s_k up to 1024).
Next: SMEM accumulator approach (one-way TMEM→REGS→SMEM per kt).

2026-05-26 20:55:16 +00:00

cache

Flush compressor: schema fix, prepare_forward, flush_write kernels, state rotation

2026-05-22 00:25:47 +00:00

kernels

D1.5: Revert broken TMEM round-trip O rescale, document as fundamentally broken

2026-05-26 20:55:16 +00:00

layers

NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring

2026-05-25 16:19:07 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering