nvfp4-megamoe-kernel

Files

biondizzle 7d5c093c99 Fix KV cache crash: skip SWA cache write on Blackwell

The SWA KV cache uses fp8_ds_mla packed layout (37376 bytes per slot,
not 512). Our naive FP8 quant + write had a shape mismatch.

Fix: skip the SWA cache write entirely. The compressor (Triton)
handles the compressed cache. For full SDPA attention, we use the
raw kv tensor directly — we don't need the paged cache at all
during prefill.

2026-05-19 08:21:57 +00:00

kernels/linear/nvfp4

Fix OOM: add --max-model-len=876544 + revert CPU dummy weight

2026-05-19 07:35:43 +00:00

patches

Fix KV cache crash: skip SWA cache write on Blackwell

2026-05-19 08:21:57 +00:00

cutedsl_quant_method.py

Fix OOM: add --max-model-len=876544 + revert CPU dummy weight

2026-05-19 07:35:43 +00:00

nvfp4_cutedsl.py

Replace autograd.Function with torch.library.custom_op for Dynamo compat

2026-05-19 01:54:48 +00:00