nvfp4-megamoe-kernel

Files

biondizzle eef0ef76af Fix NVFP4 compressor scale loading: buffer and concatenate scale shards

The stacked params mapping (wkv + wgate → fused_wkv_wgate) uses
weight_loader(param, weight, shard_id), but PerTensorScaleParameter
and ModelWeightParameter for NVFP4 scale params don't support shard_id
in load_column_parallel_weight (asserts shape equality).

Fix: buffer input_scale, weight_scale, weight_scale_2 for fused_wkv_wgate
shards, then concatenate along dim 0 and copy_ into the param after all
weights are loaded.

2026-05-18 23:24:08 +00:00

patches

Fix NVFP4 compressor scale loading: buffer and concatenate scale shards

2026-05-18 23:24:08 +00:00

cutedsl_quant_method.py

Fix: add abstract create_weights to CuTeDSLNvfp4LinearMethod

2026-05-18 20:40:48 +00:00

nvfp4_cutedsl.py

Fix torch.compile: use custom autograd Function instead of @torch.compiler.disable

2026-05-18 21:38:28 +00:00