Rewrite single_shot_inference.py: correct weight keys, NVFP4 two-level scale, compressor+indexer connected

- Fixed weight key format: model.layers.{li}.self_attn.* (was layers.{li}.attn.*)
- Added NVFP4 two-level scale: weight_scale * weight_scale_2 * input_scale
- Proper CSA compressor: overlapping Ca/Cb streams, token-level softmax
- Proper HCA compressor: non-overlapping, single stream
- Indexer: NVFP4 q_b_proj + weights_proj + own compressor at index_head_dim
- Compressed KV (dim=hd) concatenated with SWA KV for attention
- Correct MoE key format: gate_proj/up_proj/down_proj
- Correct mHC key format: attn_hc.{fn,base,scale} and ffn_hc.{fn,base,scale}
- No more disconnected compressor — full E2E pipeline

This commit is contained in:

biondizzle

2026-05-31 21:48:59 +00:00

parent 4988e77179

commit eb08cd06d1

1 changed files with 378 additions and 744 deletions

1122

single_shot_inference.py

View File

File diff suppressed because it is too large Load Diff

Rewrite single_shot_inference.py: correct weight keys, NVFP4 two-level scale, compressor+indexer connected

1122 single_shot_inference.py View File

1122

single_shot_inference.py

View File