doc: INDEXER_PROBE_RESULTS_20260602 — compressed key width is ihd=128, not n_ih*ihd=8192

This commit is contained in:
2026-06-02 05:51:24 +00:00
parent 8162c586c3
commit 0fbf28dd54

View File

@@ -0,0 +1,126 @@
# Indexer probe results — 2026-06-02
## Raw output
### Indexer load state (after fix for weight path bug)
```
Indexer L2: q_b_lin=True wp_lin=True compressor=True
Indexer L4: q_b_lin=True wp_lin=True compressor=True
Indexer L6: q_b_lin=True wp_lin=True compressor=True
```
Note: `compressor=False` before the weight path fix. The original code looked for
`*.indexer.compressor.kv_proj.weight` but the checkpoint keys are `*.indexer.kv_proj.weight`
(no extra `.compressor` nesting). Fix: changed `Indexer.load` to look for
`f"{pfx}.kv_proj.weight"` instead of `f"{pfx}.compressor.kv_proj.weight"`.
### Compressor output shapes (at first block boundary, token 3 of prefill)
```
COMPRESSOR OUT [hd=512 kv_dim=1024 ratio=4 is_csa=True]: compressed.shape=(1, 512) dtype=torch.bfloat16 stride=(512, 1) contig=True
COMPRESSOR OUT [hd=128 kv_dim=256 ratio=4 is_csa=True]: compressed.shape=(1, 128) dtype=torch.bfloat16 stride=(128, 1) contig=True
```
The first line is the **main CSA compressor** (compresses KV for attention).
The second line is the **indexer's internal compressor** (compresses hidden states for indexer scoring).
### Reshape failure (at Indexer.forward, L2, token 3)
```
!!! RESHAPE FAILURE L2 !!!
comp_indexer_kv.shape = (1, 128)
tried to reshape to (1, 64, 128)
total elements: have 128, need 8192
k_idx = comp_indexer_kv.reshape(n_comp, self.n_ih, self.ihd)
RuntimeError: shape '[1, 64, 128]' is invalid for input of size 128
```
### Checkpoint weight shapes (from safetensors scan of L2 indexer)
```
model.layers.2.self_attn.compressor.indexer.q_b_proj.weight: shape=(8192, 768) dtype=uint8
model.layers.2.self_attn.compressor.indexer.weights_proj.weight: shape=(64, 3584) dtype=uint8
model.layers.2.self_attn.compressor.indexer.kv_proj.weight: shape=(256, 3584) dtype=uint8
model.layers.2.self_attn.compressor.indexer.gate_proj.weight: shape=(256, 3584) dtype=uint8
model.layers.2.self_attn.compressor.indexer.position_bias: shape=(4, 256) dtype=bfloat16
model.layers.2.self_attn.compressor.indexer.kv_norm.weight: shape=(128,) dtype=bfloat16
```
### KVCache comp_idx_buf crash (before width fix)
```
RuntimeError: The expanded size of the tensor (512) must match the existing size (128) at non-singleton dimension 1. Target sizes: [1, 512]. Tensor sizes: [128]
at: self.comp_idx_buf[self.n_comp:end] = idx_kv
```
Original `comp_idx_buf` was `(max_comp, head_dim=512)` but indexer compressed keys are width 128.
---
## Answers
### Q1: shape of indexer.compressor.forward(...)[0]
Observed: `(1, 128)` — width **W = 128 = ihd** (the indexer head dim)
Hypothesis matched: **A** (paper-aligned: `c_I = 128`)
The indexer compressor outputs one compressed block of width `ihd=128` per `m=4` tokens.
This is NOT `n_ih × ihd = 8192` (hypothesis B) and NOT `512` (hypothesis C / current buffer width).
### Q2: indexer.compressor.kv_dim
Observed: **256** (= `2 × ihd = 2 × 128`)
Expected per hypothesis A: 256 ✓
This is the internal projection width *before* the softmax/reduce. The compressor's
two GEMMs (`kv_proj` and `gate_proj`) each produce `(T, 256)`, then the CUDA reduce
kernel collapses every `m=4` tokens into one `(1, 128)` output.
### Q3: q_b_lin and wp_lin shapes
From checkpoint (NVFP4 packed: weight shape = (N_packed, K_packed)):
- **q_b_lin**: in_features = 768×2 = 1536 (q_a lora dim), out_features = 8192 (= n_ih × ihd = 64 × 128) ✓
- **wp_lin**: in_features = 3584×2 = 7168 (hidden size), out_features = 64 (= n_ih) ✓
### Q4: Runtime k_idx shape and reshape validity
- `comp_indexer_kv.shape` before reshape: **(1, 128)**
- Reshape target `(n_comp, 64, 128)`: **FAILED**
- Total elements: **have=128, need=8192** — off by **64×** (exactly `n_ih=64`)
The current `Indexer.forward` tries `comp_indexer_kv.reshape(n_comp, self.n_ih, self.ihd)`,
which assumes the stored indexer keys have `n_ih × ihd = 8192` elements per block.
But the actual stored width is `ihd = 128` (one vector per compressed block, NOT
per-indexer-head). The 64× gap is exactly `n_ih = 64`.
This means the scoring einsum `torch.einsum('tnd,cnd->tnc', q_idx, k_idx)` cannot
work as written. The indexer query `q_idx` is `(T, 64, 128)` (per-indexer-head),
but the stored key is `(n_comp, 128)` (a single vector). The correct scoring
formula must be different from what the current code assumes.
---
## Conclusion
The implementation stores indexer compressed keys at width **`ihd = 128`** (one
vector per compressed block, matching the paper's `c_I`). The current code incorrectly
assumes the stored keys have width `n_ih × ihd = 8192` (per-indexer-head multi-head
keys), causing a 64× reshape failure at the scoring step. The `comp_idx_buf` in `KVCache`
is also 4× too wide (512 vs 128). The indexer's scoring einsum and key storage both
need rearchitecting to match the paper's single-vector-per-block compressed key format.
---
## Additional findings (not in original scope)
1. **Weight path bug**: `Indexer.load` looked for `*.indexer.compressor.kv_proj.weight`
but the checkpoint has `*.indexer.kv_proj.weight` (no `.compressor` nesting).
Fixed in commit 5be31d8.
2. **comp_idx_buf width**: was `head_dim=512`, should be `ihd=128`. Temporarily fixed
for probe in commit 8162c58. Proper fix depends on audit rewrite.
3. **Indexer compressor never loaded before**: the weight path bug meant `indexer.compressor`
was always `None`, so the indexer was always skipped (`comp_idx_kv=None` on every
CSA layer). This means the indexer has NEVER been exercised in production runs.