doc: INDEXER_PROBE_RESULTS_20260602 — compressed key width is ihd=128, not n_ih*ihd=8192
This commit is contained in:
126
archived_plans/INDEXER_PROBE_RESULTS_20260602.md
Normal file
126
archived_plans/INDEXER_PROBE_RESULTS_20260602.md
Normal file
@@ -0,0 +1,126 @@
|
||||
# Indexer probe results — 2026-06-02
|
||||
|
||||
## Raw output
|
||||
|
||||
### Indexer load state (after fix for weight path bug)
|
||||
|
||||
```
|
||||
Indexer L2: q_b_lin=True wp_lin=True compressor=True
|
||||
Indexer L4: q_b_lin=True wp_lin=True compressor=True
|
||||
Indexer L6: q_b_lin=True wp_lin=True compressor=True
|
||||
```
|
||||
|
||||
Note: `compressor=False` before the weight path fix. The original code looked for
|
||||
`*.indexer.compressor.kv_proj.weight` but the checkpoint keys are `*.indexer.kv_proj.weight`
|
||||
(no extra `.compressor` nesting). Fix: changed `Indexer.load` to look for
|
||||
`f"{pfx}.kv_proj.weight"` instead of `f"{pfx}.compressor.kv_proj.weight"`.
|
||||
|
||||
### Compressor output shapes (at first block boundary, token 3 of prefill)
|
||||
|
||||
```
|
||||
COMPRESSOR OUT [hd=512 kv_dim=1024 ratio=4 is_csa=True]: compressed.shape=(1, 512) dtype=torch.bfloat16 stride=(512, 1) contig=True
|
||||
COMPRESSOR OUT [hd=128 kv_dim=256 ratio=4 is_csa=True]: compressed.shape=(1, 128) dtype=torch.bfloat16 stride=(128, 1) contig=True
|
||||
```
|
||||
|
||||
The first line is the **main CSA compressor** (compresses KV for attention).
|
||||
The second line is the **indexer's internal compressor** (compresses hidden states for indexer scoring).
|
||||
|
||||
### Reshape failure (at Indexer.forward, L2, token 3)
|
||||
|
||||
```
|
||||
!!! RESHAPE FAILURE L2 !!!
|
||||
comp_indexer_kv.shape = (1, 128)
|
||||
tried to reshape to (1, 64, 128)
|
||||
total elements: have 128, need 8192
|
||||
k_idx = comp_indexer_kv.reshape(n_comp, self.n_ih, self.ihd)
|
||||
RuntimeError: shape '[1, 64, 128]' is invalid for input of size 128
|
||||
```
|
||||
|
||||
### Checkpoint weight shapes (from safetensors scan of L2 indexer)
|
||||
|
||||
```
|
||||
model.layers.2.self_attn.compressor.indexer.q_b_proj.weight: shape=(8192, 768) dtype=uint8
|
||||
model.layers.2.self_attn.compressor.indexer.weights_proj.weight: shape=(64, 3584) dtype=uint8
|
||||
model.layers.2.self_attn.compressor.indexer.kv_proj.weight: shape=(256, 3584) dtype=uint8
|
||||
model.layers.2.self_attn.compressor.indexer.gate_proj.weight: shape=(256, 3584) dtype=uint8
|
||||
model.layers.2.self_attn.compressor.indexer.position_bias: shape=(4, 256) dtype=bfloat16
|
||||
model.layers.2.self_attn.compressor.indexer.kv_norm.weight: shape=(128,) dtype=bfloat16
|
||||
```
|
||||
|
||||
### KVCache comp_idx_buf crash (before width fix)
|
||||
|
||||
```
|
||||
RuntimeError: The expanded size of the tensor (512) must match the existing size (128) at non-singleton dimension 1. Target sizes: [1, 512]. Tensor sizes: [128]
|
||||
at: self.comp_idx_buf[self.n_comp:end] = idx_kv
|
||||
```
|
||||
|
||||
Original `comp_idx_buf` was `(max_comp, head_dim=512)` but indexer compressed keys are width 128.
|
||||
|
||||
---
|
||||
|
||||
## Answers
|
||||
|
||||
### Q1: shape of indexer.compressor.forward(...)[0]
|
||||
|
||||
Observed: `(1, 128)` — width **W = 128 = ihd** (the indexer head dim)
|
||||
Hypothesis matched: **A** (paper-aligned: `c_I = 128`)
|
||||
|
||||
The indexer compressor outputs one compressed block of width `ihd=128` per `m=4` tokens.
|
||||
This is NOT `n_ih × ihd = 8192` (hypothesis B) and NOT `512` (hypothesis C / current buffer width).
|
||||
|
||||
### Q2: indexer.compressor.kv_dim
|
||||
|
||||
Observed: **256** (= `2 × ihd = 2 × 128`)
|
||||
Expected per hypothesis A: 256 ✓
|
||||
|
||||
This is the internal projection width *before* the softmax/reduce. The compressor's
|
||||
two GEMMs (`kv_proj` and `gate_proj`) each produce `(T, 256)`, then the CUDA reduce
|
||||
kernel collapses every `m=4` tokens into one `(1, 128)` output.
|
||||
|
||||
### Q3: q_b_lin and wp_lin shapes
|
||||
|
||||
From checkpoint (NVFP4 packed: weight shape = (N_packed, K_packed)):
|
||||
- **q_b_lin**: in_features = 768×2 = 1536 (q_a lora dim), out_features = 8192 (= n_ih × ihd = 64 × 128) ✓
|
||||
- **wp_lin**: in_features = 3584×2 = 7168 (hidden size), out_features = 64 (= n_ih) ✓
|
||||
|
||||
### Q4: Runtime k_idx shape and reshape validity
|
||||
|
||||
- `comp_indexer_kv.shape` before reshape: **(1, 128)**
|
||||
- Reshape target `(n_comp, 64, 128)`: **FAILED**
|
||||
- Total elements: **have=128, need=8192** — off by **64×** (exactly `n_ih=64`)
|
||||
|
||||
The current `Indexer.forward` tries `comp_indexer_kv.reshape(n_comp, self.n_ih, self.ihd)`,
|
||||
which assumes the stored indexer keys have `n_ih × ihd = 8192` elements per block.
|
||||
But the actual stored width is `ihd = 128` (one vector per compressed block, NOT
|
||||
per-indexer-head). The 64× gap is exactly `n_ih = 64`.
|
||||
|
||||
This means the scoring einsum `torch.einsum('tnd,cnd->tnc', q_idx, k_idx)` cannot
|
||||
work as written. The indexer query `q_idx` is `(T, 64, 128)` (per-indexer-head),
|
||||
but the stored key is `(n_comp, 128)` (a single vector). The correct scoring
|
||||
formula must be different from what the current code assumes.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
The implementation stores indexer compressed keys at width **`ihd = 128`** (one
|
||||
vector per compressed block, matching the paper's `c_I`). The current code incorrectly
|
||||
assumes the stored keys have width `n_ih × ihd = 8192` (per-indexer-head multi-head
|
||||
keys), causing a 64× reshape failure at the scoring step. The `comp_idx_buf` in `KVCache`
|
||||
is also 4× too wide (512 vs 128). The indexer's scoring einsum and key storage both
|
||||
need rearchitecting to match the paper's single-vector-per-block compressed key format.
|
||||
|
||||
---
|
||||
|
||||
## Additional findings (not in original scope)
|
||||
|
||||
1. **Weight path bug**: `Indexer.load` looked for `*.indexer.compressor.kv_proj.weight`
|
||||
but the checkpoint has `*.indexer.kv_proj.weight` (no `.compressor` nesting).
|
||||
Fixed in commit 5be31d8.
|
||||
|
||||
2. **comp_idx_buf width**: was `head_dim=512`, should be `ihd=128`. Temporarily fixed
|
||||
for probe in commit 8162c58. Proper fix depends on audit rewrite.
|
||||
|
||||
3. **Indexer compressor never loaded before**: the weight path bug meant `indexer.compressor`
|
||||
was always `None`, so the indexer was always skipped (`comp_idx_kv=None` on every
|
||||
CSA layer). This means the indexer has NEVER been exercised in production runs.
|
||||
Reference in New Issue
Block a user