doc: INDEXER_PROBE_RESULTS_20260602 — compressed key width is ihd=128, not n_ih*ihd=8192

2026-06-02 05:51:24 +00:00
parent 8162c586c3
commit 0fbf28dd54
1 changed files with 126 additions and 0 deletions
--- a/archived_plans/INDEXER_PROBE_RESULTS_20260602.md
+++ b/archived_plans/INDEXER_PROBE_RESULTS_20260602.md
@@ -0,0 +1,126 @@
+# Indexer probe results — 2026-06-02
+
+## Raw output
+
+### Indexer load state (after fix for weight path bug)
+
+```
+Indexer L2: q_b_lin=True wp_lin=True compressor=True
+Indexer L4: q_b_lin=True wp_lin=True compressor=True
+Indexer L6: q_b_lin=True wp_lin=True compressor=True
+```
+
+Note: `compressor=False` before the weight path fix. The original code looked for
+`*.indexer.compressor.kv_proj.weight` but the checkpoint keys are `*.indexer.kv_proj.weight`
+(no extra `.compressor` nesting). Fix: changed `Indexer.load` to look for
+`f"{pfx}.kv_proj.weight"` instead of `f"{pfx}.compressor.kv_proj.weight"`.
+
+### Compressor output shapes (at first block boundary, token 3 of prefill)
+
+```
+COMPRESSOR OUT [hd=512 kv_dim=1024 ratio=4 is_csa=True]: compressed.shape=(1, 512) dtype=torch.bfloat16 stride=(512, 1) contig=True
+COMPRESSOR OUT [hd=128 kv_dim=256 ratio=4 is_csa=True]: compressed.shape=(1, 128) dtype=torch.bfloat16 stride=(128, 1) contig=True
+```
+
+The first line is the **main CSA compressor** (compresses KV for attention).
+The second line is the **indexer's internal compressor** (compresses hidden states for indexer scoring).
+
+### Reshape failure (at Indexer.forward, L2, token 3)
+
+```
+!!! RESHAPE FAILURE L2 !!!
+comp_indexer_kv.shape = (1, 128)
+tried to reshape to (1, 64, 128)
+total elements: have 128, need 8192
+    k_idx = comp_indexer_kv.reshape(n_comp, self.n_ih, self.ihd)
+RuntimeError: shape '[1, 64, 128]' is invalid for input of size 128
+```
+
+### Checkpoint weight shapes (from safetensors scan of L2 indexer)
+
+```
+model.layers.2.self_attn.compressor.indexer.q_b_proj.weight:       shape=(8192, 768)  dtype=uint8
+model.layers.2.self_attn.compressor.indexer.weights_proj.weight:   shape=(64, 3584)   dtype=uint8
+model.layers.2.self_attn.compressor.indexer.kv_proj.weight:        shape=(256, 3584)  dtype=uint8
+model.layers.2.self_attn.compressor.indexer.gate_proj.weight:      shape=(256, 3584)  dtype=uint8
+model.layers.2.self_attn.compressor.indexer.position_bias:         shape=(4, 256)     dtype=bfloat16
+model.layers.2.self_attn.compressor.indexer.kv_norm.weight:        shape=(128,)       dtype=bfloat16
+```
+
+### KVCache comp_idx_buf crash (before width fix)
+
+```
+RuntimeError: The expanded size of the tensor (512) must match the existing size (128) at non-singleton dimension 1.  Target sizes: [1, 512].  Tensor sizes: [128]
+  at: self.comp_idx_buf[self.n_comp:end] = idx_kv
+```
+
+Original `comp_idx_buf` was `(max_comp, head_dim=512)` but indexer compressed keys are width 128.
+
+---
+
+## Answers
+
+### Q1: shape of indexer.compressor.forward(...)[0]
+
+Observed: `(1, 128)` — width **W = 128 = ihd** (the indexer head dim)
+Hypothesis matched: **A** (paper-aligned: `c_I = 128`)
+
+The indexer compressor outputs one compressed block of width `ihd=128` per `m=4` tokens.
+This is NOT `n_ih × ihd = 8192` (hypothesis B) and NOT `512` (hypothesis C / current buffer width).
+
+### Q2: indexer.compressor.kv_dim
+
+Observed: **256** (= `2 × ihd = 2 × 128`)
+Expected per hypothesis A: 256 ✓
+
+This is the internal projection width *before* the softmax/reduce. The compressor's
+two GEMMs (`kv_proj` and `gate_proj`) each produce `(T, 256)`, then the CUDA reduce
+kernel collapses every `m=4` tokens into one `(1, 128)` output.
+
+### Q3: q_b_lin and wp_lin shapes
+
+From checkpoint (NVFP4 packed: weight shape = (N_packed, K_packed)):
+- **q_b_lin**: in_features = 768×2 = 1536 (q_a lora dim), out_features = 8192 (= n_ih × ihd = 64 × 128) ✓
+- **wp_lin**: in_features = 3584×2 = 7168 (hidden size), out_features = 64 (= n_ih) ✓
+
+### Q4: Runtime k_idx shape and reshape validity
+
+- `comp_indexer_kv.shape` before reshape: **(1, 128)**
+- Reshape target `(n_comp, 64, 128)`: **FAILED**
+- Total elements: **have=128, need=8192** — off by **64×** (exactly `n_ih=64`)
+
+The current `Indexer.forward` tries `comp_indexer_kv.reshape(n_comp, self.n_ih, self.ihd)`,
+which assumes the stored indexer keys have `n_ih × ihd = 8192` elements per block.
+But the actual stored width is `ihd = 128` (one vector per compressed block, NOT
+per-indexer-head). The 64× gap is exactly `n_ih = 64`.
+
+This means the scoring einsum `torch.einsum('tnd,cnd->tnc', q_idx, k_idx)` cannot
+work as written. The indexer query `q_idx` is `(T, 64, 128)` (per-indexer-head),
+but the stored key is `(n_comp, 128)` (a single vector). The correct scoring
+formula must be different from what the current code assumes.
+
+---
+
+## Conclusion
+
+The implementation stores indexer compressed keys at width **`ihd = 128`** (one
+vector per compressed block, matching the paper's `c_I`). The current code incorrectly
+assumes the stored keys have width `n_ih × ihd = 8192` (per-indexer-head multi-head
+keys), causing a 64× reshape failure at the scoring step. The `comp_idx_buf` in `KVCache`
+is also 4× too wide (512 vs 128). The indexer's scoring einsum and key storage both
+need rearchitecting to match the paper's single-vector-per-block compressed key format.
+
+---
+
+## Additional findings (not in original scope)
+
+1. **Weight path bug**: `Indexer.load` looked for `*.indexer.compressor.kv_proj.weight`
+   but the checkpoint has `*.indexer.kv_proj.weight` (no `.compressor` nesting).
+   Fixed in commit 5be31d8.
+
+2. **comp_idx_buf width**: was `head_dim=512`, should be `ihd=128`. Temporarily fixed
+   for probe in commit 8162c58. Proper fix depends on audit rewrite.
+
+3. **Indexer compressor never loaded before**: the weight path bug meant `indexer.compressor`
+   was always `None`, so the indexer was always skipped (`comp_idx_kv=None` on every
+   CSA layer). This means the indexer has NEVER been exercised in production runs.