Test 2: fix topk tensor shape (flatten before iterating)

2026-06-03 08:47:32 +00:00
parent 89f6e64057
commit 3c06fd5591
3 changed files with 102 additions and 0 deletions
--- a/TEMP/CROSS_REFERENCE.md
+++ b/TEMP/CROSS_REFERENCE.md
@@ -0,0 +1,100 @@
+# vLLM Independent-Reference Cross-Checks (DSV4)
+
+**Purpose:** catch the architectural bugs your cosine backlog is blind to. Every check in `CORRECTNESS_BACKLOG.md` compares your kernel to your *own* PyTorch reference — if both share a wrong assumption (RoPE position, RoPE style, norm ordering, compression overlap), they agree at cos 0.9999 while both being wrong. vLLM's DSV4 is an **independent** implementation; it can't share your blind spots. Use it as the tiebreaker.
+
+**How to use this:** these are mostly *static code diffs*, not GPU runs — fast. For each check, find the line in `single_shot_inference.py` (or `dsv4/`), find the cited line in the vLLM tree, and record match / mismatch + the fix.
+
+**Ground rules**
+- The attached vLLM tree "works but is unstable" and strayed from the paper in places. So trust it as an oracle for the **math** of the compressor / mHC / RoPE / norm-ordering — it produces coherent output against the real weights — but NOT as a stability or paper-fidelity reference everywhere.
+- **DO NOT mirror vLLM's attention.** `attention.py` / `nvidia/flashmla.py` shoehorn CSA/HCA onto the V3.2 MLA-sparse backend + a bolted-on SWA cache. That's the unstable stopgap. Your from-scratch shared-KV MQA + grouped-output is more paper-faithful — keep it.
+- Where vLLM and your code disagree on a detail your cosines can't see (below), **vLLM is the tiebreaker worth investigating** — match it and re-test.
+
+Reference files: `deepseek_v4/nvidia/model.py`, `deepseek_v4/compressor.py`, `deepseek_v4/common/rope.py`, `deepseek_v4/common/ops/*`, `deepseek_v4/nvidia/ops/*cutedsl.py`.
+
+---
+
+## Cross-check 1 — Final tail ordering: collapse → final RMSNorm → lm_head
+*(Resolves the Test-2 / mHC question. If this is wrong, it explains the |X|=860 confusion.)*
+
+**vLLM reference (`nvidia/model.py`):** the expanded mHC residual stays `n_hc × d` "until `hc_head()` collapses it" (comment ~line 1130). Tail sequence:
+- `mhc_post_tilelang(...)` collapse (`hc_head`, `n_hc·d → d`) — ~line 1172
+- `hidden_states = self.norm(hidden_states)` where `self.norm = RMSNorm(hidden_size)` — ~line 1191
+- `compute_logits` → `lm_head` — ~lines 1444–1448
+
+So the canonical tail is **collapse → final RMSNorm → lm_head, in that order.**
+
+**Check:** trace the last layer's residual in `single_shot`. Does it (a) collapse the expanded mHC residual via `hc_head`/`mhc_post`, then (b) apply a model-final `RMSNorm`, then (c) `lm_head` — in that order?
+
+**Decision**
+- Missing collapse, missing final norm, or reordered → **real bug, fix the ordering.** This is the legitimate "the final norm is broken" branch from the decode runbook — a small fix, NOT residual clipping.
+- All present and correct → RMSNorm is scale-invariant, so |X|=860 is normalized away → **mHC growth is not the cause**, corroborating the Test-2 falsification.
+
+---
+
+## Cross-check 2 — Compressed-entry RoPE position (block-aligned) — HIGHEST PRIORITY
+*(Off-by-(r−1) here is invisible to same-input cosines and wrecks long-range attention.)*
+
+**vLLM reference (`compressor.py`):** the position used to RoPE a compressed entry is
+```
+(positions // compress_ratio) * compress_ratio    # block-aligned = FIRST position of the block
+```
+
+**Your earlier code used:** `((bi+1) * r - 1)` — the **LAST** position of the block. Difference = `r − 1`: **3 for CSA, 127 for HCA.** A constant 127-position offset on every HCA entry degrades long-range attention while every per-layer cosine stays high (your reference makes the same choice).
+
+**Check:** what position does `single_shot`'s compressor assign when RoPE-ing compressed KV entries **and** the indexer keys? Compare against the block-aligned floor formula.
+
+**Decision:** anything other than block-aligned `(pos // ratio) * ratio` → change to it and re-test. This is the single most likely concrete bug in this document.
+
+---
+
+## Cross-check 3 — RoPE style: GPT-J interleaved, not NeoX split-half
+
+**vLLM reference (`compressor.py` compress kernel comment + `common/rope.py`):**
+- `is_neox_style=False` → **interleaved pairs**, NOT split-half
+- applied to the **last `rope_head_dim`** elements of `head_dim`
+- `cos_sin_cache` layout `[max_pos, rope_head_dim]`: first half cos, second half sin (per-pair, each length `rope_head_dim // 2`)
+
+**Check:** is `single_shot`'s RoPE interleaved (GPT-J) or split-half (NeoX)? Verify **both** the forward RoPE on Q/KV **and** the inverse RoPE on the attention output, and that the cos/sin cache layout matches.
+
+**Decision:** split-half where the model expects interleaved (or a mismatched cos/sin layout) scrambles every roped vector — invisible to same-input cosines, wrong against the real weights. Match vLLM's interleaved layout.
+
+---
+
+## Cross-check 4 — CSA overlap (2m) vs HCA no-overlap
+
+**vLLM reference (`compressor.py`):** `overlap = (compress_ratio == 4)` and `coff = 1 + (compress_ratio == 4)`. So CSA (ratio 4) uses **overlapped** compression — each compressed entry derived from `2m` KV entries via the `C^a` + `C^b` overlap (paper eqs. 11–12) — while HCA (ratio 128) does **not** overlap.
+
+**Check:** does `single_shot`'s CSA compressor produce overlapped blocks (each entry from `2m` entries with the `C^b` overlap), and HCA non-overlapped?
+
+**Decision:** if CSA isn't overlapping, the compressed representation differs from what the model was trained on → degraded attention. Match the overlap behavior.
+
+---
+
+## Reporting
+
+Same harness/workflow as the decode runbook: edit locally → commit → push → pull on B200 → test; `TEST_LAYERS` as an **env var**, never a CLI arg. Most checks here are static diffs (no GPU). For each, paste: `single_shot` line, vLLM reference line, MATCH/MISMATCH, and the fix if mismatched.
+
+---
+
+## Appendix A — vLLM reference → your kernel map (for the kernel work, not the bug)
+
+Independent CuteDSL/op references for the kernels you hand-wrote. BF16/FP8, not NVFP4 — so a *correctness/layout* reference, not liftable code (and useful evidence for the NVIDIA gaps writeup):
+
+| Your kernel | vLLM independent reference |
+|---|---|
+| compress → RMSNorm → RoPE → FP8 quant → cache-write | `nvidia/ops/sparse_attn_compress_cutedsl.py` (`compress_norm_rope_store_cutedsl`) |
+| dequant + gather K (your `gather_mixed_*`) | `nvidia/ops/dequant_gather_k_cutedsl.py` |
+| indexer Q | `nvidia/ops/fused_indexer_q_cutedsl.py`, `common/ops/fused_indexer_q.py` |
+| inverse RoPE + FP8 quant | `common/ops/fused_inv_rope_fp8_quant.py` |
+| QK RMSNorm | `common/ops/fused_qk_rmsnorm.py` |
+| uncompressed tail / partial-state buffer (your decode buffering) | `common/ops/save_partial_states.py` |
+| MoE dispatch prep | `nvidia/ops/prepare_megamoe.py` |
+| mHC pre/post | `vllm…kernels.mhc.tilelang` (`mhc_pre`, `mhc_post`, `mhc_fused_post_pre`) — used in `nvidia/model.py` |
+
+## Appendix B — determinism gotcha worth stealing
+
+`compressor.py` disables PDL (`launch_pdl=False`) because the compress kernels consume a preceding GEMM's output without emitting/waiting on PDL grid-dependency primitives, and `launch_pdl=True` caused a read-after-write race → non-deterministic output. You target bitwise determinism, so this is a hazard already mapped: any kernel that depends on a prior kernel's output without an explicit grid-dependency primitive must not use PDL.
+
+## Appendix C — what NOT to take
+
+`attention.py`, `nvidia/flashmla.py`, and the `DeepseekV4SparseMLA` / `DeepseekV4SWACache` machinery are the MLA shoehorn — the unstable stopgap that motivated this whole project. Use vLLM's compressor / mHC / RoPE / norm-ordering as oracles; leave its attention impl alone.
--- a/TEMP/deepseek_v4.zip
+++ b/TEMP/deepseek_v4.zip
--- a/tests/unit/test_degeneration_2_mhc_falsify.py
+++ b/tests/unit/test_degeneration_2_mhc_falsify.py
@@ -345,6 +345,8 @@ def main():
    cos_AB = F.cosine_similarity(logits_A_f.flatten(), logits_B_f.flatten(), dim=0).item()
    top5_A_vals, top5_A_ids = logits_A_f.topk(5)
    top5_B_vals, top5_B_ids = logits_B_f.topk(5)
+    top5_A_ids = top5_A_ids.flatten(); top5_A_vals = top5_A_vals.flatten()
+    top5_B_ids = top5_B_ids.flatten(); top5_B_vals = top5_B_vals.flatten()

    print(f"\n   logits_A (|X|={X_max:.1f}):")
    print(f"     range: [{logits_A_f.min().item():.2f}, {logits_A_f.max().item():.2f}]")