Test 2: fix topk tensor shape (flatten before iterating)

This commit is contained in:
2026-06-03 08:47:32 +00:00
parent 89f6e64057
commit 3c06fd5591
3 changed files with 102 additions and 0 deletions

100
TEMP/CROSS_REFERENCE.md Normal file
View File

@@ -0,0 +1,100 @@
# vLLM Independent-Reference Cross-Checks (DSV4)
**Purpose:** catch the architectural bugs your cosine backlog is blind to. Every check in `CORRECTNESS_BACKLOG.md` compares your kernel to your *own* PyTorch reference — if both share a wrong assumption (RoPE position, RoPE style, norm ordering, compression overlap), they agree at cos 0.9999 while both being wrong. vLLM's DSV4 is an **independent** implementation; it can't share your blind spots. Use it as the tiebreaker.
**How to use this:** these are mostly *static code diffs*, not GPU runs — fast. For each check, find the line in `single_shot_inference.py` (or `dsv4/`), find the cited line in the vLLM tree, and record match / mismatch + the fix.
**Ground rules**
- The attached vLLM tree "works but is unstable" and strayed from the paper in places. So trust it as an oracle for the **math** of the compressor / mHC / RoPE / norm-ordering — it produces coherent output against the real weights — but NOT as a stability or paper-fidelity reference everywhere.
- **DO NOT mirror vLLM's attention.** `attention.py` / `nvidia/flashmla.py` shoehorn CSA/HCA onto the V3.2 MLA-sparse backend + a bolted-on SWA cache. That's the unstable stopgap. Your from-scratch shared-KV MQA + grouped-output is more paper-faithful — keep it.
- Where vLLM and your code disagree on a detail your cosines can't see (below), **vLLM is the tiebreaker worth investigating** — match it and re-test.
Reference files: `deepseek_v4/nvidia/model.py`, `deepseek_v4/compressor.py`, `deepseek_v4/common/rope.py`, `deepseek_v4/common/ops/*`, `deepseek_v4/nvidia/ops/*cutedsl.py`.
---
## Cross-check 1 — Final tail ordering: collapse → final RMSNorm → lm_head
*(Resolves the Test-2 / mHC question. If this is wrong, it explains the |X|=860 confusion.)*
**vLLM reference (`nvidia/model.py`):** the expanded mHC residual stays `n_hc × d` "until `hc_head()` collapses it" (comment ~line 1130). Tail sequence:
- `mhc_post_tilelang(...)` collapse (`hc_head`, `n_hc·d → d`) — ~line 1172
- `hidden_states = self.norm(hidden_states)` where `self.norm = RMSNorm(hidden_size)` — ~line 1191
- `compute_logits``lm_head` — ~lines 14441448
So the canonical tail is **collapse → final RMSNorm → lm_head, in that order.**
**Check:** trace the last layer's residual in `single_shot`. Does it (a) collapse the expanded mHC residual via `hc_head`/`mhc_post`, then (b) apply a model-final `RMSNorm`, then (c) `lm_head` — in that order?
**Decision**
- Missing collapse, missing final norm, or reordered → **real bug, fix the ordering.** This is the legitimate "the final norm is broken" branch from the decode runbook — a small fix, NOT residual clipping.
- All present and correct → RMSNorm is scale-invariant, so |X|=860 is normalized away → **mHC growth is not the cause**, corroborating the Test-2 falsification.
---
## Cross-check 2 — Compressed-entry RoPE position (block-aligned) — HIGHEST PRIORITY
*(Off-by-(r1) here is invisible to same-input cosines and wrecks long-range attention.)*
**vLLM reference (`compressor.py`):** the position used to RoPE a compressed entry is
```
(positions // compress_ratio) * compress_ratio # block-aligned = FIRST position of the block
```
**Your earlier code used:** `((bi+1) * r - 1)` — the **LAST** position of the block. Difference = `r 1`: **3 for CSA, 127 for HCA.** A constant 127-position offset on every HCA entry degrades long-range attention while every per-layer cosine stays high (your reference makes the same choice).
**Check:** what position does `single_shot`'s compressor assign when RoPE-ing compressed KV entries **and** the indexer keys? Compare against the block-aligned floor formula.
**Decision:** anything other than block-aligned `(pos // ratio) * ratio` → change to it and re-test. This is the single most likely concrete bug in this document.
---
## Cross-check 3 — RoPE style: GPT-J interleaved, not NeoX split-half
**vLLM reference (`compressor.py` compress kernel comment + `common/rope.py`):**
- `is_neox_style=False`**interleaved pairs**, NOT split-half
- applied to the **last `rope_head_dim`** elements of `head_dim`
- `cos_sin_cache` layout `[max_pos, rope_head_dim]`: first half cos, second half sin (per-pair, each length `rope_head_dim // 2`)
**Check:** is `single_shot`'s RoPE interleaved (GPT-J) or split-half (NeoX)? Verify **both** the forward RoPE on Q/KV **and** the inverse RoPE on the attention output, and that the cos/sin cache layout matches.
**Decision:** split-half where the model expects interleaved (or a mismatched cos/sin layout) scrambles every roped vector — invisible to same-input cosines, wrong against the real weights. Match vLLM's interleaved layout.
---
## Cross-check 4 — CSA overlap (2m) vs HCA no-overlap
**vLLM reference (`compressor.py`):** `overlap = (compress_ratio == 4)` and `coff = 1 + (compress_ratio == 4)`. So CSA (ratio 4) uses **overlapped** compression — each compressed entry derived from `2m` KV entries via the `C^a` + `C^b` overlap (paper eqs. 1112) — while HCA (ratio 128) does **not** overlap.
**Check:** does `single_shot`'s CSA compressor produce overlapped blocks (each entry from `2m` entries with the `C^b` overlap), and HCA non-overlapped?
**Decision:** if CSA isn't overlapping, the compressed representation differs from what the model was trained on → degraded attention. Match the overlap behavior.
---
## Reporting
Same harness/workflow as the decode runbook: edit locally → commit → push → pull on B200 → test; `TEST_LAYERS` as an **env var**, never a CLI arg. Most checks here are static diffs (no GPU). For each, paste: `single_shot` line, vLLM reference line, MATCH/MISMATCH, and the fix if mismatched.
---
## Appendix A — vLLM reference → your kernel map (for the kernel work, not the bug)
Independent CuteDSL/op references for the kernels you hand-wrote. BF16/FP8, not NVFP4 — so a *correctness/layout* reference, not liftable code (and useful evidence for the NVIDIA gaps writeup):
| Your kernel | vLLM independent reference |
|---|---|
| compress → RMSNorm → RoPE → FP8 quant → cache-write | `nvidia/ops/sparse_attn_compress_cutedsl.py` (`compress_norm_rope_store_cutedsl`) |
| dequant + gather K (your `gather_mixed_*`) | `nvidia/ops/dequant_gather_k_cutedsl.py` |
| indexer Q | `nvidia/ops/fused_indexer_q_cutedsl.py`, `common/ops/fused_indexer_q.py` |
| inverse RoPE + FP8 quant | `common/ops/fused_inv_rope_fp8_quant.py` |
| QK RMSNorm | `common/ops/fused_qk_rmsnorm.py` |
| uncompressed tail / partial-state buffer (your decode buffering) | `common/ops/save_partial_states.py` |
| MoE dispatch prep | `nvidia/ops/prepare_megamoe.py` |
| mHC pre/post | `vllm…kernels.mhc.tilelang` (`mhc_pre`, `mhc_post`, `mhc_fused_post_pre`) — used in `nvidia/model.py` |
## Appendix B — determinism gotcha worth stealing
`compressor.py` disables PDL (`launch_pdl=False`) because the compress kernels consume a preceding GEMM's output without emitting/waiting on PDL grid-dependency primitives, and `launch_pdl=True` caused a read-after-write race → non-deterministic output. You target bitwise determinism, so this is a hazard already mapped: any kernel that depends on a prior kernel's output without an explicit grid-dependency primitive must not use PDL.
## Appendix C — what NOT to take
`attention.py`, `nvidia/flashmla.py`, and the `DeepseekV4SparseMLA` / `DeepseekV4SWACache` machinery are the MLA shoehorn — the unstable stopgap that motivated this whole project. Use vLLM's compressor / mHC / RoPE / norm-ordering as oracles; leave its attention impl alone.

BIN
TEMP/deepseek_v4.zip Normal file

Binary file not shown.

View File

@@ -345,6 +345,8 @@ def main():
cos_AB = F.cosine_similarity(logits_A_f.flatten(), logits_B_f.flatten(), dim=0).item()
top5_A_vals, top5_A_ids = logits_A_f.topk(5)
top5_B_vals, top5_B_ids = logits_B_f.topk(5)
top5_A_ids = top5_A_ids.flatten(); top5_A_vals = top5_A_vals.flatten()
top5_B_ids = top5_B_ids.flatten(); top5_B_vals = top5_B_vals.flatten()
print(f"\n logits_A (|X|={X_max:.1f}):")
print(f" range: [{logits_A_f.min().item():.2f}, {logits_A_f.max().item():.2f}]")