diff --git a/TEMP/CROSS_REFERENCE.md b/TEMP/CROSS_REFERENCE.md new file mode 100644 index 00000000..58c5751b --- /dev/null +++ b/TEMP/CROSS_REFERENCE.md @@ -0,0 +1,100 @@ +# vLLM Independent-Reference Cross-Checks (DSV4) + +**Purpose:** catch the architectural bugs your cosine backlog is blind to. Every check in `CORRECTNESS_BACKLOG.md` compares your kernel to your *own* PyTorch reference — if both share a wrong assumption (RoPE position, RoPE style, norm ordering, compression overlap), they agree at cos 0.9999 while both being wrong. vLLM's DSV4 is an **independent** implementation; it can't share your blind spots. Use it as the tiebreaker. + +**How to use this:** these are mostly *static code diffs*, not GPU runs — fast. For each check, find the line in `single_shot_inference.py` (or `dsv4/`), find the cited line in the vLLM tree, and record match / mismatch + the fix. + +**Ground rules** +- The attached vLLM tree "works but is unstable" and strayed from the paper in places. So trust it as an oracle for the **math** of the compressor / mHC / RoPE / norm-ordering — it produces coherent output against the real weights — but NOT as a stability or paper-fidelity reference everywhere. +- **DO NOT mirror vLLM's attention.** `attention.py` / `nvidia/flashmla.py` shoehorn CSA/HCA onto the V3.2 MLA-sparse backend + a bolted-on SWA cache. That's the unstable stopgap. Your from-scratch shared-KV MQA + grouped-output is more paper-faithful — keep it. +- Where vLLM and your code disagree on a detail your cosines can't see (below), **vLLM is the tiebreaker worth investigating** — match it and re-test. + +Reference files: `deepseek_v4/nvidia/model.py`, `deepseek_v4/compressor.py`, `deepseek_v4/common/rope.py`, `deepseek_v4/common/ops/*`, `deepseek_v4/nvidia/ops/*cutedsl.py`. + +--- + +## Cross-check 1 — Final tail ordering: collapse → final RMSNorm → lm_head +*(Resolves the Test-2 / mHC question. If this is wrong, it explains the |X|=860 confusion.)* + +**vLLM reference (`nvidia/model.py`):** the expanded mHC residual stays `n_hc × d` "until `hc_head()` collapses it" (comment ~line 1130). Tail sequence: +- `mhc_post_tilelang(...)` collapse (`hc_head`, `n_hc·d → d`) — ~line 1172 +- `hidden_states = self.norm(hidden_states)` where `self.norm = RMSNorm(hidden_size)` — ~line 1191 +- `compute_logits` → `lm_head` — ~lines 1444–1448 + +So the canonical tail is **collapse → final RMSNorm → lm_head, in that order.** + +**Check:** trace the last layer's residual in `single_shot`. Does it (a) collapse the expanded mHC residual via `hc_head`/`mhc_post`, then (b) apply a model-final `RMSNorm`, then (c) `lm_head` — in that order? + +**Decision** +- Missing collapse, missing final norm, or reordered → **real bug, fix the ordering.** This is the legitimate "the final norm is broken" branch from the decode runbook — a small fix, NOT residual clipping. +- All present and correct → RMSNorm is scale-invariant, so |X|=860 is normalized away → **mHC growth is not the cause**, corroborating the Test-2 falsification. + +--- + +## Cross-check 2 — Compressed-entry RoPE position (block-aligned) — HIGHEST PRIORITY +*(Off-by-(r−1) here is invisible to same-input cosines and wrecks long-range attention.)* + +**vLLM reference (`compressor.py`):** the position used to RoPE a compressed entry is +``` +(positions // compress_ratio) * compress_ratio # block-aligned = FIRST position of the block +``` + +**Your earlier code used:** `((bi+1) * r - 1)` — the **LAST** position of the block. Difference = `r − 1`: **3 for CSA, 127 for HCA.** A constant 127-position offset on every HCA entry degrades long-range attention while every per-layer cosine stays high (your reference makes the same choice). + +**Check:** what position does `single_shot`'s compressor assign when RoPE-ing compressed KV entries **and** the indexer keys? Compare against the block-aligned floor formula. + +**Decision:** anything other than block-aligned `(pos // ratio) * ratio` → change to it and re-test. This is the single most likely concrete bug in this document. + +--- + +## Cross-check 3 — RoPE style: GPT-J interleaved, not NeoX split-half + +**vLLM reference (`compressor.py` compress kernel comment + `common/rope.py`):** +- `is_neox_style=False` → **interleaved pairs**, NOT split-half +- applied to the **last `rope_head_dim`** elements of `head_dim` +- `cos_sin_cache` layout `[max_pos, rope_head_dim]`: first half cos, second half sin (per-pair, each length `rope_head_dim // 2`) + +**Check:** is `single_shot`'s RoPE interleaved (GPT-J) or split-half (NeoX)? Verify **both** the forward RoPE on Q/KV **and** the inverse RoPE on the attention output, and that the cos/sin cache layout matches. + +**Decision:** split-half where the model expects interleaved (or a mismatched cos/sin layout) scrambles every roped vector — invisible to same-input cosines, wrong against the real weights. Match vLLM's interleaved layout. + +--- + +## Cross-check 4 — CSA overlap (2m) vs HCA no-overlap + +**vLLM reference (`compressor.py`):** `overlap = (compress_ratio == 4)` and `coff = 1 + (compress_ratio == 4)`. So CSA (ratio 4) uses **overlapped** compression — each compressed entry derived from `2m` KV entries via the `C^a` + `C^b` overlap (paper eqs. 11–12) — while HCA (ratio 128) does **not** overlap. + +**Check:** does `single_shot`'s CSA compressor produce overlapped blocks (each entry from `2m` entries with the `C^b` overlap), and HCA non-overlapped? + +**Decision:** if CSA isn't overlapping, the compressed representation differs from what the model was trained on → degraded attention. Match the overlap behavior. + +--- + +## Reporting + +Same harness/workflow as the decode runbook: edit locally → commit → push → pull on B200 → test; `TEST_LAYERS` as an **env var**, never a CLI arg. Most checks here are static diffs (no GPU). For each, paste: `single_shot` line, vLLM reference line, MATCH/MISMATCH, and the fix if mismatched. + +--- + +## Appendix A — vLLM reference → your kernel map (for the kernel work, not the bug) + +Independent CuteDSL/op references for the kernels you hand-wrote. BF16/FP8, not NVFP4 — so a *correctness/layout* reference, not liftable code (and useful evidence for the NVIDIA gaps writeup): + +| Your kernel | vLLM independent reference | +|---|---| +| compress → RMSNorm → RoPE → FP8 quant → cache-write | `nvidia/ops/sparse_attn_compress_cutedsl.py` (`compress_norm_rope_store_cutedsl`) | +| dequant + gather K (your `gather_mixed_*`) | `nvidia/ops/dequant_gather_k_cutedsl.py` | +| indexer Q | `nvidia/ops/fused_indexer_q_cutedsl.py`, `common/ops/fused_indexer_q.py` | +| inverse RoPE + FP8 quant | `common/ops/fused_inv_rope_fp8_quant.py` | +| QK RMSNorm | `common/ops/fused_qk_rmsnorm.py` | +| uncompressed tail / partial-state buffer (your decode buffering) | `common/ops/save_partial_states.py` | +| MoE dispatch prep | `nvidia/ops/prepare_megamoe.py` | +| mHC pre/post | `vllm…kernels.mhc.tilelang` (`mhc_pre`, `mhc_post`, `mhc_fused_post_pre`) — used in `nvidia/model.py` | + +## Appendix B — determinism gotcha worth stealing + +`compressor.py` disables PDL (`launch_pdl=False`) because the compress kernels consume a preceding GEMM's output without emitting/waiting on PDL grid-dependency primitives, and `launch_pdl=True` caused a read-after-write race → non-deterministic output. You target bitwise determinism, so this is a hazard already mapped: any kernel that depends on a prior kernel's output without an explicit grid-dependency primitive must not use PDL. + +## Appendix C — what NOT to take + +`attention.py`, `nvidia/flashmla.py`, and the `DeepseekV4SparseMLA` / `DeepseekV4SWACache` machinery are the MLA shoehorn — the unstable stopgap that motivated this whole project. Use vLLM's compressor / mHC / RoPE / norm-ordering as oracles; leave its attention impl alone. \ No newline at end of file diff --git a/TEMP/deepseek_v4.zip b/TEMP/deepseek_v4.zip new file mode 100644 index 00000000..77ae49cd Binary files /dev/null and b/TEMP/deepseek_v4.zip differ diff --git a/tests/unit/test_degeneration_2_mhc_falsify.py b/tests/unit/test_degeneration_2_mhc_falsify.py index 08c0004d..cda605cb 100644 --- a/tests/unit/test_degeneration_2_mhc_falsify.py +++ b/tests/unit/test_degeneration_2_mhc_falsify.py @@ -345,6 +345,8 @@ def main(): cos_AB = F.cosine_similarity(logits_A_f.flatten(), logits_B_f.flatten(), dim=0).item() top5_A_vals, top5_A_ids = logits_A_f.topk(5) top5_B_vals, top5_B_ids = logits_B_f.topk(5) + top5_A_ids = top5_A_ids.flatten(); top5_A_vals = top5_A_vals.flatten() + top5_B_ids = top5_B_ids.flatten(); top5_B_vals = top5_B_vals.flatten() print(f"\n logits_A (|X|={X_max:.1f}):") print(f" range: [{logits_A_f.min().item():.2f}, {logits_A_f.max().item():.2f}]")