From 5e09be08afdfe02cb16bbe35cdeb1b7c839e8368 Mon Sep 17 00:00:00 2001 From: biondizzle Date: Wed, 3 Jun 2026 07:56:19 +0000 Subject: [PATCH] Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The intermediate tensor from fused SwiGLU deinterleave is a column slice (non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and the CUDA kernel crashes with 'input must be contiguous'. Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused and in SharedExpert._run_l2. This is the root cause, not a workaround — CUDA kernels legitimately require contiguous memory. --- DEGENERATION_TESTS.md | 107 +++++++++++++++++++++++++++++++++++ dsv4/layers/shared_expert.py | 4 ++ dsv4/ops/quantize.py | 3 + 3 files changed, 114 insertions(+) create mode 100644 DEGENERATION_TESTS.md diff --git a/DEGENERATION_TESTS.md b/DEGENERATION_TESTS.md new file mode 100644 index 00000000..08a2d0a0 --- /dev/null +++ b/DEGENERATION_TESTS.md @@ -0,0 +1,107 @@ +# DSV4 Decode Degeneration — Two Decisive Tests (run BEFORE any kernel/model change) + +**Symptom:** coherent-ish then degenerate decode; loops on a content token ("capital"/"capitalizing"); at times wrong top-1 from step 0. + +## ⛔ HARD STOP — do not do any of these until both tests below are run and reported + +- **Do NOT modify any kernel.** +- **Do NOT modify the mHC math.** +- **Do NOT add residual clipping, `C_l` scaling, or any "tame the residual" change.** + +The `CORRECTNESS_BACKLOG.md` verdict — *"mHC residual growth (|X|→860) is the confirmed root cause"* — is **unproven**, and the proposed remedies are surgery on a *trained* model to mask a symptom. If the real cause is the prompt (likely) or a missing final norm, those changes corrupt the model and hide the actual bug. + +## Why the backlog does NOT rule this out + +Every verification in `CORRECTNESS_BACKLOG.md` is a **same-input cosine**: production kernel vs PyTorch reference, both fed the **identical hand-rolled prompt**. That proves the kernels match *each other*. It is **structurally blind** to a chat-template/prompt bug — feed both sides the same malformed prompt and every layer agrees at cos 0.9999 *while both produce garbage*. So "we ruled out everything" means "everything a same-input cosine can see." The prompt is outside that set. The backlog is **silent** on the two hypotheses below, not a refutation of them. + +--- + +## TEST 1 — Chat-template token-ID diff (most likely the actual bug; run first) + +**Hypothesis:** the hand-rolled prompt is out-of-distribution for this reasoning model → degenerate / looping output. The current construction in `single_shot_inference.py` is roughly: + +```python +input_ids = [bos, USER_TOKEN] # USER_TOKEN = 128803 +input_ids += tokenizer.encode('\n\n' + PROMPT, add_special_tokens=False) +input_ids.append(ASSISTANT_TOKEN) # ASSISTANT_TOKEN = 128804 +``` + +This almost certainly does **not** match what the model was trained on (a reasoning model expects specific assistant-turn + `` priming; THINK_START=128821, THINK_END=128822 exist for a reason). + +**Procedure** + +1. Print what we actually build: + ```python + print("hand_rolled ids:", input_ids) + print("hand_rolled str:", tokenizer.decode(input_ids)) + ``` +2. Print the canonical template the tokenizer itself produces: + ```python + ref_ids = tokenizer.apply_chat_template( + [{"role": "user", "content": PROMPT}], + add_generation_prompt=True, tokenize=True, + # This is a reasoner. Check whether the template takes a thinking kwarg + # (e.g. enable_thinking=True / thinking=...). Try with and without. + ) + print("template ids:", ref_ids) + print("template str:", tokenizer.apply_chat_template( + [{"role":"user","content":PROMPT}], add_generation_prompt=True, tokenize=False)) + ``` +3. Also dump the raw source so we can read the special-token layout directly: + ```python + print(tokenizer.chat_template) # or read tokenizer_config.json / chat_template.jinja + ``` +4. Diff `input_ids` vs `ref_ids`. Look specifically at: BOS handling, the user/assistant delimiter tokens, newline placement, and **the `` priming after the assistant token**. + +**Decision** + +- **They differ (expected):** replace the hand-rolled construction with `apply_chat_template` output, then run a short greedy generation (`--temperature 0`, modest `--max-tokens`). If Paris returns as top-1 and the loop is gone → **this was the bug. Done.** Do not touch mHC. +- **Identical but still degenerate:** the tokenizer template is faithful yet the model still loops → compare `chat_template.jinja` against the reference inference impl (`deepseek-ai/DeepSeek-V4-Pro/tree/main/inference`), and confirm the thinking-enabled variant is what's being applied. Then proceed to Test 2. + +> Note: the NVIDIA sglang run used `--reasoning-parser deepseek-v4` and `SGLANG_DEFAULT_THINKING=1`. The real format is not a bare `USER … ASSISTANT` sandwich — there is a thinking setup the hand-rolled path omits. + +--- + +## TEST 2 — Falsify the mHC "root cause" (run before ANY mHC/residual change) + +**Claim under test (from the backlog):** *"|X|=860 compresses the logit range so the model can't distinguish tokens."* + +**Why it's suspect:** there is a final RMSNorm before the LM head, and RMSNorm is **scale-invariant** — it divides the magnitude out. So |X|=860 and |X|=8 should produce the *same* logits (modulo the learned norm weight). Also, the residual grows just as much during **prefill** (backlog's own numbers: |X| up to 476, ~6240 on token 0) yet prefill/first-token is correct — magnitude common to both phases cannot be what breaks *only* decode. + +**Procedure** + +1. **Confirm the final norm exists and is applied.** Trace the path from the last layer's residual `X` → final RMSNorm → `lm_head_lin(x_out)`. Print whether a final norm runs before the LM head. + - **If it is MISSING or not applied → STOP. That is the real bug.** The fix is to apply the final norm, *not* to clip the residual. +2. **Falsification.** At the last decode layer, capture the residual at |X|≈860. Compute logits two ways through the *same* final-norm + LM-head path: + ```python + logits_A = lm_head(final_norm(X)) # X as-is, |X|≈860 + logits_B = lm_head(final_norm(X / 100.0)) # scaled down + cos = F.cosine_similarity(logits_A.flatten().float(), logits_B.flatten().float(), dim=0) + print("argmax_A", logits_A.argmax().item(), "argmax_B", logits_B.argmax().item(), "cos", cos.item()) + ``` + +**Decision** + +- **argmax_A == argmax_B and cos ≈ 1.0 (expected):** mHC growth is **exonerated**. |X| magnitude is not the cause. Stop chasing mHC; the answer is in Test 1. +- **They differ materially:** something downstream of the residual is magnitude-sensitive → the final norm is missing/broken/misapplied. **Fix the norm.** Still do not clip the residual. + +--- + +## Test ordering + +1. **Test 1 first** — it's the most likely fix and is trivial. If it resolves the loop, you're done and mHC was never the problem. +2. **Test 2 before touching mHC** — even if Test 1 isn't a full fix, prove (or correctly redirect) the mHC verdict before any model-level change. The only "fix" Test 2 can license is *applying a missing final norm*, never residual clipping. + +## Harness / workflow (from CORRECTNESS_BACKLOG §11) + +- Run via the harness: `~/.openclaw/workspace/fire_b200_test tests/unit/.py`. Never run or edit directly on the B200. +- Edit locally → commit → push → pull on B200 → test. +- Set `TEST_LAYERS` as an **env var** (`export TEST_LAYERS=10`), never as a CLI arg — single_shot's argparse will eat it and corrupt `--max-tokens` (this caused the bogus |X|=3.27e16 "blowups"). +- Both tests above are quick: Test 1 needs no GPU (tokenizer only); Test 2 needs one decode pass with `TEST_LAYERS=61`. + +## Report back (paste these) + +- **Test 1:** `hand_rolled ids`, `template ids`, the diff, and the greedy top-1 token after switching to `apply_chat_template`. +- **Test 2:** whether a final norm is applied before the LM head; `argmax_A`, `argmax_B`, `cos`. + +Until both are reported, the mHC verdict stays **unproven** and no kernel/model change is authorized. \ No newline at end of file diff --git a/dsv4/layers/shared_expert.py b/dsv4/layers/shared_expert.py index eb824744..8bbb034b 100644 --- a/dsv4/layers/shared_expert.py +++ b/dsv4/layers/shared_expert.py @@ -337,6 +337,10 @@ class Nvfp4SharedExpert: def _run_l2(self, intermediate: torch.Tensor) -> torch.Tensor: """L2 GEMM: intermediate × down_weight → BF16.""" + # The intermediate from fused SwiGLU deinterleave is a column slice + # (non-contiguous). quantize_nvfp4_gpu_fused requires contiguous input. + if not intermediate.is_contiguous(): + intermediate = intermediate.contiguous() num_tokens = intermediate.shape[0] padded_rows = cutedsl_ceil_div(num_tokens, 128) * 128 diff --git a/dsv4/ops/quantize.py b/dsv4/ops/quantize.py index eab0281a..f7fe460c 100644 --- a/dsv4/ops/quantize.py +++ b/dsv4/ops/quantize.py @@ -315,6 +315,9 @@ def quantize_nvfp4_gpu_fused(x_bf16, divisor=6.0 * 448.0): x_sf: (M, N//16) float8_e4m3fn gsa: (M,) float32 GPU tensor — per-row global scale for GEMM """ + # CUDA kernels require contiguous input — column slices from deinterleave are non-contiguous + if not x_bf16.is_contiguous(): + x_bf16 = x_bf16.contiguous() from dsv4.kernels.cuda.loader import get_cuda_module amax_mod = get_cuda_module("amax_gsa", ["amax_gsa.cu"]) gsa_gpu = amax_mod.compute_amax_gsa(x_bf16, divisor) # scalar GPU tensor