The intermediate tensor from fused SwiGLU deinterleave is a column slice (non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and the CUDA kernel crashes with 'input must be contiguous'. Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused and in SharedExpert._run_l2. This is the root cause, not a workaround — CUDA kernels legitimately require contiguous memory.
7.1 KiB
DSV4 Decode Degeneration — Two Decisive Tests (run BEFORE any kernel/model change)
Symptom: coherent-ish then degenerate decode; loops on a content token ("capital"/"capitalizing"); at times wrong top-1 from step 0.
⛔ HARD STOP — do not do any of these until both tests below are run and reported
- Do NOT modify any kernel.
- Do NOT modify the mHC math.
- Do NOT add residual clipping,
C_lscaling, or any "tame the residual" change.
The CORRECTNESS_BACKLOG.md verdict — "mHC residual growth (|X|→860) is the confirmed root cause" — is unproven, and the proposed remedies are surgery on a trained model to mask a symptom. If the real cause is the prompt (likely) or a missing final norm, those changes corrupt the model and hide the actual bug.
Why the backlog does NOT rule this out
Every verification in CORRECTNESS_BACKLOG.md is a same-input cosine: production kernel vs PyTorch reference, both fed the identical hand-rolled prompt. That proves the kernels match each other. It is structurally blind to a chat-template/prompt bug — feed both sides the same malformed prompt and every layer agrees at cos 0.9999 while both produce garbage. So "we ruled out everything" means "everything a same-input cosine can see." The prompt is outside that set. The backlog is silent on the two hypotheses below, not a refutation of them.
TEST 1 — Chat-template token-ID diff (most likely the actual bug; run first)
Hypothesis: the hand-rolled prompt is out-of-distribution for this reasoning model → degenerate / looping output. The current construction in single_shot_inference.py is roughly:
input_ids = [bos, USER_TOKEN] # USER_TOKEN = 128803
input_ids += tokenizer.encode('\n\n' + PROMPT, add_special_tokens=False)
input_ids.append(ASSISTANT_TOKEN) # ASSISTANT_TOKEN = 128804
This almost certainly does not match what the model was trained on (a reasoning model expects specific assistant-turn + <think> priming; THINK_START=128821, THINK_END=128822 exist for a reason).
Procedure
- Print what we actually build:
print("hand_rolled ids:", input_ids) print("hand_rolled str:", tokenizer.decode(input_ids)) - Print the canonical template the tokenizer itself produces:
ref_ids = tokenizer.apply_chat_template( [{"role": "user", "content": PROMPT}], add_generation_prompt=True, tokenize=True, # This is a reasoner. Check whether the template takes a thinking kwarg # (e.g. enable_thinking=True / thinking=...). Try with and without. ) print("template ids:", ref_ids) print("template str:", tokenizer.apply_chat_template( [{"role":"user","content":PROMPT}], add_generation_prompt=True, tokenize=False)) - Also dump the raw source so we can read the special-token layout directly:
print(tokenizer.chat_template) # or read tokenizer_config.json / chat_template.jinja - Diff
input_idsvsref_ids. Look specifically at: BOS handling, the user/assistant delimiter tokens, newline placement, and the<think>priming after the assistant token.
Decision
- They differ (expected): replace the hand-rolled construction with
apply_chat_templateoutput, then run a short greedy generation (--temperature 0, modest--max-tokens). If Paris returns as top-1 and the loop is gone → this was the bug. Done. Do not touch mHC. - Identical but still degenerate: the tokenizer template is faithful yet the model still loops → compare
chat_template.jinjaagainst the reference inference impl (deepseek-ai/DeepSeek-V4-Pro/tree/main/inference), and confirm the thinking-enabled variant is what's being applied. Then proceed to Test 2.
Note: the NVIDIA sglang run used
--reasoning-parser deepseek-v4andSGLANG_DEFAULT_THINKING=1. The real format is not a bareUSER … ASSISTANTsandwich — there is a thinking setup the hand-rolled path omits.
TEST 2 — Falsify the mHC "root cause" (run before ANY mHC/residual change)
Claim under test (from the backlog): "|X|=860 compresses the logit range so the model can't distinguish tokens."
Why it's suspect: there is a final RMSNorm before the LM head, and RMSNorm is scale-invariant — it divides the magnitude out. So |X|=860 and |X|=8 should produce the same logits (modulo the learned norm weight). Also, the residual grows just as much during prefill (backlog's own numbers: |X| up to 476, ~6240 on token 0) yet prefill/first-token is correct — magnitude common to both phases cannot be what breaks only decode.
Procedure
- Confirm the final norm exists and is applied. Trace the path from the last layer's residual
X→ final RMSNorm →lm_head_lin(x_out). Print whether a final norm runs before the LM head.- If it is MISSING or not applied → STOP. That is the real bug. The fix is to apply the final norm, not to clip the residual.
- Falsification. At the last decode layer, capture the residual at |X|≈860. Compute logits two ways through the same final-norm + LM-head path:
logits_A = lm_head(final_norm(X)) # X as-is, |X|≈860 logits_B = lm_head(final_norm(X / 100.0)) # scaled down cos = F.cosine_similarity(logits_A.flatten().float(), logits_B.flatten().float(), dim=0) print("argmax_A", logits_A.argmax().item(), "argmax_B", logits_B.argmax().item(), "cos", cos.item())
Decision
- argmax_A == argmax_B and cos ≈ 1.0 (expected): mHC growth is exonerated. |X| magnitude is not the cause. Stop chasing mHC; the answer is in Test 1.
- They differ materially: something downstream of the residual is magnitude-sensitive → the final norm is missing/broken/misapplied. Fix the norm. Still do not clip the residual.
Test ordering
- Test 1 first — it's the most likely fix and is trivial. If it resolves the loop, you're done and mHC was never the problem.
- Test 2 before touching mHC — even if Test 1 isn't a full fix, prove (or correctly redirect) the mHC verdict before any model-level change. The only "fix" Test 2 can license is applying a missing final norm, never residual clipping.
Harness / workflow (from CORRECTNESS_BACKLOG §11)
- Run via the harness:
~/.openclaw/workspace/fire_b200_test tests/unit/<test>.py. Never run or edit directly on the B200. - Edit locally → commit → push → pull on B200 → test.
- Set
TEST_LAYERSas an env var (export TEST_LAYERS=10), never as a CLI arg — single_shot's argparse will eat it and corrupt--max-tokens(this caused the bogus |X|=3.27e16 "blowups"). - Both tests above are quick: Test 1 needs no GPU (tokenizer only); Test 2 needs one decode pass with
TEST_LAYERS=61.
Report back (paste these)
- Test 1:
hand_rolled ids,template ids, the diff, and the greedy top-1 token after switching toapply_chat_template. - Test 2: whether a final norm is applied before the LM head;
argmax_A,argmax_B,cos.
Until both are reported, the mHC verdict stays unproven and no kernel/model change is authorized.