Files

biondizzle 5e09be08af Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill)

The intermediate tensor from fused SwiGLU deinterleave is a column slice
(non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and
the CUDA kernel crashes with 'input must be contiguous'.

Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused
and in SharedExpert._run_l2. This is the root cause, not a workaround —
CUDA kernels legitimately require contiguous memory.

2026-06-03 07:56:19 +00:00

7.1 KiB

Raw Blame History

DSV4 Decode Degeneration — Two Decisive Tests (run BEFORE any kernel/model change)

Symptom: coherent-ish then degenerate decode; loops on a content token ("capital"/"capitalizing"); at times wrong top-1 from step 0.

⛔ HARD STOP — do not do any of these until both tests below are run and reported

Do NOT modify any kernel.
Do NOT modify the mHC math.
Do NOT add residual clipping, C_l scaling, or any "tame the residual" change.

The CORRECTNESS_BACKLOG.md verdict — "mHC residual growth (|X|→860) is the confirmed root cause" — is unproven, and the proposed remedies are surgery on a trained model to mask a symptom. If the real cause is the prompt (likely) or a missing final norm, those changes corrupt the model and hide the actual bug.

Why the backlog does NOT rule this out

Every verification in CORRECTNESS_BACKLOG.md is a same-input cosine: production kernel vs PyTorch reference, both fed the identical hand-rolled prompt. That proves the kernels match each other. It is structurally blind to a chat-template/prompt bug — feed both sides the same malformed prompt and every layer agrees at cos 0.9999 while both produce garbage. So "we ruled out everything" means "everything a same-input cosine can see." The prompt is outside that set. The backlog is silent on the two hypotheses below, not a refutation of them.

TEST 1 — Chat-template token-ID diff (most likely the actual bug; run first)

Hypothesis: the hand-rolled prompt is out-of-distribution for this reasoning model → degenerate / looping output. The current construction in single_shot_inference.py is roughly:

input_ids = [bos, USER_TOKEN]                                   # USER_TOKEN = 128803
input_ids += tokenizer.encode('\n\n' + PROMPT, add_special_tokens=False)
input_ids.append(ASSISTANT_TOKEN)                               # ASSISTANT_TOKEN = 128804

This almost certainly does not match what the model was trained on (a reasoning model expects specific assistant-turn + <think> priming; THINK_START=128821, THINK_END=128822 exist for a reason).

Procedure

Print what we actually build:

print("hand_rolled ids:", input_ids)
print("hand_rolled str:", tokenizer.decode(input_ids))

Print the canonical template the tokenizer itself produces:

ref_ids = tokenizer.apply_chat_template(
    [{"role": "user", "content": PROMPT}],
    add_generation_prompt=True, tokenize=True,
    # This is a reasoner. Check whether the template takes a thinking kwarg
    # (e.g. enable_thinking=True / thinking=...). Try with and without.
)
print("template ids:", ref_ids)
print("template str:", tokenizer.apply_chat_template(
    [{"role":"user","content":PROMPT}], add_generation_prompt=True, tokenize=False))

Also dump the raw source so we can read the special-token layout directly:

print(tokenizer.chat_template)         # or read tokenizer_config.json / chat_template.jinja

Diff input_ids vs ref_ids. Look specifically at: BOS handling, the user/assistant delimiter tokens, newline placement, and the <think> priming after the assistant token.

Decision

They differ (expected): replace the hand-rolled construction with apply_chat_template output, then run a short greedy generation (--temperature 0, modest --max-tokens). If Paris returns as top-1 and the loop is gone → this was the bug. Done. Do not touch mHC.
Identical but still degenerate: the tokenizer template is faithful yet the model still loops → compare chat_template.jinja against the reference inference impl (deepseek-ai/DeepSeek-V4-Pro/tree/main/inference), and confirm the thinking-enabled variant is what's being applied. Then proceed to Test 2.

Note: the NVIDIA sglang run used --reasoning-parser deepseek-v4 and SGLANG_DEFAULT_THINKING=1. The real format is not a bare USER … ASSISTANT sandwich — there is a thinking setup the hand-rolled path omits.

TEST 2 — Falsify the mHC "root cause" (run before ANY mHC/residual change)

Claim under test (from the backlog): "|X|=860 compresses the logit range so the model can't distinguish tokens."

Why it's suspect: there is a final RMSNorm before the LM head, and RMSNorm is scale-invariant — it divides the magnitude out. So |X|=860 and |X|=8 should produce the same logits (modulo the learned norm weight). Also, the residual grows just as much during prefill (backlog's own numbers: |X| up to 476, ~6240 on token 0) yet prefill/first-token is correct — magnitude common to both phases cannot be what breaks only decode.

Procedure

Confirm the final norm exists and is applied. Trace the path from the last layer's residual X → final RMSNorm → lm_head_lin(x_out). Print whether a final norm runs before the LM head.
- If it is MISSING or not applied → STOP. That is the real bug. The fix is to apply the final norm, not to clip the residual.

Falsification. At the last decode layer, capture the residual at |X|≈860. Compute logits two ways through the same final-norm + LM-head path:

logits_A = lm_head(final_norm(X))            # X as-is, |X|≈860
logits_B = lm_head(final_norm(X / 100.0))    # scaled down
cos = F.cosine_similarity(logits_A.flatten().float(), logits_B.flatten().float(), dim=0)
print("argmax_A", logits_A.argmax().item(), "argmax_B", logits_B.argmax().item(), "cos", cos.item())

Decision

argmax_A == argmax_B and cos ≈ 1.0 (expected): mHC growth is exonerated. |X| magnitude is not the cause. Stop chasing mHC; the answer is in Test 1.
They differ materially: something downstream of the residual is magnitude-sensitive → the final norm is missing/broken/misapplied. Fix the norm. Still do not clip the residual.

Test ordering

Test 1 first — it's the most likely fix and is trivial. If it resolves the loop, you're done and mHC was never the problem.
Test 2 before touching mHC — even if Test 1 isn't a full fix, prove (or correctly redirect) the mHC verdict before any model-level change. The only "fix" Test 2 can license is applying a missing final norm, never residual clipping.

Harness / workflow (from CORRECTNESS_BACKLOG §11)

Run via the harness: ~/.openclaw/workspace/fire_b200_test tests/unit/<test>.py. Never run or edit directly on the B200.
Edit locally → commit → push → pull on B200 → test.
Set TEST_LAYERS as an env var (export TEST_LAYERS=10), never as a CLI arg — single_shot's argparse will eat it and corrupt --max-tokens (this caused the bogus |X|=3.27e16 "blowups").
Both tests above are quick: Test 1 needs no GPU (tokenizer only); Test 2 needs one decode pass with TEST_LAYERS=61.

Report back (paste these)

Test 1: hand_rolled ids, template ids, the diff, and the greedy top-1 token after switching to apply_chat_template.
Test 2: whether a final norm is applied before the LM head; argmax_A, argmax_B, cos.

Until both are reported, the mHC verdict stays unproven and no kernel/model change is authorized.

7.1 KiB Raw Blame History