Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill)

The intermediate tensor from fused SwiGLU deinterleave is a column slice
(non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and
the CUDA kernel crashes with 'input must be contiguous'.

Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused
and in SharedExpert._run_l2. This is the root cause, not a workaround —
CUDA kernels legitimately require contiguous memory.
This commit is contained in:
2026-06-03 07:56:19 +00:00
parent 60309ef124
commit 5e09be08af
3 changed files with 114 additions and 0 deletions

107
DEGENERATION_TESTS.md Normal file
View File

@@ -0,0 +1,107 @@
# DSV4 Decode Degeneration — Two Decisive Tests (run BEFORE any kernel/model change)
**Symptom:** coherent-ish then degenerate decode; loops on a content token ("capital"/"capitalizing"); at times wrong top-1 from step 0.
## ⛔ HARD STOP — do not do any of these until both tests below are run and reported
- **Do NOT modify any kernel.**
- **Do NOT modify the mHC math.**
- **Do NOT add residual clipping, `C_l` scaling, or any "tame the residual" change.**
The `CORRECTNESS_BACKLOG.md` verdict — *"mHC residual growth (|X|→860) is the confirmed root cause"* — is **unproven**, and the proposed remedies are surgery on a *trained* model to mask a symptom. If the real cause is the prompt (likely) or a missing final norm, those changes corrupt the model and hide the actual bug.
## Why the backlog does NOT rule this out
Every verification in `CORRECTNESS_BACKLOG.md` is a **same-input cosine**: production kernel vs PyTorch reference, both fed the **identical hand-rolled prompt**. That proves the kernels match *each other*. It is **structurally blind** to a chat-template/prompt bug — feed both sides the same malformed prompt and every layer agrees at cos 0.9999 *while both produce garbage*. So "we ruled out everything" means "everything a same-input cosine can see." The prompt is outside that set. The backlog is **silent** on the two hypotheses below, not a refutation of them.
---
## TEST 1 — Chat-template token-ID diff (most likely the actual bug; run first)
**Hypothesis:** the hand-rolled prompt is out-of-distribution for this reasoning model → degenerate / looping output. The current construction in `single_shot_inference.py` is roughly:
```python
input_ids = [bos, USER_TOKEN] # USER_TOKEN = 128803
input_ids += tokenizer.encode('\n\n' + PROMPT, add_special_tokens=False)
input_ids.append(ASSISTANT_TOKEN) # ASSISTANT_TOKEN = 128804
```
This almost certainly does **not** match what the model was trained on (a reasoning model expects specific assistant-turn + `<think>` priming; THINK_START=128821, THINK_END=128822 exist for a reason).
**Procedure**
1. Print what we actually build:
```python
print("hand_rolled ids:", input_ids)
print("hand_rolled str:", tokenizer.decode(input_ids))
```
2. Print the canonical template the tokenizer itself produces:
```python
ref_ids = tokenizer.apply_chat_template(
[{"role": "user", "content": PROMPT}],
add_generation_prompt=True, tokenize=True,
# This is a reasoner. Check whether the template takes a thinking kwarg
# (e.g. enable_thinking=True / thinking=...). Try with and without.
)
print("template ids:", ref_ids)
print("template str:", tokenizer.apply_chat_template(
[{"role":"user","content":PROMPT}], add_generation_prompt=True, tokenize=False))
```
3. Also dump the raw source so we can read the special-token layout directly:
```python
print(tokenizer.chat_template) # or read tokenizer_config.json / chat_template.jinja
```
4. Diff `input_ids` vs `ref_ids`. Look specifically at: BOS handling, the user/assistant delimiter tokens, newline placement, and **the `<think>` priming after the assistant token**.
**Decision**
- **They differ (expected):** replace the hand-rolled construction with `apply_chat_template` output, then run a short greedy generation (`--temperature 0`, modest `--max-tokens`). If Paris returns as top-1 and the loop is gone → **this was the bug. Done.** Do not touch mHC.
- **Identical but still degenerate:** the tokenizer template is faithful yet the model still loops → compare `chat_template.jinja` against the reference inference impl (`deepseek-ai/DeepSeek-V4-Pro/tree/main/inference`), and confirm the thinking-enabled variant is what's being applied. Then proceed to Test 2.
> Note: the NVIDIA sglang run used `--reasoning-parser deepseek-v4` and `SGLANG_DEFAULT_THINKING=1`. The real format is not a bare `USER … ASSISTANT` sandwich — there is a thinking setup the hand-rolled path omits.
---
## TEST 2 — Falsify the mHC "root cause" (run before ANY mHC/residual change)
**Claim under test (from the backlog):** *"|X|=860 compresses the logit range so the model can't distinguish tokens."*
**Why it's suspect:** there is a final RMSNorm before the LM head, and RMSNorm is **scale-invariant** — it divides the magnitude out. So |X|=860 and |X|=8 should produce the *same* logits (modulo the learned norm weight). Also, the residual grows just as much during **prefill** (backlog's own numbers: |X| up to 476, ~6240 on token 0) yet prefill/first-token is correct — magnitude common to both phases cannot be what breaks *only* decode.
**Procedure**
1. **Confirm the final norm exists and is applied.** Trace the path from the last layer's residual `X` → final RMSNorm → `lm_head_lin(x_out)`. Print whether a final norm runs before the LM head.
- **If it is MISSING or not applied → STOP. That is the real bug.** The fix is to apply the final norm, *not* to clip the residual.
2. **Falsification.** At the last decode layer, capture the residual at |X|≈860. Compute logits two ways through the *same* final-norm + LM-head path:
```python
logits_A = lm_head(final_norm(X)) # X as-is, |X|≈860
logits_B = lm_head(final_norm(X / 100.0)) # scaled down
cos = F.cosine_similarity(logits_A.flatten().float(), logits_B.flatten().float(), dim=0)
print("argmax_A", logits_A.argmax().item(), "argmax_B", logits_B.argmax().item(), "cos", cos.item())
```
**Decision**
- **argmax_A == argmax_B and cos ≈ 1.0 (expected):** mHC growth is **exonerated**. |X| magnitude is not the cause. Stop chasing mHC; the answer is in Test 1.
- **They differ materially:** something downstream of the residual is magnitude-sensitive → the final norm is missing/broken/misapplied. **Fix the norm.** Still do not clip the residual.
---
## Test ordering
1. **Test 1 first** — it's the most likely fix and is trivial. If it resolves the loop, you're done and mHC was never the problem.
2. **Test 2 before touching mHC** — even if Test 1 isn't a full fix, prove (or correctly redirect) the mHC verdict before any model-level change. The only "fix" Test 2 can license is *applying a missing final norm*, never residual clipping.
## Harness / workflow (from CORRECTNESS_BACKLOG §11)
- Run via the harness: `~/.openclaw/workspace/fire_b200_test tests/unit/<test>.py`. Never run or edit directly on the B200.
- Edit locally → commit → push → pull on B200 → test.
- Set `TEST_LAYERS` as an **env var** (`export TEST_LAYERS=10`), never as a CLI arg — single_shot's argparse will eat it and corrupt `--max-tokens` (this caused the bogus |X|=3.27e16 "blowups").
- Both tests above are quick: Test 1 needs no GPU (tokenizer only); Test 2 needs one decode pass with `TEST_LAYERS=61`.
## Report back (paste these)
- **Test 1:** `hand_rolled ids`, `template ids`, the diff, and the greedy top-1 token after switching to `apply_chat_template`.
- **Test 2:** whether a final norm is applied before the LM head; `argmax_A`, `argmax_B`, `cos`.
Until both are reported, the mHC verdict stays **unproven** and no kernel/model change is authorized.

View File

@@ -337,6 +337,10 @@ class Nvfp4SharedExpert:
def _run_l2(self, intermediate: torch.Tensor) -> torch.Tensor:
"""L2 GEMM: intermediate × down_weight → BF16."""
# The intermediate from fused SwiGLU deinterleave is a column slice
# (non-contiguous). quantize_nvfp4_gpu_fused requires contiguous input.
if not intermediate.is_contiguous():
intermediate = intermediate.contiguous()
num_tokens = intermediate.shape[0]
padded_rows = cutedsl_ceil_div(num_tokens, 128) * 128

View File

@@ -315,6 +315,9 @@ def quantize_nvfp4_gpu_fused(x_bf16, divisor=6.0 * 448.0):
x_sf: (M, N//16) float8_e4m3fn
gsa: (M,) float32 GPU tensor — per-row global scale for GEMM
"""
# CUDA kernels require contiguous input — column slices from deinterleave are non-contiguous
if not x_bf16.is_contiguous():
x_bf16 = x_bf16.contiguous()
from dsv4.kernels.cuda.loader import get_cuda_module
amax_mod = get_cuda_module("amax_gsa", ["amax_gsa.cu"])
gsa_gpu = amax_mod.compute_amax_gsa(x_bf16, divisor) # scalar GPU tensor