Add layer-by-layer diagnostic prints (CLAWMINE_DEBUG=1, enforce-eager)

When CLAWMINE_DEBUG=1, prints amax/mean/NaN/Inf after each layer. Must run with --enforce-eager (data-dependent prints break Dynamo). Gated by os.environ so dead-code-eliminated during compilation.
2026-05-18 12:51:51 +00:00
parent 2d1e9f42b1
commit 9e7639fba4
3 changed files with 84 additions and 200 deletions
--- a/CURRENT_BUG.md
+++ b/CURRENT_BUG.md
@@ -1,212 +1,79 @@
-# Current Bug: CuTeDSLMoERunner — Status & Debug History
+# Current Bug: vLLM produces empty/garbage output with NaN logits

-## Current Status (May 17, 2026 21:30 UTC)
+**Status:** Active debugging
+**Date:** 2026-05-18

-**Bug 26 FIXED. All tests pass.**
+## Symptom
+- vLLM server starts successfully, loads model, captures cudagraph
+- Chat completions return `content: ""` with `finish_reason: "length"`
+- Raw completions API returns: `Out of range float values are not JSON compliant: nan`
+- 50 completion tokens generated but all produce NaN logits
+- Model: DeepSeek-V4-Pro-NVFP4 on 8x B200 (TP=8)

- ✅ `layertest.py` — 0.988 cosine
- ✅ `cudagraph_test.py` — capture + replay works
- ✅ `test_pipeline_real_weights.py` — 0.988 cosine (8 tokens, 48 experts)
- ⏳ vLLM container — needs rebuild + test with Bug 26 fix
+## Known Good
+- `layertest.py` passes (cosine 0.988 with BF16 reference) — MoE kernel math is correct
+- `cudagraph_test.py` passes — no CPU-GPU syncs, capture + replay works
+- Model weights load successfully (281K tensors)
+- Kernel compiles and runs without CUDA errors

-**Latest fix: Missing swiglu_limit=10.0 activation clamping (Bug 25).** DeepSeek-V4 uses `SiluAndMulWithClamp(10.0)` which clamps `silu(gate)` to max 10.0 and `up` to [-10, 10]. Our runner was doing plain `F.silu(gate) * up` without clamping. Large gate values → unbounded SiLU output → corrupted L2 GEMM input → garbage logits → model outputs BOS/thinking tokens.
+## Hypotheses

-**vLLM launch config:**
-```
--gpu_memory_utilization=0.9
--compilation-config='{"cudagraph_mode": "FULL_DECODE_ONLY", "custom_ops": ["all"], "cudagraph_capture_sizes": [1, 2, 4, 8], "max_cudagraph_capture_size": 8}'
-```
+### H1: Activation global scale (gs) is wrong
+- `compute_activation_global_scales` is called during init with **random data** (torch.randn)
+- Random data may produce gs values that don't represent real token distributions
+- If gs is too small: activation quantization clips, info loss
+- If gs is too large: quantization noise dominates
+- [ ] **Test:** Run layertest with the exact gs the vLLM init computes, compare vs dynamic gs
+- [ ] **Test:** Run runner on real token data outside vLLM, check for NaN/garbage

---
+### H2: Attention layer produces bad hidden states before MoE
+- If attention output is NaN/garbage, MoE amplifies it
+- The MoE kernel may be fine but receives bad input
+- [ ] **Test:** Hook into layer 0 forward, inspect hidden_states before MoE

-## Bugs Found & Fixed
+### H3: Weight loading mismatch between vLLM and test runner
+- vLLM loads weights via its own pipeline (DeepseekV4ForCausalLM weight_loader)
+- Test scripts load directly from safetensors
+- The weight loading patches (model. prefix strip, CKPT_KEY_SUBST) may have bugs
+- [ ] **Test:** Compare weights loaded by vLLM vs direct safetensors load

-### Bug 1: Scale Assembly — Global vs Per-Expert Swizzle
-**Fix:** Two-phase scatter + per-expert swizzle.
+### H4: Expert routing / topk_ids mismatch
+- vLLM uses global expert IDs, runner expects local expert IDs
+- If routing is wrong, wrong experts process tokens
+- [ ] **Test:** Log topk_ids in vLLM vs test, verify they match expected patterns

-### Bug 2: `searchsorted(right=False)`
-**Fix:** Changed to `right=True`.
+### H5: Residual connection scale issue
+- vLLM adds MoE output to residual: `hidden = residual + MoE(hidden)`
+- If MoE output scale is wrong, residual connection can amplify error across layers
+- [ ] **Test:** Run test_multilayer.py to check error accumulation

-### Bug 3: CuTeDSL `cute.compile` GPU Memory Corruption — CRITICAL
-**Symptom:** `_token_indices` all zeros after JIT.
-**Root cause:** `cute.compile` corrupts GPU memory.
-**Fix:** `_fill_token_indices()` builds on CPU, copies to GPU. `_needs_token_refill` flag.
+### H6: Input_scale from checkpoint is being used somewhere
+- MEMORY.md says checkpoint input_scale is wrong
+- The code comment says gs default is 1/2688, overridden by warmup
+- But maybe finalize_weights sets it to checkpoint input_scale somewhere?
+- [ ] **Test:** Verify which gs value is actually used at runtime

-### Bug 4: `expert_offsets` With Leading 0
-**Fix:** Pass `expert_offsets[1:]` to GEMM.
+### H7: DeepSeek V4 attention / RoPE bug
+- The cos_sin_cache fix and float32 patch are applied
+- But maybe attention still produces garbage for real token positions
+- [ ] **Test:** Single-layer test with real token positions (not random)

-### Bug 5: Checkpoint `input_scale` Wrong for Runtime gs
-**Root cause:** Calibration value, too-small gs → block scale overflow.
-**Fix:** `compute_activation_global_scales()` warmup method.
+## Test Plan (ordered by ease and likelihood)

-### Bug 6: L1/L2 Need Separate gs
-**Fix:** Compute L2 gs from L1 output after SiLU*up.
+1. **Quick: Run layertest.py on B200** — baseline, confirm kernel still works
+2. **Standalone runner test with real-ish data** — use runner outside vLLM, check output
+3. **Inspect gs values** — print the gs computed by warmup, compare with dynamic gs
+4. **Multi-layer accumulation test** — test_multilayer.py
+5. **Weight loading comparison** — dump vLLM loaded weights vs direct load
+6. **Full pipeline test** — test_pipeline_real_weights.py with 48 experts
+7. **Attention output inspection** — check hidden_states before MoE in vLLM

-### Bug 7: L1/L2 Need Separate Scale Buffers
-**Fix:** Separate `_padded_x_sf_buf_l1`/`_l2`, separate per-expert bufs.
+## Progress

-### Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT
-**Symptom:** `IndexKernel.cu:111` OOB, cascading CUDA_ERROR_ASSERT (710).
-**Root cause:** `topk_ids` contains global IDs (0-255), runner treated as local.
-**Fix:** `experts_start_idx`, remap global→local, mask non-local tokens.
-
-### Bug 8b: `.cpu()` Sync Breaking Cudagraph
-**Fix:** `_token_indices` on GPU, `_fill_token_indices()` CPU→GPU copy.
-
-### Bug 9–11: Buffer sizing and swizzle layout
-See previous versions for details.
-
-### Bug 12: `torch.full()` During Cudagraph Capture
-**Symptom:** `cudaErrorStreamCaptureUnsupported`.
-**Fix:** Pre-allocated buffers, `.fill_()` instead of `torch.full()`.
-
-### Bug 13: Warmup Passed Global Expert IDs
-**Fix:** Pass local IDs (0..num_experts-1).
-
-### Bug 14: GEMM Scale Layout Mismatch — Fixed 128-Row vs Variable
-**Symptom:** BOS token repeat (garbage logits).
-**Root cause:** Scale assembly at `e*128`, GEMM reads by real expert_offsets. Expert with 500 tokens → GEMM reads 500 scale rows but only 128 have data.
-**Fix:** Variable padded expert offsets, scatter into real padded positions.
-
-### Bug 15: OOM — Per-Layer Padded Buffers (4.3 GB)
-**Root cause:** 72 MB × 60 layers = 4.3 GB. Not enough room for KV cache.
-**Fix:** Shared buffers (Bug 21).
-
-### Bug 16: `padded_max_slots` Mismatch
-**Fix:** Size for `num_experts * max_chunks * 128`.
-
-### Bug 17: Shape Mismatch (49152 vs 3072)
-**Root cause:** Cap `max_num_tokens` to 512 made buffers too small for 8192-token warmup.
-**Fix:** Reverted cap, use shared buffers.
-
-### Bug 18–20: Cudagraph Capture Failures (dynamic allocs, variable loops, GPU scalars)
-**Fix:** Pre-allocate everything, fixed loop counts, Python constants for offsets.
-
-### Bug 21: OOM — Shared Padded Buffers
-**Fix:** Class-level shared buffers dict keyed by device. `padded_hidden`, `padded_activated`, `padded_xsf_l1`/`l2`, `output` all shared. ~150 MB total instead of ~4.3 GB.
-
-### Bug 22: Token Dropping via `clamped_local`
-**Symptom:** Garbage model output (empty/invisible tokens).
-**Root cause:** `local_row.clamp(max=max_rows_per_expert-1)` silently dropped tokens when an expert got more than `max_chunks*128` tokens. `max_chunks` was computed as average (ceil(total_slots / (num_experts*128))), not worst-case. MoE routing is uneven — some experts get 200+ tokens while others get 10.
-**Fix:** Use real padded expert offsets (variable per expert, padded to 128). No clamping needed — each expert gets exactly the space it needs.
-
-### Bug 23: cudaErrorStreamCaptureUnsupported from Dynamic GPU Slicing
-**Symptom:** All 8 workers fail during cudagraph capture.
-**Root cause:** `buf[:total_padded_slots]` where `total_padded_slots` is a GPU scalar — dynamic tensor slicing with a GPU index is a CUDA operation not permitted during stream capture.
-**Fix:** Use full pre-allocated buffers, no dynamic GPU slicing. Pass `x_sf[:num_slots]` (Python int) to scale assembly.
-
-### Bug 24: Scale Assembly `.cpu().tolist()` Breaks Cudagraph
-**Symptom:** `cudaErrorStreamCaptureInvalidated` during capture.
-**Root cause:** Per-expert Python loops with GPU-derived offsets required `.cpu().tolist()` for slicing — CPU-GPU sync invalidates stream capture.
-**Fix:** Full-buffer Blackwell 32_4_4 swizzle. Apply `to_blocked` transform to entire `padded_x_sf` buffer at once. No CPU syncs, no Python loops. The buffer is already 128-row aligned per expert and 4-col aligned, so the full-buffer swizzle produces the correct layout. GEMM reads `scale_a` using `padded_expert_offsets`, matching the scatter layout.
-
-### Bug 25: Missing `swiglu_limit=10.0` Activation Clamping — LIKELY CAUSE OF GARBAGE OUTPUT
-**Symptom:** Model generates 30 tokens of empty/invisible content (BOS or thinking token). Not meaningful text.
-**Root cause:** DeepSeek-V4 uses `SiluAndMulWithClamp(10.0)` which:
- Clamps `silu(gate)` to max 10.0
- Clamps `up` to [-10.0, 10.0]
-
-Our runner did plain `F.silu(gate) * up` without clamping. Large gate values produce unbounded SiLU output (silu(20) ≈ 20, silu(50) ≈ 50). These large values get multiplied by the up projection, producing activations with amax >> 10. This:
-1. Corrupts the L2 GEMM input (quantized with wrong gs)
-2. Produces garbage L2 output
-3. Final logits are wrong → model collapses to most frequent token (BOS)
-
-**Fix:** Added `set_swiglu_limit(limit)` to runner. In `run()`, apply clamping:
-```python
-gate_silu = F.silu(gate)
-if self._swiglu_limit is not None:
-    gate_silu = gate_silu.clamp(max=self._swiglu_limit)
-    up = up.clamp(min=-self._swiglu_limit, max=self._swiglu_limit)
-activated = gate_silu * up
-```
-Called from `deepseek_v4.py` after warmup: `self._cutedsl_runner.set_swiglu_limit(float(self.swiglu_limit))`.
-
-### Bug 26: Padded Buffer x_sf Mismatch — Experts 1+ Get Zero Output — FIXED
-**Symptom:** Runner produces cosine 0.285 vs BF16. Some tokens get exactly zero output. Expert 0 L1 cosine 0.996, experts 1+ get cosine 0.0.
-**Root cause:** Runner quantized `padded_hidden` (4096 rows with zero padding) → `quantize_activation_nvfp4` returns x_sf with 4096 rows. Then `x_sf[:num_slots]` (first 48 rows) only covers expert 0's tokens (padded rows 0-127). Expert 1's tokens are at padded row 128, but x_sf[4] corresponds to padded row 64 (still expert 0's padding), not expert 1's data. The scale assembly scattered wrong scales for experts 1+, producing zero GEMM output.
-**Fix:** Quantize `slot_hidden` (sorted tokens, num_slots rows) instead of `padded_hidden`. This gives x_sf with num_slots rows (one per token), which the scale assembly correctly scatters into padded layout. Scatter x_fp4 into a new `hidden_fp4` padded buffer (uint8→view float4). Same fix for L2 with `activated_fp4` buffer.
-
---
-
-## Current Architecture: Variable Padded Expert Offsets
-
-```
-Each expert padded to next multiple of 128 tokens.
-padded_expert_offsets computed from real tokens_per_expert (GPU).
-
-Scatter: padded_dst = padded_expert_offsets[expert_assign] + local_row
-GEMM input: padded_hidden (full pre-allocated buffer, not sliced)
-GEMM offsets: padded_expert_offsets[1:] (GPU tensor)
-GEMM output: full buffer size; extract via l1_out[padded_dst]
-
-Scale assembly:
-  Phase 1: Scatter x_sf into padded_x_sf at padded_expert_offsets
-  Phase 2: Full-buffer Blackwell 32_4_4 swizzle (no CPU syncs)
-  Zero CPU syncs, zero Python loops
-
-Activation:
-  SiLU(gate) clamped to swiglu_limit (10.0)
-  up clamped to [-swiglu_limit, swiglu_limit]
-  activated = clamped_silu * clamped_up
-
-Shared buffers (class-level, ~150 MB total):
-  padded_hidden, padded_activated, padded_xsf_l1, padded_xsf_l2, output
-```
-
-### Cudagraph Constraints (All Resolved)
- No `.item()`, `.cpu()`, `.tolist()`
- No `torch.zeros/ones/full/empty/arange()` during capture — pre-allocate everything
- No dynamic GPU slicing (`buf[:gpu_scalar]`) — use full buffers
- No Python loops with GPU-derived values — full-buffer ops instead
- No `torch.full()` — pre-allocated `.fill_()`
- Shared buffers OK (layers sequential during capture and replay)
- `F.silu().clamp()` and `.clamp()` are GPU ops — cudagraph-safe ✅
-
-### EP Configuration (DeepSeek-V4-Pro on 8×B200)
- 256 total experts, top_k=6, swiglu_limit=10.0
- EP=8 → 48 local experts per rank (n_routed_experts / ep_size = 256/8 = 32, but logs show 48)
- `experts_start_idx` = rank × 32
- `max_num_tokens` = 8192
- `max_chunks_per_expert` = ceil(8192 × 6 / (48 × 128)) = 8
-
---
-
-## Shared Expert Path (verified correct)
-
-```
-DeepseekV4MoE.forward():
-  1. gate → fused_topk_bias → topk_weights, topk_ids
-  2. self.experts(hidden_states, topk_weights, topk_ids) → routed_output
-  3. EP all-reduce across ranks
-  4. self.shared_experts(hidden_states) → shared_output
-  5. final = routed_output + shared_output
-```
-
- Shared experts: `DeepseekV4MLP` (not NVFP4, uses standard quantization)
- `routed_scaling_factor`: Applied in `fused_topk_bias` to topk_weights ✅
- `renormalize`: Top-k weights normalized to sum to 1 ✅
- `scoring_func=sqrtsoftplus`: Applied in routing ✅
-
---
-
-## Test Files
-
-| File | Purpose |
-|------|---------|
-| `tests/layertest.py` | Reference vs runner, 3 experts. Must pass ≥0.98 cosine. |
-| `tests/cudagraph_test.py` | Cudagraph capture + replay. Must pass. |
-| `tests/test_pipeline_real_weights.py` | Full runner vs BF16 reference, 48 experts, 8 tokens. Must pass ≥0.98 cosine. |
-
-**Run order after any code change:**
-1. `python3 tests/layertest.py` — must pass
-2. `python3 tests/cudagraph_test.py` — must pass
-
---
-
-## Repo Info
-
- **Kernel:** `sweetapi.com/biondizzle/nvfp4-megamoe-kernel` (master)
- **Local:** `~/dev/nvfp4-megamoe-kernel/`
- **B200:** `/root/nvfp4-megamoe-kernel/`
- **Model:** `/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4` (read-only)
- **Never edit on B200 directly** — edit locally → commit → push → pull on B200
+- [x] Removed NaN check (Dynamo incompatible)
+- [x] vLLM container starts and loads model
+- [x] Confirmed NaN logits from completions API
+- [x] ~~H1: gs is wrong~~ — Warmup gs produces cosine 0.988 with BF16 ref. **gs is NOT the problem** when warmup is used.
+  - Default gs (1/2688) gives cosine 0.621, but vLLM calls warmup during init
+  - **BUT:** Does vLLM actually call warmup before every forward, or just once? If gs is computed from random data once and never updated, it may not generalize.
+- [ ] **New lead:** MoE kernel is fine, problem is upstream (attention, embeddings, or weight loading in vLLM path)
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -10,14 +10,14 @@ services:
      - CUDA_LAUNCH_BLOCKING=0
      - PYTHONUNBUFFERED=1
      - VLLM_RPC_TIMEOUT_MS=600000
-      - CLAWMINE_NAN_CHECK=1
+      - CLAWMINE_DEBUG=1
    command:
      - /model
      - --trust-remote-code
      - --enable-expert-parallel
      - --tensor-parallel-size=8
-      #- --enforce-eager
-      - --compilation-config
+      - --enforce-eager
+      #- --compilation-config
      #- '{"cudagraph_mode": "NONE", "custom_ops": ["all"]}'
      - '{"cudagraph_mode": "FULL_DECODE_ONLY", "custom_ops": ["all"], "cudagraph_capture_sizes": [1, 2, 4, 8], "max_cudagraph_capture_size": 8}' # This is what is runing right now
      #- '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops":["all"]}'
--- a/vllm/patches/deepseek_v4.py
+++ b/vllm/patches/deepseek_v4.py
@@ -1210,6 +1210,21 @@ class DeepseekV4DecoderLayer(nn.Module):
        return x


+def _diag_hidden_stats(hidden_states: torch.Tensor, layer_idx: int):
+    """Print hidden state stats after each layer. Disabled unless
+    CLAWMINE_DEBUG=1. os.environ is evaluated at trace time, so
+    the data-dependent path is dead code when disabled."""
+    if os.environ.get('CLAWMINE_DEBUG', '0') != '1':
+        return
+    # Only reached when CLAWMINE_DEBUG=1 (must run with --enforce-eager)
+    with torch.no_grad():
+        amax = hidden_states.amax().item()
+        mean = hidden_states.float().mean().item()
+        has_nan = torch.isnan(hidden_states).any().item()
+        has_inf = torch.isinf(hidden_states).any().item()
+        print(f"[CLAWMINE] Layer {layer_idx}: amax={amax:.4f} mean={mean:.6f} NaN={has_nan} Inf={has_inf}")
+
+
@support_torch_compile
 class DeepseekV4Model(nn.Module):
    def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
@@ -1316,6 +1331,8 @@ class DeepseekV4Model(nn.Module):
                positions,
                input_ids,
            )
+            # Diagnostic: print amax/mean every layer (eager-mode only, no Dynamo)
+            _diag_hidden_stats(hidden_states, layer_idx)


        # Stash pre-hc_head residual for the MTP draft (captured copy_).