From eb7d4f099b05a2c8fd5b4b6a7789db499a8fc5f5 Mon Sep 17 00:00:00 2001
From: biondizzle <biondizzle@gmail.com>
Date: Sun, 17 May 2026 09:01:24 +0000
Subject: [PATCH] =?UTF-8?q?Update=20CURRENT=5FBUG.md=20with=20Bug=208=20(g?=
 =?UTF-8?q?lobal=E2=86=92local=20expert=20ID)=20and=20Bug=208b=20(.cpu()?=
 =?UTF-8?q?=20sync)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 CURRENT_BUG.md | 41 ++++++++++++++++++++++++++++++++++-------
 1 file changed, 34 insertions(+), 7 deletions(-)

diff --git a/CURRENT_BUG.md b/CURRENT_BUG.md
index ea40856d..454bbeab 100644
--- a/CURRENT_BUG.md
+++ b/CURRENT_BUG.md
@@ -1,18 +1,18 @@
 # Current Bug: CuTeDSLMoERunner — Status & Debug History
 
-## Current Status (May 17, 2026 08:35 UTC)
+## Current Status (May 17, 2026 09:01 UTC)
 
-**Mostly fixed. 0.97 cosine with warmup gs. Ready for vLLM container test.**
+**Bug 8 fixed. Ready for vLLM container test.**
 
 - ✅ `layertest.py` — 0.988 cosine
 - ✅ `cudagraph_test.py` — capture + replay works
-- ✅ `test_warmup_gs.py` — 0.97 cosine with `compute_activation_global_scales()` warmup
+- ✅ `test_warmup_gs.py` — warmup gs computation works (test script has a pre-existing NameError in safety margin section, not a runner bug)
 - ❌ vLLM server — not yet tested with these fixes
 
-**Remaining concerns:**
-- The CPU-based `_token_indices` uses `sort_idx.cpu()` which is a CPU-GPU sync — may interfere with cudagraph capture
-- The `compute_activation_global_scales()` warmup needs to be called from `deepseek_v4.py` during model warmup
-- The checkpoint `input_scale` should NOT be used as the activation global_scale (it's a calibration value, not a runtime value)
+**Fixed in this round:**
+- Bug 8: Global→local expert ID remapping (was causing CUDA_ERROR_ASSERT)
+- Removed `.cpu()` sync from `run()` — `_token_indices` now on GPU, cudagraph-safe
+- Added `_needs_token_refill` flag to handle CuTeDSL JIT GPU memory corruption after first GEMM call
 
 ---
 
@@ -94,6 +94,33 @@ sorted_token_ids = token_indices[sort_idx.cpu()].to(device)
 
 **Fix:** Separate `_padded_x_sf_buf_l1` and `_padded_x_sf_buf_l2`, plus separate `_per_expert_scale_bufs_l1` and `_per_expert_scale_bufs_l2`.
 
+### Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT
+
+**Symptom:** `IndexKernel.cu:111` assertion `-sizes[i] <= index && index < sizes[i]` failed, cascading into CUDA_ERROR_ASSERT (710) across all workers. vLLM server crash on first inference.
+
+**Root cause:** With expert parallelism (EP=8), `topk_ids` contains **global** expert IDs (0-255), but `CuTeDSLMoERunner` treated them as **local** IDs (0-31). Each rank only owns 32 experts (`num_experts=32`), so tokens assigned to experts 32-255 produced:
+1. Wrong `expert_offsets` computation (tokens matched no local expert → zero counts for many experts)
+2. Out-of-bounds scatter indices in `_assemble_scales_cudagraph_safe` (`dst_rows` exceeded `padded_x_sf` buffer size)
+3. CUDA device-side assert → all subsequent CUDA calls fail with error 710
+
+The layertest never hit this because it uses local expert IDs directly (no EP).
+
+**Fix:** 
+1. Added `experts_start_idx` param to `CuTeDSLMoERunner`
+2. In `run()`: remap global→local via `local_ids = topk_ids - experts_start_idx`, mask non-local tokens with zero weight, clamp IDs to valid range
+3. Pass `experts_start_idx` from `deepseek_v4.py` (which already stores it from EP setup)
+
+### Bug 8b: `.cpu()` Sync Breaking Cudagraph Compatibility
+
+**Symptom:** `sort_idx.cpu()` in `run()` — a CPU-GPU synchronization point that cudagraph cannot capture.
+
+**Root cause:** `_token_indices` was kept on CPU to avoid CuTeDSL JIT GPU memory corruption (Bug 3). But cudagraph requires all ops to be GPU-only.
+
+**Fix:**
+1. Moved `_token_indices` to GPU
+2. Added `_fill_token_indices()` method to refill the tensor after potential corruption
+3. Added `_needs_token_refill` flag — set after `_ensure_stacked()` (weight JIT), checked/cleared after first `run()` call (GEMM JIT). After both JITs have fired, the tensor is stable.
+
 ---
 
 ## Debug Methodology — How We Got Here