From eb7d4f099b05a2c8fd5b4b6a7789db499a8fc5f5 Mon Sep 17 00:00:00 2001 From: biondizzle Date: Sun, 17 May 2026 09:01:24 +0000 Subject: [PATCH] =?UTF-8?q?Update=20CURRENT=5FBUG.md=20with=20Bug=208=20(g?= =?UTF-8?q?lobal=E2=86=92local=20expert=20ID)=20and=20Bug=208b=20(.cpu()?= =?UTF-8?q?=20sync)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- CURRENT_BUG.md | 41 ++++++++++++++++++++++++++++++++++------- 1 file changed, 34 insertions(+), 7 deletions(-) diff --git a/CURRENT_BUG.md b/CURRENT_BUG.md index ea40856d..454bbeab 100644 --- a/CURRENT_BUG.md +++ b/CURRENT_BUG.md @@ -1,18 +1,18 @@ # Current Bug: CuTeDSLMoERunner — Status & Debug History -## Current Status (May 17, 2026 08:35 UTC) +## Current Status (May 17, 2026 09:01 UTC) -**Mostly fixed. 0.97 cosine with warmup gs. Ready for vLLM container test.** +**Bug 8 fixed. Ready for vLLM container test.** - ✅ `layertest.py` — 0.988 cosine - ✅ `cudagraph_test.py` — capture + replay works -- ✅ `test_warmup_gs.py` — 0.97 cosine with `compute_activation_global_scales()` warmup +- ✅ `test_warmup_gs.py` — warmup gs computation works (test script has a pre-existing NameError in safety margin section, not a runner bug) - ❌ vLLM server — not yet tested with these fixes -**Remaining concerns:** -- The CPU-based `_token_indices` uses `sort_idx.cpu()` which is a CPU-GPU sync — may interfere with cudagraph capture -- The `compute_activation_global_scales()` warmup needs to be called from `deepseek_v4.py` during model warmup -- The checkpoint `input_scale` should NOT be used as the activation global_scale (it's a calibration value, not a runtime value) +**Fixed in this round:** +- Bug 8: Global→local expert ID remapping (was causing CUDA_ERROR_ASSERT) +- Removed `.cpu()` sync from `run()` — `_token_indices` now on GPU, cudagraph-safe +- Added `_needs_token_refill` flag to handle CuTeDSL JIT GPU memory corruption after first GEMM call --- @@ -94,6 +94,33 @@ sorted_token_ids = token_indices[sort_idx.cpu()].to(device) **Fix:** Separate `_padded_x_sf_buf_l1` and `_padded_x_sf_buf_l2`, plus separate `_per_expert_scale_bufs_l1` and `_per_expert_scale_bufs_l2`. +### Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT + +**Symptom:** `IndexKernel.cu:111` assertion `-sizes[i] <= index && index < sizes[i]` failed, cascading into CUDA_ERROR_ASSERT (710) across all workers. vLLM server crash on first inference. + +**Root cause:** With expert parallelism (EP=8), `topk_ids` contains **global** expert IDs (0-255), but `CuTeDSLMoERunner` treated them as **local** IDs (0-31). Each rank only owns 32 experts (`num_experts=32`), so tokens assigned to experts 32-255 produced: +1. Wrong `expert_offsets` computation (tokens matched no local expert → zero counts for many experts) +2. Out-of-bounds scatter indices in `_assemble_scales_cudagraph_safe` (`dst_rows` exceeded `padded_x_sf` buffer size) +3. CUDA device-side assert → all subsequent CUDA calls fail with error 710 + +The layertest never hit this because it uses local expert IDs directly (no EP). + +**Fix:** +1. Added `experts_start_idx` param to `CuTeDSLMoERunner` +2. In `run()`: remap global→local via `local_ids = topk_ids - experts_start_idx`, mask non-local tokens with zero weight, clamp IDs to valid range +3. Pass `experts_start_idx` from `deepseek_v4.py` (which already stores it from EP setup) + +### Bug 8b: `.cpu()` Sync Breaking Cudagraph Compatibility + +**Symptom:** `sort_idx.cpu()` in `run()` — a CPU-GPU synchronization point that cudagraph cannot capture. + +**Root cause:** `_token_indices` was kept on CPU to avoid CuTeDSL JIT GPU memory corruption (Bug 3). But cudagraph requires all ops to be GPU-only. + +**Fix:** +1. Moved `_token_indices` to GPU +2. Added `_fill_token_indices()` method to refill the tensor after potential corruption +3. Added `_needs_token_refill` flag — set after `_ensure_stacked()` (weight JIT), checked/cleared after first `run()` call (GEMM JIT). After both JITs have fired, the tensor is stable. + --- ## Debug Methodology — How We Got Here