Update CURRENT_BUG.md with Bug 8 (global→local expert ID) and Bug 8b (.cpu() sync)

2026-05-17 09:01:24 +00:00
parent ca3cba5bbd
commit eb7d4f099b
1 changed files with 34 additions and 7 deletions
--- a/CURRENT_BUG.md
+++ b/CURRENT_BUG.md
@@ -1,18 +1,18 @@
 # Current Bug: CuTeDSLMoERunner — Status & Debug History

-## Current Status (May 17, 2026 08:35 UTC)
+## Current Status (May 17, 2026 09:01 UTC)

-**Mostly fixed. 0.97 cosine with warmup gs. Ready for vLLM container test.**
+**Bug 8 fixed. Ready for vLLM container test.**

 - ✅ `layertest.py` — 0.988 cosine
 - ✅ `cudagraph_test.py` — capture + replay works
- ✅ `test_warmup_gs.py` — 0.97 cosine with `compute_activation_global_scales()` warmup
+- ✅ `test_warmup_gs.py` — warmup gs computation works (test script has a pre-existing NameError in safety margin section, not a runner bug)
 - ❌ vLLM server — not yet tested with these fixes

-**Remaining concerns:**
- The CPU-based `_token_indices` uses `sort_idx.cpu()` which is a CPU-GPU sync — may interfere with cudagraph capture
- The `compute_activation_global_scales()` warmup needs to be called from `deepseek_v4.py` during model warmup
- The checkpoint `input_scale` should NOT be used as the activation global_scale (it's a calibration value, not a runtime value)
+**Fixed in this round:**
+- Bug 8: Global→local expert ID remapping (was causing CUDA_ERROR_ASSERT)
+- Removed `.cpu()` sync from `run()` — `_token_indices` now on GPU, cudagraph-safe
+- Added `_needs_token_refill` flag to handle CuTeDSL JIT GPU memory corruption after first GEMM call

 ---

@@ -94,6 +94,33 @@ sorted_token_ids = token_indices[sort_idx.cpu()].to(device)

 **Fix:** Separate `_padded_x_sf_buf_l1` and `_padded_x_sf_buf_l2`, plus separate `_per_expert_scale_bufs_l1` and `_per_expert_scale_bufs_l2`.

+### Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT
+
+**Symptom:** `IndexKernel.cu:111` assertion `-sizes[i] <= index && index < sizes[i]` failed, cascading into CUDA_ERROR_ASSERT (710) across all workers. vLLM server crash on first inference.
+
+**Root cause:** With expert parallelism (EP=8), `topk_ids` contains **global** expert IDs (0-255), but `CuTeDSLMoERunner` treated them as **local** IDs (0-31). Each rank only owns 32 experts (`num_experts=32`), so tokens assigned to experts 32-255 produced:
+1. Wrong `expert_offsets` computation (tokens matched no local expert → zero counts for many experts)
+2. Out-of-bounds scatter indices in `_assemble_scales_cudagraph_safe` (`dst_rows` exceeded `padded_x_sf` buffer size)
+3. CUDA device-side assert → all subsequent CUDA calls fail with error 710
+
+The layertest never hit this because it uses local expert IDs directly (no EP).
+
+**Fix:** 
+1. Added `experts_start_idx` param to `CuTeDSLMoERunner`
+2. In `run()`: remap global→local via `local_ids = topk_ids - experts_start_idx`, mask non-local tokens with zero weight, clamp IDs to valid range
+3. Pass `experts_start_idx` from `deepseek_v4.py` (which already stores it from EP setup)
+
+### Bug 8b: `.cpu()` Sync Breaking Cudagraph Compatibility
+
+**Symptom:** `sort_idx.cpu()` in `run()` — a CPU-GPU synchronization point that cudagraph cannot capture.
+
+**Root cause:** `_token_indices` was kept on CPU to avoid CuTeDSL JIT GPU memory corruption (Bug 3). But cudagraph requires all ops to be GPU-only.
+
+**Fix:**
+1. Moved `_token_indices` to GPU
+2. Added `_fill_token_indices()` method to refill the tensor after potential corruption
+3. Added `_needs_token_refill` flag — set after `_ensure_stacked()` (weight JIT), checked/cleared after first `run()` call (GEMM JIT). After both JITs have fired, the tensor is stable.
+
 ---

 ## Debug Methodology — How We Got Here