Update CURRENT_BUG.md with Bug 8 (global→local expert ID) and Bug 8b (.cpu() sync)

This commit is contained in:
2026-05-17 09:01:24 +00:00
parent ca3cba5bbd
commit eb7d4f099b

View File

@@ -1,18 +1,18 @@
# Current Bug: CuTeDSLMoERunner — Status & Debug History
## Current Status (May 17, 2026 08:35 UTC)
## Current Status (May 17, 2026 09:01 UTC)
**Mostly fixed. 0.97 cosine with warmup gs. Ready for vLLM container test.**
**Bug 8 fixed. Ready for vLLM container test.**
-`layertest.py` — 0.988 cosine
-`cudagraph_test.py` — capture + replay works
-`test_warmup_gs.py`0.97 cosine with `compute_activation_global_scales()` warmup
-`test_warmup_gs.py`warmup gs computation works (test script has a pre-existing NameError in safety margin section, not a runner bug)
- ❌ vLLM server — not yet tested with these fixes
**Remaining concerns:**
- The CPU-based `_token_indices` uses `sort_idx.cpu()` which is a CPU-GPU sync — may interfere with cudagraph capture
- The `compute_activation_global_scales()` warmup needs to be called from `deepseek_v4.py` during model warmup
- The checkpoint `input_scale` should NOT be used as the activation global_scale (it's a calibration value, not a runtime value)
**Fixed in this round:**
- Bug 8: Global→local expert ID remapping (was causing CUDA_ERROR_ASSERT)
- Removed `.cpu()` sync from `run()``_token_indices` now on GPU, cudagraph-safe
- Added `_needs_token_refill` flag to handle CuTeDSL JIT GPU memory corruption after first GEMM call
---
@@ -94,6 +94,33 @@ sorted_token_ids = token_indices[sort_idx.cpu()].to(device)
**Fix:** Separate `_padded_x_sf_buf_l1` and `_padded_x_sf_buf_l2`, plus separate `_per_expert_scale_bufs_l1` and `_per_expert_scale_bufs_l2`.
### Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT
**Symptom:** `IndexKernel.cu:111` assertion `-sizes[i] <= index && index < sizes[i]` failed, cascading into CUDA_ERROR_ASSERT (710) across all workers. vLLM server crash on first inference.
**Root cause:** With expert parallelism (EP=8), `topk_ids` contains **global** expert IDs (0-255), but `CuTeDSLMoERunner` treated them as **local** IDs (0-31). Each rank only owns 32 experts (`num_experts=32`), so tokens assigned to experts 32-255 produced:
1. Wrong `expert_offsets` computation (tokens matched no local expert → zero counts for many experts)
2. Out-of-bounds scatter indices in `_assemble_scales_cudagraph_safe` (`dst_rows` exceeded `padded_x_sf` buffer size)
3. CUDA device-side assert → all subsequent CUDA calls fail with error 710
The layertest never hit this because it uses local expert IDs directly (no EP).
**Fix:**
1. Added `experts_start_idx` param to `CuTeDSLMoERunner`
2. In `run()`: remap global→local via `local_ids = topk_ids - experts_start_idx`, mask non-local tokens with zero weight, clamp IDs to valid range
3. Pass `experts_start_idx` from `deepseek_v4.py` (which already stores it from EP setup)
### Bug 8b: `.cpu()` Sync Breaking Cudagraph Compatibility
**Symptom:** `sort_idx.cpu()` in `run()` — a CPU-GPU synchronization point that cudagraph cannot capture.
**Root cause:** `_token_indices` was kept on CPU to avoid CuTeDSL JIT GPU memory corruption (Bug 3). But cudagraph requires all ops to be GPU-only.
**Fix:**
1. Moved `_token_indices` to GPU
2. Added `_fill_token_indices()` method to refill the tensor after potential corruption
3. Added `_needs_token_refill` flag — set after `_ensure_stacked()` (weight JIT), checked/cleared after first `run()` call (GEMM JIT). After both JITs have fired, the tensor is stable.
---
## Debug Methodology — How We Got Here