Update CURRENT_BUG.md with Bug 8 (global→local expert ID) and Bug 8b (.cpu() sync)
This commit is contained in:
@@ -1,18 +1,18 @@
|
||||
# Current Bug: CuTeDSLMoERunner — Status & Debug History
|
||||
|
||||
## Current Status (May 17, 2026 08:35 UTC)
|
||||
## Current Status (May 17, 2026 09:01 UTC)
|
||||
|
||||
**Mostly fixed. 0.97 cosine with warmup gs. Ready for vLLM container test.**
|
||||
**Bug 8 fixed. Ready for vLLM container test.**
|
||||
|
||||
- ✅ `layertest.py` — 0.988 cosine
|
||||
- ✅ `cudagraph_test.py` — capture + replay works
|
||||
- ✅ `test_warmup_gs.py` — 0.97 cosine with `compute_activation_global_scales()` warmup
|
||||
- ✅ `test_warmup_gs.py` — warmup gs computation works (test script has a pre-existing NameError in safety margin section, not a runner bug)
|
||||
- ❌ vLLM server — not yet tested with these fixes
|
||||
|
||||
**Remaining concerns:**
|
||||
- The CPU-based `_token_indices` uses `sort_idx.cpu()` which is a CPU-GPU sync — may interfere with cudagraph capture
|
||||
- The `compute_activation_global_scales()` warmup needs to be called from `deepseek_v4.py` during model warmup
|
||||
- The checkpoint `input_scale` should NOT be used as the activation global_scale (it's a calibration value, not a runtime value)
|
||||
**Fixed in this round:**
|
||||
- Bug 8: Global→local expert ID remapping (was causing CUDA_ERROR_ASSERT)
|
||||
- Removed `.cpu()` sync from `run()` — `_token_indices` now on GPU, cudagraph-safe
|
||||
- Added `_needs_token_refill` flag to handle CuTeDSL JIT GPU memory corruption after first GEMM call
|
||||
|
||||
---
|
||||
|
||||
@@ -94,6 +94,33 @@ sorted_token_ids = token_indices[sort_idx.cpu()].to(device)
|
||||
|
||||
**Fix:** Separate `_padded_x_sf_buf_l1` and `_padded_x_sf_buf_l2`, plus separate `_per_expert_scale_bufs_l1` and `_per_expert_scale_bufs_l2`.
|
||||
|
||||
### Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT
|
||||
|
||||
**Symptom:** `IndexKernel.cu:111` assertion `-sizes[i] <= index && index < sizes[i]` failed, cascading into CUDA_ERROR_ASSERT (710) across all workers. vLLM server crash on first inference.
|
||||
|
||||
**Root cause:** With expert parallelism (EP=8), `topk_ids` contains **global** expert IDs (0-255), but `CuTeDSLMoERunner` treated them as **local** IDs (0-31). Each rank only owns 32 experts (`num_experts=32`), so tokens assigned to experts 32-255 produced:
|
||||
1. Wrong `expert_offsets` computation (tokens matched no local expert → zero counts for many experts)
|
||||
2. Out-of-bounds scatter indices in `_assemble_scales_cudagraph_safe` (`dst_rows` exceeded `padded_x_sf` buffer size)
|
||||
3. CUDA device-side assert → all subsequent CUDA calls fail with error 710
|
||||
|
||||
The layertest never hit this because it uses local expert IDs directly (no EP).
|
||||
|
||||
**Fix:**
|
||||
1. Added `experts_start_idx` param to `CuTeDSLMoERunner`
|
||||
2. In `run()`: remap global→local via `local_ids = topk_ids - experts_start_idx`, mask non-local tokens with zero weight, clamp IDs to valid range
|
||||
3. Pass `experts_start_idx` from `deepseek_v4.py` (which already stores it from EP setup)
|
||||
|
||||
### Bug 8b: `.cpu()` Sync Breaking Cudagraph Compatibility
|
||||
|
||||
**Symptom:** `sort_idx.cpu()` in `run()` — a CPU-GPU synchronization point that cudagraph cannot capture.
|
||||
|
||||
**Root cause:** `_token_indices` was kept on CPU to avoid CuTeDSL JIT GPU memory corruption (Bug 3). But cudagraph requires all ops to be GPU-only.
|
||||
|
||||
**Fix:**
|
||||
1. Moved `_token_indices` to GPU
|
||||
2. Added `_fill_token_indices()` method to refill the tensor after potential corruption
|
||||
3. Added `_needs_token_refill` flag — set after `_ensure_stacked()` (weight JIT), checked/cleared after first `run()` call (GEMM JIT). After both JITs have fired, the tensor is stable.
|
||||
|
||||
---
|
||||
|
||||
## Debug Methodology — How We Got Here
|
||||
|
||||
Reference in New Issue
Block a user