Files
nvfp4-megamoe-kernel/CURRENT_BUG.md

207 lines
9.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Current Bug: CuTeDSLMoERunner — Status & Debug History
## Current Status (May 17, 2026 17:54 UTC)
**Build #11 in progress (includes swiglu_limit fix). Previous builds produced empty/invisible token output.**
-`layertest.py` — 0.988 cosine
-`cudagraph_test.py` — capture + replay works
- ✅ vLLM container starts, loads weights, warmup gs computed, cudagraph capture succeeds
- ❌ Model output was empty content (30 invisible tokens) — **swiglu_limit fix not yet tested in container**
**Latest fix: Missing swiglu_limit=10.0 activation clamping (Bug 25).** DeepSeek-V4 uses `SiluAndMulWithClamp(10.0)` which clamps `silu(gate)` to max 10.0 and `up` to [-10, 10]. Our runner was doing plain `F.silu(gate) * up` without clamping. Large gate values → unbounded SiLU output → corrupted L2 GEMM input → garbage logits → model outputs BOS/thinking tokens.
**vLLM launch config:**
```
--gpu_memory_utilization=0.9
--compilation-config='{"cudagraph_mode": "FULL_DECODE_ONLY", "custom_ops": ["all"], "cudagraph_capture_sizes": [1, 2, 4, 8], "max_cudagraph_capture_size": 8}'
```
---
## Bugs Found & Fixed
### Bug 1: Scale Assembly — Global vs Per-Expert Swizzle
**Fix:** Two-phase scatter + per-expert swizzle.
### Bug 2: `searchsorted(right=False)`
**Fix:** Changed to `right=True`.
### Bug 3: CuTeDSL `cute.compile` GPU Memory Corruption — CRITICAL
**Symptom:** `_token_indices` all zeros after JIT.
**Root cause:** `cute.compile` corrupts GPU memory.
**Fix:** `_fill_token_indices()` builds on CPU, copies to GPU. `_needs_token_refill` flag.
### Bug 4: `expert_offsets` With Leading 0
**Fix:** Pass `expert_offsets[1:]` to GEMM.
### Bug 5: Checkpoint `input_scale` Wrong for Runtime gs
**Root cause:** Calibration value, too-small gs → block scale overflow.
**Fix:** `compute_activation_global_scales()` warmup method.
### Bug 6: L1/L2 Need Separate gs
**Fix:** Compute L2 gs from L1 output after SiLU*up.
### Bug 7: L1/L2 Need Separate Scale Buffers
**Fix:** Separate `_padded_x_sf_buf_l1`/`_l2`, separate per-expert bufs.
### Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT
**Symptom:** `IndexKernel.cu:111` OOB, cascading CUDA_ERROR_ASSERT (710).
**Root cause:** `topk_ids` contains global IDs (0-255), runner treated as local.
**Fix:** `experts_start_idx`, remap global→local, mask non-local tokens.
### Bug 8b: `.cpu()` Sync Breaking Cudagraph
**Fix:** `_token_indices` on GPU, `_fill_token_indices()` CPU→GPU copy.
### Bug 911: Buffer sizing and swizzle layout
See previous versions for details.
### Bug 12: `torch.full()` During Cudagraph Capture
**Symptom:** `cudaErrorStreamCaptureUnsupported`.
**Fix:** Pre-allocated buffers, `.fill_()` instead of `torch.full()`.
### Bug 13: Warmup Passed Global Expert IDs
**Fix:** Pass local IDs (0..num_experts-1).
### Bug 14: GEMM Scale Layout Mismatch — Fixed 128-Row vs Variable
**Symptom:** BOS token repeat (garbage logits).
**Root cause:** Scale assembly at `e*128`, GEMM reads by real expert_offsets. Expert with 500 tokens → GEMM reads 500 scale rows but only 128 have data.
**Fix:** Variable padded expert offsets, scatter into real padded positions.
### Bug 15: OOM — Per-Layer Padded Buffers (4.3 GB)
**Root cause:** 72 MB × 60 layers = 4.3 GB. Not enough room for KV cache.
**Fix:** Shared buffers (Bug 21).
### Bug 16: `padded_max_slots` Mismatch
**Fix:** Size for `num_experts * max_chunks * 128`.
### Bug 17: Shape Mismatch (49152 vs 3072)
**Root cause:** Cap `max_num_tokens` to 512 made buffers too small for 8192-token warmup.
**Fix:** Reverted cap, use shared buffers.
### Bug 1820: Cudagraph Capture Failures (dynamic allocs, variable loops, GPU scalars)
**Fix:** Pre-allocate everything, fixed loop counts, Python constants for offsets.
### Bug 21: OOM — Shared Padded Buffers
**Fix:** Class-level shared buffers dict keyed by device. `padded_hidden`, `padded_activated`, `padded_xsf_l1`/`l2`, `output` all shared. ~150 MB total instead of ~4.3 GB.
### Bug 22: Token Dropping via `clamped_local`
**Symptom:** Garbage model output (empty/invisible tokens).
**Root cause:** `local_row.clamp(max=max_rows_per_expert-1)` silently dropped tokens when an expert got more than `max_chunks*128` tokens. `max_chunks` was computed as average (ceil(total_slots / (num_experts*128))), not worst-case. MoE routing is uneven — some experts get 200+ tokens while others get 10.
**Fix:** Use real padded expert offsets (variable per expert, padded to 128). No clamping needed — each expert gets exactly the space it needs.
### Bug 23: cudaErrorStreamCaptureUnsupported from Dynamic GPU Slicing
**Symptom:** All 8 workers fail during cudagraph capture.
**Root cause:** `buf[:total_padded_slots]` where `total_padded_slots` is a GPU scalar — dynamic tensor slicing with a GPU index is a CUDA operation not permitted during stream capture.
**Fix:** Use full pre-allocated buffers, no dynamic GPU slicing. Pass `x_sf[:num_slots]` (Python int) to scale assembly.
### Bug 24: Scale Assembly `.cpu().tolist()` Breaks Cudagraph
**Symptom:** `cudaErrorStreamCaptureInvalidated` during capture.
**Root cause:** Per-expert Python loops with GPU-derived offsets required `.cpu().tolist()` for slicing — CPU-GPU sync invalidates stream capture.
**Fix:** Full-buffer Blackwell 32_4_4 swizzle. Apply `to_blocked` transform to entire `padded_x_sf` buffer at once. No CPU syncs, no Python loops. The buffer is already 128-row aligned per expert and 4-col aligned, so the full-buffer swizzle produces the correct layout. GEMM reads `scale_a` using `padded_expert_offsets`, matching the scatter layout.
### Bug 25: Missing `swiglu_limit=10.0` Activation Clamping — LIKELY CAUSE OF GARBAGE OUTPUT
**Symptom:** Model generates 30 tokens of empty/invisible content (BOS or thinking token). Not meaningful text.
**Root cause:** DeepSeek-V4 uses `SiluAndMulWithClamp(10.0)` which:
- Clamps `silu(gate)` to max 10.0
- Clamps `up` to [-10.0, 10.0]
Our runner did plain `F.silu(gate) * up` without clamping. Large gate values produce unbounded SiLU output (silu(20) ≈ 20, silu(50) ≈ 50). These large values get multiplied by the up projection, producing activations with amax >> 10. This:
1. Corrupts the L2 GEMM input (quantized with wrong gs)
2. Produces garbage L2 output
3. Final logits are wrong → model collapses to most frequent token (BOS)
**Fix:** Added `set_swiglu_limit(limit)` to runner. In `run()`, apply clamping:
```python
gate_silu = F.silu(gate)
if self._swiglu_limit is not None:
gate_silu = gate_silu.clamp(max=self._swiglu_limit)
up = up.clamp(min=-self._swiglu_limit, max=self._swiglu_limit)
activated = gate_silu * up
```
Called from `deepseek_v4.py` after warmup: `self._cutedsl_runner.set_swiglu_limit(float(self.swiglu_limit))`.
---
## Current Architecture: Variable Padded Expert Offsets
```
Each expert padded to next multiple of 128 tokens.
padded_expert_offsets computed from real tokens_per_expert (GPU).
Scatter: padded_dst = padded_expert_offsets[expert_assign] + local_row
GEMM input: padded_hidden (full pre-allocated buffer, not sliced)
GEMM offsets: padded_expert_offsets[1:] (GPU tensor)
GEMM output: full buffer size; extract via l1_out[padded_dst]
Scale assembly:
Phase 1: Scatter x_sf into padded_x_sf at padded_expert_offsets
Phase 2: Full-buffer Blackwell 32_4_4 swizzle (no CPU syncs)
Zero CPU syncs, zero Python loops
Activation:
SiLU(gate) clamped to swiglu_limit (10.0)
up clamped to [-swiglu_limit, swiglu_limit]
activated = clamped_silu * clamped_up
Shared buffers (class-level, ~150 MB total):
padded_hidden, padded_activated, padded_xsf_l1, padded_xsf_l2, output
```
### Cudagraph Constraints (All Resolved)
- No `.item()`, `.cpu()`, `.tolist()`
- No `torch.zeros/ones/full/empty/arange()` during capture — pre-allocate everything
- No dynamic GPU slicing (`buf[:gpu_scalar]`) — use full buffers
- No Python loops with GPU-derived values — full-buffer ops instead
- No `torch.full()` — pre-allocated `.fill_()`
- Shared buffers OK (layers sequential during capture and replay)
- `F.silu().clamp()` and `.clamp()` are GPU ops — cudagraph-safe ✅
### EP Configuration (DeepSeek-V4-Pro on 8×B200)
- 256 total experts, top_k=6, swiglu_limit=10.0
- EP=8 → 48 local experts per rank (n_routed_experts / ep_size = 256/8 = 32, but logs show 48)
- `experts_start_idx` = rank × 32
- `max_num_tokens` = 8192
- `max_chunks_per_expert` = ceil(8192 × 6 / (48 × 128)) = 8
---
## Shared Expert Path (verified correct)
```
DeepseekV4MoE.forward():
1. gate → fused_topk_bias → topk_weights, topk_ids
2. self.experts(hidden_states, topk_weights, topk_ids) → routed_output
3. EP all-reduce across ranks
4. self.shared_experts(hidden_states) → shared_output
5. final = routed_output + shared_output
```
- Shared experts: `DeepseekV4MLP` (not NVFP4, uses standard quantization)
- `routed_scaling_factor`: Applied in `fused_topk_bias` to topk_weights ✅
- `renormalize`: Top-k weights normalized to sum to 1 ✅
- `scoring_func=sqrtsoftplus`: Applied in routing ✅
---
## Test Files
| File | Purpose |
|------|---------|
| `tests/layertest.py` | Reference vs runner, 3 experts. Must pass ≥0.98 cosine. |
| `tests/cudagraph_test.py` | Cudagraph capture + replay. Must pass. |
**Run order after any code change:**
1. `python3 tests/layertest.py` — must pass
2. `python3 tests/cudagraph_test.py` — must pass
---
## Repo Info
- **Kernel:** `sweetapi.com/biondizzle/nvfp4-megamoe-kernel` (master)
- **Local:** `~/dev/nvfp4-megamoe-kernel/`
- **B200:** `/root/nvfp4-megamoe-kernel/`
- **Model:** `/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4` (read-only)
- **Never edit on B200 directly** — edit locally → commit → push → pull on B200