From 55f1ddd502c2d460c1bb1aaeb6278a73bb7e8773 Mon Sep 17 00:00:00 2001 From: biondizzle Date: Sat, 6 Jun 2026 09:17:49 +0000 Subject: [PATCH] Update GETTING_CUDAGRAPH_READY.md and CUDA_GRAPH_SYNC_INVENTORY.md with full current status, multi-GPU stream fix, and next steps --- CUDA_GRAPH_SYNC_INVENTORY.md | 119 ++++++++++++++++++++++------------- GETTING_CUDAGRAPH_READY.md | 80 +++++++++++++++++++---- 2 files changed, 145 insertions(+), 54 deletions(-) diff --git a/CUDA_GRAPH_SYNC_INVENTORY.md b/CUDA_GRAPH_SYNC_INVENTORY.md index 2da0804d..0187c84c 100644 --- a/CUDA_GRAPH_SYNC_INVENTORY.md +++ b/CUDA_GRAPH_SYNC_INVENTORY.md @@ -1,16 +1,14 @@ # CUDA Graph Readiness — Sync Violation Inventory -**Date:** 2026-06-04 (updated 05:10 UTC) -**Source:** Section A detector runs on B200 + manual code grep (Section B checklist) + graph capture attempts +**Date:** 2026-06-06 (updated 09:15 UTC) +**Source:** Section A detector runs on B200 + manual code grep (Section B checklist) + graph capture attempts + full 61-layer replay verification **Target:** single_shot_inference.py decode forward (1 token step, T=1) ## Summary **CUDA graph capture WORKS on all 8 GPUs as of 2026-06-06!** Decode speed: 0.28-0.30s/token (2x faster than eager 0.55s/token). -**ROOT CAUSE of all-zeros replay bug**: PyTorch CUDA graphs on non-default GPUs require explicit -`torch.cuda.Stream(device=device)` for capture and replay. Using `torch.cuda.set_device()` alone -causes empty graphs (GPU 0) or stale data replay (GPU 1+). +**ROOT CAUSE of all-zeros replay bug (FIXED)**: PyTorch CUDA graphs on non-default GPUs require explicit `torch.cuda.Stream(device=device)` for capture and replay. Using `torch.cuda.set_device()` alone causes empty graphs (GPU 0) or stale data replay (GPU 1+). See `tests/unit/test_cuda_graph_stream.py` for the minimal reproduction. The eager decode path works at 0.51-0.53s/token. @@ -74,16 +72,13 @@ All VERBOSE-gated `.item()` calls (diagnostics) are safe at VERBOSE=0. ## CATEGORY 5: torch.cuda.synchronize() on hot path — ALL CONDITIONAL ✅ | File | Line | Guard | -|------|------|-------| +|------|-------|-------| | `single_shot_inference.py` | 816, 1041-1065 | `_profile_detail` flag — must be False during capture | | `single_shot_inference.py` | 1088 | Profile flag | --- -## CATEGORY 6: Per-step allocations inside CUDA graph capture — PARTIALLY FIXED 🔄 - -These are `torch.zeros()`, `torch.empty()`, and Python view operations that work fine in eager mode -but are disallowed during `torch.cuda.graph()` capture. +## CATEGORY 6: Per-step allocations inside CUDA graph capture — ALL FIXED ✅ ### FIXED — GEMM output buffers @@ -104,10 +99,9 @@ but are disallowed during `torch.cuda.graph()` capture. |------|-------|-----|--------| | `dsv4/kernels/gemm/grouped.py` | `to_blocked()` uses Python view ops (reshape, transpose, permute) — not graph-capturable | CUDA kernel `blackwell_swizzle.cu` during graph capture, Python fallback for eager | `69e15f1` | | `dsv4/layers/moe.py` | `_assemble_scales_cudagraph_safe` uses Python view ops | Same CUDA kernel treatment + pre-allocated `_padded_x_sf_swizzled_buf_l1/l2` | `69e15f1` | -| `dsv4/layers/shared_expert.py` | `_assemble_scales_single_group` calls `pad_and_swizzle_single` | Same CUDA kernel treatment + pre-allocated `_padded_x_sf_swizzled_buf_l1/l2` | `69e15f1` | -| `dsv4/layers/linear.py` | `_assemble_scales_single_group` calls `pad_and_swizzle_single` | Same CUDA kernel treatment + pre-allocated `_padded_x_sf_swizzled_buf` | `69e15f1` | +| `dsv4/layers/shared_expert.py` | `_assemble_scales_single_group` calls `pad_and_swizzle_single` | Same CUDA kernel treatment + pre-allocated `_padded_x_sf_swizzled_buf_l1/l2` | `69e15f1`, `f259d63` | -**IMPORTANT**: The swizzled buffers are allocated in `_allocate_buffers()` / `_ensure_buffer_size()`. If these haven't been called before graph capture, the buffers will be None. A safety fallback falls through to the Python path (which will fail during graph capture). **Ensure all layer buffers are allocated before calling `graph_decoder.capture()`.** +**CRITICAL BUG FIXED (2026-06-06)**: In shared_expert.py, `_padded_x_sf_swizzled_buf_l1/l2` were allocated at line 183-184 but then **overwritten with None** at line 190-191. This meant that during graph capture, `_assemble_scales_single_group` would find the swizzled buffer is None and fall through to the Python path, which FAILS during graph capture (Python view ops like reshape/transpose can't be recorded). Fixed by removing the None overwrite. ### FIXED — gsa copy_ from view @@ -125,18 +119,22 @@ but are disallowed during `torch.cuda.graph()` capture. |------|-------|-----|--------| | `dsv4/kernels/router/dense_router_decode.py` | `hidden_states.float() @ gate_bf16.T.float()` creates new FP32 tensors during capture | Run GEMM in BF16, convert only logits output to FP32 for sqrt(softplus) | `ffa7842` | -### STILL BLOCKING ⏳ — Known remaining issues for next session +### FIXED — Norm weight pre-caching (2026-06-06) + +| File | Issue | Fix | Commit | +|------|-------|-----|--------| +| `single_shot_inference.py` CUDAGraphDecoder | `attn_norm_w.to(dev, torch.float32)` creates new tensor during capture | Pre-cache norm weights on correct device in FP32 before capture; store on `self` to prevent GC | `32902d1`, `5a98cc6` | + +### Known allocations inside graph capture that are FINE (recorded and replayed correctly) | File | Issue | Notes | |------|-------|-------| -| Various layers | `.contiguous()` calls inside graph capture may allocate new tensors | Need systematic audit. During graph capture, `.contiguous()` on a non-contiguous tensor allocates. Pre-ensure tensors are contiguous before capture. | -| `dsv4/layers/mhc.py` | `_dynamic_params` does `X_flat.float()` → new FP32 tensor | This IS captured (new allocation inside graph is recorded and replayed). But need to verify no issues. | -| `dsv4/layers/mhc.py` | `sinkhorn_knopp` CUDA kernel returns new tensor | Same — allocation is recorded and replayed. Should be fine. | -| `dsv4/layers/moe.py` | `l1_out[padded_dst]` — advanced indexing creates new tensor | This IS captured and replayed. Should be fine. | -| `dsv4/layers/moe.py` | `deinterleave_l1_weights` — creates new tensor | Need to verify graph-capturable | -| `dsv4/layers/moe.py` | `sorted_token_ids` from `argsort` — creates new tensor | Captured and replayed. Should be fine. | +| `dsv4/layers/mhc.py` | `_dynamic_params` does `X_flat.float()` → new FP32 tensor | Captured and replayed. Should be fine. | +| `dsv4/layers/mhc.py` | `sinkhorn_knopp` CUDA kernel returns new tensor | Captured and replayed. Should be fine. | +| `dsv4/layers/moe.py` | `l1_out[padded_dst]` — advanced indexing creates new tensor | Captured and replayed. Should be fine. | +| `dsv4/layers/moe.py` | `deinterleave_l1_weights` — creates new tensor (non-fused path only) | Not used with fused_swiglu=True. | | `dsv4/ops/quantize.py` | `quantize_nvfp4_gpu_fused` returns new tensors from CUDA kernels | Captured and replayed (kernel output is recorded). Should be fine. | -| Shared expert / linear | Swizzled buffers may be None if `_allocate_buffers()` not called before capture | Safety fallback to Python path will FAIL during graph capture. Must ensure buffers allocated. | +| Various layers | `.contiguous()` calls on non-contiguous tensors | Allocates new tensor during capture; recorded and replayed. Fine. | --- @@ -148,6 +146,8 @@ but are disallowed during `torch.cuda.graph()` capture. | v2 | `_DLPatchTensor` wrapper forcing `dl_device` in `__dlpack__` | ❌ 'Cannot copy between CPU and CUDA tensors' | `5c94dbb` (reverted) | | v3 | Patch `torch.cuda.current_device` lambda to return tensor's device index | ✅ WORKS | `91c3703` | +**NOTE**: The from_dlpack patch is still needed during CAPTURE (Python-side). During REPLAY, the GPU kernel arguments are replayed directly — no from_dlpack call. The patch does not interfere with explicit stream management. + --- ## CATEGORY 8: Cross-GPU operations inside graph capture — FIXED ✅ @@ -160,7 +160,27 @@ but are disallowed during `torch.cuda.graph()` capture. --- -## CUDAGraphDecoder Architecture (Current — A/B Split) +## CATEGORY 9: Multi-GPU CUDA graph stream issue — FIXED ✅ + +**THIS WAS THE ROOT CAUSE OF THE ALL-ZEROS REPLAY BUG.** + +| Issue | Fix | +|-------|-----| +| Graph capture on non-default GPUs (cuda:1-7) produces all-zero output during replay | Use explicit `torch.cuda.Stream(device=device)` per layer for capture AND replay | +| GPU 0: Empty graph with `torch.cuda.set_device()` | Same fix — explicit stream | +| No sync between graph streams and default stream (eager attention) | `torch.cuda.Event` + `record()` + `wait_event()` | + +**Minimal reproduction**: `tests/unit/test_cuda_graph_stream.py` + +**Implementation in CUDAGraphDecoder**: +- `self.streams[li] = torch.cuda.Stream(device=dev)` — per-layer stream +- Capture: `with torch.cuda.graph(graph_a, stream=s):` +- Replay: `with torch.cuda.stream(s): graph_a.replay()` +- Sync: Event between graph stream and default stream for eager attention + +--- + +## CUDAGraphDecoder Architecture (Current — A/B Split with Explicit Streams) The decoder captures the compute-heavy path as two graphs per layer, with eager attention in between: @@ -168,41 +188,56 @@ The decoder captures the compute-heavy path as two graphs per layer, with eager Capture flow: 1. Step 0: warmup (eager) + warmup_gsa (fix gsa values) 2. For each layer li: - a. Capture Graph A: mHC pre_block(attn) + RMSNorm + quantize + q_a + q_b + kv projections - → writes to x_normed_bufs[li], q_heads_bufs[li], kv_3d_bufs[li], ctx_a_B_bufs[li], ctx_a_C_bufs[li], X_mid_bufs[li] - b. Capture Graph B: mHC post_block(attn) + FFN + Router + MoE + SE + mHC post_block(ffn) + a. Create per-device stream: s = torch.cuda.Stream(device=dev) + b. Capture Graph A (on stream s): mHC pre_block(attn) + RMSNorm + quantize + q_a + q_b + kv projections + → writes to x_normed_bufs[li], q_heads_bufs[li], kv_3d_bufs[li], ctx_a_B/C_bufs[li], X_mid_bufs[li], q_a_bufs[li] + c. Capture Graph B (on stream s): mHC post_block(attn) + FFN + Router + MoE + SE + mHC post_block(ffn) → reads F_attn_bufs[li], X_mid_bufs[li]; writes x_out_bufs[li] -3. Capture hc_head + norm + lm_head on cuda:0 +3. Capture hc_head + norm + lm_head on cuda:0 (on lm_stream) ``` ``` Replay flow: 1. For each layer li: a. Copy X → x_in_bufs[li] (handles cross-GPU transfer) - b. Replay Graph A → read q_heads_bufs[li], kv_3d_bufs[li], x_normed_bufs[li] - c. Run eager attention: forward_attention(... q_heads=q_heads, kv_3d=kv_3d ...) - d. Copy F_attn → F_attn_bufs[li] - e. Replay Graph B → read x_out_bufs[li] - f. X = x_out_bufs[li] -2. Copy X → x_lm_in → replay lm_graph → read logits_buf + b. Replay Graph A on stream s: + with torch.cuda.stream(s): graphs_a[li].replay() + c. Sync: graph stream → default stream (Event + wait_event) + d. Eager attention: forward_attention(q_heads=q_heads, kv_3d=kv_3d, ...) + e. Copy F_attn → F_attn_bufs[li] + f. Sync: default stream → graph stream (Event + synchronize) + g. Replay Graph B on stream s: + with torch.cuda.stream(s): graphs_b[li].replay() + h. X = x_out_bufs[li] +2. Copy X → x_lm_in → replay lm_graph on lm_stream +3. Read logits_buf ``` -Commits: `6dc2f22` (initial A/B split + critical buffer fixes), `69e15f1` (swizzle kernel), `ffa7842` (router fix) +Key commits: `6dc2f22` (initial A/B split + critical buffer fixes), `69e15f1` (swizzle kernel), `ffa7842` (router fix), `f259d63` (SE swizzle bug), `6650f06` (explicit stream fix — THE critical fix) --- -## Remaining Work for Full Graph Capture +## Performance -1. **Fix Category 6 remaining allocations** — systematic audit of ALL per-step torch.zeros/empty/copy_ in forward path -2. **Ensure swizzled buffers allocated before capture** — add explicit allocation in CUDAGraphDecoder.pre_allocate() or before capture -3. **Extend capture to all 61 layers** — test on B200 with --cuda-graph -4. **Replay verification** — bit-for-bit match with eager forward -5. **Performance benchmark** — measure speedup from graph capture -6. **Gate commits** on capture test -7. **Phase 2**: Paged KV + device-side compressor for full vLLM graph capture +| Mode | Decode Speed | Notes | +|------|-------------|-------| +| Eager (no --cuda-graph) | 0.51-0.53s/token | Baseline, stable | +| CUDA Graph (--cuda-graph) | 0.28-0.30s/token | ~2x faster, matching numerical output | -## Phase 2 (vLLM Integration) +**Decode degeneration**: Model generates repetition loop (`psych` ↔ `istically`) in BOTH modes. This is NOT caused by CUDA graph capture — it's a model-level issue. Root cause still UNKNOWN. Components exonerated: mHC, FMHA, compression. +--- + +## Remaining Work + +### Phase 1 (current — nearly complete) +1. ⬜ **Gate commits on capture test** — implement CI check +2. ⬜ **Optimize stream sync** — pre-create events, reduce per-step overhead +3. ⬜ **Long-run stability test** — --max-tokens 512+ with --cuda-graph +4. ⬜ **Memory leak check** — ensure no growing GPU usage over many steps +5. ⬜ **Numerical drift check** — verify logit range stays stable over 512+ steps + +### Phase 2 (vLLM Integration — future) - Paged KV cache (fixed blocks + block table) - Device-side compressor boundary detection + fixed-shape output - Full graph capture including FMHA diff --git a/GETTING_CUDAGRAPH_READY.md b/GETTING_CUDAGRAPH_READY.md index f9e510f3..13c9dddd 100644 --- a/GETTING_CUDAGRAPH_READY.md +++ b/GETTING_CUDAGRAPH_READY.md @@ -10,6 +10,35 @@ You do **not** need one monolithic graph. The standard pattern (what vLLM's DSV4 --- +## ⚠️ CRITICAL MULTI-GPU REQUIREMENT (learned 2026-06-06) + +**PyTorch CUDA graphs on non-default GPUs REQUIRE explicit `torch.cuda.Stream(device=device)` for capture AND replay.** Using `torch.cuda.set_device()` alone causes: +- GPU 0: Empty graph (warning: "The CUDA Graph is empty") +- GPU 1+: Graph replays with stale capture-time data, ignoring updated input buffers + +**The fix:** +```python +# CAPTURE: +s = torch.cuda.Stream(device=device) +g = torch.cuda.CUDAGraph() +with torch.cuda.graph(g, stream=s): + output_buf.copy_(input_buf * 2.0) + +# REPLAY: +with torch.cuda.stream(s): + g.replay() +``` + +**Stream synchronization between graph and eager paths:** +- Graph A/B run on per-device streams +- Eager attention (between Graph A and Graph B) runs on the default stream +- Use `torch.cuda.Event` + `record()` + `wait_event()` for sync +- **Do NOT use `torch.cuda.synchronize()`** — it syncs ALL GPUs (too heavy) + +This was the root cause of the "all-zeros replay" bug that took an entire session to diagnose. The minimal reproduction test is in `tests/unit/test_cuda_graph_stream.py`. **Read this test if you ever see zero-output graph replay again.** + +--- + ## SECTION A — The detector (build this FIRST, before porting anything) ✅ DONE **Status:** Built and verified on B200 (2026-06-03). See `tests/unit/test_cuda_graph_readiness.py`. @@ -83,7 +112,7 @@ Also confirmed: - **Router** is graph-safe: pre-allocated output buffers, GPU-only operations ✅ - **mHC** is graph-safe: fixed-iteration Sinkhorn, no `.item()` on hot path ✅ -### Architectural Decision: Eager-Break-at-Attention (Phase 1) — UPDATED 2026-06-04 +### Architectural Decision: Eager-Break-at-Attention (Phase 1) — UPDATED 2026-06-06 The per-layer compute is split into **two graph-captured regions** with eager attention in between: - **Graph A** (captured): mHC pre_block(attn) + fused RMSNorm + quantize + q_a + q_a_norm + q_b + kv projections @@ -97,6 +126,8 @@ The per-layer compute is split into **two graph-captured regions** with eager at **Rationale**: FMHA has dynamic sequence length; compressor/KV are data-dependent. Capturing the compute-heavy parts (projections, MoE, SE) eliminates ~94ms of Python dispatch overhead per step. The attention path (which is NOT compute-heavy for T=1 decode) runs eagerly with negligible overhead. +**CRITICAL**: Both Graph A and Graph B are captured and replayed on **explicit per-device streams** (`torch.cuda.Stream(device=device)`). The eager attention path runs on the **default stream**. Event-based synchronization is used between graph streams and the default stream. + **Phase 2**: Paged KV + device-side compressor → full graph capture for vLLM integration. --- @@ -106,23 +137,44 @@ The per-layer compute is split into **two graph-captured regions** with eager at 1. ✅ **Build Section A's detector and run it on the current forward** — DONE. `tests/unit/test_cuda_graph_readiness.py` on B200. 2. ✅ **Fix Section C's five device-native kernels** — 3/5 done, 2 deferred to Phase 2 with architectural decision. 3. ✅ **Re-run capture-under-test until it captures clean** — WORKING on all 8 GPUs! Root cause: multi-GPU requires explicit `torch.cuda.Stream(device=device)`. -4. ⬜ **Gate every commit on the capture test** — Not yet implemented. +4. ✅ **Replay verification** — Graph replay matches eager forward on all 8 GPUs. Logit range [-26.5, 15.0] matches. +5. ✅ **Benchmark** — 0.28-0.30s/token with CUDA graphs (vs 0.55s/token eager = ~2x speedup). +6. ⬜ **Gate every commit on the capture test** — Not yet implemented. +7. ⬜ **Optimize stream sync** — Current implementation uses `torch.cuda.Event` + `wait_event()`/`synchronize()`. Could potentially reduce overhead by using per-layer events instead of per-step events. +8. ⬜ **Phase 2**: Paged KV + device-side compressor for full vLLM graph capture. -### Next Steps (for next session) -1. ~~**Continue fixing per-step allocations in graph capture path**~~ ✅ DONE -2. ~~**Verify swizzled scale buffers are allocated before graph capture**~~ ✅ DONE (SE bug fixed) -3. ~~**Test graph capture on B200**~~ ✅ DONE — working with 0.28s/token -4. ~~**Extend capture to all 61 layers**~~ ✅ DONE — all 61 layers captured and replayed -5. ~~**Replay verification**~~ ✅ DONE — graph replay matches eager forward -6. ~~**Benchmark**~~ ✅ DONE — 0.28s/token (2x faster than eager 0.55s/token) -7. **Gate commits on capture test** — implement CI check -8. **Optimize stream sync** — replace `torch.cuda.synchronize()` with event-based waits -9. **Phase 2**: Paged KV + device-side compressor for full vLLM graph capture +--- + +## NEXT STEPS (pick up here in next session) + +### Priority 1: Decode degeneration (still unresolved) +The model generates a repetition loop (`psych` ↔ `istically`) regardless of whether CUDA graphs are used. This is the SAME issue as the eager path — not caused by graph capture. Root cause UNKNOWN. Components exonerated: mHC, FMHA, compression. This is the highest-priority correctness issue. + +### Priority 2: Stream sync optimization +The current graph replay uses per-step `torch.cuda.Event` sync between graph streams and the default stream. This works but may add overhead. Potential optimizations: +- Pre-create events as instance variables instead of creating new ones each step +- Use `torch.cuda.Stream.wait_stream()` instead of event-based sync where possible +- Profile the sync overhead vs compute time + +### Priority 3: Long-run stability +Test with --max-tokens 512+ to verify stability over many decode steps. Check for: +- Memory leaks (growing GPU memory usage) +- Numerical drift (logit range changes over time) +- Graph replay failures after many steps + +### Priority 4: Phase 2 — Full vLLM integration +- Paged KV cache (fixed blocks + block table) +- Device-side compressor boundary detection + fixed-shape output +- Full graph capture including FMHA +- Bucket-by-shape for variable sequence lengths + +--- ## Guardrails - Keep the stop-check, detokenize, and load-time BF16 dequant on the host — they're outside the captured region by design; don't contort them to be "graph-safe." - **Phase 1 uses eager-break-at-attention.** Phase 2 adds paged KV. Don't retrofit paged KV into Phase 1 — it's a separate integration. - Host-known-int branching is allowed; only device-value branching must be eliminated. Don't over-correct and try to make legitimate shape/dtype dispatch device-side. +- **ALWAYS use explicit `torch.cuda.Stream(device=device)` for graph capture and replay on multi-GPU setups.** This is non-negotiable on B200. ## Violation Fix Log @@ -140,3 +192,7 @@ The per-layer compute is split into **two graph-captured regions** with eager at | `6dc2f22` | **CRITICAL: _l1_out_buf 2x too narrow → GPU memory corruption (root cause of ALL cudaErrorInvalidValue errors)**. Also: all GEMM output buffers pre-allocated, gsa copy_ → scalar assignment | | `69e15f1` | Blackwell swizzle CUDA kernel for graph capture, swizzled output buffers | | `ffa7842` | Dense router: BF16 GEMM instead of FP32 conversion during graph capture | +| `f259d63` | **CRITICAL: SE swizzled buffers allocated then overwritten with None — graph capture would fall through to broken Python path** | +| `32902d1` | Derive q_a_dim from config, pre-cache norm weights, add buffer verification | +| `5a98cc6` | Store pre-cached norm weights on self to prevent GC during graph replay | +| `6650f06` | **CRITICAL FIX: Use explicit per-device streams for CUDA graph capture/replay — fixes all-zeros replay on non-cuda:0 GPUs** |