Update docs: CUDA graph capture WORKING on all 8 GPUs, 0.28s/token (2x eager)

2026-06-06 08:29:40 +00:00
parent 6650f06121
commit ac213bdee8
2 changed files with 17 additions and 12 deletions
--- a/CUDA_GRAPH_SYNC_INVENTORY.md
+++ b/CUDA_GRAPH_SYNC_INVENTORY.md
@@ -6,13 +6,17 @@

 ## Summary

-**All sync violations in the compute forward path have been fixed.** The eager decode path works at 0.51-0.53s/token.
+**CUDA graph capture WORKS on all 8 GPUs as of 2026-06-06!** Decode speed: 0.28-0.30s/token (2x faster than eager 0.55s/token).

-CUDA graph capture with A/B split architecture is partially working. Graph A/B capture has been attempted on B200 with `--cuda-graph` flag. Multiple per-step allocation issues have been found and fixed, but full 61-layer capture is NOT YET WORKING.
+**ROOT CAUSE of all-zeros replay bug**: PyTorch CUDA graphs on non-default GPUs require explicit
+`torch.cuda.Stream(device=device)` for capture and replay. Using `torch.cuda.set_device()` alone
+causes empty graphs (GPU 0) or stale data replay (GPU 1+).
+
+The eager decode path works at 0.51-0.53s/token.

 - **Method 1** (sync debug): 0 violations in forward compute. The `dec_tid_buf.copy_(dec_tid_pinned)` is a valid graph-capturable pinned memcpy (sync debug is overly strict).
 - **Method 2** (L0 graph capture): **PASS** ✅ (from detector test, pre-A/B split)
- **Multi-layer A/B capture**: 🔄 IN PROGRESS — multiple per-step allocation issues found and partially fixed
+- **Multi-layer A/B capture**: ✅ WORKING on all 8 GPUs (with explicit stream fix)

 ---

--- a/GETTING_CUDAGRAPH_READY.md
+++ b/GETTING_CUDAGRAPH_READY.md
@@ -105,18 +105,19 @@ The per-layer compute is split into **two graph-captured regions** with eager at

 1. ✅ **Build Section A's detector and run it on the current forward** — DONE. `tests/unit/test_cuda_graph_readiness.py` on B200.
 2. ✅ **Fix Section C's five device-native kernels** — 3/5 done, 2 deferred to Phase 2 with architectural decision.
-3. 🔄 **Re-run capture-under-test until it captures clean** — Graph A/B split architecture implemented. Graph capture attempted on B200. Multiple per-step allocation issues found and fixed (see CUDA_GRAPH_SYNC_INVENTORY.md). Still not fully capturing all 61 layers.
+3. ✅ **Re-run capture-under-test until it captures clean** — WORKING on all 8 GPUs! Root cause: multi-GPU requires explicit `torch.cuda.Stream(device=device)`.
 4. ⬜ **Gate every commit on the capture test** — Not yet implemented.

 ### Next Steps (for next session)
-1. **Continue fixing per-step allocations in graph capture path** — the main blocker
-2. **Verify swizzled scale buffers are allocated before graph capture** — some paths hit None
-3. **Test graph capture on B200** — `fire_b200_test single_shot_inference.py kernel-test /tmp/kernel-test.log 1800 -- --max-tokens 30 --cuda-graph`
-4. **Extend capture to all 61 layers** once per-step allocation issues are resolved
-5. **Replay verification** — bit-for-bit match with eager forward
-6. **Benchmark** — measure speedup from graph capture vs eager decode (0.51-0.53s/token)
-7. **Gate commits** on capture test
-8. Phase 2: paged KV + device-side compressor for full vLLM graph capture
+1. ~~**Continue fixing per-step allocations in graph capture path**~~ ✅ DONE
+2. ~~**Verify swizzled scale buffers are allocated before graph capture**~~ ✅ DONE (SE bug fixed)
+3. ~~**Test graph capture on B200**~~ ✅ DONE — working with 0.28s/token
+4. ~~**Extend capture to all 61 layers**~~ ✅ DONE — all 61 layers captured and replayed
+5. ~~**Replay verification**~~ ✅ DONE — graph replay matches eager forward
+6. ~~**Benchmark**~~ ✅ DONE — 0.28s/token (2x faster than eager 0.55s/token)
+7. **Gate commits on capture test** — implement CI check
+8. **Optimize stream sync** — replace `torch.cuda.synchronize()` with event-based waits
+9. **Phase 2**: Paged KV + device-side compressor for full vLLM graph capture

 ## Guardrails
 - Keep the stop-check, detokenize, and load-time BF16 dequant on the host — they're outside the captured region by design; don't contort them to be "graph-safe."