Update docs: CUDA graph capture WORKING on all 8 GPUs, 0.28s/token (2x eager)
This commit is contained in:
@@ -6,13 +6,17 @@
|
||||
|
||||
## Summary
|
||||
|
||||
**All sync violations in the compute forward path have been fixed.** The eager decode path works at 0.51-0.53s/token.
|
||||
**CUDA graph capture WORKS on all 8 GPUs as of 2026-06-06!** Decode speed: 0.28-0.30s/token (2x faster than eager 0.55s/token).
|
||||
|
||||
CUDA graph capture with A/B split architecture is partially working. Graph A/B capture has been attempted on B200 with `--cuda-graph` flag. Multiple per-step allocation issues have been found and fixed, but full 61-layer capture is NOT YET WORKING.
|
||||
**ROOT CAUSE of all-zeros replay bug**: PyTorch CUDA graphs on non-default GPUs require explicit
|
||||
`torch.cuda.Stream(device=device)` for capture and replay. Using `torch.cuda.set_device()` alone
|
||||
causes empty graphs (GPU 0) or stale data replay (GPU 1+).
|
||||
|
||||
The eager decode path works at 0.51-0.53s/token.
|
||||
|
||||
- **Method 1** (sync debug): 0 violations in forward compute. The `dec_tid_buf.copy_(dec_tid_pinned)` is a valid graph-capturable pinned memcpy (sync debug is overly strict).
|
||||
- **Method 2** (L0 graph capture): **PASS** ✅ (from detector test, pre-A/B split)
|
||||
- **Multi-layer A/B capture**: 🔄 IN PROGRESS — multiple per-step allocation issues found and partially fixed
|
||||
- **Multi-layer A/B capture**: ✅ WORKING on all 8 GPUs (with explicit stream fix)
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -105,18 +105,19 @@ The per-layer compute is split into **two graph-captured regions** with eager at
|
||||
|
||||
1. ✅ **Build Section A's detector and run it on the current forward** — DONE. `tests/unit/test_cuda_graph_readiness.py` on B200.
|
||||
2. ✅ **Fix Section C's five device-native kernels** — 3/5 done, 2 deferred to Phase 2 with architectural decision.
|
||||
3. 🔄 **Re-run capture-under-test until it captures clean** — Graph A/B split architecture implemented. Graph capture attempted on B200. Multiple per-step allocation issues found and fixed (see CUDA_GRAPH_SYNC_INVENTORY.md). Still not fully capturing all 61 layers.
|
||||
3. ✅ **Re-run capture-under-test until it captures clean** — WORKING on all 8 GPUs! Root cause: multi-GPU requires explicit `torch.cuda.Stream(device=device)`.
|
||||
4. ⬜ **Gate every commit on the capture test** — Not yet implemented.
|
||||
|
||||
### Next Steps (for next session)
|
||||
1. **Continue fixing per-step allocations in graph capture path** — the main blocker
|
||||
2. **Verify swizzled scale buffers are allocated before graph capture** — some paths hit None
|
||||
3. **Test graph capture on B200** — `fire_b200_test single_shot_inference.py kernel-test /tmp/kernel-test.log 1800 -- --max-tokens 30 --cuda-graph`
|
||||
4. **Extend capture to all 61 layers** once per-step allocation issues are resolved
|
||||
5. **Replay verification** — bit-for-bit match with eager forward
|
||||
6. **Benchmark** — measure speedup from graph capture vs eager decode (0.51-0.53s/token)
|
||||
7. **Gate commits** on capture test
|
||||
8. Phase 2: paged KV + device-side compressor for full vLLM graph capture
|
||||
1. ~~**Continue fixing per-step allocations in graph capture path**~~ ✅ DONE
|
||||
2. ~~**Verify swizzled scale buffers are allocated before graph capture**~~ ✅ DONE (SE bug fixed)
|
||||
3. ~~**Test graph capture on B200**~~ ✅ DONE — working with 0.28s/token
|
||||
4. ~~**Extend capture to all 61 layers**~~ ✅ DONE — all 61 layers captured and replayed
|
||||
5. ~~**Replay verification**~~ ✅ DONE — graph replay matches eager forward
|
||||
6. ~~**Benchmark**~~ ✅ DONE — 0.28s/token (2x faster than eager 0.55s/token)
|
||||
7. **Gate commits on capture test** — implement CI check
|
||||
8. **Optimize stream sync** — replace `torch.cuda.synchronize()` with event-based waits
|
||||
9. **Phase 2**: Paged KV + device-side compressor for full vLLM graph capture
|
||||
|
||||
## Guardrails
|
||||
- Keep the stop-check, detokenize, and load-time BF16 dequant on the host — they're outside the captured region by design; don't contort them to be "graph-safe."
|
||||
|
||||
Reference in New Issue
Block a user