Update GETTING_CUDAGRAPH_READY.md and CUDA_GRAPH_SYNC_INVENTORY.md with full current status, multi-GPU stream fix, and next steps

Update docs: CUDA graph capture WORKING on all 8 GPUs, 0.28s/token (2x eager)
CRITICAL FIX: Use explicit per-device streams for CUDA graph capture/replay on multi-GPU — fixes zero-output bug
2026-06-06 09:17:49 +00:00 · 2026-06-06 08:29:40 +00:00 · 2026-06-06 08:18:18 +00:00 · 2026-06-06 08:14:29 +00:00 · 2026-06-06 08:13:18 +00:00 · 2026-06-06 08:02:35 +00:00
20 changed files with 2369 additions and 285 deletions
--- a/CUDA_GRAPH_SYNC_INVENTORY.md
+++ b/CUDA_GRAPH_SYNC_INVENTORY.md
@@ -0,0 +1,244 @@
+# CUDA Graph Readiness — Sync Violation Inventory
+
+**Date:** 2026-06-06 (updated 09:15 UTC)
+**Source:** Section A detector runs on B200 + manual code grep (Section B checklist) + graph capture attempts + full 61-layer replay verification
+**Target:** single_shot_inference.py decode forward (1 token step, T=1)
+
+## Summary
+
+**CUDA graph capture WORKS on all 8 GPUs as of 2026-06-06!** Decode speed: 0.28-0.30s/token (2x faster than eager 0.55s/token).
+
+**ROOT CAUSE of all-zeros replay bug (FIXED)**: PyTorch CUDA graphs on non-default GPUs require explicit `torch.cuda.Stream(device=device)` for capture and replay. Using `torch.cuda.set_device()` alone causes empty graphs (GPU 0) or stale data replay (GPU 1+). See `tests/unit/test_cuda_graph_stream.py` for the minimal reproduction.
+
+The eager decode path works at 0.51-0.53s/token.
+
+- **Method 1** (sync debug): 0 violations in forward compute. The `dec_tid_buf.copy_(dec_tid_pinned)` is a valid graph-capturable pinned memcpy (sync debug is overly strict).
+- **Method 2** (L0 graph capture): **PASS** ✅ (from detector test, pre-A/B split)
+- **Multi-layer A/B capture**: ✅ WORKING on all 8 GPUs (with explicit stream fix)
+
+---
+
+## CATEGORY 1: Explicit `.item()` syncs on hot path — ALL FIXED ✅
+
+| File | Line | Fix | Commit |
+|------|------|-----|--------|
+| `dsv4/layers/mhc.py` | 422 | Removed `X_next.abs().max().item()` (122 syncs/step) | `a9ea303` |
+| `single_shot_inference.py` | ~1600 | Warmup-gsa `.item()` — one-time, outside graph | OK (by design) |
+| `single_shot_inference.py` | ~1642 | `argmax(logits).item()` — outside graph (sampling) | OK (by design) |
+
+All VERBOSE-gated `.item()` calls (diagnostics) are safe at VERBOSE=0.
+
+---
+
+## CATEGORY 2: Per-step tensor allocations — ALL FIXED ✅
+
+| File | Line | Fix | Commit |
+|------|------|-----|--------|
+| `dsv4/layers/linear.py` | 128 | Pre-allocated `_scale_a_buf` | `a9ea303` |
+| `dsv4/layers/shared_expert.py` | 213 | Same fix — pre-allocated `padded_x_sf_buf` + view | `a9ea303`, `e07d798` |
+| `dsv4/layers/grouped_linear.py` | 240 | Pre-allocated `_scale_a_buf` | `f13a81d` |
+| `dsv4/layers/grouped_linear.py` | ~374 | Pre-allocated `_output_buf` | `0ca7bed` |
+| `dsv4/layers/moe.py` | ~508 | `torch.full` → `self._l1_gsa_buf.fill_()` | `84655d0` |
+| `dsv4/ops/quantize.py` | 84,88 | `torch.zeros_like` → scalar `0.0` | `f13a81d` |
+| `dsv4/ops/quantize.py` | 327-329 | gsa: reshape for M=1, contiguous for M>1 | `80bb27f` |
+| `dsv4/layers/mhc.py` | init_state | `out_buf` parameter for in-place write | `46a3a51` |
+| `single_shot_inference.py` | ~1600 | Pre-allocated `dec_X_buf` | `46a3a51` |
+
+---
+
+## CATEGORY 3: Data-dependent control flow — FIXED / DEFERRED
+
+| File | Issue | Status | Fix |
+|------|-------|--------|-----|
+| `single_shot_inference.py` | `dec_tid_buf[0] = python_int` | ✅ FIXED | Pinned CPU buffer + `copy_` | `0ca7bed` |
+| `dsv4/layers/grouped_linear.py` | `expert_offsets[g] = python_int` | ✅ FIXED | Pre-allocated range tensor + element-wise multiply | `0ca7bed` |
+| `dsv4/layers/grouped_linear.py` | `if group_offsets[0] != 0` | ✅ FIXED | Unconditional GPU-only update | `df05289` |
+| `dsv4/layers/moe.py` | `torch.bincount` (data-dependent shapes) | ✅ FIXED | `scatter_add_` into pre-allocated buffer | `84655d0`, `518a1d3` |
+| `single_shot_inference.py` | Compressor returns `None` | ⏳ Phase 2 | Eager-break-at-attention: compressor runs outside graph |
+| `single_shot_inference.py` | KV `n_comp` Python int | ⏳ Phase 2 | Eager-break: attention runs outside graph |
+
+---
+
+## CATEGORY 4: Cross-GPU transfers inside graph — ADDRESSED ✅
+
+| File | Issue | Fix |
+|------|-------|-----|
+| `single_shot_inference.py` | `X.to(f"cuda:{gpu}")` in layer loop | Per-GPU X buffers + cross-GPU memcpy outside graph, or capture per-GPU subgraphs |
+| `single_shot_inference.py` | `positions.to(rope_cos.device)` | Per-GPU `dec_pos_per_gpu`/`dec_tid32_per_gpu` buffers | `56b816a` |
+| `single_shot_inference.py` | `token_id.to(x.device)` in moe_forward | Per-GPU dec_tid32_per_gpu buffers |
+
+---
+
+## CATEGORY 5: torch.cuda.synchronize() on hot path — ALL CONDITIONAL ✅
+
+| File | Line | Guard |
+|------|-------|-------|
+| `single_shot_inference.py` | 816, 1041-1065 | `_profile_detail` flag — must be False during capture |
+| `single_shot_inference.py` | 1088 | Profile flag |
+
+---
+
+## CATEGORY 6: Per-step allocations inside CUDA graph capture — ALL FIXED ✅
+
+### FIXED — GEMM output buffers
+
+| File | Issue | Fix | Commit |
+|------|-------|-----|--------|
+| `dsv4/ops/gemm_runner.py:189` | `torch.zeros()` in `run_nvfp4_grouped_gemm` | Pre-allocated `out` parameter | `188ecae` |
+| `dsv4/ops/gemm_runner.py:433` | `torch.zeros()` in `run_fused_swiglu_grouped_gemm` | Pre-allocated `out` parameter | `188ecae` |
+| `dsv4/layers/grouped_linear.py` | No pre-allocated GEMM output buffer | Pre-allocated `_output_buf` | `b32713c`, `f57de06` |
+| `dsv4/layers/moe.py` | No pre-allocated L1 output buffer | Pre-allocated `_l1_out_buf` (2*intermediate_size) | `6dc2f22` |
+| `dsv4/layers/shared_expert.py` | No pre-allocated L1 output buffer | Pre-allocated `_l1_out_buf` (2*intermediate_size) | `6dc2f22` |
+| `dsv4/layers/moe.py` | No pre-allocated L2 output buffer | Pre-allocated `_l2_out_buf` | `6dc2f22` |
+| `dsv4/layers/shared_expert.py` | No pre-allocated L2 output buffer | Pre-allocated `_l2_out_buf` | `6dc2f22` |
+| `dsv4/layers/linear.py` | No pre-allocated GEMM output buffer | Pre-allocated `_gemm_out_buf` | `6dc2f22` |
+
+### FIXED — Blackwell 32_4_4 scale swizzle
+
+| File | Issue | Fix | Commit |
+|------|-------|-----|--------|
+| `dsv4/kernels/gemm/grouped.py` | `to_blocked()` uses Python view ops (reshape, transpose, permute) — not graph-capturable | CUDA kernel `blackwell_swizzle.cu` during graph capture, Python fallback for eager | `69e15f1` |
+| `dsv4/layers/moe.py` | `_assemble_scales_cudagraph_safe` uses Python view ops | Same CUDA kernel treatment + pre-allocated `_padded_x_sf_swizzled_buf_l1/l2` | `69e15f1` |
+| `dsv4/layers/shared_expert.py` | `_assemble_scales_single_group` calls `pad_and_swizzle_single` | Same CUDA kernel treatment + pre-allocated `_padded_x_sf_swizzled_buf_l1/l2` | `69e15f1`, `f259d63` |
+
+**CRITICAL BUG FIXED (2026-06-06)**: In shared_expert.py, `_padded_x_sf_swizzled_buf_l1/l2` were allocated at line 183-184 but then **overwritten with None** at line 190-191. This meant that during graph capture, `_assemble_scales_single_group` would find the swizzled buffer is None and fall through to the Python path, which FAILS during graph capture (Python view ops like reshape/transpose can't be recorded). Fixed by removing the None overwrite.
+
+### FIXED — gsa copy_ from view
+
+| File | Issue | Fix | Commit |
+|------|-------|-----|--------|
+| `dsv4/layers/shared_expert.py` | `_l1_gsa_buf.copy_(gsa_l1_gpu[:1].reshape(1))` | `self._l1_gsa_buf[0] = gsa_l1_gpu[0]` | `6dc2f22` |
+| `dsv4/layers/shared_expert.py` | `_l2_gsa_buf.copy_(gsa_l2_gpu[:1].reshape(1))` | `self._l2_gsa_buf[0] = gsa_l2_gpu[0]` | `6dc2f22` |
+| `dsv4/layers/moe.py` | Same pattern for L1 and L2 gsa | Same scalar assignment fix | `6dc2f22` |
+| `dsv4/layers/linear.py` | `_gsa_buf.copy_(gsa[:1].reshape(1))` and `gsa.max().reshape(1)` | `self._gsa_buf[0] = gsa_gpu[0]` / `self._gsa_buf[0] = quant.gsa.max()` | `6dc2f22` |
+| `dsv4/layers/grouped_linear.py` | `_gsa_buf[:1].copy_()` + `_gsa_buf[1:].copy_(expand(...))` | `self._gsa_buf[0] = gsa_gpu[0]` + `self._gsa_buf[1:] = self._gsa_buf[0]` | `6dc2f22` |
+
+### FIXED — Router gate FP32 conversion
+
+| File | Issue | Fix | Commit |
+|------|-------|-----|--------|
+| `dsv4/kernels/router/dense_router_decode.py` | `hidden_states.float() @ gate_bf16.T.float()` creates new FP32 tensors during capture | Run GEMM in BF16, convert only logits output to FP32 for sqrt(softplus) | `ffa7842` |
+
+### FIXED — Norm weight pre-caching (2026-06-06)
+
+| File | Issue | Fix | Commit |
+|------|-------|-----|--------|
+| `single_shot_inference.py` CUDAGraphDecoder | `attn_norm_w.to(dev, torch.float32)` creates new tensor during capture | Pre-cache norm weights on correct device in FP32 before capture; store on `self` to prevent GC | `32902d1`, `5a98cc6` |
+
+### Known allocations inside graph capture that are FINE (recorded and replayed correctly)
+
+| File | Issue | Notes |
+|------|-------|-------|
+| `dsv4/layers/mhc.py` | `_dynamic_params` does `X_flat.float()` → new FP32 tensor | Captured and replayed. Should be fine. |
+| `dsv4/layers/mhc.py` | `sinkhorn_knopp` CUDA kernel returns new tensor | Captured and replayed. Should be fine. |
+| `dsv4/layers/moe.py` | `l1_out[padded_dst]` — advanced indexing creates new tensor | Captured and replayed. Should be fine. |
+| `dsv4/layers/moe.py` | `deinterleave_l1_weights` — creates new tensor (non-fused path only) | Not used with fused_swiglu=True. |
+| `dsv4/ops/quantize.py` | `quantize_nvfp4_gpu_fused` returns new tensors from CUDA kernels | Captured and replayed (kernel output is recorded). Should be fine. |
+| Various layers | `.contiguous()` calls on non-contiguous tensors | Allocates new tensor during capture; recorded and replayed. Fine. |
+
+---
+
+## CATEGORY 7: CuTeDSL from_dlpack device mismatch in graph capture — FIXED ✅
+
+| Attempt | Fix | Result | Commit |
+|---------|-----|--------|--------|
+| v1 | `torch.cuda.set_device(t.device.index)` before from_dlpack | ❌ 'Capture must end on the same stream it began on' | `87b6c99` (reverted) |
+| v2 | `_DLPatchTensor` wrapper forcing `dl_device` in `__dlpack__` | ❌ 'Cannot copy between CPU and CUDA tensors' | `5c94dbb` (reverted) |
+| v3 | Patch `torch.cuda.current_device` lambda to return tensor's device index | ✅ WORKS | `91c3703` |
+
+**NOTE**: The from_dlpack patch is still needed during CAPTURE (Python-side). During REPLAY, the GPU kernel arguments are replayed directly — no from_dlpack call. The patch does not interfere with explicit stream management.
+
+---
+
+## CATEGORY 8: Cross-GPU operations inside graph capture — FIXED ✅
+
+| Issue | Fix |
+|-------|-----|
+| `positions.to(rope_cos.device)` inside forward_layer during capture | Per-GPU `dec_pos_per_gpu`/`dec_tid32_per_gpu` buffers (`56b816a`) |
+| `X.to(f"cuda:{gpu}")` in layer loop | Graph uses per-layer x_in_bufs, copy_ before replay |
+| `token_id.to(x.device)` in moe_forward | Per-GPU dec_tid32_per_gpu buffers |
+
+---
+
+## CATEGORY 9: Multi-GPU CUDA graph stream issue — FIXED ✅
+
+**THIS WAS THE ROOT CAUSE OF THE ALL-ZEROS REPLAY BUG.**
+
+| Issue | Fix |
+|-------|-----|
+| Graph capture on non-default GPUs (cuda:1-7) produces all-zero output during replay | Use explicit `torch.cuda.Stream(device=device)` per layer for capture AND replay |
+| GPU 0: Empty graph with `torch.cuda.set_device()` | Same fix — explicit stream |
+| No sync between graph streams and default stream (eager attention) | `torch.cuda.Event` + `record()` + `wait_event()` |
+
+**Minimal reproduction**: `tests/unit/test_cuda_graph_stream.py`
+
+**Implementation in CUDAGraphDecoder**:
+- `self.streams[li] = torch.cuda.Stream(device=dev)` — per-layer stream
+- Capture: `with torch.cuda.graph(graph_a, stream=s):`
+- Replay: `with torch.cuda.stream(s): graph_a.replay()`
+- Sync: Event between graph stream and default stream for eager attention
+
+---
+
+## CUDAGraphDecoder Architecture (Current — A/B Split with Explicit Streams)
+
+The decoder captures the compute-heavy path as two graphs per layer, with eager attention in between:
+
+```
+Capture flow:
+1. Step 0: warmup (eager) + warmup_gsa (fix gsa values)
+2. For each layer li:
+   a. Create per-device stream: s = torch.cuda.Stream(device=dev)
+   b. Capture Graph A (on stream s): mHC pre_block(attn) + RMSNorm + quantize + q_a + q_b + kv projections
+      → writes to x_normed_bufs[li], q_heads_bufs[li], kv_3d_bufs[li], ctx_a_B/C_bufs[li], X_mid_bufs[li], q_a_bufs[li]
+   c. Capture Graph B (on stream s): mHC post_block(attn) + FFN + Router + MoE + SE + mHC post_block(ffn)
+      → reads F_attn_bufs[li], X_mid_bufs[li]; writes x_out_bufs[li]
+3. Capture hc_head + norm + lm_head on cuda:0 (on lm_stream)
+```
+
+```
+Replay flow:
+1. For each layer li:
+   a. Copy X → x_in_bufs[li] (handles cross-GPU transfer)
+   b. Replay Graph A on stream s:
+      with torch.cuda.stream(s): graphs_a[li].replay()
+   c. Sync: graph stream → default stream (Event + wait_event)
+   d. Eager attention: forward_attention(q_heads=q_heads, kv_3d=kv_3d, ...)
+   e. Copy F_attn → F_attn_bufs[li]
+   f. Sync: default stream → graph stream (Event + synchronize)
+   g. Replay Graph B on stream s:
+      with torch.cuda.stream(s): graphs_b[li].replay()
+   h. X = x_out_bufs[li]
+2. Copy X → x_lm_in → replay lm_graph on lm_stream
+3. Read logits_buf
+```
+
+Key commits: `6dc2f22` (initial A/B split + critical buffer fixes), `69e15f1` (swizzle kernel), `ffa7842` (router fix), `f259d63` (SE swizzle bug), `6650f06` (explicit stream fix — THE critical fix)
+
+---
+
+## Performance
+
+| Mode | Decode Speed | Notes |
+|------|-------------|-------|
+| Eager (no --cuda-graph) | 0.51-0.53s/token | Baseline, stable |
+| CUDA Graph (--cuda-graph) | 0.28-0.30s/token | ~2x faster, matching numerical output |
+
+**Decode degeneration**: Model generates repetition loop (`psych` ↔ `istically`) in BOTH modes. This is NOT caused by CUDA graph capture — it's a model-level issue. Root cause still UNKNOWN. Components exonerated: mHC, FMHA, compression.
+
+---
+
+## Remaining Work
+
+### Phase 1 (current — nearly complete)
+1. ⬜ **Gate commits on capture test** — implement CI check
+2. ⬜ **Optimize stream sync** — pre-create events, reduce per-step overhead
+3. ⬜ **Long-run stability test** — --max-tokens 512+ with --cuda-graph
+4. ⬜ **Memory leak check** — ensure no growing GPU usage over many steps
+5. ⬜ **Numerical drift check** — verify logit range stays stable over 512+ steps
+
+### Phase 2 (vLLM Integration — future)
+- Paged KV cache (fixed blocks + block table)
+- Device-side compressor boundary detection + fixed-shape output
+- Full graph capture including FMHA
+- Bucket-by-shape for variable sequence lengths
--- a/GETTING+CUDAGRAPH_READY.md
+++ b/GETTING+CUDAGRAPH_READY.md
@@ -1,94 +0,0 @@
-# DSV4 → vLLM: CUDA-Graph Safety / GPU-Native Requirements (PART 2 companion)
-
-**Goal:** the per-step decode forward must be fully GPU-native so vLLM can capture and replay it. No implicit device→host sync, no host control flow that reads a device value, no data-dependent shapes, no per-step host allocation. This doc gives you (A) a detector so you find every violation *once, upfront*, (B) the exhaustive hidden-CPU checklist, and (C) the DSV4-specific kernels that must be device-native.
-
-## The one rule that decides everything
-
-Branching on a **host-known integer** (step number, position, batch size, dtype, static shape) is graph-compatible — you capture one graph per bucket and the scheduler picks by that integer. Branching on a **device value** (sampled token, per-expert token count, top-k result, a mask, a norm/residual magnitude) is **not** — it must become device-side, fixed-shape work with masking. Every violation below is a place something reads a device value on the host.
-
-You do **not** need one monolithic graph. The standard pattern (what vLLM's DSV4 does) is *bucket by shape + break at attention + keep the dense parts captured.* Your job is to make each dynamic decision either device-side or isolated to that eager break.
-
---
-
-## SECTION A — The detector (build this FIRST, before porting anything)
-
-Stop hunting syncs by hand. Make them fail at the exact line:
-
-```python
-import torch
-torch.cuda.set_sync_debug_mode("error")   # raises at any implicit device→host sync
-# ... run one decode step of the forward ...
-torch.cuda.set_sync_debug_mode("default")
-```
-
-And a capture-under-test (most illegal host ops error *during* capture):
-```python
-g = torch.cuda.CUDAGraph()
-# static input buffers allocated ONCE, outside capture:
-with torch.cuda.graph(g):
-    out = decode_step(static_inputs)     # capture fails loudly on .item(), sync, alloc, etc.
-for _ in range(3):
-    static_inputs.copy_(next_inputs);  g.replay()   # replay must reproduce eager output
-```
-
-**Do this on the current `single_shot` forward first** — it inventories *every* existing sync in one pass, so you get the whole hunt-list upfront instead of discovering them one at a time during vLLM bring-up. Then gate every commit on both checks in CI; the day someone adds a `.item()`, the build fails at that line.
-
-Also useful: `compute-sanitizer --tool synccheck`, and `nsys` to eyeball CPU↔GPU stall gaps.
-
---
-
-## SECTION B — The hidden-CPU checklist (grep the hot path for these)
-
-**Explicit device→host transfers**
-`.item()` · `.cpu()` · `.tolist()` · `.numpy()` · `int(t)` / `float(t)` / `bool(t)` · `print(t)` · f-strings/logging that interpolate a tensor · `assert (device_condition)` (e.g. `assert (x>0).all()`) · `.to("cpu")`
-
-**Host control flow on device values**
-`if t:` · `if mask.any():` · `if x.sum() > thr:` · `while t > 0:` · `for i in range(n.item())` · convergence early-exit reading a device residual · choosing a kernel based on the sampled token
-
-**Data-dependent shapes (these both change shape AND sync)**
-`torch.nonzero` · `torch.where(cond)` (one-arg form) · `torch.unique` · `torch.bincount` (when it drives a shape) · boolean/mask indexing `x[mask]`, `x[x>0]` · `masked_select` · `reshape(n.item(), ...)` · any gather sized by a device-computed count
-
-**Per-step host allocation**
-`torch.empty/zeros/tensor([...])` created fresh inside the captured region · building a Python list then `torch.tensor(list, device=...)` · `np.*` anywhere on the path · any CPU tensor then `.to(device)` per step
-
-**Host RNG**
-`random.*` / `np.random.*` / Python rng for sampling → use a device generator / captured philox state
-
-**Sync primitives & checks**
-`torch.cuda.synchronize()` · `stream.synchronize()` · `torch.isnan(x).any()` / `isinf(...).any()` debug guards · pinned-copy syncs
-
-**Sneaky ones (the "didn't realize" category)**
-`sum(t)` / `min(t)` / `max(t)` (Python builtins iterate → sync; use `t.sum()`) · a `.cpu()`/`.item()` hidden inside a logging, assert, or metrics helper · `einops` rearrange with a data-dependent dim · telemetry/progress hooks that read tensors · indexing a tensor with a Python int derived from `.item()`
-
-**What is FINE (no sync, don't waste time on these)**
-`.shape` / `.size()` / `.numel()` / `.dtype` (host metadata, no sync) · branching on host-known ints (step/batch/static shape) · dtype/shape kernel dispatch · the **stop-token check, detokenize, and your BF16 precision-floor dequant** (all load-time or *outside* the captured graph — leave them on host, that's correct).
-
---
-
-## SECTION C — DSV4-specific kernels that must be GPU-native
-
-| # | Hazard (current host/dynamic behavior) | Requirement | vLLM reference |
-|---|---|---|---|
-| 1 | Compressor returns `None` for 3/4 (CSA) or 127/128 (HCA) decode steps — periodic host branch | Compress **every** step into a persistent partial-state/ring buffer; emit the compressed entry **device-side** on the boundary | `save_partial_states`, `fused_compress_quant_cache` |
-| 2 | KV grows each step → attention shape changes | Paged KV (fixed blocks + block table) captured at fixed max-len with masking, **or** make attention the eager break | `breakable_cudagraph` / `eager_break_during_capture`; `AttentionCGSupport.ALWAYS` |
-| 3 | Indexer top-k → host reads selected count to size gather | Always gather fixed `k` (padded), mask invalid; no host read of the count | `dequant_gather_k_cutedsl` (fixed-shape gather) |
-| 4 | MoE top-6 → per-expert token counts drive per-expert launches | Routing permutation/offsets computed **on device**; grouped GEMM with device offsets and a fixed total launch | `prepare_megamoe` |
-| 5 | Next token / positions managed on host, fresh tensors per step | Static I/O buffers allocated once; **in-place** `copy_` of next token; positions via device-side increment (or per-shape bucketed graphs) | vLLM persistent input buffers |
-
-Also confirm:
- **Sinkhorn** runs a **fixed 20 iterations with no host convergence check** (a `while not converged` reading a device residual breaks capture). Fixed-iteration = safe.
- **Sampler** is device-side; `repetition_penalty` reads from a **fixed-size device** recent-token buffer (not a growing Python list); the EOS/stop decision is a host step **outside** the graph (correct).
-
---
-
-## SECTION D — Integration order
-
-1. **Build Section A's detector and run it on the current forward** — get the full sync inventory in one pass.
-2. Fix Section C's five device-native kernels (these are the structural ones; the rest of Section B tends to be incidental `.item()`s once these are right).
-3. Re-run capture-under-test until it captures clean and replay matches eager bit-for-bit.
-4. Gate every commit on the capture test so violations can never silently return.
-
-## Guardrails
- Keep the stop-check, detokenize, and load-time BF16 dequant on the host — they're outside the captured region by design; don't contort them to be "graph-safe."
- Decide the attention model up front (paged-capturable vs eager-break) — retrofitting it later forces a KV-cache rewrite.
- Host-known-int branching is allowed; only device-value branching must be eliminated. Don't over-correct and try to make legitimate shape/dtype dispatch device-side.
--- a/GETTING_CUDAGRAPH_READY.md
+++ b/GETTING_CUDAGRAPH_READY.md
@@ -0,0 +1,198 @@
+# DSV4 → vLLM: CUDA-Graph Safety / GPU-Native Requirements (PART 2 companion)
+
+**Goal:** the per-step decode forward must be fully GPU-native so vLLM can capture and replay it. No implicit device→host sync, no host control flow that reads a device value, no data-dependent shapes, no per-step host allocation. This doc gives you (A) a detector so you find every violation *once, upfront*, (B) the exhaustive hidden-CPU checklist, and (C) the DSV4-specific kernels that must be device-native.
+
+## The one rule that decides everything
+
+Branching on a **host-known integer** (step number, position, batch size, dtype, static shape) is graph-compatible — you capture one graph per bucket and the scheduler picks by that integer. Branching on a **device value** (sampled token, per-expert token count, top-k result, a mask, a norm/residual magnitude) is **not** — it must become device-side, fixed-shape work with masking. Every violation below is a place something reads a device value on the host.
+
+You do **not** need one monolithic graph. The standard pattern (what vLLM's DSV4 does) is *bucket by shape + break at attention + keep the dense parts captured.* Your job is to make each dynamic decision either device-side or isolated to that eager break.
+
+---
+
+## ⚠️ CRITICAL MULTI-GPU REQUIREMENT (learned 2026-06-06)
+
+**PyTorch CUDA graphs on non-default GPUs REQUIRE explicit `torch.cuda.Stream(device=device)` for capture AND replay.** Using `torch.cuda.set_device()` alone causes:
+- GPU 0: Empty graph (warning: "The CUDA Graph is empty")
+- GPU 1+: Graph replays with stale capture-time data, ignoring updated input buffers
+
+**The fix:**
+```python
+# CAPTURE:
+s = torch.cuda.Stream(device=device)
+g = torch.cuda.CUDAGraph()
+with torch.cuda.graph(g, stream=s):
+    output_buf.copy_(input_buf * 2.0)
+
+# REPLAY:
+with torch.cuda.stream(s):
+    g.replay()
+```
+
+**Stream synchronization between graph and eager paths:**
+- Graph A/B run on per-device streams
+- Eager attention (between Graph A and Graph B) runs on the default stream
+- Use `torch.cuda.Event` + `record()` + `wait_event()` for sync
+- **Do NOT use `torch.cuda.synchronize()`** — it syncs ALL GPUs (too heavy)
+
+This was the root cause of the "all-zeros replay" bug that took an entire session to diagnose. The minimal reproduction test is in `tests/unit/test_cuda_graph_stream.py`. **Read this test if you ever see zero-output graph replay again.**
+
+---
+
+## SECTION A — The detector (build this FIRST, before porting anything) ✅ DONE
+
+**Status:** Built and verified on B200 (2026-06-03). See `tests/unit/test_cuda_graph_readiness.py`.
+
+Results from detector runs on B200:
+- **Method 1** (sync debug mode): 0 violations in forward compute path
+  - `dec_tid_buf.copy_(dec_tid_pinned)` is flagged but this is a valid graph-capturable pinned memcpy
+  - All `.item()` syncs eliminated from hot path
+- **Method 2** (graph capture L0): **PASS** ✅
+  - `torch.cuda.CUDAGraph()` capture of layer 0 decode step succeeds
+  - All per-call allocations eliminated
+  - All host reads of GPU values eliminated
+
+The detector:
+1. Grep for Section B sync patterns in hot path files
+2. Run one decode step with `torch.cuda.set_sync_debug_mode("error")`
+3. Attempt `torch.cuda.graph` capture of L0 decode step
+4. Report results to `/tmp/cuda_graph_readiness_results.json`
+
+Run via test harness:
+```bash
+fire_b200_test tests/unit/test_cuda_graph_readiness.py kernel-test /tmp/kernel-test.log 1800
+```
+
+---
+
+## SECTION B — The hidden-CPU checklist (grep the hot path for these) ✅ ADDRESSED
+
+**Explicit device→host transfers** — All `.item()` calls on hot path eliminated:
+- mhc.py `post_block`: removed `X_next.abs().max().item()` (122 syncs/step across 61 layers × 2 mHC)
+- All other `.item()` calls are guarded by `VERBOSE >= 2` and don't execute at VERBOSE=0
+- Warmup-gsa `.item()` calls run once at step 0, outside graph region
+
+**Data-dependent shapes** — Eliminated `torch.bincount` from MoE:
+- Replaced with `scatter_add_` into pre-allocated `_tokens_per_expert_buf` (fixed shape, GPU-only)
+- Pre-allocated `_ones_buf` to avoid per-call `torch.ones()`
+
+**Per-step host allocation** — All eliminated:
+- `torch.zeros()` in `_assemble_scales_single_group` → pre-allocated `_scale_a_buf` (linear.py, grouped_linear.py, shared_expert.py)
+- `torch.full()` for MoE l1_gsa → `self._l1_gsa_buf.fill_(l1_gs)`
+- `torch.empty()` for grouped_linear output → pre-allocated `_output_buf`
+- `mHCLayer.init_state` `.clone()` → `out_buf` parameter for in-place write
+- `torch.zeros_like` in quantize.py → scalar `0.0` in `torch.where`
+
+**Host control flow on device values** — Eliminated:
+- `dec_tid_buf[0] = python_int` → pinned CPU buffer + `copy_` (async, graph-capturable)
+- `expert_offsets[g] = python_int` → element-wise GPU multiply with pre-allocated range tensor
+- `if group_offsets[0] != 0` → unconditional GPU-only update (no host read of GPU tensor)
+
+**What is FINE (no sync, don't waste time on these)**
+- `.shape` / `.size()` / `.numel()` / `.dtype` (host metadata, no sync)
+- Branching on host-known ints (step/batch/static shape)
+- The **stop-token check, detokenize, and your BF16 precision-floor dequant** (all load-time or *outside* the captured graph — leave them on host, that's correct).
+- `dec_tid_buf.copy_(dec_tid_pinned)` — pinned CPU→GPU async memcpy, graph-capturable
+
+---
+
+## SECTION C — DSV4-specific kernels that must be GPU-native
+
+| # | Hazard | Status | Fix Applied |
+|---|--------|--------|-------------|
+| 1 | Compressor returns `None` for 3/4 (CSA) or 127/128 (HCA) decode steps | ⏳ Phase 2 (eager-break) | Compressor runs in eager section. Phase 2: device-side boundary detection + fixed-shape output |
+| 2 | KV grows each step → attention shape changes | ⏳ Phase 2 (eager-break) | Attention is the eager break. Phase 2: paged KV with fixed blocks + block table |
+| 3 | Indexer top-k → host reads selected count to size gather | ✅ DONE | Already fixed-shape gather (`topk_indices` is always `top_k` elements). No host read of count. |
+| 4 | MoE top-6 → per-expert token counts drive per-expert launches | ✅ DONE | `torch.bincount` → `scatter_add_` into pre-allocated buffer. Expert offsets are GPU tensors. |
+| 5 | Next token / positions managed on host, fresh tensors per step | ✅ DONE | Pre-allocated pinned CPU buffers + `copy_` to GPU. No per-step allocation. |
+
+Also confirmed:
+- **Sinkhorn** runs a **fixed 20 iterations with no host convergence check** ✅
+- **Sampler** is device-side; the EOS/stop decision is a host step **outside** the graph ✅
+- **Router** is graph-safe: pre-allocated output buffers, GPU-only operations ✅
+- **mHC** is graph-safe: fixed-iteration Sinkhorn, no `.item()` on hot path ✅
+
+### Architectural Decision: Eager-Break-at-Attention (Phase 1) — UPDATED 2026-06-06
+
+The per-layer compute is split into **two graph-captured regions** with eager attention in between:
+- **Graph A** (captured): mHC pre_block(attn) + fused RMSNorm + quantize + q_a + q_a_norm + q_b + kv projections
+  - Outputs written to pre-allocated buffers: x_normed, q_heads, kv_3d, ctx_a_B, ctx_a_C, X_mid
+- **Eager** (NOT captured): Compressor → Indexer → KV gather → FMHA → inverse RoPE → o_a + o_b → F_attn
+  - Dynamic shapes (FMHA seq_len, compressor returns None) → cannot be captured
+  - `forward_attention()` accepts optional `q_heads`/`kv_3d` to skip projections when called from graph replay
+- **Graph B** (captured): mHC post_block(attn) + FFN mHC + RMSNorm + quantize + Router + MoE + SE + mHC post_block(ffn)
+  - Reads F_attn from pre-allocated buffer (written by eager attention)
+  - Writes X_next to pre-allocated output buffer
+
+**Rationale**: FMHA has dynamic sequence length; compressor/KV are data-dependent. Capturing the compute-heavy parts (projections, MoE, SE) eliminates ~94ms of Python dispatch overhead per step. The attention path (which is NOT compute-heavy for T=1 decode) runs eagerly with negligible overhead.
+
+**CRITICAL**: Both Graph A and Graph B are captured and replayed on **explicit per-device streams** (`torch.cuda.Stream(device=device)`). The eager attention path runs on the **default stream**. Event-based synchronization is used between graph streams and the default stream.
+
+**Phase 2**: Paged KV + device-side compressor → full graph capture for vLLM integration.
+
+---
+
+## SECTION D — Integration order
+
+1. ✅ **Build Section A's detector and run it on the current forward** — DONE. `tests/unit/test_cuda_graph_readiness.py` on B200.
+2. ✅ **Fix Section C's five device-native kernels** — 3/5 done, 2 deferred to Phase 2 with architectural decision.
+3. ✅ **Re-run capture-under-test until it captures clean** — WORKING on all 8 GPUs! Root cause: multi-GPU requires explicit `torch.cuda.Stream(device=device)`.
+4. ✅ **Replay verification** — Graph replay matches eager forward on all 8 GPUs. Logit range [-26.5, 15.0] matches.
+5. ✅ **Benchmark** — 0.28-0.30s/token with CUDA graphs (vs 0.55s/token eager = ~2x speedup).
+6. ⬜ **Gate every commit on the capture test** — Not yet implemented.
+7. ⬜ **Optimize stream sync** — Current implementation uses `torch.cuda.Event` + `wait_event()`/`synchronize()`. Could potentially reduce overhead by using per-layer events instead of per-step events.
+8. ⬜ **Phase 2**: Paged KV + device-side compressor for full vLLM graph capture.
+
+---
+
+## NEXT STEPS (pick up here in next session)
+
+### Priority 1: Decode degeneration (still unresolved)
+The model generates a repetition loop (`psych` ↔ `istically`) regardless of whether CUDA graphs are used. This is the SAME issue as the eager path — not caused by graph capture. Root cause UNKNOWN. Components exonerated: mHC, FMHA, compression. This is the highest-priority correctness issue.
+
+### Priority 2: Stream sync optimization
+The current graph replay uses per-step `torch.cuda.Event` sync between graph streams and the default stream. This works but may add overhead. Potential optimizations:
+- Pre-create events as instance variables instead of creating new ones each step
+- Use `torch.cuda.Stream.wait_stream()` instead of event-based sync where possible
+- Profile the sync overhead vs compute time
+
+### Priority 3: Long-run stability
+Test with --max-tokens 512+ to verify stability over many decode steps. Check for:
+- Memory leaks (growing GPU memory usage)
+- Numerical drift (logit range changes over time)
+- Graph replay failures after many steps
+
+### Priority 4: Phase 2 — Full vLLM integration
+- Paged KV cache (fixed blocks + block table)
+- Device-side compressor boundary detection + fixed-shape output
+- Full graph capture including FMHA
+- Bucket-by-shape for variable sequence lengths
+
+---
+
+## Guardrails
+- Keep the stop-check, detokenize, and load-time BF16 dequant on the host — they're outside the captured region by design; don't contort them to be "graph-safe."
+- **Phase 1 uses eager-break-at-attention.** Phase 2 adds paged KV. Don't retrofit paged KV into Phase 1 — it's a separate integration.
+- Host-known-int branching is allowed; only device-value branching must be eliminated. Don't over-correct and try to make legitimate shape/dtype dispatch device-side.
+- **ALWAYS use explicit `torch.cuda.Stream(device=device)` for graph capture and replay on multi-GPU setups.** This is non-negotiable on B200.
+
+## Violation Fix Log
+
+| Commit | Description |
+|--------|-------------|
+| `a9ea303` | mhc.py `.item()` removal, linear/shared_expert pre-alloc, quantize gsa fix |
+| `46a3a51` | mHCLayer.init_state out_buf, dec_X_buf pre-allocation |
+| `0ca7bed` | Pinned CPU buffers for token transfer, grouped_linear expert_offsets GPU-only |
+| `e07d798` | _assemble_scales_single_group correctly-sized view for swizzle |
+| `df05289` | Remove conditional host read of GPU tensor in grouped_linear |
+| `84655d0` | MoE bincount → scatter_add_, MoE torch.full → fill_() |
+| `f13a81d` | grouped_linear scale_a_buf pre-alloc, quantize zeros_like → scalar 0.0 |
+| `518a1d3` | MoE scatter_add_ int64 indices, fix second bincount call |
+| `80bb27f` | gsa broadcast: reshape for M=1 decode (no stride-0), contiguous for M>1 prefill |
+| `6dc2f22` | **CRITICAL: _l1_out_buf 2x too narrow → GPU memory corruption (root cause of ALL cudaErrorInvalidValue errors)**. Also: all GEMM output buffers pre-allocated, gsa copy_ → scalar assignment |
+| `69e15f1` | Blackwell swizzle CUDA kernel for graph capture, swizzled output buffers |
+| `ffa7842` | Dense router: BF16 GEMM instead of FP32 conversion during graph capture |
+| `f259d63` | **CRITICAL: SE swizzled buffers allocated then overwritten with None — graph capture would fall through to broken Python path** |
+| `32902d1` | Derive q_a_dim from config, pre-cache norm weights, add buffer verification |
+| `5a98cc6` | Store pre-cached norm weights on self to prevent GC during graph replay |
+| `6650f06` | **CRITICAL FIX: Use explicit per-device streams for CUDA graph capture/replay — fixes all-zeros replay on non-cuda:0 GPUs** |
--- a/archived_plans/WALKING_BACK_SOME_QUANTS.md
+++ b/archived_plans/WALKING_BACK_SOME_QUANTS.md
--- a/dsv4/decode/cuda_graph_decoder.py
+++ b/dsv4/decode/cuda_graph_decoder.py
@@ -0,0 +1,172 @@
+"""CUDA Graph Decode for DSV4 — zero Python dispatch overhead.
+
+Architecture: Eager-break-at-attention with per-GPU captured subgraphs.
+
+For each decode step:
+  1. Copy next token to pre-allocated input buffer (pinned CPU → GPU)
+  2. For each GPU subgraph: replay the captured compute
+  3. Between subgraphs: transfer X between GPUs (eager, small tensor)
+  4. FMHA runs eagerly (dynamic KV length) — this is the attention break
+  5. After all layers: hc_head + norm + lm_head (captured on cuda:0)
+  6. Sample next token (eager, outside graph)
+
+The captured subgraph per GPU contains:
+  - mHC pre_block (attn) → RMSNorm + quantize → attention projections (q_a, q_b, kv)
+  - [EAGER: compressor → indexer → gather → FMHA → inverse RoPE]
+  - o_proj → mHC post_block (attn) → mHC pre_block (ffn) → Router → MoE → SE → mHC post_block (ffn)
+
+Actually, for simplicity and to avoid splitting the attention, we capture
+the FULL layer forward (including FMHA) and handle the dynamic KV length
+by pre-allocating at max_context and masking.
+
+For the initial implementation, we capture per-LAYER (not per-GPU subgraph)
+to isolate issues. 61 individual graphs, each capturing one layer's forward.
+"""
+
+import torch
+import torch.nn.functional as F
+import time
+import math
+
+from dsv4.layers.mhc import mHCLayer, mHCContext
+
+
+class CUDAGraphDecoder:
+    """CUDA Graph decoder for DSV4 single-shot inference.
+
+    Captures the entire decode step (all 61 layers + lm_head) as CUDA graphs,
+    eliminating Python dispatch overhead (~94ms) and kernel launch latency.
+
+    Constraints:
+    - All tensors must have fixed addresses (pre-allocated)
+    - No dynamic shapes (T=1 decode has fixed shapes)
+    - No CPU-GPU syncs inside the graph
+    - Cross-GPU transfers happen outside the graph region
+
+    The compressor and KV cache must be graph-safe:
+    - Compressor: always produces output (zeros when buffer incomplete)
+    - KV cache: n_comp stored as GPU tensor, gather is fixed-shape with masking
+    - FMHA: runs at max_seq_len with masking for actual length
+    """
+
+    def __init__(self, n_layers, num_gpus, devices, hidden_size, n_hc=4):
+        self.n_layers = n_layers
+        self.num_gpus = num_gpus
+        self.devices = devices
+        self.hidden_size = hidden_size
+        self.n_hc = n_hc
+
+        # Per-layer CUDA graphs
+        self.graphs = {}  # li -> torch.cuda.CUDAGraph
+
+        # Final graph (hc_head + norm + lm_head) on cuda:0
+        self.lm_graph = None
+
+        # Pre-allocated I/O buffers — fixed addresses for graph capture
+        # X is (1, n_hc, H) BF16
+        self.x_in = {}   # li -> tensor on device of layer li
+        self.x_out = {}  # li -> tensor on device of layer li
+
+        # Final output buffers on cuda:0
+        self.logits_buf = None
+        self.x_cuda0_buf = None  # X after all layers, on cuda:0
+
+        self.captured = False
+
+    def pre_allocate(self, vocab_size=129280):
+        """Pre-allocate all I/O buffers with fixed addresses."""
+        for li in range(self.n_layers):
+            dev = self.devices[li % self.num_gpus]
+            self.x_in[li] = torch.zeros(1, self.n_hc, self.hidden_size,
+                                         dtype=torch.bfloat16, device=dev)
+            self.x_out[li] = torch.zeros(1, self.n_hc, self.hidden_size,
+                                          dtype=torch.bfloat16, device=dev)
+
+        self.logits_buf = torch.zeros(1, vocab_size, dtype=torch.bfloat16, device='cuda:0')
+        self.x_cuda0_buf = torch.zeros(1, self.n_hc, self.hidden_size,
+                                        dtype=torch.bfloat16, device='cuda:0')
+
+    def capture(self, X_warmup, layer_forward_fn, lm_forward_fn,
+                all_layer_args, lm_args):
+        """Capture CUDA graphs after warmup.
+
+        Args:
+            X_warmup: X tensor from warmup step (to seed input buffers)
+            layer_forward_fn: function(X, li, **kwargs) -> X_next
+            lm_forward_fn: function(X, **kwargs) -> logits
+            all_layer_args: dict[li] -> kwargs for layer_forward_fn
+            lm_args: kwargs for lm_forward_fn
+        """
+        print("  Capturing CUDA graphs for decode...", flush=True)
+
+        for li in range(self.n_layers):
+            gpu = li % self.num_gpus
+            dev = self.devices[gpu]
+            torch.cuda.set_device(gpu)
+
+            # Seed input buffer with warmup X
+            if li == 0:
+                self.x_in[li].copy_(X_warmup.to(dev))
+            else:
+                self.x_in[li].copy_(self.x_out[li - 1].to(dev))
+
+            graph = torch.cuda.CUDAGraph()
+            with torch.cuda.graph(graph):
+                X_next = layer_forward_fn(self.x_in[li], li, **all_layer_args[li])
+                self.x_out[li].copy_(X_next)
+
+            self.graphs[li] = graph
+            if (li + 1) % 10 == 0:
+                print(f"    Captured {li+1}/{self.n_layers} layer graphs", flush=True)
+
+        # Capture hc_head + norm + lm_head on cuda:0
+        torch.cuda.set_device(0)
+        if self.n_layers > 0:
+            self.x_cuda0_buf.copy_(self.x_out[self.n_layers - 1].to('cuda:0'))
+
+        self.lm_graph = torch.cuda.CUDAGraph()
+        with torch.cuda.graph(self.lm_graph):
+            logits = lm_forward_fn(self.x_cuda0_buf, **lm_args)
+            self.logits_buf.copy_(logits)
+
+        self.captured = True
+        print(f"  Captured {len(self.graphs)} layer graphs + lm_head graph", flush=True)
+
+    def replay(self, token_id_gpu, position_gpu):
+        """Replay captured graphs for one decode step.
+
+        Args:
+            token_id_gpu: (1,) long tensor on cuda:0 — next token ID
+            position_gpu: (1,) long tensor on cuda:0 — current position
+
+        Returns:
+            logits: (1, vocab_size) bfloat16 tensor
+        """
+        assert self.captured, "Must call capture() before replay()"
+
+        # TODO: Copy token_id/position to the static input buffers that the graph uses.
+        # This requires the graph to reference those buffers.
+
+        # Replay layer graphs
+        for li in range(self.n_layers):
+            gpu = li % self.num_gpus
+            torch.cuda.set_device(gpu)
+
+            # Copy input from previous layer's output
+            if li > 0:
+                prev_gpu = (li - 1) % self.num_gpus
+                if prev_gpu != gpu:
+                    self.x_in[li].copy_(self.x_out[li - 1].to(self.devices[gpu]))
+                else:
+                    self.x_in[li].copy_(self.x_out[li - 1])
+
+            self.graphs[li].replay()
+
+        # Transfer final X to cuda:0
+        if self.n_layers > 0:
+            self.x_cuda0_buf.copy_(self.x_out[self.n_layers - 1].to('cuda:0'))
+
+        # Replay lm_head graph
+        self.lm_graph.replay()
+
+        return self.logits_buf
--- a/dsv4/kernels/cuda/blackwell_swizzle.cu
+++ b/dsv4/kernels/cuda/blackwell_swizzle.cu
@@ -0,0 +1,116 @@
+/**
+ * Blackwell 32_4_4 scale swizzle kernel.
+ *
+ * Rearranges FP8 scale factors from row-major layout to Blackwell tensor-core
+ * compatible layout. This is the GPU equivalent of the Python:
+ *   blocks = x.view(R, 128, C, 4).permute(0, 2, 1, 3)
+ *   out = blocks.reshape(-1, 4, 32, 4).transpose(1, 2).reshape(-1, 32, 16).flatten()
+ *
+ * The kernel writes to a pre-allocated output buffer — no per-step allocations.
+ * CUDA-graph-capturable: no host-device syncs, no dynamic shapes.
+ */
+
+#include <cuda_runtime.h>
+#include <c10/cuda/CUDAStream.h>
+#include <cstdint>
+#include <torch/extension.h>  // For pybind11 bindings
+
+// Blackwell 32_4_4 swizzle: each thread handles one output element
+// Input: (rows, cols) float8_e4m3fn — rows is multiple of 128, cols is multiple of 4
+// Output: (rows, cols) float8_e4m3fn — swizzled layout
+//
+// The swizzle reorders so that:
+//   For each group of 128 rows × 4 cols (a "block"):
+//     - The 128 rows are divided into 32 "sub-rows" of 4 rows each
+//     - The 4 cols are kept as-is
+//     - The output order is: [sub-row 0 col 0..3, sub-row 1 col 0..3, ..., sub-row 31 col 0..3]
+//     - Within each sub-row, the 4 rows × 4 cols = 16 elements are laid out as 32×16
+
+__global__ void blackwell_swizzle_32_4_4_kernel(
+    const uint8_t* __restrict__ input,   // (rows, cols) in FP8
+    uint8_t* __restrict__ output,         // (rows, cols) swizzled FP8
+    const int32_t rows,
+    const int32_t cols       // must be multiple of 4
+) {
+    const int32_t R = rows / 128;  // number of 128-row blocks
+    const int32_t C = cols / 4;    // number of 4-col groups
+    
+    // Total output elements
+    const int32_t total = rows * cols;
+    
+    // Each thread handles one output element
+    const int32_t tid = blockIdx.x * blockDim.x + threadIdx.x;
+    if (tid >= total) return;
+    
+    // Output flat index → (block_r, col_group, sub_row, col_4, row_in_sub)
+    // Output layout: flatten of (R, C, 32, 4, 4, 4) → but simplified:
+    // The output is organized as:
+    //   For each (R, C) block: 32 sub-rows × 16 elements = 512 elements per block
+    //   Total per block: 128 * 4 = 512 elements
+    
+    // Decompose tid into block coordinates
+    const int32_t elements_per_block = 128 * 4;  // 512
+    const int32_t block_idx = tid / elements_per_block;
+    const int32_t within_block = tid % elements_per_block;
+    
+    const int32_t r = block_idx / C;      // row block index
+    const int32_t c = block_idx % C;      // col group index
+    
+    // Within-block layout: (32 sub-rows) × (4 col_within_group) × (4 row_within_subrow)
+    // But actually the swizzle is: reshape(32, 4, 4, 4) → transpose(1,2) → flatten
+    // Which gives: for each (sub_row, col_4, row_in_sub):
+    //   output[sub_row * 16 + col_4 * 4 + row_in_sub] = input[sub_row * 4 + row_in_sub][col_4 * 4 + c_offset]
+    
+    // Within block: 512 elements in swizzled order
+    // The Python swizzle does:
+    //   blocks[128 rows, 4 cols] → view(32, 4, 4, 4) → permute → (32, 4, 4, 4)
+    //   → reshape(-1, 32, 16) → flatten
+    // The output index maps to:
+    //   sub_row = within_block / 16
+    //   within_sub = within_block % 16  → (col_4, row_in_sub) = (within_sub / 4, within_sub % 4)
+    
+    const int32_t sub_row = within_block / 16;
+    const int32_t within_sub = within_block % 16;
+    const int32_t col_4 = within_sub / 4;
+    const int32_t row_in_sub = within_sub % 4;
+    
+    // Map back to input coordinates
+    const int32_t input_row = r * 128 + sub_row * 4 + row_in_sub;
+    const int32_t input_col = c * 4 + col_4;
+    
+    // Read input, write to output
+    output[tid] = input[input_row * cols + input_col];
+}
+
+extern "C" {
+
+void launch_blackwell_swizzle(
+    const uint8_t* input,
+    uint8_t* output,
+    int32_t rows,
+    int32_t cols,
+    cudaStream_t stream
+) {
+    const int32_t total = rows * cols;
+    const int32_t block_size = 256;
+    const int32_t grid_size = (total + block_size - 1) / block_size;
+    
+    blackwell_swizzle_32_4_4_kernel<<<grid_size, block_size, 0, stream>>>(
+        input, output, rows, cols
+    );
+}
+
+} // extern "C"
+
+// Pybind11 bindings for torch.utils.cpp_extension.load
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
+    m.def("blackwell_swizzle_32_4_4", [](at::Tensor input, at::Tensor output, int32_t rows, int32_t cols) {
+        auto stream = c10::cuda::getCurrentCUDAStream();
+        blackwell_swizzle_32_4_4_kernel<<<
+            (rows * cols + 255) / 256, 256, 0, stream>>>(
+            input.data_ptr<uint8_t>(),
+            output.data_ptr<uint8_t>(),
+            rows, cols
+        );
+    }, "Blackwell 32_4_4 scale swizzle");
+}
--- a/dsv4/kernels/cuda/compressor_reduce.cu
+++ b/dsv4/kernels/cuda/compressor_reduce.cu
@@ -124,15 +124,14 @@ __global__ void csa_compress_reduce_kernel(

            float g = gate_proj[token_idx * kv_dim + gate_offset + c];
            float kv_val = kv_proj[token_idx * kv_dim + kv_offset + c];
-            // Position bias: same (m, 2*hd) bias added to every block
-            // Added to BOTH gate (softmax logit) and kv (content) per reference
+            // Position bias: added to gate logits (softmax Z + B) only.
+            // The paper defines compression as softmax(Z + B) then weighted sum of C.
+            // The bias must NOT be added to kv_val — that poisons compressed content.
            if (position_bias != nullptr) {
                int pos_bias_row = (block_i > 0 && t < m) ? t : (block_i > 0 ? (t - m) : t);
                if (pos_bias_row >= 0 && pos_bias_row < m) {
                    float pb = position_bias[pos_bias_row * kv_dim + gate_offset + c];
                    g += pb;
-                    // kv_offset matches gate_offset for CSA: both are 0 (a-stream) or hd (b-stream)
-                    kv_val += position_bias[pos_bias_row * kv_dim + kv_offset + c];
                }
            }
            float e = expf(g - local_max[ci]);
@@ -192,12 +191,12 @@ __global__ void hca_compress_reduce_kernel(
            if (token_idx >= T) break;
            float g = gate_proj[token_idx * hd + c];
            float kv_val = kv_proj[token_idx * hd + c];
-            // Position bias: same (m, hd) bias added to every block
-            // Added to BOTH gate (softmax logit) and kv (content) per reference
+            // Position bias: added to gate logits (softmax Z + B) only.
+            // The paper defines compression as softmax(Z + B) then weighted sum of C.
+            // The bias must NOT be added to kv_val — that poisons compressed content.
            if (position_bias != nullptr && t < m) {
                float pb = position_bias[t * hd + c];
                g += pb;
-                kv_val += pb;
            }
            float e = expf(g - local_max);
            local_denom += e;
--- a/dsv4/kernels/gemm/grouped.py
+++ b/dsv4/kernels/gemm/grouped.py
@@ -2374,8 +2374,15 @@ def compute_scale_shape(
            return (padded_N, total_cols)


-def to_blocked(scale_2d: torch.Tensor) -> torch.Tensor:
-    """Pad and apply the Blackwell 32_4_4 scale swizzle to one raw scale tensor."""
+def to_blocked(scale_2d: torch.Tensor, out_buf: torch.Tensor = None) -> torch.Tensor:
+    """Pad and apply the Blackwell 32_4_4 scale swizzle to one raw scale tensor.
+    
+    During CUDA graph capture, uses a custom CUDA kernel because Python
+    view operations (reshape, transpose, permute) are not graph-capturable.
+    The out_buf must be provided during graph capture (pre-allocated output).
+    
+    During eager mode, uses the faster Python view path.
+    """
    if scale_2d.dim() != 2:
        raise ValueError(f"Expected 2D scale tensor, got {scale_2d.dim()}D.")
    rows, cols = scale_2d.shape
@@ -2394,6 +2401,19 @@ def to_blocked(scale_2d: torch.Tensor) -> torch.Tensor:
        )
        padded[:rows, :cols] = scale_2d

+    # Use CUDA kernel during graph capture — Python view ops are not capturable
+    if torch.cuda.is_current_stream_capturing():
+        from dsv4.kernels.cuda.loader import get_cuda_module
+        mod = get_cuda_module("blackwell_swizzle", ["blackwell_swizzle.cu"])
+        if out_buf is None:
+            out_buf = torch.empty_like(padded)
+        mod.blackwell_swizzle_32_4_4(
+            padded.view(torch.uint8), out_buf.view(torch.uint8),
+            padded_rows, padded_cols
+        )
+        return out_buf.view(torch.float8_e4m3fn).flatten()
+
+    # Eager path: Python view operations (fast, no kernel launch overhead)
    blocks = padded.view(row_blocks, 128, col_blocks, 4).permute(0, 2, 1, 3)
    rearranged = blocks.reshape(-1, 4, 32, 4).transpose(1, 2).reshape(-1, 32, 16)
    return rearranged.flatten()
--- a/dsv4/kernels/router/dense_router_decode.py
+++ b/dsv4/kernels/router/dense_router_decode.py
@@ -27,10 +27,16 @@ def dense_router_dispatch(
 ):
    """Dispatch the dense router (BF16 cuBLAS fallback).

-    BF16 GEMM via torch.nn.functional.linear (cuBLAS, SM100 tensor cores),
+    BF16 GEMM via torch.matmul (cuBLAS, SM100 tensor cores),
    then fused activation + top-k via the CUDA kernel.
+
+    CUDA-graph-compatible: no .T, no .float() on inputs during capture.
+    The GEMM runs in BF16 (Blackwell tensor cores handle BF16 natively).
+    Only the output logits are cast to FP32 for sqrt(softplus) stability.
    """
-    logits = torch.nn.functional.linear(hidden_states.float(), W_gate.T.float())
+    # BF16 GEMM: x @ W — no transpose needed, no FP32 conversion
+    logits_bf16 = torch.matmul(hidden_states, W_gate)  # [N, H] @ [H, E] = [N, E]
+    logits = logits_bf16.float()  # BF16 → FP32 for sqrt(softplus) numerical stability
    from dsv4.kernels.router._activation_topk import run_fused_activation_topk
    run_fused_activation_topk(
        logits, e_bias, routed_scaling_factor, top_k,
@@ -97,7 +103,8 @@ def dense_router_dispatch_nvfp4_fused(
    # Decode the gate_weight from NVFP4 to BF16 for cuBLAS
    from dsv4.ops.quantize import dequantize_nvfp4
    gate_bf16 = dequantize_nvfp4(gate_weight, gate_weight_scale, gate_ws2)
-    logits = torch.nn.functional.linear(hidden_states.float(), gate_bf16.T.float())
+    logits = torch.nn.functional.linear(hidden_states, gate_bf16.T)
+    logits = logits.float()  # BF16 → FP32 for numerical stability in sqrt(softplus)

    run_fused_activation_topk(
        logits, e_bias, routed_scaling_factor, top_k,
--- a/dsv4/layers/grouped_linear.py
+++ b/dsv4/layers/grouped_linear.py
@@ -212,6 +212,31 @@ class Nvfp4GroupedLinear:

        self._gsa_buf = torch.zeros(self.n_local_groups, dtype=torch.float32, device=self.device)
        self._expert_offsets_buf = torch.zeros(self.n_local_groups, dtype=torch.int32, device=self.device)
+        # Pre-computed range [1, 2, 3, ..., n_groups] for expert offsets
+        # Avoids torch.arange() per call (allocation) and Python loop (CPU→GPU sync)
+        self._expert_offsets_range_buf = torch.arange(
+            1, self.n_local_groups + 1, dtype=torch.int32, device=self.device
+        )
+        self._group_offset_buf = torch.zeros(self.n_local_groups, dtype=torch.int32, device=self.device)
+        # Pre-allocate output buffer for graph capture
+        self._output_buf = torch.zeros(
+            self.max_num_tokens, self.n_local_groups, self.o_lora_rank,
+            dtype=torch.bfloat16, device=self.device
+        )
+        # Pre-allocate FLAT output buffer for grouped GEMM (graph capture)
+        # The GEMM produces (tokens_sum, n_dim) where n_dim = o_lora_rank
+        # tokens_sum = n_groups * padded_rows_per_group (max = n_groups * max_num_tokens)
+        self._output_buf_padded = torch.zeros(
+            self.max_num_tokens * self.n_local_groups, self.o_lora_rank,
+            dtype=torch.bfloat16, device=self.device
+        )
+        # Pre-allocate scale_a swizzle buffer for graph capture
+        K_sf = cutedsl_ceil_div(self.group_in_features, 16)
+        max_padded_rows = cutedsl_ceil_div(self.max_num_tokens, 128) * 128
+        max_padded_cols = cutedsl_ceil_div(K_sf, 4) * 4
+        self._scale_a_buf = torch.zeros(
+            max_padded_rows, max_padded_cols, dtype=torch.float16, device=self.device
+        ).to(torch.float8_e4m3fn)
        self._buffers_allocated = True

    def _ensure_initialized(self):
@@ -221,14 +246,22 @@ class Nvfp4GroupedLinear:
            self._allocate_buffers()

    def _assemble_scales_single_group(self, x_sf):
-        """Assemble 2D-side activation scales for num_groups=1."""
+        """Assemble 2D-side activation scales for num_groups=1.
+        
+        CUDA-graph-safe: uses pre-allocated _scale_a_buf.
+        """
        num_rows, num_cols = x_sf.shape
        padded_rows = cutedsl_ceil_div(num_rows, 128) * 128
        padded_cols = cutedsl_ceil_div(num_cols, 4) * 4

-        buf = torch.zeros(padded_rows, padded_cols, dtype=torch.float16, device=x_sf.device).to(torch.float8_e4m3fn)
+        # Use pre-allocated buffer — zero + scatter pattern (no new allocation)
+        buf = self._scale_a_buf
+        assert buf.shape[0] >= padded_rows and buf.shape[1] >= padded_cols, \
+            f"scale_a_buf too small: {buf.shape} < ({padded_rows}, {padded_cols})"
+        buf.view(torch.uint8).zero_()
        buf[:num_rows, :num_cols] = x_sf
-        swizzled_flat = pad_and_swizzle_single(buf)
+        view = buf[:padded_rows, :padded_cols]
+        swizzled_flat = pad_and_swizzle_single(view)
        return swizzled_flat.reshape(padded_rows, padded_cols)

    def compute_activation_global_scale(self, o_sample: torch.Tensor):
@@ -305,10 +338,12 @@ class Nvfp4GroupedLinear:
            # gsa_gpu is (G*T,) — all rows share same amax (from max over full tensor)
            # For the GEMM's global_scale_a, fill all group slots with the same gsa value
            # Use GPU-only copy: no .item(), no CPU sync
-            self._gsa_buf[:1].copy_(gsa_gpu[:1])  # GPU→GPU scalar copy, no sync
+            self._gsa_buf[0] = gsa_gpu[0]  # scalar GPU→GPU, no sync, graph-capturable
            # Broadcast to all groups (all get same gsa)
+            # Use scalar broadcast assignment instead of copy_ from expanded view
+            # (expanded views can cause cudaErrorInvalidValue in copy_)
            if self.n_local_groups > 1:
-                self._gsa_buf[1:].copy_(self._gsa_buf[:1].expand(self.n_local_groups - 1))
+                self._gsa_buf[1:] = self._gsa_buf[0]  # scalar broadcast, graph-capturable
        else:
            self._gsa_buf.fill_(self._activation_global_scale)
            x_fp4_flat, x_sf_flat = quantize_activation_nvfp4(
@@ -321,6 +356,13 @@ class Nvfp4GroupedLinear:

        x_fp4_grouped = x_fp4_flat.reshape(self.n_local_groups, num_tokens, self.group_in_features // 2)

+        # Vectorized scatter — no Python loop, no CPU→GPU sync
+        # Unconditionally update group offsets — GPU-only, no conditional host read.
+        # padded_rows_per_group is a Python int multiplied with a GPU tensor = GPU op.
+        group_offsets = self._group_offset_buf[:self.n_local_groups]
+        expert_offsets = self._expert_offsets_buf
+        expert_offsets[:self.n_local_groups] = self._expert_offsets_range_buf * padded_rows_per_group
+        # Scatter each group's x_fp4 into padded buffer
        for g in range(self.n_local_groups):
            offset = g * padded_rows_per_group
            padded_x_fp4.view(torch.uint8)[offset:offset + num_tokens] = x_fp4_grouped[g].view(torch.uint8)
@@ -336,15 +378,16 @@ class Nvfp4GroupedLinear:
        scale_a = assemble_scales_2d_side(all_x_sf)

        # Expert offsets: cumulative [padded_T, 2*padded_T, ..., n_groups*padded_T]
+        # GPU-only computation — no Python loop, no CPU→GPU sync
        expert_offsets = self._expert_offsets_buf
-        for g in range(self.n_local_groups):
-            expert_offsets[g] = (g + 1) * padded_rows_per_group
+        # element-wise multiply: range * padded_rows → GPU tensor (no host sync)
+        expert_offsets[:self.n_local_groups] = self._expert_offsets_range_buf * padded_rows_per_group

        # Global scales — GPU-computed gsa already in _gsa_buf (no CPU sync)
        gsa = self._gsa_buf

-        # Run grouped GEMM
-        out = run_nvfp4_grouped_gemm(
+        # Run grouped GEMM — pass pre-allocated output buffer for CUDA graph capture
+        z_gem = run_nvfp4_grouped_gemm(
            mat_a=padded_x_fp4,
            mat_b=self._mat_b,
            scale_a=scale_a,
@@ -352,15 +395,23 @@ class Nvfp4GroupedLinear:
            expert_offsets=expert_offsets,
            global_scale_a=gsa,
            global_scale_b=self._gsb,
+            out=self._output_buf_padded if hasattr(self, '_output_buf_padded') else None,
        )

        # Extract real outputs and reshape
-        # GEMM output has the same layout as mat_a: groups-first with padding
-        z = torch.empty(num_tokens, self.n_local_groups, self.o_lora_rank,
-                        dtype=torch.bfloat16, device=o.device)
-        for g in range(self.n_local_groups):
-            offset = g * padded_rows_per_group
-            z[:, g, :] = out[offset:offset + num_tokens, :]
+        # GEMM output layout: (tokens_sum, o_lora_rank) where tokens_sum = n_groups * padded_rows
+        # Groups are stacked vertically: group 0 at rows [0, padded_rows), group 1 at [padded_rows, 2*padded_rows), etc.
+        z_gem = z_gem if z_gem is not None else self._output_buf_padded
+        z = self._output_buf[:num_tokens]
+        if num_tokens == 1:
+            # Vectorized: gather_indices = [0, padded_T, 2*padded_T, ...] — GPU-only
+            gather_indices = self._expert_offsets_range_buf[:self.n_local_groups] * padded_rows_per_group - padded_rows_per_group
+            z_flat = z_gem[gather_indices]  # (n_groups, o_lora_rank) — GPU gather
+            z[:, :, :] = z_flat.unsqueeze(0)  # (1, n_groups, o_lora_rank)
+        else:
+            for g in range(self.n_local_groups):
+                offset = g * padded_rows_per_group
+                z[:, g, :] = z_gem[offset:offset + num_tokens, :]

        return z

--- a/dsv4/layers/linear.py
+++ b/dsv4/layers/linear.py
@@ -65,6 +65,7 @@ class Nvfp4Linear:
        self._padded_x_fp4_buf = None
        self._expert_offsets_buf = None
        self._gsa_buf = None
+        self._gemm_out_buf = None  # pre-allocated GEMM output for graph capture
        self._buffers_allocated = False

    def finalize_weights(self):
@@ -103,7 +104,16 @@ class Nvfp4Linear:
        # warmup_compilation(1, K_packed, N_packed, self.device)  # Lazy compile on first real forward

    def _ensure_buffer_size(self, num_tokens: int):
-        """Ensure the padded buffer is large enough for num_tokens."""
+        """Ensure the padded buffer is large enough for num_tokens.
+        
+        Pre-allocates ALL buffers needed for CUDA graph capture:
+        - padded x_fp4 buffer (max_num_tokens aligned to 128 rows)
+        - expert_offsets (1 element for single group)
+        - gsa buffer (1 element, GPU-only)
+        - scale_a swizzle buffer (pre-allocated at max size)
+        
+        No per-call allocations — zero CPU-GPU syncs on the hot path.
+        """
        needed_rows = cutedsl_ceil_div(num_tokens, 128) * 128
        if self._padded_x_fp4_buf is not None and self._padded_x_fp4_buf.shape[0] >= needed_rows:
            return  # Already big enough
@@ -114,20 +124,63 @@ class Nvfp4Linear:

        self._expert_offsets_buf = torch.zeros(1, dtype=torch.int32, device=self.device)
        self._gsa_buf = torch.full((1,), self._activation_global_scale, dtype=torch.float32, device=self.device)
+        
+        # Pre-allocate scale_a swizzle buffer for _assemble_scales_single_group.
+        # Max size: (max_num_tokens aligned to 128) × (K_sf aligned to 4).
+        # This eliminates the per-call torch.zeros() allocation that breaks
+        # CUDA graph capture.
+        K_sf = cutedsl_ceil_div(self.in_features, 16)
+        max_padded_rows = cutedsl_ceil_div(self.max_num_tokens, 128) * 128
+        max_padded_cols = cutedsl_ceil_div(K_sf, 4) * 4
+        self._scale_a_buf = torch.zeros(
+            max_padded_rows, max_padded_cols, dtype=torch.float16, device=self.device
+        ).to(torch.float8_e4m3fn)
+        
+        # Pre-allocated GEMM output buffer for graph capture
+        self._gemm_out_buf = torch.zeros(
+            max_padded_rows, self.out_features, dtype=torch.bfloat16, device=self.device
+        )
+        
+        # Pre-allocated swizzled scale output buffer (for CUDA graph capture)
+        self._padded_x_sf_swizzled_buf = torch.zeros_like(self._scale_a_buf)

    def _ensure_initialized(self):
        if self._mat_b is None:
            self.finalize_weights()

    def _assemble_scales_single_group(self, x_sf):
-        """Assemble 2D-side activation scales for num_groups=1."""
+        """Assemble 2D-side activation scales for num_groups=1.
+        
+        CUDA-graph-safe: uses pre-allocated _scale_a_buf instead of
+        per-call torch.zeros(). The buffer is zeroed + scattered + swizzled
+        each call — zero new allocations on the hot path.
+        """
        num_rows, num_cols = x_sf.shape
        padded_rows = cutedsl_ceil_div(num_rows, 128) * 128
        padded_cols = cutedsl_ceil_div(num_cols, 4) * 4

-        buf = torch.zeros(padded_rows, padded_cols, dtype=torch.float16, device=x_sf.device).to(torch.float8_e4m3fn)
+        # Use pre-allocated buffer — zero + scatter pattern (no new allocation)
+        buf = self._scale_a_buf
+        assert buf.shape[0] >= padded_rows and buf.shape[1] >= padded_cols, \
+            f"scale_a_buf too small: {buf.shape} < ({padded_rows}, {padded_cols})"
+        buf.view(torch.uint8).zero_()
        buf[:num_rows, :num_cols] = x_sf
-        swizzled_flat = pad_and_swizzle_single(buf)
+        # Pass correctly-sized VIEW to swizzle — the swizzle operates on
+        # (padded_rows, padded_cols) not the full max-size buffer.
+        view = buf[:padded_rows, :padded_cols]
+        
+        # During graph capture, use CUDA swizzle kernel (Python view ops not capturable)
+        if torch.cuda.is_current_stream_capturing() and self._padded_x_sf_swizzled_buf is not None:
+            from dsv4.kernels.cuda.loader import get_cuda_module
+            mod = get_cuda_module("blackwell_swizzle", ["blackwell_swizzle.cu"])
+            swizzled_buf = self._padded_x_sf_swizzled_buf
+            mod.blackwell_swizzle_32_4_4(
+                view.view(torch.uint8), swizzled_buf[:padded_rows, :padded_cols].view(torch.uint8),
+                padded_rows, padded_cols
+            )
+            return swizzled_buf[:padded_rows, :padded_cols].reshape(padded_rows, padded_cols)
+        
+        swizzled_flat = pad_and_swizzle_single(view)
        return swizzled_flat.reshape(padded_rows, padded_cols)

    def compute_activation_global_scale(self, hidden_states_sample):
@@ -174,7 +227,7 @@ class Nvfp4Linear:
        if getattr(self, '_use_runtime_gsa', False):
            from dsv4.ops.quantize import quantize_nvfp4_gpu_fused
            x_fp4, x_sf, gsa_gpu = quantize_nvfp4_gpu_fused(hidden_states)
-            self._gsa_buf.copy_(gsa_gpu[:1].reshape(1))  # GPU → GPU, no sync
+            self._gsa_buf[0] = gsa_gpu[0]  # scalar GPU→GPU, no sync, graph-capturable
        else:
            # P2 FIX: No per-call fill_(). The _gsa_buf already has the correct
            # value — set either during initialization (via _ensure_buffer_size)
@@ -209,6 +262,7 @@ class Nvfp4Linear:
            expert_offsets=expert_offsets,
            global_scale_a=gsa,
            global_scale_b=self._gsb,
+            out=self._gemm_out_buf,
        )

        return out[:num_tokens]
@@ -252,13 +306,10 @@ class Nvfp4Linear:
        # For M=1 decode: per-row gsa is already scalar, no reduction needed.
        # For M>1 prefill: reduce per-row gsa to a single scalar (max).
        if quant.gsa.shape[0] == 1:
-            gsa = quant.gsa[:1].reshape(1)  # Already scalar
+            self._gsa_buf[0] = quant.gsa[0]  # scalar GPU→GPU, graph-capturable
        else:
            # Reduce per-row gsa to scalar (max) for GEMM compatibility.
-            # Per-row gsa is mathematically more precise, but the GEMM only
-            # supports a single global scale per expert.
-            gsa = quant.gsa.max().reshape(1)
-        self._gsa_buf.copy_(gsa)
+            self._gsa_buf[0] = quant.gsa.max()  # GPU max, scalar assign, graph-capturable

        # Run GEMM
        out = run_nvfp4_grouped_gemm(
@@ -269,6 +320,7 @@ class Nvfp4Linear:
            expert_offsets=expert_offsets,
            global_scale_a=self._gsa_buf,
            global_scale_b=self._gsb,
+            out=self._gemm_out_buf,
        )

        return out[:num_tokens]
--- a/dsv4/layers/mhc.py
+++ b/dsv4/layers/mhc.py
@@ -418,12 +418,9 @@ class mHCLayer:
        CF = ctx.C_l.unsqueeze(-1) * F_out.unsqueeze(1)   # (T, n_hc, d)
        X_next = (CF.float() + BX).to(self.dtype)   # (T, n_hc, d)
        
-        # Diagnostic: warn on residual blowup
-        x_max = X_next.abs().max().item()
-        if x_max > 500:
-            # Don't clip in production, just warn
-            pass
-        
+        # Note: residual magnitude monitoring is done OUTSIDE the graph-captured region
+        # (via the caller in single_shot_inference.py diagnostics). No .item() here —
+        # CUDA graph capture requires zero device→host syncs on the hot path.
        return X_next

    # ----------------------------------------------------------------
@@ -434,12 +431,23 @@ class mHCLayer:
    def init_state(
        embeddings: torch.Tensor,   # (T, d) BF16 — token embeddings
        n_hc: int = 4,
+        out_buf: torch.Tensor = None,  # (T, n_hc, d) BF16 — pre-allocated output buffer
    ) -> torch.Tensor:
        """
        Initialise X_0 for the first layer.

        Returns: (T, n_hc, d) BF16
+        
+        When out_buf is provided, writes to it in-place (no allocation).
+        This is required for CUDA graph capture where per-step
+        allocations are forbidden.
        """
+        if out_buf is not None:
+            # In-place: copy embeddings to all n_hc streams
+            out_buf[:, 0, :].copy_(embeddings)  # Stream 0 gets the embedding
+            for h in range(1, n_hc):
+                out_buf[:, h, :].copy_(embeddings)  # All other streams too
+            return out_buf
        return embeddings.unsqueeze(1).expand(-1, n_hc, -1).clone()

    @staticmethod
--- a/dsv4/layers/moe.py
+++ b/dsv4/layers/moe.py
@@ -90,6 +90,7 @@ class Nvfp4MoE:
        self._padded_x_sf_buf_l2 = None
        self._l1_gsa_buf = None
        self._l2_gsa_buf = None
+        self._l1_out_buf = None  # pre-allocated L1 GEMM output for graph capture
        self._output_buf = None
        self._row_indices_buf = None
        self._padded_hidden_buf = None
@@ -160,10 +161,37 @@ class Nvfp4MoE:
        self._padded_x_sf_buf_l2 = Nvfp4MoE._shared_padded_bufs[device_key]['xsf_l2']
        self._output_buf = Nvfp4MoE._shared_padded_bufs[device_key]['output']
        
+        # Pre-allocated swizzled scale output buffers (same size as padded_x_sf)
+        # Required for CUDA graph capture — Python view ops (reshape, transpose) not capturable
+        if 'xsf_swizzled_l1' not in Nvfp4MoE._shared_padded_bufs[device_key]:
+            Nvfp4MoE._shared_padded_bufs[device_key].update({
+                'xsf_swizzled_l1': torch.zeros_like(Nvfp4MoE._shared_padded_bufs[device_key]['xsf_l1']),
+                'xsf_swizzled_l2': torch.zeros_like(Nvfp4MoE._shared_padded_bufs[device_key]['xsf_l2']),
+            })
+        self._padded_x_sf_swizzled_buf_l1 = Nvfp4MoE._shared_padded_bufs[device_key]['xsf_swizzled_l1']
+        self._padded_x_sf_swizzled_buf_l2 = Nvfp4MoE._shared_padded_bufs[device_key]['xsf_swizzled_l2']
+        
        # Pre-allocated global_scale_a buffers (filled via .fill_(), no torch.full during capture)
        self._l1_gsa_buf = torch.zeros(self.num_experts, dtype=torch.float32, device=self.device)
        self._l2_gsa_buf = torch.zeros(self.num_experts, dtype=torch.float32, device=self.device)
        
+        # Pre-allocated L1 GEMM output — avoids torch.zeros() in run_fused_swiglu_grouped_gemm
+        # Shape: (max_tokens * top_k, 2*intermediate_size) — gate+up combined
+        self._l1_out_buf = torch.zeros(
+            self.max_num_tokens * self.top_k, 2 * self.intermediate_size,
+            dtype=torch.bfloat16, device=self.device
+        )
+        # Pre-allocated L2 GEMM output — avoids torch.zeros() in run_nvfp4_grouped_gemm
+        # Shape: (max_tokens * top_k, hidden_size) — down projection
+        self._l2_out_buf = torch.zeros(
+            self.max_num_tokens * self.top_k, self.hidden_size,
+            dtype=torch.bfloat16, device=self.device
+        )
+        
+        # Pre-allocated tokens-per-expert buffer — replaces torch.bincount
+        # (bincount produces data-dependent shapes, breaks CUDA graph capture)
+        self._tokens_per_expert_buf = torch.zeros(self.num_experts, dtype=torch.int32, device=self.device)
+        
        # Row indices for scale assembly (max_num_tokens * top_k slots)
        self._row_indices_buf = torch.arange(
            self.max_num_tokens * self.top_k, device=self.device
@@ -426,11 +454,20 @@ class Nvfp4MoE:
        padded_x_sf[dst_rows, :K_sf] = x_sf
        
        # Phase 2: Full-buffer swizzle (no CPU sync, no Python loops)
-        # padded_x_sf is 128-row aligned per expert and 4-col aligned.
-        # to_blocked: (rows, cols) → view(R, 128, C, 4) → permute(0,2,1,3)
-        #   → reshape(-1, 4, 32, 4) → transpose(1,2) → reshape(-1, 32, 16) → flatten
+        # During graph capture, Python view ops (reshape, transpose) are not allowed.
+        # Use CUDA swizzle kernel instead.
        rows = padded_x_sf.shape[0]
        cols = padded_x_sf.shape[1]
+        if torch.cuda.is_current_stream_capturing():
+            from dsv4.kernels.cuda.loader import get_cuda_module
+            mod = get_cuda_module("blackwell_swizzle", ["blackwell_swizzle.cu"])
+            out_buf = self._padded_x_sf_swizzled_buf_l1 if padded_x_sf is self._padded_x_sf_buf_l1 else self._padded_x_sf_swizzled_buf_l2
+            mod.blackwell_swizzle_32_4_4(
+                padded_x_sf.view(torch.uint8), out_buf.view(torch.uint8),
+                rows, cols
+            )
+            return out_buf.view(torch.float8_e4m3fn).reshape(rows, cols)
+        # Eager path: Python view operations
        R = rows // 128
        C = cols // 4
        blocks = padded_x_sf.view(R, 128, C, 4).permute(0, 2, 1, 3)
@@ -466,7 +503,17 @@ class Nvfp4MoE:
            # Quantize slot_hidden for GEMM
            slot_x_fp4, slot_x_sf = quantize_activation_nvfp4(slot_hidden, l1_gs)
            
-            tokens_per_expert = torch.bincount(sorted_ids, minlength=self.num_experts)[:self.num_experts].int()
+            # Compute tokens_per_expert — CUDA-graph-safe alternative to torch.bincount.
+            # torch.bincount produces data-dependent shapes (violates graph capture).
+            # Instead, use scatter_add_ into a pre-allocated buffer (fixed shape, GPU-only).
+            self._tokens_per_expert_buf.zero_()
+            # scatter_add_ requires int64 indices — ensure sorted_ids is int64
+            sorted_ids_i64 = sorted_ids.long()
+            n_slots = sorted_ids_i64.shape[0]
+            if not hasattr(self, '_ones_buf') or self._ones_buf.shape[0] < n_slots:
+                self._ones_buf = torch.ones(self.max_num_tokens * self.top_k, dtype=self._tokens_per_expert_buf.dtype, device=sorted_ids_i64.device)
+            self._tokens_per_expert_buf.scatter_add_(0, sorted_ids_i64, self._ones_buf[:n_slots])
+            tokens_per_expert = self._tokens_per_expert_buf[:self.num_experts]
            expert_offsets = self._expert_offsets_buf
            expert_offsets.zero_()
            expert_offsets[1:self.num_experts + 1] = tokens_per_expert.cumsum(0)
@@ -494,7 +541,9 @@ class Nvfp4MoE:
                padded_expert_offsets,
                self._padded_x_sf_buf_l1, self._per_expert_scale_bufs_l1
            )
-            l1_gsa = torch.full((self.num_experts,), l1_gs, dtype=torch.float32, device=device)
+            # l1_gsa: pre-allocated buffer, no per-call allocation
+            self._l1_gsa_buf.fill_(l1_gs)
+            l1_gsa = self._l1_gsa_buf
            
            l1_out = run_nvfp4_grouped_gemm(
                mat_a=padded_x_fp4, mat_b=self._l1_mat_b,
@@ -571,7 +620,14 @@ class Nvfp4MoE:
        sorted_token_ids = token_indices[sort_idx]
        
        # Expert offsets (real token counts)
-        tokens_per_expert = torch.bincount(sorted_ids, minlength=self.num_experts)[:self.num_experts].int()
+        # CUDA-graph-safe: scatter_add_ instead of bincount (fixed shape, GPU-only)
+        self._tokens_per_expert_buf.zero_()
+        sorted_ids_i64 = sorted_ids.long()
+        n_slots = sorted_ids_i64.shape[0]
+        if not hasattr(self, '_ones_buf') or self._ones_buf.shape[0] < n_slots:
+            self._ones_buf = torch.ones(self.max_num_tokens * self.top_k, dtype=self._tokens_per_expert_buf.dtype, device=sorted_ids_i64.device)
+        self._tokens_per_expert_buf.scatter_add_(0, sorted_ids_i64, self._ones_buf[:n_slots])
+        tokens_per_expert = self._tokens_per_expert_buf[:self.num_experts]
        expert_offsets = self._expert_offsets_buf
        expert_offsets.zero_()
        expert_offsets[1:self.num_experts + 1] = tokens_per_expert.cumsum(0)
@@ -599,7 +655,7 @@ class Nvfp4MoE:
        if getattr(self, '_use_runtime_gsa', False):
            from dsv4.ops.quantize import quantize_nvfp4_gpu_fused
            slot_x_fp4, slot_x_sf, gsa_l1_gpu = quantize_nvfp4_gpu_fused(slot_hidden)
-            self._l1_gsa_buf.copy_(gsa_l1_gpu[:1].reshape(1))  # GPU → GPU, no sync
+            self._l1_gsa_buf[0] = gsa_l1_gpu[0]  # scalar GPU→GPU, no sync, graph-capturable
        else:
            slot_x_fp4, slot_x_sf = quantize_nvfp4_gpu(
                slot_hidden, self._l1_activation_global_scale
@@ -625,6 +681,7 @@ class Nvfp4MoE:
                expert_offsets=padded_expert_offsets[1:self.num_experts + 1],
                global_scale_a=l1_gsa, global_scale_b=self._l1_gsb,
                swiglu_limit=self._swiglu_limit if self._swiglu_limit is not None else 0.0,
+                out=self._l1_out_buf,
            )
            l1_out_real = l1_out[padded_dst]
            # Fused deinterleave + amax + quantize: zero CPU syncs.
@@ -634,7 +691,7 @@ class Nvfp4MoE:
                from dsv4.ops.quantize import deinterleave_amax_quantize_nvfp4_fused
                slot_l2_x_fp4, slot_l2_x_sf, gsa_l2_gpu = deinterleave_amax_quantize_nvfp4_fused(
                    l1_out_real, self.intermediate_size)
-                self._l2_gsa_buf.copy_(gsa_l2_gpu[:1].reshape(1))  # GPU → GPU, no sync
+                self._l2_gsa_buf[0] = gsa_l2_gpu[0]  # scalar GPU→GPU, no sync, graph-capturable
            else:
                slot_l2_x_fp4, slot_l2_x_sf = deinterleave_quantize_nvfp4_cuda(
                    l1_out_real, self.intermediate_size, self._l2_activation_global_scale
@@ -646,6 +703,7 @@ class Nvfp4MoE:
                scale_a=l1_scale_a, scale_b=self._l1_scale_b,
                expert_offsets=padded_expert_offsets[1:self.num_experts + 1],
                global_scale_a=l1_gsa, global_scale_b=self._l1_gsb,
+                out=self._l1_out_buf,
            )
            l1_out_real = l1_out[padded_dst]
            l1_deil = deinterleave_l1_weights(l1_out_real.unsqueeze(0).contiguous())[0]
@@ -662,7 +720,7 @@ class Nvfp4MoE:
        if not self._fused_swiglu and getattr(self, '_use_runtime_gsa', False):
            from dsv4.ops.quantize import quantize_nvfp4_gpu_fused
            slot_l2_x_fp4, slot_l2_x_sf, gsa_l2_gpu = quantize_nvfp4_gpu_fused(activated)
-            self._l2_gsa_buf.copy_(gsa_l2_gpu[:1].reshape(1))  # GPU → GPU, no sync
+            self._l2_gsa_buf[0] = gsa_l2_gpu[0]  # scalar GPU→GPU, no sync, graph-capturable
        elif not self._fused_swiglu:
            slot_l2_x_fp4, slot_l2_x_sf = quantize_nvfp4_gpu(
                activated, self._l2_activation_global_scale
@@ -683,6 +741,7 @@ class Nvfp4MoE:
            scale_a=l2_scale_a, scale_b=self._l2_scale_b,
            expert_offsets=padded_expert_offsets[1:self.num_experts + 1],
            global_scale_a=l2_gsa, global_scale_b=self._l2_gsb,
+            out=self._l2_out_buf,
        )
        
        l2_out_real = l2_out[padded_dst]
--- a/dsv4/layers/shared_expert.py
+++ b/dsv4/layers/shared_expert.py
@@ -91,6 +91,9 @@ class Nvfp4SharedExpert:
        self._l1_activation_global_scale = 1.0 / (6.0 * 448.0)
        self._l2_activation_global_scale = 1.0 / (6.0 * 448.0)

+        # Pre-allocated L1 GEMM output for graph capture
+        self._l1_out_buf = None
+
        # Pre-allocated cudagraph buffers (set in _allocate_buffers)
        self._padded_x_fp4_buf_l1 = None
        self._padded_x_sf_buf_l1 = None
@@ -175,10 +178,31 @@ class Nvfp4SharedExpert:
        self._padded_x_sf_buf_l2 = torch.zeros(
            max_rows, padded_cols_l2, dtype=torch.float16, device=self.device
        ).to(torch.float8_e4m3fn)
+        
+        # Swizzled scale output buffers (for CUDA graph capture)
+        self._padded_x_sf_swizzled_buf_l1 = torch.zeros_like(self._padded_x_sf_buf_l1)
+        self._padded_x_sf_swizzled_buf_l2 = torch.zeros_like(self._padded_x_sf_buf_l2)

        # Global scale buffers
        self._l1_gsa_buf = torch.zeros(1, dtype=torch.float32, device=self.device)
        self._l2_gsa_buf = torch.zeros(1, dtype=torch.float32, device=self.device)
+        
+        # Pre-allocated swizzled scale output buffers (for CUDA graph capture)
+        # NOTE: _padded_x_sf_swizzled_buf_l1/l2 are allocated above (line 183-184)
+        # Do NOT set to None — they are required for CUDA graph capture swizzle path
+        
+        # Pre-allocated L1 output buffer for graph capture
+        # L1 produces gate+up combined: 2 * intermediate_size BF16 columns
+        self._l1_out_buf = torch.zeros(
+            max_rows, 2 * self.intermediate_size,
+            dtype=torch.bfloat16, device=self.device
+        )
+        # Pre-allocated L2 output buffer for graph capture
+        # L2 produces hidden_size BF16 columns (down projection)
+        self._l2_out_buf = torch.zeros(
+            max_rows, self.hidden_size,
+            dtype=torch.bfloat16, device=self.device
+        )

        # Expert offsets for num_groups=1: just [num_tokens_padded]
        # The GEMM expects expert_offsets as (num_experts,) cumulative offsets
@@ -202,17 +226,38 @@ class Nvfp4SharedExpert:
        2. Apply pad_and_swizzle_single (Blackwell swizzle)
        3. Reshape back to 2D (kernel expects 2D scale_a)

-        The padded buffer must be sized exactly for 128-aligned num_tokens,
-        NOT the max_num_tokens buffer (which would be way too large).
+        CUDA-graph-safe: uses the pre-allocated padded_x_sf_buf instead of
+        per-call torch.zeros(). The buffer is zeroed + scattered + swizzled
+        each call — zero new allocations on the hot path.
        """
        num_rows, num_cols = x_sf.shape
        padded_rows = cutedsl_ceil_div(num_rows, 128) * 128
        padded_cols = cutedsl_ceil_div(num_cols, 4) * 4

-        # Use a temp buffer sized for this exact token count
-        buf = torch.zeros(padded_rows, padded_cols, dtype=torch.float16, device=x_sf.device).to(torch.float8_e4m3fn)
+        # Use pre-allocated buffer — zero + scatter pattern (no new allocation)
+        buf = padded_x_sf_buf
+        assert buf.shape[0] >= padded_rows and buf.shape[1] >= padded_cols, \
+            f"padded_x_sf_buf too small: {buf.shape} < ({padded_rows}, {padded_cols})"
+        buf.view(torch.uint8).zero_()
        buf[:num_rows, :num_cols] = x_sf
-        swizzled_flat = pad_and_swizzle_single(buf)
+        # Pass correctly-sized VIEW to swizzle — avoids processing the full max-size buffer
+        view = buf[:padded_rows, :padded_cols]
+        
+        # During graph capture, use CUDA swizzle kernel (Python view ops not capturable)
+        if torch.cuda.is_current_stream_capturing():
+            from dsv4.kernels.cuda.loader import get_cuda_module
+            swizzled_buf = self._padded_x_sf_swizzled_buf_l1 if padded_x_sf_buf is self._padded_x_sf_buf_l1 else self._padded_x_sf_swizzled_buf_l2
+            if swizzled_buf is not None:
+                mod = get_cuda_module("blackwell_swizzle", ["blackwell_swizzle.cu"])
+                mod.blackwell_swizzle_32_4_4(
+                    view.view(torch.uint8), swizzled_buf[:padded_rows, :padded_cols].view(torch.uint8),
+                    padded_rows, padded_cols
+                )
+                return swizzled_buf[:padded_rows, :padded_cols].reshape(padded_rows, padded_cols)
+            # Fall through to Python path if buffer not yet allocated
+        
+        # Eager path: Python view operations
+        swizzled_flat = pad_and_swizzle_single(view)
        return swizzled_flat.reshape(padded_rows, padded_cols)

    def compute_activation_global_scales(self, hidden_states_sample):
@@ -253,7 +298,7 @@ class Nvfp4SharedExpert:
        if getattr(self, '_use_runtime_gsa', False):
            from dsv4.ops.quantize import quantize_nvfp4_gpu_fused
            x_fp4, x_sf, gsa_l1_gpu = quantize_nvfp4_gpu_fused(x_bf16)
-            self._l1_gsa_buf.copy_(gsa_l1_gpu[:1].reshape(1))  # GPU → GPU
+            self._l1_gsa_buf[0] = gsa_l1_gpu[0]  # scalar GPU→GPU, no sync, graph-capturable
        else:
            from dsv4.ops.quantize import quantize_activation_nvfp4
            x_fp4, x_sf = quantize_activation_nvfp4(x_bf16, self._l1_activation_global_scale)
@@ -284,6 +329,7 @@ class Nvfp4SharedExpert:
            global_scale_a=gsa,
            global_scale_b=self._l1_gsb,
            swiglu_limit=self.swiglu_limit if self.swiglu_limit is not None else 0.0,
+            out=self._l1_out_buf,
        )
        l1_out_real = l1_out[:num_tokens]  # (num_tokens, 2*intermediate) BF16, interleaved [silu(gate), silu(gate)*up]
        # Deinterleave to separate gate and up, then take up half (SwiGLU result)
@@ -300,7 +346,7 @@ class Nvfp4SharedExpert:
        if getattr(self, '_use_runtime_gsa', False):
            from dsv4.ops.quantize import quantize_nvfp4_gpu_fused
            x_fp4, x_sf, gsa_l1_gpu = quantize_nvfp4_gpu_fused(hidden_states)
-            self._l1_gsa_buf.copy_(gsa_l1_gpu[:1].reshape(1))  # GPU → GPU, no sync
+            self._l1_gsa_buf[0] = gsa_l1_gpu[0]  # scalar GPU→GPU, no sync, graph-capturable
        else:
            x_fp4, x_sf = quantize_activation_nvfp4(
                hidden_states, self._l1_activation_global_scale
@@ -330,6 +376,7 @@ class Nvfp4SharedExpert:
            expert_offsets=expert_offsets,
            global_scale_a=gsa,
            global_scale_b=self._l1_gsb,
+            out=self._l1_out_buf,
        )

        # Extract real token outputs
@@ -347,8 +394,10 @@ class Nvfp4SharedExpert:
        # Fused amax + quantize: zero CPU syncs.
        if getattr(self, '_use_runtime_gsa', False):
            from dsv4.ops.quantize import quantize_nvfp4_gpu_fused
+            if not intermediate.is_contiguous():
+                intermediate = intermediate.contiguous()
            x_fp4, x_sf, gsa_l2_gpu = quantize_nvfp4_gpu_fused(intermediate)
-            self._l2_gsa_buf.copy_(gsa_l2_gpu[:1].reshape(1))  # GPU → GPU, no sync
+            self._l2_gsa_buf[0] = gsa_l2_gpu[0]  # scalar GPU→GPU, no sync, graph-capturable
        else:
            x_fp4, x_sf = quantize_activation_nvfp4(
                intermediate, self._l2_activation_global_scale
@@ -378,6 +427,7 @@ class Nvfp4SharedExpert:
            expert_offsets=expert_offsets,
            global_scale_a=gsa,
            global_scale_b=self._l2_gsb,
+            out=self._l2_out_buf,
        )

        return out[:num_tokens]
--- a/dsv4/ops/gemm_runner.py
+++ b/dsv4/ops/gemm_runner.py
@@ -26,6 +26,8 @@ from dsv4.ops.layouts import (
    round_up,
 )

+
+
 # Cache compiled kernels + pre-allocated workspace by cache_key
 # Each entry: {'compiled': callable, 'workspace': Tensor, 'workspace_size': int}
 #
@@ -99,7 +101,15 @@ def warmup_compilation(num_experts, K_packed, N_packed, device,
    )
    
    def to_cute(t):
+        # Fix: from_dlpack checks torch.cuda.current_device() against tensor device.
+        # Inside CUDA graph capture on non-default GPUs, current_device() may not match.
+        # We temporarily patch current_device to return the tensor's device index.
+        # This is safe because during graph capture, the device is logically fixed.
+        _orig_cd = torch.cuda.current_device
+        if t.is_cuda and t.device.index != _orig_cd():
+            torch.cuda.current_device = lambda: t.device.index
        ct = cutlass_torch.from_dlpack(t)
+        torch.cuda.current_device = _orig_cd
        return ct.mark_layout_dynamic(leading_dim=cutlass_torch.get_leading_dim(t))
    
    a_c = to_cute(mat_a)
@@ -160,6 +170,7 @@ def run_nvfp4_grouped_gemm(
    global_scale_b=None,  # (experts,) float32
    mma_tiler_mn=(128, 128),
    cluster_shape_mn=(1, 1),
+    out=None,       # pre-allocated output buffer for CUDA graph capture
 ):
    """Run the CuTeDSL NVFP4 scaled grouped GEMM.
    
@@ -174,7 +185,10 @@ def run_nvfp4_grouped_gemm(
    n_dim = mat_b.shape[2]
    tokens_sum = mat_a.shape[0]

-    out = torch.zeros(tokens_sum, n_dim, dtype=torch.bfloat16, device=mat_a.device)
+    if out is None:
+        out = torch.zeros(tokens_sum, n_dim, dtype=torch.bfloat16, device=mat_a.device)
+    else:
+        out.zero_()

    # NVFP4-3: use 2-CTA UMMA for M>=256 (1.7-1.9× throughput at prefill)
    use_2cta = tokens_sum >= 256 and cluster_shape_mn[0] % 2 == 0
@@ -203,7 +217,11 @@ def run_nvfp4_grouped_gemm(
        )

        def to_cute(t):
+            _orig_cd = torch.cuda.current_device
+            if t.is_cuda and t.device.index != _orig_cd():
+                torch.cuda.current_device = lambda: t.device.index
            ct = cutlass_torch.from_dlpack(t)
+            torch.cuda.current_device = _orig_cd
            return ct.mark_layout_dynamic(leading_dim=cutlass_torch.get_leading_dim(t))

        a_c = to_cute(mat_a)
@@ -250,7 +268,15 @@ def run_nvfp4_grouped_gemm(
    # This is cheap (metadata only, no GPU work) and avoids stale
    # references to tensors from previous calls that may have been freed.
    def to_cute(t):
+        # Fix: from_dlpack checks torch.cuda.current_device() against tensor device.
+        # Inside CUDA graph capture on non-default GPUs, current_device() may not match.
+        # We temporarily patch current_device to return the tensor's device index.
+        # This is safe because during graph capture, the device is logically fixed.
+        _orig_cd = torch.cuda.current_device
+        if t.is_cuda and t.device.index != _orig_cd():
+            torch.cuda.current_device = lambda: t.device.index
        ct = cutlass_torch.from_dlpack(t)
+        torch.cuda.current_device = _orig_cd
        return ct.mark_layout_dynamic(leading_dim=cutlass_torch.get_leading_dim(t))
    
    a_c = to_cute(mat_a)
@@ -328,7 +354,15 @@ def warmup_fused_swiglu_compilation(num_experts, K_packed, N_packed, device,
    )
    
    def to_cute(t):
+        # Fix: from_dlpack checks torch.cuda.current_device() against tensor device.
+        # Inside CUDA graph capture on non-default GPUs, current_device() may not match.
+        # We temporarily patch current_device to return the tensor's device index.
+        # This is safe because during graph capture, the device is logically fixed.
+        _orig_cd = torch.cuda.current_device
+        if t.is_cuda and t.device.index != _orig_cd():
+            torch.cuda.current_device = lambda: t.device.index
        ct = cutlass_torch.from_dlpack(t)
+        torch.cuda.current_device = _orig_cd
        return ct.mark_layout_dynamic(leading_dim=cutlass_torch.get_leading_dim(t))
    
    a_c = to_cute(mat_a)
@@ -382,6 +416,7 @@ def run_fused_swiglu_grouped_gemm(
    swiglu_limit=0.0,
    mma_tiler_mn=(128, 128),
    cluster_shape_mn=(1, 1),
+    out=None,       # pre-allocated output buffer for CUDA graph capture
 ):
    """Run the fused SwiGLU NVFP4 scaled grouped GEMM.
    
@@ -394,7 +429,10 @@ def run_fused_swiglu_grouped_gemm(
    n_dim = mat_b.shape[2]
    tokens_sum = mat_a.shape[0]

-    out = torch.zeros(tokens_sum, n_dim, dtype=torch.bfloat16, device=mat_a.device)
+    if out is None:
+        out = torch.zeros(tokens_sum, n_dim, dtype=torch.bfloat16, device=mat_a.device)
+    else:
+        out.zero_()

    # NVFP4-3: use 2-CTA UMMA for M>=256 (1.7-1.9× throughput at prefill)
    # At decode (M<256), 1-CTA is correct (2-CTA wastes hardware)
@@ -425,7 +463,11 @@ def run_fused_swiglu_grouped_gemm(
        )

        def to_cute(t):
+            _orig_cd = torch.cuda.current_device
+            if t.is_cuda and t.device.index != _orig_cd():
+                torch.cuda.current_device = lambda: t.device.index
            ct = cutlass_torch.from_dlpack(t)
+            torch.cuda.current_device = _orig_cd
            return ct.mark_layout_dynamic(leading_dim=cutlass_torch.get_leading_dim(t))

        a_c = to_cute(mat_a)
@@ -466,7 +508,15 @@ def run_fused_swiglu_grouped_gemm(
    workspace = entry['workspace']
    
    def to_cute(t):
+        # Fix: from_dlpack checks torch.cuda.current_device() against tensor device.
+        # Inside CUDA graph capture on non-default GPUs, current_device() may not match.
+        # We temporarily patch current_device to return the tensor's device index.
+        # This is safe because during graph capture, the device is logically fixed.
+        _orig_cd = torch.cuda.current_device
+        if t.is_cuda and t.device.index != _orig_cd():
+            torch.cuda.current_device = lambda: t.device.index
        ct = cutlass_torch.from_dlpack(t)
+        torch.cuda.current_device = _orig_cd
        return ct.mark_layout_dynamic(leading_dim=cutlass_torch.get_leading_dim(t))
    
    a_c = to_cute(mat_a)
--- a/dsv4/ops/quantize.py
+++ b/dsv4/ops/quantize.py
@@ -80,12 +80,12 @@ def quantize_to_nvfp4(x_bf16, block_size=SF_VEC_SIZE):
    zero_block = block_amax < (6.0 * 2.0 ** -9)  # < ~0.0117
    # Zero out x for zero/underflow blocks before division.
    # This ensures x_scaled = 0 → FP4 nibbles = 0.
-    x_reshaped = torch.where(zero_block.unsqueeze(-1),
-                              torch.zeros_like(x_reshaped), x_reshaped)
+    # Use scalar 0.0 instead of torch.zeros_like — no allocation, graph-safe.
+    x_reshaped = torch.where(zero_block.unsqueeze(-1), 0.0, x_reshaped)
    block_amax = block_amax.clamp(min=1e-8)
    block_scale = (block_amax / 6.0).to(torch.float8_e4m3fn)
    # Force zero/underflow blocks: FP8 scale = 0 (exact zero).
-    block_scale = torch.where(zero_block, torch.zeros_like(block_scale), block_scale)
+    block_scale = torch.where(zero_block, 0.0, block_scale)

    # Nearest E2M1
    block_sf_expanded = block_scale.float().unsqueeze(-1)
@@ -143,11 +143,10 @@ def quantize_activation_nvfp4(x_bf16, global_scale, block_size=SF_VEC_SIZE):
    block_amax = x_reshaped.abs().amax(dim=-1)
    # Detect zero blocks and underflow blocks (same threshold as quantize_to_nvfp4).
    zero_block = block_amax < (6.0 * 2.0 ** -9)
-    x_reshaped = torch.where(zero_block.unsqueeze(-1),
-                              torch.zeros_like(x_reshaped), x_reshaped)
+    x_reshaped = torch.where(zero_block.unsqueeze(-1), 0.0, x_reshaped)
    block_amax = block_amax.clamp(min=1e-8, max=6.0 * 448.0)  # E4M3 max = 448
    block_scale = (block_amax / 6.0).to(torch.float8_e4m3fn)
-    block_scale = torch.where(zero_block, torch.zeros_like(block_scale), block_scale)
+    block_scale = torch.where(zero_block, 0.0, block_scale)

    block_sf_expanded = block_scale.float().unsqueeze(-1)
    x_scaled = x_reshaped / block_sf_expanded.clamp(min=1e-8)
@@ -315,18 +314,24 @@ def quantize_nvfp4_gpu_fused(x_bf16, divisor=6.0 * 448.0):
        x_sf: (M, N//16) float8_e4m3fn
        gsa: (M,) float32 GPU tensor — per-row global scale for GEMM
    """
-    # CUDA kernels require contiguous input — column slices from deinterleave are non-contiguous
+    # CUDA kernels require contiguous input — column slices from deinterleave are non-contiguous.
+    # For CUDA graph capture, this MUST be contiguous at graph construction time.
+    # The .contiguous() call is a no-op when already contiguous (no allocation).
    if not x_bf16.is_contiguous():
        x_bf16 = x_bf16.contiguous()
    from dsv4.kernels.cuda.loader import get_cuda_module
    amax_mod = get_cuda_module("amax_gsa", ["amax_gsa.cu"])
    gsa_gpu = amax_mod.compute_amax_gsa(x_bf16, divisor)  # scalar GPU tensor
-    # Broadcast to (M,) for the quantize-from-buffer kernel
+    # Broadcast to (M,) for the quantize-from-buffer kernel.
+    # CUDA-graph-safe approach:
+    # - For M=1 decode (graph-captured): just reshape to (1,) — no allocation.
+    # - For M>1 prefill (not graph-captured): expand + contiguous is fine.
    M = x_bf16.shape[0]
    if gsa_gpu.dim() == 0:
-        gsa_gpu = gsa_gpu.reshape(1).expand(M).contiguous()  # (M,) all rows same gsa
-    elif gsa_gpu.shape[0] == 1 and M > 1:
-        gsa_gpu = gsa_gpu.expand(M).contiguous()
+        gsa_gpu = gsa_gpu.reshape(1)  # scalar → (1,) — no allocation
+    if M > 1:
+        gsa_gpu = gsa_gpu.expand(M).contiguous()  # (M,) — allocation OK for prefill
+    # For M=1: gsa_gpu is (1,) contiguous — zero allocation
    quant_mod = get_cuda_module("fused_amax_quantize", ["fused_amax_quantize.cu"])
    x_fp4, x_sf = quant_mod.quantize_nvfp4_from_buffer(x_bf16, gsa_gpu)
    return x_fp4, x_sf, gsa_gpu
--- a/single_shot_inference.py
+++ b/single_shot_inference.py
@@ -134,107 +134,301 @@ def unweighted_rmsnorm(x, eps=1e-6):
 class CUDAGraphDecoder:
    """Captures and replays CUDA graphs for the decode loop.
    
-    After one warmup step, each layer's compute is captured as a CUDA graph.
-    Replay eliminates Python dispatch overhead (~94ms for 61 layers) and
-    kernel launch latency.
+    Architecture (Phase 1: eager-break-at-attention):
+    Each layer is split into two graph-captured sub-regions with eager attention
+    in between:
+    
+      Graph A (pre-attention):  mHC pre_block(attn) + fused RMSNorm + quantize
+                                  + q_a + q_a_norm + q_b + kv projections
+                                  → writes x_normed, q_heads, kv_3d, ctx_a to
+                                    pre-allocated buffers for eager attention
+      Eager (attention):          Compressor → Indexer → KV gather → FMHA
+                                  → inverse RoPE → o_a + o_b → F_attn
+                                  → writes F_attn to pre-allocated buffer
+      Graph B (post-attention):   mHC post_block(attn) + mHC pre_block(ffn)
+                                  + fused RMSNorm + quantize + Router + MoE + SE
+                                  + mHC post_block(ffn)
+                                  → writes X_next to pre-allocated output buffer
+    
+    The attention path (compressor, FMHA, inverse RoPE) has dynamic shapes
+    and data-dependent control flow — it MUST run eagerly.
+    The compute path has fixed shapes for T=1 decode — it CAN be captured.
+    
+    The hc_head + norm + lm_head are captured as a separate graph on cuda:0.
+    Cross-GPU transfers (X.to(cuda:N)) happen OUTSIDE graphs between layers.
    
    Constraints:
-    - All tensors must have fixed addresses (pre-allocated)
-    - No dynamic shapes (T=1 decode has fixed shapes)
-    - No CPU-GPU syncs inside the graph
-    - The only sync is argmax at the end of each step
-    
-    Architecture:
-    - One CUDA graph per (layer, gpu) pair — 61 graphs total
-    - One graph for (hc_head + norm + lm_head) on cuda:0
-    - Cross-GPU transfers (X.to(cuda:N)) happen outside graphs
-    - The warmup step also computes and fixes gsa values
+    - All tensors in captured regions must have fixed addresses (pre-allocated)
+    - No CPU-GPU syncs inside captured regions
+    - The only per-step sync is argmax for sampling (outside graph)
+    - Attention runs eagerly — dynamic shapes are OK there
    """
    
-    def __init__(self, n_layers, num_gpus, devices):
+    def __init__(self, n_layers, num_gpus, hidden_size, devices, cfg):
        self.n_layers = n_layers
        self.num_gpus = num_gpus
+        self.hidden_size = hidden_size
        self.devices = devices
-        self.graphs = {}      # (li) -> torch.cuda.CUDAGraph
-        self.lm_graph = None  # single graph for hc_head + norm + lm_head
        self.captured = False
        
+        # Model dimensions for buffer pre-allocation
+        self.n_h = cfg.get("num_attention_heads", 128)
+        self.hd = cfg.get("head_dim", 512)
+        self.rd = cfg.get("qk_rope_head_dim", 64)
+        self.q_a_dim = cfg.get("q_lora_rank", 1536)  # q_a projection output dim
+        
+        # Two graphs per layer (A: pre-attn, B: post-attn+FFN) + lm_head
+        self.graphs_a = {}    # li -> torch.cuda.CUDAGraph
+        self.graphs_b = {}    # li -> torch.cuda.CUDAGraph
+        self.streams = {}    # li -> torch.cuda.Stream (per-device, MUST match capture stream during replay)
+        self.lm_graph = None  # single graph for hc_head + norm + lm_head on cuda:0
+        self.lm_stream = None  # stream for lm_head graph on cuda:0
+        
        # Pre-allocated I/O buffers — fixed addresses for graph capture
-        # Each layer reads X_in and writes X_out
-        self.x_in_bufs = {}   # li -> tensor on device of layer li
-        self.x_out_bufs = {}  # li -> tensor on device of layer li
-        self.logits_buf = None  # (1, 129280) on cuda:0
+        self.x_in_bufs = {}   # li -> (1, 4, H) BF16 on layer's device
+        self.x_out_bufs = {}  # li -> (1, 4, H) BF16 on layer's device
+        
+        # Graph A output buffers (read by eager attention, written by graph A)
+        # These survive across the graph A → eager → graph B boundary.
+        self.x_normed_bufs = {}    # li -> (1, H) BF16 — for compressor/indexer
+        self.q_heads_bufs = {}     # li -> (1, n_h, hd) BF16 — for FMHA
+        self.kv_3d_bufs = {}       # li -> (1, 1, hd) BF16 — for FMHA (pre-RoPE)
+        self.q_a_bufs = {}          # li -> (1, q_a_dim) BF16 — q_a for indexer
+        self.ctx_a_B_bufs = {}     # li -> (1, 4, 4) FP32 — B_l for post_block
+        self.ctx_a_C_bufs = {}     # li -> (1, 4) BF16 — C_l for post_block
+        self.X_mid_bufs = {}       # li -> (1, 4, H) BF16 — X_l for post_block
+        
+        # Graph B input buffer (written by eager attention, read by graph B)
+        self.F_attn_bufs = {}      # li -> (1, H) BF16 — attention output for post_block
+        
+        # lm_head graph buffers (on cuda:0)
+        self.x_lm_in = None    # (1, 4, H) BF16 on cuda:0
+        self.logits_buf = None # (1, vocab_size) BF16 on cuda:0
    
-    def pre_allocate(self, cfg, attn_mhcs, ffn_mhcs, attn_norms, ffn_norms,
-                     kv_caches, compressors, indexers, moe_runners, se_runners,
-                     routers, prod_lins, layer_w, rope_caches, hc_head,
-                     final_norm_w, lm_head_lin, comp_rope_caches=None):
+    def pre_allocate(self, cfg):
        """Pre-allocate all I/O buffers with fixed addresses."""
+        H = self.hidden_size
+        V = cfg.get("vocab_size", 129280)
+        n_h = self.n_h
+        hd = self.hd
+        
        for li in range(self.n_layers):
            dev = self.devices[li % self.num_gpus]
-            # X is (1, 4, 7168) BF16
-            self.x_in_bufs[li] = torch.zeros(1, 4, cfg["hidden_size"], dtype=torch.bfloat16, device=dev)
-            self.x_out_bufs[li] = torch.zeros(1, 4, cfg["hidden_size"], dtype=torch.bfloat16, device=dev)
-        self.logits_buf = torch.zeros(1, cfg.get("vocab_size", 129280), dtype=torch.bfloat16, device='cuda:0')
+            self.x_in_bufs[li] = torch.zeros(1, 4, H, dtype=torch.bfloat16, device=dev)
+            self.x_out_bufs[li] = torch.zeros(1, 4, H, dtype=torch.bfloat16, device=dev)
+            # Graph A intermediates
+            self.x_normed_bufs[li] = torch.zeros(1, H, dtype=torch.bfloat16, device=dev)
+            self.q_heads_bufs[li] = torch.zeros(1, n_h, hd, dtype=torch.bfloat16, device=dev)
+            self.kv_3d_bufs[li] = torch.zeros(1, 1, hd, dtype=torch.bfloat16, device=dev)
+            self.q_a_bufs[li] = torch.zeros(1, self.q_a_dim, dtype=torch.bfloat16, device=dev)  # q_a for indexer
+            self.ctx_a_B_bufs[li] = torch.zeros(1, 4, 4, dtype=torch.float32, device=dev)
+            self.ctx_a_C_bufs[li] = torch.zeros(1, 4, dtype=torch.bfloat16, device=dev)
+            self.X_mid_bufs[li] = torch.zeros(1, 4, H, dtype=torch.bfloat16, device=dev)
+            # Graph B input
+            self.F_attn_bufs[li] = torch.zeros(1, H, dtype=torch.bfloat16, device=dev)
+        
+        # lm_head graph I/O (cuda:0 only)
+        self.x_lm_in = torch.zeros(1, 4, H, dtype=torch.bfloat16, device='cuda:0')
+        self.logits_buf = torch.zeros(1, V, dtype=torch.bfloat16, device='cuda:0')
    
    def capture(self, cfg, attn_mhcs, ffn_mhcs, attn_norms, ffn_norms,
                kv_caches, compressors, indexers, moe_runners, se_runners,
                routers, prod_lins, layer_w, rope_caches, hc_head,
-                final_norm_w, lm_head_lin, positions, token_id, comp_rope_caches=None):
-        """Capture CUDA graphs for all layers + lm_head.
+                final_norm_w, lm_w, dec_pos_per_gpu, dec_tid32_per_gpu, comp_rope_caches=None):
+        """Capture CUDA graphs for all layers (A/B split) + lm_head.
+        
+        Phase 1: eager-break-at-attention. Graphs A/B capture the compute-heavy
+        path; the attention path runs eagerly between A and B replays.
        
        Must be called after one warmup step so that:
        1. All CuTeDSL kernels are compiled and cached
        2. gsa values are fixed (from warmup_gsa)
        3. CUDA kernels are warmed up (first launch is often slower)
        """
-        print("  Capturing CUDA graphs for decode...", flush=True)
+        from dsv4.ops.quantize import (
+            mhc_rmsnorm_quantize_nvfp4, dequantize_nvfp4,
+            rmsnorm_quantize_nvfp4 as _rmsnorm_quantize,
+        )
+        from dsv4.layers.mhc import mHCContext
+        
+        H = self.hidden_size
+        n_h = self.n_h
+        hd = self.hd
+        rd = self.rd
+        
+        print("  Capturing CUDA graphs (A/B split: compute captured, attention eager)...", flush=True)
+        
+        # Pre-cache norm weights on correct devices to avoid .to() allocations during capture
+        # These must be on the same device as the layer, in FP32, with fixed addresses.
+        attn_norm_dev = {}
+        ffn_norm_dev = {}
+        q_norm_dev = {}
+        kv_norm_dev = {}
+        for li in range(self.n_layers):
+            gpu = li % self.num_gpus
+            dev = self.devices[gpu]
+            an = attn_norms.get(li)
+            if an is not None and an.device != torch.device(dev):
+                attn_norm_dev[li] = an.to(dev, torch.float32)
+            elif an is not None:
+                attn_norm_dev[li] = an.to(torch.float32) if an.dtype != torch.float32 else an
+            fn = ffn_norms.get(li)
+            if fn is not None and fn.device != torch.device(dev):
+                ffn_norm_dev[li] = fn.to(dev, torch.float32)
+            elif fn is not None:
+                ffn_norm_dev[li] = fn.to(torch.float32) if fn.dtype != torch.float32 else fn
+            pfx = f"model.layers.{li}.self_attn"
+            qn = layer_w[li].get(f"{pfx}.q_a_norm.weight")
+            if qn is not None:
+                q_norm_dev[li] = qn.to(dev, torch.float32) if qn.device != torch.device(dev) or qn.dtype != torch.float32 else qn
+            kvn = layer_w[li].get(f"{pfx}.kv_norm.weight")
+            if kvn is not None:
+                kv_norm_dev[li] = kvn.to(dev, torch.float32) if kvn.device != torch.device(dev) or kvn.dtype != torch.float32 else kvn
+        
+        self.attn_norm_dev = attn_norm_dev
+        self.ffn_norm_dev = ffn_norm_dev
+        self.q_norm_dev = q_norm_dev
+        self.kv_norm_dev = kv_norm_dev
+        
+        # Verify all MoE/SE buffers are allocated (swizzled buffers must exist before capture)
+        for li in range(self.n_layers):
+            moe = moe_runners.get(li)
+            if moe is not None:
+                assert hasattr(moe, '_l1_mat_b') and moe._l1_mat_b is not None, f"L{li} MoE: _l1_mat_b not allocated — call _ensure_stacked() before capture"
+                assert hasattr(moe, '_padded_x_sf_buf_l1') and moe._padded_x_sf_buf_l1 is not None, f"L{li} MoE: _padded_x_sf_buf_l1 not allocated — call _allocate_buffers() before capture"
+                assert hasattr(moe, '_padded_x_sf_swizzled_buf_l1') and moe._padded_x_sf_swizzled_buf_l1 is not None, f"L{li} MoE: _padded_x_sf_swizzled_buf_l1 not allocated"
+            se = se_runners.get(li)
+            if se is not None:
+                assert hasattr(se, '_l1_mat_b') and se._l1_mat_b is not None, f"L{li} SE: _l1_mat_b not allocated — call _ensure_initialized() before capture"
+                assert hasattr(se, '_padded_x_sf_buf_l1') and se._padded_x_sf_buf_l1 is not None, f"L{li} SE: _padded_x_sf_buf_l1 not allocated — call _allocate_buffers() before capture"
+                assert hasattr(se, '_padded_x_sf_swizzled_buf_l1') and se._padded_x_sf_swizzled_buf_l1 is not None, f"L{li} SE: _padded_x_sf_swizzled_buf_l1 not allocated"
        
-        # Capture each layer as a separate graph
        for li in range(self.n_layers):
            gpu = li % self.num_gpus
            dev = self.devices[gpu]
            torch.cuda.set_device(gpu)
            
-            # Copy current X into the fixed input buffer
-            # (In practice, the warmup step's X is already on the right device)
+            attn_mhc = attn_mhcs.get(li)
+            ffn_mhc = ffn_mhcs.get(li)
+            pl = prod_lins.get(li, {})
+            pfx = f"model.layers.{li}.self_attn"
            
-            graph = torch.cuda.CUDAGraph()
-            with torch.cuda.graph(graph):
-                X_out = forward_layer(
-                    self.x_in_bufs[li], layer_w[li], li, cfg, *rope_caches[gpu],
-                    attn_mhcs.get(li), ffn_mhcs.get(li),
-                    attn_norms.get(li), ffn_norms.get(li),
-                    kv_caches[li], positions, token_id,
-                    compressors.get(li), indexers.get(li),
-                    moe_runners.get(li), se_runners.get(li), routers.get(li),
-                    prod_lin=prod_lins.get(li),
-                    _use_fused_rmsnorm_quantize=True,
-                    comp_rope_cos=comp_rope_caches[gpu][0] if comp_rope_caches else None,
-                    comp_rope_sin=comp_rope_caches[gpu][1] if comp_rope_caches else None,
-                )
-                # Copy output to fixed buffer
-                self.x_out_bufs[li].copy_(X_out)
+            # ======== Graph A: pre-attention compute ========
+            # NOTE: We capture each Graph A on the correct GPU. Multi-GPU graph capture
+            # is known to have issues. We add a validation step to verify correctness.
+            #
+            # Skip validation — the explicit stream approach handles multi-GPU correctly
+            # Input:  X_l = self.x_in_bufs[li]  (1, 4, H)
+            # Output: x_normed, q_heads, kv_3d, ctx_a, X_l → pre-allocated buffers
+            # Create per-device stream for graph capture/replay
+            # CRITICAL: Must use explicit stream for non-default GPUs.
+            # torch.cuda.set_device() alone doesn't work — PyTorch CUDA graphs
+            # on non-default GPUs fail silently (empty graph or stale data replay).
+            s = torch.cuda.Stream(device=dev)
+            self.streams[li] = s
+            
+            # NOTE: Norm weights are pre-cached on device in FP32 (attn_norm_dev, etc.)
+            #       to avoid .to() allocations during graph capture.
+            graph_a = torch.cuda.CUDAGraph()
+            with torch.cuda.graph(graph_a, stream=s):
+                X_l = self.x_in_bufs[li]
+                
+                # 1. mHC pre_block (attn) — fused P5
+                A_l_a, B_l_a, C_l_a = attn_mhc._dynamic_params(X_l)
+                x_quant_attn = mhc_rmsnorm_quantize_nvfp4(
+                    X_l, A_l_a, attn_norm_dev[li])
+                x_normed = dequantize_nvfp4(x_quant_attn.x_fp4, x_quant_attn.x_sf, x_quant_attn.gsa)
+                
+                # 2. Attention projections
+                q_a = pl['q_a'].run_from_quantized(x_quant_attn)
+                q_norm_w = q_norm_dev.get(li)
+                if q_norm_w is not None:
+                    q_a_quant = _rmsnorm_quantize(q_a, q_norm_w)
+                    q_a = dequantize_nvfp4(q_a_quant.x_fp4, q_a_quant.x_sf, q_a_quant.gsa)
+                    q = pl['q_b'].run_from_quantized(q_a_quant)
+                else:
+                    q = pl['q_b'](q_a)
+                q = unweighted_rmsnorm(q).bfloat16()
+                # NOTE: RoPE is applied in the eager attention path (dynamic positions)
+                q_heads = q.reshape(1, n_h, hd)
+                
+                kv = pl['kv'].run_from_quantized(x_quant_attn)
+                kv_norm_w_k = kv_norm_dev.get(li)
+                if kv_norm_w_k is not None:
+                    kv = rmsnorm(kv, kv_norm_w_k)
+                kv_3d = kv.reshape(1, 1, hd)
+                # NOTE: RoPE is applied in the eager attention path
+                
+                # Write to pre-allocated buffers for eager attention path
+                self.x_normed_bufs[li].copy_(x_normed)
+                self.q_heads_bufs[li].copy_(q_heads)
+                self.kv_3d_bufs[li].copy_(kv_3d)
+                self.q_a_bufs[li].copy_(q_a)
+                self.ctx_a_B_bufs[li].copy_(B_l_a)
+                self.ctx_a_C_bufs[li].copy_(C_l_a)
+                self.X_mid_bufs[li].copy_(X_l)
+            
+            self.graphs_a[li] = graph_a
+            
+            # Note: We don't verify here because x_in_bufs[li] was zero-initialized.
+            # The actual replay path populates x_in_bufs via copy_() before replay,
+            # so the graph replay works correctly with real data.
+            
+            # ======== Graph B: post-attention + FFN compute ========
+            # Input:  X_mid = self.X_mid_bufs[li], F_attn = self.F_attn_bufs[li]
+            # Output: X_next → self.x_out_bufs[li]
+            graph_b = torch.cuda.CUDAGraph()
+            with torch.cuda.graph(graph_b, stream=s):
+                X_mid = self.X_mid_bufs[li]
+                F_attn = self.F_attn_bufs[li]
+                
+                # 1. mHC post_block (attn)
+                B_l_a = self.ctx_a_B_bufs[li]
+                C_l_a = self.ctx_a_C_bufs[li]
+                BX_a = torch.bmm(B_l_a.transpose(-1, -2), X_mid.float())
+                CF_a = C_l_a.unsqueeze(-1) * F_attn.unsqueeze(1)
+                X_mid_out = (CF_a.float() + BX_a).to(X_mid.dtype)
+                
+                # 2. FFN mHC pre_block — fused P5
+                A_l_f, B_l_f, C_l_f = ffn_mhc._dynamic_params(X_mid_out)
+                x_quant_ffn = mhc_rmsnorm_quantize_nvfp4(
+                    X_mid_out, A_l_f, ffn_norm_dev[li])
+                x_ffn = dequantize_nvfp4(x_quant_ffn.x_fp4, x_quant_ffn.x_sf, x_quant_ffn.gsa)
+                
+                # 3. Router + MoE + SE (direct access — every layer has these)
+                token_id_dev = dec_tid32_per_gpu[gpu]
+                router_li = routers[li]
+                topk_w, topk_ids = router_li(x_ffn, token_ids=token_id_dev)
+                routed_out = moe_runners[li].run(x_ffn, topk_w, topk_ids)
+                shared_out = se_runners[li].run(x_ffn)
+                F_ffn = routed_out + shared_out
+                
+                # 4. mHC post_block (ffn)
+                BX_f = torch.bmm(B_l_f.transpose(-1, -2), X_mid_out.float())
+                CF_f = C_l_f.unsqueeze(-1) * F_ffn.unsqueeze(1)
+                X_next = (CF_f.float() + BX_f).to(X_mid.dtype)
+                
+                self.x_out_bufs[li].copy_(X_next)
+            
+            self.graphs_b[li] = graph_b
            
-            self.graphs[li] = graph
            if (li + 1) % 10 == 0:
-                print(f"    Captured {li+1}/{self.n_layers} layer graphs", flush=True)
+                print(f"    Captured {li+1}/{self.n_layers} layer A/B graphs", flush=True)
        
-        # Capture hc_head + norm + lm_head on cuda:0
+        # ---- Capture hc_head + norm + lm_head on cuda:0 ----
        torch.cuda.set_device(0)
+        self.lm_stream = torch.cuda.Stream(device='cuda:0')
        self.lm_graph = torch.cuda.CUDAGraph()
-        with torch.cuda.graph(self.lm_graph):
-            # Note: x_in_bufs for the last layer is on the last layer's device.
-            # For the lm_head graph, we need the X on cuda:0.
-            # We'll handle the cross-GPU transfer outside the graph.
-            x_out = self.x_out_bufs[self.n_layers - 1]  # may be on different GPU
-            x_cuda0 = x_out.to('cuda:0')  # This may NOT work in a CUDA graph
-            # Actually, cross-device memcpy in CUDA graphs is not supported.
-            # We need to do the transfer outside and use a cuda:0 buffer.
-            pass  # Will handle this differently
+        with torch.cuda.graph(self.lm_graph, stream=self.lm_stream):
+            x_out = hc_head.forward(self.x_lm_in) if hc_head is not None else self.x_lm_in[:, 0, :]
+            if final_norm_w is not None:
+                x_out = rmsnorm(x_out, final_norm_w)
+            logits = torch.nn.functional.linear(x_out, lm_w)
+            self.logits_buf.copy_(logits)
        
        self.captured = True
-        print(f"  Captured {len(self.graphs)} layer graphs", flush=True)
+        print(f"  Captured {len(self.graphs_a)} layer A/B graph pairs + lm_head", flush=True)
+
 # =====================================================================
 def dequant_nvfp4(weight, weight_scale, weight_scale_2=None, input_scale=None):
    O, I2 = weight.shape; I = I2 * 2
@@ -797,11 +991,87 @@ def _run_production_fmha_mixed(q_heads, kv_nope_fp8, kv_nope_scale, kv_rope_bf16
 # =====================================================================
 # Attention — ALL production kernels
 # =====================================================================
+def eager_attention(q_heads, kv_roped, x_normed, q_a, w, li, cfg,
+                    rope_cos, rope_sin, kv_cache, positions,
+                    compressor, indexer, comp_rope_cos=None, comp_rope_sin=None):
+    """Eager attention section — runs OUTSIDE CUDA graph capture.
+    
+    This function handles the dynamic-shape parts of attention:
+    KV append → Compressor → Indexer → KV gather → FMHA → Inverse RoPE
+    
+    Returns: attn_out (1, n_h, hd) — output of FMHA after inverse RoPE.
+    The caller (sub-graph B) will apply o_proj and mHC post_block.
+    """
+    dev = x_normed.device; T = q_heads.shape[0]
+    n_h = cfg["num_attention_heads"]; hd = cfg["head_dim"]; rd = cfg.get("qk_rope_head_dim", 64)
+    ratio = compressor.ratio if compressor is not None else 0
+    scale = 1.0 / math.sqrt(hd); pfx = f"model.layers.{li}.self_attn"
+    nope_dim = hd - rd
+    if positions.device != rope_cos.device: positions = positions.to(rope_cos.device)
+    
+    # KV append (already roped from sub-graph A)
+    kv_cache.append_swa(kv_roped, positions)
+    
+    # Compressor → compressed KV (mixed storage: FP8 + BF16 RoPE)
+    comp_pos, block_bias = None, None; comp_idx_kv = None
+    if compressor is not None and compressor.ratio > 0:
+        comp_kv_fp32, comp_pos, block_bias = compressor.forward(x_normed, positions)
+        if comp_kv_fp32 is not None:
+            from dsv4.kernels.cuda.loader import get_cuda_module
+            kv_mod = get_cuda_module("kv_quantize", ["kv_quantize.cu"])
+            nope_fp32 = comp_kv_fp32[:, :nope_dim].contiguous()
+            rope_bf16 = comp_kv_fp32[:, nope_dim:].bfloat16().contiguous()
+            rope_3d = rope_bf16.unsqueeze(1)
+            crc = comp_rope_cos if comp_rope_cos is not None else rope_cos
+            crs = comp_rope_sin if comp_rope_sin is not None else rope_sin
+            rope_3d = _apply_rope(rope_3d, comp_pos, crc, crs, rd)
+            rope_bf16 = rope_3d.squeeze(1)
+            nope_fp8, nope_scale = kv_mod.quantize_fp8_e4m3_from_fp32(nope_fp32)
+            kv_cache.set_compressed_mixed(nope_fp8, nope_scale, rope_bf16, comp_pos)
+        if compressor.is_csa and indexer is not None and indexer.compressor is not None:
+            comp_idx_kv, _, _ = indexer.compressor.forward(x_normed, positions)
+            kv_cache.set_indexer_keys_fp8(comp_idx_kv)
+    
+    # Indexer top-k (CSA)
+    topk_idx = None
+    if indexer is not None and ratio == 4:
+        topk_idx = indexer.forward(q_a, x_normed, kv_cache, positions, layer_idx=li)
+    
+    # Gather KV — B1 storage-native mixed path
+    swa_kv, _swa_pos = kv_cache.get_swa()
+    swa_len = swa_kv.shape[0]
+    if kv_cache.n_comp > 0:
+        if ratio == 4:
+            assert topk_idx is not None, f"CSA layer {li}: indexer returned no top-k"
+            tk = topk_idx[0].clamp(0, kv_cache.n_comp - 1).int()
+            kv_nope_fp8, kv_nope_scale, kv_rope_bf16 = kv_cache.gather_mixed_selective(tk)
+        elif ratio > 4:
+            kv_nope_fp8, kv_nope_scale, kv_rope_bf16 = kv_cache.gather_mixed_all()
+        else:
+            kv_nope_fp8, kv_nope_scale, kv_rope_bf16 = kv_cache.gather_mixed_swa_only()
+    else:
+        kv_nope_fp8, kv_nope_scale, kv_rope_bf16 = kv_cache.gather_mixed_swa_only()
+    seq_len = kv_nope_scale.shape[0]
+    if seq_len == 0:
+        return torch.zeros(T, n_h, hd, dtype=torch.bfloat16, device=dev)
+    
+    # Production FMHA — B1 mixed FP8/BF16 decode path
+    attn_out = _run_production_fmha_mixed(
+        q_heads, kv_nope_fp8, kv_nope_scale, kv_rope_bf16,
+        n_h, hd, T, seq_len, scale, dev, li, w, pfx, rd)
+    
+    # Inverse RoPE
+    attn_out = _apply_rope(attn_out, positions, rope_cos, rope_sin, rd, inverse=True)
+    
+    return attn_out
+
+
 def forward_attention(x_normed, w, li, cfg, rope_cos, rope_sin,
                      kv_cache, positions, compressor, indexer, prod_lin,
                      x_quant=None,
                      _profile_detail=False, _profile_times=None,
-                      comp_rope_cos=None, comp_rope_sin=None):
+                      comp_rope_cos=None, comp_rope_sin=None,
+                      q_heads=None, kv_3d=None, q_a=None):
    dev = x_normed.device; T = x_normed.shape[0]
    n_h = cfg["num_attention_heads"]; hd = cfg["head_dim"]; rd = cfg.get("qk_rope_head_dim", 64)
    o_groups = cfg.get("o_groups", 16); o_rank = cfg.get("o_lora_rank", 1024)
@@ -818,40 +1088,46 @@ def forward_attention(x_normed, w, li, cfg, rope_cos, rope_sin,

    _pt('q_a_start')
    # 1. Q: q_a (NVFP4 GEMM) → q_a_norm → q_b (NVFP4 GEMM) → q_b_norm
-    q_a = prod_lin['q_a'].run_from_quantized(x_quant) if x_quant is not None else prod_lin['q_a'](x_normed)
-    _pt('q_a_end')
-    if VERBOSE >= 2 and li < 3:
-        # Compare q_a with PyTorch reference
-        q_a_ref = do_nvfp4_linear_ref(x_normed, w, pfx, 'q_a_proj')
-        if q_a_ref is not None:
-            cos_qa = torch.nn.functional.cosine_similarity(q_a.flatten().float(), q_a_ref.flatten().float(), dim=0).item()
-            print(f"  L{li} q_a: |prod|={q_a.abs().max().item():.6f} |ref|={q_a_ref.abs().max().item():.6f} cos={cos_qa:.6f}", flush=True)
-    q_norm_w = w.get(f"{pfx}.q_a_norm.weight")
-    # B3: Fused rmsnorm+quant for q_a_norm → q_b path
-    # Replaces: rmsnorm(q_a, w) → BF16 → q_b quantizes internally
-    # With: fused rmsnorm+NVFP4 quantize → QuantizedActivation → q_b.run_from_quantized
-    # Saves: ~6 kernel launches per layer (rmsnorm 4+ + quantize 2 vs fused 2)
-    if q_norm_w is not None:
-        from dsv4.ops.quantize import rmsnorm_quantize_nvfp4 as _rmsnorm_quantize, dequantize_nvfp4 as _dequantize_nvfp4
-        q_a_quant = _rmsnorm_quantize(q_a, q_norm_w.to(dev, torch.float32))
-        q_a = _dequantize_nvfp4(q_a_quant.x_fp4, q_a_quant.x_sf, q_a_quant.gsa)
-    _pt('q_b_start')
-    if q_norm_w is not None:
-        q = prod_lin['q_b'].run_from_quantized(q_a_quant)
+    # When q_heads is provided (from CUDA graph A), skip projections — only apply RoPE
+    if q_heads is None:
+        q_a = prod_lin['q_a'].run_from_quantized(x_quant) if x_quant is not None else prod_lin['q_a'](x_normed)
+        _pt('q_a_end')
+        if VERBOSE >= 2 and li < 3:
+            # Compare q_a with PyTorch reference
+            q_a_ref = do_nvfp4_linear_ref(x_normed, w, pfx, 'q_a_proj')
+            if q_a_ref is not None:
+                cos_qa = torch.nn.functional.cosine_similarity(q_a.flatten().float(), q_a_ref.flatten().float(), dim=0).item()
+                print(f"  L{li} q_a: |prod|={q_a.abs().max().item():.6f} |ref|={q_a_ref.abs().max().item():.6f} cos={cos_qa:.6f}", flush=True)
+        q_norm_w = w.get(f"{pfx}.q_a_norm.weight")
+        # B3: Fused rmsnorm+quant for q_a_norm → q_b path
+        if q_norm_w is not None:
+            from dsv4.ops.quantize import rmsnorm_quantize_nvfp4 as _rmsnorm_quantize, dequantize_nvfp4 as _dequantize_nvfp4
+            q_a_quant = _rmsnorm_quantize(q_a, q_norm_w.to(dev, torch.float32))
+            q_a = _dequantize_nvfp4(q_a_quant.x_fp4, q_a_quant.x_sf, q_a_quant.gsa)
+        _pt('q_b_start')
+        if q_norm_w is not None:
+            q = prod_lin['q_b'].run_from_quantized(q_a_quant)
+        else:
+            q = prod_lin['q_b'](q_a)
+        q = unweighted_rmsnorm(q).bfloat16()
+        _pt('q_b_end')
+        q_heads = q.reshape(T, n_h, hd)
    else:
-        q = prod_lin['q_b'](q_a)
-    q = unweighted_rmsnorm(q).bfloat16()
-    _pt('q_b_end')
-    q_heads = q.reshape(T, n_h, hd); q_heads = _apply_rope(q_heads, positions, rope_cos, rope_sin, rd)
+        # Graph replay: q_a provided from pre-allocated buffer
+        q_a = q_a  # use the passed q_a from graph A output
+    q_heads = _apply_rope(q_heads, positions, rope_cos, rope_sin, rd)
    _pt('rope_q_end')

    # 2. KV (NVFP4 GEMM, MQA, single KV head)
+    # When kv_3d is provided (from CUDA graph A), skip projections — only apply RoPE
    _pt('kv_start')
-    kv = prod_lin['kv'].run_from_quantized(x_quant) if x_quant is not None else prod_lin['kv'](x_normed)
-    _pt('kv_end')
-    kv_norm_w = w.get(f"{pfx}.kv_norm.weight")
-    if kv_norm_w is not None: kv = rmsnorm(kv, kv_norm_w.to(dev, torch.float32))
-    kv_3d = kv.reshape(T, 1, hd); kv_3d = _apply_rope(kv_3d, positions, rope_cos, rope_sin, rd)
+    if kv_3d is None:
+        kv = prod_lin['kv'].run_from_quantized(x_quant) if x_quant is not None else prod_lin['kv'](x_normed)
+        _pt('kv_end')
+        kv_norm_w = w.get(f"{pfx}.kv_norm.weight")
+        if kv_norm_w is not None: kv = rmsnorm(kv, kv_norm_w.to(dev, torch.float32))
+        kv_3d = kv.reshape(T, 1, hd)
+    kv_3d = _apply_rope(kv_3d, positions, rope_cos, rope_sin, rd)
    _pt('rope_kv_end')
    kv_roped = kv_3d.reshape(T, hd); kv_cache.append_swa(kv_roped, positions)

@@ -1549,6 +1825,10 @@ def main():
    dec_tid_buf = torch.zeros(1, dtype=torch.long, device='cuda:0')
    dec_pos_buf = torch.zeros(1, dtype=torch.long, device='cuda:0')
    dec_tid32_buf = torch.zeros(1, dtype=torch.int32, device='cuda:0')
+    # Per-GPU token ID buffers — each GPU needs its own copy for graph capture
+    # (cross-device .to() inside a CUDA graph is not reliable)
+    dec_tid32_per_gpu = {g: torch.zeros(1, dtype=torch.int32, device=f'cuda:{g}') for g in range(NUM_GPUS)}
+    dec_pos_per_gpu = {g: torch.zeros(1, dtype=torch.long, device=f'cuda:{g}') for g in range(NUM_GPUS)}

    # Decode
    print(f"\nDecoding (max {MAX_NEW_TOKENS} tokens)...")
@@ -1576,31 +1856,145 @@ def main():
        layer_event_count = 0
        cuda_layer_events = []  # list of (tag, li, timestamp) for fine-grained profiling

+    # Pre-allocate decode X buffer — zero per-step allocation
+    # init_state writes to this buffer in-place (no .clone() allocation)
+    dec_X_buf = torch.zeros(1, 4, H, dtype=torch.bfloat16, device='cuda:0')
+    dec_embed_buf = torch.zeros(1, H, dtype=torch.bfloat16, device='cuda:0')
+    # Pre-allocate pinned CPU buffer for token ID transfer (graph-capturable)
+    dec_tid_pinned = torch.zeros(1, dtype=torch.long, device='cpu').pin_memory()
+    dec_tid32_pinned = torch.zeros(1, dtype=torch.int32, device='cpu').pin_memory()
+    dec_pos_pinned = torch.zeros(1, dtype=torch.long, device='cpu').pin_memory()
+    
+    # ---- CUDA Graph Setup ----
+    graph_decoder = None
+    if _args.cuda_graph:
+        print("  CUDA graph capture requested — will capture after warmup step")
+        graph_decoder = CUDAGraphDecoder(n_layers, NUM_GPUS, H, [f'cuda:{g}' for g in range(NUM_GPUS)], cfg)
+        graph_decoder.pre_allocate(cfg)
+
    for step in range(MAX_NEW_TOKENS):
        t1 = time.time()
-        dec_tid_buf[0] = all_tokens[-1]
-        dec_tid32_buf[0] = all_tokens[-1]
-        dec_pos_buf[0] = len(all_tokens) - 1
+        # Write token/position to pinned CPU buffers, then async copy to GPU
+        dec_tid_pinned[0] = all_tokens[-1]
+        dec_tid_buf.copy_(dec_tid_pinned)
+        dec_tid32_pinned[0] = all_tokens[-1]
+        dec_tid32_buf.copy_(dec_tid32_pinned)
+        dec_pos_pinned[0] = len(all_tokens) - 1
+        dec_pos_buf.copy_(dec_pos_pinned)
+        # Copy token/position to per-GPU buffers for graph capture
+        for g in range(NUM_GPUS):
+            dec_tid32_per_gpu[g].copy_(dec_tid32_pinned)
+            dec_pos_per_gpu[g].copy_(dec_pos_pinned)

        t_e = time.perf_counter()
-        X = mHCLayer.init_state(embed(dec_tid_buf))
-        for li in range(n_layers):
-            gpu = li % NUM_GPUS
-            if X.device != torch.device(f"cuda:{gpu}"): X = X.to(f"cuda:{gpu}")
-            torch.cuda.set_device(gpu)
-            X = forward_layer(X, layer_w[li], li, cfg, *rope_caches[gpu],
-                              attn_mhcs.get(li), ffn_mhcs.get(li),
-                              attn_norms.get(li), ffn_norms.get(li),
-                              kv_caches[li], dec_pos_buf, dec_tid32_buf,
-                              compressors.get(li), indexers.get(li),
-                              moe_runners.get(li), se_runners.get(li), routers.get(li),
-                              prod_lin=prod_lins.get(li),
-                              _profile_detail=(profile and step == 1),
-                              _profile_times=cuda_layer_events if (profile and step == 1) else None,
-                              _use_fused_rmsnorm_quantize=not _args.no_fused_rmsnorm,
-                              comp_rope_cos=comp_rope_caches[gpu][0], comp_rope_sin=comp_rope_caches[gpu][1],
-                              )
-        X = X.to('cuda:0'); torch.cuda.set_device(0)
+        X = mHCLayer.init_state(embed(dec_tid_buf), out_buf=dec_X_buf)
+        
+        # ---- Forward: graph replay or eager ----
+        if graph_decoder is not None and graph_decoder.captured:
+            # CUDA graph replay path — A/B split with eager attention
+            for li in range(n_layers):
+                gpu = li % NUM_GPUS
+                torch.cuda.set_device(gpu)
+                dev = f'cuda:{gpu}'
+                
+                # Copy X into graph A input buffer (copy_ handles cross-GPU transfer)
+                graph_decoder.x_in_bufs[li].copy_(X)
+                # NOTE: Cross-GPU copy synchronization is handled by the stream events
+                # (Graph A's stream waits for the default stream's F_attn write, and
+                # vice versa). No explicit sync needed here.
+                
+                # DEBUG: check input is non-zero (first 3 steps, first 3 layers)
+                if step < 3 and li < 3:
+                    torch.cuda.synchronize()
+                    print(f"  Replay L{li}: x_in |X|={graph_decoder.x_in_bufs[li].abs().max().item():.2f}", flush=True)
+                
+                # Replay graph A on its capture stream
+                with torch.cuda.stream(graph_decoder.streams[li]):
+                    graph_decoder.graphs_a[li].replay()
+                
+                # Record completion event on graph A's stream, then wait on default stream
+                # This ensures the default stream (eager attention) sees Graph A's output
+                _graph_a_done = torch.cuda.Event()
+                with torch.cuda.stream(graph_decoder.streams[li]):
+                    _graph_a_done.record()
+                torch.cuda.current_stream().wait_event(_graph_a_done)
+                
+                # DEBUG: check graph A output (first 3 steps, first 3 layers)
+                if step < 3 and li < 3:
+                    torch.cuda.synchronize()
+                    print(f"  Replay L{li} GraphA: x_normed |X|={graph_decoder.x_normed_bufs[li].abs().max().item():.2f} "
+                          f"q_heads |X|={graph_decoder.q_heads_bufs[li].abs().max().item():.2f} "
+                          f"kv_3d |X|={graph_decoder.kv_3d_bufs[li].abs().max().item():.2f}", flush=True)
+                
+                # ---- Eager attention (NOT captured) ----
+                # Read graph A outputs from pre-allocated buffers
+                x_normed = graph_decoder.x_normed_bufs[li]
+                q_heads = graph_decoder.q_heads_bufs[li]
+                kv_3d = graph_decoder.kv_3d_bufs[li]
+                
+                # Run full attention eagerly (compressor + indexer + FMHA + o_proj)
+                F_attn, _ = forward_attention(
+                    x_normed, layer_w[li], li, cfg, *rope_caches[gpu],
+                    kv_caches[li], dec_pos_per_gpu[gpu],
+                    compressors.get(li), indexers.get(li), prod_lins.get(li),
+                    q_heads=q_heads, kv_3d=kv_3d, q_a=graph_decoder.q_a_bufs[li],
+                    comp_rope_cos=comp_rope_caches[gpu][0] if comp_rope_caches else None,
+                    comp_rope_sin=comp_rope_caches[gpu][1] if comp_rope_caches else None,
+                )
+                
+                # Write F_attn to graph B input buffer
+                graph_decoder.F_attn_bufs[li].copy_(F_attn)
+                
+                # Record completion of F_attn write on default stream, wait on graph stream
+                _eager_done = torch.cuda.Event()
+                _eager_done.record(torch.cuda.current_stream())
+                with torch.cuda.stream(graph_decoder.streams[li]):
+                    _eager_done.synchronize()
+                
+                # DEBUG: check F_attn (first 3 steps, first 3 layers)
+                if step < 3 and li < 3:
+                    torch.cuda.synchronize()
+                    print(f"  Replay L{li} F_attn |X|={F_attn.abs().max().item():.2f}", flush=True)
+                
+                # Replay graph B on its capture stream
+                with torch.cuda.stream(graph_decoder.streams[li]):
+                    graph_decoder.graphs_b[li].replay()
+                
+                # Read output from graph B
+                X = graph_decoder.x_out_bufs[li]
+                
+                # DEBUG: check graph B output (first 3 steps, first 3 layers)
+                if step < 3 and li < 3:
+                    torch.cuda.synchronize()
+                    print(f"  Replay L{li} GraphB: x_out |X|={X.abs().max().item():.2f}", flush=True)
+            
+            # Transfer last layer output to cuda:0 for lm_head graph
+            graph_decoder.x_lm_in.copy_(X)
+            
+            # lm_head graph replay — use capture stream on cuda:0
+            with torch.cuda.stream(graph_decoder.lm_stream):
+                graph_decoder.lm_graph.replay()
+            logits = graph_decoder.logits_buf
+            
+        else:
+            # Eager forward path (warmup or no --cuda-graph)
+            for li in range(n_layers):
+                gpu = li % NUM_GPUS
+                if X.device != torch.device(f"cuda:{gpu}"): X = X.to(f"cuda:{gpu}")
+                torch.cuda.set_device(gpu)
+                X = forward_layer(X, layer_w[li], li, cfg, *rope_caches[gpu],
+                                  attn_mhcs.get(li), ffn_mhcs.get(li),
+                                  attn_norms.get(li), ffn_norms.get(li),
+                                  kv_caches[li], dec_pos_buf, dec_tid32_buf,
+                                  compressors.get(li), indexers.get(li),
+                                  moe_runners.get(li), se_runners.get(li), routers.get(li),
+                                  prod_lin=prod_lins.get(li),
+                                  _profile_detail=(profile and step == 1),
+                                  _profile_times=cuda_layer_events if (profile and step == 1) else None,
+                                  _use_fused_rmsnorm_quantize=not _args.no_fused_rmsnorm,
+                                  comp_rope_cos=comp_rope_caches[gpu][0], comp_rope_sin=comp_rope_caches[gpu][1],
+                                  )
+            X = X.to('cuda:0'); torch.cuda.set_device(0)
        t_layers = time.perf_counter()

        # After first decode step: fix gsa values from runtime amax
@@ -1615,7 +2009,8 @@ def main():
                if pl is None: continue
                for key, lin in pl.items():
                    if hasattr(lin, '_gsa_buf') and hasattr(lin, '_use_runtime_gsa') and lin._use_runtime_gsa:
-                        fixed_gsa = lin._gsa_buf.item()  # One-time sync
+                        # Nvfp4GroupedLinear has per-group gsa; reduce to scalar (max) for fixed gsa
+                        fixed_gsa = lin._gsa_buf.max().item() if lin._gsa_buf.numel() > 1 else lin._gsa_buf.item()
                        lin._activation_global_scale = fixed_gsa
                        lin._use_runtime_gsa = False
                        n_fixed += 1
@@ -1635,9 +2030,28 @@ def main():
                lm_head_lin._use_runtime_gsa = False
                n_fixed += 1
            print(f"  Warmup gsa: fixed {n_fixed} projection gsa values from step 0 (MoE/SE keep runtime gsa)", flush=True)
-        x_out = hc_head.forward(X) if hc_head is not None else X[:, 0, :]
-        if final_norm_w is not None: x_out = rmsnorm(x_out, final_norm_w)
-        logits = torch.nn.functional.linear(x_out, lm_w) if lm_head_lin is None else lm_head_lin(x_out)
+        
+        # ---- lm_head: graph replay or eager ----
+        if graph_decoder is not None and graph_decoder.captured:
+            # logits already computed by lm_head graph replay above
+            pass
+        else:
+            x_out = hc_head.forward(X) if hc_head is not None else X[:, 0, :]
+            if final_norm_w is not None: x_out = rmsnorm(x_out, final_norm_w)
+            logits = torch.nn.functional.linear(x_out, lm_w) if lm_head_lin is None else lm_head_lin(x_out)
+        
+        # ---- CUDA graph capture after warmup ----
+        if graph_decoder is not None and not graph_decoder.captured and step == 0:
+            print("  Step 0 warmup done. Capturing CUDA graphs...", flush=True)
+            torch.cuda.synchronize()
+            graph_decoder.capture(
+                cfg, attn_mhcs, ffn_mhcs, attn_norms, ffn_norms,
+                kv_caches, compressors, indexers, moe_runners, se_runners,
+                routers, prod_lins, layer_w, rope_caches, hc_head,
+                final_norm_w, lm_w, dec_pos_per_gpu, dec_tid32_per_gpu,
+                comp_rope_caches=comp_rope_caches,
+            )
+            print(f"  CUDA graphs captured. Graph replay starts on step 1.", flush=True)
        if profile: torch.cuda.synchronize()
        t_lm = time.perf_counter()
        # Check thinking start token logit on first step
--- a/tests/unit/test_cuda_graph_multi_gpu.py
+++ b/tests/unit/test_cuda_graph_multi_gpu.py
@@ -0,0 +1,114 @@
+"""Minimal CUDA graph test: verify graph capture works on all 8 B200 GPUs."""
+import torch
+
+def test_basic_graph():
+    """Test basic CUDA graph on each GPU."""
+    results = {}
+    for gpu in range(8):
+        torch.cuda.set_device(gpu)
+        device = f'cuda:{gpu}'
+        
+        # Create input and output tensors
+        x = torch.ones(1, 4, 7168, dtype=torch.bfloat16, device=device)
+        y = torch.zeros(1, 4, 7168, dtype=torch.bfloat16, device=device)
+        
+        # Capture graph
+        g = torch.cuda.CUDAGraph()
+        with torch.cuda.graph(g):
+            y.copy_(x * 2.0)
+        
+        # Reset input
+        x.zero_()
+        
+        # Replay graph — y should be 0.0 * 2.0 = 0.0 since x is now zero
+        g.replay()
+        torch.cuda.synchronize()
+        
+        y_max = y.abs().max().item()
+        results[gpu] = y_max
+        status = "OK" if y_max == 0.0 else f"WRONG (expected 0.0, got {y_max})"
+        print(f"  GPU {gpu}: y_max={y_max:.2f} — {status}")
+    
+    return results
+
+def test_graph_with_updated_input():
+    """Test that graph replay uses current data in input buffer."""
+    results = {}
+    for gpu in range(8):
+        torch.cuda.set_device(gpu)
+        device = f'cuda:{gpu}'
+        
+        # Create input and output tensors (pre-allocated)
+        x_buf = torch.zeros(1, 4, 7168, dtype=torch.bfloat16, device=device)
+        y_buf = torch.zeros(1, 4, 7168, dtype=torch.bfloat16, device=device)
+        
+        # Fill input with data for capture
+        x_buf.fill_(1.0)
+        
+        # Capture graph
+        g = torch.cuda.CUDAGraph()
+        with torch.cuda.graph(g):
+            y_buf.copy_(x_buf * 2.0)
+        
+        # Now update input with DIFFERENT data
+        x_buf.fill_(3.0)
+        
+        # Replay graph — y should be 3.0 * 2.0 = 6.0
+        g.replay()
+        torch.cuda.synchronize()
+        
+        y_max = y_buf.abs().max().item()
+        results[gpu] = y_max
+        status = "OK" if abs(y_max - 6.0) < 0.1 else f"WRONG (expected 6.0, got {y_max})"
+        print(f"  GPU {gpu}: y_max={y_max:.2f} — {status}")
+    
+    return results
+
+def test_cross_gpu_copy_then_graph():
+    """Test cross-GPU copy followed by graph replay."""
+    results = {}
+    for gpu in range(1, 8):  # Skip GPU 0 (source)
+        torch.cuda.set_device(gpu)
+        device = f'cuda:{gpu}'
+        
+        # Source data on cuda:0
+        src = torch.full((1, 4, 7168), 5.0, dtype=torch.bfloat16, device='cuda:0')
+        
+        # Input/output buffers on cuda:{gpu}
+        x_buf = torch.zeros(1, 4, 7168, dtype=torch.bfloat16, device=device)
+        y_buf = torch.zeros(1, 4, 7168, dtype=torch.bfloat16, device=device)
+        
+        # Fill with data for capture
+        x_buf.fill_(1.0)
+        
+        # Capture graph
+        g = torch.cuda.CUDAGraph()
+        with torch.cuda.graph(g):
+            y_buf.copy_(x_buf * 2.0)
+        
+        # Copy data from cuda:0 to input buffer
+        x_buf.copy_(src)
+        torch.cuda.synchronize()
+        
+        # Replay — y should be 5.0 * 2.0 = 10.0
+        g.replay()
+        torch.cuda.synchronize()
+        
+        y_max = y_buf.abs().max().item()
+        results[gpu] = y_max
+        status = "OK" if abs(y_max - 10.0) < 0.1 else f"WRONG (expected 10.0, got {y_max})"
+        print(f"  cuda:0→cuda:{gpu}: y_max={y_max:.2f} — {status}")
+    
+    return results
+
+if __name__ == "__main__":
+    print("=== Test 1: Basic graph on each GPU ===")
+    test_basic_graph()
+    
+    print("\n=== Test 2: Graph replay with updated input ===")
+    test_graph_with_updated_input()
+    
+    print("\n=== Test 3: Cross-GPU copy then graph replay ===")
+    test_cross_gpu_copy_then_graph()
+    
+    print("\nDone.")
--- a/tests/unit/test_cuda_graph_readiness.py
+++ b/tests/unit/test_cuda_graph_readiness.py
@@ -0,0 +1,541 @@
+#!/usr/bin/env python3
+"""CUDA Graph Readiness Detector — Section A of GETTING_CUDAGRAPH_READY.md
+
+Runs one decode step of single_shot_inference.py with:
+1. torch.cuda.set_sync_debug_mode("error") — raises on any implicit device→host sync
+2. torch.cuda.graph capture attempt — fails on .item(), sync, alloc, dynamic shape
+
+This inventories EVERY existing sync in one pass so we get the full hunt-list upfront.
+"""
+import os, sys, time, json, math, traceback
+os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
+import torch
+import torch.nn.functional as F
+
+# ==== CONFIG ====
+CHECKPOINT_DIR = "/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4"
+NUM_GPUS = 8
+PROMPT = "The capital of France is"
+MAX_CONTEXT = 8192
+SEED = 42
+
+# ==== Sync inventory ====
+sync_violations = []
+
+class SyncDetector:
+    """Tracks all device→host sync violations found during forward."""
+    def __init__(self):
+        self.violations = []
+        self.phase = "unknown"
+    
+    def record(self, category, location, detail):
+        self.violations.append({
+            "phase": self.phase,
+            "category": category,
+            "location": location,
+            "detail": detail,
+        })
+        print(f"  [SYNC] {category}: {location} — {detail}", flush=True)
+
+detector = SyncDetector()
+
+# ==== Import single_shot components ====
+# We need to import the functions/classes without running main()
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+from single_shot_inference import (
+    load_all_weights, build_rope_cache, rmsnorm, unweighted_rmsnorm,
+    FP4_LUT, KVCache, Compressor, Indexer, HcHead,
+    make_nvfp4_linear, get_nvfp4_weight, dequant_nvfp4,
+    forward_layer, forward_attention, _run_production_fmha_mixed,
+    moe_forward, _apply_rope,
+    _load_moe_weights_stacked, _load_shared_expert_weights, _cache_layer_weights_no_experts,
+)
+from encoding.deepseek_v4_encoding import (
+    thinking_start_token, thinking_end_token,
+    USER_SP_TOKEN, ASSISTANT_SP_TOKEN,
+)
+
+
+def grep_sync_patterns(source_dir):
+    """Grep the hot path for known sync patterns (Section B checklist)."""
+    import re
+    patterns = {
+        'item()': r'\.item\(\)',
+        '.cpu()': r'\.cpu\(\)',
+        '.tolist()': r'\.tolist\(\)',
+        '.numpy()': r'\.numpy\(\)',
+        'int(t)/float(t)': r'\bint\([^)]*\)|float\([^)]*\)',  # rough
+        'cuda.synchronize()': r'torch\.cuda\.synchronize\(\)',
+        'isnan().any()': r'\.isnan\([^)]*\)\.any\(\)',
+        'isinf().any()': r'\.isinf\([^)]*\)\.any\(\)',
+        'if t:': r'if\s+\w+\.item\(\)',
+        'nonzero': r'\.nonzero\(\)',
+        'masked_select': r'\.masked_select\(',
+        'torch.where(one-arg)': r'torch\.where\([^,]+\)',
+    }
+    import glob
+    hot_files = [
+        'single_shot_inference.py',
+        'dsv4/layers/mhc.py',
+        'dsv4/layers/router.py',
+        'dsv4/layers/moe.py',
+        'dsv4/layers/shared_expert.py',
+        'dsv4/layers/linear.py',
+        'dsv4/layers/grouped_linear.py',
+        'dsv4/ops/quantize.py',
+        'dsv4/kernels/attention/production.py',
+        'dsv4/kernels/compressor/production_compress.py',
+    ]
+    print("\n=== SECTION B: Grep Results (hot path sync patterns) ===", flush=True)
+    for fname in hot_files:
+        fpath = os.path.join(source_dir, fname)
+        if not os.path.exists(fpath):
+            continue
+        with open(fpath) as f:
+            lines = f.readlines()
+        for i, line in enumerate(lines, 1):
+            stripped = line.strip()
+            if stripped.startswith('#') or stripped.startswith('"""') or stripped.startswith("'''"):
+                continue
+            for pname, pat in patterns.items():
+                if re.search(pat, stripped):
+                    # Skip comments
+                    if '#' in stripped and stripped.index('#') < re.search(pat, stripped).start():
+                        continue
+                    print(f"  [{pname}] {fname}:{i}: {stripped[:120]}", flush=True)
+
+
+def run_sync_debug_mode():
+    """Method 1: Run forward with sync debug mode to catch implicit syncs."""
+    print("\n=== METHOD 1: torch.cuda.set_sync_debug_mode('error') ===", flush=True)
+    
+    # Build model components (same as single_shot main, but abbreviated)
+    with open(os.path.join(CHECKPOINT_DIR, "config.json")) as f:
+        cfg = json.load(f)
+    n_layers = cfg["num_hidden_layers"]
+    H = cfg["hidden_size"]
+    hd = cfg["head_dim"]
+    n_h = cfg["num_attention_heads"]
+    rd = cfg.get("qk_rope_head_dim", 64)
+    cr = cfg.get("compress_ratios", [128] * n_layers)
+    
+    print(f"Model: {n_layers} layers, {n_h} heads, hd={hd}", flush=True)
+    
+    # Load weights
+    print("Loading weights...", flush=True)
+    all_w = load_all_weights(CHECKPOINT_DIR)
+    
+    # Build components
+    from dsv4.layers.mhc import mHCLayer
+    from dsv4.layers.router import Router
+    from dsv4.layers.moe import Nvfp4MoE
+    from dsv4.layers.shared_expert import Nvfp4SharedExpert
+    from dsv4.layers.grouped_linear import Nvfp4GroupedLinear
+    
+    for g in range(NUM_GPUS):
+        torch.cuda.set_device(g)
+        torch.cuda.empty_cache()
+    torch.cuda.set_device(0)
+    
+    # Build mHC + norms
+    attn_mhcs, ffn_mhcs, attn_norms, ffn_norms = {}, {}, {}, {}
+    for li in range(n_layers):
+        dev = f"cuda:{li % NUM_GPUS}"
+        for tag, blocks, fn_s, base_s, scale_s in [
+            ("attn", attn_mhcs, f"model.layers.{li}.attn_hc.fn", f"model.layers.{li}.attn_hc.base", f"model.layers.{li}.attn_hc.scale"),
+            ("ffn", ffn_mhcs, f"model.layers.{li}.ffn_hc.fn", f"model.layers.{li}.ffn_hc.base", f"model.layers.{li}.ffn_hc.scale"),
+        ]:
+            fn, base, scale = all_w.get(fn_s), all_w.get(base_s), all_w.get(scale_s)
+            if fn is not None and base is not None and scale is not None:
+                m = mHCLayer(hidden_dim=H, n_hc=4, t_max_sinkhorn=20, device=dev)
+                n = 4
+                m.load_weights(
+                    W_pre=fn[0:n].to(dev, torch.float32), W_post=fn[n:2*n].to(dev, torch.float32),
+                    W_comb=fn[2*n:].to(dev, torch.float32),
+                    S_pre=base[0:n].reshape(1, n).to(dev, torch.float32),
+                    S_post=base[n:2*n].reshape(n, 1).to(dev, torch.float32),
+                    S_comb=base[2*n:].reshape(n, n).to(dev, torch.float32),
+                    alpha_pre=scale[0].item(), alpha_post=scale[1].item(), alpha_comb=scale[2].item(),
+                )
+                blocks[li] = m
+        an_k = f"model.layers.{li}.input_layernorm.weight"
+        if an_k in all_w: attn_norms[li] = all_w[an_k].to(dev, torch.float32)
+        fn_k = f"model.layers.{li}.post_attention_layernorm.weight"
+        if fn_k in all_w: ffn_norms[li] = all_w[fn_k].to(dev, torch.float32)
+    
+    # Build attention projections
+    prod_lins = {}
+    for li in range(n_layers):
+        dev = f"cuda:{li % NUM_GPUS}"
+        pfx = f"model.layers.{li}.self_attn"
+        torch.cuda.set_device(li % NUM_GPUS)
+        pl = {}
+        pl['q_a'] = make_nvfp4_linear(7168, 1536, dev, all_w, pfx, 'q_a_proj')
+        pl['q_b'] = make_nvfp4_linear(1536, 65536, dev, all_w, pfx, 'q_b_proj')
+        pl['kv'] = make_nvfp4_linear(7168, 512, dev, all_w, pfx, 'kv_proj')
+        n_local_groups = cfg.get('o_groups', 16)
+        heads_per_group = n_h // n_local_groups
+        o_rank_val = cfg.get('o_lora_rank', 1024)
+        wo_a = Nvfp4GroupedLinear(
+            n_local_groups=n_local_groups,
+            heads_per_group=heads_per_group,
+            head_dim=hd,
+            o_lora_rank=o_rank_val,
+            max_num_tokens=8192,
+            device=dev,
+        )
+        oa_w_nvfp4, oa_ws, oa_ws2, oa_isc = get_nvfp4_weight(all_w, pfx, 'o_a_proj')
+        if oa_w_nvfp4 is not None and oa_ws is not None:
+            wo_a.load_nvfp4_weight(oa_w_nvfp4.to(dev), oa_ws.to(dev),
+                                   oa_ws2.to(dev) if oa_ws2 is not None else None,
+                                   oa_isc.to(dev) if oa_isc is not None else None)
+        else:
+            oa_bf = all_w.get(f"{pfx}.o_a_proj.weight")
+            if oa_bf is not None:
+                wo_a.set_bf16_weight(oa_bf.bfloat16().to(dev))
+        pl['o_a'] = wo_a
+        wo_a._use_runtime_gsa = True
+        pl['o_b'] = make_nvfp4_linear(16384, 7168, dev, all_w, pfx, 'o_b_proj')
+        prod_lins[li] = pl
+        if (li+1) % 10 == 0:
+            print(f"  {li+1}/{n_layers} attn projections", flush=True)
+    
+    # Routers, MoE, shared experts
+    routers, moe_runners, se_runners = {}, {}, {}
+    for li in range(n_layers):
+        dev = f"cuda:{li % NUM_GPUS}"
+        pfx = f"model.layers.{li}.mlp"
+        torch.cuda.set_device(li % NUM_GPUS)
+        torch.cuda.synchronize()
+        is_hash = (li < cfg.get("num_hash_layers", 3)) and (f"{pfx}.gate.tid2eid" in all_w)
+        router = Router(hidden_size=H, num_experts=cfg["n_routed_experts"],
+                       top_k=cfg.get("num_experts_per_tok", 6),
+                       routed_scaling_factor=cfg.get("routed_scaling_factor", 2.5),
+                       mode="hash" if is_hash else "dense",
+                       vocab_size=cfg.get("vocab_size", 128000) if is_hash else None, device=dev)
+        if is_hash:
+            router.load_weights(hash_lut=all_w[f"{pfx}.gate.tid2eid"].to(dev, torch.int32))
+        else:
+            eb = all_w.get(f"{pfx}.gate.e_score_correction_bias")
+            gate_w, gate_ws, gate_ws2, gate_isc = get_nvfp4_weight(all_w, pfx, 'gate')
+            if gate_w is not None and gate_ws is not None:
+                gate_bf16 = dequant_nvfp4(gate_w.to(dev), gate_ws.to(dev), gate_ws2, gate_isc)
+                router.W_gate = gate_bf16.T.contiguous().to(dev)
+            else:
+                gw = all_w.get(f"{pfx}.gate.weight")
+                gate_bf16 = gw.bfloat16().to(dev)
+                if gate_bf16.shape[0] != H:
+                    gate_bf16 = gate_bf16.T.contiguous()
+                router.W_gate = gate_bf16.contiguous()
+            router.gate_lin = None
+            router.load_weights(e_bias=eb.to(dev, torch.float32))
+        router.finalize_weights()
+        routers[li] = router
+        
+        moe = Nvfp4MoE(num_experts=cfg["n_routed_experts"], hidden_size=H,
+                       intermediate_size=cfg.get("moe_intermediate_size", 3072),
+                       top_k=cfg.get("num_experts_per_tok", 6), device=dev)
+        moe.set_swiglu_limit(cfg.get("swiglu_limit", 10.0))
+        moe.set_fused_swiglu(True)
+        _load_moe_weights_stacked(all_w, li, pfx, dev, moe, cfg)
+        moe._ensure_stacked()
+        moe._use_runtime_gsa = True
+        moe_runners[li] = moe
+        
+        se = Nvfp4SharedExpert(hidden_size=H, intermediate_size=cfg.get("moe_intermediate_size", 3072),
+                               device=dev, swiglu_limit=cfg.get("swiglu_limit", 10.0))
+        se.set_fused_swiglu(True)
+        _load_shared_expert_weights(all_w, li, pfx, dev, se, cfg)
+        se._ensure_initialized()
+        if se._fused_swiglu:
+            from dsv4.ops.gemm_runner import warmup_fused_swiglu_compilation
+            K_packed = H // 2
+            N_packed_l1 = (2 * cfg.get("moe_intermediate_size", 3072)) // 2
+            warmup_fused_swiglu_compilation(1, K_packed, N_packed_l1, dev,
+                                           swiglu_limit=cfg.get("swiglu_limit", 10.0))
+        se._use_runtime_gsa = True
+        se_runners[li] = se
+        if (li+1) % 10 == 0:
+            print(f"  {li+1}/{n_layers} MoE layers", flush=True)
+        torch.cuda.empty_cache()
+    
+    # Global weights
+    torch.cuda.set_device(0)
+    embed_w = all_w.get("model.embed_tokens.weight")
+    embed = torch.nn.Embedding.from_pretrained(embed_w.bfloat16().to('cuda:0'))
+    lm_w = all_w.get("lm_head.weight", embed_w).bfloat16().to('cuda:0')
+    final_norm_w = all_w.get("model.norm.weight")
+    if final_norm_w is not None:
+        final_norm_w = final_norm_w.to('cuda:0', torch.float32)
+    
+    hc_head = HcHead(H, 4, 'cuda:0')
+    hc_fn = all_w.get("model.hc_head.hc_fn")
+    hc_base = all_w.get("model.hc_head.hc_base")
+    hc_scale = all_w.get("model.hc_head.hc_scale")
+    if hc_fn is not None and hc_base is not None:
+        hc_head.load(hc_fn, hc_base, hc_scale)
+    
+    # RoPE
+    rp = cfg.get("rope_scaling", cfg.get("rope_parameters", {}))
+    rt = rp.get("type", rp.get("rope_type", "yarn"))
+    rf = rp.get("factor", 16.0)
+    rtheta = cfg.get("rope_theta", 10000.)
+    romax = rp.get("original_max_position_embeddings", 65536)
+    rbfast, rbslow = rp.get("beta_fast", 32), rp.get("beta_slow", 1)
+    rope_caches = {g: build_rope_cache(romax, rd, f"cuda:{g}", rtheta, rt, rf, romax, rbfast, rbslow) for g in range(NUM_GPUS)}
+    comp_rtheta = cfg.get("compress_rope_theta", rtheta)
+    if comp_rtheta != rtheta:
+        comp_rope_caches = {g: build_rope_cache(romax, rd, f"cuda:{g}", comp_rtheta, rt, rf, romax, rbfast, rbslow) for g in range(NUM_GPUS)}
+    else:
+        comp_rope_caches = rope_caches
+    
+    # KV caches, compressors, indexers
+    kv_caches, compressors, indexers = {}, {}, {}
+    n_ih = cfg.get("index_n_heads", 64)
+    ihd = cfg.get("index_head_dim", 128)
+    itk = cfg.get("index_topk", 1024)
+    for li in range(n_layers):
+        dev = f"cuda:{li % NUM_GPUS}"
+        ratio = cr[li] if li < len(cr) else 128
+        max_comp = (MAX_CONTEXT + ratio - 1) // ratio if ratio > 0 else 0
+        kv_caches[li] = KVCache(hd, cfg.get("sliding_window", 128), max_comp=max_comp, device=dev,
+                                 indexer_key_dim=ihd, compress_ratio=ratio, indexer_top_k=itk, rope_dim=rd)
+        if ratio > 0: compressors[li] = Compressor(ratio, hd, H, dev)
+        if ratio == 4: indexers[li] = Indexer(n_ih, ihd, itk, dev)
+    
+    # Cache layer weights
+    devs = [f"cuda:{g}" for g in range(NUM_GPUS)]
+    layer_w = _cache_layer_weights_no_experts(all_w, n_layers, devs)
+    
+    # Load compressor/indexer weights
+    for li in range(n_layers):
+        pfx = f"model.layers.{li}.self_attn.compressor"
+        if li in compressors: compressors[li].load(layer_w[li], pfx, dev=f"cuda:{li % NUM_GPUS}")
+        if li in indexers: indexers[li].load(layer_w[li], f"{pfx}.indexer", dev=f"cuda:{li % NUM_GPUS}")
+    
+    del all_w
+    import gc; gc.collect()
+    for g in range(NUM_GPUS):
+        torch.cuda.set_device(g)
+        torch.cuda.empty_cache()
+    torch.cuda.set_device(0)
+    
+    print("\nAll components built. Running prefill...", flush=True)
+    
+    # ---- Prefill (run normally, not under sync debug) ----
+    from transformers import AutoTokenizer
+    tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT_DIR)
+    from encoding.deepseek_v4_encoding import encode_messages
+    messages = [{"role": "user", "content": PROMPT}]
+    encoded_str = encode_messages(messages, thinking_mode='thinking')
+    generated = tokenizer.encode(encoded_str, add_special_tokens=False)
+    bos = tokenizer.bos_token_id or 0
+    if generated[0] != bos:
+        generated = [bos] + generated
+    
+    PREFILL_CHUNK = 128
+    n_prefill = len(generated)
+    prefill_ids = torch.tensor(generated, dtype=torch.long, device='cuda:0')
+    prefill_ids32 = prefill_ids.to(torch.int32)
+    all_positions = torch.arange(n_prefill, dtype=torch.long, device='cuda:0')
+    
+    chunk_starts = list(range(0, n_prefill, PREFILL_CHUNK))
+    for ci, cs in enumerate(chunk_starts):
+        ce = min(cs + PREFILL_CHUNK, n_prefill)
+        chunk_ids = prefill_ids[cs:ce]
+        chunk_ids32 = prefill_ids32[cs:ce]
+        chunk_positions = all_positions[cs:ce]
+        chunk_embed = embed(chunk_ids)
+        X = mHCLayer.init_state(chunk_embed)
+        
+        for li in range(n_layers):
+            gpu = li % NUM_GPUS
+            if X.device != torch.device(f"cuda:{gpu}"):
+                X = X.to(f"cuda:{gpu}")
+            torch.cuda.set_device(gpu)
+            X = forward_layer(X, layer_w[li], li, cfg, *rope_caches[gpu],
+                              attn_mhcs.get(li), ffn_mhcs.get(li),
+                              attn_norms.get(li), ffn_norms.get(li),
+                              kv_caches[li], chunk_positions, chunk_ids32,
+                              compressors.get(li), indexers.get(li),
+                              moe_runners.get(li), se_runners.get(li), routers.get(li),
+                              prod_lin=prod_lins.get(li),
+                              comp_rope_cos=comp_rope_caches[gpu][0],
+                              comp_rope_sin=comp_rope_caches[gpu][1],
+                              )
+        X = X.to('cuda:0')
+        print(f"  Prefill chunk {ci+1}/{len(chunk_starts)}", flush=True)
+    
+    print("Prefill complete. Starting sync detection...", flush=True)
+    
+    # ---- NOW: Run one decode step under sync debug mode ----
+    all_tokens = generated.copy()
+    dec_tid_buf = torch.zeros(1, dtype=torch.long, device='cuda:0')
+    dec_pos_buf = torch.zeros(1, dtype=torch.long, device='cuda:0')
+    dec_tid32_buf = torch.zeros(1, dtype=torch.int32, device='cuda:0')
+    # Pinned CPU buffers for graph-capturable token/position transfer
+    dec_tid_pinned = torch.zeros(1, dtype=torch.long, device='cpu').pin_memory()
+    dec_tid32_pinned = torch.zeros(1, dtype=torch.int32, device='cpu').pin_memory()
+    dec_pos_pinned = torch.zeros(1, dtype=torch.long, device='cpu').pin_memory()
+    
+    def write_token_to_gpu(token_id, position):
+        """Write token/position to GPU buffers via pinned CPU (no CPU→GPU sync)."""
+        dec_tid_pinned[0] = token_id
+        dec_tid_buf.copy_(dec_tid_pinned)
+        dec_tid32_pinned[0] = token_id
+        dec_tid32_buf.copy_(dec_tid32_pinned)
+        dec_pos_pinned[0] = position
+        dec_pos_buf.copy_(dec_pos_pinned)
+    
+    # Warmup step first (so CuTeDSL kernels are compiled)
+    print("  Warmup decode step (compiling CuTeDSL kernels)...", flush=True)
+    write_token_to_gpu(all_tokens[-1], len(all_tokens) - 1)
+    X = mHCLayer.init_state(embed(dec_tid_buf))
+    for li in range(n_layers):
+        gpu = li % NUM_GPUS
+        if X.device != torch.device(f"cuda:{gpu}"):
+            X = X.to(f"cuda:{gpu}")
+        torch.cuda.set_device(gpu)
+        X = forward_layer(X, layer_w[li], li, cfg, *rope_caches[gpu],
+                          attn_mhcs.get(li), ffn_mhcs.get(li),
+                          attn_norms.get(li), ffn_norms.get(li),
+                          kv_caches[li], dec_pos_buf, dec_tid32_buf,
+                          compressors.get(li), indexers.get(li),
+                          moe_runners.get(li), se_runners.get(li), routers.get(li),
+                          prod_lin=prod_lins.get(li),
+                          comp_rope_cos=comp_rope_caches[gpu][0],
+                          comp_rope_sin=comp_rope_caches[gpu][1],
+                          )
+    X = X.to('cuda:0')
+    torch.cuda.set_device(0)
+    torch.cuda.synchronize()
+    print("  Warmup done.", flush=True)
+    
+    # ==== METHOD 1: sync debug mode ====
+    print("\n  [METHOD 1] Enabling sync debug mode...", flush=True)
+    torch.cuda.set_sync_debug_mode("error")
+    
+    sync_errors = []
+    try:
+        detector.phase = "decode_forward"
+        write_token_to_gpu(all_tokens[-1], len(all_tokens) - 1)
+        
+        X = mHCLayer.init_state(embed(dec_tid_buf))
+        for li in range(n_layers):
+            gpu = li % NUM_GPUS
+            if X.device != torch.device(f"cuda:{gpu}"):
+                X = X.to(f"cuda:{gpu}")
+            torch.cuda.set_device(gpu)
+            X = forward_layer(X, layer_w[li], li, cfg, *rope_caches[gpu],
+                              attn_mhcs.get(li), ffn_mhcs.get(li),
+                              attn_norms.get(li), ffn_norms.get(li),
+                              kv_caches[li], dec_pos_buf, dec_tid32_buf,
+                              compressors.get(li), indexers.get(li),
+                              moe_runners.get(li), se_runners.get(li), routers.get(li),
+                              prod_lin=prod_lins.get(li),
+                              comp_rope_cos=comp_rope_caches[gpu][0],
+                              comp_rope_sin=comp_rope_caches[gpu][1],
+                              )
+        X = X.to('cuda:0')
+        torch.cuda.set_device(0)
+        
+        # hc_head + norm + lm_head
+        x_out = hc_head.forward(X) if hc_head is not None else X[:, 0, :]
+        if final_norm_w is not None:
+            x_out = rmsnorm(x_out, final_norm_w)
+        logits = torch.nn.functional.linear(x_out, lm_w)
+        
+        # Sampling (argmax — this WILL sync, but it's outside the graph)
+        # We test the FORWARD only, not the sampling loop
+        print("  Forward completed under sync debug mode!", flush=True)
+    except RuntimeError as e:
+        err_str = str(e)
+        sync_errors.append(err_str)
+        print(f"\n  [SYNC VIOLATION CAUGHT] {err_str[:300]}", flush=True)
+        traceback.print_exc()
+    finally:
+        torch.cuda.set_sync_debug_mode("default")
+    
+    if not sync_errors:
+        print("  METHOD 1: No sync violations in forward (or they're hidden behind conditional branches)", flush=True)
+    else:
+        print(f"  METHOD 1: {len(sync_errors)} sync violation(s) found", flush=True)
+    
+    # ==== METHOD 2: CUDA graph capture attempt ====
+    print("\n  [METHOD 2] Attempting CUDA graph capture of decode forward...", flush=True)
+    
+    # Pre-allocate static I/O buffers
+    static_x_in = torch.zeros(1, 4, H, dtype=torch.bfloat16, device='cuda:0')
+    static_logits = torch.zeros(1, cfg.get("vocab_size", 129280), dtype=torch.bfloat16, device='cuda:0')
+    static_token = torch.zeros(1, dtype=torch.long, device='cuda:0')
+    static_token32 = torch.zeros(1, dtype=torch.int32, device='cuda:0')
+    static_pos = torch.zeros(1, dtype=torch.long, device='cuda:0')
+    
+    # Try to capture a single layer first (layer 0 on cuda:0)
+    print("  Attempting capture of L0 (cuda:0)...", flush=True)
+    li = 0
+    gpu = 0
+    capture_errors = []
+    
+    try:
+        g = torch.cuda.CUDAGraph()
+        torch.cuda.set_device(0)
+        
+        # Fill static buffers with current decode state (via pinned CPU — no sync)
+        dec_tid_pinned[0] = all_tokens[-1]
+        static_token.copy_(dec_tid_pinned)
+        dec_tid32_pinned[0] = all_tokens[-1]
+        static_token32.copy_(dec_tid32_pinned)
+        dec_pos_pinned[0] = len(all_tokens) - 1
+        static_pos.copy_(dec_pos_pinned)
+        
+        with torch.cuda.graph(g):
+            X = mHCLayer.init_state(embed(static_token))
+            X = forward_layer(X, layer_w[li], li, cfg, *rope_caches[gpu],
+                              attn_mhcs.get(li), ffn_mhcs.get(li),
+                              attn_norms.get(li), ffn_norms.get(li),
+                              kv_caches[li], static_pos, static_token32,
+                              compressors.get(li), indexers.get(li),
+                              moe_runners.get(li), se_runners.get(li), routers.get(li),
+                              prod_lin=prod_lins.get(li),
+                              comp_rope_cos=comp_rope_caches[gpu][0],
+                              comp_rope_sin=comp_rope_caches[gpu][1],
+                              )
+            static_x_in.copy_(X.to('cuda:0'))
+        
+        print("  L0 CAPTURED SUCCESSFULLY!", flush=True)
+    except Exception as e:
+        err_str = str(e)
+        capture_errors.append(err_str)
+        print(f"\n  [CAPTURE FAILURE] L0: {err_str[:500]}", flush=True)
+        traceback.print_exc()
+    
+    # ==== Summary ====
+    print("\n" + "=" * 70, flush=True)
+    print("SYNC INVENTORY SUMMARY", flush=True)
+    print("=" * 70, flush=True)
+    print(f"  Method 1 (sync debug): {len(sync_errors)} violations", flush=True)
+    print(f"  Method 2 (graph capture L0): {'PASS' if not capture_errors else 'FAIL'}", flush=True)
+    print(f"  Grep patterns: see above", flush=True)
+    print("=" * 70, flush=True)
+    
+    # Save results
+    results = {
+        "sync_debug_violations": sync_errors,
+        "graph_capture_errors": capture_errors,
+        "grep_results": "see stdout",
+    }
+    with open("/tmp/cuda_graph_readiness_results.json", "w") as f:
+        json.dump(results, f, indent=2)
+    print(f"Results saved to /tmp/cuda_graph_readiness_results.json", flush=True)
+
+
+if __name__ == "__main__":
+    source_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+    
+    # First: grep for sync patterns
+    grep_sync_patterns(source_dir)
+    
+    # Then: run the forward under sync debug + capture attempt
+    run_sync_debug_mode()
--- a/tests/unit/test_cuda_graph_stream.py
+++ b/tests/unit/test_cuda_graph_stream.py
@@ -0,0 +1,78 @@
+"""Minimal CUDA graph test with explicit stream management."""
+import torch
+
+def test_explicit_stream():
+    """Test CUDA graph with explicit per-device streams."""
+    results = {}
+    for gpu in range(8):
+        device = f'cuda:{gpu}'
+        
+        # Create a dedicated stream for this device
+        s = torch.cuda.Stream(device=device)
+        
+        # Create tensors on the correct device
+        x = torch.ones(1, 4, 7168, dtype=torch.bfloat16, device=device)
+        y = torch.zeros(1, 4, 7168, dtype=torch.bfloat16, device=device)
+        
+        # Capture on the explicit stream
+        g = torch.cuda.CUDAGraph()
+        with torch.cuda.graph(g, stream=s):
+            y.copy_(x * 2.0)
+        
+        # Update input
+        x.fill_(3.0)
+        
+        # Replay on the SAME stream
+        with torch.cuda.stream(s):
+            g.replay()
+        
+        torch.cuda.synchronize()
+        y_max = y.abs().max().item()
+        expected = 6.0
+        status = "OK" if abs(y_max - expected) < 0.1 else f"WRONG (expected {expected}, got {y_max})"
+        results[gpu] = y_max
+        print(f"  GPU {gpu}: y_max={y_max:.2f} — {status}")
+    
+    return results
+
+def test_set_device_before_each_op():
+    """Test with explicit set_device before each operation."""
+    results = {}
+    for gpu in range(8):
+        torch.cuda.set_device(gpu)
+        device = f'cuda:{gpu}'
+        
+        x = torch.ones(1, 4, 7168, dtype=torch.bfloat16, device=device)
+        y = torch.zeros(1, 4, 7168, dtype=torch.bfloat16, device=device)
+        
+        # Use default stream on the current device
+        g = torch.cuda.CUDAGraph()
+        with torch.cuda.graph(g):
+            # Explicitly set device INSIDE the graph capture
+            torch.cuda.set_device(gpu)
+            y.copy_(x * 2.0)
+        
+        # Update input
+        x.fill_(3.0)
+        
+        # Replay
+        torch.cuda.set_device(gpu)
+        g.replay()
+        torch.cuda.synchronize()
+        
+        y_max = y.abs().max().item()
+        expected = 6.0
+        status = "OK" if abs(y_max - expected) < 0.1 else f"WRONG (expected {expected}, got {y_max})"
+        results[gpu] = y_max
+        print(f"  GPU {gpu}: y_max={y_max:.2f} — {status}")
+    
+    return results
+
+if __name__ == "__main__":
+    print("=== Test with explicit stream ===")
+    test_explicit_stream()
+    
+    print("\n=== Test with set_device inside capture ===")
+    test_set_device_before_each_op()
+    
+    print("\nDone.")