55f1ddd502
Update GETTING_CUDAGRAPH_READY.md and CUDA_GRAPH_SYNC_INVENTORY.md with full current status, multi-GPU stream fix, and next steps
master
biondizzle2026-06-06 09:17:49 +00:00
ac213bdee8
Update docs: CUDA graph capture WORKING on all 8 GPUs, 0.28s/token (2x eager)
biondizzle2026-06-06 08:29:40 +00:00
6650f06121
CRITICAL FIX: Use explicit per-device streams for CUDA graph capture/replay on multi-GPU — fixes zero-output bug
biondizzle2026-06-06 08:18:18 +00:00
90ac38cde0
Add CUDA graph stream management test
biondizzle2026-06-06 08:14:29 +00:00
26042e3f01
Add minimal CUDA graph multi-GPU test to isolate zero-output bug
biondizzle2026-06-06 08:13:18 +00:00
86275851d4
Add minimal CUDA graph test per GPU during capture to isolate multi-GPU graph issue
biondizzle2026-06-06 08:02:35 +00:00
2cbf7a43e9
Add sync after cross-GPU copy before graph replay; remove misleading zero-input verification
biondizzle2026-06-06 07:51:22 +00:00
2bb52c7cae
Add per-layer graph capture verification — replay immediately and check for zeros
biondizzle2026-06-06 07:40:19 +00:00
5a98cc6d90
Store pre-cached norm weights on self to prevent GC during graph replay — root cause of all-zeros replay bug
biondizzle2026-06-06 07:29:33 +00:00
dcb2495a5b
Add graph replay debug prints for first 3 steps/layers
biondizzle2026-06-06 07:19:07 +00:00
16b9a4def2
Fix CUDA graph replay: set device to cuda:0 before lm_head graph replay
biondizzle2026-06-06 07:18:49 +00:00
f259d63930
CRITICAL FIX: SE swizzled buffers were allocated then overwritten with None — graph capture would fall through to broken Python path
biondizzle2026-06-06 07:01:52 +00:00
32902d1036
CUDA graph capture: derive q_a_dim from config, pre-cache norm weights, add buffer verification, use direct dict access for routers/moe/se
biondizzle2026-06-06 07:01:12 +00:00
64f547058e
Fix graph replay: pass q_a from Graph A output to forward_attention
biondizzle2026-06-04 08:09:30 +00:00
26da6d33af
Fix graph replay: remove extra token_id arg from forward_attention call
biondizzle2026-06-04 06:10:02 +00:00
ae26f6b83c
Fix dense router BF16 dispatch: use torch.matmul instead of F.linear
biondizzle2026-06-04 05:58:24 +00:00
e46b615873
Fix dense router BF16 dispatch for CUDA graph capture
biondizzle2026-06-04 05:50:13 +00:00
b4a59d0940
Update CUDA graph docs with current status, A/B split, buffer fixes, remaining blockers
biondizzle2026-06-04 05:13:51 +00:00
ffa7842b58
Fix dense router: run GEMM in BF16, convert to FP32 only for activation
biondizzle2026-06-04 04:49:08 +00:00
119e6d471e
Add safety check for swizzled buffers: fall through to Python path if None
biondizzle2026-06-04 04:32:00 +00:00
fae61d3ef7
Add c10/cuda/CUDAStream.h include for getCurrentCUDAStream
biondizzle2026-06-04 04:13:40 +00:00
ee86969f6c
Fix CUDA stream: use c10::cuda::getCurrentCUDAStream() directly in kernel launch
biondizzle2026-06-04 03:57:59 +00:00
e26c28a1ce
Fix CUDA stream API: getCurrentCUDAStream().stream()
biondizzle2026-06-04 03:43:04 +00:00
7450ebc67a
CORRECTNESS_BACKLOG.md: comprehensive production pipeline verification results — all tested and confirmed findings from PART A diagnostics
biondizzle2026-06-03 07:31:01 +00:00