|
|
df05289d6f
|
CUDA graph: Fix remaining sync violations from B200 detector run 2
1. grouped_linear.py: Remove conditional host read of GPU tensor
- 'if group_offsets[0] != 0' reads GPU value on host → sync
- Fix: unconditionally update offsets every call (GPU-only multiply)
2. test_cuda_graph_readiness.py: Use pinned CPU buffers for token transfer
- dec_tid_buf[0] = python_int → CPU→GPU sync
- Fix: write to pinned CPU buffer, then copy_ (async, graph-capturable)
3. Add dsv4/decode/cuda_graph_decoder.py (skeleton)
|
2026-06-03 17:20:34 +00:00 |
|