biondizzle
df05289d6f
CUDA graph: Fix remaining sync violations from B200 detector run 2
1. grouped_linear.py: Remove conditional host read of GPU tensor
- 'if group_offsets[0] != 0' reads GPU value on host → sync
- Fix: unconditionally update offsets every call (GPU-only multiply)
2. test_cuda_graph_readiness.py: Use pinned CPU buffers for token transfer
- dec_tid_buf[0] = python_int → CPU→GPU sync
- Fix: write to pinned CPU buffer, then copy_ (async, graph-capturable)
3. Add dsv4/decode/cuda_graph_decoder.py (skeleton)
2026-06-03 17:20:34 +00:00
..
2026-06-02 19:27:07 +00:00
2026-06-02 19:27:07 +00:00
2026-06-02 19:24:39 +00:00
2026-06-03 17:20:34 +00:00
2026-05-22 17:07:23 +00:00
2026-05-31 12:05:19 +00:00
2026-05-31 20:11:37 +00:00
2026-05-31 18:38:34 +00:00
2026-05-16 02:13:18 +00:00
2026-05-22 17:08:12 +00:00
2026-05-31 09:17:07 +00:00
2026-05-31 09:23:10 +00:00
2026-05-31 20:23:18 +00:00
2026-05-31 05:55:10 +00:00