nvfp4-megamoe-kernel/dsv4/decode/cuda_graph_decoder.py at master

Files

biondizzle df05289d6f CUDA graph: Fix remaining sync violations from B200 detector run 2

1. grouped_linear.py: Remove conditional host read of GPU tensor
   - 'if group_offsets[0] != 0' reads GPU value on host → sync
   - Fix: unconditionally update offsets every call (GPU-only multiply)

2. test_cuda_graph_readiness.py: Use pinned CPU buffers for token transfer
   - dec_tid_buf[0] = python_int → CPU→GPU sync
   - Fix: write to pinned CPU buffer, then copy_ (async, graph-capturable)

3. Add dsv4/decode/cuda_graph_decoder.py (skeleton)

2026-06-03 17:20:34 +00:00

6.6 KiB

Raw Permalink Blame History

View Raw

6.6 KiB Raw Permalink Blame History

6.6 KiB

Raw Permalink Blame History