biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-06 09:18:38 +00:00
55f1ddd502 Update GETTING_CUDAGRAPH_READY.md and CUDA_GRAPH_SYNC_INVENTORY.md with full current status, multi-GPU stream fix, and next steps
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-06 08:29:42 +00:00
ac213bdee8 Update docs: CUDA graph capture WORKING on all 8 GPUs, 0.28s/token (2x eager)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-06 08:18:21 +00:00
6650f06121 CRITICAL FIX: Use explicit per-device streams for CUDA graph capture/replay on multi-GPU — fixes zero-output bug
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-06 08:14:55 +00:00
90ac38cde0 Add CUDA graph stream management test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-06 08:13:20 +00:00
26042e3f01 Add minimal CUDA graph multi-GPU test to isolate zero-output bug
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-06 08:02:39 +00:00
86275851d4 Add minimal CUDA graph test per GPU during capture to isolate multi-GPU graph issue
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-06 07:51:25 +00:00
2cbf7a43e9 Add sync after cross-GPU copy before graph replay; remove misleading zero-input verification
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-06 07:40:21 +00:00
2bb52c7cae Add per-layer graph capture verification — replay immediately and check for zeros
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-06 07:29:36 +00:00
5a98cc6d90 Store pre-cached norm weights on self to prevent GC during graph replay — root cause of all-zeros replay bug
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-06 07:19:10 +00:00
dcb2495a5b Add graph replay debug prints for first 3 steps/layers
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-06 07:18:51 +00:00
16b9a4def2 Fix CUDA graph replay: set device to cuda:0 before lm_head graph replay
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-06 07:01:56 +00:00
f259d63930 CRITICAL FIX: SE swizzled buffers were allocated then overwritten with None — graph capture would fall through to broken Python path
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-06 07:01:18 +00:00
32902d1036 CUDA graph capture: derive q_a_dim from config, pre-cache norm weights, add buffer verification, use direct dict access for routers/moe/se
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 08:09:46 +00:00
64f547058e Fix graph replay: pass q_a from Graph A output to forward_attention
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 06:10:04 +00:00
26da6d33af Fix graph replay: remove extra token_id arg from forward_attention call
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 05:58:27 +00:00
ae26f6b83c Fix dense router BF16 dispatch: use torch.matmul instead of F.linear
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 05:50:21 +00:00
e46b615873 Fix dense router BF16 dispatch for CUDA graph capture
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 05:13:54 +00:00
b4a59d0940 Update CUDA graph docs with current status, A/B split, buffer fixes, remaining blockers
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 04:49:10 +00:00
ffa7842b58 Fix dense router: run GEMM in BF16, convert to FP32 only for activation
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-04 04:32:02 +00:00
119e6d471e Add safety check for swizzled buffers: fall through to Python path if None