biondizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-06 09:18:38 +00:00

55f1ddd502 Update GETTING_CUDAGRAPH_READY.md and CUDA_GRAPH_SYNC_INVENTORY.md with full current status, multi-GPU stream fix, and next steps

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-06 08:29:42 +00:00

ac213bdee8 Update docs: CUDA graph capture WORKING on all 8 GPUs, 0.28s/token (2x eager)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-06 08:18:21 +00:00

6650f06121 CRITICAL FIX: Use explicit per-device streams for CUDA graph capture/replay on multi-GPU — fixes zero-output bug

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-06 08:14:55 +00:00

90ac38cde0 Add CUDA graph stream management test

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-06 08:13:20 +00:00

26042e3f01 Add minimal CUDA graph multi-GPU test to isolate zero-output bug

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-06 08:02:39 +00:00

86275851d4 Add minimal CUDA graph test per GPU during capture to isolate multi-GPU graph issue

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-06 07:51:25 +00:00

2cbf7a43e9 Add sync after cross-GPU copy before graph replay; remove misleading zero-input verification

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-06 07:40:21 +00:00

2bb52c7cae Add per-layer graph capture verification — replay immediately and check for zeros

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-06 07:29:36 +00:00

5a98cc6d90 Store pre-cached norm weights on self to prevent GC during graph replay — root cause of all-zeros replay bug

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-06 07:19:10 +00:00

dcb2495a5b Add graph replay debug prints for first 3 steps/layers

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-06 07:18:51 +00:00

16b9a4def2 Fix CUDA graph replay: set device to cuda:0 before lm_head graph replay

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-06 07:01:56 +00:00

f259d63930 CRITICAL FIX: SE swizzled buffers were allocated then overwritten with None — graph capture would fall through to broken Python path

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-06 07:01:18 +00:00

32902d1036 CUDA graph capture: derive q_a_dim from config, pre-cache norm weights, add buffer verification, use direct dict access for routers/moe/se

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 08:09:46 +00:00

64f547058e Fix graph replay: pass q_a from Graph A output to forward_attention

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 06:10:04 +00:00

26da6d33af Fix graph replay: remove extra token_id arg from forward_attention call

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 05:58:27 +00:00

ae26f6b83c Fix dense router BF16 dispatch: use torch.matmul instead of F.linear

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 05:50:21 +00:00

e46b615873 Fix dense router BF16 dispatch for CUDA graph capture

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 05:13:54 +00:00

b4a59d0940 Update CUDA graph docs with current status, A/B split, buffer fixes, remaining blockers

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 04:49:10 +00:00

ffa7842b58 Fix dense router: run GEMM in BF16, convert to FP32 only for activation

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-04 04:32:02 +00:00

119e6d471e Add safety check for swizzled buffers: fall through to Python path if None