biondizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 23:53:17 +00:00

1d6610c46d CUDA graph A/B split: eager-break-at-attention architecture

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 23:41:43 +00:00

800e974d20 Update CUDA_GRAPH_SYNC_INVENTORY.md with session 2 progress

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 23:17:54 +00:00

a468f72a0e CUDA graph: Pre-allocate L1 GEMM output buffers in MoE and SharedExpert

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 22:56:22 +00:00

56b816a54f CUDA graph: Use per-GPU position/token buffers for graph capture

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 22:26:51 +00:00

f57de06eb5 Fix grouped_linear GEMM output buffer shape and extraction

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 22:04:21 +00:00

92225b07e7 CUDA graph: Simplify to single-graph-per-layer capture (revert A/B split)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 22:02:03 +00:00

b32713c302 grouped_linear: Pre-allocate output buffer for grouped GEMM (CUDA graph capture)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 21:45:18 +00:00

676fad064f Fix: Add out= parameter to run_fused_swiglu_grouped_gemm signature

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 21:30:26 +00:00

188ecae47f CUDA graph: Eliminate per-step allocations in graph-captured code paths

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 21:09:15 +00:00

91c370360a Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v3)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 20:54:21 +00:00

5c94dbbc37 Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v2)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 20:34:27 +00:00

87b6c9932b Fix CuTeDSL from_dlpack device mismatch inside CUDA graph capture

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 19:49:56 +00:00

2661cebe9a Fix warmup_gsa: handle multi-element _gsa_buf (Nvfp4GroupedLinear per-group gsa)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 19:24:30 +00:00

486f74d900 CUDA graph: Implement eager-break-at-attention decoder with sub-graph A/B split

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 19:15:29 +00:00

5ea3aa3406 Update GETTING_CUDAGRAPH_READY.md and CUDA_GRAPH_SYNC_INVENTORY.md

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 18:08:28 +00:00

80bb27f5bf CUDA graph: Fix gsa broadcast — contiguous for prefill, reshape for decode

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 17:53:49 +00:00

518a1d3f95 CUDA graph: Fix MoE scatter_add_ index dtype + fix second bincount

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 17:39:21 +00:00

f13a81d48b CUDA graph: Fix per-call allocations in grouped_linear and quantize

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 17:37:06 +00:00

84655d066a CUDA graph: Fix MoE bincount and per-call allocations (Hazard #4)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-03 17:20:45 +00:00

df05289d6f CUDA graph: Fix remaining sync violations from B200 detector run 2