biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 23:53:17 +00:00
1d6610c46d CUDA graph A/B split: eager-break-at-attention architecture
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 23:41:43 +00:00
800e974d20 Update CUDA_GRAPH_SYNC_INVENTORY.md with session 2 progress
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 23:17:54 +00:00
a468f72a0e CUDA graph: Pre-allocate L1 GEMM output buffers in MoE and SharedExpert
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 22:56:22 +00:00
56b816a54f CUDA graph: Use per-GPU position/token buffers for graph capture
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 22:26:51 +00:00
f57de06eb5 Fix grouped_linear GEMM output buffer shape and extraction
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 22:04:21 +00:00
92225b07e7 CUDA graph: Simplify to single-graph-per-layer capture (revert A/B split)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 22:02:03 +00:00
b32713c302 grouped_linear: Pre-allocate output buffer for grouped GEMM (CUDA graph capture)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 21:45:18 +00:00
676fad064f Fix: Add out= parameter to run_fused_swiglu_grouped_gemm signature
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 21:30:26 +00:00
188ecae47f CUDA graph: Eliminate per-step allocations in graph-captured code paths
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 21:09:15 +00:00
91c370360a Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v3)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 20:54:21 +00:00
5c94dbbc37 Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v2)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 20:34:27 +00:00
87b6c9932b Fix CuTeDSL from_dlpack device mismatch inside CUDA graph capture
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 19:49:56 +00:00
2661cebe9a Fix warmup_gsa: handle multi-element _gsa_buf (Nvfp4GroupedLinear per-group gsa)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 19:24:30 +00:00
486f74d900 CUDA graph: Implement eager-break-at-attention decoder with sub-graph A/B split
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 19:15:29 +00:00
5ea3aa3406 Update GETTING_CUDAGRAPH_READY.md and CUDA_GRAPH_SYNC_INVENTORY.md
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 18:08:28 +00:00
80bb27f5bf CUDA graph: Fix gsa broadcast — contiguous for prefill, reshape for decode
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 17:53:49 +00:00
518a1d3f95 CUDA graph: Fix MoE scatter_add_ index dtype + fix second bincount
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 17:39:21 +00:00
f13a81d48b CUDA graph: Fix per-call allocations in grouped_linear and quantize
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 17:37:06 +00:00
84655d066a CUDA graph: Fix MoE bincount and per-call allocations (Hazard #4)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-03 17:20:45 +00:00
df05289d6f CUDA graph: Fix remaining sync violations from B200 detector run 2