Commit Graph

  • 55f1ddd502 Update GETTING_CUDAGRAPH_READY.md and CUDA_GRAPH_SYNC_INVENTORY.md with full current status, multi-GPU stream fix, and next steps master biondizzle 2026-06-06 09:17:49 +00:00
  • ac213bdee8 Update docs: CUDA graph capture WORKING on all 8 GPUs, 0.28s/token (2x eager) biondizzle 2026-06-06 08:29:40 +00:00
  • 6650f06121 CRITICAL FIX: Use explicit per-device streams for CUDA graph capture/replay on multi-GPU — fixes zero-output bug biondizzle 2026-06-06 08:18:18 +00:00
  • 90ac38cde0 Add CUDA graph stream management test biondizzle 2026-06-06 08:14:29 +00:00
  • 26042e3f01 Add minimal CUDA graph multi-GPU test to isolate zero-output bug biondizzle 2026-06-06 08:13:18 +00:00
  • 86275851d4 Add minimal CUDA graph test per GPU during capture to isolate multi-GPU graph issue biondizzle 2026-06-06 08:02:35 +00:00
  • 2cbf7a43e9 Add sync after cross-GPU copy before graph replay; remove misleading zero-input verification biondizzle 2026-06-06 07:51:22 +00:00
  • 2bb52c7cae Add per-layer graph capture verification — replay immediately and check for zeros biondizzle 2026-06-06 07:40:19 +00:00
  • 5a98cc6d90 Store pre-cached norm weights on self to prevent GC during graph replay — root cause of all-zeros replay bug biondizzle 2026-06-06 07:29:33 +00:00
  • dcb2495a5b Add graph replay debug prints for first 3 steps/layers biondizzle 2026-06-06 07:19:07 +00:00
  • 16b9a4def2 Fix CUDA graph replay: set device to cuda:0 before lm_head graph replay biondizzle 2026-06-06 07:18:49 +00:00
  • f259d63930 CRITICAL FIX: SE swizzled buffers were allocated then overwritten with None — graph capture would fall through to broken Python path biondizzle 2026-06-06 07:01:52 +00:00
  • 32902d1036 CUDA graph capture: derive q_a_dim from config, pre-cache norm weights, add buffer verification, use direct dict access for routers/moe/se biondizzle 2026-06-06 07:01:12 +00:00
  • 64f547058e Fix graph replay: pass q_a from Graph A output to forward_attention biondizzle 2026-06-04 08:09:30 +00:00
  • 26da6d33af Fix graph replay: remove extra token_id arg from forward_attention call biondizzle 2026-06-04 06:10:02 +00:00
  • ae26f6b83c Fix dense router BF16 dispatch: use torch.matmul instead of F.linear biondizzle 2026-06-04 05:58:24 +00:00
  • e46b615873 Fix dense router BF16 dispatch for CUDA graph capture biondizzle 2026-06-04 05:50:13 +00:00
  • b4a59d0940 Update CUDA graph docs with current status, A/B split, buffer fixes, remaining blockers biondizzle 2026-06-04 05:13:51 +00:00
  • ffa7842b58 Fix dense router: run GEMM in BF16, convert to FP32 only for activation biondizzle 2026-06-04 04:49:08 +00:00
  • 119e6d471e Add safety check for swizzled buffers: fall through to Python path if None biondizzle 2026-06-04 04:32:00 +00:00
  • fae61d3ef7 Add c10/cuda/CUDAStream.h include for getCurrentCUDAStream biondizzle 2026-06-04 04:13:40 +00:00
  • ee86969f6c Fix CUDA stream: use c10::cuda::getCurrentCUDAStream() directly in kernel launch biondizzle 2026-06-04 03:57:59 +00:00
  • e26c28a1ce Fix CUDA stream API: getCurrentCUDAStream().stream() biondizzle 2026-06-04 03:43:04 +00:00
  • 9b3917e248 Fix blackwell_swizzle.cu: add pybind11 bindings for torch extension loader biondizzle 2026-06-04 03:29:10 +00:00
  • 5487a58df4 Fix NameError: add rows/cols variables to MoE swizzle biondizzle 2026-06-04 03:14:27 +00:00
  • a434545d12 Blackwell swizzle CUDA kernel for CUDA graph capture biondizzle 2026-06-04 03:03:02 +00:00
  • e7766254b7 Pre-allocate ALL GEMM output buffers for CUDA graph capture biondizzle 2026-06-04 02:41:59 +00:00
  • 676a0448c0 CRITICAL FIX: _l1_out_buf was 2x too narrow — caused GPU memory corruption biondizzle 2026-06-04 02:06:18 +00:00
  • 0890e578f4 DEBUG: print l1_out shape before gate/up split biondizzle 2026-06-04 01:49:12 +00:00
  • 8546ed725f DEBUG: check SE input magnitude biondizzle 2026-06-04 01:38:24 +00:00
  • 26ecf96328 DEBUG: check intermediate magnitude before SE L2 biondizzle 2026-06-04 01:30:29 +00:00
  • 5303d6a82f DEBUG: test copy_ with contiguous slice vs scalar assign for gsa biondizzle 2026-06-04 01:27:25 +00:00
  • ccbc713658 DEBUG: check gsa values and pinpoint exact failing operation biondizzle 2026-06-04 01:16:37 +00:00
  • e77455c3ba DEBUG: add sync inside quantize_nvfp4_gpu_fused to catch async errors biondizzle 2026-06-04 01:05:47 +00:00
  • 55def5eef9 Restore A/B split + gsa scalar fix (error is pre-existing, not regression) biondizzle 2026-06-04 01:03:36 +00:00
  • 59eccd04ab REVERT: test if cudaErrorInvalidValue is pre-existing or regression biondizzle 2026-06-04 00:53:09 +00:00
  • 5e3ced0b60 DEBUG: isolate which kernel causes cudaErrorInvalidValue in SE L2 path biondizzle 2026-06-04 00:41:28 +00:00
  • b314fde9b7 Fix gsa copy_ cudaErrorInvalidValue: replace view-based copy_ with scalar assignment biondizzle 2026-06-04 00:30:21 +00:00
  • 993bb345d1 DEBUG: fix VERBOSE reference in shared_expert, always print L2 gsa debug biondizzle 2026-06-04 00:15:38 +00:00
  • f0f87df906 DEBUG: add sync + shape prints to shared_expert L2 gsa copy biondizzle 2026-06-04 00:05:08 +00:00
  • 1d6610c46d CUDA graph A/B split: eager-break-at-attention architecture biondizzle 2026-06-03 23:53:08 +00:00
  • 800e974d20 Update CUDA_GRAPH_SYNC_INVENTORY.md with session 2 progress biondizzle 2026-06-03 23:41:42 +00:00
  • a468f72a0e CUDA graph: Pre-allocate L1 GEMM output buffers in MoE and SharedExpert biondizzle 2026-06-03 23:17:43 +00:00
  • 56b816a54f CUDA graph: Use per-GPU position/token buffers for graph capture biondizzle 2026-06-03 22:56:20 +00:00
  • f57de06eb5 Fix grouped_linear GEMM output buffer shape and extraction biondizzle 2026-06-03 22:26:40 +00:00
  • 92225b07e7 CUDA graph: Simplify to single-graph-per-layer capture (revert A/B split) biondizzle 2026-06-03 22:04:18 +00:00
  • b32713c302 grouped_linear: Pre-allocate output buffer for grouped GEMM (CUDA graph capture) biondizzle 2026-06-03 22:02:01 +00:00
  • 676fad064f Fix: Add out= parameter to run_fused_swiglu_grouped_gemm signature biondizzle 2026-06-03 21:45:15 +00:00
  • 188ecae47f CUDA graph: Eliminate per-step allocations in graph-captured code paths biondizzle 2026-06-03 21:30:24 +00:00
  • 91c370360a Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v3) biondizzle 2026-06-03 21:09:12 +00:00
  • 5c94dbbc37 Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v2) biondizzle 2026-06-03 20:54:18 +00:00
  • 87b6c9932b Fix CuTeDSL from_dlpack device mismatch inside CUDA graph capture biondizzle 2026-06-03 20:34:24 +00:00
  • 2661cebe9a Fix warmup_gsa: handle multi-element _gsa_buf (Nvfp4GroupedLinear per-group gsa) biondizzle 2026-06-03 19:49:54 +00:00
  • 486f74d900 CUDA graph: Implement eager-break-at-attention decoder with sub-graph A/B split biondizzle 2026-06-03 19:24:26 +00:00
  • 5ea3aa3406 Update GETTING_CUDAGRAPH_READY.md and CUDA_GRAPH_SYNC_INVENTORY.md biondizzle 2026-06-03 19:15:27 +00:00
  • 80bb27f5bf CUDA graph: Fix gsa broadcast — contiguous for prefill, reshape for decode biondizzle 2026-06-03 18:08:18 +00:00
  • 518a1d3f95 CUDA graph: Fix MoE scatter_add_ index dtype + fix second bincount biondizzle 2026-06-03 17:53:40 +00:00
  • f13a81d48b CUDA graph: Fix per-call allocations in grouped_linear and quantize biondizzle 2026-06-03 17:39:20 +00:00
  • 84655d066a CUDA graph: Fix MoE bincount and per-call allocations (Hazard #4) biondizzle 2026-06-03 17:37:03 +00:00
  • df05289d6f CUDA graph: Fix remaining sync violations from B200 detector run 2 biondizzle 2026-06-03 17:20:34 +00:00
  • e07d79868f CUDA graph: Fix _assemble_scales_single_group swizzle size biondizzle 2026-06-03 17:02:34 +00:00
  • 0ca7bed0e1 CUDA graph: Fix sync violations found by B200 detector biondizzle 2026-06-03 16:52:19 +00:00
  • 46a3a51832 CUDA graph: Fix per-step allocations in decode loop biondizzle 2026-06-03 16:38:35 +00:00
  • a9ea30353c CUDA graph: Fix sync violations (Category 1-2) biondizzle 2026-06-03 16:37:20 +00:00
  • caac8ae108 Fix syntax error: 'is not not None' -> 'is not None' biondizzle 2026-06-03 16:34:33 +00:00
  • ba68212fa7 Add CUDA graph readiness detector (Section A of GETTING_CUDAGRAPH_READY.md) biondizzle 2026-06-03 16:34:15 +00:00
  • ca5bc814d5 Fix compressor: do not add positional bias to KV content biondizzle 2026-06-03 15:52:00 +00:00
  • 4fe73fe713 auto: pre-test commit v-precision-floor-fix-20260603 biondizzle 2026-06-03 15:45:15 +00:00
  • f577ed97f4 Fix: Use PyTorch dequant_nvfp4 for weight dequantization (compressor/indexer/router gate) biondizzle 2026-06-03 14:57:40 +00:00
  • 1121cd7b47 Add CUDA_LAUNCH_BLOCKING=1 to catch async errors biondizzle 2026-06-03 14:48:51 +00:00
  • f3bb0ca08c Fix dequant gsa: use ws2 only, NOT input_scale * ws2 biondizzle 2026-06-03 14:38:24 +00:00
  • 470e65fb19 Fix dequant gsb: input_scale * ws2, not 1.0 * ws2 biondizzle 2026-06-03 14:26:59 +00:00
  • 2dd16d5789 Switch compressor + indexer weights_proj to BF16 F.linear biondizzle 2026-06-03 14:19:41 +00:00
  • 95e45a87e3 Add explicit .to(dev) on W_gate after transpose — belt and suspenders biondizzle 2026-06-03 14:17:02 +00:00
  • ef94c48957 Simplify router gate: dequant NVFP4 → BF16, F.linear (no FP8 middleman) biondizzle 2026-06-03 14:14:10 +00:00
  • 715602c87c Switch lm_head to BF16 + router gate to FP8_E4M3 biondizzle 2026-06-03 14:10:28 +00:00
  • 3320abfe24 Fix two correctness bugs: compressor pos bias on KV + SwiGLU clamp ordering pure-nvfp4 biondizzle 2026-06-03 11:17:49 +00:00
  • 7901470e63 doc clean up v-official-encoding-path biondizzle 2026-06-03 10:53:41 +00:00
  • ca7c309463 Add reference/ dir: vLLM tokenizers, reasoning parsers, tool parsers, official inference biondizzle 2026-06-03 10:25:23 +00:00
  • 8cfc1cae58 Canonical encoding: derive special token IDs from official encoding module + tokenizer biondizzle 2026-06-03 10:23:02 +00:00
  • a86d6d90a5 Replace hand-rolled prompt with official DSV4 encoder (canonical path) biondizzle 2026-06-03 09:59:05 +00:00
  • 284fc9ca86 Fix: thread comp_rope_cos/comp_rope_sin through forward_attention biondizzle 2026-06-03 09:30:57 +00:00
  • 6a3374da18 Cross-check 2 complete: block-aligned comp_pos + compress_rope_theta wired through biondizzle 2026-06-03 09:19:11 +00:00
  • 5003e756e2 WIP: cross-check 2 fix — block-aligned compressed RoPE positions + compress_rope_theta support biondizzle 2026-06-03 09:17:54 +00:00
  • 572bdd2840 auto: pre-test commit biondizzle 2026-06-03 09:01:02 +00:00
  • 3c06fd5591 Test 2: fix topk tensor shape (flatten before iterating) biondizzle 2026-06-03 08:47:32 +00:00
  • 89f6e64057 README: document test harness gotchas (timeout arg, stale procs, screen names) biondizzle 2026-06-03 08:36:02 +00:00
  • 29d6986dd4 Test 2: fix quantize_to_nvfp4 import biondizzle 2026-06-03 08:21:39 +00:00
  • 60b9bbd470 Test 2: fix import - use mHCLayer from dsv4.layers.mhc, fixed prompt encoding biondizzle 2026-06-03 08:20:21 +00:00
  • 1e77dfcaa0 Fix prompt encoding: remove \n\n before content per official DSV4 spec; add --chat-mode biondizzle 2026-06-03 08:19:33 +00:00
  • 2a42686e8e Test 1 v2: diff hand-rolled vs official DSV4 encoding biondizzle 2026-06-03 08:18:56 +00:00
  • 11c2d5fe53 Add degeneration test 2: falsify mHC residual growth root cause biondizzle 2026-06-03 08:18:01 +00:00
  • c77b83fffc Add degeneration test 1: chat-template token-ID diff biondizzle 2026-06-03 08:17:09 +00:00
  • c5a131c358 more doc clean up again biondizzle 2026-06-03 08:14:07 +00:00
  • 019a3a34b7 Clean up L0 B1 verify noise (gate on VERBOSE), update FINAL_STRETCH.md biondizzle 2026-06-03 08:12:54 +00:00
  • 5e09be08af Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill) biondizzle 2026-06-03 07:56:19 +00:00
  • 60309ef124 Batched prefill: replace T=1 token-by-token with chunked T≤128 batch processing biondizzle 2026-06-03 07:39:37 +00:00
  • 0bf276f8c9 more doc cleanup biondizzle 2026-06-03 07:37:13 +00:00
  • d463ac8512 doc cleanup biondizzle 2026-06-03 07:34:12 +00:00
  • 7450ebc67a CORRECTNESS_BACKLOG.md: comprehensive production pipeline verification results — all tested and confirmed findings from PART A diagnostics biondizzle 2026-06-03 07:31:01 +00:00