nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Select branches

Hide Pull Requests

master

pre-b1

pure-nvfp4

v-b1-b2-done-20260603

v-c1-c2-c3-20260602

v-e2e-nvfp4-all-projections

v-e2e-paris-32tok-20260601-0549

v-indexer-fix-20260602

v-nvfp4-fused-router-rewrite-20260601-0715

v-nvfp4-router-oa-20260601-0610

v-official-encoding-path

v-p0p1p2p3-fused-swiglu-cuda-rope-20260602

v-perf-part1-p2-reverted-20260602

v-post-indexer-c-fixes-20260602

v-precision-floor-fix-20260603

v-single-shot-paris-20260601-0539

v-working-e2e-20260601-0515

v0.1-e2e-working

55f1ddd502 Update GETTING_CUDAGRAPH_READY.md and CUDA_GRAPH_SYNC_INVENTORY.md with full current status, multi-GPU stream fix, and next steps master biondizzle 2026-06-06 09:17:49 +00:00
ac213bdee8 Update docs: CUDA graph capture WORKING on all 8 GPUs, 0.28s/token (2x eager) biondizzle 2026-06-06 08:29:40 +00:00
6650f06121 CRITICAL FIX: Use explicit per-device streams for CUDA graph capture/replay on multi-GPU — fixes zero-output bug biondizzle 2026-06-06 08:18:18 +00:00
90ac38cde0 Add CUDA graph stream management test biondizzle 2026-06-06 08:14:29 +00:00
26042e3f01 Add minimal CUDA graph multi-GPU test to isolate zero-output bug biondizzle 2026-06-06 08:13:18 +00:00
86275851d4 Add minimal CUDA graph test per GPU during capture to isolate multi-GPU graph issue biondizzle 2026-06-06 08:02:35 +00:00
2cbf7a43e9 Add sync after cross-GPU copy before graph replay; remove misleading zero-input verification biondizzle 2026-06-06 07:51:22 +00:00
2bb52c7cae Add per-layer graph capture verification — replay immediately and check for zeros biondizzle 2026-06-06 07:40:19 +00:00
5a98cc6d90 Store pre-cached norm weights on self to prevent GC during graph replay — root cause of all-zeros replay bug biondizzle 2026-06-06 07:29:33 +00:00
dcb2495a5b Add graph replay debug prints for first 3 steps/layers biondizzle 2026-06-06 07:19:07 +00:00
16b9a4def2 Fix CUDA graph replay: set device to cuda:0 before lm_head graph replay biondizzle 2026-06-06 07:18:49 +00:00
f259d63930 CRITICAL FIX: SE swizzled buffers were allocated then overwritten with None — graph capture would fall through to broken Python path biondizzle 2026-06-06 07:01:52 +00:00
32902d1036 CUDA graph capture: derive q_a_dim from config, pre-cache norm weights, add buffer verification, use direct dict access for routers/moe/se biondizzle 2026-06-06 07:01:12 +00:00
64f547058e Fix graph replay: pass q_a from Graph A output to forward_attention biondizzle 2026-06-04 08:09:30 +00:00
26da6d33af Fix graph replay: remove extra token_id arg from forward_attention call biondizzle 2026-06-04 06:10:02 +00:00
ae26f6b83c Fix dense router BF16 dispatch: use torch.matmul instead of F.linear biondizzle 2026-06-04 05:58:24 +00:00
e46b615873 Fix dense router BF16 dispatch for CUDA graph capture biondizzle 2026-06-04 05:50:13 +00:00
b4a59d0940 Update CUDA graph docs with current status, A/B split, buffer fixes, remaining blockers biondizzle 2026-06-04 05:13:51 +00:00
ffa7842b58 Fix dense router: run GEMM in BF16, convert to FP32 only for activation biondizzle 2026-06-04 04:49:08 +00:00
119e6d471e Add safety check for swizzled buffers: fall through to Python path if None biondizzle 2026-06-04 04:32:00 +00:00
fae61d3ef7 Add c10/cuda/CUDAStream.h include for getCurrentCUDAStream biondizzle 2026-06-04 04:13:40 +00:00
ee86969f6c Fix CUDA stream: use c10::cuda::getCurrentCUDAStream() directly in kernel launch biondizzle 2026-06-04 03:57:59 +00:00
e26c28a1ce Fix CUDA stream API: getCurrentCUDAStream().stream() biondizzle 2026-06-04 03:43:04 +00:00
9b3917e248 Fix blackwell_swizzle.cu: add pybind11 bindings for torch extension loader biondizzle 2026-06-04 03:29:10 +00:00
5487a58df4 Fix NameError: add rows/cols variables to MoE swizzle biondizzle 2026-06-04 03:14:27 +00:00
a434545d12 Blackwell swizzle CUDA kernel for CUDA graph capture biondizzle 2026-06-04 03:03:02 +00:00
e7766254b7 Pre-allocate ALL GEMM output buffers for CUDA graph capture biondizzle 2026-06-04 02:41:59 +00:00
676a0448c0 CRITICAL FIX: _l1_out_buf was 2x too narrow — caused GPU memory corruption biondizzle 2026-06-04 02:06:18 +00:00
0890e578f4 DEBUG: print l1_out shape before gate/up split biondizzle 2026-06-04 01:49:12 +00:00
8546ed725f DEBUG: check SE input magnitude biondizzle 2026-06-04 01:38:24 +00:00
26ecf96328 DEBUG: check intermediate magnitude before SE L2 biondizzle 2026-06-04 01:30:29 +00:00
5303d6a82f DEBUG: test copy_ with contiguous slice vs scalar assign for gsa biondizzle 2026-06-04 01:27:25 +00:00
ccbc713658 DEBUG: check gsa values and pinpoint exact failing operation biondizzle 2026-06-04 01:16:37 +00:00
e77455c3ba DEBUG: add sync inside quantize_nvfp4_gpu_fused to catch async errors biondizzle 2026-06-04 01:05:47 +00:00
55def5eef9 Restore A/B split + gsa scalar fix (error is pre-existing, not regression) biondizzle 2026-06-04 01:03:36 +00:00
59eccd04ab REVERT: test if cudaErrorInvalidValue is pre-existing or regression biondizzle 2026-06-04 00:53:09 +00:00
5e3ced0b60 DEBUG: isolate which kernel causes cudaErrorInvalidValue in SE L2 path biondizzle 2026-06-04 00:41:28 +00:00
b314fde9b7 Fix gsa copy_ cudaErrorInvalidValue: replace view-based copy_ with scalar assignment biondizzle 2026-06-04 00:30:21 +00:00
993bb345d1 DEBUG: fix VERBOSE reference in shared_expert, always print L2 gsa debug biondizzle 2026-06-04 00:15:38 +00:00
f0f87df906 DEBUG: add sync + shape prints to shared_expert L2 gsa copy biondizzle 2026-06-04 00:05:08 +00:00
1d6610c46d CUDA graph A/B split: eager-break-at-attention architecture biondizzle 2026-06-03 23:53:08 +00:00
800e974d20 Update CUDA_GRAPH_SYNC_INVENTORY.md with session 2 progress biondizzle 2026-06-03 23:41:42 +00:00
a468f72a0e CUDA graph: Pre-allocate L1 GEMM output buffers in MoE and SharedExpert biondizzle 2026-06-03 23:17:43 +00:00
56b816a54f CUDA graph: Use per-GPU position/token buffers for graph capture biondizzle 2026-06-03 22:56:20 +00:00
f57de06eb5 Fix grouped_linear GEMM output buffer shape and extraction biondizzle 2026-06-03 22:26:40 +00:00
92225b07e7 CUDA graph: Simplify to single-graph-per-layer capture (revert A/B split) biondizzle 2026-06-03 22:04:18 +00:00
b32713c302 grouped_linear: Pre-allocate output buffer for grouped GEMM (CUDA graph capture) biondizzle 2026-06-03 22:02:01 +00:00
676fad064f Fix: Add out= parameter to run_fused_swiglu_grouped_gemm signature biondizzle 2026-06-03 21:45:15 +00:00
188ecae47f CUDA graph: Eliminate per-step allocations in graph-captured code paths biondizzle 2026-06-03 21:30:24 +00:00
91c370360a Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v3) biondizzle 2026-06-03 21:09:12 +00:00
5c94dbbc37 Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v2) biondizzle 2026-06-03 20:54:18 +00:00
87b6c9932b Fix CuTeDSL from_dlpack device mismatch inside CUDA graph capture biondizzle 2026-06-03 20:34:24 +00:00
2661cebe9a Fix warmup_gsa: handle multi-element _gsa_buf (Nvfp4GroupedLinear per-group gsa) biondizzle 2026-06-03 19:49:54 +00:00
486f74d900 CUDA graph: Implement eager-break-at-attention decoder with sub-graph A/B split biondizzle 2026-06-03 19:24:26 +00:00
5ea3aa3406 Update GETTING_CUDAGRAPH_READY.md and CUDA_GRAPH_SYNC_INVENTORY.md biondizzle 2026-06-03 19:15:27 +00:00
80bb27f5bf CUDA graph: Fix gsa broadcast — contiguous for prefill, reshape for decode biondizzle 2026-06-03 18:08:18 +00:00
518a1d3f95 CUDA graph: Fix MoE scatter_add_ index dtype + fix second bincount biondizzle 2026-06-03 17:53:40 +00:00
f13a81d48b CUDA graph: Fix per-call allocations in grouped_linear and quantize biondizzle 2026-06-03 17:39:20 +00:00
84655d066a CUDA graph: Fix MoE bincount and per-call allocations (Hazard #4) biondizzle 2026-06-03 17:37:03 +00:00
df05289d6f CUDA graph: Fix remaining sync violations from B200 detector run 2 biondizzle 2026-06-03 17:20:34 +00:00
e07d79868f CUDA graph: Fix _assemble_scales_single_group swizzle size biondizzle 2026-06-03 17:02:34 +00:00
0ca7bed0e1 CUDA graph: Fix sync violations found by B200 detector biondizzle 2026-06-03 16:52:19 +00:00
46a3a51832 CUDA graph: Fix per-step allocations in decode loop biondizzle 2026-06-03 16:38:35 +00:00
a9ea30353c CUDA graph: Fix sync violations (Category 1-2) biondizzle 2026-06-03 16:37:20 +00:00
caac8ae108 Fix syntax error: 'is not not None' -> 'is not None' biondizzle 2026-06-03 16:34:33 +00:00
ba68212fa7 Add CUDA graph readiness detector (Section A of GETTING_CUDAGRAPH_READY.md) biondizzle 2026-06-03 16:34:15 +00:00
ca5bc814d5 Fix compressor: do not add positional bias to KV content biondizzle 2026-06-03 15:52:00 +00:00
4fe73fe713 auto: pre-test commit v-precision-floor-fix-20260603 biondizzle 2026-06-03 15:45:15 +00:00
f577ed97f4 Fix: Use PyTorch dequant_nvfp4 for weight dequantization (compressor/indexer/router gate) biondizzle 2026-06-03 14:57:40 +00:00
1121cd7b47 Add CUDA_LAUNCH_BLOCKING=1 to catch async errors biondizzle 2026-06-03 14:48:51 +00:00
f3bb0ca08c Fix dequant gsa: use ws2 only, NOT input_scale * ws2 biondizzle 2026-06-03 14:38:24 +00:00
470e65fb19 Fix dequant gsb: input_scale * ws2, not 1.0 * ws2 biondizzle 2026-06-03 14:26:59 +00:00
2dd16d5789 Switch compressor + indexer weights_proj to BF16 F.linear biondizzle 2026-06-03 14:19:41 +00:00
95e45a87e3 Add explicit .to(dev) on W_gate after transpose — belt and suspenders biondizzle 2026-06-03 14:17:02 +00:00
ef94c48957 Simplify router gate: dequant NVFP4 → BF16, F.linear (no FP8 middleman) biondizzle 2026-06-03 14:14:10 +00:00
715602c87c Switch lm_head to BF16 + router gate to FP8_E4M3 biondizzle 2026-06-03 14:10:28 +00:00
3320abfe24 Fix two correctness bugs: compressor pos bias on KV + SwiGLU clamp ordering pure-nvfp4 biondizzle 2026-06-03 11:17:49 +00:00
7901470e63 doc clean up v-official-encoding-path biondizzle 2026-06-03 10:53:41 +00:00
ca7c309463 Add reference/ dir: vLLM tokenizers, reasoning parsers, tool parsers, official inference biondizzle 2026-06-03 10:25:23 +00:00
8cfc1cae58 Canonical encoding: derive special token IDs from official encoding module + tokenizer biondizzle 2026-06-03 10:23:02 +00:00
a86d6d90a5 Replace hand-rolled prompt with official DSV4 encoder (canonical path) biondizzle 2026-06-03 09:59:05 +00:00
284fc9ca86 Fix: thread comp_rope_cos/comp_rope_sin through forward_attention biondizzle 2026-06-03 09:30:57 +00:00
6a3374da18 Cross-check 2 complete: block-aligned comp_pos + compress_rope_theta wired through biondizzle 2026-06-03 09:19:11 +00:00
5003e756e2 WIP: cross-check 2 fix — block-aligned compressed RoPE positions + compress_rope_theta support biondizzle 2026-06-03 09:17:54 +00:00
572bdd2840 auto: pre-test commit biondizzle 2026-06-03 09:01:02 +00:00
3c06fd5591 Test 2: fix topk tensor shape (flatten before iterating) biondizzle 2026-06-03 08:47:32 +00:00
89f6e64057 README: document test harness gotchas (timeout arg, stale procs, screen names) biondizzle 2026-06-03 08:36:02 +00:00
29d6986dd4 Test 2: fix quantize_to_nvfp4 import biondizzle 2026-06-03 08:21:39 +00:00
60b9bbd470 Test 2: fix import - use mHCLayer from dsv4.layers.mhc, fixed prompt encoding biondizzle 2026-06-03 08:20:21 +00:00
1e77dfcaa0 Fix prompt encoding: remove \n\n before content per official DSV4 spec; add --chat-mode biondizzle 2026-06-03 08:19:33 +00:00
2a42686e8e Test 1 v2: diff hand-rolled vs official DSV4 encoding biondizzle 2026-06-03 08:18:56 +00:00
11c2d5fe53 Add degeneration test 2: falsify mHC residual growth root cause biondizzle 2026-06-03 08:18:01 +00:00
c77b83fffc Add degeneration test 1: chat-template token-ID diff biondizzle 2026-06-03 08:17:09 +00:00
c5a131c358 more doc clean up again biondizzle 2026-06-03 08:14:07 +00:00
019a3a34b7 Clean up L0 B1 verify noise (gate on VERBOSE), update FINAL_STRETCH.md biondizzle 2026-06-03 08:12:54 +00:00
5e09be08af Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill) biondizzle 2026-06-03 07:56:19 +00:00
60309ef124 Batched prefill: replace T=1 token-by-token with chunked T≤128 batch processing biondizzle 2026-06-03 07:39:37 +00:00
0bf276f8c9 more doc cleanup biondizzle 2026-06-03 07:37:13 +00:00
d463ac8512 doc cleanup biondizzle 2026-06-03 07:34:12 +00:00
7450ebc67a CORRECTNESS_BACKLOG.md: comprehensive production pipeline verification results — all tested and confirmed findings from PART A diagnostics biondizzle 2026-06-03 07:31:01 +00:00

1 2 3 4 5 ...