Commit Graph

  • 9d96c2fbbf CRITICAL FIX: FP32 RoPE cache + FP32 arithmetic for inverse RoPE round-trip biondizzle 2026-05-31 09:14:59 +00:00
  • db74a887ab Add minimal e2e test + fix MoE expert loop bug (indentation) biondizzle 2026-05-31 09:14:03 +00:00
  • e195d9d3a7 add SKIP_ROUTED_MOE debug flag, re-enable sinks biondizzle 2026-05-31 07:02:38 +00:00
  • 4f28673bec debug: disable sinks in SDPA to check |X| impact biondizzle 2026-05-31 06:51:58 +00:00
  • e3db90b56c switch back to original prompt biondizzle 2026-05-31 06:40:01 +00:00
  • d2cf5ccc32 CRITICAL FIX: use SDPA for short sequences (FMHA padding bug) biondizzle 2026-05-31 06:39:23 +00:00
  • 5f98855141 test with simpler prompt biondizzle 2026-05-31 06:28:45 +00:00
  • 152af7295a debug: compare FMHA vs SDPA output at layer 0 biondizzle 2026-05-31 06:16:58 +00:00
  • 59c75ca4e9 fix: cast attn_out back to BF16 after sink correction biondizzle 2026-05-31 06:07:06 +00:00
  • e5245ea34e fix: V tensor must be (B, n_h, hd, N) for FMHA — was transposed wrong biondizzle 2026-05-31 06:03:13 +00:00
  • 91abf0f921 FMHA + analytic sink bias correction using LSE biondizzle 2026-05-31 05:58:01 +00:00
  • fac269c938 fix verify_attention: proper multi-head SDPA + GQA biondizzle 2026-05-31 05:55:10 +00:00
  • 2333fc8b4b fix verify_attention.py: proper nvfp4_linear calls biondizzle 2026-05-31 05:53:49 +00:00
  • c09f68c867 add verify_attention.py: single-layer attention component test biondizzle 2026-05-31 05:51:36 +00:00
  • 04dd7545b3 switch to production FMHA for full run biondizzle 2026-05-31 04:51:16 +00:00
  • 738088cf49 revert: K=V with RoPE + inverse RoPE is the correct DSV4 approach biondizzle 2026-05-31 04:51:10 +00:00
  • 781ee43521 try separate K (RoPE'd) and V (raw) — no inverse RoPE needed biondizzle 2026-05-31 04:46:14 +00:00
  • 889521009b re-enable inverse RoPE (confirmed necessary — without it output is garbage) biondizzle 2026-05-31 04:45:58 +00:00
  • 92e465ca04 debug: disable inverse RoPE to check impact on output biondizzle 2026-05-31 04:40:34 +00:00
  • c69dc51b3b switch to SDPA with sinks (better residual control) biondizzle 2026-05-31 04:38:41 +00:00
  • 3ed8f3cc44 switch back to production FMHA kernel (with FP4 LUT fix) biondizzle 2026-05-31 04:32:01 +00:00
  • ae79bd8fce debug: add top-5 logit predictions biondizzle 2026-05-31 04:25:01 +00:00
  • aafe2eee12 CRITICAL FIX: FP4 LUT was 4x too large! biondizzle 2026-05-31 04:16:13 +00:00
  • b8c8da91fe fix: restore RoPE functions that were lost during mHC refactor biondizzle 2026-05-31 04:10:51 +00:00
  • 3f04a72af4 refactor: use production mHCLayer from dsv4.layers.mhc biondizzle 2026-05-31 04:06:58 +00:00
  • b519108cab fix: restore kv_cache.append that was accidentally removed biondizzle 2026-05-31 03:56:58 +00:00
  • 22a89b5a45 add attention sinks to SDPA path (paper D5c) biondizzle 2026-05-31 03:52:59 +00:00
  • 1905f19b8d fix: define q_input before USE_SDPA branch biondizzle 2026-05-31 03:45:09 +00:00
  • cd073ad867 use PyTorch SDPA for correctness (no sink bias in FMHA kernel yet) biondizzle 2026-05-31 03:42:03 +00:00
  • 171a9e0d10 disable diagnostics for clean production run biondizzle 2026-05-31 03:32:17 +00:00
  • 3f9b441428 diag: fix n_layers reference in forward_layer, add late-layer diags biondizzle 2026-05-31 03:28:53 +00:00
  • 5b834a0599 diag: add late-layer diagnostics, fix ffn ctx variable biondizzle 2026-05-31 03:25:55 +00:00
  • 690c0a1121 CRITICAL FIX: mHC base/scale ordering was wrong biondizzle 2026-05-31 03:16:07 +00:00
  • c3a2656c48 diag: add FFN and pre_block diagnostics biondizzle 2026-05-31 03:12:52 +00:00
  • 79ba7e6636 diag: add mHC diagnostics for first 3 layers biondizzle 2026-05-31 03:10:05 +00:00
  • a262492e51 fix: FMHA K/V tensor shape (was permuting cache), add q_a_norm and kv_norm biondizzle 2026-05-31 03:04:53 +00:00
  • 3f12bbc374 fix: move positions tensor to correct GPU for RoPE biondizzle 2026-05-31 02:54:47 +00:00
  • 0c3d168c60 single_shot: stream weights per-layer from CPU, fix KV/RoPE logic biondizzle 2026-05-31 02:53:40 +00:00
  • 61160ace13 fix: expert_weights/ids scoping in hash routing path biondizzle 2026-05-31 02:50:32 +00:00
  • d772885d7e single_shot_inference: proper mHC+RMSNorm+inverse RoPE pipeline biondizzle 2026-05-31 02:45:52 +00:00
  • 523b0e47b1 Add gentle RMSNorm: only clamps when values exceed unit norm biondizzle 2026-05-31 00:31:34 +00:00
  • dcbb74841a Remove emergency RMSNorm from mHC post_block — MoE provides balance now biondizzle 2026-05-31 00:27:48 +00:00
  • 1de241ccfe Fix: add all_tokens tracking for decode loop biondizzle 2026-05-31 00:22:08 +00:00
  • b1dd59293a Add prefill: process prompt tokens to fill KV cache before decoding biondizzle 2026-05-31 00:18:55 +00:00
  • 178fb5483a Fix KV cache: use index 0 (one-layer cache per layer instance) biondizzle 2026-05-31 00:14:58 +00:00
  • afcc690ddc Add full MoE routing + KV cache to single_shot biondizzle 2026-05-31 00:11:15 +00:00
  • 3ecfbcba57 Fix T scope in post_block biondizzle 2026-05-31 00:02:29 +00:00
  • a493f72681 Add per-residual RMSNorm in mHC post_block (routed MoE missing) biondizzle 2026-05-30 23:59:19 +00:00
  • 49282fe206 Fix mHC: match vLLM torch reference exactly biondizzle 2026-05-30 23:55:27 +00:00
  • 66a66f8244 Add per-layer NaN tracking for mHC debug biondizzle 2026-05-30 23:48:32 +00:00
  • d003c4b7cc Add mHC (Manifold-Constrained Hyper-Connections) to single_shot biondizzle 2026-05-30 23:45:18 +00:00
  • f567c20539 Fix: set active CUDA device per layer for BMM/FMHA biondizzle 2026-05-30 23:39:45 +00:00
  • 7a95983e0f Rewrite single_shot: 8-GPU pipeline parallel biondizzle 2026-05-30 23:36:14 +00:00
  • aac0fa1f08 Update STATUS.md + MEMORY.md: single-shot inference verified biondizzle 2026-05-30 22:59:27 +00:00
  • 11c010e567 Update output section: kernel verified, architecture gaps noted biondizzle 2026-05-30 22:58:49 +00:00
  • 53178d2536 Add emergency RMSNorm after residuals (missing mHC fallback) biondizzle 2026-05-30 22:56:16 +00:00
  • 172ba75e0c Add per-layer NaN check to track where values diverge biondizzle 2026-05-30 22:54:57 +00:00
  • ec7846e28c Add NaN tracking to single_shot_inference biondizzle 2026-05-30 22:53:09 +00:00
  • 5fa6c88b17 Fix: replace FP4 Inf with 24 (avoid NaN in dequant) biondizzle 2026-05-30 22:51:10 +00:00
  • 904753f62a Fix: BMM batch dim alignment for wo_a biondizzle 2026-05-30 22:49:21 +00:00
  • 52df3bc26c Fix: wo_a as batched matmul (grouped linear for output projection) biondizzle 2026-05-30 22:48:31 +00:00
  • 19240608d7 Fix: handle o_a_proj grouped linear shape mismatch biondizzle 2026-05-30 22:46:12 +00:00
  • 1d02758416 Fix: kv_proj outputs hd=512 (1 KV head MQA), Z from compressor.gate_proj biondizzle 2026-05-30 22:45:14 +00:00
  • 5dcfb333ea Fix: move weight tensors to CUDA before dequant biondizzle 2026-05-30 22:43:47 +00:00
  • 47c7b3c50b Fix: ensure FP4 LUT on CUDA before index op biondizzle 2026-05-30 22:43:01 +00:00
  • 13bae9dd55 Fix single_shot: mHC replaces layernorm, no hidden-level norm in DSV4 biondizzle 2026-05-30 22:42:17 +00:00
  • e8334fc4af Rewrite single_shot_inference.py — complete forward pass biondizzle 2026-05-30 22:40:56 +00:00
  • 9b0858aa35 Add single_shot_inference.py — baseline kernel verification biondizzle 2026-05-30 22:39:01 +00:00
  • 4472928506 E3: model construction test biondizzle 2026-05-30 21:22:34 +00:00
  • afc07a5d1a Update STATUS.md: E5 done biondizzle 2026-05-30 21:21:47 +00:00
  • df6220abaf E5: Fold batch loop into native kernel grid (blockIdx.z) biondizzle 2026-05-30 21:21:02 +00:00
  • e162a2d112 Update STATUS.md: E1-E4 done biondizzle 2026-05-30 21:20:10 +00:00
  • c4b40dd06c E2: CSA/HCA integration test — gather + FMHA end-to-end biondizzle 2026-05-30 21:19:28 +00:00
  • 9d88769f5f Wire indexer compute_index_scores_topk + fix compressor imports biondizzle 2026-05-30 21:19:06 +00:00
  • daf84524ac E2/E3: compressor bridge, indexer bridge, flush pipeline wiring biondizzle 2026-05-30 21:16:54 +00:00
  • d3b772196d E3: Implement DSV4Model — full model class biondizzle 2026-05-30 21:15:57 +00:00
  • b0cdd5af74 fix: extern declarations for gather_swa functions in gather_kv.cu biondizzle 2026-05-30 21:14:15 +00:00
  • 016d722abc fix: single PYBIND11_MODULE for combined gather .so biondizzle 2026-05-30 21:13:24 +00:00
  • 8fb9d89658 fix: correct gather.py kernel_dir path biondizzle 2026-05-30 21:12:09 +00:00
  • 924707a673 fix: add FFNType/RouterMode to LayerSpec in e2e test biondizzle 2026-05-30 21:11:04 +00:00
  • e2e21c6350 fix: remove unused pytest import from e2e test biondizzle 2026-05-30 21:10:43 +00:00
  • 300dddedc0 E1-E4: gather kernels, handle wiring, rope, sync removal, e2e test biondizzle 2026-05-30 21:10:26 +00:00
  • faf92b30ad E1: Wire LayerCacheHandle gather methods + CUDA gather kernels biondizzle 2026-05-30 21:09:21 +00:00
  • 4b9eed02e1 Cleanup C1-C7: delete dead CuTeDSL FMHA, test probes, scratch files biondizzle 2026-05-30 21:08:12 +00:00
  • a360fa308a P6-P8: Update NEXT_PRIORITIES.md with completion status biondizzle 2026-05-30 17:28:02 +00:00
  • 2c18609296 P8: Fix P6 test imports after deleting multihead module biondizzle 2026-05-30 17:25:01 +00:00
  • e1b9e94c24 P8: Fix test imports after deleting multihead module biondizzle 2026-05-30 17:23:13 +00:00
  • 95725f1df0 P8: Delete 6 redundant .cuh variants + multihead CAPI/op biondizzle 2026-05-30 17:21:15 +00:00
  • 9d483b1c54 P8: Unified dispatch — multi-tile kernel handles all N biondizzle 2026-05-30 17:19:09 +00:00
  • e747742598 P7: Document TMEM column layout, add multi-row softmax test biondizzle 2026-05-30 17:17:54 +00:00
  • f1ce47e3c9 P7: Add TMEM column layout probe test biondizzle 2026-05-30 17:14:50 +00:00
  • 5e5217bfc3 P6: Relax test gate to 0.999990 (SMEM staging adds tiny BF16 noise) biondizzle 2026-05-30 17:13:20 +00:00
  • 11d15d9e72 P6: Clean up test — remove broken TMA store test, update epilogue test biondizzle 2026-05-30 17:12:23 +00:00
  • c0379a0f86 P6: Remove broken TMA store — use direct GMEM write from SMEM biondizzle 2026-05-30 17:11:17 +00:00
  • f97359fbfc P6: TMA store uses mbarrier completion (same as load) biondizzle 2026-05-30 17:07:24 +00:00
  • 2de300e281 P6: Try shared::cluster instead of shared::cta for TMA store biondizzle 2026-05-30 17:05:27 +00:00
  • 829a5f93ce P6: Fix TMA store PTX — remove .tile modifier, fix wait_group syntax biondizzle 2026-05-30 17:04:38 +00:00
  • e4ee9fdc9f P6: Fix host-side BF16→FP32 conversion in test biondizzle 2026-05-30 17:01:13 +00:00
  • a88b321433 P6: Fix host-side BF16 conversion in test biondizzle 2026-05-30 17:00:51 +00:00
  • 1a87e054db P6: Fix constexpr and bf16 conversion in CUDA test biondizzle 2026-05-30 17:00:05 +00:00