-
9d96c2fbbf
CRITICAL FIX: FP32 RoPE cache + FP32 arithmetic for inverse RoPE round-trip
biondizzle
2026-05-31 09:14:59 +00:00
-
db74a887ab
Add minimal e2e test + fix MoE expert loop bug (indentation)
biondizzle
2026-05-31 09:14:03 +00:00
-
e195d9d3a7
add SKIP_ROUTED_MOE debug flag, re-enable sinks
biondizzle
2026-05-31 07:02:38 +00:00
-
4f28673bec
debug: disable sinks in SDPA to check |X| impact
biondizzle
2026-05-31 06:51:58 +00:00
-
e3db90b56c
switch back to original prompt
biondizzle
2026-05-31 06:40:01 +00:00
-
d2cf5ccc32
CRITICAL FIX: use SDPA for short sequences (FMHA padding bug)
biondizzle
2026-05-31 06:39:23 +00:00
-
5f98855141
test with simpler prompt
biondizzle
2026-05-31 06:28:45 +00:00
-
152af7295a
debug: compare FMHA vs SDPA output at layer 0
biondizzle
2026-05-31 06:16:58 +00:00
-
59c75ca4e9
fix: cast attn_out back to BF16 after sink correction
biondizzle
2026-05-31 06:07:06 +00:00
-
e5245ea34e
fix: V tensor must be (B, n_h, hd, N) for FMHA — was transposed wrong
biondizzle
2026-05-31 06:03:13 +00:00
-
91abf0f921
FMHA + analytic sink bias correction using LSE
biondizzle
2026-05-31 05:58:01 +00:00
-
fac269c938
fix verify_attention: proper multi-head SDPA + GQA
biondizzle
2026-05-31 05:55:10 +00:00
-
2333fc8b4b
fix verify_attention.py: proper nvfp4_linear calls
biondizzle
2026-05-31 05:53:49 +00:00
-
c09f68c867
add verify_attention.py: single-layer attention component test
biondizzle
2026-05-31 05:51:36 +00:00
-
04dd7545b3
switch to production FMHA for full run
biondizzle
2026-05-31 04:51:16 +00:00
-
738088cf49
revert: K=V with RoPE + inverse RoPE is the correct DSV4 approach
biondizzle
2026-05-31 04:51:10 +00:00
-
781ee43521
try separate K (RoPE'd) and V (raw) — no inverse RoPE needed
biondizzle
2026-05-31 04:46:14 +00:00
-
889521009b
re-enable inverse RoPE (confirmed necessary — without it output is garbage)
biondizzle
2026-05-31 04:45:58 +00:00
-
92e465ca04
debug: disable inverse RoPE to check impact on output
biondizzle
2026-05-31 04:40:34 +00:00
-
c69dc51b3b
switch to SDPA with sinks (better residual control)
biondizzle
2026-05-31 04:38:41 +00:00
-
3ed8f3cc44
switch back to production FMHA kernel (with FP4 LUT fix)
biondizzle
2026-05-31 04:32:01 +00:00
-
ae79bd8fce
debug: add top-5 logit predictions
biondizzle
2026-05-31 04:25:01 +00:00
-
aafe2eee12
CRITICAL FIX: FP4 LUT was 4x too large!
biondizzle
2026-05-31 04:16:13 +00:00
-
b8c8da91fe
fix: restore RoPE functions that were lost during mHC refactor
biondizzle
2026-05-31 04:10:51 +00:00
-
3f04a72af4
refactor: use production mHCLayer from dsv4.layers.mhc
biondizzle
2026-05-31 04:06:58 +00:00
-
b519108cab
fix: restore kv_cache.append that was accidentally removed
biondizzle
2026-05-31 03:56:58 +00:00
-
22a89b5a45
add attention sinks to SDPA path (paper D5c)
biondizzle
2026-05-31 03:52:59 +00:00
-
1905f19b8d
fix: define q_input before USE_SDPA branch
biondizzle
2026-05-31 03:45:09 +00:00
-
cd073ad867
use PyTorch SDPA for correctness (no sink bias in FMHA kernel yet)
biondizzle
2026-05-31 03:42:03 +00:00
-
171a9e0d10
disable diagnostics for clean production run
biondizzle
2026-05-31 03:32:17 +00:00
-
3f9b441428
diag: fix n_layers reference in forward_layer, add late-layer diags
biondizzle
2026-05-31 03:28:53 +00:00
-
5b834a0599
diag: add late-layer diagnostics, fix ffn ctx variable
biondizzle
2026-05-31 03:25:55 +00:00
-
690c0a1121
CRITICAL FIX: mHC base/scale ordering was wrong
biondizzle
2026-05-31 03:16:07 +00:00
-
c3a2656c48
diag: add FFN and pre_block diagnostics
biondizzle
2026-05-31 03:12:52 +00:00
-
79ba7e6636
diag: add mHC diagnostics for first 3 layers
biondizzle
2026-05-31 03:10:05 +00:00
-
a262492e51
fix: FMHA K/V tensor shape (was permuting cache), add q_a_norm and kv_norm
biondizzle
2026-05-31 03:04:53 +00:00
-
3f12bbc374
fix: move positions tensor to correct GPU for RoPE
biondizzle
2026-05-31 02:54:47 +00:00
-
0c3d168c60
single_shot: stream weights per-layer from CPU, fix KV/RoPE logic
biondizzle
2026-05-31 02:53:40 +00:00
-
61160ace13
fix: expert_weights/ids scoping in hash routing path
biondizzle
2026-05-31 02:50:32 +00:00
-
d772885d7e
single_shot_inference: proper mHC+RMSNorm+inverse RoPE pipeline
biondizzle
2026-05-31 02:45:52 +00:00
-
523b0e47b1
Add gentle RMSNorm: only clamps when values exceed unit norm
biondizzle
2026-05-31 00:31:34 +00:00
-
dcbb74841a
Remove emergency RMSNorm from mHC post_block — MoE provides balance now
biondizzle
2026-05-31 00:27:48 +00:00
-
1de241ccfe
Fix: add all_tokens tracking for decode loop
biondizzle
2026-05-31 00:22:08 +00:00
-
b1dd59293a
Add prefill: process prompt tokens to fill KV cache before decoding
biondizzle
2026-05-31 00:18:55 +00:00
-
178fb5483a
Fix KV cache: use index 0 (one-layer cache per layer instance)
biondizzle
2026-05-31 00:14:58 +00:00
-
afcc690ddc
Add full MoE routing + KV cache to single_shot
biondizzle
2026-05-31 00:11:15 +00:00
-
3ecfbcba57
Fix T scope in post_block
biondizzle
2026-05-31 00:02:29 +00:00
-
a493f72681
Add per-residual RMSNorm in mHC post_block (routed MoE missing)
biondizzle
2026-05-30 23:59:19 +00:00
-
49282fe206
Fix mHC: match vLLM torch reference exactly
biondizzle
2026-05-30 23:55:27 +00:00
-
66a66f8244
Add per-layer NaN tracking for mHC debug
biondizzle
2026-05-30 23:48:32 +00:00
-
d003c4b7cc
Add mHC (Manifold-Constrained Hyper-Connections) to single_shot
biondizzle
2026-05-30 23:45:18 +00:00
-
f567c20539
Fix: set active CUDA device per layer for BMM/FMHA
biondizzle
2026-05-30 23:39:45 +00:00
-
7a95983e0f
Rewrite single_shot: 8-GPU pipeline parallel
biondizzle
2026-05-30 23:36:14 +00:00
-
aac0fa1f08
Update STATUS.md + MEMORY.md: single-shot inference verified
biondizzle
2026-05-30 22:59:27 +00:00
-
11c010e567
Update output section: kernel verified, architecture gaps noted
biondizzle
2026-05-30 22:58:49 +00:00
-
53178d2536
Add emergency RMSNorm after residuals (missing mHC fallback)
biondizzle
2026-05-30 22:56:16 +00:00
-
172ba75e0c
Add per-layer NaN check to track where values diverge
biondizzle
2026-05-30 22:54:57 +00:00
-
ec7846e28c
Add NaN tracking to single_shot_inference
biondizzle
2026-05-30 22:53:09 +00:00
-
5fa6c88b17
Fix: replace FP4 Inf with 24 (avoid NaN in dequant)
biondizzle
2026-05-30 22:51:10 +00:00
-
904753f62a
Fix: BMM batch dim alignment for wo_a
biondizzle
2026-05-30 22:49:21 +00:00
-
52df3bc26c
Fix: wo_a as batched matmul (grouped linear for output projection)
biondizzle
2026-05-30 22:48:31 +00:00
-
19240608d7
Fix: handle o_a_proj grouped linear shape mismatch
biondizzle
2026-05-30 22:46:12 +00:00
-
1d02758416
Fix: kv_proj outputs hd=512 (1 KV head MQA), Z from compressor.gate_proj
biondizzle
2026-05-30 22:45:14 +00:00
-
5dcfb333ea
Fix: move weight tensors to CUDA before dequant
biondizzle
2026-05-30 22:43:47 +00:00
-
47c7b3c50b
Fix: ensure FP4 LUT on CUDA before index op
biondizzle
2026-05-30 22:43:01 +00:00
-
13bae9dd55
Fix single_shot: mHC replaces layernorm, no hidden-level norm in DSV4
biondizzle
2026-05-30 22:42:17 +00:00
-
e8334fc4af
Rewrite single_shot_inference.py — complete forward pass
biondizzle
2026-05-30 22:40:56 +00:00
-
9b0858aa35
Add single_shot_inference.py — baseline kernel verification
biondizzle
2026-05-30 22:39:01 +00:00
-
4472928506
E3: model construction test
biondizzle
2026-05-30 21:22:34 +00:00
-
afc07a5d1a
Update STATUS.md: E5 done
biondizzle
2026-05-30 21:21:47 +00:00
-
df6220abaf
E5: Fold batch loop into native kernel grid (blockIdx.z)
biondizzle
2026-05-30 21:21:02 +00:00
-
e162a2d112
Update STATUS.md: E1-E4 done
biondizzle
2026-05-30 21:20:10 +00:00
-
c4b40dd06c
E2: CSA/HCA integration test — gather + FMHA end-to-end
biondizzle
2026-05-30 21:19:28 +00:00
-
9d88769f5f
Wire indexer compute_index_scores_topk + fix compressor imports
biondizzle
2026-05-30 21:19:06 +00:00
-
daf84524ac
E2/E3: compressor bridge, indexer bridge, flush pipeline wiring
biondizzle
2026-05-30 21:16:54 +00:00
-
d3b772196d
E3: Implement DSV4Model — full model class
biondizzle
2026-05-30 21:15:57 +00:00
-
b0cdd5af74
fix: extern declarations for gather_swa functions in gather_kv.cu
biondizzle
2026-05-30 21:14:15 +00:00
-
016d722abc
fix: single PYBIND11_MODULE for combined gather .so
biondizzle
2026-05-30 21:13:24 +00:00
-
8fb9d89658
fix: correct gather.py kernel_dir path
biondizzle
2026-05-30 21:12:09 +00:00
-
924707a673
fix: add FFNType/RouterMode to LayerSpec in e2e test
biondizzle
2026-05-30 21:11:04 +00:00
-
e2e21c6350
fix: remove unused pytest import from e2e test
biondizzle
2026-05-30 21:10:43 +00:00
-
300dddedc0
E1-E4: gather kernels, handle wiring, rope, sync removal, e2e test
biondizzle
2026-05-30 21:10:26 +00:00
-
faf92b30ad
E1: Wire LayerCacheHandle gather methods + CUDA gather kernels
biondizzle
2026-05-30 21:09:21 +00:00
-
4b9eed02e1
Cleanup C1-C7: delete dead CuTeDSL FMHA, test probes, scratch files
biondizzle
2026-05-30 21:08:12 +00:00
-
a360fa308a
P6-P8: Update NEXT_PRIORITIES.md with completion status
biondizzle
2026-05-30 17:28:02 +00:00
-
2c18609296
P8: Fix P6 test imports after deleting multihead module
biondizzle
2026-05-30 17:25:01 +00:00
-
e1b9e94c24
P8: Fix test imports after deleting multihead module
biondizzle
2026-05-30 17:23:13 +00:00
-
95725f1df0
P8: Delete 6 redundant .cuh variants + multihead CAPI/op
biondizzle
2026-05-30 17:21:15 +00:00
-
9d483b1c54
P8: Unified dispatch — multi-tile kernel handles all N
biondizzle
2026-05-30 17:19:09 +00:00
-
e747742598
P7: Document TMEM column layout, add multi-row softmax test
biondizzle
2026-05-30 17:17:54 +00:00
-
f1ce47e3c9
P7: Add TMEM column layout probe test
biondizzle
2026-05-30 17:14:50 +00:00
-
5e5217bfc3
P6: Relax test gate to 0.999990 (SMEM staging adds tiny BF16 noise)
biondizzle
2026-05-30 17:13:20 +00:00
-
11d15d9e72
P6: Clean up test — remove broken TMA store test, update epilogue test
biondizzle
2026-05-30 17:12:23 +00:00
-
c0379a0f86
P6: Remove broken TMA store — use direct GMEM write from SMEM
biondizzle
2026-05-30 17:11:17 +00:00
-
f97359fbfc
P6: TMA store uses mbarrier completion (same as load)
biondizzle
2026-05-30 17:07:24 +00:00
-
2de300e281
P6: Try shared::cluster instead of shared::cta for TMA store
biondizzle
2026-05-30 17:05:27 +00:00
-
829a5f93ce
P6: Fix TMA store PTX — remove .tile modifier, fix wait_group syntax
biondizzle
2026-05-30 17:04:38 +00:00
-
e4ee9fdc9f
P6: Fix host-side BF16→FP32 conversion in test
biondizzle
2026-05-30 17:01:13 +00:00
-
a88b321433
P6: Fix host-side BF16 conversion in test
biondizzle
2026-05-30 17:00:51 +00:00
-
1a87e054db
P6: Fix constexpr and bf16 conversion in CUDA test
biondizzle
2026-05-30 17:00:05 +00:00