This website requires JavaScript.
9dbfac9dfa
PART A: verify kv_norm_w loaded correctly
biondizzle
2026-06-03 07:03:39 +00:00
a682c6adf4
PART A: add raw compressor output diagnostic
biondizzle
2026-06-03 06:56:56 +00:00
f2c1b3afd5
PART A: fix KV diagnostics — compute q_a before indexer, add Q_heads magnitude check
biondizzle
2026-06-03 06:33:51 +00:00
86e59c16c5
PART A: add KV gather diagnostics at blowup layer
biondizzle
2026-06-03 06:25:35 +00:00
262f844e2e
PART A: add detailed blowup diagnostics — capture mHC intermediate values when |X| > 1e6
biondizzle
2026-06-03 06:10:33 +00:00
6459fbca9a
fix: import forward_attention
biondizzle
2026-06-03 05:41:33 +00:00
91dfac34d8
PART A: simplified to production-only diagnostics — track per-layer |X| during prefill and decode, detect blowup early
biondizzle
2026-06-03 05:33:22 +00:00
d99503732d
fix: add BF16 gate weight fallback for dense routers (missing from test)
biondizzle
2026-06-03 05:22:47 +00:00
801bfc9a83
add router mode debug print
biondizzle
2026-06-03 05:15:52 +00:00
b385ecc05e
PART A: decode diagnostics test — production vs reference per-layer X comparison at decode step
biondizzle
2026-06-03 05:06:40 +00:00
d518fcb82a
test: correct sink bias reference — denominator-only, no V contribution
biondizzle
2026-06-03 04:57:37 +00:00
9574a9dc2e
test: add sink bias to reference SDPA in decode FMHA comparison
biondizzle
2026-06-03 04:53:55 +00:00
9a9b347b2b
test: add per-head magnitude ratio diagnostics to decode FMHA test
biondizzle
2026-06-03 04:50:23 +00:00
f5fa20c581
fix: syntax error — missing closing paren in indexer.forward call
biondizzle
2026-06-03 04:46:41 +00:00
693975ec92
fix: device mismatches in decode FMHA test — dec_pos must be on per-layer GPU
biondizzle
2026-06-03 04:46:24 +00:00
e1d96c509d
test: decode FMHA layer comparison — checks FMHA accuracy during decode step
biondizzle
2026-06-03 04:39:12 +00:00
1ebe7f0dde
Add PART_A_NEXT_SESSION.md: clues for decode degeneration debugging
biondizzle
2026-06-03 04:34:28 +00:00
d8306be3f2
Fix PART A test: proper FP8 quantization and MQA reference
biondizzle
2026-06-03 04:20:36 +00:00
4126909dfb
Simplify PART A test: compressor + FMHA at production scale
biondizzle
2026-06-03 04:18:13 +00:00
8c54cfa748
Fix KVCache init in PART A test
biondizzle
2026-06-03 04:15:41 +00:00
04cf8ca848
Add PART A diagnostic tests: compressor + KV cache + FMHA at production scale
biondizzle
2026-06-03 04:13:53 +00:00
75288bd12f
Wire prefill FMHA into production.py and single_shot
biondizzle
2026-06-03 03:49:57 +00:00
5417f65b08
CRITICAL FIX: Add T-dimension strides to prefill FMHA kernel
biondizzle
2026-06-03 03:48:17 +00:00
dd1cbe1faa
Fix smem size for prefill debug test
biondizzle
2026-06-03 03:47:01 +00:00
09384a637a
Fix constexpr issues in prefill debug test
biondizzle
2026-06-03 03:46:29 +00:00
d3dc8cf901
Add prefill T=2 debug CUDA test with intermediate value printing
biondizzle
2026-06-03 03:46:14 +00:00
223c22488f
Simplify prefill PV read: use decode kernel's exact pattern
biondizzle
2026-06-03 03:22:49 +00:00
2bf5e74e61
Add prefill debug test: compare T=1 decode vs prefill kernel step by step
biondizzle
2026-06-03 03:05:25 +00:00
eb69c3bfb9
CRITICAL FIX: add missing tb base in QK TMEM read address
biondizzle
2026-06-03 03:00:57 +00:00
99b6de316b
Fix prefill kernel: add missing tb base in PV TMEM read, fix ACCUMULATE for per-row PV
biondizzle
2026-06-03 02:59:19 +00:00
9034f67b0f
Fix prefill kernel: read ALL n_sub PV results (was only n_sub=0)
biondizzle
2026-06-03 02:54:59 +00:00
a4ef6c3454
Add B1 mixed FP8 prefill FMHA kernel (T>1 support)
biondizzle
2026-06-03 02:50:27 +00:00
1f757151ef
Fix router gate BF16 quantize path for production FMHA test
biondizzle
2026-06-03 02:47:47 +00:00
07168357cc
Fix o_a_proj weight loading: add BF16 fallback for grouped linear
biondizzle
2026-06-03 02:38:00 +00:00
27d8d80a40
Fix missing DEVICE constant in production FMHA test
biondizzle
2026-06-03 02:31:11 +00:00
26a817c2f2
Fix production FMHA layer test: compare raw FMHA vs SDPA on production gathered KV
biondizzle
2026-06-03 02:26:37 +00:00
ba67e055f7
Add production FMHA layer comparison test
biondizzle
2026-06-03 02:22:23 +00:00
af58f2c5b2
Add B1 weight/format verification at L0 in single_shot
v-b1-b2-done-20260603
biondizzle
2026-06-03 01:52:55 +00:00
8df5de5477
Update B1 docs with test results and bug fix
biondizzle
2026-06-03 01:50:59 +00:00
3e3b352e7e
Update FINAL_STRETCH.md: B1 and B2 marked DONE with test results and bug fixes
biondizzle
2026-06-03 01:50:21 +00:00
84a02f8995
Remove debug test files, keep production B1/B2 unit tests
biondizzle
2026-06-03 01:49:39 +00:00
6fa9ad7852
B2 indexer: adopt TMEM warp-to-row mapping fix
biondizzle
2026-06-03 01:42:38 +00:00
6c92ff91f3
B2 indexer: temporary heads 0-31 only while figuring out TMEM row 32-63 layout
biondizzle
2026-06-03 01:12:10 +00:00
7732c93f62
Fix B2 indexer: use 16x256b.x1 TMEM read with TMEM_COLS=512
biondizzle
2026-06-03 01:08:48 +00:00
a75a9843af
Fix B2 indexer: add sLogits scratch buffer to SMEM layout
biondizzle
2026-06-03 00:59:06 +00:00
cc7b17fdaa
Fix B2 indexer: use 2-warps for TMEM read (P7 row-slice model)
biondizzle
2026-06-03 00:55:27 +00:00
8d0a02ca67
B2 TMEM debug: try stride=SK_TILE/8=16 for row group 32-63
biondizzle
2026-06-03 00:52:32 +00:00
fdf702470c
Add B2 TMEM read debug kernel and test
biondizzle
2026-06-03 00:50:52 +00:00
f1cf4c0215
Add B2 QK debug test with w_h=1 for simple comparison
biondizzle
2026-06-03 00:46:48 +00:00
d36dbba01c
Fix B2 indexer: increase TMEM_COLS to 512 for full 128-row MMA output
biondizzle
2026-06-03 00:45:15 +00:00
797345dfe9
Add B2 score debug test
biondizzle
2026-06-03 00:43:44 +00:00
afb82b9c89
Fix B2 indexer: replace broken 16x256b TMEM read with proven 32x32b.x8
biondizzle
2026-06-03 00:39:49 +00:00
99e50fcb58
Add B2 minimal debug test to find hang point
biondizzle
2026-06-03 00:35:48 +00:00
e21bd14408
Fix B1 test LSE reference shape handling
biondizzle
2026-06-03 00:25:53 +00:00
4fe7f9dc37
Fix B1 FMHA: swap V matrix canonical layout args (dd, kk) not (kk, dd)
biondizzle
2026-06-03 00:24:20 +00:00
29a95a3db6
Add B1 QK vs PV isolation test
biondizzle
2026-06-03 00:23:35 +00:00
c322e3f301
Add B1 FMHA debug test for cosine failure investigation
biondizzle
2026-06-03 00:22:00 +00:00
5447d1d1dc
Add comprehensive B2 FP8 indexer unit test
biondizzle
2026-06-03 00:21:29 +00:00
38eecb28d8
Add comprehensive B1 mixed FP8 FMHA unit test
biondizzle
2026-06-03 00:20:07 +00:00
f2063c0588
B1: minimal debug test for mixed FP8 FMHA (1 head, N=128)
biondizzle
2026-06-03 00:09:36 +00:00
0cea0b33ff
B1 test: fix BF16 reference to use PyTorch SDPA
biondizzle
2026-06-03 00:07:38 +00:00
a51d19a7fc
B1: add mixed FP8 FMHA cosine verification test (HD=512, N=128-2048)
biondizzle
2026-06-03 00:06:25 +00:00
b9243fe40a
B2: FP8 tensor-core indexer scoring + weighted ReLU + top-k
biondizzle
2026-06-02 23:18:54 +00:00
a9d5e09f4c
B1: mixed FP8/BF16 decode FMHA integration
biondizzle
2026-06-02 22:53:14 +00:00
2eb4f0886e
things
pre-b1
biondizzle
2026-06-02 22:31:13 +00:00
9d4a014fad
Fix NameError: dequantize_nvfp4 not in scope in forward_attention
biondizzle
2026-06-02 21:52:29 +00:00
9ba6476d3f
auto: pre-test commit
biondizzle
2026-06-02 21:39:01 +00:00
845227c06c
Fix stale lock file in CUDA loader — prevents infinite spin on crash recovery
biondizzle
2026-06-02 21:34:58 +00:00
0b6ca0df80
P5 integration + B3 q_a_norm fused + gsa scalar fix
biondizzle
2026-06-02 21:20:34 +00:00
7e42b5e090
A1: Add ◇ (think_start) priming after Assistant token
biondizzle
2026-06-02 20:23:47 +00:00
ac4eedc444
auto: pre-test commit
biondizzle
2026-06-02 20:16:43 +00:00
ecd48ab65e
A1: Add explicit stop set for DSV4 turn-end tokens
biondizzle
2026-06-02 19:59:52 +00:00
35dbb8d12b
Cleanup Part 2: Fix docs, stale references, dead code
biondizzle
2026-06-02 19:27:28 +00:00
f3b551956d
Cleanup Step 2: Archive Lineage P code, fix broken imports
biondizzle
2026-06-02 19:27:07 +00:00
8de47e26ce
Cleanup Step 1: Move root-level files to proper directories
biondizzle
2026-06-02 19:24:39 +00:00
b111525af4
Fix indexer documentation and safety issues
biondizzle
2026-06-02 19:08:40 +00:00
d770111cb1
Remove stale duplicate .cu files from indexer/ subfolder
biondizzle
2026-06-02 18:49:40 +00:00
eb5ef93bf1
Add A/B comparison mode for P4 fused vs unfused RMSNorm+quantize
biondizzle
2026-06-02 18:49:30 +00:00
b8bab01a55
Update PERFORMANCE_AUDIT.md — P4 done, P5 kernel done (pending integration)
biondizzle
2026-06-02 18:26:01 +00:00
8447ba7138
FIX: Deadlock in indexer_score_topk kernel — __syncthreads inside strided loop
biondizzle
2026-06-02 18:11:56 +00:00
c926c4a597
P5: Fix mhc_rmsnorm_quantize_nvfp4 — add proper function definition
biondizzle
2026-06-02 17:57:33 +00:00
36fdbeb56d
stuff
biondizzle
2026-06-02 17:51:46 +00:00
bdf0b15d45
P4: Fix rmsnorm_quantize_nvfp4 returns QuantizedActivation not tuple
biondizzle
2026-06-02 17:43:21 +00:00
454dbdad52
P5: Fused mHC pre_block + RMSNorm + NVFP4 quantize kernel
biondizzle
2026-06-02 16:39:42 +00:00
7bb3207347
P4: Integrate fused RMSNorm+quantize into single_shot (attention path)
biondizzle
2026-06-02 16:38:44 +00:00
0d1cd1e216
P4: Add QuantizedActivation + Nvfp4Linear.run_from_quantized
biondizzle
2026-06-02 16:37:38 +00:00
149ecefb56
P4: Relax test thresholds — per-row gsa vs scalar gsa difference expected
biondizzle
2026-06-02 16:34:49 +00:00
57ab4b9d4c
P4: Fix dequantize_nvfp4 bridge — handle float8_e4m3fn dtype
biondizzle
2026-06-02 16:31:56 +00:00
29f836d711
P4: Fix fused RMSNorm kernel — match quantize_nvfp4.cu encoding
biondizzle
2026-06-02 16:28:44 +00:00
794ebaf7e5
P4: Fused RMSNorm + NVFP4 quantize kernel (2 launches vs 6+)
biondizzle
2026-06-02 16:26:24 +00:00
82294fc21e
Fix nope_dim UnboundLocalError — hoist to function scope
biondizzle
2026-06-02 11:18:58 +00:00
e231b98387
Fix mHC Sinkhorn test: row sums expected to be off (eps after softmax)
biondizzle
2026-06-02 10:46:28 +00:00
b5f29be169
Add mHC Sinkhorn CUDA kernel test
biondizzle
2026-06-02 10:45:02 +00:00
6cb5078821
Fix mHC Sinkhorn kernel: remove VLA, remove Python fallback
biondizzle
2026-06-02 10:44:53 +00:00
c89762ecdd
Fix set_indexer_keys_fp8 None guard + store comp_pos in mixed storage
biondizzle
2026-06-02 10:20:26 +00:00
1f69f61363
Add detailed comment: why compressed KV uses FP8 not NVFP4
biondizzle
2026-06-02 10:19:54 +00:00
edc8e7ee8d
KV-1/KV-2: Mixed FP8+BF16 compressed KV (DeepSeek V4 paper format)
biondizzle
2026-06-02 10:08:43 +00:00
12b6365b42
Fix RoPE test: use proper cos/sin cache
biondizzle
2026-06-02 10:04:01 +00:00
f566b9b748
Fix FP8 quantize return type (2-tuple not 3)
biondizzle
2026-06-02 10:02:01 +00:00
bdb25ee5cd
Add production-value unit tests for kv_quantize kernels
biondizzle
2026-06-02 10:01:07 +00:00