This website requires JavaScript.
7804b779ce
diag: print wo_a g_flat magnitude to find where zeros come from
biondizzle
2026-06-01 01:40:53 +00:00
efe63caea9
diag: print FMHA output magnitude for first 3 layers
biondizzle
2026-06-01 01:34:02 +00:00
7fbbdc5204
diag: validate router output before MoE
biondizzle
2026-06-01 01:27:16 +00:00
f5fa84016e
diag: sync+error check after each layer on first token
biondizzle
2026-06-01 01:26:50 +00:00
91b3929605
fix: call moe_runner.run() and se_runner.run() (not __call__)
biondizzle
2026-06-01 01:14:38 +00:00
03c45d4bfb
fix: pass int32 token_ids to hash router (was int64)
biondizzle
2026-06-01 01:08:03 +00:00
62efde5c9f
fix: router — use cuBLAS BF16 GEMM + activation_topk CUDA kernel (production path, not CuTeDSL fused)
biondizzle
2026-06-01 01:01:15 +00:00
5591a725e1
fix: router kernel — infer OperandMajorMode from tensor layout (same pattern as MoE GEMM)
biondizzle
2026-06-01 00:59:18 +00:00
0ab5d8c317
fix: disable broken CuTeDSL fused router — use BF16 linear + activation_topk (both are production paths)
biondizzle
2026-06-01 00:56:00 +00:00
c339fe7ad9
fix: router A operand major mode MN (not K) — fixes CuTeDSL local_tile coord error
biondizzle
2026-06-01 00:54:19 +00:00
b7a8c44d26
single_shot: eager MoE/SE weight processing, stale GPU cleanup, --prefill-tokens flag
biondizzle
2026-06-01 00:42:08 +00:00
15f45b57c3
fix: correct Nvfp4Linear dimension inference from checkpoint weights
biondizzle
2026-06-01 00:32:36 +00:00
e671780008
fix: transpose checkpoint weights before make_b_k_major in Nvfp4Linear/SharedExpert
biondizzle
2026-06-01 00:30:37 +00:00
e8a7a9256f
fix: convert uint8 checkpoint weights to float4_e2m1fn_x2 for CuTeDSL GEMM
biondizzle
2026-06-01 00:18:34 +00:00
172448514c
fix: fold weight_scale_2 into global_scale_b for NVFP4 GEMM
biondizzle
2026-06-01 00:10:50 +00:00
563df02aef
fix: import SF_VEC_SIZE from quantize in gemm_runner (was NameError)
biondizzle
2026-06-01 00:04:48 +00:00
be476b2ce2
router: catch CuTeDSL warmup failures fast, don't let MLIR errors slow down init
biondizzle
2026-06-01 00:00:07 +00:00
56dff8d185
fix: W_gate is (H, E) but F.linear expects (E, H), transpose before linear
biondizzle
2026-05-31 23:55:16 +00:00
5396a04c28
router: broaden except to catch all CuTeDSL errors, fall through to cuBLAS+activation_topk path
biondizzle
2026-05-31 23:54:16 +00:00
3b5b9f487c
fix: compute num_tma_load_bytes inside cute.compile context
biondizzle
2026-05-31 23:53:13 +00:00
1bc0da0f35
fix: properly scope swap code inside else/guard blocks, replace continue with if guard
biondizzle
2026-05-31 23:51:43 +00:00
d0d765e1f2
fix: replace break statements with flag-based loops in router kernel (CuTeDSL restriction)
biondizzle
2026-05-31 23:50:39 +00:00
210391e571
fix: PersistentTileSchedulerParams constructor takes (problem_shape, cluster_shape) not from_shape
biondizzle
2026-05-31 23:49:12 +00:00
824d054ad7
fix: inside cute.compile args are already CuTe tensors, no conversion needed
biondizzle
2026-05-31 23:47:33 +00:00
6375e54396
fix: use from_dlpack + mark_layout_dynamic instead of non-existent to_cuTe_tensor in router
biondizzle
2026-05-31 23:46:35 +00:00
cb2ca8591f
fix: add @cute.jit to router compiled function
biondizzle
2026-05-31 23:44:53 +00:00
d5d2b7b4b8
fix: defer router MMA/TMA setup into cute.compile context (matches MoE pattern)
biondizzle
2026-05-31 23:44:00 +00:00
157f1c5258
fix: use OperandMajorMode from nvgpu (not deprecated tcgen05) and mma_tiler_mn in router kernel
biondizzle
2026-05-31 23:39:50 +00:00
1dbc57e2cd
fix: use mma_tiler_mn in _create_tiled_mma (attribute exists at init time)
biondizzle
2026-05-31 23:36:01 +00:00
d05dd50bf5
fix: OperandMajorMode.K not MAJOR_K (correct CuTeDSL API)
biondizzle
2026-05-31 23:34:54 +00:00
a6a8755439
single_shot: switch to head-packed FMHA dispatch (1 kernel launch vs 128)
biondizzle
2026-05-31 23:33:32 +00:00
80002f2efc
single_shot: production NVFP4 GEMM for ALL attention projections
biondizzle
2026-05-31 23:28:16 +00:00
32efd5139d
Fix gate weight transpose: checkpoint is (E, H), Router expects (H, E)
biondizzle
2026-05-31 23:21:09 +00:00
e45c0ff51b
single_shot: use reference dequant for attn projections, focus on MoE+FMHA
biondizzle
2026-05-31 23:20:04 +00:00
dfbffa1df1
single_shot: CUDA_LAUNCH_BLOCKING for debugging
biondizzle
2026-05-31 23:18:35 +00:00
a66fdf6049
single_shot: add sync to catch CUDA errors early
biondizzle
2026-05-31 23:17:46 +00:00
0b35c36d23
single_shot: memory-efficient MoE loading, lazy Nvfp4Linear init
biondizzle
2026-05-31 23:16:45 +00:00
050b5ee449
Fix n_h reference before assignment in single_shot
biondizzle
2026-05-31 23:14:24 +00:00
c5adbbfde6
FMHA sink: don't double-scale sink bias
biondizzle
2026-05-31 23:12:20 +00:00
4adee1207f
FMHA: zero-init my_p_vals to fix N<128 padding NaN
biondizzle
2026-05-31 23:11:12 +00:00
13be3ad443
FMHA sink bias in kernel + single_shot production rewrite
biondizzle
2026-05-31 23:10:13 +00:00
23e88638aa
single_shot: memory-efficient MoE loading (CPU stacking, one-shot GPU transfer)
biondizzle
2026-05-31 22:55:11 +00:00
92200367f3
FMHA kernel fix: N_orig vs N_padded — correct softmax masking for seq_len < 128
biondizzle
2026-05-31 22:52:39 +00:00
d40821c843
single_shot: fix memory (no double-loading MoE weights), FMHA short-seq fallback
biondizzle
2026-05-31 22:49:15 +00:00
91568e12d4
single_shot_inference.py: production kernel stack version
biondizzle
2026-05-31 22:45:44 +00:00
fb96c34b89
rename: single_shot_inference.py → single_shot_PYTORCH_REFERENCE.py
biondizzle
2026-05-31 22:42:06 +00:00
79d1a83348
Add NEXT_STEPS.md: post v0.1 issues, kernel migration plan, lessons learned
biondizzle
2026-05-31 22:30:34 +00:00
acc20dffd7
CRITICAL FIX: don't fold input_scale into NVFP4 weight dequant
v0.1-e2e-working
biondizzle
2026-05-31 22:03:55 +00:00
4e64acbb64
fix MoE gate BF16/NVFP4 handling, add attention diagnostics
biondizzle
2026-05-31 21:57:47 +00:00
0d2b5ceb93
fix positions device mismatch: move to rope cache device in forward_attention
biondizzle
2026-05-31 21:54:56 +00:00
2676476013
fix mHC pre_block bmm dtype mismatch: A is FP32, X is BF16
biondizzle
2026-05-31 21:51:59 +00:00
eb08cd06d1
Rewrite single_shot_inference.py: correct weight keys, NVFP4 two-level scale, compressor+indexer connected
biondizzle
2026-05-31 21:48:59 +00:00
4988e77179
probe key format
biondizzle
2026-05-31 21:42:52 +00:00
ba915dbd53
add probe_shapes script
biondizzle
2026-05-31 21:41:31 +00:00
c54dd15550
find hc keys
biondizzle
2026-05-31 21:38:43 +00:00
52b4971711
Full E2E single-shot: compressor, indexer, correct checkpoint keys (layers.{li}.attn/ffn)
biondizzle
2026-05-31 21:36:17 +00:00
cec17fee7d
fixed prefix
biondizzle
2026-05-31 21:26:04 +00:00
696f3261ab
focused key dump
biondizzle
2026-05-31 21:25:31 +00:00
b7c9bb1262
dump all keys
biondizzle
2026-05-31 21:24:58 +00:00
54e2a3684a
filter expert keys
biondizzle
2026-05-31 21:24:35 +00:00
bafabda01f
add checkpoint key dump script
biondizzle
2026-05-31 21:24:14 +00:00
23f1cf4065
Fix HcHead: use FP32 for RMSNorm + linear (matches HF reference)
biondizzle
2026-05-31 21:13:21 +00:00
274ea13251
Fix critical bug: add hc_head for final mHC readout (was using stream 0)
biondizzle
2026-05-31 21:13:02 +00:00
baee36e728
Fix dtype mismatch in validate_layer: cast flat to float before F.linear
biondizzle
2026-05-31 20:23:18 +00:00
46c4ef2cf5
Add per-layer validation test (tests/validate_layer.py)
biondizzle
2026-05-31 20:22:13 +00:00
abe4210367
Add compact per-layer residual trace (GROWTH_DIAG), disable verbose ATTN_DIAG
biondizzle
2026-05-31 20:21:03 +00:00
98fa410167
Add HF reference test script
biondizzle
2026-05-31 20:11:37 +00:00
a1b39adcaa
Add attention entropy diag (ATTN_DIAG), KV cache diag, --no-thinking mode
biondizzle
2026-05-31 19:29:55 +00:00
2a886fe0f2
Add --no-thinking mode to skip thinking tokens and use second-best
biondizzle
2026-05-31 19:24:21 +00:00
41ef0ebd0f
Add KV cache length diagnostic during decode
biondizzle
2026-05-31 19:17:24 +00:00
8baebf3c2e
Restore --skip-mhc arg, empty system prompt for testing
biondizzle
2026-05-31 19:04:53 +00:00
ca661d32e8
Empty system prompt for testing (was causing model to regurgitate AI assistant tokens)
biondizzle
2026-05-31 19:03:55 +00:00
b09b2cf511
Fix MoE routing: hash layers 0-2 (tid2eid), e_score_correction_bias for layers 3+
biondizzle
2026-05-31 18:52:38 +00:00
7d9e70c5d5
Fix remaining mHC API references: layer_compare.py, layer.py comment
biondizzle
2026-05-31 18:38:34 +00:00
7b123d159f
CRITICAL FIX: mHC fn/base/scale ordering [pre,post,comb] + comb transposed + Sinkhorn softmax
biondizzle
2026-05-31 18:38:12 +00:00
f6c02f808f
Add layer-by-layer comparison test for debugging
biondizzle
2026-05-31 12:48:43 +00:00
6ad577bd18
Add HuggingFace reference comparison test
biondizzle
2026-05-31 12:05:19 +00:00
581c4170f9
Fix sink logits shape: (n_h, T, 1) for concatenation with (n_h, T, seq_len)
biondizzle
2026-05-31 11:57:23 +00:00
0f951a0b1a
Fix attention sinks: logit bias (HuggingFace reference), not dummy KV
biondizzle
2026-05-31 11:53:43 +00:00
daed594902
CRITICAL FIX: Add missing q_b_norm (unweighted RMSNorm after q_b_proj)
biondizzle
2026-05-31 11:47:16 +00:00
dd50c355a6
Fix MHC_DIAG null check when SKIP_MHC is enabled
biondizzle
2026-05-31 11:37:32 +00:00
631e6ea3e4
Add --skip-mhc flag for simple residual diagnostic
biondizzle
2026-05-31 11:33:41 +00:00
d201a9334e
CRITICAL FIX: Add YaRN RoPE scaling (factor=16)
biondizzle
2026-05-31 11:25:52 +00:00
88719f39b4
Add single-layer trace (Phase 2.6) for detailed debugging
biondizzle
2026-05-31 11:20:46 +00:00
8256e23aed
Fix mHCContext attribute access (not tuple unpacking) and enable attention diag
biondizzle
2026-05-31 11:10:37 +00:00
72c139a59f
Enable MHC_DIAG for diagnostic run
biondizzle
2026-05-31 11:07:23 +00:00
cd661c2e40
Add attention and Q/KV diagnostics (MHC_DIAG flag)
biondizzle
2026-05-31 11:07:17 +00:00
9584fcbc23
Fix top5_ids variable name in decode logging
biondizzle
2026-05-31 10:54:40 +00:00
a6d56d10ca
Add top-20 logging and thinking token detection in decode loop
biondizzle
2026-05-31 10:49:28 +00:00
d891ae7e96
Fix prompt format: use DeepSeek V4 chat tokens
biondizzle
2026-05-31 10:33:41 +00:00
f86742ef8e
Cache layer weights on GPU — eliminates per-token CPU→GPU transfer
biondizzle
2026-05-31 10:28:25 +00:00
ce3d6069cc
CRITICAL FIX: mHC base/scale ordering matches fn ordering [pre, res, post]
biondizzle
2026-05-31 10:07:14 +00:00
9a43e9aa77
CRITICAL FIX: mHC fn weight row ordering was wrong
biondizzle
2026-05-31 10:02:57 +00:00
0346e479d4
Add system prompt, CLI args, inverse RoPE flag, minimal e2e test
biondizzle
2026-05-31 09:56:18 +00:00
429fc3db40
Fix expert weight indexing for 1D tensor
biondizzle
2026-05-31 09:23:10 +00:00
33004dcbf4
Fix expert weight broadcasting (wt.item() for scalar multiply)
biondizzle
2026-05-31 09:22:27 +00:00
1434b35971
Add residual diagnostic test — per-layer magnitude tracking
biondizzle
2026-05-31 09:21:41 +00:00
1c18c16c68
Fix production rope.py: FP32 arithmetic for forward_rope_partial + inverse_rope_bf16
biondizzle
2026-05-31 09:17:36 +00:00
970869d017
Fix mHCBlock import + relax RoPE round-trip threshold (BF16 noise expected)
biondizzle
2026-05-31 09:17:07 +00:00
a2ee78b564
Fix RoPE shape bug (interleave needs separate even/odd assembly)
biondizzle
2026-05-31 09:15:59 +00:00