-
7ef6402936
KV-1/KV-2/KV-3: NVFP4 compressed KV + FP8 indexer keys
biondizzle
2026-06-02 10:00:50 +00:00
-
40dd56eac2
KV-1: Fix shared memory corruption in block_reduce
biondizzle
2026-06-02 09:49:12 +00:00
-
0fefadedd4
KV-1: Fix FP8 round-trip mismatch in fused quantize
biondizzle
2026-06-02 09:46:32 +00:00
-
d74ff5768d
KV diag test
biondizzle
2026-06-02 09:43:45 +00:00
-
c2664281c3
KV-1/KV-2: Fix quantize kernel — each thread handles 16-elem blocks independently
biondizzle
2026-06-02 09:41:15 +00:00
-
f23320b5b2
KV-1/KV-2: Fused compress+NVFP4 quantize kernels + dequant
biondizzle
2026-06-02 09:37:53 +00:00
-
107d62dd76
docs: update PERFORMANCE_AUDIT.md — Part 1 (P0-P3) landed, Part 2 KV cache next
biondizzle
2026-06-02 09:30:06 +00:00
-
3c295f225a
P3: integrate CUDA RoPE kernel into single_shot — 732 launches/token eliminated
v-p0p1p2p3-fused-swiglu-cuda-rope-20260602
biondizzle
2026-06-02 09:08:07 +00:00
-
54a9b6961b
fix: rope_cuda path — kernels/cuda not ops/cuda
biondizzle
2026-06-02 09:06:36 +00:00
-
2bbbead984
P3: CUDA RoPE kernel — single launch per call (vs 5-6 PyTorch ops)
biondizzle
2026-06-02 09:05:22 +00:00
-
851ec9b4d5
P3 WIP: fused RMSNorm + quantize kernel skeleton (not yet integrated)
biondizzle
2026-06-02 09:02:52 +00:00
-
b13c1057f5
test: verify GEMM shape with production weight format
biondizzle
2026-06-02 08:43:40 +00:00
-
40fb49d670
test: verify GEMM output shape
biondizzle
2026-06-02 08:41:22 +00:00
-
f01d3f3eac
wip: SE fused SwiGLU deinterleave fix
biondizzle
2026-06-02 08:41:00 +00:00
-
1726cb64a9
fix: interleave_l1_weights granularity_bf16 (not granularity) in SE
biondizzle
2026-06-02 08:29:03 +00:00
-
553275d810
feat: P1 — add eager warmup_fused_swiglu_compilation for SharedExpert (1-group)
biondizzle
2026-06-02 08:25:52 +00:00
-
5ed4c86137
fix: expert_offsets for 4-expert fused SwiGLU test
biondizzle
2026-06-02 08:24:32 +00:00
-
53362d2579
test: isolate fused SwiGLU — test no-clamp first
biondizzle
2026-06-02 08:23:28 +00:00
-
ae4506d722
fix: w_gs is scalar not iterable
biondizzle
2026-06-02 08:22:29 +00:00
-
b0c71b947e
test: fused SwiGLU — smoke test + correctness comparison with graceful degradation
biondizzle
2026-06-02 08:21:33 +00:00
-
2cfca36095
fix: compute correct gs from data in fused SwiGLU test
biondizzle
2026-06-02 08:20:27 +00:00
-
4a05a40cf0
fix: fused SwiGLU test — proper weight quant + 128-token alignment
biondizzle
2026-06-02 08:19:31 +00:00
-
fa769b6214
fix: pad activation as uint8 view for float4 dtype
biondizzle
2026-06-02 08:18:26 +00:00
-
024be1a60b
fix: test weight quantization dtype for fused SwiGLU test
biondizzle
2026-06-02 08:17:35 +00:00
-
19afa52e80
fix: use cute.where() directly for clamp in fused SwiGLU
biondizzle
2026-06-02 08:16:41 +00:00
-
5c746bbdf2
fix: TensorSSA-compatible clamp in fused SwiGLU kernel
biondizzle
2026-06-02 08:15:46 +00:00
-
3a30f35c68
fix: cute.math.fmin/fmax → cute.arch.fmin/fmax in fused SwiGLU kernel
biondizzle
2026-06-02 08:12:55 +00:00
-
fca72427ea
fix: add fp4_out/sf_out/l2_global_scale params to fused_swiglu kernel() signature
biondizzle
2026-06-02 08:11:18 +00:00
-
55ea109cca
test: fused SwiGLU kernel compilation + correctness (P0/P1 gate)
biondizzle
2026-06-02 08:09:57 +00:00
-
7904cf05c4
Add set_fused_swiglu() method to Nvfp4MoE
biondizzle
2026-06-02 07:59:57 +00:00
-
d8e17d70c1
P0+P1+P2: Enable fused SwiGLU (MoE+SE), fix SE _run_l1_fused, remove per-call gsa fill_
biondizzle
2026-06-02 07:57:39 +00:00
-
61d5e7ba53
revert: P2 gsa fill elimination — revert to proven path for e2e stability
v-perf-part1-p2-reverted-20260602
biondizzle
2026-06-02 07:32:10 +00:00
-
790f8c350a
perf: P2 landed (gsa fill elimination). P0/P1 fused SwiGLU disabled — CuTeDSL kernel arg-binding bug.
biondizzle
2026-06-02 07:16:08 +00:00
-
040b2eb6e7
perf: P0/P1/P2 — fused SwiGLU for MoE+SE, eliminate per-call gsa fill
biondizzle
2026-06-02 06:59:25 +00:00
-
e9506e0c20
perf: C1/C2/C3 — per-layer max_comp, pre-allocated gather_buf, SWA views
v-post-indexer-c-fixes-20260602
v-c1-c2-c3-20260602
biondizzle
2026-06-02 06:18:06 +00:00
-
617da29a5b
fix: assert topk_idx is not None in CSA layers — no silent fallback to SWA-only
biondizzle
2026-06-02 06:14:23 +00:00
-
5b4c496512
fix: three indexer bugs — weight path, comp_idx_buf width, scoring einsum
v-indexer-fix-20260602
biondizzle
2026-06-02 05:53:10 +00:00
-
0fbf28dd54
doc: INDEXER_PROBE_RESULTS_20260602 — compressed key width is ihd=128, not n_ih*ihd=8192
biondizzle
2026-06-02 05:51:24 +00:00
-
8162c586c3
probe: fix comp_idx_buf width to ihd=128 so indexer probe can complete
biondizzle
2026-06-02 05:38:44 +00:00
-
5be31d8582
fix: indexer compressor weight path — weights are at *.indexer.kv_proj not *.indexer.compressor.kv_proj
biondizzle
2026-06-02 05:25:44 +00:00
-
fdfcca918c
probe: verify indexer compressor load state
biondizzle
2026-06-02 05:17:00 +00:00
-
fb0ed87626
probe: add indexer compressor early-return and buffering diagnostics
biondizzle
2026-06-02 05:06:18 +00:00
-
06c92f208f
INDEXER PROBE: instrumentation prints for compressed key width investigation
biondizzle
2026-06-02 04:44:47 +00:00
-
510eaf4a26
probe: HF indexer architecture from B200
biondizzle
2026-06-02 04:38:24 +00:00
-
938e9079ce
probe: indexer and compressor weight shapes from checkpoint
biondizzle
2026-06-02 04:36:35 +00:00
-
9254cb0b0d
test: NVFP4 runtime gsa accuracy vs PyTorch reference
biondizzle
2026-06-02 04:31:18 +00:00
-
7e3fb5f4d0
fix: add missing import for quantize_nvfp4_gpu in linear.py fixed-gsa path
biondizzle
2026-06-02 04:28:29 +00:00
-
f52eedbdce
Add production-value tests: ALL tests use Pro config (61L, HD=512, 384 experts, HCA=128, 1M context)
biondizzle
2026-06-02 04:10:39 +00:00
-
668a42e71a
debug: print mhc_sinkhorn CUDA kernel compile errors
biondizzle
2026-06-02 04:02:34 +00:00
-
ca53bdb8e1
perf: skip MQA GQA expansion in FMHA (stride=0, no 128x K/V copy)
biondizzle
2026-06-02 03:54:03 +00:00
-
7b82d31330
perf: fused mHC Sinkhorn CUDA kernel (1 launch vs 38)
biondizzle
2026-06-02 03:50:57 +00:00
-
f0dec9f6bd
profile: fine-grained attention component timing
biondizzle
2026-06-02 03:08:34 +00:00
-
7114c48575
fix: parenthesize profile_detail condition
biondizzle
2026-06-02 02:56:13 +00:00
-
4734e894c7
profile: add per-layer attn vs ffn timing with CUDA sync
biondizzle
2026-06-02 02:46:35 +00:00
-
4017ef2f16
fix: accurate profile sync + remove paris_tids 129K iteration
biondizzle
2026-06-01 23:55:26 +00:00
-
73ae9393da
FIX: RoPE cache 8192→65536 (original_max_position_embeddings), KVCache max_comp 32768→65536
biondizzle
2026-06-01 23:18:37 +00:00
-
36f9782bad
Add thinking/Paris token logit check on step 0 for quality debugging
biondizzle
2026-06-01 23:14:24 +00:00
-
ef7e0d63bb
Add --warmup-gsa flag: fix attention/router gsa after first decode step to eliminate amax kernel launches
biondizzle
2026-06-01 23:04:44 +00:00
-
008e59eb90
Add --profile flag: per-component GPU timing with CUDA sync (embed+layers, lm_head, sampling)
biondizzle
2026-06-01 23:03:46 +00:00
-
106f42c93c
auto: pre-test commit
biondizzle
2026-06-01 23:01:34 +00:00
-
e53645654d
Reduce hot-path .item() syncs: gate li>=58 diagnostics behind VERBOSE>=2, topk on float
biondizzle
2026-06-01 22:33:03 +00:00
-
6f4bbc997a
Add sync after sampler for step<3 to catch async CUDA errors early
biondizzle
2026-06-01 22:32:40 +00:00
-
5493a8727e
P7: compressor early return + decode buffering (skip GEMMs when n_complete=0); sampler SMEM fix (LK=24 fits 48KB default); topk on float not bf16
biondizzle
2026-06-01 22:29:56 +00:00
-
828ba73dff
Update PERFORMANCE_AUDIT.md: P0 complete, P2/P3/P5 done
biondizzle
2026-06-01 22:21:31 +00:00
-
583ad6cfe6
P0 complete: Kill .item() in grouped_linear, reduce hot-path syncs
biondizzle
2026-06-01 22:21:12 +00:00
-
8767c263ab
Add cuda.synchronize + better logits validation after lm_head
biondizzle
2026-06-01 22:06:41 +00:00
-
2a6f9a10b1
lm_head: fall back to BF16 F.linear for stability
biondizzle
2026-06-01 22:05:22 +00:00
-
9bad30c777
Add logits validation debug before topk sampling
biondizzle
2026-06-01 21:59:23 +00:00
-
9fec7d609e
Fix gsa_buffer shape mismatch for MoE (M>1 rows)
biondizzle
2026-06-01 21:33:59 +00:00
-
cacf64232e
CRITICAL FIX: fused_amax_quantize cross-CTA race condition
biondizzle
2026-06-01 21:26:51 +00:00
-
e3412cf913
P5: In-place RoPE — no x.clone(), no empty_like allocation
biondizzle
2026-06-01 21:18:41 +00:00
-
00746c2d2b
Fix module path: move loader code from __init__.py to loader.py
biondizzle
2026-06-01 21:18:29 +00:00
-
230d28e562
Fix KVCache constructor call — device as keyword arg, not positional
biondizzle
2026-06-01 21:11:01 +00:00
-
c9b92cd840
Remove P1 from audit — multi-GPU layout is correct for the reference script
biondizzle
2026-06-01 21:07:59 +00:00
-
c8faf20a99
P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path
biondizzle
2026-06-01 21:05:03 +00:00
-
e0607c9e2f
P0: Add fused_amax_quantize.cu kernel + CUDA module loader with compile-once caching
biondizzle
2026-06-01 21:02:03 +00:00
-
d279965db4
Update PERFORMANCE_AUDIT.md: remove invalidated items, add WIP status
biondizzle
2026-06-01 20:55:44 +00:00
-
60715f89bc
Fix CUDA kernel compilation: use c10::cuda::getCurrentCUDAStream
biondizzle
2026-06-01 20:49:55 +00:00
-
2dc5b4ec19
Fix sampler kernel stack overflow: reduce MAX_K from 256 to 128
biondizzle
2026-06-01 20:42:53 +00:00
-
360f76b970
Performance audit fixes: eliminate CPU-GPU syncs
biondizzle
2026-06-01 20:40:19 +00:00
-
4f698baa5d
Production fused CUDA sampler + decode loop optimizations
biondizzle
2026-06-01 20:29:57 +00:00
-
2830a3ee7c
Fix lm_head NVFP4: transpose weight and scales to match Nvfp4Linear checkpoint layout
v-e2e-nvfp4-all-projections
biondizzle
2026-06-01 19:51:21 +00:00
-
16b72b9581
PERF: Eliminate double quantization for o_a_proj + NVFP4 lm_head
biondizzle
2026-06-01 19:41:21 +00:00
-
9a3bb43f20
Set default max-tokens=512 for reasoning model
biondizzle
2026-06-01 17:27:01 +00:00
-
db6e3545da
Fix: add _use_runtime_gsa=True to router gate GEMM in single_shot
biondizzle
2026-06-01 17:25:04 +00:00
-
9d57b0453b
auto: pre-test commit
biondizzle
2026-06-01 15:04:46 +00:00
-
1a6d9ee29b
Reset to greedy decoding (temperature=0)
biondizzle
2026-06-01 15:04:02 +00:00
-
038fe81c68
Fix MoE non-fused L2 runtime gsa + update test harness for extra args
biondizzle
2026-06-01 15:03:54 +00:00
-
a48d6e14ae
Default temperature=0.7 with rep penalty
biondizzle
2026-06-01 14:55:43 +00:00
-
1d64b863ca
Add temperature sampling + repetition penalty to fix degenerate repetition
biondizzle
2026-06-01 14:54:49 +00:00
-
6cca16f97a
Set max-tokens=128 default, clean up for final verification
biondizzle
2026-06-01 14:43:48 +00:00
-
a0e758ec3b
Set default max-tokens=30 for faster iteration
biondizzle
2026-06-01 14:33:55 +00:00
-
2b1fca6dae
CRITICAL FIX: runtime activation global scale to prevent E4M3 overflow
biondizzle
2026-06-01 14:21:16 +00:00
-
3b2714410f
Add NVFP4 linear accuracy test: prod vs ref with all-ones input
biondizzle
2026-06-01 14:15:27 +00:00
-
3e47d5f20a
Add prod vs ref GEMM comparison test + gate logits diagnostic
biondizzle
2026-06-01 14:11:37 +00:00
-
ad143afe37
Add L58-60 diagnostic: mHC A/B/C, MoE routed/shared, topk
biondizzle
2026-06-01 13:55:55 +00:00
-
7a05d3d3af
NVFP4 router gate: use Nvfp4Linear for both checkpoint and quantized paths
biondizzle
2026-06-01 11:25:50 +00:00
-
e5dbe1ed22
Switch router to Nvfp4Linear production GEMM (custom CuTeDSL kernel crashes MLIR)
biondizzle
2026-06-01 11:17:54 +00:00
-
a4324781c3
Fix: properly remove sqrt(softplus) from CuTeDSL kernel
biondizzle
2026-06-01 11:14:04 +00:00
-
6efe90cd85
Move sqrt(softplus) out of CuTeDSL kernel into Python
biondizzle
2026-06-01 11:12:41 +00:00