90ac38cde0
Add CUDA graph stream management test
2026-06-06 08:14:29 +00:00
26042e3f01
Add minimal CUDA graph multi-GPU test to isolate zero-output bug
2026-06-06 08:13:18 +00:00
df05289d6f
CUDA graph: Fix remaining sync violations from B200 detector run 2
...
1. grouped_linear.py: Remove conditional host read of GPU tensor
- 'if group_offsets[0] != 0' reads GPU value on host → sync
- Fix: unconditionally update offsets every call (GPU-only multiply)
2. test_cuda_graph_readiness.py: Use pinned CPU buffers for token transfer
- dec_tid_buf[0] = python_int → CPU→GPU sync
- Fix: write to pinned CPU buffer, then copy_ (async, graph-capturable)
3. Add dsv4/decode/cuda_graph_decoder.py (skeleton)
2026-06-03 17:20:34 +00:00
caac8ae108
Fix syntax error: 'is not not None' -> 'is not None'
2026-06-03 16:34:33 +00:00
ba68212fa7
Add CUDA graph readiness detector (Section A of GETTING_CUDAGRAPH_READY.md)
...
- Grep for Section B sync patterns in hot path files
- Method 1: run decode forward with torch.cuda.set_sync_debug_mode('error')
- Method 2: attempt CUDA graph capture of L0 decode step
- Full model load + prefill + warmup before detection
- Results saved to /tmp/cuda_graph_readiness_results.json
2026-06-03 16:34:15 +00:00
3c06fd5591
Test 2: fix topk tensor shape (flatten before iterating)
2026-06-03 08:47:32 +00:00
29d6986dd4
Test 2: fix quantize_to_nvfp4 import
2026-06-03 08:21:39 +00:00
60b9bbd470
Test 2: fix import - use mHCLayer from dsv4.layers.mhc, fixed prompt encoding
2026-06-03 08:20:21 +00:00
2a42686e8e
Test 1 v2: diff hand-rolled vs official DSV4 encoding
2026-06-03 08:18:56 +00:00
11c2d5fe53
Add degeneration test 2: falsify mHC residual growth root cause
2026-06-03 08:18:01 +00:00
c77b83fffc
Add degeneration test 1: chat-template token-ID diff
2026-06-03 08:17:09 +00:00
9dbfac9dfa
PART A: verify kv_norm_w loaded correctly
2026-06-03 07:03:39 +00:00
a682c6adf4
PART A: add raw compressor output diagnostic
2026-06-03 06:56:56 +00:00
f2c1b3afd5
PART A: fix KV diagnostics — compute q_a before indexer, add Q_heads magnitude check
2026-06-03 06:33:51 +00:00
86e59c16c5
PART A: add KV gather diagnostics at blowup layer
2026-06-03 06:25:35 +00:00
262f844e2e
PART A: add detailed blowup diagnostics — capture mHC intermediate values when |X| > 1e6
2026-06-03 06:10:33 +00:00
6459fbca9a
fix: import forward_attention
2026-06-03 05:41:33 +00:00
91dfac34d8
PART A: simplified to production-only diagnostics — track per-layer |X| during prefill and decode, detect blowup early
2026-06-03 05:33:22 +00:00
d99503732d
fix: add BF16 gate weight fallback for dense routers (missing from test)
2026-06-03 05:22:47 +00:00
801bfc9a83
add router mode debug print
2026-06-03 05:15:52 +00:00
b385ecc05e
PART A: decode diagnostics test — production vs reference per-layer X comparison at decode step
2026-06-03 05:06:40 +00:00
d518fcb82a
test: correct sink bias reference — denominator-only, no V contribution
2026-06-03 04:57:37 +00:00
9574a9dc2e
test: add sink bias to reference SDPA in decode FMHA comparison
2026-06-03 04:53:55 +00:00
9a9b347b2b
test: add per-head magnitude ratio diagnostics to decode FMHA test
2026-06-03 04:50:23 +00:00
f5fa20c581
fix: syntax error — missing closing paren in indexer.forward call
2026-06-03 04:46:41 +00:00
693975ec92
fix: device mismatches in decode FMHA test — dec_pos must be on per-layer GPU
2026-06-03 04:46:24 +00:00
e1d96c509d
test: decode FMHA layer comparison — checks FMHA accuracy during decode step
2026-06-03 04:39:12 +00:00
d8306be3f2
Fix PART A test: proper FP8 quantization and MQA reference
2026-06-03 04:20:36 +00:00
4126909dfb
Simplify PART A test: compressor + FMHA at production scale
2026-06-03 04:18:13 +00:00
8c54cfa748
Fix KVCache init in PART A test
2026-06-03 04:15:41 +00:00
04cf8ca848
Add PART A diagnostic tests: compressor + KV cache + FMHA at production scale
2026-06-03 04:13:53 +00:00
dd1cbe1faa
Fix smem size for prefill debug test
2026-06-03 03:47:01 +00:00
09384a637a
Fix constexpr issues in prefill debug test
2026-06-03 03:46:29 +00:00
d3dc8cf901
Add prefill T=2 debug CUDA test with intermediate value printing
2026-06-03 03:46:14 +00:00
2bf5e74e61
Add prefill debug test: compare T=1 decode vs prefill kernel step by step
2026-06-03 03:05:25 +00:00
a4ef6c3454
Add B1 mixed FP8 prefill FMHA kernel (T>1 support)
...
New files:
- fmha_mixed_fp8_prefill.cuh: kernel supporting T=1..128
- Sub-batch processing (T_BATCH=32) to fit in 232KB SMEM
- Multi-row QK TMEM read using tcgen05.ld.32x32b.x8
- Per-row online softmax
- Per-row PV MMA (correctness first; batched PV is TODO)
- Attention sink support
- fmha_mixed_fp8_prefill_capi.cu: C API bridge
- fmha_mixed_fp8_prefill_op.py: Python ctypes loader
- test_b1_mixed_fp8_prefill.py: unit test (T=1..32, N=128..4096)
Also: fix production FMHA layer test (BF16 fallback for o_a_proj,
router gate BF16 quantize path, missing DEVICE constant)
2026-06-03 02:50:27 +00:00
1f757151ef
Fix router gate BF16 quantize path for production FMHA test
2026-06-03 02:47:47 +00:00
07168357cc
Fix o_a_proj weight loading: add BF16 fallback for grouped linear
2026-06-03 02:38:00 +00:00
27d8d80a40
Fix missing DEVICE constant in production FMHA test
2026-06-03 02:31:11 +00:00
26a817c2f2
Fix production FMHA layer test: compare raw FMHA vs SDPA on production gathered KV
...
Phase 1: Run full pipeline to populate KV caches with real model weights.
Phase 2: For each layer, gather KV in mixed FP8/BF16 format, run both
production FMHA and PyTorch SDPA, compare cosine similarity.
Uses random Q (not model-generated) to isolate FMHA kernel accuracy
from upstream pipeline issues.
2026-06-03 02:26:37 +00:00
ba67e055f7
Add production FMHA layer comparison test
...
Test loads real model weights, runs attention forward for layers 0-4,
compares production B1 mixed FP8 FMHA output vs PyTorch SDPA reference.
This will reveal the FMHA cosine degradation (was 0.679 at L1) with
real data patterns, not just synthetic random data.
Production values: HD=512, NOPE=448, ROPE=64, H=128, 8 GPUs.
2026-06-03 02:22:23 +00:00
84a02f8995
Remove debug test files, keep production B1/B2 unit tests
2026-06-03 01:49:39 +00:00
fdf702470c
Add B2 TMEM read debug kernel and test
2026-06-03 00:50:52 +00:00
f1cf4c0215
Add B2 QK debug test with w_h=1 for simple comparison
2026-06-03 00:46:48 +00:00
797345dfe9
Add B2 score debug test
2026-06-03 00:43:44 +00:00
99e50fcb58
Add B2 minimal debug test to find hang point
2026-06-03 00:35:48 +00:00
e21bd14408
Fix B1 test LSE reference shape handling
2026-06-03 00:25:53 +00:00
29a95a3db6
Add B1 QK vs PV isolation test
2026-06-03 00:23:35 +00:00
c322e3f301
Add B1 FMHA debug test for cosine failure investigation
2026-06-03 00:22:00 +00:00
5447d1d1dc
Add comprehensive B2 FP8 indexer unit test
2026-06-03 00:21:29 +00:00