6e53e3007c
fix: clamp block_amax to E4M3 max (448) in quantize_activation_nvfp4 — prevents NaN from overflow
2026-06-01 04:59:06 +00:00
563df02aef
fix: import SF_VEC_SIZE from quantize in gemm_runner (was NameError)
2026-06-01 00:04:48 +00:00
be476b2ce2
router: catch CuTeDSL warmup failures fast, don't let MLIR errors slow down init
2026-06-01 00:00:07 +00:00
1c18c16c68
Fix production rope.py: FP32 arithmetic for forward_rope_partial + inverse_rope_bf16
2026-05-31 09:17:36 +00:00
300dddedc0
E1-E4: gather kernels, handle wiring, rope, sync removal, e2e test
...
E1: LayerCacheHandle now exposes gather_compressed_kv,
gather_all_compressed_kv, gather_swa_kv, num_query_heads, head_dim.
Gather kernels in dsv4/kernels/cuda/gather_swa.cu + gather_kv.cu.
Python wrapper in dsv4/kernels/cache/gather.py.
E2: tests/e2e/test_one_layer.py — SWA path smoke test.
E3: Compressor/indexer __init__.py bridges (NotImplementedError stubs
for CSA/HCA compress_and_store, compute_index_scores_topk).
E4: Removed torch.cuda.synchronize() from fmha_multitile_op.py fast path.
Error checking via C API return code instead.
Also: forward_rope_partial in ops/rope.py (GPT-J interleaved, last 64 dims).
2026-05-30 21:10:26 +00:00
4b9eed02e1
Cleanup C1-C7: delete dead CuTeDSL FMHA, test probes, scratch files
...
- Deleted fmha.py (CuTeDSL slow path), FmhaKernel, Python KV merge
- Deleted fmha_sm100.cuh, fmha_sm100_tc.cuh, fmha_sm100_launch.cu, fmha_epilogue_sm100.cuh
- Moved fmha_qk_verify.cuh to tests/unit/qk_verify_kernel.cuh
- Deleted decode_sparse.py, decode_swa.py, kernels/decode/
- Deleted 46 test_d*.py probes, test_smem_*, test_cotiled_*, test_tmem_*,
test_smem_p_*, test_ultra_minimal, test_fmha_pv16, test_working_softmax_maybe
- Deleted root scratch: debug_linear.py, test_mapping.py, run_router_tests.py
- Moved archive/ to archived_plans/code_archive/
- Rewrote production.py: single fast path via 6-warp multi-tile kernel
- Added STATUS.md, audit_attention_live.md
- Moved NEXT_PRIORITIES*.md to archived_plans/
2026-05-30 21:08:12 +00:00
b9f15c250f
Stage E: head-packed MQA/GQA, batch dim, custom_op, integration API
...
- production.py: head-packed M dimension for MQA/GQA (q_per_kv*T rows
in single launch per KV group, eliminating redundant K/V TMA loads)
- production.py: batch dimension support (outer Python loop)
- production.py: warmup_attention_kernels() for pre-compilation
- production.py: dsv4_attention_per_head() for exact per-head sink bias
- __init__.py: sparse_fmha_with_swa, dense_fmha_with_swa, swa_only_fmha
integration functions bridging AttentionSubBlock → production FMHA
- custom_ops.py: dsv4::sparse_fmha_with_swa custom_op registration
- test_production.py: comprehensive tests (MHA/MQA/GQA, head-packed vs
per-head parity, multi-segment KV, SWA+causal+sink, batch, edge cases)
2026-05-27 15:15:03 +00:00
d53e0a33a9
NVFP4-3: add use_2cta_instrs conditional to gemm_runner
...
- run_nvfp4_grouped_gemm: use_2cta = tokens_sum >= 256 and cluster_m even
- run_fused_swiglu_grouped_gemm: same conditional
- Auto-warms up on first use via lazy compilation cache
- 1.7-1.9× throughput at prefill shapes (M>=256)
- Decode (M<256) stays 1-CTA (correct, no waste)
2026-05-25 16:42:02 +00:00
c2e3d15633
NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring
...
- Add quantize_nvfp4.cu: BF16→FP4 GPU kernel (no CPU sync, warp shuffle amax)
- Add quantize_nvfp4_gpu() bridge in ops/quantize.py
- Fix deinterleave_quantize kernel path (dsv4/ops/kernels → dsv4/kernels/cuda)
- Wire GPU quantize into Nvfp4MoE._run_impl():
- L1 input: quantize_nvfp4_gpu (replaces quantize_activation_nvfp4)
- Fused SwiGLU L2: deinterleave_quantize_nvfp4_cuda (single kernel)
- Non-fused L2: quantize_nvfp4_gpu
- Add test_nvfp4_gpu_quantize.py for both kernels
2026-05-25 16:19:07 +00:00
e3e67c3992
NVFP4-3: enable 2-CTA UMMA when MMA tile M >= 256 (1.7-1.9x throughput)
2026-05-24 22:57:49 +00:00
401e24768a
fix: import ceil_div in quantize.py (was NameError at runtime)
2026-05-23 08:40:24 +00:00
abfe4485f7
Router: full kernel stack — hash, topk, activation+topk, dense decode/prefill
...
Step 1: Hash router (hash_router.cu)
- One thread per token, gather from [vocab_size, k] LUT
- Uniform 1/k weights, FP32 output
- 3 MB LUT fits in L2 for repeated decode calls
Step 2: topk_select.cu — general top-k primitive
- Per-thread register min-heap (k=6, compile-time unrolled)
- Shared memory merge: thread 0 merges 64 partial heaps
- Tie-breaking: lower index wins on equal scores
- Reusable by CSA indexer
Step 3: activation_topk.cu — fused sqrt(softplus) + bias + topk + renorm
- Single kernel: all 6 steps of the router math, no intermediate buffers
- Numerically stable softplus: max(x,0) + log1p(exp(-|x|))
- Per-thread heap with unbiased activation co-stored
- Shared memory merge → sort descending → renormalize → store
Step 4: dense_router_decode.py — CuTeDSL fused GEMM kernel (skeleton)
- BF16 GEMM with tcgen05.mma, FP32 accumulator
- Custom epilogue: activation + bias + top-k (structure defined, needs TMA/MMA boilerplate)
- Dispatch: N<=64 uses fused decode, N>64 uses prefill path
Step 5: dense_router_prefill.py — prefill path
- torch.nn.functional.linear for GEMM (DeepGEMM integration deferred)
- Calls activation_topk for fused post-GEMM processing
Step 6: Router class + ops/router.py + test_router.py
- Router: construction-time mode (dense/hash), weight loading, custom_op dispatch
- ops/router.py: torch.library.custom_op wrappers, integer-keyed registry
- test_router.py: spec oracle tests (DO NOT RUN — Carmine is testing Stage C)
Test strategy: each kernel tested against its mathematical spec in FP32.
No reference implementation, no two debug streams. The oracle IS the math.
2026-05-21 21:54:05 +00:00
3fb3c925af
Restructure: cutedsl/ -> dsv4/ with proper layering
...
- Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py
- Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc.
- Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda)
- Moved PyTorch bridges to dsv4/ops/
- Moved nn.Module layers to dsv4layers/
- Moved reference implementations to dsv4/reference/
- Moved vendored CUTLASS code to vendored/
- Archived ~190 debug tests to tests/archive/
- Kept ~15 canonical tests in tests/unit/
- Updated all import paths
- Added stubs for future components (model/, cache/, loader/)
- Updated pyproject.toml: dsv4-inference package name
2026-05-21 17:30:44 +00:00