Commit Graph

5 Commits

Author SHA1 Message Date
845227c06c Fix stale lock file in CUDA loader — prevents infinite spin on crash recovery
torch.utils.cpp_extension.load creates a 'lock' file in the build
directory during compilation. If the compiling process is killed
(OOM, timeout, user interrupt), the lock file is never removed and
subsequent processes spin forever polling it (clock_nanosleep(100ms)
→ stat(lock) → repeat).

Fix: _cleanup_stale_lock() removes lock files older than 10 minutes
before any compilation attempt. This is the correct threshold — CUDA
kernel compilation should never take more than a few minutes, so a
10-minute-old lock is guaranteed stale.
2026-06-02 21:34:58 +00:00
f3b551956d Cleanup Step 2: Archive Lineage P code, fix broken imports
- Move dead dsv4/ modules to dsv4/_archive/ (52 files)
  - model/{dsv4,mtp,layer,layer_schedule}
  - layers/{embedding,attention,ffn,norm} (kept linear,mhc,router,moe,shared_expert,grouped_linear - live)
  - cache/*, kernels/cache/*, kernels/indexer/{csa_indexer,score_topk,compute_valid_lens}
  - kernels/router/{nvfp4_fused_router,dense_router_decode_kernel,dense_router_prefill}
  - ops/{topk,topk_select,rope,router}, loader/{hf_checkpoint,layout_convert}
  - reference/{attention,compressor,csa_attention,moe_pipeline}
  - kernels/compressor/{compress_tail,csa_hca}
- Restore dsv4/ops/{router,custom_ops}.py (needed by live layers)
- Fix dsv4/kernels/{indexer,compressor,attention}/__init__.py (removed broken imports)
- Remove preload_all() from loader.py (dead, referenced nonexistent .cu file)
- Fix loader.py docstring (fused_amax_quantize_nvfp4 → quantize_nvfp4_from_buffer)
- Move broken tests to tests/e2e_archive/
  - test_fused_router, production_values_test, e2e/{one_layer,model_construction,csa_hca}
- vLLM has 0 imports of dsv4 (Step 0 confirmed)
2026-06-02 19:27:07 +00:00
f23320b5b2 KV-1/KV-2: Fused compress+NVFP4 quantize kernels + dequant
- compressor_reduce_quant.cu: Single-kernel CSA/HCA compress + RMSNorm + NVFP4 quantize.
  No intermediate BF16. FP32 → E2M1 + E4M3 + FP32 gsa in one kernel.
  Shared memory: ~2.5KB per CTA (FP32 staging + nibble buffer).

- dequant_nvfp4.cu: NVFP4 → BF16 dequantization kernels.
  Full dequant (HCA dense gather) and selective dequant (CSA top-k gather).
  Single kernel launch per gather operation.

- production_compress.py: Added csa_compress_production_nvfp4() and
  hca_compress_production_nvfp4() — production path for KV-1/KV-2.

- loader.py: Preload dequant_nvfp4 and compressor_reduce_quant modules.

- test_kv_compress_quant.py: Unit tests verifying cos >= 0.999
  between BF16 reference and NVFP4 round-trip path.
2026-06-02 09:37:53 +00:00
cacf64232e CRITICAL FIX: fused_amax_quantize cross-CTA race condition
The single-kernel approach used __syncthreads() for cross-CTA amax
reduction, but __syncthreads() only syncs within a CTA (same blockIdx).
CTA 0 reading s_amax[1] before CTA 1 writes = race condition = garbage gsa.

Result: residual |X| exploded to 10^37 by L0. F_attn and F_ffn were 0.0.

Fix: Two-kernel approach (correct, zero CPU syncs):
  Kernel 1: amax_gsa.cu — computes gsa on GPU, returns GPU tensor
  Kernel 2: quantize_nvfp4_from_buffer — reads gsa from GPU buffer

The fused_amax_quantize.cu now exports quantize_nvfp4_from_buffer and
deinterleave_quantize_from_buffer (gsa from GPU buffer, not kernel param).

Same P0 win: zero .item() syncs. Two kernel launches instead of one,
but correctness > shaving one launch.
2026-06-01 21:26:51 +00:00
00746c2d2b Fix module path: move loader code from __init__.py to loader.py
quantize.py and others import from dsv4.kernels.cuda.loader — the module
must be a separate file, not just __init__.py.
2026-06-01 21:18:29 +00:00