nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	845227c06c	Fix stale lock file in CUDA loader — prevents infinite spin on crash recovery torch.utils.cpp_extension.load creates a 'lock' file in the build directory during compilation. If the compiling process is killed (OOM, timeout, user interrupt), the lock file is never removed and subsequent processes spin forever polling it (clock_nanosleep(100ms) → stat(lock) → repeat). Fix: _cleanup_stale_lock() removes lock files older than 10 minutes before any compilation attempt. This is the correct threshold — CUDA kernel compilation should never take more than a few minutes, so a 10-minute-old lock is guaranteed stale.	2026-06-02 21:34:58 +00:00
biondizzle	f3b551956d	Cleanup Step 2: Archive Lineage P code, fix broken imports - Move dead dsv4/ modules to dsv4/_archive/ (52 files) - model/{dsv4,mtp,layer,layer_schedule} - layers/{embedding,attention,ffn,norm} (kept linear,mhc,router,moe,shared_expert,grouped_linear - live) - cache/, kernels/cache/, kernels/indexer/{csa_indexer,score_topk,compute_valid_lens} - kernels/router/{nvfp4_fused_router,dense_router_decode_kernel,dense_router_prefill} - ops/{topk,topk_select,rope,router}, loader/{hf_checkpoint,layout_convert} - reference/{attention,compressor,csa_attention,moe_pipeline} - kernels/compressor/{compress_tail,csa_hca} - Restore dsv4/ops/{router,custom_ops}.py (needed by live layers) - Fix dsv4/kernels/{indexer,compressor,attention}/__init__.py (removed broken imports) - Remove preload_all() from loader.py (dead, referenced nonexistent .cu file) - Fix loader.py docstring (fused_amax_quantize_nvfp4 → quantize_nvfp4_from_buffer) - Move broken tests to tests/e2e_archive/ - test_fused_router, production_values_test, e2e/{one_layer,model_construction,csa_hca} - vLLM has 0 imports of dsv4 (Step 0 confirmed)	2026-06-02 19:27:07 +00:00
biondizzle	f23320b5b2	KV-1/KV-2: Fused compress+NVFP4 quantize kernels + dequant - compressor_reduce_quant.cu: Single-kernel CSA/HCA compress + RMSNorm + NVFP4 quantize. No intermediate BF16. FP32 → E2M1 + E4M3 + FP32 gsa in one kernel. Shared memory: ~2.5KB per CTA (FP32 staging + nibble buffer). - dequant_nvfp4.cu: NVFP4 → BF16 dequantization kernels. Full dequant (HCA dense gather) and selective dequant (CSA top-k gather). Single kernel launch per gather operation. - production_compress.py: Added csa_compress_production_nvfp4() and hca_compress_production_nvfp4() — production path for KV-1/KV-2. - loader.py: Preload dequant_nvfp4 and compressor_reduce_quant modules. - test_kv_compress_quant.py: Unit tests verifying cos >= 0.999 between BF16 reference and NVFP4 round-trip path.	2026-06-02 09:37:53 +00:00
biondizzle	cacf64232e	CRITICAL FIX: fused_amax_quantize cross-CTA race condition The single-kernel approach used __syncthreads() for cross-CTA amax reduction, but __syncthreads() only syncs within a CTA (same blockIdx). CTA 0 reading s_amax[1] before CTA 1 writes = race condition = garbage gsa. Result: residual \|X\| exploded to 10^37 by L0. F_attn and F_ffn were 0.0. Fix: Two-kernel approach (correct, zero CPU syncs): Kernel 1: amax_gsa.cu — computes gsa on GPU, returns GPU tensor Kernel 2: quantize_nvfp4_from_buffer — reads gsa from GPU buffer The fused_amax_quantize.cu now exports quantize_nvfp4_from_buffer and deinterleave_quantize_from_buffer (gsa from GPU buffer, not kernel param). Same P0 win: zero .item() syncs. Two kernel launches instead of one, but correctness > shaving one launch.	2026-06-01 21:26:51 +00:00
biondizzle	00746c2d2b	Fix module path: move loader code from __init__.py to loader.py quantize.py and others import from dsv4.kernels.cuda.loader — the module must be a separate file, not just __init__.py.	2026-06-01 21:18:29 +00:00

5 Commits