Bug #2 fix: warmup_compilation and warmup_fused_swiglu_compilation now
use valid FP4 data by quantizing random BF16 through quantize_to_nvfp4.
Random uint8 bytes as FP4 bit patterns cause cudaErrorIllegalInstruction
in Blackwell MMA hardware. Re-enabled warmup calls in runner.py.
Bug #1 kernel: sparse_topk_metadata.cu with:
- build_c128a_topk_metadata: position-based compressed KV slot lookup
via block table for C128A (compress_ratio=128) decode tokens
- compute_c4a_global_topk: local topk index -> global slot ID mapping
via block table for C4A (compress_ratio=4) decode tokens
- Both tested: correct block table lookups, proper padding
Bug #3 kernel: C4A uses compute_c4a_global_topk (same .cu file)
- Replaces vLLM Triton kernel with our own CUDA kernel
Deleted stale STATUS.md, FUSED_EPILOGUE_STATUS.md, FUSED_EPILOGUE_PLAN.md, CURRENT_BUGMD