nvfp4-megamoe-kernel

Files

biondizzle 67d5e26080 Fix warmup compilation + add sparse topk metadata kernels

Bug #2 fix: warmup_compilation and warmup_fused_swiglu_compilation now
use valid FP4 data by quantizing random BF16 through quantize_to_nvfp4.
Random uint8 bytes as FP4 bit patterns cause cudaErrorIllegalInstruction
in Blackwell MMA hardware. Re-enabled warmup calls in runner.py.

Bug #1 kernel: sparse_topk_metadata.cu with:
  - build_c128a_topk_metadata: position-based compressed KV slot lookup
    via block table for C128A (compress_ratio=128) decode tokens
  - compute_c4a_global_topk: local topk index -> global slot ID mapping
    via block table for C4A (compress_ratio=4) decode tokens
  - Both tested: correct block table lookups, proper padding

Bug #3 kernel: C4A uses compute_c4a_global_topk (same .cu file)
  - Replaces vLLM Triton kernel with our own CUDA kernel

Deleted stale STATUS.md, FUSED_EPILOGUE_STATUS.md, FUSED_EPILOGUE_PLAN.md, CURRENT_BUGMD

2026-05-20 06:43:43 +00:00

deinterleave_quantize.cu

Custom CUDA kernel for de-interleave plus NVFP4 quantize

2026-05-20 04:39:47 +00:00

sparse_topk_metadata.cu

Fix warmup compilation + add sparse topk metadata kernels

2026-05-20 06:43:43 +00:00