nvfp4-megamoe-kernel

Files

biondizzle 8447ba7138 FIX: Deadlock in indexer_score_topk kernel — __syncthreads inside strided loop

CRITICAL BUG: The old kernel had __syncthreads() and a spinlock INSIDE
the strided loop over num_valid entries. When num_valid % n_threads != 0
(i.e. essentially always at production context lengths), threads that
exit the loop early deadlock on the barrier while others wait forever.

Fix: per-thread local top-k in registers (LOCAL_K=8), block-level merge
after the loop completes. No in-loop barriers, no spinlocks.

Architecture:
- Each thread maintains a private min-heap of LOCAL_K best scores
- After the strided loop (no __syncthreads inside), threads write their
  local top-k to shared memory
- Thread 0 builds the final top-k from all n_threads*LOCAL_K candidates
- For top_k=1024, n_threads=128, LOCAL_K=8: 1024 candidates = exact merge
- SMEM budget: w_h + merge heap + per-thread staging = ~30KB (well under 232KB)

Also updated the copy in dsv4/kernels/cuda/ (the one actually loaded
by the Python bridge).

Future optimization (separate from this fix):
- The dot products are scalar FP32 per thread. At 1M context this is slow.
  Production path should use FP4 tcgen05 MMA (Stage F).
- The block-level merge is single-threaded. Could use warp-reduce or
  bitonic sort for top_k > 256.

2026-06-02 18:11:56 +00:00

cache

E1: Wire LayerCacheHandle gather methods + CUDA gather kernels

2026-05-30 21:09:21 +00:00

kernels

FIX: Deadlock in indexer_score_topk kernel — __syncthreads inside strided loop

2026-06-02 18:11:56 +00:00

layers

P4: Add QuantizedActivation + Nvfp4Linear.run_from_quantized

2026-06-02 16:37:38 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering