nvfp4-megamoe-kernel

Files

biondizzle 4f698baa5d Production fused CUDA sampler + decode loop optimizations

- Add dsv4/kernels/cuda/sampler.cu: fused temperature + repetition penalty
  + top-k + top-p (nucleus) sampling, single kernel launch, zero CPU syncs
- Add dsv4/model/sampler.py: CUDASampler wrapper + PyTorch reference
- Update single_shot_inference.py:
  - Use CUDASampler for non-greedy decoding (temperature=0.6, top_k=50, top_p=0.95)
  - Pre-allocate decode buffers (no per-step torch.tensor allocation)
  - Track thinking tokens (128821/128822) — not garbage for reasoning model
  - Reduce diagnostic CPU syncs (top-5 every 5 steps, NaN check every 20)
  - Add --top-k and --top-p CLI args
  - Default: temperature=0.6 (was 0.0 greedy), rep_penalty=1.1 (was 1.2)

2026-06-01 20:29:57 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

config.py

Layer dispatch: config, schedule, attention/FFN sub-blocks, TransformerLayer

2026-05-21 23:11:09 +00:00

dsv4.py

E3: Implement DSV4Model — full model class

2026-05-30 21:15:57 +00:00

layer_schedule.py

Layer dispatch: config, schedule, attention/FFN sub-blocks, TransformerLayer