Files
nvfp4-megamoe-kernel/dsv4/kernels/cuda
biondizzle 5290c91c35 fix quantize_nvfp4 kernel: use proven single-thread-per-CTA pattern from deinterleave_quantize.cu
The warp shuffle approach failed because __shfl_down_sync with 16 threads
has undefined behavior for the odd nibble. Use the same pattern as the
working deinterleave_quantize.cu: 1 CTA per 16-element block, 16 threads
per CTA, each thread reads all 16 elements sequentially and computes
amax + quantize + pack.
2026-05-25 16:21:44 +00:00
..