- Use half_step_to_e2m1 for E2M1 FP4 quantization (not LUT search) - Use __nv_fp8_e4m3 + memcpy for block scale (not reinterpret_cast) - Pack nibbles as (nibbles[2*i+1] << 4) | nibbles[2*i] (same as prod) - Output uint8 buffers, then .view() to FP4/FP8 dtypes - Handle near-zero block scale same as quantize_nvfp4.cu