- Split provided_slot_token vs slot_token_out (returned to caller) - No gather when slot_token=None (L2 path), no unnecessary alloc - .contiguous() on gathered tensors for CUTLASS alignment - Return slot_token_out consistently
- Split provided_slot_token vs slot_token_out (returned to caller) - No gather when slot_token=None (L2 path), no unnecessary alloc - .contiguous() on gathered tensors for CUTLASS alignment - Return slot_token_out consistently