128 * (sizeof(float) + sizeof(int)) = 1KB — within CUDA default stack limit. 256 * 8 = 2KB would overflow.