The CPU dummy weight broke torch.mm(compressor.weight.T) which expects GPU tensors. Instead, reduce max_model_len to fit KV cache within available memory (876544 instead of 1048576).
The CPU dummy weight broke torch.mm(compressor.weight.T) which expects GPU tensors. Instead, reduce max_model_len to fit KV cache within available memory (876544 instead of 1048576).