monkey patch the monkey pathing vllm nonsense again

This commit is contained in:
2026-04-15 23:29:26 +00:00
parent c570c4658e
commit d8f5f88b64
2 changed files with 362 additions and 545 deletions

View File

@@ -1,545 +0,0 @@
(APIServer pid=1) INFO 04-15 22:38:39 [utils.py:299]
(APIServer pid=1) INFO 04-15 22:38:39 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=1) INFO 04-15 22:38:39 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.0
(APIServer pid=1) INFO 04-15 22:38:39 [utils.py:299] █▄█▀ █ █ █ █ model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
(APIServer pid=1) INFO 04-15 22:38:39 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 04-15 22:38:39 [utils.py:299]
(APIServer pid=1) INFO 04-15 22:38:39 [utils.py:233] non-default args: {'model_tag': 'nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': 'nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4', 'trust_remote_code': True, 'max_model_len': 1048576, 'enforce_eager': True, 'attention_backend': 'TRITON_ATTN', 'reasoning_parser': 'super_v3', 'reasoning_parser_plugin': '/opt/super_v3_reasoning_parser.py', 'tensor_parallel_size': 8, 'disable_custom_all_reduce': True, 'gpu_memory_utilization': 0.96, 'enable_prefix_caching': False, 'mamba_ssm_cache_dtype': 'float16', 'enable_chunked_prefill': True, 'disable_hybrid_kv_cache_manager': False, 'async_scheduling': True, 'max_cudagraph_capture_size': 128, 'kv_transfer_config': KVTransferConfig(kv_connector='LMCacheConnectorV1', engine_id='dea40998-1518-4361-a31f-884d3c1c1e74', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_both', kv_rank=None, kv_parallel_size=1, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={}, kv_connector_module_path=None, enable_permute_local_kv=False, kv_load_failure_policy='fail')}
(APIServer pid=1) WARNING 04-15 22:38:39 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_ROUTER_GATEWAY_PORT_80_TCP_ADDR
(APIServer pid=1) WARNING 04-15 22:38:39 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_ROUTER_GATEWAY_PORT
(APIServer pid=1) WARNING 04-15 22:38:39 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_ROUTER_GATEWAY_PORT_9091_TCP_PORT
(APIServer pid=1) WARNING 04-15 22:38:39 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_ROUTER_GATEWAY_PORT_9091_TCP
(APIServer pid=1) WARNING 04-15 22:38:39 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_ROUTER_GATEWAY_SERVICE_PORT
(APIServer pid=1) WARNING 04-15 22:38:39 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_ROUTER_GATEWAY_PORT_9091_TCP_PROTO
(APIServer pid=1) WARNING 04-15 22:38:39 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_ROUTER_GATEWAY_PORT_80_TCP
(APIServer pid=1) WARNING 04-15 22:38:39 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_ROUTER_GATEWAY_PORT_80_TCP_PORT
(APIServer pid=1) WARNING 04-15 22:38:39 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_ROUTER_GATEWAY_SERVICE_PORT_HTTP_MONITORING
(APIServer pid=1) WARNING 04-15 22:38:39 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_ROUTER_GATEWAY_SERVICE_PORT_LISTENER_80
(APIServer pid=1) WARNING 04-15 22:38:39 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_ROUTER_GATEWAY_SERVICE_HOST
(APIServer pid=1) WARNING 04-15 22:38:39 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_ROUTER_GATEWAY_PORT_80_TCP_PROTO
(APIServer pid=1) WARNING 04-15 22:38:39 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_ROUTER_GATEWAY_PORT_9091_TCP_ADDR
(APIServer pid=1) A new version of the following files was downloaded from https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4:
(APIServer pid=1) - configuration_nemotron_h.py
(APIServer pid=1) . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
(APIServer pid=1) INFO 04-15 22:38:46 [model.py:549] Resolved architecture: NemotronHForCausalLM
(APIServer pid=1) WARNING 04-15 22:38:46 [model.py:2176] User-specified max_model_len (1048576) is greater than the derived max_model_len (max_position_embeddings=262144.0 or model_max_length=None in model's config.json). VLLM_ALLOW_LONG_MAX_MODEL_LEN must be used with extreme caution. If the model uses relative position encoding (RoPE), positions exceeding derived_max_model_len lead to nan. If the model uses absolute position encoding, positions exceeding derived_max_model_len will cause a CUDA array out-of-bounds error.
(APIServer pid=1) INFO 04-15 22:38:46 [model.py:1678] Using max model len 1048576
(APIServer pid=1) INFO 04-15 22:38:46 [cache.py:227] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=1) INFO 04-15 22:38:46 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) INFO 04-15 22:38:46 [config.py:281] Setting attention block size to 1056 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 04-15 22:38:46 [config.py:312] Padding mamba page size by 0.19% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) WARNING 04-15 22:38:46 [modelopt.py:381] Detected ModelOpt fp8 checkpoint (quant_algo=FP8). Please note that the format is experimental and could change.
(APIServer pid=1) WARNING 04-15 22:38:46 [modelopt.py:998] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=1) INFO 04-15 22:38:46 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 04-15 22:38:46 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 04-15 22:38:46 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 04-15 22:38:46 [vllm.py:1025] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 04-15 22:38:51 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant, allreduce_rms
(EngineCore pid=277) INFO 04-15 22:38:58 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4', speculative_config=None, tokenizer='nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1048576, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=modelopt_mixed, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='super_v3', reasoning_parser_plugin='/opt/super_v3_reasoning_parser.py', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [128, 8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': True}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=277) INFO 04-15 22:38:58 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.244.248.111 (local), world_size=8, local_world_size=8
(Worker pid=348) INFO 04-15 22:39:03 [parallel_state.py:1400] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:36625 backend=nccl
(Worker pid=415) INFO 04-15 22:39:07 [parallel_state.py:1400] world_size=8 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:36625 backend=nccl
(Worker pid=483) INFO 04-15 22:39:11 [parallel_state.py:1400] world_size=8 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:36625 backend=nccl
(Worker pid=556) INFO 04-15 22:39:15 [parallel_state.py:1400] world_size=8 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:36625 backend=nccl
(Worker pid=629) INFO 04-15 22:39:19 [parallel_state.py:1400] world_size=8 rank=4 local_rank=4 distributed_init_method=tcp://127.0.0.1:36625 backend=nccl
(Worker pid=702) INFO 04-15 22:39:23 [parallel_state.py:1400] world_size=8 rank=5 local_rank=5 distributed_init_method=tcp://127.0.0.1:36625 backend=nccl
(Worker pid=775) INFO 04-15 22:39:27 [parallel_state.py:1400] world_size=8 rank=6 local_rank=6 distributed_init_method=tcp://127.0.0.1:36625 backend=nccl
(Worker pid=848) INFO 04-15 22:39:31 [parallel_state.py:1400] world_size=8 rank=7 local_rank=7 distributed_init_method=tcp://127.0.0.1:36625 backend=nccl
(Worker pid=348) INFO 04-15 22:39:31 [pynccl.py:111] vLLM is using nccl==2.28.9
(Worker pid=348) INFO 04-15 22:39:36 [parallel_state.py:1716] rank 0 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker_TP0 pid=348) INFO 04-15 22:39:37 [gpu_model_runner.py:4735] Starting to load model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4...
(Worker_TP0 pid=348) INFO 04-15 22:39:38 [__init__.py:261] Selected FlashInferFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod
(Worker_TP0 pid=348) INFO 04-15 22:39:38 [deep_gemm.py:115] DeepGEMM E8M0 enabled on current platform.
(Worker_TP0 pid=348) INFO 04-15 22:39:38 [nvfp4_utils.py:85] Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM
(Worker_TP0 pid=348) INFO 04-15 22:39:38 [nvfp4.py:256] Using 'FLASHINFER_TRTLLM' NvFp4 MoE backend out of potential backends: ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN'].
(Worker_TP1 pid=415) INFO 04-15 22:39:38 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP0 pid=348) INFO 04-15 22:39:38 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP4 pid=629) INFO 04-15 22:39:38 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP2 pid=483) INFO 04-15 22:39:38 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP6 pid=775) INFO 04-15 22:39:38 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP7 pid=848) INFO 04-15 22:39:38 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP3 pid=556) INFO 04-15 22:39:38 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP5 pid=702) INFO 04-15 22:39:38 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(Worker_TP1 pid=415) INFO 04-15 22:41:36 [weight_utils.py:581] Time spent downloading weights for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4: 116.048954 seconds
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 0% Completed | 0/17 [00:00<?, ?it/s]
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 6% Completed | 1/17 [00:01<00:20, 1.29s/it]
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 12% Completed | 2/17 [00:03<00:23, 1.60s/it]
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 18% Completed | 3/17 [00:04<00:21, 1.52s/it]
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 24% Completed | 4/17 [00:06<00:20, 1.58s/it]
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 29% Completed | 5/17 [00:07<00:17, 1.49s/it]
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 35% Completed | 6/17 [00:08<00:16, 1.47s/it]
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 41% Completed | 7/17 [00:10<00:14, 1.42s/it]
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 47% Completed | 8/17 [00:11<00:12, 1.39s/it]
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 53% Completed | 9/17 [00:13<00:11, 1.40s/it]
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 59% Completed | 10/17 [00:14<00:10, 1.47s/it]
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 65% Completed | 11/17 [00:16<00:08, 1.45s/it]
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 71% Completed | 12/17 [00:17<00:07, 1.44s/it]
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 76% Completed | 13/17 [00:18<00:05, 1.43s/it]
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 82% Completed | 14/17 [00:20<00:04, 1.43s/it]
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 88% Completed | 15/17 [00:21<00:02, 1.30s/it]
(Worker_TP0 pid=348)
Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:21<00:00, 1.26s/it]
(Worker_TP0 pid=348)
(Worker_TP0 pid=348) INFO 04-15 22:41:59 [default_loader.py:384] Loading weights took 21.38 seconds
(Worker_TP0 pid=348) INFO 04-15 22:41:59 [flashinfer_utils.py:238] Padding intermediate size from 336 to 384 for up/down projection weights.
(Worker_TP0 pid=348) INFO 04-15 22:41:59 [nvfp4.py:401] Using MoEPrepareAndFinalizeNoDPEPMonolithic
(Worker_TP0 pid=348) WARNING 04-15 22:41:59 [kv_cache.py:94] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(Worker_TP0 pid=348) WARNING 04-15 22:41:59 [kv_cache.py:108] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(Worker_TP0 pid=348) INFO 04-15 22:42:01 [gpu_model_runner.py:4820] Model loading took 10.4 GiB memory and 142.225157 seconds
(Worker_TP0 pid=348) INFO 04-15 22:42:10 [gpu_worker.py:436] Available KV cache memory: 158.16 GiB
(EngineCore pid=277) INFO 04-15 22:42:11 [kv_cache_utils.py:1319] GPU KV cache size: 13,819,872 tokens
(EngineCore pid=277) INFO 04-15 22:42:11 [kv_cache_utils.py:1324] Maximum concurrency for 1,048,576 tokens per request: 78.68x
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] WorkerProc hit an exception.
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 306, in initialize_from_config
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 527, in initialize_from_config
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 67, in ensure_kv_transfer_initialized
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 59, in create_connector
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] raise ValueError(
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ValueError: Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 306, in initialize_from_config
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 527, in initialize_from_config
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 67, in ensure_kv_transfer_initialized
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 59, in create_connector
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] raise ValueError(
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ValueError: Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
(Worker_TP7 pid=848) ERROR 04-15 22:42:11 [multiproc_executor.py:949]
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] WorkerProc hit an exception.
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 306, in initialize_from_config
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 527, in initialize_from_config
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 67, in ensure_kv_transfer_initialized
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 59, in create_connector
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] raise ValueError(
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ValueError: Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 306, in initialize_from_config
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 527, in initialize_from_config
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 67, in ensure_kv_transfer_initialized
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 59, in create_connector
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] raise ValueError(
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ValueError: Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
(Worker_TP6 pid=775) ERROR 04-15 22:42:11 [multiproc_executor.py:949]
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] WorkerProc hit an exception.
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 306, in initialize_from_config
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 527, in initialize_from_config
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 67, in ensure_kv_transfer_initialized
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 59, in create_connector
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] raise ValueError(
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ValueError: Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 306, in initialize_from_config
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 527, in initialize_from_config
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 67, in ensure_kv_transfer_initialized
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 59, in create_connector
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] raise ValueError(
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ValueError: Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
(Worker_TP3 pid=556) ERROR 04-15 22:42:11 [multiproc_executor.py:949]
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] WorkerProc hit an exception.
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 306, in initialize_from_config
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 527, in initialize_from_config
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 67, in ensure_kv_transfer_initialized
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 59, in create_connector
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] raise ValueError(
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ValueError: Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 306, in initialize_from_config
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 527, in initialize_from_config
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 67, in ensure_kv_transfer_initialized
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 59, in create_connector
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] raise ValueError(
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ValueError: Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
(Worker_TP0 pid=348) ERROR 04-15 22:42:11 [multiproc_executor.py:949]
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] WorkerProc hit an exception.
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 306, in initialize_from_config
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 527, in initialize_from_config
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 67, in ensure_kv_transfer_initialized
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 59, in create_connector
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] raise ValueError(
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ValueError: Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 306, in initialize_from_config
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 527, in initialize_from_config
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 67, in ensure_kv_transfer_initialized
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 59, in create_connector
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] raise ValueError(
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ValueError: Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
(Worker_TP4 pid=629) ERROR 04-15 22:42:11 [multiproc_executor.py:949]
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] WorkerProc hit an exception.
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 306, in initialize_from_config
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 527, in initialize_from_config
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 67, in ensure_kv_transfer_initialized
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 59, in create_connector
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] raise ValueError(
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ValueError: Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 306, in initialize_from_config
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 527, in initialize_from_config
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 67, in ensure_kv_transfer_initialized
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 59, in create_connector
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] raise ValueError(
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ValueError: Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
(Worker_TP2 pid=483) ERROR 04-15 22:42:11 [multiproc_executor.py:949]
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] WorkerProc hit an exception.
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 306, in initialize_from_config
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 527, in initialize_from_config
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 67, in ensure_kv_transfer_initialized
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 59, in create_connector
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] raise ValueError(
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ValueError: Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 306, in initialize_from_config
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 527, in initialize_from_config
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 67, in ensure_kv_transfer_initialized
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 59, in create_connector
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] raise ValueError(
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ValueError: Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
(Worker_TP5 pid=702) ERROR 04-15 22:42:11 [multiproc_executor.py:949]
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] WorkerProc hit an exception.
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 306, in initialize_from_config
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 527, in initialize_from_config
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 67, in ensure_kv_transfer_initialized
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 59, in create_connector
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] raise ValueError(
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ValueError: Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 306, in initialize_from_config
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 527, in initialize_from_config
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ensure_kv_transfer_initialized(self.vllm_config, kv_cache_config)
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_transfer_state.py", line 67, in ensure_kv_transfer_initialized
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector(
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/factory.py", line 59, in create_connector
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] raise ValueError(
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949] ValueError: Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
(Worker_TP1 pid=415) ERROR 04-15 22:42:11 [multiproc_executor.py:949]
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] EngineCore failed to start.
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] super().__init__(
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 124, in __init__
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 280, in _initialize_kv_caches
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 117, in initialize_from_config
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] self.collective_rpc("initialize_from_config", args=(kv_cache_configs,))
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 397, in collective_rpc
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] return aggregate(get_response())
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] ^^^^^^^^^^^^^^
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 380, in get_response
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] raise RuntimeError(
(EngineCore pid=277) ERROR 04-15 22:42:11 [core.py:1108] RuntimeError: Worker failed with error 'Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.', please check the stack trace above for the root cause
(Worker_TP3 pid=556) WARNING 04-15 22:42:11 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP0 pid=348) WARNING 04-15 22:42:11 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP6 pid=775) WARNING 04-15 22:42:11 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP7 pid=848) WARNING 04-15 22:42:11 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP4 pid=629) WARNING 04-15 22:42:11 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP2 pid=483) WARNING 04-15 22:42:11 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP1 pid=415) WARNING 04-15 22:42:11 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP5 pid=702) WARNING 04-15 22:42:11 [multiproc_executor.py:871] WorkerProc was terminated
(EngineCore pid=277) ERROR 04-15 22:42:14 [multiproc_executor.py:273] Worker proc VllmWorker-3 died unexpectedly, shutting down executor.
(EngineCore pid=277) Process EngineCore:
(EngineCore pid=277) Traceback (most recent call last):
(EngineCore pid=277) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=277) self.run()
(EngineCore pid=277) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=277) self._target(*self._args, **self._kwargs)
(EngineCore pid=277) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core
(EngineCore pid=277) raise e
(EngineCore pid=277) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=277) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=277) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=277) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=277) return func(*args, **kwargs)
(EngineCore pid=277) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=277) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=277) super().__init__(
(EngineCore pid=277) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 124, in __init__
(EngineCore pid=277) kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=277) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=277) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=277) return func(*args, **kwargs)
(EngineCore pid=277) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=277) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 280, in _initialize_kv_caches
(EngineCore pid=277) self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore pid=277) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 117, in initialize_from_config
(EngineCore pid=277) self.collective_rpc("initialize_from_config", args=(kv_cache_configs,))
(EngineCore pid=277) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 397, in collective_rpc
(EngineCore pid=277) return aggregate(get_response())
(EngineCore pid=277) ^^^^^^^^^^^^^^
(EngineCore pid=277) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 380, in get_response
(EngineCore pid=277) raise RuntimeError(
(EngineCore pid=277) RuntimeError: Worker failed with error 'Connector LMCacheConnectorV1 does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.', please check the stack trace above for the root cause
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 887, in __init__
(APIServer pid=1) super().__init__(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=1) with launch_core_engines(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 8 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

View File

@@ -0,0 +1,362 @@
(Worker_TP0 pid=347) INFO 04-15 23:04:43 [default_loader.py:384] Loading weights took 22.09 seconds
(Worker_TP0 pid=347) INFO 04-15 23:04:43 [flashinfer_utils.py:238] Padding intermediate size from 336 to 384 for up/down projection weights.
(Worker_TP0 pid=347) INFO 04-15 23:04:43 [nvfp4.py:401] Using MoEPrepareAndFinalizeNoDPEPMonolithic
(Worker_TP0 pid=347) WARNING 04-15 23:04:44 [kv_cache.py:94] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(Worker_TP0 pid=347) WARNING 04-15 23:04:44 [kv_cache.py:108] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(Worker_TP0 pid=347) INFO 04-15 23:04:46 [gpu_model_runner.py:4820] Model loading took 10.4 GiB memory and 133.856349 seconds
(Worker_TP0 pid=347) INFO 04-15 23:04:53 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/3fd416396e/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=347) INFO 04-15 23:04:53 [backends.py:1111] Dynamo bytecode transform time: 4.26 s
(Worker_TP0 pid=347) INFO 04-15 23:04:53 [flashinfer_all_reduce.py:109] Auto-selected flashinfer allreduce backend: trtllm
(Worker_TP0 pid=347) /usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py:83: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
(Worker_TP0 pid=347) return func(*args, **kwargs)
(Worker_TP0 pid=347) INFO 04-15 23:04:54 [flashinfer_all_reduce.py:149] Initialized FlashInfer Allreduce norm fusion workspace with backend=trtllm
(Worker_TP0 pid=347) INFO 04-15 23:04:57 [backends.py:372] Cache the graph of compile range (1, 128) for later use
(Worker_TP0 pid=347) INFO 04-15 23:04:57 [backends.py:372] Cache the graph of compile range (129, 8192) for later use
(Worker_TP0 pid=347) INFO 04-15 23:05:10 [backends.py:390] Compiling a graph for compile range (1, 128) takes 13.86 s
(Worker_TP0 pid=347) INFO 04-15 23:05:11 [backends.py:390] Compiling a graph for compile range (129, 8192) takes 14.38 s
(Worker_TP0 pid=347) INFO 04-15 23:05:13 [decorators.py:640] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/843944412cd4c5c9ac31fd76eb61f7a06b6ada8d50eaed83ce0c0803840a330f/rank_0_0/model
(Worker_TP0 pid=347) INFO 04-15 23:05:13 [monitor.py:48] torch.compile took 24.48 s in total
(Worker_TP0 pid=347) INFO 04-15 23:05:20 [monitor.py:76] Initial profiling/warmup run took 7.55 s
(Worker_TP3 pid=555) WARNING 04-15 23:05:24 [kv_cache_utils.py:1175] Hybrid KV cache manager is disabled for this hybrid model, This means we do not enable any optimizations for saving KV cache memory (e.g., dropping the KV cache outside the sliding window). The compute of layers like sliding window is still saved.
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] WorkerProc hit an exception.
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5864, in profile_cudagraph_memory
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] self._init_minimal_kv_cache_for_profiling()
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5804, in _init_minimal_kv_cache_for_profiling
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1236, in get_kv_cache_groups
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] unify_hybrid_kv_cache_specs(kv_cache_spec)
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1216, in unify_hybrid_kv_cache_specs
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] raise ValueError(
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5864, in profile_cudagraph_memory
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] self._init_minimal_kv_cache_for_profiling()
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5804, in _init_minimal_kv_cache_for_profiling
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1236, in get_kv_cache_groups
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] unify_hybrid_kv_cache_specs(kv_cache_spec)
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1216, in unify_hybrid_kv_cache_specs
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] raise ValueError(
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.
(Worker_TP3 pid=555) ERROR 04-15 23:05:24 [multiproc_executor.py:949]
(Worker_TP6 pid=774) WARNING 04-15 23:05:24 [kv_cache_utils.py:1175] Hybrid KV cache manager is disabled for this hybrid model, This means we do not enable any optimizations for saving KV cache memory (e.g., dropping the KV cache outside the sliding window). The compute of layers like sliding window is still saved.
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] WorkerProc hit an exception.
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5864, in profile_cudagraph_memory
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] self._init_minimal_kv_cache_for_profiling()
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5804, in _init_minimal_kv_cache_for_profiling
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1236, in get_kv_cache_groups
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] unify_hybrid_kv_cache_specs(kv_cache_spec)
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1216, in unify_hybrid_kv_cache_specs
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] raise ValueError(
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5864, in profile_cudagraph_memory
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] self._init_minimal_kv_cache_for_profiling()
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5804, in _init_minimal_kv_cache_for_profiling
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1236, in get_kv_cache_groups
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] unify_hybrid_kv_cache_specs(kv_cache_spec)
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1216, in unify_hybrid_kv_cache_specs
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] raise ValueError(
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.
(Worker_TP6 pid=774) ERROR 04-15 23:05:24 [multiproc_executor.py:949]
(Worker_TP5 pid=701) WARNING 04-15 23:05:24 [kv_cache_utils.py:1175] Hybrid KV cache manager is disabled for this hybrid model, This means we do not enable any optimizations for saving KV cache memory (e.g., dropping the KV cache outside the sliding window). The compute of layers like sliding window is still saved.
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] WorkerProc hit an exception.
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5864, in profile_cudagraph_memory
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] self._init_minimal_kv_cache_for_profiling()
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5804, in _init_minimal_kv_cache_for_profiling
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1236, in get_kv_cache_groups
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] unify_hybrid_kv_cache_specs(kv_cache_spec)
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1216, in unify_hybrid_kv_cache_specs
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] raise ValueError(
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5864, in profile_cudagraph_memory
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] self._init_minimal_kv_cache_for_profiling()
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5804, in _init_minimal_kv_cache_for_profiling
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1236, in get_kv_cache_groups
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] unify_hybrid_kv_cache_specs(kv_cache_spec)
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1216, in unify_hybrid_kv_cache_specs
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] raise ValueError(
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.
(Worker_TP5 pid=701) ERROR 04-15 23:05:24 [multiproc_executor.py:949]
(Worker_TP0 pid=347) WARNING 04-15 23:05:24 [kv_cache_utils.py:1175] Hybrid KV cache manager is disabled for this hybrid model, This means we do not enable any optimizations for saving KV cache memory (e.g., dropping the KV cache outside the sliding window). The compute of layers like sliding window is still saved.
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] WorkerProc hit an exception.
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5864, in profile_cudagraph_memory
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] self._init_minimal_kv_cache_for_profiling()
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5804, in _init_minimal_kv_cache_for_profiling
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1236, in get_kv_cache_groups
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] unify_hybrid_kv_cache_specs(kv_cache_spec)
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1216, in unify_hybrid_kv_cache_specs
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] raise ValueError(
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5864, in profile_cudagraph_memory
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] self._init_minimal_kv_cache_for_profiling()
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5804, in _init_minimal_kv_cache_for_profiling
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1236, in get_kv_cache_groups
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] unify_hybrid_kv_cache_specs(kv_cache_spec)
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1216, in unify_hybrid_kv_cache_specs
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] raise ValueError(
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.
(Worker_TP0 pid=347) ERROR 04-15 23:05:24 [multiproc_executor.py:949]
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] EngineCore failed to start.
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] super().__init__(
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 124, in __init__
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] return self.collective_rpc("determine_available_memory")
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 397, in collective_rpc
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] return aggregate(get_response())
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] ^^^^^^^^^^^^^^
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 380, in get_response
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] raise RuntimeError(
(EngineCore pid=276) ERROR 04-15 23:05:24 [core.py:1108] RuntimeError: Worker failed with error 'Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.', please check the stack trace above for the root cause
(Worker_TP6 pid=774) WARNING 04-15 23:05:24 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP5 pid=701) WARNING 04-15 23:05:24 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP0 pid=347) WARNING 04-15 23:05:24 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP3 pid=555) WARNING 04-15 23:05:24 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP1 pid=414) WARNING 04-15 23:05:24 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP4 pid=628) WARNING 04-15 23:05:24 [multiproc_executor.py:871] WorkerProc was terminated
(Worker_TP2 pid=482) Exception ignored in: <function ExactWeakKeyDictionary.__setitem__.<locals>.<lambda> at 0x7f04b2c57a60>
(Worker_TP2 pid=482) Traceback (most recent call last):
(Worker_TP2 pid=482) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/utils.py", line 1025, in <lambda>
(Worker_TP2 pid=482) self.refs[idx] = weakref.ref(key, lambda ref: self._remove_id(idx))
(Worker_TP2 pid=482)
(Worker_TP2 pid=482) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 797, in signal_handler
(Worker_TP2 pid=482) raise SystemExit()
(Worker_TP2 pid=482) SystemExit:
(Worker_TP2 pid=482) WARNING 04-15 23:05:24 [kv_cache_utils.py:1175] Hybrid KV cache manager is disabled for this hybrid model, This means we do not enable any optimizations for saving KV cache memory (e.g., dropping the KV cache outside the sliding window). The compute of layers like sliding window is still saved.
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] WorkerProc hit an exception.
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5864, in profile_cudagraph_memory
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] self._init_minimal_kv_cache_for_profiling()
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5804, in _init_minimal_kv_cache_for_profiling
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1236, in get_kv_cache_groups
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] unify_hybrid_kv_cache_specs(kv_cache_spec)
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1216, in unify_hybrid_kv_cache_specs
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] raise ValueError(
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5864, in profile_cudagraph_memory
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] self._init_minimal_kv_cache_for_profiling()
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5804, in _init_minimal_kv_cache_for_profiling
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1236, in get_kv_cache_groups
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] unify_hybrid_kv_cache_specs(kv_cache_spec)
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1216, in unify_hybrid_kv_cache_specs
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] raise ValueError(
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.
(Worker_TP2 pid=482) ERROR 04-15 23:05:24 [multiproc_executor.py:949]
(Worker_TP7 pid=847) Exception ignored in: <function ExactWeakKeyDictionary.__setitem__.<locals>.<lambda> at 0x7f4be1d7df80>
(Worker_TP7 pid=847) Traceback (most recent call last):
(Worker_TP7 pid=847) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/utils.py", line 1025, in <lambda>
(Worker_TP7 pid=847) self.refs[idx] = weakref.ref(key, lambda ref: self._remove_id(idx))
(Worker_TP7 pid=847)
(Worker_TP7 pid=847) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 797, in signal_handler
(Worker_TP7 pid=847) raise SystemExit()
(Worker_TP7 pid=847) SystemExit:
(Worker_TP7 pid=847) WARNING 04-15 23:05:24 [kv_cache_utils.py:1175] Hybrid KV cache manager is disabled for this hybrid model, This means we do not enable any optimizations for saving KV cache memory (e.g., dropping the KV cache outside the sliding window). The compute of layers like sliding window is still saved.
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] WorkerProc hit an exception.
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5864, in profile_cudagraph_memory
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] self._init_minimal_kv_cache_for_profiling()
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5804, in _init_minimal_kv_cache_for_profiling
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1236, in get_kv_cache_groups
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] unify_hybrid_kv_cache_specs(kv_cache_spec)
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1216, in unify_hybrid_kv_cache_specs
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] raise ValueError(
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] output = func(*args, **kwargs)
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] return func(*args, **kwargs)
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5864, in profile_cudagraph_memory
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] self._init_minimal_kv_cache_for_profiling()
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5804, in _init_minimal_kv_cache_for_profiling
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1236, in get_kv_cache_groups
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] unify_hybrid_kv_cache_specs(kv_cache_spec)
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1216, in unify_hybrid_kv_cache_specs
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] raise ValueError(
(Worker_TP7 pid=847) ERROR 04-15 23:05:24 [multiproc_executor.py:949] ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.