Stop using title frontmatter and fix doc that can only be reached by search (#20623)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor
2025-07-08 11:27:40 +01:00
committed by GitHub
parent b4bab81660
commit b942c094e3
81 changed files with 82 additions and 238 deletions

View File

@@ -1,6 +1,4 @@
---
title: Using Docker
---
# Using Docker
[](){ #deployment-docker-pre-built-image }

View File

@@ -1,6 +1,5 @@
---
title: Anyscale
---
# Anyscale
[](){ #deployment-anyscale }
[Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray.

View File

@@ -1,6 +1,4 @@
---
title: Anything LLM
---
# Anything LLM
[Anything LLM](https://github.com/Mintplex-Labs/anything-llm) is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting.

View File

@@ -1,6 +1,4 @@
---
title: AutoGen
---
# AutoGen
[AutoGen](https://github.com/microsoft/autogen) is a framework for creating multi-agent AI applications that can act autonomously or work alongside humans.

View File

@@ -1,6 +1,4 @@
---
title: BentoML
---
# BentoML
[BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-compliant image and deploy it on Kubernetes.

View File

@@ -1,6 +1,4 @@
---
title: Cerebrium
---
# Cerebrium
<p align="center">
<img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>

View File

@@ -1,6 +1,4 @@
---
title: Chatbox
---
# Chatbox
[Chatbox](https://github.com/chatboxai/chatbox) is a desktop client for LLMs, available on Windows, Mac, Linux.

View File

@@ -1,6 +1,4 @@
---
title: Dify
---
# Dify
[Dify](https://github.com/langgenius/dify) is an open-source LLM app development platform. Its intuitive interface combines agentic AI workflow, RAG pipeline, agent capabilities, model management, observability features, and more, allowing you to quickly move from prototype to production.

View File

@@ -1,6 +1,4 @@
---
title: dstack
---
# dstack
<p align="center">
<img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>

View File

@@ -1,6 +1,4 @@
---
title: Haystack
---
# Haystack
# Haystack

View File

@@ -1,6 +1,4 @@
---
title: Helm
---
# Helm
A Helm chart to deploy vLLM for Kubernetes

View File

@@ -1,6 +1,4 @@
---
title: LiteLLM
---
# LiteLLM
[LiteLLM](https://github.com/BerriAI/litellm) call all LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, Groq etc.]

View File

@@ -1,6 +1,4 @@
---
title: Lobe Chat
---
# Lobe Chat
[Lobe Chat](https://github.com/lobehub/lobe-chat) is an open-source, modern-design ChatGPT/LLMs UI/Framework.

View File

@@ -1,6 +1,4 @@
---
title: LWS
---
# LWS
LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
A major use case is for multi-host/multi-node distributed inference.

View File

@@ -1,6 +1,4 @@
---
title: Modal
---
# Modal
vLLM can be run on cloud GPUs with [Modal](https://modal.com), a serverless computing platform designed for fast auto-scaling.

View File

@@ -1,6 +1,4 @@
---
title: Open WebUI
---
# Open WebUI
1. Install the [Docker](https://docs.docker.com/engine/install/)

View File

@@ -1,6 +1,4 @@
---
title: Retrieval-Augmented Generation
---
# Retrieval-Augmented Generation
[Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources.

View File

@@ -1,6 +1,4 @@
---
title: SkyPilot
---
# SkyPilot
<p align="center">
<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>

View File

@@ -1,6 +1,4 @@
---
title: Streamlit
---
# Streamlit
[Streamlit](https://github.com/streamlit/streamlit) lets you transform Python scripts into interactive web apps in minutes, instead of weeks. Build dashboards, generate reports, or create chat apps.

View File

@@ -1,5 +1,3 @@
---
title: NVIDIA Triton
---
# NVIDIA Triton
The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details.

View File

@@ -1,6 +1,4 @@
---
title: KServe
---
# KServe
vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.

View File

@@ -1,6 +1,4 @@
---
title: KubeAI
---
# KubeAI
[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.

View File

@@ -1,6 +1,4 @@
---
title: Llama Stack
---
# Llama Stack
vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .

View File

@@ -1,6 +1,4 @@
---
title: llmaz
---
# llmaz
[llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models on Kubernetes, aimed for production use. It uses vLLM as the default model serving backend.

View File

@@ -1,6 +1,4 @@
---
title: Production stack
---
# Production stack
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with:

View File

@@ -1,6 +1,4 @@
---
title: Using Kubernetes
---
# Using Kubernetes
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes.

View File

@@ -1,6 +1,4 @@
---
title: Using Nginx
---
# Using Nginx
This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.