Are you building Large Language Model (LLM) applications and struggling with slow inference speeds or memory limitations? vLLM is the solution top API and backend engineers are adopting to accelerate LLM serving, handle high concurrency, and reduce infrastructure costs. This hands-on guide explains what vLLM is, how it works, how to install it, and how to use it for both batch and real-time API inference, so your team can deliver fast, reliable AI features at scale.
What is vLLM? Why Does It Matter for LLM APIs?
vLLM is an open-source, high-throughput, memory-efficient inference engine designed for serving large language models. Developed by leading researchers and engineers, it tackles two of the biggest challenges facing LLM deployments:
- Slow inference speeds: Especially with many concurrent users or large batch jobs.
- High memory usage: Traditional attention mechanisms waste GPU memory, limiting the size and number of models you can serve.
vLLM's core innovations:
- PagedAttention: Optimizes the key-value (KV) cache, drastically reducing memory waste by using a virtual memory-like paging system.
- Continuous batching: Dynamically batches requests as they arrive, maximizing GPU utilization and minimizing user wait times.
Think of vLLM as a turbocharged backend engine for LLM APIs, especially for developers who need scalable, production-ready inference.
Why API Developers and Backend Engineers Prefer vLLM
vLLM is quickly becoming the go-to LLM inference engine for technical teams because it delivers:
- State-of-the-art throughput: Serve more user requests per second, process larger datasets faster.
- Efficient GPU usage: Fit bigger models on your GPUs, or reduce hardware costs for existing workloads.
- Dynamic batching: No more static wait times; vLLM adapts to real traffic, keeping your GPUs busy.
- OpenAI-compatible API: Seamlessly replace or supplement OpenAI endpoints with your own, self-hosted models.
- Simple, flexible APIs: Both for offline batch jobs and live serving.
- Broad model support: Llama, Mistral, Qwen, OPT, Falcon, and more from Hugging Face and ModelScope.
- Active open-source development: Frequent updates, growing community, and cutting-edge features.
See the full list of supported models in the vLLM documentation.
Tip: If you're building or testing LLM-powered APIs, consider integrating with Apidog. Apidog makes it easy to design, test, and document your LLM endpoints—whether you're using vLLM, OpenAI, or custom backends—helping teams streamline API collaboration and QA.
Supported LLMs: Which Models Work with vLLM?
vLLM natively supports a wide range of transformer-based models, including:
- Llama series: Llama, Llama 2, Llama 3
- Mistral and Mixtral
- Qwen and Qwen2
- GPT-2, GPT-J, GPT-NeoX
- OPT
- Bloom
- Falcon
- MPT
- And more, including multi-modal models
The list is growing. For the most current compatibility, check the official vLLM Supported Models List.
Note: If your model is not listed but shares architecture with a supported one, it may still work—test carefully. Custom architectures may require contributing code upstream.
Key Concepts: PagedAttention and Continuous Batching
Understanding these two concepts will help you optimize your LLM deployments:
PagedAttention
- Problem: Traditional attention uses contiguous memory for the KV cache, causing fragmentation and wasted GPU VRAM.
- Solution: PagedAttention breaks the KV cache into flexible "pages," like virtual memory in operating systems. This can reduce memory overhead by up to 90% and enables memory sharing for common sequence prefixes.
Continuous Batching
- Problem: Static batching (waiting for a full batch before starting) leads to idle GPU time and high latency.
- Solution: Continuous batching instantly processes new requests as soon as GPU resources free up, maximizing throughput and minimizing user wait times.
These optimizations are why vLLM outperforms many other LLM serving frameworks.
Prerequisites: What You Need Before Installing vLLM
Before you get started, make sure your environment meets these requirements:
- Operating System: Linux recommended (WSL2 and macOS possible, but Linux is best supported).
- Python: 3.9, 3.10, 3.11, or 3.12. Use a virtual environment.
- NVIDIA GPU with CUDA: For best performance. (vLLM relies on CUDA; CPU-only and other accelerators have limited or experimental support.)
- PyTorch: vLLM installs a compatible version automatically, but you may pre-install it for custom CUDA versions.
How to Install vLLM: Step-by-Step
1. Using pip (Recommended)
python -m venv vllm-env
source vllm-env/bin/activate
# On Windows: vllm-env\\Scripts\\activate
pip install vllm
This installs vLLM and its dependencies (including PyTorch).
2. Using Conda
conda create -n vllm-env python=3.11 -y
conda activate vllm-env
pip install vllm
Tip: For custom CUDA versions, install PyTorch with conda first, then vLLM.
3. Using uv (for super-fast installs)
uv venv vllm-env --python 3.12 --seed
source vllm-env/bin/activate
uv pip install vllm
4. Verify Installation
python -c "import vllm; print(vllm.__version__)"
vllm --help
You should see the installed version and command-line help.
Offline Batch Inference with vLLM
Batch inference is ideal for running predictions on a list of prompts—great for evaluation, dataset generation, or bulk processing.
Example: Batch Inference Script
from vllm import LLM, SamplingParams
# 1. Define prompts
prompts = [
"The capital of France is",
"Explain the theory of relativity in simple terms:",
"Write a short poem about a rainy day:",
"Translate 'Hello, world!' to German:",
]
# 2. Set sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=150,
stop=["\n", " Human:", " Assistant:"]
)
# 3. Initialize vLLM engine (choose a model your GPU can handle)
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
# 4. Generate outputs
outputs = llm.generate(prompts, sampling_params)
# 5. Display results
for output in outputs:
print("-" * 20)
print(f"Prompt: {output.prompt!r}")
print(f"Generated Text: {output.outputs[0].text!r}")
print("-" * 20)
Tips:
- vLLM defaults to Hugging Face Hub models. For ModelScope, set
VLLM_USE_MODELSCOPE=1. - To override a model’s generation config, use
generation_config="vllm"in theLLMconstructor. - For quantized models (AWQ, GPTQ, etc.), check vLLM documentation and Hugging Face model cards.
Running vLLM as an OpenAI-Compatible API Server
Want to serve LLMs via an OpenAI-like API? vLLM makes it easy to swap endpoints, test new models, and integrate with API tools like Apidog for seamless design, mock, and QA workflows.
Start the vLLM Server
source vllm-env/bin/activate
vllm serve mistralai/Mistral-7B-Instruct-v0.1
# Or, for another model:
# vllm serve Qwen/Qwen2-1.5B-Instruct
Key options:
-model <model_name_or_path>: Model to serve (required)-host 0.0.0.0: Bind to all interfaces (for remote access)-port 8000: Specify port-tensor-parallel-size <N>: Distribute model across N GPUs-api-key <key>: Require API key for requests (useful for production)-generation-config vllm: Use vLLM’s default generation parameters-chat-template <path>: Custom chat template (for advanced usage)
Server runs at http://localhost:8000 by default.
Using the Completions API Endpoint
cURL Example:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.1",
"prompt": "San Francisco is a city in",
"max_tokens": 50,
"temperature": 0.7
}'
Python Example (OpenAI Client):
from openai import OpenAI
client = OpenAI(
api_key="EMPTY", # Or your API key if set
base_url="http://localhost:8000/v1"
)
completion = client.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.1",
prompt="Explain the benefits of using vLLM:",
max_tokens=150,
temperature=0.5
)
print(completion.choices[0].text)
Using the Chat Completions API Endpoint
cURL Example:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.1",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the main advantage of PagedAttention in vLLM?"}
],
"max_tokens": 100,
"temperature": 0.7
}'
Python Example:
chat_response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.1",
messages=[
{"role": "system", "content": "You are a helpful programming assistant."},
{"role": "user", "content": "Write a simple Python function to calculate factorial."}
],
max_tokens=200,
temperature=0.5
)
print(chat_response.choices[0].message.content)
With Apidog, you can quickly design, mock, and test these API endpoints, ensuring smooth integration and automated QA for your LLM-powered products.
vLLM Attention Backends: FlashAttention, xFormers, and FlashInfer
vLLM supports multiple attention computation backends for optimal speed and memory efficiency:
- FlashAttention (1 & 2): Fastest for most modern NVIDIA GPUs, minimizes memory usage.
- xFormers: Broad compatibility; good fallback for older or less common hardware.
- FlashInfer: Advanced, recently added; requires manual install.
Automatic selection: vLLM chooses the best backend for your hardware and model by default.
Manual override: Set the environment variable VLLM_ATTENTION_BACKEND to FLASH_ATTN, XFORMERS, or FLASHINFER before running vLLM if you wish to force a backend.
Troubleshooting Common vLLM Issues
1. CUDA Out of Memory Errors
- Try a smaller model (e.g., OPT-1.3B)
- Reduce concurrent requests or batch size
- Use quantized models (AWQ, GPTQ, etc.)
- Distribute across multiple GPUs (
-tensor-parallel-size) - Check for other GPU processes with
nvidia-smi
2. Installation & Compatibility Problems
- Ensure CUDA, PyTorch, and NVIDIA drivers are compatible (see PyTorch compatibility matrix)
- Pre-install PyTorch if needed
- Use official vLLM Docker images for hassle-free setup
3. Model Loading Failures
- Double-check model name (e.g.,
mistralai/Mistral-7B-Instruct-v0.1) - Use
trust_remote_code=Trueif model requires it - Use local paths for pre-downloaded models
- Check disk space and internet connectivity
4. Slow Inference
- Monitor GPU utilization (
nvidia-smi) - Update vLLM, dependencies, and drivers
- Experiment with different attention backends
- Adjust sampling parameters (
max_tokens, etc.)
5. Unexpected or Nonsensical Output
- Ensure correct prompt formatting (see model card)
- Tune sampling parameters (
temperature,top_p) - Try a different model to isolate the issue
- Check chat template usage on the server
Next Steps: Level Up Your LLM API Workflow
With vLLM, you can deploy and scale LLM-powered APIs faster—and with Apidog, you gain a complete toolkit for API design, testing, and documentation. This combination empowers teams to:
- Develop, mock, and test LLM endpoints with real-world traffic patterns
- Automate QA for both vLLM and OpenAI-compatible APIs
- Collaborate across teams with clear, up-to-date API docs
Explore vLLM's advanced features (quantization, multi-LoRA, distributed serving, speculative decoding) in the official documentation, and boost your LLM development lifecycle with Apidog for seamless API management.



