What is vLLM? How to Install and Use vLLM, Explained

This tutorial will walk you through everything you need to know as a beginner: what vLLM is, why it's important, how to install it, and how to use it for both offline batch processing and online serving.

Mark Ponomarev

Mark Ponomarev

14 April 2025

What is vLLM? How to Install and Use vLLM, Explained

Welcome to the comprehensive guide on vLLM! If you're involved in the world of Large Language Models (LLMs), you've likely encountered the challenges of inference speed and throughput. Serving these massive models efficiently can be a bottleneck. This is where vLLM steps in as a game-changing solution. This tutorial will walk you through everything you need to know as a beginner: what vLLM is, why it's important, how to install it, and how to use it for both offline batch processing and online serving.

What Exactly is vLLM?

At its core, vLLM is a high-throughput and memory-efficient library specifically designed for Large Language Model (LLM) inference and serving. Developed by researchers and engineers aiming to overcome the performance limitations of existing serving systems, vLLM significantly speeds up the process of getting predictions (inference) from LLMs.

Traditional methods for LLM inference often struggle with managing the large memory footprint of the model's attention mechanism (specifically, the KV cache) and efficiently batching incoming requests. vLLM introduces novel techniques, most notably PagedAttention, to address these challenges head-on. It allows for much higher throughput (more requests processed per second) and can serve models faster and more cost-effectively compared to many standard Hugging Face Transformers implementations or other serving frameworks when dealing with concurrent requests.

Think of vLLM as a highly optimized engine for running pre-trained LLMs. You provide the model and the prompts, and vLLM handles the complex task of generating text quickly and efficiently, whether for a single large batch of prompts or for many simultaneous users interacting with a deployed model.

Why Choose vLLM for LLM Inference?

Several compelling reasons make vLLM a preferred choice for developers and organizations working with LLMs:

  1. State-of-the-Art Performance: vLLM delivers significantly higher throughput compared to many baseline implementations. This means you can handle more user requests simultaneously or process large datasets faster with the same hardware.
  2. Efficient Memory Management: The core innovation, PagedAttention, drastically reduces memory waste by managing the KV cache more effectively. This allows you to fit larger models onto your GPUs or serve existing models with less memory overhead, potentially reducing hardware costs.
  3. Continuous Batching: Unlike static batching (where the server waits for a full batch before processing), vLLM uses continuous batching. It processes requests dynamically as they arrive, significantly improving GPU utilization and reducing latency for individual requests.
  4. OpenAI-Compatible Server: vLLM includes a built-in server that mimics the OpenAI API. This makes it incredibly easy to use vLLM as a drop-in replacement for applications already built using the OpenAI Python client or compatible tools. You can often switch your endpoint URL and API key, and your existing code will work with your self-hosted vLLM instance.
  5. Ease of Use: Despite its sophisticated internals, vLLM offers a relatively simple API for both offline inference (LLM class) and online serving (vllm serve command).
  6. Broad Model Compatibility: vLLM supports a wide range of popular open-source LLMs available on the Hugging Face Hub (and potentially ModelScope).
  7. Active Development and Community: vLLM is an actively maintained open-source project with a growing community, ensuring ongoing improvements, bug fixes, and support for new models and features.
  8. Optimized Kernels: vLLM utilizes highly optimized CUDA kernels for various operations, further boosting performance on NVIDIA GPUs.

If speed, efficiency, and scalability are crucial for your LLM application, vLLM is a technology you should seriously consider.

What Models Does vLLM Support?

vLLM supports a wide array of popular transformer-based models hosted on the Hugging Face Hub. This includes many variants of:

The list is constantly growing. For the most up-to-date and comprehensive list of officially supported models, always refer to the official vLLM documentation:

vLLM Supported Models List

If a model isn't explicitly listed, it might still work if its architecture is similar to a supported one, but compatibility isn't guaranteed without official support or testing. Adding new model architectures usually requires code contributions to the vLLM project.

Some Key vLLM Terminologies:

While vLLM is easy to use on the surface, understanding a couple of its core concepts helps appreciate why it's so effective:

These two techniques work together synergistically to provide vLLM's impressive performance characteristics.

Before Getting Started with vLLM, You Need to Check:

Before you can install and run vLLM, ensure your system meets the following requirements:

  1. Operating System: Linux is the primary supported OS. While community efforts might exist for other OSes (like WSL2 on Windows or macOS), Linux provides the most straightforward and officially supported experience.
  2. Python Version: vLLM requires Python 3.9, 3.10, 3.11, or 3.12. It's highly recommended to use a virtual environment to manage your Python dependencies.
  3. NVIDIA GPU with CUDA: For optimal performance and access to the core features, you need an NVIDIA GPU with a compute capability supported by PyTorch and the necessary CUDA toolkit installed. vLLM heavily relies on CUDA for its optimized kernels. While CPU-only inference and support for other accelerators (like AMD GPUs or AWS Inferentia/Trainium) are available or under development, the primary path involves NVIDIA hardware. Check the official PyTorch website for CUDA compatibility with your specific GPU driver version.
  4. PyTorch: vLLM is built on PyTorch. The installation process usually handles installing a compatible version, but ensure you have a working PyTorch installation compatible with your CUDA version if you encounter issues.

Step-by-Step Guide to Install vLLM

The recommended way to install vLLM is using a package manager within a virtual environment. This prevents conflicts with other Python projects. Here are the steps using popular tools:

Using pip with vLLM

pip is the standard Python package installer.

Create and Activate a Virtual Environment (Recommended):

python -m venv vllm-env
source vllm-env/bin/activate
# On Windows use: vllm-env\\\\Scripts\\\\activate

Install vLLM:

pip install vllm

This command will download and install the latest stable version of vLLM and its core dependencies, including a compatible version of PyTorch for your detected CUDA setup (if possible).

Using Conda with vLLM

Conda is another popular environment and package manager, especially in the data science community.

Create and Activate a Conda Environment:

conda create -n vllm-env python=3.11 -y # Or use 3.9, 3.10, 3.12
conda activate vllm-env

Install vLLM using pip within Conda:It's generally recommended to use pip to install vLLM even within a Conda environment to ensure you get the latest compatible build easily.

pip install vllm

You might need to install PyTorch separately via Conda first if you prefer managing it that way, ensuring compatibility: conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia (adjust CUDA version as needed). Then run pip install vllm.

Using uv with vLLM

uv is a newer, extremely fast Python package installer and resolver.

Install uv (if you haven't already): Follow instructions on the official uv documentation.

Create and Activate an Environment using uv:

uv venv vllm-env --python 3.12 --seed # Or use 3.9, 3.10, 3.11
source vllm-env/bin/activate
# On Windows use: vllm-env\\\\Scripts\\\\activate

Install vLLM using uv:

uv pip install vllm

Verifying Your vLLM Installation

After installation, you can quickly verify it by trying to import vLLM in a Python interpreter or running a basic command:

# Activate your virtual environment first (e.g., source vllm-env/bin/activate)
python -c "import vllm; print(vllm.__version__)"

This should print the installed vLLM version without errors.

Alternatively, try the help command for the server (requires successful installation):

vllm --help

Performing Offline Batch Inference with vLLM

Offline batch inference refers to generating text for a predefined list of input prompts all at once. This is useful for tasks like evaluating a model, generating responses for a dataset, or pre-computing results. vLLM makes this efficient using its LLM class.

Understanding vLLM's LLM Class

The vllm.LLM class is the main entry point for offline inference. You initialize it by specifying the model you want to use.

from vllm import LLM

# Initialize the LLM engine with a model from Hugging Face Hub
# Make sure you have enough GPU memory for the chosen model!
# Example: Using a smaller model like OPT-125m
llm = LLM(model="facebook/opt-125m")

# Example: Using a larger model like Llama-3-8B-Instruct (requires significant GPU memory)
# llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")

print("vLLM engine initialized.")

By default, vLLM downloads models from the Hugging Face Hub. If your model is hosted on ModelScope, you need to set the environment variable VLLM_USE_MODELSCOPE=1 before running your Python script.

Configuring vLLM Sampling Parameters

To control how the text is generated, you use the vllm.SamplingParams class. This allows you to set parameters like:

from vllm import SamplingParams

# Define sampling parameters
# If not specified, vLLM might use defaults from the model's generation_config.json
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=100, # Limit the length of the generated text
    stop=["\\\\n", " Human:", " Assistant:"] # Stop generation if these tokens appear
)

print("Sampling parameters configured.")

Important Note: By default, vLLM attempts to load and use settings from the generation_config.json file associated with the model on Hugging Face Hub. If you want to ignore this and use vLLM's default sampling parameters unless overridden by your SamplingParams object, initialize the LLM class like this: llm = LLM(model="...", generation_config="vllm"). If you do provide a SamplingParams object to the generate method, those parameters will always take precedence over both the model's config and vLLM's defaults.

Running Your First vLLM Batch Job

Now, let's combine the LLM object, SamplingParams, and a list of prompts to generate text.

from vllm import LLM, SamplingParams

# 1. Define your input prompts
prompts = [
    "The capital of France is",
    "Explain the theory of relativity in simple terms:",
    "Write a short poem about a rainy day:",
    "Translate 'Hello, world!' to German:",
]

# 2. Configure sampling parameters
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=150)

# 3. Initialize the vLLM engine (use a model suitable for your hardware)
try:
    # Using a relatively small, capable model
    llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
    # Or for smaller GPUs:
    # llm = LLM(model="facebook/opt-1.3b")
    # llm = LLM(model="facebook/opt-125m")
except Exception as e:
    print(f"Error initializing LLM: {e}")
    print("Please ensure you have enough GPU memory and CUDA is set up correctly.")
    exit()

# 4. Generate text for the prompts
# The generate method takes the list of prompts and sampling parameters
print("Generating responses...")
outputs = llm.generate(prompts, sampling_params)
print("Generation complete.")

# 5. Print the results
# The output is a list of RequestOutput objects
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text # Get the text from the first generated sequence
    print("-" * 20)
    print(f"Prompt: {prompt!r}")
    print(f"Generated Text: {generated_text!r}")
    print("-" * 20)

This script initializes vLLM, defines prompts and parameters, runs the generation process efficiently in a batch, and then prints the output for each prompt. The llm.generate() call handles the complexities of batching and GPU execution internally.

Setting Up the vLLM OpenAI-Compatible Server

One of vLLM's most powerful features is its ability to act as a high-performance backend server that speaks the same language as the OpenAI API. This allows you to easily host your own open-source models and integrate them into applications designed for OpenAI.

Launching the vLLM Server

Starting the server is straightforward using the vllm serve command in your terminal.

Activate your virtual environment where vLLM is installed.

source vllm-env/bin/activate

Run the vllm serve command:You need to specify the model you want to serve.

# Example using Mistral-7B-Instruct
vllm serve mistralai/Mistral-7B-Instruct-v0.1

# Example using a smaller model like Qwen2-1.5B-Instruct
# vllm serve Qwen/Qwen2-1.5B-Instruct

This command will:

Common Options:

The server will output logs indicating it's running and ready to accept requests.

Interacting with the vLLM Server: Completions API

Once the server is running, you can send requests to its /v1/completions endpoint, just like you would with OpenAI's older completions API.

Using curl:

curl <http://localhost:8000/v1/completions> \\\\
    -H "Content-Type: application/json" \\\\
    -d '{
        "model": "mistralai/Mistral-7B-Instruct-v0.1",
        "prompt": "San Francisco is a city in",
        "max_tokens": 50,
        "temperature": 0.7
    }'

(Replace "mistralai/Mistral-7B-Instruct-v0.1" with the actual model you are serving)

Using the openai Python Library:

from openai import OpenAI

# Point the client to your vLLM server endpoint
client = OpenAI(
    api_key="EMPTY", # Use "EMPTY" or your actual key if you set one with --api-key
    base_url="<http://localhost:8000/v1>"
)

print("Sending request to vLLM server (Completions)...")

try:
    completion = client.completions.create(
        model="mistralai/Mistral-7B-Instruct-v0.1", # Model name must match the one served
        prompt="Explain the benefits of using vLLM:",
        max_tokens=150,
        temperature=0.5
    )

    print("Response:")
    print(completion.choices[0].text)

except Exception as e:
    print(f"An error occurred: {e}")

(Remember to replace the model name if you are serving a different one)

Interacting with the vLLM Server: Chat Completions API

vLLM also supports the more modern /v1/chat/completions endpoint, suitable for conversational models and structured message formats (system, user, assistant roles).

Using curl:

curl <http://localhost:8000/v1/chat/completions> \\\\
    -H "Content-Type: application/json" \\\\
    -d '{
        "model": "mistralai/Mistral-7B-Instruct-v0.1",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the main advantage of PagedAttention in vLLM?"}
        ],
        "max_tokens": 100,
        "temperature": 0.7
    }'

(Replace model name as needed)

Using the openai Python Library:

from openai import OpenAI

# Point the client to your vLLM server endpoint
client = OpenAI(
    api_key="EMPTY", # Use "EMPTY" or your actual key
    base_url="<http://localhost:8000/v1>"
)

print("Sending request to vLLM server (Chat Completions)...")

try:
    chat_response = client.chat.completions.create(
        model="mistralai/Mistral-7B-Instruct-v0.1", # Model name must match the one served
        messages=[
            {"role": "system", "content": "You are a helpful programming assistant."},
            {"role": "user", "content": "Write a simple Python function to calculate factorial."}
        ],
        max_tokens=200,
        temperature=0.5
    )

    print("Response:")
    print(chat_response.choices[0].message.content)

except Exception as e:
    print(f"An error occurred: {e}")

(Remember to replace the model name if necessary)

Using the OpenAI-compatible server is a powerful way to deploy high-performance LLM inference endpoints with minimal changes to your existing application logic.

Let’s Talk About vLLM Attention Backends

vLLM utilizes specialized "backends" to compute the attention mechanism efficiently. These backends are optimized implementations leveraging different libraries or techniques, primarily targeting NVIDIA GPUs. The choice of backend can impact performance and memory usage. The main ones include:

  1. FlashAttention: Uses the FlashAttention library (versions 1 and 2). FlashAttention is a highly optimized attention algorithm that significantly speeds up computation and reduces memory usage by avoiding the need to materialize the large intermediate attention matrix in GPU High Bandwidth Memory (HBM). It's often the fastest option for many modern GPUs (like Ampere, Hopper architectures) and sequence lengths. vLLM typically includes pre-built wheels with FlashAttention support.
  2. Xformers: Leverages the xFormers library, developed by Meta AI. xFormers also provides memory-efficient and optimized attention implementations (like MemoryEfficientAttention). It offers broad compatibility across various GPU architectures and can be a good alternative or fallback if FlashAttention isn't available or optimal for a specific scenario. vLLM's standard installation often includes support for xFormers.
  3. FlashInfer: A more recent backend option utilizing the FlashInfer library. FlashInfer provides highly optimized kernels specifically tailored for deploying LLMs, focusing on various prefill and decoding scenarios, including features like speculative decoding and efficient handling of paged KV caches. There are typically no pre-built vLLM wheels containing FlashInfer, meaning you must install it separately in your environment before vLLM can use it. Refer to the FlashInfer official documentation or the vLLM Dockerfiles for installation instructions if you intend to use this backend.

Automatic Backend Selection:By default, vLLM automatically detects the most suitable and performant attention backend based on your hardware (GPU architecture), installed libraries (is FlashAttention/xFormers/FlashInfer available?), and the specific model being used. It performs checks to ensure compatibility and aims to provide the best out-of-the-box performance without manual configuration.

Manual Backend Selection:In some advanced use cases or for benchmarking purposes, you might want to force vLLM to use a specific backend. You can do this by setting the VLLM_ATTENTION_BACKEND environment variable before launching your vLLM process (either the offline script or the server).

# Example: Force using FlashAttention (if installed and compatible)
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
python your_offline_script.py
# or
# export VLLM_ATTENTION_BACKEND=FLASH_ATTN
# vllm serve your_model ...

# Example: Force using xFormers
export VLLM_ATTENTION_BACKEND=XFORMERS
python your_offline_script.py

# Example: Force using FlashInfer (requires prior installation)
export VLLM_ATTENTION_BACKEND=FLASHINFER
python your_offline_script.py

For most beginners, relying on vLLM's automatic backend selection is recommended. Manually setting the backend is typically reserved for experimentation or troubleshooting specific performance issues.

Troubleshooting Common vLLM Installation and Usage Issues

While vLLM aims for ease of use, you might encounter some common hurdles, especially during setup. Here are some frequent problems and their potential solutions:

  1. CUDA Out of Memory (OOM) Errors:
  1. Installation Errors (CUDA/PyTorch Compatibility):
  1. Model Loading Failures:
  1. Slow Performance:
  1. Incorrect Output or Gibberish:

Consulting the official vLLM documentation and the project's GitHub Issues page is also highly recommended when encountering problems.

Conclusion: Your Journey with vLLM

Congratulations! You've taken your first steps into the world of high-performance LLM inference with vLLM. We've covered the fundamental concepts, from understanding what vLLM is and why its PagedAttention technology is revolutionary, to the practical steps of installation and usage.

You now know how to:

vLLM significantly lowers the barrier to deploying powerful LLMs efficiently. By leveraging its speed and memory optimization, you can build faster, more scalable, and potentially more cost-effective AI applications. Whether you're processing large datasets or building real-time conversational agents, vLLM provides the engine to power your LLM inference needs.

This guide provides a solid foundation, but the vLLM ecosystem is rich with more advanced features like quantization support, multi-LoRA inference, speculative decoding, distributed serving, and much more. The best way to continue your learning journey is to explore the official vLLM documentation, experiment with different models and parameters, and perhaps even contribute to the vibrant open-source community. Happy coding, and enjoy the speed of vLLM!


Explore more

Mistral AI Announces Codestral Embed: Revolutionizing Code Search and AI-Powered Development

Mistral AI Announces Codestral Embed: Revolutionizing Code Search and AI-Powered Development

Discover Mistral AI's Codestral Embed, the revolutionary code embedding model that transforms software development through semantic code search, AI-powered completion, and intelligent code understanding.

29 May 2025

How to Use Qwen 3 30B for MCP and Agentic Tasks

How to Use Qwen 3 30B for MCP and Agentic Tasks

Master Qwen 3 30B with MCP for agentic tasks! This tutorial guides you through Ollama setup, tool-calling with MCP, and building a poetry-writing agent using Qwen 3’s reasoning mode.

29 May 2025

Using Mistral Agents API with MCP: How Good Is It?

Using Mistral Agents API with MCP: How Good Is It?

Artificial Intelligence (AI) is rapidly moving beyond simply generating text or recognizing images. The next frontier is about AI that can take action, solve problems, and interact with the world in meaningful ways. Mistral AI, a prominent name in the field, has taken a significant step in this direction with its Mistral Agents API. This powerful toolkit allows developers to build sophisticated AI agents that can do much more than traditional language models. At its core, the Agents API is desi

28 May 2025

Practice API Design-first in Apidog

Discover an easier way to build and use APIs