How to Use Qwen3 Quantized Models Locally: A Step-by-Step Guide

Large language models (LLMs) like Qwen3 are revolutionizing the AI landscape with their impressive capabilities in coding, reasoning, and natural language understanding. Developed by the Qwen team at Alibaba, Qwen3 offers quantized models that enable efficient local deployment, making it accessible for developers, researchers, and enthusiasts to run these powerful models on their own hardware. Whether you're using Ollama, LM Studio, or vLLM, this guide will walk you through the process of setting up and running Qwen3 quantized models locally.

💡

Before diving in, ensure you have the right tools to test and interact with your local Qwen3 setup. Apidog is an excellent API testing tool that can help you validate your local model’s API endpoints with ease. Download Apidog for free to streamline your API testing workflow while working with Qwen3!

button

In this technical guide, we’ll explore the setup process, model selection, deployment methods, and API integration. Let’s get started.

What Are Qwen3 Quantized Models?

Qwen3 is the latest generation of LLMs from Alibaba, designed for high performance across tasks like coding, math, and general reasoning. Quantized models, such as those in BF16, FP8, GGUF, AWQ, and GPTQ formats, reduce the computational and memory requirements, making them ideal for local deployment on consumer-grade hardware.

The Qwen3 family includes various models:

Qwen3-235B-A22B (MoE): A mixture-of-experts model with BF16, FP8, GGUF, and GPTQ-int4 formats.
Qwen3-30B-A3B (MoE): Another MoE variant with similar quantization options.
Qwen3-32B, 14B, 8B, 4B, 1.7B, 0.6B (Dense): Dense models available in BF16, FP8, GGUF, AWQ, and GPTQ-int8 formats.

These models support flexible deployment through platforms like Ollama, LM Studio, and vLLM, which we’ll cover in detail. Additionally, Qwen3 offers features like "thinking mode," which can be toggled for better reasoning, and generation parameters to fine-tune output quality.

Now that we understand the basics, let’s move on to the prerequisites for running Qwen3 locally.

Prerequisites for Running Qwen3 Locally

Before deploying Qwen3 quantized models, ensure your system meets the following requirements:

Hardware:

A modern CPU or GPU (NVIDIA GPUs are recommended for vLLM).
At least 16GB of RAM for smaller models like Qwen3-4B; 32GB or more for larger models like Qwen3-32B.
Sufficient storage (e.g., Qwen3-235B-A22B GGUF may require ~150GB).

Software:

A compatible operating system (Windows, macOS, or Linux).
Python 3.8+ for vLLM and API interactions.
Docker (optional, for vLLM).
Git for cloning repositories.

Dependencies:

Install required libraries like torch, transformers, and vllm (for vLLM).
Download Ollama or LM Studio binaries from their official websites.

With these prerequisites in place, let’s proceed to download the Qwen3 quantized models.

Step 1: Download Qwen3 Quantized Models

First, you need to download the quantized models from trusted sources. The Qwen team provides Qwen3 models on Hugging Face and ModelScope

Hugging Face: Qwen3 Collection
ModelScope: Qwen3 Collection

How to Download from Hugging Face

Visit the Hugging Face Qwen3 collection.
Select a model, such as Qwen3-4B in GGUF format for lightweight deployment.
Click the "Download" button or use the git clone command to fetch the model files:

git clone https://huggingface.co/Qwen/Qwen3-4B-GGUF

Store the model files in a directory, such as /models/qwen3-4b-gguf.

How to Download from ModelScope

Navigate to the ModelScope Qwen3 collection.
Choose your desired model and quantization format (e.g., AWQ or GPTQ).
Download the files manually or use their API for programmatic access.

Once the models are downloaded, let’s explore how to deploy them using Ollama.

Step 2: Deploy Qwen3 Using Ollama

Ollama provides a user-friendly way to run LLMs locally with minimal setup. It supports Qwen3’s GGUF format, making it ideal for beginners.

Install Ollama

Visit the official Ollama website and download the binary for your operating system.
Install Ollama by running the installer or following the command-line instructions:

curl -fsSL https://ollama.com/install.sh | sh

Verify the installation:

ollama --version

Run Qwen3 with Ollama

Start the model:

ollama run qwen3:235b-a22b-q8_0

Once the model is running, you can interact with it via the command line:

>>> Hello, how can I assist you today?

Ollama also provides a local API endpoint (typically http://localhost:11434) for programmatic access, which we’ll test later using Apidog.

Next, let’s explore how to use LM Studio for running Qwen3.

Step 3: Deploy Qwen3 Using LM Studio

LM Studio is another popular tool for running LLMs locally, offering a graphical interface for model management.

Install LM Studio

Download LM Studio from its official website.
Install the application by following the on-screen instructions.
Launch LM Studio and ensure it’s running.

Load Qwen3 in LM Studio

In LM Studio, go to the "Local Models" section.

Click "Add Model" and search the model to download it:

Configure the model settings, such as:

Temperature: 0.6
Top-P: 0.95
Top-K: 20
These settings match Qwen3’s recommended thinking mode parameters.

Start the model server by clicking "Start Server." LM Studio will provide a local API endpoint (e.g., http://localhost:1234).

Interact with Qwen3 in LM Studio

Use LM Studio’s built-in chat interface to test the model.
Alternatively, access the model via its API endpoint, which we’ll explore in the API testing section.

With LM Studio set up, let’s move on to a more advanced deployment method using vLLM.

Step 4: Deploy Qwen3 Using vLLM

vLLM is a high-performance serving solution optimized for LLMs, supporting Qwen3’s FP8 and AWQ quantized models. It’s ideal for developers building robust applications.

Install vLLM

Ensure Python 3.8+ is installed on your system.
Install vLLM using pip:

pip install vllm

Verify the installation:

python -c "import vllm; print(vllm.__version__)"

Run Qwen3 with vLLM

Start a vLLM server with your Qwen3 model

# Load and run the model:
vllm serve "Qwen/Qwen3-235B-A22B"

The --enable-thinking=False flag disables Qwen3’s thinking mode.

Once the server starts, it will provide an API endpoint at http://localhost:8000.

Configure vLLM for Optimal Performance

vLLM supports advanced configurations, such as:

Tensor Parallelism: Adjust --tensor-parallel-size based on your GPU setup.
Context Length: Qwen3 supports up to 32,768 tokens, which can be set via --max-model-len 32768.
Generation Parameters: Use the API to set temperature, top_p, and top_k (e.g., 0.7, 0.8, 20 for non-thinking mode).

With vLLM running, let’s test the API endpoint using Apidog.

Step 5: Test Qwen3 API with Apidog

Apidog is a powerful tool for testing API endpoints, making it perfect for interacting with your locally deployed Qwen3 model.

Set Up Apidog

Download and install Apidog from the official website.
Launch Apidog and create a new project.

Test Ollama API

Create a new API request in Apidog.
Set the endpoint to http://localhost:11434/api/generate.
Configure the request:

Method: POST
Body (JSON):

{
  "model": "qwen3-4b",
  "prompt": "Hello, how can I assist you today?",
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20
}

Send the request and verify the response.

Test vLLM API

Create another API request in Apidog.
Set the endpoint to http://localhost:8000/v1/completions.
Configure the request:

Method: POST
Body (JSON):

{
  "model": "qwen3-4b-awq",
  "prompt": "Write a Python script to calculate factorial.",
  "max_tokens": 512,
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20
}

Send the request and check the output.

Apidog makes it easy to validate your Qwen3 deployment and ensure the API is functioning correctly. Now, let’s fine-tune the model’s performance.

Step 6: Fine-Tune Qwen3 Performance

To optimize Qwen3’s performance, adjust the following settings based on your use case:

Thinking Mode

Qwen3 supports a "thinking mode" for enhanced reasoning, as highlighted in the X post image. You can control it in two ways:

Soft Switch: Add /think or /no_think to your prompt.

Example: Solve this math problem /think.

Hard Switch: Disable thinking entirely in vLLM with --enable-thinking=False.

Generation Parameters

Fine-tune the generation parameters for better output quality:

Temperature: Use 0.6 for thinking mode or 0.7 for non-thinking mode.
Top-P: Set to 0.95 (thinking) or 0.8 (non-thinking).
Top-K: Use 20 for both modes.
Avoid greedy decoding, as recommended by the Qwen team.

Experiment with these settings to achieve the desired balance between creativity and accuracy.

Troubleshooting Common Issues

While deploying Qwen3, you may encounter some issues. Here are solutions to common problems:

Model Fails to Load in Ollama:

Ensure the GGUF file path in the Modelfile is correct.
Check if your system has enough memory to load the model.

vLLM Tensor Parallelism Error:

If you see an error like "output_size is not divisible by weight quantization block_n," reduce the --tensor-parallel-size (e.g., to 4).

API Request Fails in Apidog:

Verify that the server (Ollama, LM Studio, or vLLM) is running.
Double-check the endpoint URL and request payload.

By addressing these issues, you can ensure a smooth deployment experience.

Conclusion

Running Qwen3 quantized models locally is a straightforward process with tools like Ollama, LM Studio, and vLLM. Whether you’re a developer building applications or a researcher experimenting with LLMs, Qwen3 offers the flexibility and performance you need. By following this guide, you’ve learned how to download models from Hugging Face and ModelScope, deploy them using various frameworks, and test their API endpoints with Apidog.

Start exploring Qwen3 today and unlock the power of local LLMs for your projects!

button