How to Use Qwen3 Quantized Models Locally: A Step-by-Step Guide

Learn how to use Qwen3 quantized models locally with Ollama, LM Studio, and vLLM.

Ashley Innocent

Ashley Innocent

14 May 2025

How to Use Qwen3 Quantized Models Locally: A Step-by-Step Guide

Large language models (LLMs) like Qwen3 are revolutionizing the AI landscape with their impressive capabilities in coding, reasoning, and natural language understanding. Developed by the Qwen team at Alibaba, Qwen3 offers quantized models that enable efficient local deployment, making it accessible for developers, researchers, and enthusiasts to run these powerful models on their own hardware. Whether you're using Ollama, LM Studio, or vLLM, this guide will walk you through the process of setting up and running Qwen3 quantized models locally.

💡
Before diving in, ensure you have the right tools to test and interact with your local Qwen3 setup. Apidog is an excellent API testing tool that can help you validate your local model’s API endpoints with ease. Download Apidog for free to streamline your API testing workflow while working with Qwen3!
button

In this technical guide, we’ll explore the setup process, model selection, deployment methods, and API integration. Let’s get started.

What Are Qwen3 Quantized Models?

Qwen3 is the latest generation of LLMs from Alibaba, designed for high performance across tasks like coding, math, and general reasoning. Quantized models, such as those in BF16, FP8, GGUF, AWQ, and GPTQ formats, reduce the computational and memory requirements, making them ideal for local deployment on consumer-grade hardware.

The Qwen3 family includes various models:

These models support flexible deployment through platforms like Ollama, LM Studio, and vLLM, which we’ll cover in detail. Additionally, Qwen3 offers features like "thinking mode," which can be toggled for better reasoning, and generation parameters to fine-tune output quality.

Now that we understand the basics, let’s move on to the prerequisites for running Qwen3 locally.

Prerequisites for Running Qwen3 Locally

Before deploying Qwen3 quantized models, ensure your system meets the following requirements:

Hardware:

Software:

Dependencies:

With these prerequisites in place, let’s proceed to download the Qwen3 quantized models.

Step 1: Download Qwen3 Quantized Models

First, you need to download the quantized models from trusted sources. The Qwen team provides Qwen3 models on Hugging Face and ModelScope

How to Download from Hugging Face

  1. Visit the Hugging Face Qwen3 collection.
  2. Select a model, such as Qwen3-4B in GGUF format for lightweight deployment.
  3. Click the "Download" button or use the git clone command to fetch the model files:
git clone https://huggingface.co/Qwen/Qwen3-4B-GGUF
  1. Store the model files in a directory, such as /models/qwen3-4b-gguf.

How to Download from ModelScope

  1. Navigate to the ModelScope Qwen3 collection.
  2. Choose your desired model and quantization format (e.g., AWQ or GPTQ).
  3. Download the files manually or use their API for programmatic access.

Once the models are downloaded, let’s explore how to deploy them using Ollama.

Step 2: Deploy Qwen3 Using Ollama

Ollama provides a user-friendly way to run LLMs locally with minimal setup. It supports Qwen3’s GGUF format, making it ideal for beginners.

Install Ollama

  1. Visit the official Ollama website and download the binary for your operating system.
  2. Install Ollama by running the installer or following the command-line instructions:
curl -fsSL https://ollama.com/install.sh | sh
  1. Verify the installation:
ollama --version

Run Qwen3 with Ollama

  1. Start the model:
ollama run qwen3:235b-a22b-q8_0
  1. Once the model is running, you can interact with it via the command line:
>>> Hello, how can I assist you today?

Ollama also provides a local API endpoint (typically http://localhost:11434) for programmatic access, which we’ll test later using Apidog.

Next, let’s explore how to use LM Studio for running Qwen3.

Step 3: Deploy Qwen3 Using LM Studio

LM Studio is another popular tool for running LLMs locally, offering a graphical interface for model management.

Install LM Studio

  1. Download LM Studio from its official website.
  2. Install the application by following the on-screen instructions.
  3. Launch LM Studio and ensure it’s running.

Load Qwen3 in LM Studio

In LM Studio, go to the "Local Models" section.

Click "Add Model" and search the model to download it:

Configure the model settings, such as:

Start the model server by clicking "Start Server." LM Studio will provide a local API endpoint (e.g., http://localhost:1234).

Interact with Qwen3 in LM Studio

  1. Use LM Studio’s built-in chat interface to test the model.
  2. Alternatively, access the model via its API endpoint, which we’ll explore in the API testing section.

With LM Studio set up, let’s move on to a more advanced deployment method using vLLM.

Step 4: Deploy Qwen3 Using vLLM

vLLM is a high-performance serving solution optimized for LLMs, supporting Qwen3’s FP8 and AWQ quantized models. It’s ideal for developers building robust applications.

Install vLLM

  1. Ensure Python 3.8+ is installed on your system.
  2. Install vLLM using pip:
pip install vllm
  1. Verify the installation:
python -c "import vllm; print(vllm.__version__)"

Run Qwen3 with vLLM

Start a vLLM server with your Qwen3 model

# Load and run the model:
vllm serve "Qwen/Qwen3-235B-A22B"

The --enable-thinking=False flag disables Qwen3’s thinking mode.

Once the server starts, it will provide an API endpoint at http://localhost:8000.

Configure vLLM for Optimal Performance

vLLM supports advanced configurations, such as:

With vLLM running, let’s test the API endpoint using Apidog.

Step 5: Test Qwen3 API with Apidog

Apidog is a powerful tool for testing API endpoints, making it perfect for interacting with your locally deployed Qwen3 model.

Set Up Apidog

  1. Download and install Apidog from the official website.
  2. Launch Apidog and create a new project.

Test Ollama API

  1. Create a new API request in Apidog.
  2. Set the endpoint to http://localhost:11434/api/generate.
  3. Configure the request:
{
  "model": "qwen3-4b",
  "prompt": "Hello, how can I assist you today?",
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20
}
  1. Send the request and verify the response.

Test vLLM API

  1. Create another API request in Apidog.
  2. Set the endpoint to http://localhost:8000/v1/completions.
  3. Configure the request:
{
  "model": "qwen3-4b-awq",
  "prompt": "Write a Python script to calculate factorial.",
  "max_tokens": 512,
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20
}
  1. Send the request and check the output.

Apidog makes it easy to validate your Qwen3 deployment and ensure the API is functioning correctly. Now, let’s fine-tune the model’s performance.

Step 6: Fine-Tune Qwen3 Performance

To optimize Qwen3’s performance, adjust the following settings based on your use case:

Thinking Mode

Qwen3 supports a "thinking mode" for enhanced reasoning, as highlighted in the X post image. You can control it in two ways:

  1. Soft Switch: Add /think or /no_think to your prompt.
  1. Hard Switch: Disable thinking entirely in vLLM with --enable-thinking=False.

Generation Parameters

Fine-tune the generation parameters for better output quality:

Experiment with these settings to achieve the desired balance between creativity and accuracy.

Troubleshooting Common Issues

While deploying Qwen3, you may encounter some issues. Here are solutions to common problems:

Model Fails to Load in Ollama:

vLLM Tensor Parallelism Error:

API Request Fails in Apidog:

By addressing these issues, you can ensure a smooth deployment experience.

Conclusion

Running Qwen3 quantized models locally is a straightforward process with tools like Ollama, LM Studio, and vLLM. Whether you’re a developer building applications or a researcher experimenting with LLMs, Qwen3 offers the flexibility and performance you need. By following this guide, you’ve learned how to download models from Hugging Face and ModelScope, deploy them using various frameworks, and test their API endpoints with Apidog.

Start exploring Qwen3 today and unlock the power of local LLMs for your projects!

button

Explore more

How to Use Deepseek R1 Locally with Cursor

How to Use Deepseek R1 Locally with Cursor

Learn how to set up and configure local DeepSeek R1 with Cursor IDE for private, cost-effective AI coding assistance.

4 June 2025

How to Run Gemma 3n on Android ?

How to Run Gemma 3n on Android ?

Learn how to install and run Gemma 3n on Android using Google AI Edge Gallery.

3 June 2025

How to Use Google Search Console MCP Server

How to Use Google Search Console MCP Server

This guide details Google Search Console MCP for powerful SEO analytics and Apidog MCP Server for AI-driven API development. Learn to install, configure, and leverage these tools to boost productivity and gain deeper insights into your web performance and API specifications.

30 May 2025

Practice API Design-first in Apidog

Discover an easier way to build and use APIs