How to Run Qwen3 Embedding and Reranker Models Locally with Ollama

Mark Ponomarev

Mark Ponomarev

25 June 2025

How to Run Qwen3 Embedding and Reranker Models Locally with Ollama

The world of large language models (LLMs) is expanding at an explosive pace, but for a long time, accessing state-of-the-art capabilities meant relying on cloud-based APIs. This dependency often comes with concerns about privacy, cost, and customization. The tide is turning, however, thanks to powerful open-source models and tools like Ollama that make running them on your local machine easier than ever.

Among the most exciting recent developments is the release of the Qwen3 model family by Alibaba Cloud. These models, particularly the specialized embedding and reranker versions, are setting new benchmarks in performance. When paired with Ollama, they provide a potent toolkit for developers and researchers looking to build sophisticated AI applications, such as advanced search engines and Retrieval-Augmented Generation (RAG) systems, all from the comfort of their own hardware.

This article is your comprehensive, step-by-step guide to harnessing this power. We will demystify what embedding and reranker models are, walk through the setup of Ollama, and provide practical Python code to run the Qwen3 embedding and reranker models for a complete, end-to-end RAG workflow.

💡
Want a great API Testing tool that generates beautiful API Documentation?

Want an integrated, All-in-One platform for your Developer Team to work together with maximum productivity?

Apidog delivers all your demands, and replaces Postman at a much more affordable price!
button

The Power Duo: Understanding Embedding and Reranker Models

Before we dive into the "how," let's understand the "what." In the context of a RAG system, embedding and reranker models play distinct but complementary roles.

1. The Librarian: The Embedding Model

Imagine a massive library with millions of books but no catalog system. Finding information would be a nightmare. An embedding model is like a hyper-intelligent librarian who reads every single document and assigns it a specific location in a vast, multi-dimensional "concept space."

Technically, a text embedding is a process that converts a piece of text (a word, sentence, or entire document) into a dense vector of numbers. This vector captures the semantic meaning of the text. Documents with similar meanings will have vectors that are "close" to each other in this space.

When you have a query, the embedding model converts your question into a vector and then searches the library for document vectors that are closest to it. This initial search is incredibly efficient and retrieves a broad set of potentially relevant documents.

2. The Expert Consultant: The Reranker Model

The initial retrieval from the embedding model is fast, but it's not always perfect. It might pull in documents that are thematically related but don't precisely answer the user's query. This is where the reranker comes in.

If the embedding model is a generalist librarian, the reranker is a subject-matter expert. It takes the top results from the initial search and performs a more nuanced, computationally intensive analysis. Instead of comparing a query vector to document vectors independently, a reranker model (typically a cross-encoder) looks at the query and each document as a pair.

This direct comparison allows the reranker to calculate a much more accurate relevance score. It then re-orders the documents based on this score, pushing the most relevant results to the top. This two-stage process—a fast initial retrieval followed by a precise reranking—is the secret to state-of-the-art RAG performance.

Meet the Qwen3 Models: A New Standard in Open-Source AI

The Qwen3 series from Alibaba Cloud isn't just another set of models; it represents a significant leap forward in open-source NLP. Here’s what makes the embedding and reranker models stand out:

Setting Up Your Local AI Environment

Now, let's get our hands dirty. The first step is to set up Ollama and download the Qwen3 models.

Step 1: Install Ollama

Ollama provides a simple, single-line installation command for macOS and Linux. Open your terminal and run:

curl -fsSL https://ollama.com/install.sh | sh

For Windows, download the official installer from the Ollama website.

Once installed, you can verify it's working by running:

ollama --version

Step 2: Download the Qwen3 Models

With Ollama running, you can pull models from the Ollama library. The Qwen3 models are hosted under the dengcao namespace. We'll pull a recommended version of the embedding and reranker models. The :Q5_K_M tag signifies a specific quantization level that offers a great balance between performance and resource usage.

In your terminal, run the following commands:

# Download the 8B parameter embedding model
ollama pull dengcao/Qwen3-Embedding-8B:Q5_K_M

# Download the 4B parameter reranker model
ollama pull dengcao/Qwen3-Reranker-4B:Q5_K_M

These downloads might take some time, depending on your internet connection. Once complete, you can see your locally available models by running ollama list.

Part 1: Generating Embeddings with Qwen3

With the embedding model downloaded, let's generate some vectors. We'll use the official ollama Python library. If you don't have it installed, run pip install ollama.

Here’s a simple Python script to generate an embedding for a piece of text:

import ollama

# Define the model name as downloaded
EMBEDDING_MODEL = 'dengcao/Qwen3-Embedding-8B:Q5_K_M'

def get_embedding(text: str):
    """Generates an embedding for a given text."""
    try:
        response = ollama.embeddings(
            model=EMBEDDING_MODEL,
            prompt=text
        )
        return response['embedding']
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

# --- Example Usage ---
sentence = "Ollama makes it easy to run LLMs locally."
embedding = get_embedding(sentence)

if embedding:
    print(f"Embedding for: '{sentence}'")
    # Print the first few dimensions for brevity
    print(f"First 5 dimensions: {embedding[:5]}")
    print(f"Total dimensions: {len(embedding)}")

This script will output the first five values of the generated vector and its total size (which is 4096 for the 8B model). This vector is the numerical representation of our sentence, ready to be stored and compared.

Part 2: Refining Results with the Qwen3 Reranker

Using the reranker is slightly different. Instead of a dedicated rerank endpoint, we use the standard chat endpoint. We craft a specific prompt that asks the model to act as a reranker, taking a query and a document as input and outputting a relevance score.

Let's create a Python function to handle this. We'll ask the model to return a simple "Yes" or "No" to indicate relevance, which we can easily convert into a score.

import ollama

# Define the model name as downloaded
RERANKER_MODEL = 'dengcao/Qwen3-Reranker-4B:Q5_K_M'

def rerank_document(query: str, document: str) -> float:
    """
    Uses the Qwen3 Reranker to score the relevance of a document to a query.
    Returns a score of 1.0 for 'Yes' and 0.0 for 'No'.
    """
    prompt = f"""
    You are an expert relevance grader. Your task is to evaluate if the
    following document is relevant to the user's query.
    Please answer with a simple 'Yes' or 'No'.

    Query: {query}
    Document: {document}
    """
    try:
        response = ollama.chat(
            model=RERANKER_MODEL,
            messages=[{'role': 'user', 'content': prompt}],
            options={'temperature': 0.0} # For deterministic output
        )
        answer = response['message']['content'].strip().lower()
        if 'yes' in answer:
            return 1.0
        return 0.0
    except Exception as e:
        print(f"An error occurred during reranking: {e}")
        return 0.0

# --- Example Usage ---
user_query = "How do I run models locally?"
doc1 = "Ollama is a tool for running large language models on your own computer."
doc2 = "The capital of France is Paris."

score1 = rerank_document(user_query, doc1)
score2 = rerank_document(user_query, doc2)

print(f"Relevance of Doc 1: {'Relevant' if score1 > 0.5 else 'Not Relevant'} (Score: {score1})")
print(f"Relevance of Doc 2: {'Relevant' if score2 > 0.5 else 'Not Relevant'} (Score: {score2})")

This function demonstrates how to interact with the reranker. It correctly identifies that doc1 is highly relevant to the query while doc2 is not.

Putting It All Together: A Simple RAG Implementation

Now for the main event. Let's build a mini-RAG pipeline that uses both our models to answer a query from a small knowledge base. For the similarity search, we'll use numpy. Install it with pip install numpy.

import ollama
import numpy as np

# --- Model Definitions ---
EMBEDDING_MODEL = 'dengcao/Qwen3-Embedding-8B:Q5_K_M'
RERANKER_MODEL = 'dengcao/Qwen3-Reranker-4B:Q5_K_M'

# --- 1. Corpus and Offline Embedding Generation ---
documents = [
    "The Qwen3 series of models was developed by Alibaba Cloud.",
    "Ollama provides a simple command-line interface for running LLMs.",
    "A reranker model refines search results by calculating a precise relevance score.",
    "To install Ollama on Linux, you can use a curl command.",
    "Embedding models convert text into numerical vectors for semantic search.",
]

# In a real application, you would store these embeddings in a vector database
corpus_embeddings = []
print("Generating embeddings for the document corpus...")
for doc in documents:
    response = ollama.embeddings(model=EMBEDDING_MODEL, prompt=doc)
    corpus_embeddings.append(response['embedding'])
print("Embeddings generated.")

def cosine_similarity(v1, v2):
    """Calculates cosine similarity between two vectors."""
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# --- 2. Online Retrieval and Reranking ---
user_query = "How do I install Ollama?"

# Embed the user's query
query_embedding = ollama.embeddings(model=EMBEDDING_MODEL, prompt=user_query)['embedding']

# Perform initial retrieval (semantic search)
retrieval_scores = [cosine_similarity(query_embedding, emb) for emb in corpus_embeddings]
top_k_indices = np.argsort(retrieval_scores)[::-1][:3] # Get top 3 results

print("\n--- Initial Retrieval Results (before reranking) ---")
for i in top_k_indices:
    print(f"Score: {retrieval_scores[i]:.4f} - Document: {documents[i]}")

# --- 3. Rerank the top results ---
retrieved_docs = [documents[i] for i in top_k_indices]

print("\n--- Reranking the top results ---")
reranked_scores = [rerank_document(user_query, doc) for doc in retrieved_docs]

# Combine documents with their new scores and sort
reranked_results = sorted(zip(retrieved_docs, reranked_scores), key=lambda x: x[1], reverse=True)

print("\n--- Final Results (after reranking) ---")
for doc, score in reranked_results:
    print(f"Relevance Score: {score:.2f} - Document: {doc}")

When you run this script, you'll see the power of the two-stage process. The initial retrieval correctly finds documents related to "Ollama" and "installing." However, the reranker then precisely identifies the document about using curl as the most relevant, pushing it to the top with a perfect score.

Conclusion

You have now successfully set up and used one of the most powerful open-source AI tandems available today, right on your local machine. By combining the broad reach of the Qwen3 embedding model with the sharp precision of the Qwen3 reranker, you can build applications that understand and process language with a level of nuance that was previously the exclusive domain of large, proprietary systems.

The journey doesn't end here. You can experiment with different model sizes, try various quantization levels, and integrate this pipeline into more complex applications. The ability to run these tools locally unlocks a world of possibilities, empowering you to create, innovate, and explore without compromising on privacy or performance. Welcome to the new era of local, open-source AI.

💡
Want a great API Testing tool that generates beautiful API Documentation?

Want an integrated, All-in-One platform for your Developer Team to work together with maximum productivity?

Apidog delivers all your demands, and replaces Postman at a much more affordable price!
button

Explore more

3 Easy Ways to Use Google Veo 3 for Free

3 Easy Ways to Use Google Veo 3 for Free

Want to try Google Veo 3 without paying? Learn 3 legitimate ways to access Google’s powerful AI video tool for free—including student promos, Google AI trials, and $300 Google Cloud credits. Step-by-step guide included!

25 June 2025

SuperClaude: Power Up Your Claude Code Instantly

SuperClaude: Power Up Your Claude Code Instantly

The arrival of large language models in the software development world has been nothing short of a revolution. AI assistants like Anthropic's Claude can draft code, explain complex algorithms, and debug tricky functions in seconds. They are a phenomenal force multiplier. Yet, for all their power, a lingering sense of genericness remains. Professional developers often find themselves grappling with the same frustrations: the AI's short memory, its lack of context about their specific project, the

25 June 2025

What's a Claude.md File? 5 Best Practices to Use Claude.md for Claude Code

What's a Claude.md File? 5 Best Practices to Use Claude.md for Claude Code

Here's a true story from a Reddit user, a C++ dev and ex-FAANG staff engineer: For four years, a "white whale" bug lurked in the codebase of a C++ developer with over 30 years of experience. A former FAANG Staff Engineer, this was the kind of programmer other developers sought out when all hope was lost. Yet, this particular bug, introduced during a massive 60,000-line refactor, remained elusive. It was an annoying edge case, a ghost in the machine that defied discovery despite an estimated 200

25 June 2025

Practice API Design-first in Apidog

Discover an easier way to build and use APIs