How to Run Qwen3 Embedding and Reranker Models Locally with Ollama

The world of large language models (LLMs) is expanding at an explosive pace, but for a long time, accessing state-of-the-art capabilities meant relying on cloud-based APIs. This dependency often comes with concerns about privacy, cost, and customization. The tide is turning, however, thanks to powerful open-source models and tools like Ollama that make running them on your local machine easier than ever.

Among the most exciting recent developments is the release of the Qwen3 model family by Alibaba Cloud. These models, particularly the specialized embedding and reranker versions, are setting new benchmarks in performance. When paired with Ollama, they provide a potent toolkit for developers and researchers looking to build sophisticated AI applications, such as advanced search engines and Retrieval-Augmented Generation (RAG) systems, all from the comfort of their own hardware.

This article is your comprehensive, step-by-step guide to harnessing this power. We will demystify what embedding and reranker models are, walk through the setup of Ollama, and provide practical Python code to run the Qwen3 embedding and reranker models for a complete, end-to-end RAG workflow.

💡

Want a great API Testing tool that generates beautiful API Documentation?

Want an integrated, All-in-One platform for your Developer Team to work together with maximum productivity?

Apidog delivers all your demands, and replaces Postman at a much more affordable price!

button

The Power Duo: Understanding Embedding and Reranker Models

Before we dive into the "how," let's understand the "what." In the context of a RAG system, embedding and reranker models play distinct but complementary roles.

1. The Librarian: The Embedding Model

Imagine a massive library with millions of books but no catalog system. Finding information would be a nightmare. An embedding model is like a hyper-intelligent librarian who reads every single document and assigns it a specific location in a vast, multi-dimensional "concept space."

Technically, a text embedding is a process that converts a piece of text (a word, sentence, or entire document) into a dense vector of numbers. This vector captures the semantic meaning of the text. Documents with similar meanings will have vectors that are "close" to each other in this space.

When you have a query, the embedding model converts your question into a vector and then searches the library for document vectors that are closest to it. This initial search is incredibly efficient and retrieves a broad set of potentially relevant documents.

2. The Expert Consultant: The Reranker Model

The initial retrieval from the embedding model is fast, but it's not always perfect. It might pull in documents that are thematically related but don't precisely answer the user's query. This is where the reranker comes in.

If the embedding model is a generalist librarian, the reranker is a subject-matter expert. It takes the top results from the initial search and performs a more nuanced, computationally intensive analysis. Instead of comparing a query vector to document vectors independently, a reranker model (typically a cross-encoder) looks at the query and each document as a pair.

This direct comparison allows the reranker to calculate a much more accurate relevance score. It then re-orders the documents based on this score, pushing the most relevant results to the top. This two-stage process—a fast initial retrieval followed by a precise reranking—is the secret to state-of-the-art RAG performance.

Meet the Qwen3 Models: A New Standard in Open-Source AI

The Qwen3 series from Alibaba Cloud isn't just another set of models; it represents a significant leap forward in open-source NLP. Here’s what makes the embedding and reranker models stand out:

State-of-the-Art Performance: The Qwen3-Embedding-8B model, at the time of its release, shot to the #1 spot on the highly competitive MTEB (Massive Text Embedding Benchmark) multilingual leaderboard, outperforming many established models.
Exceptional Multilingual Capability: Trained on a vast corpus, the models support over 100 languages, making them ideal for building global-scale applications.
Flexible and Efficient: The models come in various sizes (e.g., 0.6B, 4B, and 8B parameters), allowing developers to choose the best balance of performance, speed, and hardware requirements. They also support various quantization levels, which further reduces their memory footprint with minimal impact on accuracy.
Instruction-Aware: You can provide custom instructions to the models to tailor their performance for specific tasks or domains, a feature that can yield significant performance gains.

Setting Up Your Local AI Environment

Now, let's get our hands dirty. The first step is to set up Ollama and download the Qwen3 models.

Step 1: Install Ollama

Ollama provides a simple, single-line installation command for macOS and Linux. Open your terminal and run:

curl -fsSL https://ollama.com/install.sh | sh

For Windows, download the official installer from the Ollama website.

Once installed, you can verify it's working by running:

ollama --version

Step 2: Download the Qwen3 Models

With Ollama running, you can pull models from the Ollama library. The Qwen3 models are hosted under the dengcao namespace. We'll pull a recommended version of the embedding and reranker models. The :Q5_K_M tag signifies a specific quantization level that offers a great balance between performance and resource usage.

In your terminal, run the following commands:

# Download the 8B parameter embedding model
ollama pull dengcao/Qwen3-Embedding-8B:Q5_K_M

# Download the 4B parameter reranker model
ollama pull dengcao/Qwen3-Reranker-4B:Q5_K_M

These downloads might take some time, depending on your internet connection. Once complete, you can see your locally available models by running ollama list.

Part 1: Generating Embeddings with Qwen3

With the embedding model downloaded, let's generate some vectors. We'll use the official ollama Python library. If you don't have it installed, run pip install ollama.

Here’s a simple Python script to generate an embedding for a piece of text:

import ollama

# Define the model name as downloaded
EMBEDDING_MODEL = 'dengcao/Qwen3-Embedding-8B:Q5_K_M'

def get_embedding(text: str):
    """Generates an embedding for a given text."""
    try:
        response = ollama.embeddings(
            model=EMBEDDING_MODEL,
            prompt=text
        )
        return response['embedding']
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

# --- Example Usage ---
sentence = "Ollama makes it easy to run LLMs locally."
embedding = get_embedding(sentence)

if embedding:
    print(f"Embedding for: '{sentence}'")
    # Print the first few dimensions for brevity
    print(f"First 5 dimensions: {embedding[:5]}")
    print(f"Total dimensions: {len(embedding)}")

This script will output the first five values of the generated vector and its total size (which is 4096 for the 8B model). This vector is the numerical representation of our sentence, ready to be stored and compared.

Part 2: Refining Results with the Qwen3 Reranker

Using the reranker is slightly different. Instead of a dedicated rerank endpoint, we use the standard chat endpoint. We craft a specific prompt that asks the model to act as a reranker, taking a query and a document as input and outputting a relevance score.

Let's create a Python function to handle this. We'll ask the model to return a simple "Yes" or "No" to indicate relevance, which we can easily convert into a score.

import ollama

# Define the model name as downloaded
RERANKER_MODEL = 'dengcao/Qwen3-Reranker-4B:Q5_K_M'

def rerank_document(query: str, document: str) -> float:
    """
    Uses the Qwen3 Reranker to score the relevance of a document to a query.
    Returns a score of 1.0 for 'Yes' and 0.0 for 'No'.
    """
    prompt = f"""
    You are an expert relevance grader. Your task is to evaluate if the
    following document is relevant to the user's query.
    Please answer with a simple 'Yes' or 'No'.

    Query: {query}
    Document: {document}
    """
    try:
        response = ollama.chat(
            model=RERANKER_MODEL,
            messages=[{'role': 'user', 'content': prompt}],
            options={'temperature': 0.0} # For deterministic output
        )
        answer = response['message']['content'].strip().lower()
        if 'yes' in answer:
            return 1.0
        return 0.0
    except Exception as e:
        print(f"An error occurred during reranking: {e}")
        return 0.0

# --- Example Usage ---
user_query = "How do I run models locally?"
doc1 = "Ollama is a tool for running large language models on your own computer."
doc2 = "The capital of France is Paris."

score1 = rerank_document(user_query, doc1)
score2 = rerank_document(user_query, doc2)

print(f"Relevance of Doc 1: {'Relevant' if score1 > 0.5 else 'Not Relevant'} (Score: {score1})")
print(f"Relevance of Doc 2: {'Relevant' if score2 > 0.5 else 'Not Relevant'} (Score: {score2})")

This function demonstrates how to interact with the reranker. It correctly identifies that doc1 is highly relevant to the query while doc2 is not.

Putting It All Together: A Simple RAG Implementation

Now for the main event. Let's build a mini-RAG pipeline that uses both our models to answer a query from a small knowledge base. For the similarity search, we'll use numpy. Install it with pip install numpy.

import ollama
import numpy as np

# --- Model Definitions ---
EMBEDDING_MODEL = 'dengcao/Qwen3-Embedding-8B:Q5_K_M'
RERANKER_MODEL = 'dengcao/Qwen3-Reranker-4B:Q5_K_M'

# --- 1. Corpus and Offline Embedding Generation ---
documents = [
    "The Qwen3 series of models was developed by Alibaba Cloud.",
    "Ollama provides a simple command-line interface for running LLMs.",
    "A reranker model refines search results by calculating a precise relevance score.",
    "To install Ollama on Linux, you can use a curl command.",
    "Embedding models convert text into numerical vectors for semantic search.",
]

# In a real application, you would store these embeddings in a vector database
corpus_embeddings = []
print("Generating embeddings for the document corpus...")
for doc in documents:
    response = ollama.embeddings(model=EMBEDDING_MODEL, prompt=doc)
    corpus_embeddings.append(response['embedding'])
print("Embeddings generated.")

def cosine_similarity(v1, v2):
    """Calculates cosine similarity between two vectors."""
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# --- 2. Online Retrieval and Reranking ---
user_query = "How do I install Ollama?"

# Embed the user's query
query_embedding = ollama.embeddings(model=EMBEDDING_MODEL, prompt=user_query)['embedding']

# Perform initial retrieval (semantic search)
retrieval_scores = [cosine_similarity(query_embedding, emb) for emb in corpus_embeddings]
top_k_indices = np.argsort(retrieval_scores)[::-1][:3] # Get top 3 results

print("\n--- Initial Retrieval Results (before reranking) ---")
for i in top_k_indices:
    print(f"Score: {retrieval_scores[i]:.4f} - Document: {documents[i]}")

# --- 3. Rerank the top results ---
retrieved_docs = [documents[i] for i in top_k_indices]

print("\n--- Reranking the top results ---")
reranked_scores = [rerank_document(user_query, doc) for doc in retrieved_docs]

# Combine documents with their new scores and sort
reranked_results = sorted(zip(retrieved_docs, reranked_scores), key=lambda x: x[1], reverse=True)

print("\n--- Final Results (after reranking) ---")
for doc, score in reranked_results:
    print(f"Relevance Score: {score:.2f} - Document: {doc}")

When you run this script, you'll see the power of the two-stage process. The initial retrieval correctly finds documents related to "Ollama" and "installing." However, the reranker then precisely identifies the document about using curl as the most relevant, pushing it to the top with a perfect score.

Conclusion

You have now successfully set up and used one of the most powerful open-source AI tandems available today, right on your local machine. By combining the broad reach of the Qwen3 embedding model with the sharp precision of the Qwen3 reranker, you can build applications that understand and process language with a level of nuance that was previously the exclusive domain of large, proprietary systems.

The journey doesn't end here. You can experiment with different model sizes, try various quantization levels, and integrate this pipeline into more complex applications. The ability to run these tools locally unlocks a world of possibilities, empowering you to create, innovate, and explore without compromising on privacy or performance. Welcome to the new era of local, open-source AI.

💡

button