MiMo-7B-RL: the Reasoning LLM from Xiaomi

Xiaomi's LLM-Core Team presents MiMo-7B-RL, challenging the idea that top-tier reasoning in AI requires massive models. This 7-billion-parameter model, specifically engineered for mathematical and coding tasks, demonstrates performance that rivals much larger models and specialized systems like OpenAI's o1-mini. This achievement results from a comprehensive strategy optimizing the entire model lifecycle, proving potent reasoning can be unlocked in more efficient architectures.

💡

Want a great API Testing tool that generates beautiful API Documentation?

Want an integrated, All-in-One platform for your Developer Team to work together with maximum productivity?

Apidog delivers all your demans, and replaces Postman at a much more affordable price!

button

What is MiMo-7B

The development of MiMo-7B hinges on the belief that a model's fundamental reasoning capability is established during pre-training. While later fine-tuning stages are essential, the initial foundation is critical. Xiaomi identified that many smaller models struggle with complex reasoning because their base training lacks sufficient exposure to logical patterns.

To counter this, MiMo's pre-training was meticulously designed to maximize "reasoning pattern density." This involved sophisticated data processing: enhancing text extraction to capture complex structures in technical documents and code, applying multi-dimensional filters to concentrate reasoning examples, and generating vast synthetic datasets embodying logical steps and problem-solving. A three-stage data mixture strategy was employed during pre-training, utilizing approximately 25 trillion tokens to build the MiMo-7B-Base model.

Furthermore, Xiaomi incorporated Multiple-Token Prediction (MTP) as an auxiliary training objective. This technique, where the model predicts several tokens ahead, potentially enhances understanding of complex dependencies and can accelerate inference through speculative decoding.

Advanced Reinforcement Learning

Building upon the fine-tuned MiMo-7B-SFT model, the Reinforcement Learning (RL) phase specifically targets math and code proficiency. A high-quality dataset of 130,000 carefully curated math and code problems, all verifiable through rule-based checks (like unit tests or numerical validation), formed the basis for training.

To ensure genuine capability improvement and avoid "reward hacking," only objective, rule-based accuracy rewards were used. A novel "test difficulty driven code reward" system was introduced to tackle the sparse reward problem inherent in complex code generation. Instead of an all-or-nothing reward, this system grants partial credit for passing easier test cases within a problem, providing a denser gradient signal for the model to learn from.

Efficiency was also key. As the model improved, a data re-sampling strategy down-weighted easier problems, focusing training on more challenging examples. Xiaomi also developed a "Seamless Rollout Engine," an optimized RL infrastructure that integrates continuous generation, asynchronous reward calculation, and early termination to minimize GPU idle time, yielding significant training (2.29x) and validation (1.96x) speedups.

MiMo-7B-RL Family: A Quick Look

Xiaomi has released several models showcasing the development stages:

Model	Description
MiMo-7B-Base	Base model with strong inherent reasoning potential
MiMo-7B-RL-Zero	RL applied directly to the base model
MiMo-7B-SFT	Supervised Fine-Tuned model from the base
MiMo-7B-RL	RL applied to the SFT model, top reasoning performance

MiMo-7B-RL Benchmarks

Evaluation results highlight MiMo-7B-RL's strengths, particularly when compared against leading models using a generation temperature of 0.6.

Comparative Performance:

Benchmark	GPT-4o-0513	Claude-3.5-Sonnet-1022	OpenAI o1-mini	MiMo-7B-RL
Mathematics
MATH-500(Pass@1)	74.6	78.3	90.0	95.8
AIME 2024(Pass@1)	9.3	16.0	63.6	68.2
AIME 2025(Pass@1)	11.6	7.4	50.7	55.4
Code
LiveCodeBench v5(Pass@1)	32.9	38.9	53.8	57.8
LiveCodeBench v6(Pass@1)	30.9	37.2	46.8	49.3

(Selected math/code benchmarks shown)

MiMo-7B-RL demonstrates exceptional performance in mathematics and coding, often exceeding significantly larger models and specialized reasoning models like o1-mini on challenging benchmarks like MATH, AIME, and recent LiveCodeBench versions. While its general reasoning scores are strong for its size, they naturally trail the largest frontier models, reflecting its specialized training focus.

Performance Within the MiMo Series:

Benchmark	MiMo-7B-Base	MiMo-7B-RL-Zero	MiMo-7B-SFT	MiMo-7B-RL
Mathematics
MATH500(Pass@1)	37.4	93.6	93.0	95.8
AIME 2024(Pass@1)	32.9	56.4	58.7	68.2
Code
LiveCodeBench v5(Pass@1)	32.9	49.1	52.3	57.8

This internal comparison illustrates the effectiveness of each training stage. The base model shows strong initial reasoning, which is significantly boosted by SFT, and further refined to peak performance by the final RL phase targeting math and code. Applying RL directly to the base (RL-Zero) is effective, but the SFT intermediate step appears beneficial for achieving the highest scores.

Running MiMo-7B-RL

The models are readily available on the Hugging Face Hub.

Model Access:

Find MiMo-7B-RL and other models in the series at the XiaomiMiMo organization page on Hugging Face. The model size is approximately 7.83 billion parameters (BF16 precision, Safetensors format).

Running Inference with vLLM (Recommended)

Xiaomi recommends using their fork of vLLM (based on v0.7.3) for inference, as it supports the Multi-Token Prediction feature for potentially faster generation.

Using the Xiaomi vLLM Fork (with MTP):

# Ensure Xiaomi's vLLM fork is installed
from vllm import LLM, SamplingParams

# --- FACTUAL CODE SNIPPET START ---
# Source: https://huggingface.co/XiaomiMiMo/MiMo-7B-RL Model Card
model_path = "/path/to/XiaomiMiMo/MiMo-7B-RL" # Replace with your download path

llm = LLM(
    model=model_path,
    trust_remote_code=True,  # Required for MiMo's custom code
    num_speculative_tokens=1, # Enables MTP speculative decoding
    disable_log_stats=False
)
# Recommended sampling temperature for benchmark replication
sampling_params = SamplingParams(temperature=0.6)

# Example conversation structure (empty system prompt recommended)
conversation = [
    {
        "role": "system",
        "content": "" # Use an empty system prompt
    },
    {
        "role": "user",
        "content": "Write a python function to compute the nth Fibonacci number.",
    },
]

# Generate the response
outputs = llm.chat(conversation,
                   sampling_params=sampling_params,
                   use_tqdm=False)

# Process and print output
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}")
    print("-" * 20)
    print(f"Generated text: {generated_text!r}")
# --- FACTUAL CODE SNIPPET END ---

print("=" * 80)

Using Standard vLLM (without MTP):
If not using the MTP feature or using a standard vLLM build, register the MiMo architecture first using the register_mimo_in_vllm.py script provided by Xiaomi.

# --- FACTUAL CODE SNIPPET START ---
# Source: https://huggingface.co/XiaomiMiMo/MiMo-7B-RL Model Card
# Ensure register_mimo_in_vllm.py is accessible
import register_mimo_in_vllm

from vllm import LLM, SamplingParams

model_path = "/path/to/XiaomiMiMo/MiMo-7B-RL" # Replace with your download path
llm = LLM(
    model=model_path,
    trust_remote_code=True,
    # Do not set num_speculative_tokens if not using MTP
    disable_log_stats=False
)
sampling_params = SamplingParams(temperature=0.6)

# Conversation setup and generation call is the same as the MTP example...
conversation = [
    {"role": "system", "content": ""},
    {"role": "user", "content": "Write a python function to compute the nth Fibonacci number."},
]
outputs = llm.chat(conversation, sampling_params=sampling_params, use_tqdm=False)
# Processing output is the same...
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}\n{'-'*20}\nGenerated text: {generated_text!r}")
# --- FACTUAL CODE SNIPPET END ---

Using HuggingFace Transformers

Standard HuggingFace transformers library inference is also possible. Remember trust_remote_code=True is necessary.

# --- FACTUAL CODE SNIPPET START ---
# Source: https://huggingface.co/XiaomiMiMo/MiMo-7B-RL Model Card
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "/path/to/XiaomiMiMo/MiMo-7B-RL" # Replace with your download path

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True, # Essential for loading MiMo
    device_map="auto"       # Use GPU if available
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Prepare the input prompt
prompt = "Write a python function to compute the nth Fibonacci number."
# Tokenize the input
inputs = tokenizer([prompt], return_tensors='pt').to(model.device)

# Generate the output sequence
output_sequences = model.generate(
    **inputs,
    max_new_tokens=256,      # Control output length
    temperature=0.6,         # Recommended temperature
    do_sample=True           # Use sampling for temperatures != 1.0
)

# Decode the output
generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
print(generated_text)
# --- FACTUAL CODE SNIPPET END ---

Usage Recommendations

For best results, especially when trying to replicate benchmark scores, use the recommended setup: Xiaomi's vLLM fork (based on v0.7.3) and an empty system prompt.

Final Thoughts: Efficient Reasoning Realized by Xiaomi?

Xiaomi's MiMo-7B-RL demonstrates that exceptional reasoning performance in specialized domains like mathematics and coding is achievable without resorting to enormous model sizes. Through careful pre-training focused on reasoning patterns and innovative reinforcement learning techniques, they've created an efficient model that competes effectively with much larger counterparts. The open release of the MiMo series provides valuable tools and insights, pushing forward the development of powerful, accessible AI reasoning capabilities.

💡

button