What Is Microsoft VibeVoice? How to Use the Open-Source Voice AI Models

VibeVoice is Microsoft's open-source voice AI: TTS (90 min, 4 speakers), streaming, and ASR (60 min, 50+ languages). MIT-licensed. Learn to install and use it.

Ashley Innocent

Ashley Innocent

2 April 2026

What Is Microsoft VibeVoice? How to Use the Open-Source Voice AI Models

TL;DR

VibeVoice is Microsoft’s open-source voice AI family with three models: VibeVoice-1.5B for text-to-speech (up to 90 minutes, 4 speakers), VibeVoice-Realtime-0.5B for streaming TTS, and VibeVoice-ASR for speech recognition (60-minute audio, 50+ languages, 7.77% WER). All models are MIT-licensed and run locally. This guide covers installation, usage, and API integration.

Introduction

Microsoft released VibeVoice as an open-source voice AI framework in early 2026. It includes models for both speech synthesis (text-to-speech) and speech recognition (automatic speech recognition), all running locally on your hardware with no cloud dependency.

The framework has three models:

The TTS models caused controversy after release. Microsoft temporarily disabled the main GitHub repository when they discovered voice cloning misuse. The community forked the code, and Microsoft later re-enabled the repo with added safeguards: an audible AI disclaimer embedded in generated audio and imperceptible watermarking for provenance verification.

VibeVoice-ASR is now available on Azure AI Foundry for cloud deployment. The TTS models remain research-focused with an MIT license.

This guide walks through installation, text-to-speech generation, speech recognition, API integration, and how to test voice AI endpoints with Apidog.

button

How VibeVoice works: architecture overview

The tokenizer breakthrough

VibeVoice’s core advancement is its continuous speech tokenizers operating at an ultra-low frame rate of 7.5 Hz. For comparison, most speech models process audio at 50-100 Hz. This 7-13x reduction in frame rate means the model handles long sequences (90 minutes of audio) without running out of context.

The system uses two tokenizers:

Next-token diffusion

The model combines an LLM backbone (Qwen2.5-1.5B) with a lightweight diffusion head (~123M parameters). The LLM handles textual context and dialogue flow. The diffusion head generates high-fidelity acoustic details using DDPM (Denoising Diffusion Probabilistic Models) with Classifier-Free Guidance.

Total parameter count: 3B (including tokenizers and diffusion head).

Training approach

VibeVoice uses curriculum learning, progressively training on longer sequences: 4K, 16K, 32K, then 64K tokens. The pre-trained tokenizers stay frozen during this phase; only the LLM and diffusion head parameters update. This lets the model learn to handle increasingly long audio without forgetting short-form capabilities.

VibeVoice model specifications

Model Parameters Purpose Max length Languages License
VibeVoice-1.5B 3B (total) Text-to-speech 90 minutes English, Chinese MIT
VibeVoice-Realtime-0.5B ~0.5B Streaming TTS Long-form English, Chinese MIT
VibeVoice-ASR ~9B Speech recognition 60 minutes 50+ languages MIT

VibeVoice-1.5B (TTS)

Specification Value
LLM base Qwen2.5-1.5B
Context length 64K tokens
Max speakers 4 simultaneous
Audio output 24kHz WAV mono
Tensor type BF16
Format Safetensors
HuggingFace downloads 62,630/month
Community forks 12 fine-tuned variants

VibeVoice-ASR

Specification Value
Architecture base Qwen2.5
Parameters ~9B
Audio processing Up to 60 minutes single pass
Frame rate 7.5 Hz
Average WER 7.77% (across 8 English datasets)
LibriSpeech Clean WER 2.20%
TED-LIUM WER 2.57%
Languages 50+
Output Structured (Who + When + What)
Supported audio WAV, FLAC, MP3 at 16kHz+

Installation and setup

Prerequisites

Install VibeVoice TTS

# Clone the repository
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice

# Install dependencies
pip install -r requirements.txt

Models download automatically from HuggingFace on first run. You can also pre-download them:

from huggingface_hub import snapshot_download

# Download the 1.5B TTS model
snapshot_download(
    "microsoft/VibeVoice-1.5B",
    local_dir="./models/VibeVoice-1.5B",
    local_dir_use_symlinks=False
)

Install via pip (community package)

pip install vibevoice

Install for ASR

VibeVoice-ASR uses a separate setup:

git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
pip install -r requirements-asr.txt

Or deploy through Azure AI Foundry for managed cloud inference.

Generating speech with VibeVoice-1.5B

Single-speaker generation

Create a text file with your script:

Alice: Welcome to the Apidog developer podcast. Today we're covering API testing strategies for 2026.

Run inference:

python VibeVoice \
  --model_path microsoft/VibeVoice-1.5B \
  --txt_path script.txt \
  --speaker_names Alice \
  --cfg_scale 1.5

Output saves as a .wav file in the outputs/ directory.

Multi-speaker podcast generation

VibeVoice handles up to 4 speakers with consistent voice identities throughout the entire recording:

Alice: Welcome back to the show. Today we have two API experts joining us.
Bob: Thanks for having me. I've been working on REST API design patterns for the past five years.
Carol: And I focus on GraphQL performance optimization. Happy to be here.
Alice: Let's start with the debate everyone wants to hear. REST versus GraphQL for microservices.
Bob: REST gives you clear resource boundaries. Each endpoint maps to a specific resource.
Carol: GraphQL gives you flexibility. One endpoint, and the client decides what data it needs.
python VibeVoice \
  --model_path microsoft/VibeVoice-1.5B \
  --txt_path podcast_script.txt \
  --speaker_names Alice Bob Carol \
  --cfg_scale 1.5

The model maintains distinct voice characteristics for each speaker across the full conversation, even at 90-minute lengths.

Voice cloning (zero-shot)

Clone a voice from a reference audio sample:

Audio requirements:

Convert existing audio to the right format:

ffmpeg -i source_recording.m4a -ar 24000 -ac 1 reference_voice.wav

Use the Gradio demo interface for voice cloning:

python demo/gradio_demo.py

This launches a web UI at http://127.0.0.1:7860 where you upload your reference audio, select the cloned voice, and generate speech.

Streaming with VibeVoice-Realtime-0.5B

For applications needing low-latency audio output (~300ms first chunk):

python demo/streaming_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path script.txt \
  --speaker_name Alice

The Realtime model is smaller and faster but produces lower fidelity audio than the full 1.5B model. Use it for interactive applications; use the 1.5B for pre-generated content.

Using VibeVoice with Python

Pipeline API

from transformers import pipeline
from huggingface_hub import snapshot_download

# Download model
model_path = snapshot_download("microsoft/VibeVoice-1.5B")

# Load pipeline
pipe = pipeline(
    "text-to-speech",
    model=model_path,
    no_processor=False
)

# Prepare multi-speaker script
script = [
    {"role": "Alice", "content": "How do you handle API versioning?"},
    {"role": "Bob", "content": "We use URL path versioning. v1, v2, and so on."},
]

# Apply chat template
input_data = pipe.processor.apply_chat_template(script)

# Generate audio
generate_kwargs = {
    "cfg_scale": 1.5,
    "n_diffusion_steps": 50,
}

output = pipe(input_data, generate_kwargs=generate_kwargs)

FastAPI wrapper for production

The community built a FastAPI wrapper that exposes VibeVoice as an OpenAI-compatible TTS API:

git clone https://github.com/ncoder-ai/VibeVoice-FastAPI.git
cd VibeVoice-FastAPI
docker compose up

This gives you an API endpoint compatible with OpenAI’s TTS format:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vibevoice-1.5b",
    "input": "Your API documentation should be a conversation, not a monologue.",
    "voice": "alice"
  }' \
  --output speech.wav

This OpenAI-compatible endpoint means you can test your VibeVoice API integration with Apidog using the same request format you’d use for OpenAI’s TTS API. Import the endpoint, configure your request body, and test voice generation without writing application code.

Using VibeVoice-ASR for speech recognition

Basic transcription

python asr_inference.py \
  --model_path microsoft/VibeVoice-ASR \
  --audio_path meeting_recording.wav

Structured output format

VibeVoice-ASR produces structured transcriptions with three fields per segment:

Example output:

{
  "segments": [
    {
      "speaker": "Speaker 1",
      "start": 0.0,
      "end": 4.2,
      "text": "Let's review the API endpoints for the new release."
    },
    {
      "speaker": "Speaker 2",
      "start": 4.5,
      "end": 8.1,
      "text": "I've added three new endpoints for the billing module."
    }
  ]
}

ASR as an MCP server

VibeVoice-ASR can run as an MCP (Model Context Protocol) server, plugging directly into Claude Code, Cursor, and other AI coding tools:

# Install the MCP server
pip install vibevoice-mcp-server

# Run it
vibevoice-mcp serve

This lets your coding agent transcribe meetings, voice notes, or audio recordings as part of its workflow. You dictate requirements, the MCP server transcribes them, and the coding agent processes the text.

When to use VibeVoice-ASR vs Whisper

Use case Best choice Why
Long meetings (30-60 min) VibeVoice-ASR Single-pass 60-min processing, speaker ID
Interviews with multiple speakers VibeVoice-ASR Built-in diarization
Podcasts needing timestamps VibeVoice-ASR Structured Who/When/What output
Multilingual content (50+ languages) VibeVoice-ASR Broader language support
Short clips in noisy environments Whisper Better noise robustness
Edge/mobile deployment Whisper Smaller model size, wider device support
Non-English languages (specialized) Whisper More mature multilingual fine-tuning

Testing voice AI APIs with Apidog

Whether you’re using the VibeVoice FastAPI wrapper, Azure AI Foundry endpoint, or building your own voice AI API, Apidog helps you test and debug these integrations.

Test the TTS endpoint

  1. Create a new POST request in Apidog pointing to your VibeVoice FastAPI server
  2. Set the request body to the OpenAI-compatible format:
{
  "model": "vibevoice-1.5b",
  "input": "Test speech synthesis with proper intonation and pacing.",
  "voice": "alice",
  "response_format": "wav"
}
  1. Send the request and verify the response headers include audio/wav content type
  2. Save the response as a WAV file to verify audio quality

Test the ASR endpoint

For speech-to-text APIs:

  1. Set up a POST request with multipart/form-data
  2. Attach your audio file as a form field
  3. Verify the structured JSON response includes speaker IDs, timestamps, and transcribed text

Validate audio API contracts

Voice AI APIs handle binary data (audio files) alongside JSON metadata. Apidog’s request builder handles both:

Download Apidog to test your voice AI integrations before deploying to production.

button

Safety and responsible use

Microsoft added several safeguards after the initial misuse incidents:

What’s allowed

What’s not allowed

Limitations to know about

Language support is narrow for TTS. VibeVoice-1.5B supports English and Chinese. Other languages produce unintelligible output. VibeVoice-ASR has broader coverage at 50+ languages.

Hardware requirements are steep for ASR. The ASR model needs 24 GB+ VRAM (A100/H100 class GPUs). The TTS models run on consumer GPUs with 7-8 GB VRAM.

No overlapping speech handling. The TTS model doesn’t model speakers talking over each other. All dialogue is turn-based.

Inherited model biases. Both models inherit biases from their Qwen2.5 base. Outputs can contain unexpected, biased, or inaccurate content.

Research-grade software. This is not production-ready. Expect rough edges in edge cases, error handling, and non-English output.

Deploying VibeVoice-ASR on Azure AI Foundry

For teams that don’t want to manage GPU infrastructure, Microsoft made VibeVoice-ASR available through Azure AI Foundry. This gives you a managed API endpoint without provisioning hardware.

The Azure deployment handles scaling, model updates, and infrastructure maintenance. You get an HTTPS endpoint that accepts audio files and returns structured transcriptions in the same Who/When/What format as the local model.

This is particularly useful for production workloads where you need consistent uptime and SLA guarantees that self-hosted GPU inference can’t provide. Check Azure AI Foundry’s model catalog for current pricing and deployment options.

For testing your Azure-hosted VibeVoice endpoint before integrating it into your application, set up the endpoint URL and authentication headers in Apidog and run test transcriptions against sample audio files.

Community and ecosystem

VibeVoice has an active community:

Notable community projects:

FAQ

Is VibeVoice free to use?

Yes. All three models (TTS 1.5B, Realtime 0.5B, ASR) are MIT-licensed. You can use them for commercial and non-commercial purposes. Azure AI Foundry hosting has separate pricing for managed cloud inference.

Can VibeVoice run on Apple Silicon Macs?

The community has contributed scripts for M-series Mac inference. Check the HuggingFace discussions for the VibeVoice-1.5B model. Performance is slower than CUDA GPUs but functional.

How does VibeVoice compare to ElevenLabs?

VibeVoice runs locally with no API costs and no data leaving your machine. ElevenLabs offers higher quality, more voices, and easier setup, but requires a paid subscription and cloud processing. For privacy-sensitive applications or offline use, VibeVoice wins. For production quality and ease of use, ElevenLabs is ahead.

Why was the GitHub repository temporarily disabled?

Microsoft discovered people using voice cloning for impersonation and deepfakes. They disabled the repo, added safety features (audible disclaimers, watermarking), and re-enabled it. The community fork kept development going during the downtime.

Can I fine-tune VibeVoice on custom voices?

Yes. The community has produced 12 fine-tuned variants on HuggingFace. You need voice samples (30-60 seconds of clear WAV audio at 24kHz mono) and GPU resources for training.

What audio formats does VibeVoice output?

WAV at 24,000 Hz mono. You can convert to MP3, OGG, FLAC, or other formats with ffmpeg after generation.

Can I use VibeVoice-ASR as a Whisper replacement?

For long-form audio with speaker identification, yes. VibeVoice-ASR handles 60-minute recordings in a single pass with built-in diarization. Whisper needs external tools for speaker identification and struggles with recordings over 30 minutes without chunking. For short, noisy clips or edge deployment, Whisper remains the better choice.

Does VibeVoice support real-time voice chat?

VibeVoice-Realtime-0.5B supports streaming text input with ~300ms first-chunk latency. It’s usable for near-real-time applications but isn’t designed for full-duplex voice conversation. For that, look at Azure OpenAI’s GPT-Realtime or similar hosted solutions.

button

Explore more

axios@1.14.1 Supply Chain Attack: What to Do Now

axios@1.14.1 Supply Chain Attack: What to Do Now

axios@1.14.1 was compromised on npm with a RAT payload. Here's what happened, how to check if you're affected, and exactly what to do to secure your project.

2 April 2026

Best AI Coding Agent in 2026? Claude Code vs OpenClaw

Best AI Coding Agent in 2026? Claude Code vs OpenClaw

Claude Code vs OpenClaw compared feature by feature: tools, security, multi-agent workflow, channels, and model support-plus where Apidog fits your API stack.

2 April 2026

How to Secure NPM Dependencies ? A Complete Supply Chain Security Guide for API Developers

How to Secure NPM Dependencies ? A Complete Supply Chain Security Guide for API Developers

Protect your API projects from npm supply chain attacks with 7 layers of defense: lockfiles, script blocking, provenance, behavioral analysis, and dependency reduction.

1 April 2026

Practice API Design-first in Apidog

Discover an easier way to build and use APIs