What Is Microsoft VibeVoice? How to Use the Open-Source Voice AI Models

TL;DR

VibeVoice is Microsoft’s open-source voice AI family with three models: VibeVoice-1.5B for text-to-speech (up to 90 minutes, 4 speakers), VibeVoice-Realtime-0.5B for streaming TTS, and VibeVoice-ASR for speech recognition (60-minute audio, 50+ languages, 7.77% WER). All models are MIT-licensed and run locally. This guide covers installation, usage, and API integration.

Introduction

Microsoft released VibeVoice as an open-source voice AI framework in early 2026. It includes models for both speech synthesis (text-to-speech) and speech recognition (automatic speech recognition), all running locally on your hardware with no cloud dependency.

The framework has three models:

VibeVoice-1.5B generates expressive, multi-speaker conversational audio from text scripts. It can synthesize up to 90 minutes of speech with 4 distinct speakers in a single pass.
VibeVoice-Realtime-0.5B is a lightweight streaming variant that produces audio with ~300ms first-chunk latency.
VibeVoice-ASR transcribes up to 60 minutes of continuous audio with speaker identification, timestamps, and structured output across 50+ languages.

The TTS models caused controversy after release. Microsoft temporarily disabled the main GitHub repository when they discovered voice cloning misuse. The community forked the code, and Microsoft later re-enabled the repo with added safeguards: an audible AI disclaimer embedded in generated audio and imperceptible watermarking for provenance verification.

VibeVoice-ASR is now available on Azure AI Foundry for cloud deployment. The TTS models remain research-focused with an MIT license.

This guide walks through installation, text-to-speech generation, speech recognition, API integration, and how to test voice AI endpoints with Apidog.

button

How VibeVoice works: architecture overview

The tokenizer breakthrough

VibeVoice’s core advancement is its continuous speech tokenizers operating at an ultra-low frame rate of 7.5 Hz. For comparison, most speech models process audio at 50-100 Hz. This 7-13x reduction in frame rate means the model handles long sequences (90 minutes of audio) without running out of context.

The system uses two tokenizers:

Acoustic Tokenizer: A sigma-VAE variant with ~340M parameters in a mirror-symmetric encoder-decoder. It downsamples 3,200x from 24kHz input audio.
Semantic Tokenizer: Mirrors the acoustic tokenizer’s architecture but is trained with an ASR proxy task to capture linguistic meaning.

Next-token diffusion

The model combines an LLM backbone (Qwen2.5-1.5B) with a lightweight diffusion head (~123M parameters). The LLM handles textual context and dialogue flow. The diffusion head generates high-fidelity acoustic details using DDPM (Denoising Diffusion Probabilistic Models) with Classifier-Free Guidance.

Total parameter count: 3B (including tokenizers and diffusion head).

Training approach

VibeVoice uses curriculum learning, progressively training on longer sequences: 4K, 16K, 32K, then 64K tokens. The pre-trained tokenizers stay frozen during this phase; only the LLM and diffusion head parameters update. This lets the model learn to handle increasingly long audio without forgetting short-form capabilities.

VibeVoice model specifications

Model	Parameters	Purpose	Max length	Languages	License
VibeVoice-1.5B	3B (total)	Text-to-speech	90 minutes	English, Chinese	MIT
VibeVoice-Realtime-0.5B	~0.5B	Streaming TTS	Long-form	English, Chinese	MIT
VibeVoice-ASR	~9B	Speech recognition	60 minutes	50+ languages	MIT

VibeVoice-1.5B (TTS)

Specification	Value
LLM base	Qwen2.5-1.5B
Context length	64K tokens
Max speakers	4 simultaneous
Audio output	24kHz WAV mono
Tensor type	BF16
Format	Safetensors
HuggingFace downloads	62,630/month
Community forks	12 fine-tuned variants

VibeVoice-ASR

Specification	Value
Architecture base	Qwen2.5
Parameters	~9B
Audio processing	Up to 60 minutes single pass
Frame rate	7.5 Hz
Average WER	7.77% (across 8 English datasets)
LibriSpeech Clean WER	2.20%
TED-LIUM WER	2.57%
Languages	50+
Output	Structured (Who + When + What)
Supported audio	WAV, FLAC, MP3 at 16kHz+

Installation and setup

Prerequisites

Python 3.8+
NVIDIA GPU with CUDA support
Minimum 7-8 GB VRAM for TTS models
Minimum 24 GB VRAM for ASR model (A100/H100 recommended)
32 GB RAM minimum (64 GB recommended for ASR)
CUDA 11.8+ (CUDA 12.0+ recommended)

Install VibeVoice TTS

# Clone the repository
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice

# Install dependencies
pip install -r requirements.txt

Models download automatically from HuggingFace on first run. You can also pre-download them:

from huggingface_hub import snapshot_download

# Download the 1.5B TTS model
snapshot_download(
    "microsoft/VibeVoice-1.5B",
    local_dir="./models/VibeVoice-1.5B",
    local_dir_use_symlinks=False
)

Install via pip (community package)

pip install vibevoice

Install for ASR

VibeVoice-ASR uses a separate setup:

git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
pip install -r requirements-asr.txt

Or deploy through Azure AI Foundry for managed cloud inference.

Generating speech with VibeVoice-1.5B

Single-speaker generation

Create a text file with your script:

Alice: Welcome to the Apidog developer podcast. Today we're covering API testing strategies for 2026.

Run inference:

python VibeVoice \
  --model_path microsoft/VibeVoice-1.5B \
  --txt_path script.txt \
  --speaker_names Alice \
  --cfg_scale 1.5

Output saves as a .wav file in the outputs/ directory.

Multi-speaker podcast generation

VibeVoice handles up to 4 speakers with consistent voice identities throughout the entire recording:

Alice: Welcome back to the show. Today we have two API experts joining us.
Bob: Thanks for having me. I've been working on REST API design patterns for the past five years.
Carol: And I focus on GraphQL performance optimization. Happy to be here.
Alice: Let's start with the debate everyone wants to hear. REST versus GraphQL for microservices.
Bob: REST gives you clear resource boundaries. Each endpoint maps to a specific resource.
Carol: GraphQL gives you flexibility. One endpoint, and the client decides what data it needs.

python VibeVoice \
  --model_path microsoft/VibeVoice-1.5B \
  --txt_path podcast_script.txt \
  --speaker_names Alice Bob Carol \
  --cfg_scale 1.5

The model maintains distinct voice characteristics for each speaker across the full conversation, even at 90-minute lengths.

Voice cloning (zero-shot)

Clone a voice from a reference audio sample:

Audio requirements:

Format: WAV (mono)
Sample rate: 24,000 Hz
Duration: 30-60 seconds of clear speech

Convert existing audio to the right format:

ffmpeg -i source_recording.m4a -ar 24000 -ac 1 reference_voice.wav

Use the Gradio demo interface for voice cloning:

python demo/gradio_demo.py

This launches a web UI at http://127.0.0.1:7860 where you upload your reference audio, select the cloned voice, and generate speech.

Streaming with VibeVoice-Realtime-0.5B

For applications needing low-latency audio output (~300ms first chunk):

python demo/streaming_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path script.txt \
  --speaker_name Alice

The Realtime model is smaller and faster but produces lower fidelity audio than the full 1.5B model. Use it for interactive applications; use the 1.5B for pre-generated content.

Using VibeVoice with Python

Pipeline API

from transformers import pipeline
from huggingface_hub import snapshot_download

# Download model
model_path = snapshot_download("microsoft/VibeVoice-1.5B")

# Load pipeline
pipe = pipeline(
    "text-to-speech",
    model=model_path,
    no_processor=False
)

# Prepare multi-speaker script
script = [
    {"role": "Alice", "content": "How do you handle API versioning?"},
    {"role": "Bob", "content": "We use URL path versioning. v1, v2, and so on."},
]

# Apply chat template
input_data = pipe.processor.apply_chat_template(script)

# Generate audio
generate_kwargs = {
    "cfg_scale": 1.5,
    "n_diffusion_steps": 50,
}

output = pipe(input_data, generate_kwargs=generate_kwargs)

FastAPI wrapper for production

The community built a FastAPI wrapper that exposes VibeVoice as an OpenAI-compatible TTS API:

git clone https://github.com/ncoder-ai/VibeVoice-FastAPI.git
cd VibeVoice-FastAPI
docker compose up

This gives you an API endpoint compatible with OpenAI’s TTS format:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vibevoice-1.5b",
    "input": "Your API documentation should be a conversation, not a monologue.",
    "voice": "alice"
  }' \
  --output speech.wav

This OpenAI-compatible endpoint means you can test your VibeVoice API integration with Apidog using the same request format you’d use for OpenAI’s TTS API. Import the endpoint, configure your request body, and test voice generation without writing application code.

Using VibeVoice-ASR for speech recognition

Basic transcription

python asr_inference.py \
  --model_path microsoft/VibeVoice-ASR \
  --audio_path meeting_recording.wav

Structured output format

VibeVoice-ASR produces structured transcriptions with three fields per segment:

Who: Speaker identity (Speaker 1, Speaker 2, etc.)
When: Start and end timestamps
What: Transcribed text content

Example output:

{
  "segments": [
    {
      "speaker": "Speaker 1",
      "start": 0.0,
      "end": 4.2,
      "text": "Let's review the API endpoints for the new release."
    },
    {
      "speaker": "Speaker 2",
      "start": 4.5,
      "end": 8.1,
      "text": "I've added three new endpoints for the billing module."
    }
  ]
}

ASR as an MCP server

VibeVoice-ASR can run as an MCP (Model Context Protocol) server, plugging directly into Claude Code, Cursor, and other AI coding tools:

# Install the MCP server
pip install vibevoice-mcp-server

# Run it
vibevoice-mcp serve

This lets your coding agent transcribe meetings, voice notes, or audio recordings as part of its workflow. You dictate requirements, the MCP server transcribes them, and the coding agent processes the text.

When to use VibeVoice-ASR vs Whisper

Use case	Best choice	Why
Long meetings (30-60 min)	VibeVoice-ASR	Single-pass 60-min processing, speaker ID
Interviews with multiple speakers	VibeVoice-ASR	Built-in diarization
Podcasts needing timestamps	VibeVoice-ASR	Structured Who/When/What output
Multilingual content (50+ languages)	VibeVoice-ASR	Broader language support
Short clips in noisy environments	Whisper	Better noise robustness
Edge/mobile deployment	Whisper	Smaller model size, wider device support
Non-English languages (specialized)	Whisper	More mature multilingual fine-tuning

Testing voice AI APIs with Apidog

Whether you’re using the VibeVoice FastAPI wrapper, Azure AI Foundry endpoint, or building your own voice AI API, Apidog helps you test and debug these integrations.

Test the TTS endpoint

Create a new POST request in Apidog pointing to your VibeVoice FastAPI server
Set the request body to the OpenAI-compatible format:

{
  "model": "vibevoice-1.5b",
  "input": "Test speech synthesis with proper intonation and pacing.",
  "voice": "alice",
  "response_format": "wav"
}

Send the request and verify the response headers include audio/wav content type
Save the response as a WAV file to verify audio quality

Test the ASR endpoint

For speech-to-text APIs:

Set up a POST request with multipart/form-data
Attach your audio file as a form field
Verify the structured JSON response includes speaker IDs, timestamps, and transcribed text

Validate audio API contracts

Voice AI APIs handle binary data (audio files) alongside JSON metadata. Apidog’s request builder handles both:

Binary file uploads for ASR endpoints
JSON body formatting for TTS endpoints
Response validation for structured transcription output
Environment variables to switch between local and cloud endpoints

Download Apidog to test your voice AI integrations before deploying to production.

button

Safety and responsible use

Microsoft added several safeguards after the initial misuse incidents:

Audible AI disclaimer: All generated audio includes an automatic “This segment was generated by AI” message
Imperceptible watermarking: Hidden markers enable third-party verification of VibeVoice-generated content
Inference logging: Hashed logs detect abuse patterns with quarterly aggregated statistics
MIT license: Permits commercial use, but Microsoft recommends against production deployment without further testing

What’s allowed

Research and academic use
Internal prototyping and testing
Podcast generation with proper AI disclosure
Accessibility applications (text-to-speech for visually impaired users)

What’s not allowed

Voice impersonation without explicit recorded consent
Deepfakes or presenting AI audio as genuine human recordings
Real-time voice conversion for live deepfake applications
Generating non-speech audio (music, sound effects)

Limitations to know about

Language support is narrow for TTS. VibeVoice-1.5B supports English and Chinese. Other languages produce unintelligible output. VibeVoice-ASR has broader coverage at 50+ languages.

Hardware requirements are steep for ASR. The ASR model needs 24 GB+ VRAM (A100/H100 class GPUs). The TTS models run on consumer GPUs with 7-8 GB VRAM.

No overlapping speech handling. The TTS model doesn’t model speakers talking over each other. All dialogue is turn-based.

Inherited model biases. Both models inherit biases from their Qwen2.5 base. Outputs can contain unexpected, biased, or inaccurate content.

Research-grade software. This is not production-ready. Expect rough edges in edge cases, error handling, and non-English output.

Deploying VibeVoice-ASR on Azure AI Foundry

For teams that don’t want to manage GPU infrastructure, Microsoft made VibeVoice-ASR available through Azure AI Foundry. This gives you a managed API endpoint without provisioning hardware.

The Azure deployment handles scaling, model updates, and infrastructure maintenance. You get an HTTPS endpoint that accepts audio files and returns structured transcriptions in the same Who/When/What format as the local model.

This is particularly useful for production workloads where you need consistent uptime and SLA guarantees that self-hosted GPU inference can’t provide. Check Azure AI Foundry’s model catalog for current pricing and deployment options.

For testing your Azure-hosted VibeVoice endpoint before integrating it into your application, set up the endpoint URL and authentication headers in Apidog and run test transcriptions against sample audio files.

Community and ecosystem

VibeVoice has an active community:

62,630+ monthly HuggingFace downloads for the 1.5B model
2,280+ likes on HuggingFace
79+ HuggingFace Spaces running the model
12 fine-tuned variants from the community
4 quantized versions for lower-VRAM deployment
Community fork at vibevoice-community/VibeVoice with active maintenance

Notable community projects:

VibeVoice-FastAPI: Production REST API wrapper with Docker support
VibeVoice MCP Server: Integration with AI coding tools via Model Context Protocol
Apple Silicon support: Community scripts for M-series Mac inference
Quantized models: GGUF and other formats for reduced VRAM usage

FAQ

Is VibeVoice free to use?

Yes. All three models (TTS 1.5B, Realtime 0.5B, ASR) are MIT-licensed. You can use them for commercial and non-commercial purposes. Azure AI Foundry hosting has separate pricing for managed cloud inference.

Can VibeVoice run on Apple Silicon Macs?

The community has contributed scripts for M-series Mac inference. Check the HuggingFace discussions for the VibeVoice-1.5B model. Performance is slower than CUDA GPUs but functional.

How does VibeVoice compare to ElevenLabs?

VibeVoice runs locally with no API costs and no data leaving your machine. ElevenLabs offers higher quality, more voices, and easier setup, but requires a paid subscription and cloud processing. For privacy-sensitive applications or offline use, VibeVoice wins. For production quality and ease of use, ElevenLabs is ahead.

Why was the GitHub repository temporarily disabled?

Microsoft discovered people using voice cloning for impersonation and deepfakes. They disabled the repo, added safety features (audible disclaimers, watermarking), and re-enabled it. The community fork kept development going during the downtime.

Can I fine-tune VibeVoice on custom voices?

Yes. The community has produced 12 fine-tuned variants on HuggingFace. You need voice samples (30-60 seconds of clear WAV audio at 24kHz mono) and GPU resources for training.

What audio formats does VibeVoice output?

WAV at 24,000 Hz mono. You can convert to MP3, OGG, FLAC, or other formats with ffmpeg after generation.

Can I use VibeVoice-ASR as a Whisper replacement?

For long-form audio with speaker identification, yes. VibeVoice-ASR handles 60-minute recordings in a single pass with built-in diarization. Whisper needs external tools for speaker identification and struggles with recordings over 30 minutes without chunking. For short, noisy clips or edge deployment, Whisper remains the better choice.

Does VibeVoice support real-time voice chat?

VibeVoice-Realtime-0.5B supports streaming text input with ~300ms first-chunk latency. It’s usable for near-real-time applications but isn’t designed for full-duplex voice conversation. For that, look at Azure OpenAI’s GPT-Realtime or similar hosted solutions.

button