Qwen3.5-Omni Is Here: Alibaba's Omnimodal AI Beats Gemini on Audio

Qwen3.5-Omni launched March 30 with 113-language speech, voice cloning, and benchmark wins over Gemini 3.1 Pro. Here's what's new and why it matters.

Ashley Innocent

Ashley Innocent

31 March 2026

Qwen3.5-Omni Is Here: Alibaba's Omnimodal AI Beats Gemini on Audio

TL;DR

Alibaba released Qwen3.5-Omni on March 30, 2026. It processes text, images, audio, and video in a single model and outputs both text and real-time speech. It outperforms Gemini 3.1 Pro on general audio understanding and reasoning benchmarks, supports 113 languages for speech recognition, and includes voice cloning. Three variants are available: Plus, Flash, and Light.

One model for everything

Most AI workflows today involve stitching together separate models: one for speech-to-text, another for vision, another for text generation, and another for text-to-speech. Each handoff adds latency, cost, and points of failure.

Qwen3.5-Omni collapses that stack. It takes text, images, audio, and video as input and returns text or speech as output, all within a single model inference call. The context window holds 256,000 tokens, which covers over 10 hours of audio or roughly 400 seconds of 720p video with audio.

Alibaba trained it on over 100 million hours of native audio-visual data. The result is a model that doesn’t just handle multiple modalities; it reasons across them at the same time.

If you’re building apps that involve any combination of voice, video, images, and text, this changes what’s possible at the API level.

What changed from Qwen3-Omni

The previous generation, Qwen3-Omni Flash, launched in December 2025 with 234ms response latency. Qwen3.5-Omni is the next full release. Here’s what changed:

Language coverage expanded significantly

Speech recognition in Qwen3-Omni covered 19 languages. Qwen3.5-Omni covers 113 languages and dialects. Speech generation went from 10 languages to 36. That’s not a minor bump; it’s the difference between a model that works for Western markets and one that works globally.

Voice cloning is now built in

You can upload a voice sample and have the model respond in that voice. In the previous generation, this wasn’t available. In Qwen3.5-Omni Plus and Flash, voice cloning is accessible via the API. The model matches speaker identity well enough to pass as a consistent voice persona across long conversations.

ARIA technology eliminates audio garbling

Numbers and unusual words (product names, technical terms, proper nouns) have historically garbled in neural TTS systems. ARIA, Qwen’s dynamic text-speech synchronization layer, specifically addresses this. It reads ahead in the text buffer and adjusts phoneme generation before outputting audio, so “IPv6,” “$249.99,” and “Qwen3.5-Omni” all come out correctly.

Semantic interruption works the way humans expect

When you say “uh-huh” during a voice response, you want the model to keep talking. When you say “wait, stop,” you want it to stop. Earlier voice AI systems treated any audio input as an interruption command. Qwen3.5-Omni distinguishes between backchannels (acknowledgments) and actual interruptions, making voice conversations feel more natural.

Real-time web search is integrated

The model can query the web during inference and incorporate live results into its response. You don’t need to pre-fetch context and inject it into the prompt; the model handles retrieval itself when needed.

Audio-Visual Vibe Coding

Screen recordings now function as a coding input. Record your screen, pass the video to the model, and ask it to replicate or improve what it sees. It generates working code from the visual context. This is the multimodal equivalent of Cursor’s context-aware code generation, except the input is video.

Benchmark results

Across 36 audio and audio-visual benchmarks:

For speech generation quality specifically, it beats ElevenLabs, GPT-Audio, and Minimax on multilingual voice stability across 20 languages. That’s a meaningful comparison: ElevenLabs is a dedicated voice AI company with years of focus on this problem.


Model variants

Alibaba ships three versions:

Variant Best for
Qwen3.5-Omni Plus Maximum quality; audio-visual reasoning, voice cloning, long-context tasks
Qwen3.5-Omni Flash Balanced speed and quality; real-time voice chat, production APIs
Qwen3.5-Omni Light Low-latency tasks; mobile and edge scenarios

All three handle the full input modality stack (text, images, audio, video). The differences are in output quality, latency, and cost. Plus is the benchmark leader; Flash is what most production applications should start with.

The 256K token context window

256K tokens is the input ceiling. What does that translate to in practice?

For most multimodal use cases, 256K is enough that you won’t need to chunk inputs. A 30-minute meeting recording, a full product demo video, or a long customer support call all fit in a single request.

Compare this to GPT-4o’s 128K context or Gemini 2.5 Pro’s 1M context. Qwen3.5-Omni is smaller than Gemini’s ceiling, but its audio-visual performance on benchmarks compensates for that difference in most real-world tasks.


113-language speech recognition

The jump from 19 to 113 languages in speech recognition isn’t just a marketing number. It matters for three categories of applications:

Customer support for global products. If your users speak Thai, Bengali, Swahili, or Finnish, you now have a single model that can handle their voice input without routing through a separate ASR pipeline.

Multilingual content processing. Podcasts, videos, and interviews in non-English languages can be transcribed, translated, and summarized in one call.

Mid-conversation language switching. Bilingual speakers often switch languages mid-sentence. Qwen3.5-Omni handles this natively. A conversation that moves between English and Spanish doesn’t confuse the model or degrade recognition accuracy.

Architecture: Thinker-Talker with MoE

The model uses a Thinker-Talker architecture. The Thinker component processes multimodal input and generates reasoning tokens. The Talker component converts those tokens to natural speech in real time using a multi-codebook approach that minimizes latency.

Under the hood, the Plus variant uses Mixture of Experts (MoE), which means only a subset of model parameters activate per token. This keeps inference fast and memory efficient relative to a dense model of equivalent quality.

For local deployment, vLLM is the recommended inference server because of how it handles MoE routing. HuggingFace Transformers works but is slower on MoE architectures.

Where Apidog fits in

If you’re evaluating whether to build on Qwen3.5-Omni’s API, you’ll be sending multimodal requests: JSON bodies with base64-encoded audio, image URLs, video references, and text all mixed together.

Debugging those requests without a proper API client gets painful quickly. Apidog handles this well. You can build and save your Qwen3.5-Omni request templates, set environment variables for your API keys, and write automated tests that verify response structure and content.

For teams evaluating the three model variants, Apidog makes it easy to run the same request against Plus, Flash, and Light and compare latency and output quality side by side.

Download Apidog free to start testing multimodal API requests.

button

Who this is for

Qwen3.5-Omni makes sense to evaluate if you’re building:

Voice assistants. Real-time speech in, speech out, with conversation memory and web retrieval. The semantic interruption and ARIA features solve two of the hardest problems in voice UX.

Video analysis tools. Automated video summarization, meeting transcription, tutorial generation from screen recordings. The 256K context window means you can pass in long recordings without chunking.

Multilingual customer products. 113-language ASR and 36-language TTS in one model. No separate vendor for each language tier.

Accessibility tooling. Alt-text generation for images, audio descriptions for video content, real-time caption generation with language support for under-resourced languages.

Developer productivity tools. Audio-Visual Vibe Coding turns screen recordings into working code. That’s a new input modality for code assistants.

Access

Qwen3.5-Omni is available through:

The API follows Alibaba Cloud’s standard authentication model. You’ll need a DashScope API key. See the DashScope documentation for endpoint details and pricing per modality.

What to watch

Qwen3.5-Omni is strong on audio benchmarks. Whether those benchmark gains translate to real-world quality in your specific use case is worth testing directly. Benchmarks measure aggregate performance across curated test sets; they don’t predict how the model handles your domain’s vocabulary, your users’ accents, or your video formats.

The voice cloning feature is API-only for now. The qwen.ai web interface doesn’t expose it yet.

Local deployment requires significant GPU memory. The Plus variant (30B MoE) needs at least 40GB VRAM for comfortable inference. Flash and Light variants are more accessible.

FAQ

How is Qwen3.5-Omni different from Qwen2.5-Omni?

Qwen2.5-Omni supported 7B and 3B dense model sizes with 19 languages for speech. Qwen3.5-Omni uses an MoE architecture, expands speech recognition to 113 languages, adds voice cloning, and introduces ARIA for better audio quality. The benchmark performance and context window also grew significantly.

Can I run Qwen3.5-Omni locally?

Yes, via HuggingFace Transformers or vLLM. The Plus variant needs 40GB+ VRAM. Flash and Light variants run on smaller GPUs. vLLM is the better choice for production local deployment because of MoE optimization.

Is there a free tier?

The qwen.ai web interface is free to use. API access through DashScope is paid. Pricing per modality (audio tokens, video frames, text tokens) is available in the DashScope pricing documentation.

Does it support real-time streaming?

Yes. The Thinker-Talker architecture outputs audio in a streaming chunked manner, so the first audio bytes arrive before the full response is generated. This is what makes live voice conversation feel natural.

What’s the difference between Plus, Flash, and Light?

Plus is the highest quality, best for tasks where accuracy matters more than speed. Flash is the balanced option for most production APIs. Light is the fastest, intended for latency-sensitive applications like mobile or edge inference.

Can I use my own voice with the API?

Yes, via voice cloning on the API. You upload an audio sample of the target voice, and the model uses it for speech output. This is not available through the web interface yet.

How does it compare to ElevenLabs for voice generation?

On Alibaba’s benchmarks across 20 languages, Qwen3.5-Omni Plus outperforms ElevenLabs on multilingual voice stability. ElevenLabs has a longer track record and more voice customization options in its product. If you need voice-only capabilities, ElevenLabs is still worth comparing. If you need an integrated multimodal model, Qwen3.5-Omni is the cleaner choice.

Is it safe to send sensitive audio or video data through the API?

Review Alibaba Cloud’s data processing agreement before sending sensitive content. As with any cloud API, assume data may be logged unless the agreement explicitly guarantees otherwise.

Explore more

7 Best API Management Tools in 2026, Ranked by G2

7 Best API Management Tools in 2026, Ranked by G2

G2 Spring 2026 named Apidog and viaSocket Leaders in API Management. Honest, hands-on comparison of the 7 ranked tools and who each one fits.

15 May 2026

What is ERNIE 5.1? Baidu's New MoE Model

What is ERNIE 5.1? Baidu's New MoE Model

Baidu's ERNIE 5.1 hit 4th globally on Arena Search at ~6% of frontier pre-training cost. Architecture, benchmarks, and how it compares to DeepSeek V4 and Kimi K2.6.

14 May 2026

Claude Code Weekly Limits Just Jumped 50% Through July 13: What Pro, Max, and Team Users Should Do With the Extra Quota

Claude Code Weekly Limits Just Jumped 50% Through July 13: What Pro, Max, and Team Users Should Do With the Extra Quota

Anthropic raised Claude Code weekly limits 50% through July 13, 2026. What changed for Pro, Max, Team, and Enterprise, plus how to use the extra quota.

14 May 2026

Practice API Design-first in Apidog

Discover an easier way to build and use APIs