How to Use Qwen3.5-Omni: Text, Audio, Video, and Voice Cloning via API

Step-by-step guide to using Qwen3.5-Omni via the DashScope API. Send audio, video, images, and text. Clone voices. Run it locally. Full code examples included.

@apidog

@apidog

31 March 2026

How to Use Qwen3.5-Omni: Text, Audio, Video, and Voice Cloning via API

Apidog for Enterprise

On-Premises Deploy

SSO & RBAC

SOC 2 Compliant

Explore Apidog Enterprise

TL;DR

Qwen3.5-Omni accepts text, images, audio, and video as input and returns text or real-time speech. Access it through the Alibaba Cloud DashScope API or run it locally via HuggingFace Transformers. This guide covers API setup, working code examples for each modality, voice cloning, and how to test your requests with Apidog.

button

What you’re working with

Qwen3.5-Omni is a single model that handles four input types simultaneously: text, images, audio, and video. It returns either text or natural speech, depending on how you configure the request.

Released March 30, 2026, it’s built on a Thinker-Talker architecture with an MoE backbone. The Thinker processes multimodal input and reasons over it. The Talker converts the output to speech using a multi-codebook system that starts streaming audio before the full response is complete.

Three variants are available:

This guide uses Flash for most examples since it’s the right starting point for most applications. Swap in Plus where you need maximum quality.

API access via DashScope

Alibaba Cloud’s DashScope API is the primary way to use Qwen3.5-Omni in production. You’ll need a DashScope account and an API key.

Step 1: Create a DashScope account

Go to dashscope.aliyuncs.com and sign up. If you already have an Alibaba Cloud account, use that.

Step 2: Get your API key

  1. Log in to the DashScope console
  2. Click API Key Management in the left sidebar
  3. Click Create API Key
  4. Copy the key (format: sk-...)

Step 3: Install the SDK

pip install dashscope

Or use the OpenAI-compatible endpoint directly with the openai SDK:

pip install openai

DashScope exposes an OpenAI-compatible API at https://dashscope.aliyuncs.com/compatible-mode/v1, which means you can swap your base_url and use the same code you’d write for OpenAI.

Text input and output

Start with the simplest case: text in, text out.

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": "Explain the difference between REST and GraphQL APIs in plain terms."
        }
    ],
)

print(response.choices[0].message.content)

Switch to qwen3.5-omni-plus for harder reasoning tasks or qwen3.5-omni-light when latency is the priority.


Audio input: transcription and understanding

Pass an audio file URL or base64-encoded audio. The model transcribes, understands, and reasons over the content natively. No separate ASR step needed.

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

# Load a local audio file
with open("meeting_recording.wav", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": audio_data,
                        "format": "wav"
                    }
                },
                {
                    "type": "text",
                    "text": "Summarize the key decisions made in this meeting and list any action items."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

The model handles 113 languages for speech recognition. You don’t need to specify the language; it detects it automatically.

Supported audio formats: WAV, MP3, M4A, OGG, FLAC.

Audio output: text-to-speech in the response

To get speech back instead of text, set the modalities parameter and configure audio output:

from openai import OpenAI
import base64

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    modalities=["text", "audio"],
    audio={"voice": "Chelsie", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "Describe the steps to authenticate a REST API using OAuth 2.0."
        }
    ],
)

# The response includes both text and audio
text_content = response.choices[0].message.content
audio_data = response.choices[0].message.audio.data  # base64-encoded WAV

# Save the audio
with open("response.wav", "wb") as f:
    f.write(base64.b64decode(audio_data))

print(f"Text: {text_content}")
print("Audio saved to response.wav")

Two built-in voices are available: Chelsie (female) and Ethan (male). Speech generation works in 36 languages.

Image input: visual understanding

Pass an image URL or base64-encoded image alongside a text question:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/api-diagram.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe this API architecture diagram and identify any potential bottlenecks."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

For local images, encode them as base64:

import base64

with open("screenshot.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

# Use data URL format
image_url = f"data:image/png;base64,{image_data}"

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": image_url}
                },
                {
                    "type": "text",
                    "text": "What error is shown in this screenshot?"
                }
            ]
        }
    ],
)

Video input: understanding recordings and screen captures

Video input is where Qwen3.5-Omni does something no text or image model can do: reason across both the visual and audio tracks simultaneously.

from openai import OpenAI
import base64

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

# Pass a video URL
response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "https://example.com/product-demo.mp4"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe what the developer is building in this demo and write equivalent code."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

Audio-Visual Vibe Coding

The “Vibe Coding” use case is passing a screen recording and asking the model to generate code from what it sees:

with open("screen_recording.mp4", "rb") as f:
    video_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-plus",  # Use Plus for best code generation quality
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": f"data:video/mp4;base64,{video_data}"
                    }
                },
                {
                    "type": "text",
                    "text": "Watch this screen recording and write the complete code that replicates what you see being built. Include all the UI components and their interactions."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

The 256K token context window fits roughly 400 seconds of 720p video with audio. For recordings longer than that, trim or split.

Voice cloning

Voice cloning lets you give the model a target voice and have it respond in that voice. This is available on Plus and Flash via the API.

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

# Load a 10-30 second voice sample for cloning
with open("voice_sample.wav", "rb") as f:
    voice_sample = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-plus",
    modalities=["text", "audio"],
    audio={
        "voice": "custom",
        "format": "wav",
        "voice_sample": {
            "data": voice_sample,
            "format": "wav"
        }
    },
    messages=[
        {
            "role": "user",
            "content": "Welcome to the Apidog developer portal. How can I help you today?"
        }
    ],
)

audio_data = response.choices[0].message.audio.data
with open("cloned_response.wav", "wb") as f:
    f.write(base64.b64decode(audio_data))

Tips for voice cloning quality:

Streaming responses

For real-time voice chat or interactive applications, use streaming. The model starts returning audio before the full response is generated:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

stream = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    modalities=["text", "audio"],
    audio={"voice": "Ethan", "format": "pcm16"},
    messages=[
        {
            "role": "user",
            "content": "Explain how WebSocket connections differ from HTTP polling."
        }
    ],
    stream=True,
)

audio_chunks = []
text_chunks = []

for chunk in stream:
    delta = chunk.choices[0].delta
    if hasattr(delta, "audio") and delta.audio:
        if delta.audio.get("data"):
            audio_chunks.append(delta.audio["data"])
    if delta.content:
        text_chunks.append(delta.content)
        print(delta.content, end="", flush=True)

print()  # newline after streaming text

# Combine and save audio chunks
if audio_chunks:
    import base64
    full_audio = b"".join(base64.b64decode(chunk) for chunk in audio_chunks)
    with open("streamed_response.pcm", "wb") as f:
        f.write(full_audio)

PCM16 format works well for streaming since you can pipe it directly to an audio output buffer without waiting for a complete file.

Multi-turn conversation with mixed modalities

Real conversations mix inputs across turns. Here’s how to manage conversation history with different modalities:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

conversation = []

def send_message(content_parts):
    conversation.append({"role": "user", "content": content_parts})
    
    response = client.chat.completions.create(
        model="qwen3.5-omni-flash",
        messages=conversation,
    )
    
    reply = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": reply})
    return reply

# Turn 1: text
print(send_message([{"type": "text", "text": "I have an API that keeps returning 503 errors."}]))

# Turn 2: add an image (error log screenshot)
import base64
with open("error_log.png", "rb") as f:
    img = base64.b64encode(f.read()).decode()

print(send_message([
    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img}"}},
    {"type": "text", "text": "Here's the error log screenshot. What's causing this?"}
]))

# Turn 3: follow-up text
print(send_message([{"type": "text", "text": "How do I fix the connection pool exhaustion you mentioned?"}]))

The 256K context window means you can carry long conversations, including ones with embedded images and audio, without hitting truncation issues.

Local deployment with HuggingFace

If you need to run Qwen3.5-Omni on your own infrastructure:

pip install transformers==4.57.3
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation
import soundfile as sf
from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

model_path = "Qwen/Qwen3-Omni-30B-A3B-Instruct"

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    model_path,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
processor = Qwen3OmniMoeProcessor.from_pretrained(model_path)

conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "path/to/your/audio.wav"},
            {"type": "text", "text": "What is being discussed in this audio?"}
        ],
    },
]

text = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=False,
)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)
inputs = processor(
    text=text,
    audio=audios,
    images=images,
    videos=videos,
    return_tensors="pt",
    padding=True,
)
inputs = inputs.to(model.device).to(model.dtype)

text_ids, audio_output = model.generate(**inputs, speaker="Chelsie")

text_response = processor.batch_decode(text_ids, skip_special_tokens=True)[0]
sf.write("local_response.wav", audio_output.reshape(-1).cpu().numpy(), samplerate=24000)

print(text_response)

GPU memory requirements for local deployment:

Variant Precision Min VRAM
Plus (30B MoE) BF16 ~40GB
Flash BF16 ~20GB
Light BF16 ~10GB

For production local inference, use vLLM instead of HuggingFace Transformers. MoE models run faster under vLLM’s routing optimizations.

Testing your Qwen3.5-Omni requests with Apidog

Multimodal API requests are harder to debug than plain JSON. You’re dealing with base64-encoded audio and video, nested content arrays, and responses that can include both text and audio. Doing this from a terminal gets tedious quickly.

Apidog handles this cleanly. Set up your DashScope endpoint as a new collection, store your API key as an environment variable, and build request templates for each modality you’re working with.

For each variant (Plus, Flash, Light), you can duplicate the base request and change the model parameter. Run all three in sequence and compare responses, latency, and output quality in one view.

You can also write test assertions in Apidog to verify your multimodal responses:

This is useful when you’re deciding which variant to use in production.

Error handling and retry logic

Rate limits and timeouts are common with large multimodal models, especially for video inputs. Build retry handling from the start:

import time
import random
from openai import OpenAI, RateLimitError, APITimeoutError, APIConnectionError

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
    timeout=120,  # 2-minute timeout for large video inputs
)

def call_with_retry(messages, model="qwen3.5-omni-flash", max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model=model,
                messages=messages,
            )
        except RateLimitError:
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limit hit. Waiting {wait:.1f}s...")
            time.sleep(wait)
        except (APITimeoutError, APIConnectionError) as e:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Connection error: {e}. Retrying in {wait:.1f}s...")
            time.sleep(wait)
    raise RuntimeError(f"Failed after {max_retries} attempts")

For video inputs larger than 100MB, consider:


Common issues and fixes

“Audio output is garbled on numbers or technical terms”This is the problem ARIA technology addresses. Make sure you’re on Qwen3.5-Omni (not an earlier version). If you’re self-hosting, use the latest model weights from HuggingFace.

“The model keeps talking when I send an audio interruption”Semantic interruption requires the Flash or Plus variant. Light may not have this feature. Also check that you’re streaming the response (not batch) for interruption to work.

“Voice cloning quality is poor”The voice sample needs to be clean. Remove background noise with a tool like Audacity before uploading. Use at least 15 seconds of audio. WAV at 16kHz or 44.1kHz works best.

“Video input returns an error about token limits”256K tokens covers roughly 400 seconds of 720p video. Longer videos need trimming or lower resolution. Check your video duration and reduce to under 6 minutes for safety.

“Local deployment is very slow”Use vLLM, not HuggingFace Transformers, for production local inference. MoE models need vLLM’s routing optimizations for reasonable throughput.

FAQ

Which DashScope model ID do I use for Qwen3.5-Omni?

Use qwen3.5-omni-plus, qwen3.5-omni-flash, or qwen3.5-omni-light depending on your quality and latency needs. Start with Flash for most use cases.

Can I use the OpenAI Python SDK with DashScope?

Yes. Set base_url="https://dashscope.aliyuncs.com/compatible-mode/v1" and use your DashScope key as api_key. The request and response format is identical to the OpenAI API.

How do I send multiple files (audio + image) in one request?

Put them in the content array as separate typed objects alongside your text prompt. All four modalities can appear in the same message.

Is there a size limit for audio or video files?

DashScope has per-request payload limits. For large files, use a URL reference instead of base64 encoding. Host the file on accessible storage and pass the URL in the audio or video_url field.

How do I disable audio output and get text only?

Set modalities=["text"] or omit the modalities parameter. Text-only responses are faster and cheaper.

Does it support function/tool calling?

Yes. Use the standard tools parameter with function definitions, same as with any OpenAI-compatible model. The model returns structured tool call objects that you execute in your own code.

What’s the best way to handle long audio recordings?

For recordings under 10 hours, send them as a single request. For longer recordings, split at natural pause points and process each segment separately. Aggregate the results in your application layer.

How do I test my multimodal requests before building a full application?

Use Apidog to build and save request templates for each modality. You can switch between model variants, inspect the full response structure, and write assertions that verify output quality without writing application code first.

button

Explore more

How to Use Claude Fable 5 in Cursor

How to Use Claude Fable 5 in Cursor

Set up Claude Fable 5 in Cursor: add your Anthropic API key, enable claude-fable-5, select it, and understand the $10/$50 own-key billing before long runs.

10 June 2026

Git-native Collaboration for API Testing and Engineering

Git-native Collaboration for API Testing and Engineering

Git-native Collaboration for API Testing and Engineering treats API specs, requests, tests, and docs like code: versioned, reviewed, tested, and merged through Git.

10 June 2026

How to Use the Claude Fable 5 API

How to Use the Claude Fable 5 API

Call the Claude Fable 5 API with working Python, TypeScript, and curl code: streaming, tool use, errors, cost math, plus how to test it in Apidog.

10 June 2026

Practice API Design-first in Apidog

Discover an easier way to build and use APIs