Qwen Audio - AI Audio Model for Speech, Sound & Voice Understanding

What is Qwen-Audio?

Qwen-Audio is the audio branch of Alibaba Cloud's Qwen model family — a series of open-source large audio-language models that can listen to audio inputs and respond with natural text or speech. Where Qwen-VL teaches the Qwen language model to see, Qwen-Audio teaches it to hear. The result is a single, unified model that handles human speech, environmental sound, and music as first-class inputs, eliminating the awkward pipeline of "convert audio to text with a separate ASR system, then send the text to an LLM" that powers most older voice AI products.

The series began with the original Qwen-Audio in late 2023, followed by Qwen2-Audio in 2024 and the audio capabilities of Qwen3-Omni and Qwen3.5-Omni in 2026. The Qwen team also released Qwen3-TTS in January 2026 as a companion text-to-speech model, completing the loop so that Qwen models can not only listen but also speak in real time, with voice cloning from as little as three seconds of reference audio.

What makes Qwen-Audio distinctive in the crowded audio AI landscape is universal coverage. Most audio models specialize: Whisper handles speech-to-text, MusicLM handles music, CLAP handles sound classification. Qwen-Audio was designed from the start to handle all of these — speech, natural sound, music, songs — with a single unified architecture. You can feed it a meeting recording and ask "summarize the key decisions," then feed it a song and ask "what genre is this and what instruments are playing," and the same model handles both with strong accuracy.

Core Specifications

Latest Model

Qwen3-Omni-30B-A3B

Audio Encoder

Native AuT Transformer

Languages (ASR)

113 languages/dialects

Languages (Speech Out)

36 languages/dialects

Max Audio Length

10+ hours (Qwen3.5-Omni)

Context Window

256K tokens

License

Open-weight

Real-Time

Streaming voice chat

Key Capabilities

🎙️

Voice Chat

Talk to the model directly. No separate ASR step — speech goes in, text or voice comes out.

📝

Speech Recognition

Highly accurate ASR in 113 languages and dialects, beating dedicated speech-to-text systems on many benchmarks.

🔊

Sound Analysis

Identifies environmental sounds (glass breaking, dog barking, traffic) and describes what's happening.

🎵

Music Understanding

Genre classification, instrument identification, mood analysis, and even lyrics transcription.

🌍

Translation

Direct speech-to-text translation — listen in one language, answer in another, no intermediate step needed.

👥

Speaker Awareness

Distinguishes multiple speakers, estimates gender and approximate age from voice characteristics.

⚡

Real-Time Streaming

Qwen3-Omni delivers streaming voice responses with low latency, suitable for live conversation.

🗣️

Voice Cloning (via Qwen3-TTS)

Generate speech in any voice from 3 seconds of reference audio, with controllable emotion and pacing.

Architecture: How Qwen-Audio Listens

Qwen-Audio's architecture has evolved significantly across generations, but the core idea is consistent: take a powerful Qwen language model and extend it with an audio encoder that converts raw waveforms into tokens the LLM can reason about.

Qwen-Audio and Qwen2-Audio: The Foundation

The first two generations used a frozen Whisper-style audio encoder followed by a learned projection adapter that maps audio features into the LLM's embedding space. Training happened in three stages — multi-task pretraining for audio-language alignment, supervised fine-tuning for downstream skills, and direct preference optimization (DPO) to match human preferences in dialogue. This setup gave Qwen2-Audio strong performance across speech recognition, sound classification, and music tasks while supporting two distinct interaction modes: voice chat (just talk to it) and audio analysis (provide audio plus a text instruction).

Qwen3-Omni and Qwen3.5-Omni: The Native Audio Transformer

The newest generation breaks from the Whisper-encoder tradition entirely. Qwen3-Omni and Qwen3.5-Omni use a native Audio Transformer (AuT) encoder pretrained on more than 100 million hours of audio-visual data. Rather than treating audio as something to be transcribed and then reasoned about, the AuT encoder gives the model a grounded sense of temporal and acoustic patterns, which dramatically improves performance on tasks that depend on prosody, timing, and acoustic context.

Thinker-Talker Architecture

Qwen3-Omni introduces a bifurcated architecture with two components: the Thinker reasons about the input and decides what to say, and the Talker converts that decision into actual speech tokens in real time. Both components use Hybrid-Attention Mixture-of-Experts (MoE), which means only a fraction of parameters activate per token, keeping latency low even for a model with hundreds of billions of total parameters. This split is what enables Qwen3-Omni to deliver fluent streaming voice without the awkward stutters and number-misreadings that plague single-stream voice models.

ARIA Alignment

A common failure mode in streaming voice models is misalignment between the text the model "thinks" and the audio it actually outputs — numbers get misread, words get clipped, transitions stutter. Alibaba's ARIA (Adaptive Rate Interleave Alignment) technique dynamically aligns text and speech units during generation, eliminating these issues and giving Qwen3-Omni notably smoother conversational flow than earlier streaming voice systems.

Model Variants at a Glance

Model	Year	Best For	Size
Qwen-Audio	2023	Original audio understanding (legacy)	7B
Qwen2-Audio-7B	2024	Pretrained base for fine-tuning	7B
Qwen2-Audio-7B-Instruct	2024	Chat-tuned voice & analysis	7B
Qwen3-Omni-30B-A3B-Instruct	2025	Production omni-modal chat	30B MoE
Qwen3-Omni-30B-A3B-Thinking	2025	Deep audio reasoning	30B MoE
Qwen3-TTS (0.6B / 1.7B)	Jan 2026	Voice cloning & speech generation	0.6–1.7B
Qwen3.5-Omni-Plus	Mar 2026	Frontier real-time omni-modal	Multi-billion

Download & Access Qwen-Audio

Like Qwen-VL, the Qwen-Audio family is genuinely open. Model weights are public on Hugging Face and ModelScope, and the original training code is on GitHub. There are also hosted options if you'd rather not run inference yourself.

🤗 Hugging Face

Official weights for Qwen2-Audio, Qwen3-Omni, and Qwen3-TTS. Canonical source.

Browse models →

📦 ModelScope

Alibaba's own model hub, recommended for users in mainland China for faster downloads.

Browse on ModelScope →

🐙 GitHub (Qwen2-Audio)

Official Qwen2-Audio repo with training code, inference scripts, and demos.

View on GitHub →

🐙 GitHub (Qwen3-Omni)

The latest omni-modal repo covering audio, vision, and real-time voice.

View on GitHub →

🌐 Qwen Chat (Web)

Talk to Qwen-Omni in your browser — full voice chat, no install required.

Open Qwen Chat →

☁️ Alibaba Cloud API

Hosted API for production — pay-per-use, includes Qwen3-TTS voice design.

Get API key →

Installation Guide

How you "install" Qwen-Audio depends on what you want to do. For casual exploration use the web app. For local inference on your own data, use Hugging Face transformers or vLLM. For real-time voice deployment, use the official Docker image. We'll cover each path below.

Option 1 — Qwen Chat (Web, Zero Install)

Open chat.qwen.ai in any modern browser.
Sign in with an Alibaba Cloud, Google, or phone-number account.
Click the microphone icon in the input bar to record audio, or attach an audio file.
Type or speak your question about the audio and submit.
For full voice chat with spoken replies, switch to voice mode in the chat settings — Qwen-Omni will respond aloud in real time.

Option 2 — Hugging Face Transformers (Python)

For local inference on your own GPU, the Hugging Face route is the most flexible. You'll need Python 3.10+, PyTorch, and roughly 16 GB of VRAM for the 7B model.

Install required libraries: pip install transformers torch librosa accelerate.
(Optional) For Qwen3-Omni install the latest transformers from source: pip install git+https://github.com/huggingface/transformers.
Use the snippet below to run your first audio query against Qwen2-Audio-7B-Instruct.

from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
from io import BytesIO
from urllib.request import urlopen
import librosa

model_id = "Qwen/Qwen2-Audio-7B-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = Qwen2AudioForConditionalGeneration.from_pretrained(
    model_id, device_map="auto"
)

# Build a conversation with an audio input
conversation = [
    {"role": "user", "content": [
        {"type": "audio",
         "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/"
                      "Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound? Should I be alarmed?"}
    ]}
]

text = processor.apply_chat_template(
    conversation, add_generation_prompt=True, tokenize=False
)

# Load the audio
audios = []
for msg in conversation:
    if isinstance(msg["content"], list):
        for el in msg["content"]:
            if el["type"] == "audio":
                audio, _ = librosa.load(
                    BytesIO(urlopen(el["audio_url"]).read()),
                    sr=processor.feature_extractor.sampling_rate,
                )
                audios.append(audio)

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs = inputs.to(model.device)

generate_ids = model.generate(**inputs, max_length=512)
response = processor.batch_decode(
    generate_ids[:, inputs.input_ids.size(1):],
    skip_special_tokens=True
)[0]

print(response)

Option 3 — Qwen3-Omni via Docker (Real-Time Voice)

For the full Qwen3-Omni experience including real-time voice chat with a web UI, the official Docker image is the easiest path.

# Pull and run the Qwen3-Omni demo container
docker run --gpus all -it -p 8901:80 \
    --name qwen3-omni \
    -v /path/to/your/workspace:/data/shared/Qwen3-Omni \
    qwenllm/qwen3-omni

# Inside the container, launch the web demo
python web_demo.py \
    -c Qwen/Qwen3-Omni-30B-A3B-Instruct \
    --server-port 80 \
    --server-name 0.0.0.0

Once running, open http://localhost:8901 in your browser for a full voice-chat interface.

Option 4 — vLLM for Production Serving

For high-throughput production deployment with batching and an OpenAI-compatible API:

# Install vLLM with Omni support
pip install vllm-omni

# Serve Qwen3-Omni with OpenAI-compatible API
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 32768

Option 5 — Qwen3-TTS for Voice Output

If you specifically need speech generation (not just understanding), the standalone Qwen3-TTS family is your friend. The lightweight setup with uv:

# One-shot CLI via uv (no local install needed)
uv run https://tools.simonwillison.net/python/q3_tts.py \
    'Hello, world! This is Qwen3-TTS speaking.' \
    -i 'warm friendly voice' \
    -o hello.wav

On Apple Silicon Macs, use pip install mlx-audio for an optimized MLX build.

Using the Hosted API

If you'd rather not run inference yourself, Alibaba Cloud's Model Studio (DashScope) hosts the Qwen-Audio and Qwen-Omni models with pay-per-token pricing. The endpoint is OpenAI-compatible, so any existing OpenAI client works with a base URL swap.

from openai import OpenAI
import base64, os

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

# Encode local audio
with open("meeting.mp3", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="qwen-audio-turbo",
    messages=[{
        "role": "user",
        "content": [
            {"type": "input_audio",
             "input_audio": {"data": audio_b64, "format": "mp3"}},
            {"type": "text", "text": "Summarize the key decisions in this meeting."}
        ]
    }]
)

print(response.choices[0].message.content)

Real-World Use Cases

Voice Assistants and Smart Speakers

The most obvious application is voice assistants. Because Qwen-Audio bypasses the traditional ASR-LLM-TTS pipeline, the latency between you speaking and the assistant replying is dramatically lower. Combined with Qwen3-TTS for output, you get a complete voice-in voice-out loop that feels conversational rather than transactional.

Meeting Transcription and Summarization

Feed Qwen-Audio a recorded meeting and ask for action items, decisions, or a full transcript with speaker labels. Because it understands both speech and context simultaneously, the summaries are noticeably sharper than a "transcribe then summarize" pipeline.

Audio Content Moderation

For platforms hosting user audio (podcasts, voice messages, video uploads), Qwen-Audio can flag problematic content — hate speech, violence, copyrighted music — far more accurately than keyword-based filters operating on a transcription.

Music Tagging and Discovery

Qwen-Audio understands music structurally: tempo, key, instrumentation, mood, genre. Music streaming services, DJ tools, and music libraries use it for automatic tagging and similarity search without needing specialized music models.

Accessibility

Real-time sound description for deaf and hard-of-hearing users, voice-driven interfaces for users with motor impairments, and multilingual translation for users with limited English literacy all benefit from Qwen-Audio's unified architecture.

Audiobook Production with Qwen3-TTS

The voice cloning capability means a single voice actor can record a short reference sample and then "narrate" thousands of pages automatically. Combined with Qwen3-TTS's emotion and pacing controls, this enables consistent, natural narration for long-form content at a fraction of the production cost.

Tips for Best Results

Use clean audio when possible. Qwen-Audio is robust to background noise, but a clean recording at 16 kHz or higher always produces sharper results.
Specify the task clearly. "Transcribe this in Spanish" or "identify the genre and instruments" produces better output than vague prompts like "what's this?".
For long audio, use Qwen3-Omni. Older Qwen2-Audio is limited to ~30 second clips; Qwen3-Omni handles 10+ hours natively.
Try the Thinking variant for hard cases. Multi-speaker meetings, ambiguous sounds, and music with subtle features benefit from extended reasoning.
Use voice cloning responsibly. Qwen3-TTS makes voice cloning trivial — always get consent from the person whose voice you're cloning, and consider watermarking your output.

Frequently Asked Questions

Is Qwen-Audio free to use commercially?

The Qwen-Audio family is released under permissive licenses, with most models allowing free commercial use. Always check the LICENSE file in the specific Hugging Face repository, since terms can vary slightly between models.

Can it run on a laptop without a GPU?

Qwen2-Audio-7B can run on a CPU with enough RAM (32 GB recommended) but inference will be slow — expect seconds per response rather than real-time. For real-time voice chat you'll want a GPU with at least 16 GB of VRAM, or an Apple Silicon Mac with 32 GB+ unified memory.

How does it compare to Whisper for speech recognition?

On standard ASR benchmarks like LibriSpeech and CommonVoice, Qwen2-Audio significantly outperformed previous state-of-the-art models including Whisper at release. For pure transcription Whisper is still excellent and lighter-weight, but when you need both transcription and downstream reasoning in one model, Qwen-Audio is the better choice.

Does it work with phone-call quality audio?

Yes. Qwen-Audio handles 8 kHz telephone audio reasonably well, though 16 kHz or higher gives noticeably better results. For critical use cases, upsample first or use a specialized telephony model.

Can I fine-tune it on my own audio data?

Yes. The official GitHub repositories include training and fine-tuning scripts, and tools like Unsloth and LLaMA-Factory support efficient LoRA fine-tuning of Qwen2-Audio on consumer hardware.

Final Thoughts

Qwen-Audio is one of the most underrated releases in the open-source AI ecosystem. While vision-language models grab most of the attention, audio is in many ways a harder problem — it's continuous, temporal, and dense with information that text can only crudely approximate. The fact that Alibaba has released a series of audio-language models that handle speech, sound, music, and real-time voice interaction at this level of quality, all with open weights and permissive licenses, is a significant gift to the developer community.

For most applications today, Qwen3-Omni is the right starting point — it handles audio, vision, and text in a single model with real-time streaming. For pure audio understanding without the multimodal overhead, Qwen2-Audio remains a strong, lightweight choice. And for any application that needs to speak back to users, Qwen3-TTS rounds out the stack with state-of-the-art voice generation and cloning.

The easiest way to start is at chat.qwen.ai, where you can have a full voice conversation with Qwen-Omni in your browser. Once you're convinced of the quality, the open weights are one git clone away.

Qwen Audio: Hear, Understand, Respond