Qwen3 Coder: Agentic Coding Assistant in the World

What is Qwen-Coder?

Qwen-Coder is the code-specialized branch of Alibaba Cloud's Qwen model family — a series of open-source large language models built specifically for software engineering tasks. Originally launched as CodeQwen and renamed in late 2024, the family today comprises the production-ready Qwen2.5-Coder series (six sizes from 0.5B to 32B) and the newer Qwen3-Coder line including Qwen3-Coder-30B-A3B-Instruct and the agentic-optimized Qwen3-Coder-Next. All variants are released under Apache 2.0, fully self-hostable, and engineered specifically to do one thing exceptionally well: write, complete, repair, and reason about code.

What makes Qwen-Coder distinctive is the deliberate trade-off behind it. Rather than building a general-purpose model that happens to code, Alibaba's team built code models that happen to retain strong general reasoning. The training corpus is 5.5 trillion tokens deliberately balanced at roughly 45% code and 55% natural language, with meticulous data cleaning, scalable synthetic data generation, and balanced sampling across languages and repository types. The result is a family of models that punch dramatically above their weight on code benchmarks while still handling math, instruction-following, and general dialogue surprisingly well.

The flagship Qwen2.5-Coder-32B-Instruct is the headline release — it matches GPT-4o on EvalPlus, LiveCodeBench, and BigCodeBench, scores 73.7 on the Aider code-repair benchmark (also matching GPT-4o), and handles 92 programming languages with strong performance on niche languages like Haskell, Racket, and OCaml. The 7B variant runs on a single consumer GPU and matches earlier 20B+ code models. The 0.5B and 1.5B variants make on-device code completion feasible. And the Qwen3-Coder-30B-A3B Mixture-of-Experts model brings the same code intelligence to agentic workflows where latency and throughput matter as much as raw accuracy.

Key Features

Qwen-Coder is built around software engineering as a first-class task, not as a side capability. The headline features:

Six dense sizes (Qwen2.5-Coder): 0.5B, 1.5B, 3B, 7B, 14B, and 32B parameters — covering every hardware tier from edge to workstation.
Two MoE variants (Qwen3-Coder): Qwen3-Coder-30B-A3B-Instruct and Qwen3-Coder-Next, optimized for production coding agents.
128K Context Window: Large enough to feed an entire small-to-medium repository or several large source files at once.
92 Programming Languages: Strong performance across mainstream languages (Python, JavaScript, Rust, Go, Java, C++, C#) and niche ones (Haskell, Racket, OCaml, Elixir) with a 65.9 score on McEval.
Fill-in-the-Middle (FIM): Trained with FIM objectives for natural code completion in editors — works seamlessly with Cursor, Continue, and other IDE integrations.
Repository-Level Reasoning: SOTA on CrossCodeEval, CrossCodeLongEval, RepoEval, and SAFIM benchmarks for understanding code across multiple files.
Code Repair: 73.7 on Aider, matching GPT-4o — the model can identify and fix bugs in existing code with surprising accuracy.
Code Reasoning: Strong at predicting code execution behavior, tracing data flow, and explaining what a piece of code does.
Apache 2.0 License: All variants released under permissive licensing, with no commercial-use restrictions.

Demo

Here's a quick example of what Qwen-Coder does in practice. Given a natural-language prompt, the 32B model generates clean, idiomatic, production-leaning code with appropriate type hints, docstrings, and error handling — not just the bare minimum that compiles.

qwen-coder demo — chat.qwen.ai

▸ Write a Python function that downloads a URL with retries and exponential backoff. Use the requests library.

import requests import time from typing import Optional def fetch_with_retry( url: str, max_retries: int = 5, base_delay: float = 1.0, timeout: float = 10.0, ) -> Optional[requests.Response]: """Fetch a URL with exponential backoff on transient failures. Retries on connection errors, timeouts, and 5xx responses. Returns the Response on success, or None if all retries fail. """ for attempt in range(max_retries): try: response = requests.get(url, timeout=timeout) if response.status_code < 500: return response # 5xx — retry with backoff except (requests.ConnectionError, requests.Timeout): pass delay = base_delay * (2 ** attempt) time.sleep(delay) return None

Notice what the model did without being asked: it added type hints, wrote a real docstring explaining the retry behavior, distinguished between transient (5xx, connection, timeout) and permanent (4xx) failures, and handled both connection errors and HTTP errors separately. This kind of judgment in idiomatic code style is what separates a code-specialized model from a general LLM doing its best.

Try the demo yourself: open chat.qwen.ai, pick Qwen-Coder from the model menu, and paste a real coding task from your work. For IDE integration, install the Continue extension in VS Code and point it at a local Qwen-Coder instance for fill-in-the-middle completion as you type.

Architecture and Training

Qwen-Coder inherits the decoder-only transformer architecture from the Qwen2.5 base series, with grouped query attention, rotary position embeddings, RMSNorm, and SwiGLU activations. All six Qwen2.5-Coder sizes share the same head dimensions but differ in layer count, hidden size, and intermediate size — the 7B model uses 28 layers with a hidden size of 3584, while the 32B scales to 64 layers and 5120 hidden size. The newer Qwen3-Coder-30B-A3B variant is a Mixture-of-Experts model that activates only 3B parameters per token, giving you 30B-class capability with 3B-class inference cost.

The training pipeline is where the real work happens. Continued pretraining starts from the Qwen2.5 base checkpoint and runs for 5.5 trillion additional tokens of carefully curated data. The team built scalable synthetic data generation pipelines to expand coverage of rare languages and edge cases, applied multiple rounds of deduplication and quality filtering, and balanced the mixture so the model retains general and math capabilities rather than overfitting to code. Post-training includes instruction tuning on millions of code-related conversations, fill-in-the-middle objectives for completion tasks, and reinforcement learning from human feedback specifically calibrated for code quality.

The Qwen3-Coder-Next variant takes a different approach: rather than scaling model size, the team scaled agentic training. The model was specifically post-trained on long-horizon software engineering tasks — multi-file edits, test-driven debugging cycles, codebase navigation — which makes it disproportionately strong at the kinds of workflows actually run by production coding agents like SWE-Agent, OpenHands, and Cursor's agent mode. On SWE-Bench Pro, Qwen3-Coder-Next matches or outperforms models with an order of magnitude more active compute.

Downloads

The full Qwen-Coder family is openly downloadable. Pick the source that matches your workflow — direct weights from Hugging Face for maximum control, Ollama for one-line local serving, or hosted access via the web app and API.

🤗 Hugging Face

Canonical weights for all Qwen2.5-Coder and Qwen3-Coder sizes — base, instruct, and quantized.

Browse models →

📦 ModelScope

Alibaba's official model hub, recommended for users in mainland China for faster downloads.

Browse →

🦙 Ollama

Pre-quantized GGUF builds with one-command local serving and an OpenAI-compatible API.

Pull from Ollama →

🐙 GitHub

Official repo with inference code, evaluation scripts, training recipes, and IDE integration examples.

View on GitHub →

🌐 Qwen Chat (Web)

Try Qwen-Coder live in your browser — no install, fastest way to evaluate the model.

Open chat →

☁️ Alibaba Cloud API

Hosted API access via DashScope with pay-per-token billing and OpenAI-compatible endpoints.

Get API key →

Hardware quick reference: the 7B model at 4-bit quantization fits in 4 GB of VRAM (RTX 3060), the 14B fits in 8 GB at Q4, and the 32B in 18 GB at Q4 (RTX 4090). For unquantized FP16 inference, double those numbers. Add 4–8 GB of system RAM per 128K context window for the KV cache.

Installation Guide

The right install path depends on whether you want to use Qwen-Coder as an interactive assistant, integrate it into your IDE, or serve it as an API for your own apps. Here are four practical setups covering all three cases.

Option 1: Ollama (Easiest, One Command)

Ollama is the simplest way to run Qwen-Coder locally. Install Ollama from ollama.com/download, then pull and run the model:

# Pull the 7B Instruct model (default, good for most use)
ollama pull qwen2.5-coder

# Or pick a specific size
ollama pull qwen2.5-coder:32b      # full power, needs 18 GB+ VRAM
ollama pull qwen2.5-coder:14b      # balanced
ollama pull qwen2.5-coder:1.5b     # edge / laptop

# Start interactive chat
ollama run qwen2.5-coder

# Or serve as an API at http://localhost:11434
ollama serve

Ollama exposes an OpenAI-compatible HTTP API once running, so you can point any OpenAI client at http://localhost:11434/v1 with a placeholder API key and start making requests immediately.

Option 2: Hugging Face Transformers (Python, Full Control)

For maximum flexibility, run Qwen-Coder directly through Hugging Face transformers. You'll need Python 3.10+, PyTorch, and roughly 16 GB of VRAM for the 7B at FP16 or 4 GB at 4-bit:

pip install transformers torch accelerate
# Optional: 4-bit quantization
pip install bitsandbytes

Then write your first code-generation request:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Qwen/Qwen2.5-Coder-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

prompt = (
    "Write a Rust function that reads a CSV file from disk "
    "and returns a vector of structs. Use the csv crate."
)

messages = [
    {"role": "system", "content": "You are an expert programmer. "
                                  "Write clean, idiomatic, well-documented code."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=1024)
response = tokenizer.decode(
    output[0][inputs.input_ids.shape[1]:],
    skip_special_tokens=True,
)

print(response)

Option 3: IDE Integration via Continue (VS Code / JetBrains)

For fill-in-the-middle code completion in your editor — the killer feature of any code model — use the Continue extension. It works with VS Code, JetBrains IDEs, and others, and connects to a local Ollama or Hugging Face server:

Install the Continue extension from the VS Code marketplace or JetBrains plugin repository.
Open the Continue config file and add a Qwen-Coder model entry pointing at http://localhost:11434 if you're using Ollama.
Enable tab-completion in the Continue settings — type a comment describing what you want and press Tab to insert the model's suggestion.
Use Cmd/Ctrl + L to open the inline chat for refactoring, explaining, or debugging selected code.

Option 4: vLLM for Production Serving

For team or production deployment with batching, high throughput, and an OpenAI-compatible API:

pip install vllm

vllm serve Qwen/Qwen2.5-Coder-32B-Instruct \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 32768 \
    --tensor-parallel-size 2     # adjust to your GPU count

Once running, hit http://localhost:8000/v1/chat/completions from any OpenAI-compatible client. vLLM also handles continuous batching, so you can serve dozens of concurrent users from a single GPU without queue blocking.

Using the Hosted API

If you'd rather skip self-hosting, Alibaba Cloud's DashScope hosts Qwen-Coder with pay-per-token pricing. The endpoint is OpenAI-compatible, so any existing OpenAI client works with a base URL swap:

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

response = client.chat.completions.create(
    model="qwen2.5-coder-32b-instruct",
    messages=[
        {"role": "system", "content": "You are an expert programmer."},
        {"role": "user", "content": "Refactor this function to use async/await: ..."}
    ],
    # Qwen-specific: enable FIM mode for code completion
    extra_body={"enable_fim": False, "repetition_penalty": 1.1}
)

print(response.choices[0].message.content)

Real-World Use Cases

Qwen-Coder's combination of strong benchmarks, open licensing, and a full size range makes it a strong fit for several practical applications. Code assistants and copilots are the most obvious — the model has been shown working inside Cursor, Continue, and bespoke editor plugins, providing inline completion and chat-based refactoring without sending your code to a third party. Repository-level code understanding tools use Qwen-Coder's 128K context to ingest a full codebase and answer questions about it, locate functions, or explain how subsystems connect.

For agentic engineering workflows, Qwen3-Coder-Next excels at SWE-Bench-style tasks where the model must read code, run tests, identify bugs, and produce verified patches. Code review automation uses Qwen-Coder to flag issues in pull requests, suggest improvements, and check for stylistic consistency. And in education, the 1.5B and 3B models are small enough to embed in interactive tutorials, where they can explain concepts, generate worked examples, and grade student solutions without the latency or cost of calling a hosted API.

Tips and Best Practices

A few pragmatic tips for getting the most out of Qwen-Coder. First, match the model size to the task — the 32B isn't strictly better for every workflow; for in-IDE tab completion, the 1.5B or 7B variants give noticeably lower latency and the quality difference is small. Use the 32B for complex, multi-file generation or repository-level reasoning where accuracy beats speed. Second, use the right system prompt: "You are an expert programmer. Write clean, idiomatic, well-documented code." consistently produces better output than generic instructions, because the model was trained on data with similar framing.

For code completion (filling in a function body, completing a line), enable FIM mode rather than using chat-style prompts — Qwen-Coder was specifically trained with FIM objectives and produces sharper completions in that mode. When generating new code, set repetition_penalty to 1.1 (the Qwen-specific default) rather than the OpenAI frequency_penalty. And finally, always review generated code before running it. Like all code models, Qwen-Coder occasionally produces plausible-looking but subtly wrong code — strong typing, tests, and human review remain essential.

Final Thoughts

Qwen-Coder is one of the most complete open-source code model releases available today. The combination of six (going on eight) model sizes, 128K context, Apache 2.0 licensing, strong benchmarks across generation, completion, repair, and reasoning, and seamless integration with the existing developer tool ecosystem (Ollama, Continue, vLLM) makes it the obvious default for teams that want frontier-level coding capability without the closed-API tax. The newer Qwen3-Coder variants extend that lineage into the agentic era, where models don't just write code — they edit, test, and verify it across full repositories.

The easiest way to start is at chat.qwen.ai with a real coding task from your work. If the output meets your bar, one ollama pull qwen2.5-coder later you'll have it running locally with no ongoing cost, no vendor lock-in, and no API rate limits.

Qwen Coder: Open Source Coding LLM