Qwen3-Coder-30B-A3B-Instruct-FP8: Efficient, Scalable Agentic Coding for the Future

Qwen3-Coder-30B-A3B-Instruct-FP8

Qwen3-Coder-30B-A3B-Instruct-FP8 is the latest milestone in the Qwen3-Coder series, combining cutting-edge performance, efficient FP8 quantization, and native long-context capabilities—all optimized for large-scale agentic coding tasks. Whether you're building AI-assisted developer tools, autonomous research agents, or long-context code comprehension systems, this model offers a practical and scalable solution.


Model Overview

Feature Details
Model Type Causal Language Model
Training Pretraining & Post-training
Parameters 30.5B total, 3.3B activated
Architecture 48 layers, GQA with 32Q/4KV heads
Experts 128 total, 8 activated
Context Length 262,144 tokens (native), up to 1M with Yarn
Quantization FP8 fine-grained, block size 128

Note: This model does not support thinking mode, and the parameter enable_thinking=False is now deprecated.


Key Enhancements

✅ 1. Agentic Coding at Scale

Qwen3-Coder-30B-A3B-Instruct-FP8 excels in agentic use cases:

  • Supports tool-calling natively

  • Seamlessly integrates with frameworks like Qwen Code, CLINE, and OpenAI-compatible APIs

  • Uses structured function call formats that mirror OpenAI’s tool calling paradigm

✅ 2. Long-Context Support

The model natively supports 256K tokens, extendable to 1M tokens using Yarn. Ideal for:

  • Reading large codebases

  • Multi-file reasoning

  • Repository-level understanding

✅ 3. FP8 Quantization for Efficiency

Using fine-grained FP8 quantization, this variant offers:

  • Up to 4× memory and compute efficiency

  • Lower deployment cost

  • Smooth integration with inference frameworks like transformers, sglang, and vLLM


Quickstart Example (Hugging Face Transformers)

python
from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) prompt = "Write a quick sort algorithm." messages = [{"role": "user", "content": prompt}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate(**model_inputs, max_new_tokens=65536) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() content = tokenizer.decode(output_ids, skip_special_tokens=True) print("content:", content)

OOM Warning: If you run into out-of-memory issues, reduce the context length to 32,768 or less.


Tool Calling Example

python
# Define your tool def square_the_number(num: float) -> dict: return num ** 2 # Tool definition format tools = [{ "type": "function", "function": { "name": "square_the_number", "description": "output the square of the number.", "parameters": { "type": "object", "required": ["input_num"], "properties": { 'input_num': { 'type': 'number', 'description': 'input_num is a number that will be squared' } } } } }] # Call using OpenAI-compatible API import OpenAI client = OpenAI(base_url='http://localhost:8000/v1', api_key="EMPTY") messages = [{'role': 'user', 'content': 'square the number 1024'}] completion = client.chat.completions.create( messages=messages, model="Qwen3-Coder-30B-A3B-Instruct-FP8", max_tokens=65536, tools=tools, ) print(completion.choice[0])

Best Practices for Usage

Setting Recommendation
Temperature 0.7
Top-p 0.8
Top-k 20
Repetition Penalty 1.05
Max Output Tokens 65,536 for instruct-style generations

Known Issues

  • transformers has limited support for fine-grained FP8 in distributed setups.

  • Set CUDA_LAUNCH_BLOCKING=1 when running across multiple devices to avoid launch sync issues.


Citation

If Qwen3-Coder benefits your research or application, consider citing:

bibtex
@misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388} }

Resources



Qwen3 Coder - Agentic Coding Adventure

Step into a new era of AI-powered development with Qwen3 Coder the world’s most agentic open-source coding model.