Build Your Own Qwen3 API Endpoint with vLLM + FastAPI

How to Build a Qwen3 API Endpoint with vLLM and FastAPI

Introduction: Host Your Own LLM API, No Tokens Needed

Want to:

  • Call Qwen3 like OpenAI's API?

  • Keep everything local & secure?

  • Serve blazing fast inference?

With vLLM + FastAPI, you can:

  • Deploy Qwen3 models (7B/14B/72B)

  • Use OpenAI-style /v1/chat/completions endpoints

  • Host your own private or public API


1. Environment Setup

First, install the necessary packages:

bash
pip install vllm fastapi uvicorn openai

Make sure you have:

  • Python ≥ 3.9

  • A GPU with 16GB+ VRAM (for 7B / 14B)


2. Start vLLM API Server (Qwen3)

bash
python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen1.5-7B-Chat \ --port 8000

Optional: Load in 4-bit or use faster attention with FlashAttention-2.


3. Create FastAPI Proxy Server

python
# app.py from fastapi import FastAPI, Request import httpx app = FastAPI() VLLM_API = "http://localhost:8000/v1/chat/completions" @app.post("/chat") async def chat_proxy(request: Request): body = await request.json() async with httpx.AsyncClient() as client: res = await client.post(VLLM_API, json=body) return res.json()

4. Run Your API

bash
uvicorn app:app --host 0.0.0.0 --port 7860

You now have:

  • Qwen3 running on localhost:8000 (vLLM backend)

  • FastAPI proxy on localhost:7860/chat


5. Test Locally with OpenAI SDK

python
import openai openai.api_key = "qwen-key" # dummy key openai.api_base = "http://localhost:7860" response = openai.ChatCompletion.create( model="Qwen1.5-7B-Chat", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the capital of Japan?"} ] ) print(response.choices[0].message['content'])

6. Supported Features with vLLM

Feature Supported
Chat Completions
Streaming Output
Token Usage Count
Function Calling ✅ (via prompt)
Batch Inference

7. Optional: Add API Key Auth

python
from fastapi import Header, HTTPException @app.post("/chat") async def chat_proxy(request: Request, authorization: str = Header(None)): if authorization != "Bearer YOUR_API_KEY": raise HTTPException(status_code=403, detail="Forbidden") # continue...

8. Optional: Expose API with Cloudflare Tunnel or Ngrok

bash
cloudflared tunnel --url http://localhost:7860

Or use ngrok http 7860 for quick sharing during development.


Conclusion: Full-Stack Qwen3 API in Minutes

You now have:

  • ✅ A blazing fast LLM server

  • ✅ OpenAI-compatible API

  • ✅ Full control over privacy, speed, and scaling

This is ideal for production apps, enterprise tools, research assistants, and more.


Resources



Qwen3 Coder - Agentic Coding Adventure

Step into a new era of AI-powered development with Qwen3 Coder the world’s most agentic open-source coding model.