Build Your Own Qwen3 API Endpoint with vLLM + FastAPI
Introduction: Host Your Own LLM API, No Tokens Needed
Want to:
-
Call Qwen3 like OpenAI's API?
-
Keep everything local & secure?
-
Serve blazing fast inference?
With vLLM + FastAPI, you can:
-
Deploy Qwen3 models (7B/14B/72B)
-
Use OpenAI-style
/v1/chat/completions
endpoints -
Host your own private or public API
1. Environment Setup
First, install the necessary packages:
bashpip install vllm fastapi uvicorn openai
Make sure you have:
-
Python ≥ 3.9
-
A GPU with 16GB+ VRAM (for 7B / 14B)
2. Start vLLM API Server (Qwen3)
bashpython -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen1.5-7B-Chat \ --port 8000
Optional: Load in 4-bit or use faster attention with FlashAttention-2.
3. Create FastAPI Proxy Server
python# app.py from fastapi import FastAPI, Request import httpx app = FastAPI() VLLM_API = "http://localhost:8000/v1/chat/completions" @app.post("/chat") async def chat_proxy(request: Request): body = await request.json() async with httpx.AsyncClient() as client: res = await client.post(VLLM_API, json=body) return res.json()
4. Run Your API
bashuvicorn app:app --host 0.0.0.0 --port 7860
You now have:
-
Qwen3 running on
localhost:8000
(vLLM backend) -
FastAPI proxy on
localhost:7860/chat
5. Test Locally with OpenAI SDK
pythonimport openai openai.api_key = "qwen-key" # dummy key openai.api_base = "http://localhost:7860" response = openai.ChatCompletion.create( model="Qwen1.5-7B-Chat", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the capital of Japan?"} ] ) print(response.choices[0].message['content'])
6. Supported Features with vLLM
Feature | Supported |
---|---|
Chat Completions | ✅ |
Streaming Output | ✅ |
Token Usage Count | ✅ |
Function Calling | ✅ (via prompt) |
Batch Inference | ✅ |
7. Optional: Add API Key Auth
pythonfrom fastapi import Header, HTTPException @app.post("/chat") async def chat_proxy(request: Request, authorization: str = Header(None)): if authorization != "Bearer YOUR_API_KEY": raise HTTPException(status_code=403, detail="Forbidden") # continue...
8. Optional: Expose API with Cloudflare Tunnel or Ngrok
bashcloudflared tunnel --url http://localhost:7860
Or use ngrok http 7860
for quick sharing during development.
Conclusion: Full-Stack Qwen3 API in Minutes
You now have:
-
✅ A blazing fast LLM server
-
✅ OpenAI-compatible API
-
✅ Full control over privacy, speed, and scaling
This is ideal for production apps, enterprise tools, research assistants, and more.
Resources
Qwen3 Coder - Agentic Coding Adventure
Step into a new era of AI-powered development with Qwen3 Coder the world’s most agentic open-source coding model.