Build Your Own Qwen3 API Endpoint with vLLM + FastAPI

How to Build a Qwen3 API Endpoint with vLLM and FastAPI

Introduction: Host Your Own LLM API, No Tokens Needed

Want to:

Call Qwen3 like OpenAI's API?
Keep everything local & secure?
Serve blazing fast inference?

With vLLM + FastAPI, you can:

Deploy Qwen3 models (7B/14B/72B)
Use OpenAI-style /v1/chat/completions endpoints
Host your own private or public API

1. Environment Setup

First, install the necessary packages:

bash
pip install vllm fastapi uvicorn openai

Make sure you have:

Python ≥ 3.9
A GPU with 16GB+ VRAM (for 7B / 14B)

2. Start vLLM API Server (Qwen3)

bash
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen1.5-7B-Chat \
  --port 8000

Optional: Load in 4-bit or use faster attention with FlashAttention-2.

3. Create FastAPI Proxy Server

python
# app.py
from fastapi import FastAPI, Request
import httpx

app = FastAPI()
VLLM_API = "http://localhost:8000/v1/chat/completions"

@app.post("/chat")
async def chat_proxy(request: Request):
    body = await request.json()
    async with httpx.AsyncClient() as client:
        res = await client.post(VLLM_API, json=body)
    return res.json()

4. Run Your API

bash
uvicorn app:app --host 0.0.0.0 --port 7860

You now have:

Qwen3 running on localhost:8000 (vLLM backend)
FastAPI proxy on localhost:7860/chat

5. Test Locally with OpenAI SDK

python
import openai

openai.api_key = "qwen-key"  # dummy key
openai.api_base = "http://localhost:7860"

response = openai.ChatCompletion.create(
    model="Qwen1.5-7B-Chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What's the capital of Japan?"}
    ]
)

print(response.choices[0].message['content'])

6. Supported Features with vLLM

Feature	Supported
Chat Completions	✅
Streaming Output	✅
Token Usage Count	✅
Function Calling	✅ (via prompt)
Batch Inference	✅

7. Optional: Add API Key Auth

python
from fastapi import Header, HTTPException

@app.post("/chat")
async def chat_proxy(request: Request, authorization: str = Header(None)):
    if authorization != "Bearer YOUR_API_KEY":
        raise HTTPException(status_code=403, detail="Forbidden")
    # continue...

8. Optional: Expose API with Cloudflare Tunnel or Ngrok

bash
cloudflared tunnel --url http://localhost:7860

Or use ngrok http 7860 for quick sharing during development.

Conclusion: Full-Stack Qwen3 API in Minutes

You now have:

✅ A blazing fast LLM server
✅ OpenAI-compatible API
✅ Full control over privacy, speed, and scaling

This is ideal for production apps, enterprise tools, research assistants, and more.

Resources

Qwen3 Coder - Agentic Coding Adventure

Step into a new era of AI-powered development with Qwen3 Coder the world’s most agentic open-source coding model.

Hugging Face GitHub Modelscope Discord