7.1 — FastAPI for Model Serving — Building AI Endpoints
FastAPI for Model Serving — Building AI Endpoints
You have a trained model. Now you need to turn it into an API that clients can call. FastAPI is the standard framework for serving AI models — it's fast (async by default), auto-generates documentation, handles validation, and plays perfectly with Python's ML ecosystem. This lesson teaches you to build production-grade AI API endpoints from scratch.
Why FastAPI for AI?
| Framework | Speed | Async | Auto-Docs | Type Validation | ML Ecosystem |
|---|---|---|---|---|---|
| FastAPI | Fastest Python | Yes | Yes (Swagger + ReDoc) | Yes (Pydantic) | Perfect |
| Flask | Medium | No (needs Gevent) | No | No | Good |
| Django | Slower | Limited | No | No | Overkill |
| Express (Node) | Fast | Yes | No | No | Poor for ML |
Your First AI Endpoint
Project Structure
llm-api/
├── main.py # FastAPI app + routes
├── models.py # Pydantic request/response schemas
├── inference.py # Model loading + prediction logic
├── config.py # Settings and environment variables
├── requirements.txt
└── Dockerfile
Step 1: Define Request/Response Schemas
# models.py
from pydantic import BaseModel, Field
class ChatRequest(BaseModel):
prompt: str = Field(..., min_length=1, max_length=4096)
max_tokens: int = Field(default=512, ge=1, le=4096)
temperature: float = Field(default=0.7, ge=0.0, le=2.0)
system_prompt: str = Field(default="You are a helpful assistant.")
class ChatResponse(BaseModel):
text: str
tokens_used: int
model: str
latency_ms: float
Step 2: Model Loading
# inference.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
class ModelEngine:
def __init__(self, model_path: str):
self.device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading model from {model_path} on {self.device}...")
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map="auto"
)
print("Model loaded successfully.")
def generate(self, prompt: str, max_tokens: int = 512,
temperature: float = 0.7) -> dict:
start = time.time()
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=temperature,
do_sample=temperature > 0
)
text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
latency = (time.time() - start) * 1000
return {
"text": text[len(prompt):].strip(),
"tokens_used": len(outputs[0]) - len(inputs.input_ids[0]),
"latency_ms": round(latency, 2)
}
Step 3: FastAPI Application
# main.py
from fastapi import FastAPI, HTTPException
from contextlib import asynccontextmanager
from models import ChatRequest, ChatResponse
from inference import ModelEngine
from config import settings
engine = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global engine
engine = ModelEngine(settings.MODEL_PATH)
yield
del engine
app = FastAPI(
title="LLM Inference API",
version="1.0.0",
lifespan=lifespan
)
@app.get("/health")
async def health():
return {"status": "healthy", "model": settings.MODEL_NAME}
@app.post("/v1/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
try:
result = engine.generate(
prompt=request.prompt,
max_tokens=request.max_tokens,
temperature=request.temperature
)
return ChatResponse(
text=result["text"],
tokens_used=result["tokens_used"],
model=settings.MODEL_NAME,
latency_ms=result["latency_ms"]
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/v1/models")
async def list_models():
return {"models": [settings.MODEL_NAME]}
Step 4: Run It
pip install fastapi uvicorn transformers torch
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1
# Test with curl
curl -X POST http://localhost:8000/v1/chat \
-H "Content-Type: application/json" \
-d '{"prompt": "Explain quantum computing in simple terms", "max_tokens": 256}'
Visit http://localhost:8000/docs for interactive Swagger documentation.
Production Patterns
Async Request Queuing
For models that can only handle one request at a time:
import asyncio
class RequestQueue:
def __init__(self):
self.queue = asyncio.Queue(maxsize=100)
self.processing = False
async def enqueue(self, request: ChatRequest) -> ChatResponse:
future = asyncio.get_event_loop().create_future()
await self.queue.put((request, future))
if not self.processing:
asyncio.create_task(self._process())
return await future
async def _process(self):
self.processing = True
while not self.queue.empty():
request, future = await self.queue.get()
result = engine.generate(
prompt=request.prompt,
max_tokens=request.max_tokens
)
future.set_result(ChatResponse(**result))
self.processing = False
Response Streaming
For long generations, stream tokens as they're produced:
from fastapi.responses import StreamingResponse
import json
@app.post("/v1/chat/stream")
async def chat_stream(request: ChatRequest):
async def generate_tokens():
for token in engine.stream_generate(request.prompt):
yield f"data: {json.dumps({'token': token})}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(
generate_tokens(),
media_type="text/event-stream"
)
Batched Inference
Process multiple requests together for better GPU utilization:
from fastapi import BackgroundTasks
batch_buffer = []
BATCH_SIZE = 8
BATCH_TIMEOUT = 0.1 # seconds
async def batch_processor():
while True:
if len(batch_buffer) >= BATCH_SIZE or \
(batch_buffer and time_since_first() > BATCH_TIMEOUT):
batch = batch_buffer[:BATCH_SIZE]
batch_buffer[:BATCH_SIZE] = []
results = engine.batch_generate([b.prompt for b in batch])
for req, result in zip(batch, results):
req.future.set_result(result)
await asyncio.sleep(0.01)
Error Handling & Middleware
from fastapi.middleware.cors import CORSMiddleware
import logging
# CORS for web clients
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # restrict in production
allow_methods=["POST", "GET"],
allow_headers=["*"],
)
# Request logging
@app.middleware("http")
async def log_requests(request, call_next):
start = time.time()
response = await call_next(request)
duration = time.time() - start
logging.info(f"{request.method} {request.url.path} - {response.status_code} - {duration:.3f}s")
return response
# Global error handler
@app.exception_handler(Exception)
async def global_exception_handler(request, exc):
logging.error(f"Unhandled error: {exc}")
return JSONResponse(
status_code=500,
content={"error": "Internal server error", "detail": str(exc)}
)
Practice Lab
Task 1: Basic AI API
Build a FastAPI app that serves a sentiment analysis model (use distilbert-base-uncased-finetuned-sst-2-english). Create /predict endpoint that accepts text and returns positive/negative with confidence score.
Task 2: Streaming Endpoint
Add a /v1/chat/stream endpoint that returns Server-Sent Events. Test it with curl and verify tokens stream in real-time.
Task 3: Load Testing
Use wrk or hey to benchmark your API. Measure: requests/sec, p50/p95/p99 latency, error rate. Identify the bottleneck.
Pakistan Case Study
Meet Tariq — a freelance Python developer from Lahore building AI APIs for international clients on Upwork.
His first project: A Karachi logistics company needed a route optimization API. Tariq built it with FastAPI + a fine-tuned model.
His API architecture:
- FastAPI with Pydantic validation (caught bad inputs before hitting the model)
- Health check endpoint (client's monitoring system needed it)
- Response streaming for the chatbot interface
- Auto-generated Swagger docs (client's frontend team used them directly)
His pricing evolution:
- First AI API project: $500 (learning experience)
- After 3 projects with good reviews: $2,000-3,000 per API
- Now charges $150/hour for AI API consulting
- Monthly Upwork income: $4,000-6,000 (PKR 1.1M-1.7M)
His advice: "Every AI startup needs someone who can turn a Jupyter notebook into a production API. Learn FastAPI + Docker + basic deployment, and you have a $100+/hour skill."
Key Takeaways
- FastAPI is the standard for serving AI models — fast, typed, auto-documented
- Load models during startup (lifespan), not per-request
- Use Pydantic models for request/response validation
- Stream responses for long-running LLM generations
- Batch inference improves GPU utilization (process 8 requests at once vs. 1)
- Always include a
/healthendpoint for load balancers and monitoring - Auto-generated docs at
/docseliminate the need for separate API documentation
Next lesson: Load balancing and auto-scaling your AI API services.
Lesson Summary
Quiz: FastAPI for Model Serving — Building AI Endpoints
4 questions to test your understanding. Score 60% or higher to pass.