FastAPI for Model Serving — Building AI Endpoints

You have a trained model. Now you need to turn it into an API that clients can call. FastAPI is the standard framework for serving AI models — it's fast (async by default), auto-generates documentation, handles validation, and plays perfectly with Python's ML ecosystem. This lesson teaches you to build production-grade AI API endpoints from scratch.

Why FastAPI for AI?

Framework	Speed	Async	Auto-Docs	Type Validation	ML Ecosystem
FastAPI	Fastest Python	Yes	Yes (Swagger + ReDoc)	Yes (Pydantic)	Perfect
Flask	Medium	No (needs Gevent)	No	No	Good
Django	Slower	Limited	No	No	Overkill
Express (Node)	Fast	Yes	No	No	Poor for ML

Your First AI Endpoint

Project Structure

code

llm-api/
├── main.py              # FastAPI app + routes
├── models.py            # Pydantic request/response schemas
├── inference.py         # Model loading + prediction logic
├── config.py            # Settings and environment variables
├── requirements.txt
└── Dockerfile

Step 1: Define Request/Response Schemas

python

# models.py
from pydantic import BaseModel, Field

class ChatRequest(BaseModel):
    prompt: str = Field(..., min_length=1, max_length=4096)
    max_tokens: int = Field(default=512, ge=1, le=4096)
    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
    system_prompt: str = Field(default="You are a helpful assistant.")

class ChatResponse(BaseModel):
    text: str
    tokens_used: int
    model: str
    latency_ms: float

Step 2: Model Loading

python

# inference.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time

class ModelEngine:
    def __init__(self, model_path: str):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Loading model from {model_path} on {self.device}...")

        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        print("Model loaded successfully.")

    def generate(self, prompt: str, max_tokens: int = 512,
                 temperature: float = 0.7) -> dict:
        start = time.time()

        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                temperature=temperature,
                do_sample=temperature > 0
            )

        text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        latency = (time.time() - start) * 1000

        return {
            "text": text[len(prompt):].strip(),
            "tokens_used": len(outputs[0]) - len(inputs.input_ids[0]),
            "latency_ms": round(latency, 2)
        }

Step 3: FastAPI Application

python

# main.py
from fastapi import FastAPI, HTTPException
from contextlib import asynccontextmanager
from models import ChatRequest, ChatResponse
from inference import ModelEngine
from config import settings

engine = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    global engine
    engine = ModelEngine(settings.MODEL_PATH)
    yield
    del engine

app = FastAPI(
    title="LLM Inference API",
    version="1.0.0",
    lifespan=lifespan
)

@app.get("/health")
async def health():
    return {"status": "healthy", "model": settings.MODEL_NAME}

@app.post("/v1/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    try:
        result = engine.generate(
            prompt=request.prompt,
            max_tokens=request.max_tokens,
            temperature=request.temperature
        )
        return ChatResponse(
            text=result["text"],
            tokens_used=result["tokens_used"],
            model=settings.MODEL_NAME,
            latency_ms=result["latency_ms"]
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/v1/models")
async def list_models():
    return {"models": [settings.MODEL_NAME]}

Step 4: Run It

bash

pip install fastapi uvicorn transformers torch
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1

# Test with curl
curl -X POST http://localhost:8000/v1/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain quantum computing in simple terms", "max_tokens": 256}'

Visit http://localhost:8000/docs for interactive Swagger documentation.

Production Patterns

Async Request Queuing

For models that can only handle one request at a time:

python

import asyncio

class RequestQueue:
    def __init__(self):
        self.queue = asyncio.Queue(maxsize=100)
        self.processing = False

    async def enqueue(self, request: ChatRequest) -> ChatResponse:
        future = asyncio.get_event_loop().create_future()
        await self.queue.put((request, future))
        if not self.processing:
            asyncio.create_task(self._process())
        return await future

    async def _process(self):
        self.processing = True
        while not self.queue.empty():
            request, future = await self.queue.get()
            result = engine.generate(
                prompt=request.prompt,
                max_tokens=request.max_tokens
            )
            future.set_result(ChatResponse(**result))
        self.processing = False

Response Streaming

For long generations, stream tokens as they're produced:

python

from fastapi.responses import StreamingResponse
import json

@app.post("/v1/chat/stream")
async def chat_stream(request: ChatRequest):
    async def generate_tokens():
        for token in engine.stream_generate(request.prompt):
            yield f"data: {json.dumps({'token': token})}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(
        generate_tokens(),
        media_type="text/event-stream"
    )

Batched Inference

Process multiple requests together for better GPU utilization:

python

from fastapi import BackgroundTasks

batch_buffer = []
BATCH_SIZE = 8
BATCH_TIMEOUT = 0.1  # seconds

async def batch_processor():
    while True:
        if len(batch_buffer) >= BATCH_SIZE or \
           (batch_buffer and time_since_first() > BATCH_TIMEOUT):
            batch = batch_buffer[:BATCH_SIZE]
            batch_buffer[:BATCH_SIZE] = []
            results = engine.batch_generate([b.prompt for b in batch])
            for req, result in zip(batch, results):
                req.future.set_result(result)
        await asyncio.sleep(0.01)

Error Handling & Middleware

python

from fastapi.middleware.cors import CORSMiddleware
import logging

# CORS for web clients
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # restrict in production
    allow_methods=["POST", "GET"],
    allow_headers=["*"],
)

# Request logging
@app.middleware("http")
async def log_requests(request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start
    logging.info(f"{request.method} {request.url.path} - {response.status_code} - {duration:.3f}s")
    return response

# Global error handler
@app.exception_handler(Exception)
async def global_exception_handler(request, exc):
    logging.error(f"Unhandled error: {exc}")
    return JSONResponse(
        status_code=500,
        content={"error": "Internal server error", "detail": str(exc)}
    )

Practice Lab

Task 1: Basic AI API Build a FastAPI app that serves a sentiment analysis model (use distilbert-base-uncased-finetuned-sst-2-english). Create /predict endpoint that accepts text and returns positive/negative with confidence score.

Task 2: Streaming Endpoint Add a /v1/chat/stream endpoint that returns Server-Sent Events. Test it with curl and verify tokens stream in real-time.

Task 3: Load Testing Use wrk or hey to benchmark your API. Measure: requests/sec, p50/p95/p99 latency, error rate. Identify the bottleneck.

Pakistan Case Study

Meet Tariq — a freelance Python developer from Lahore building AI APIs for international clients on Upwork.

His first project: A Karachi logistics company needed a route optimization API. Tariq built it with FastAPI + a fine-tuned model.

His API architecture:

FastAPI with Pydantic validation (caught bad inputs before hitting the model)
Health check endpoint (client's monitoring system needed it)
Response streaming for the chatbot interface
Auto-generated Swagger docs (client's frontend team used them directly)

His pricing evolution:

First AI API project: $500 (learning experience)
After 3 projects with good reviews: $2,000-3,000 per API
Now charges $150/hour for AI API consulting
Monthly Upwork income: $4,000-6,000 (PKR 1.1M-1.7M)

His advice: "Every AI startup needs someone who can turn a Jupyter notebook into a production API. Learn FastAPI + Docker + basic deployment, and you have a $100+/hour skill."

Key Takeaways

FastAPI is the standard for serving AI models — fast, typed, auto-documented
Load models during startup (lifespan), not per-request
Use Pydantic models for request/response validation
Stream responses for long-running LLM generations
Batch inference improves GPU utilization (process 8 requests at once vs. 1)
Always include a /health endpoint for load balancers and monitoring
Auto-generated docs at /docs eliminate the need for separate API documentation

Next lesson: Load balancing and auto-scaling your AI API services.

7.1 — FastAPI for Model Serving — Building AI Endpoints

FastAPI for Model Serving — Building AI Endpoints

Why FastAPI for AI?

Your First AI Endpoint

Project Structure

Step 1: Define Request/Response Schemas

Step 2: Model Loading

Step 3: FastAPI Application

Step 4: Run It

Production Patterns

Async Request Queuing

Response Streaming

Batched Inference

Error Handling & Middleware

Practice Lab

Pakistan Case Study

Key Takeaways

Lesson Summary

Quiz: FastAPI for Model Serving — Building AI Endpoints