AI Infrastructure & Local LLMsModule 7

7.1FastAPI for Model Serving — Building AI Endpoints

30 min 9 code blocks Practice Lab Quiz (4Q)

FastAPI for Model Serving — Building AI Endpoints

You have a trained model. Now you need to turn it into an API that clients can call. FastAPI is the standard framework for serving AI models — it's fast (async by default), auto-generates documentation, handles validation, and plays perfectly with Python's ML ecosystem. This lesson teaches you to build production-grade AI API endpoints from scratch.

Why FastAPI for AI?

FrameworkSpeedAsyncAuto-DocsType ValidationML Ecosystem
FastAPIFastest PythonYesYes (Swagger + ReDoc)Yes (Pydantic)Perfect
FlaskMediumNo (needs Gevent)NoNoGood
DjangoSlowerLimitedNoNoOverkill
Express (Node)FastYesNoNoPoor for ML

Your First AI Endpoint

Project Structure

code
llm-api/
├── main.py              # FastAPI app + routes
├── models.py            # Pydantic request/response schemas
├── inference.py         # Model loading + prediction logic
├── config.py            # Settings and environment variables
├── requirements.txt
└── Dockerfile

Step 1: Define Request/Response Schemas

python
# models.py
from pydantic import BaseModel, Field

class ChatRequest(BaseModel):
    prompt: str = Field(..., min_length=1, max_length=4096)
    max_tokens: int = Field(default=512, ge=1, le=4096)
    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
    system_prompt: str = Field(default="You are a helpful assistant.")

class ChatResponse(BaseModel):
    text: str
    tokens_used: int
    model: str
    latency_ms: float

Step 2: Model Loading

python
# inference.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time

class ModelEngine:
    def __init__(self, model_path: str):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Loading model from {model_path} on {self.device}...")

        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        print("Model loaded successfully.")

    def generate(self, prompt: str, max_tokens: int = 512,
                 temperature: float = 0.7) -> dict:
        start = time.time()

        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                temperature=temperature,
                do_sample=temperature > 0
            )

        text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        latency = (time.time() - start) * 1000

        return {
            "text": text[len(prompt):].strip(),
            "tokens_used": len(outputs[0]) - len(inputs.input_ids[0]),
            "latency_ms": round(latency, 2)
        }

Step 3: FastAPI Application

python
# main.py
from fastapi import FastAPI, HTTPException
from contextlib import asynccontextmanager
from models import ChatRequest, ChatResponse
from inference import ModelEngine
from config import settings

engine = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    global engine
    engine = ModelEngine(settings.MODEL_PATH)
    yield
    del engine

app = FastAPI(
    title="LLM Inference API",
    version="1.0.0",
    lifespan=lifespan
)

@app.get("/health")
async def health():
    return {"status": "healthy", "model": settings.MODEL_NAME}

@app.post("/v1/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    try:
        result = engine.generate(
            prompt=request.prompt,
            max_tokens=request.max_tokens,
            temperature=request.temperature
        )
        return ChatResponse(
            text=result["text"],
            tokens_used=result["tokens_used"],
            model=settings.MODEL_NAME,
            latency_ms=result["latency_ms"]
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/v1/models")
async def list_models():
    return {"models": [settings.MODEL_NAME]}

Step 4: Run It

bash
pip install fastapi uvicorn transformers torch
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1

# Test with curl
curl -X POST http://localhost:8000/v1/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain quantum computing in simple terms", "max_tokens": 256}'

Visit http://localhost:8000/docs for interactive Swagger documentation.

Production Patterns

Async Request Queuing

For models that can only handle one request at a time:

python
import asyncio

class RequestQueue:
    def __init__(self):
        self.queue = asyncio.Queue(maxsize=100)
        self.processing = False

    async def enqueue(self, request: ChatRequest) -> ChatResponse:
        future = asyncio.get_event_loop().create_future()
        await self.queue.put((request, future))
        if not self.processing:
            asyncio.create_task(self._process())
        return await future

    async def _process(self):
        self.processing = True
        while not self.queue.empty():
            request, future = await self.queue.get()
            result = engine.generate(
                prompt=request.prompt,
                max_tokens=request.max_tokens
            )
            future.set_result(ChatResponse(**result))
        self.processing = False

Response Streaming

For long generations, stream tokens as they're produced:

python
from fastapi.responses import StreamingResponse
import json

@app.post("/v1/chat/stream")
async def chat_stream(request: ChatRequest):
    async def generate_tokens():
        for token in engine.stream_generate(request.prompt):
            yield f"data: {json.dumps({'token': token})}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(
        generate_tokens(),
        media_type="text/event-stream"
    )

Batched Inference

Process multiple requests together for better GPU utilization:

python
from fastapi import BackgroundTasks

batch_buffer = []
BATCH_SIZE = 8
BATCH_TIMEOUT = 0.1  # seconds

async def batch_processor():
    while True:
        if len(batch_buffer) >= BATCH_SIZE or \
           (batch_buffer and time_since_first() > BATCH_TIMEOUT):
            batch = batch_buffer[:BATCH_SIZE]
            batch_buffer[:BATCH_SIZE] = []
            results = engine.batch_generate([b.prompt for b in batch])
            for req, result in zip(batch, results):
                req.future.set_result(result)
        await asyncio.sleep(0.01)

Error Handling & Middleware

python
from fastapi.middleware.cors import CORSMiddleware
import logging

# CORS for web clients
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # restrict in production
    allow_methods=["POST", "GET"],
    allow_headers=["*"],
)

# Request logging
@app.middleware("http")
async def log_requests(request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start
    logging.info(f"{request.method} {request.url.path} - {response.status_code} - {duration:.3f}s")
    return response

# Global error handler
@app.exception_handler(Exception)
async def global_exception_handler(request, exc):
    logging.error(f"Unhandled error: {exc}")
    return JSONResponse(
        status_code=500,
        content={"error": "Internal server error", "detail": str(exc)}
    )
Practice Lab

Practice Lab

Task 1: Basic AI API Build a FastAPI app that serves a sentiment analysis model (use distilbert-base-uncased-finetuned-sst-2-english). Create /predict endpoint that accepts text and returns positive/negative with confidence score.

Task 2: Streaming Endpoint Add a /v1/chat/stream endpoint that returns Server-Sent Events. Test it with curl and verify tokens stream in real-time.

Task 3: Load Testing Use wrk or hey to benchmark your API. Measure: requests/sec, p50/p95/p99 latency, error rate. Identify the bottleneck.

Pakistan Case Study

Meet Tariq — a freelance Python developer from Lahore building AI APIs for international clients on Upwork.

His first project: A Karachi logistics company needed a route optimization API. Tariq built it with FastAPI + a fine-tuned model.

His API architecture:

  • FastAPI with Pydantic validation (caught bad inputs before hitting the model)
  • Health check endpoint (client's monitoring system needed it)
  • Response streaming for the chatbot interface
  • Auto-generated Swagger docs (client's frontend team used them directly)

His pricing evolution:

  • First AI API project: $500 (learning experience)
  • After 3 projects with good reviews: $2,000-3,000 per API
  • Now charges $150/hour for AI API consulting
  • Monthly Upwork income: $4,000-6,000 (PKR 1.1M-1.7M)

His advice: "Every AI startup needs someone who can turn a Jupyter notebook into a production API. Learn FastAPI + Docker + basic deployment, and you have a $100+/hour skill."

Key Takeaways

  • FastAPI is the standard for serving AI models — fast, typed, auto-documented
  • Load models during startup (lifespan), not per-request
  • Use Pydantic models for request/response validation
  • Stream responses for long-running LLM generations
  • Batch inference improves GPU utilization (process 8 requests at once vs. 1)
  • Always include a /health endpoint for load balancers and monitoring
  • Auto-generated docs at /docs eliminate the need for separate API documentation

Next lesson: Load balancing and auto-scaling your AI API services.

Lesson Summary

Includes hands-on practice lab9 runnable code examples4-question knowledge check below

Quiz: FastAPI for Model Serving — Building AI Endpoints

4 questions to test your understanding. Score 60% or higher to pass.