Monitoring, Logging & Error Recovery

Production mein agent deploy karna ek cheez hai. Woh agent subah 3 baje kisi reason se crash ho jaye aur kisi ko pata na chale — yeh alag baat hai. In production agent systems, what you cannot observe you cannot fix, and what you cannot fix will eventually fail at the worst possible moment. Monitoring, logging, and error recovery are not afterthoughts — they are the invisible infrastructure that turns a demo into a production business. This lesson gives you the complete observability stack for autonomous agent systems running in Pakistan and internationally.

Section 1: What to Log in an Agent System

Every agent execution should produce a structured log with at minimum:

python

{
    "run_id": "unique_identifier_per_execution",
    "timestamp_start": "2026-03-26T03:00:00.000Z",
    "timestamp_end": "2026-03-26T03:00:47.230Z",
    "agent_name": "karachi_outreach_agent",
    "task_input": "Find 5 DHA restaurants, generate personalized pitches",
    "steps_completed": [
        {
            "step": 1,
            "action": "query_business_database",
            "params": {"city": "Karachi", "category": "restaurant", "limit": 5},
            "result_summary": "5 restaurants found",
            "duration_ms": 234,
            "tokens_used": {"input": 150, "output": 80}
        },
        {
            "step": 2,
            "action": "generate_pitch",
            "params": {"business_name": "Sakura Restaurant DHA"},
            "result_summary": "Pitch generated (200 words)",
            "duration_ms": 1847,
            "tokens_used": {"input": 450, "output": 200}
        }
    ],
    "total_steps": 10,
    "status": "completed",
    "errors": [],
    "total_cost_usd": 0.0234,
    "total_cost_pkr": 6.67,
    "output_summary": "10 pitches generated, saved to outreach_batch_2026-03-26.csv"
}

This structured log answers every question you will have when debugging: What ran? In what order? How long did each step take? What did it cost? What was the output?

Section 2: The Complete Logging System

python

import json
import time
import uuid
import traceback
from datetime import datetime
from pathlib import Path
import os

class AgentLogger:
    """Production-grade logger for autonomous agents"""

    def __init__(self, agent_name: str, log_dir: str = "logs"):
        self.agent_name = agent_name
        self.log_dir = Path(log_dir)
        self.log_dir.mkdir(exist_ok=True)
        self.run_id = str(uuid.uuid4())[:8]
        self.run_log = {
            "run_id": self.run_id,
            "agent_name": agent_name,
            "timestamp_start": datetime.utcnow().isoformat(),
            "steps": [],
            "errors": [],
            "status": "running",
            "total_cost_usd": 0.0,
            "total_tokens": {"input": 0, "output": 0}
        }
        self._start_time = time.time()
        print(f"[{agent_name}] Run {self.run_id} started")

    def log_step(self, step_name: str, params: dict, result: any,
                 duration_ms: int, tokens: dict = None):
        """Log a completed step"""
        step = {
            "step_number": len(self.run_log["steps"]) + 1,
            "name": step_name,
            "params": params,
            "result_summary": str(result)[:200],  # Cap at 200 chars
            "duration_ms": duration_ms,
            "tokens": tokens or {}
        }
        self.run_log["steps"].append(step)

        if tokens:
            self.run_log["total_tokens"]["input"] += tokens.get("input", 0)
            self.run_log["total_tokens"]["output"] += tokens.get("output", 0)

        print(f"[{self.agent_name}] Step {step['step_number']}: {step_name} ({duration_ms}ms)")

    def log_error(self, error: Exception, step_name: str, fatal: bool = False):
        """Log an error"""
        error_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "step": step_name,
            "error_type": type(error).__name__,
            "error_message": str(error),
            "traceback": traceback.format_exc(),
            "fatal": fatal
        }
        self.run_log["errors"].append(error_entry)
        print(f"[{self.agent_name}] ERROR in {step_name}: {error}")

        if fatal:
            self.finalize("failed")

    def finalize(self, status: str = "completed"):
        """Write final log to disk"""
        self.run_log["status"] = status
        self.run_log["timestamp_end"] = datetime.utcnow().isoformat()
        self.run_log["duration_seconds"] = round(time.time() - self._start_time, 2)

        log_path = self.log_dir / f"{self.agent_name}_{self.run_id}_{status}.json"
        with open(log_path, "w") as f:
            json.dump(self.run_log, f, indent=2, default=str)

        print(f"[{self.agent_name}] Run {self.run_id} {status}. Log: {log_path}")
        return log_path

Section 3: Error Recovery Strategies

Not all errors are equal. Implement tiered recovery:

python

import anthropic
import time

class ResilientAgentRunner:
    """Agent runner with retry logic and graceful degradation"""

    def __init__(self, max_retries: int = 3, retry_delay: float = 2.0):
        self.max_retries = max_retries
        self.retry_delay = retry_delay
        self.client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

    def safe_api_call(self, prompt: str, model: str = "claude-sonnet-4-6",
                      fallback_model: str = "claude-haiku-4-5-20251001"):
        """API call with retry, fallback, and graceful degradation"""

        for attempt in range(self.max_retries):
            try:
                response = self.client.messages.create(
                    model=model,
                    max_tokens=1000,
                    messages=[{"role": "user", "content": prompt}]
                )
                return response.content[0].text, model

            except anthropic.RateLimitError:
                wait = self.retry_delay * (2 ** attempt)  # Exponential backoff
                print(f"Rate limited. Waiting {wait}s before retry {attempt + 1}/{self.max_retries}")
                time.sleep(wait)

            except anthropic.APIStatusError as e:
                if e.status_code >= 500:  # Server errors — retry
                    time.sleep(self.retry_delay)
                    continue
                else:  # Client errors — do not retry, try fallback
                    print(f"API error {e.status_code}. Trying fallback model: {fallback_model}")
                    try:
                        response = self.client.messages.create(
                            model=fallback_model,
                            max_tokens=1000,
                            messages=[{"role": "user", "content": prompt}]
                        )
                        return response.content[0].text, fallback_model
                    except Exception:
                        return None, None  # Graceful degradation

        print("All retries exhausted. Returning None.")
        return None, None

Section 4: Alerting for Pakistani Production Systems

When your agent fails in production, you need to know immediately. Three alerting channels for Pakistani developers:

Channel 1: WhatsApp via WATI (best for Pakistan)

python

def send_whatsapp_alert(message: str, phone: str):
    """Send critical agent failure alert via WhatsApp"""
    # WATI API call here
    pass

Channel 2: Email via Gmail SMTP

python

import smtplib
from email.mime.text import MIMEText

def send_email_alert(subject: str, body: str, to_email: str):
    msg = MIMEText(body)
    msg["Subject"] = f"[AGENT ALERT] {subject}"
    msg["From"] = os.environ.get("ALERT_EMAIL")
    msg["To"] = to_email
    # Send via SMTP

Channel 3: Telegram Bot (free, instant)

python

import requests
def send_telegram_alert(message: str):
    bot_token = os.environ.get("TELEGRAM_BOT_TOKEN")
    chat_id = os.environ.get("TELEGRAM_CHAT_ID")
    requests.post(f"https://api.telegram.org/bot{bot_token}/sendMessage",
                  json={"chat_id": chat_id, "text": message})

Practice Lab

Exercise 1: Add Logging to an Existing Agent Take any agent you have built in this course. Add the AgentLogger class to it. Run the agent 3 times. Review the log files. Are the step timings what you expected? Are there any steps taking longer than they should?

Exercise 2: Trigger and Test Error Recovery Deliberately cause a rate limit error by making rapid API calls. Verify that the exponential backoff in ResilientAgentRunner correctly waits and retries. Check that the error is logged correctly in the run log.

Exercise 3: Set Up Your Alert Channel Create a Telegram bot (free — search "BotFather" on Telegram). Configure the send_telegram_alert function with your bot token and chat ID. Test it by triggering a manual alert from your agent. Confirm you receive it on your phone.

Key Takeaways

Structured logging with run IDs, step-by-step records, and cost tracking is the foundation of observable production agent systems
Every log entry should answer: what ran, when, how long it took, what it cost, and what was the output — without this, debugging production failures is guesswork
Tiered error recovery (retry with exponential backoff → fallback model → graceful degradation) prevents a single API hiccup from crashing your entire agent run
Pakistani developers should set up Telegram bot alerts for critical failures — it is free, instant, and reaches you on your phone
The monitoring infrastructure described in this lesson is what separates a demo-level agent from a production-grade system that clients and businesses can depend on

5.1 — Monitoring, Logging & Error Recovery

Monitoring, Logging & Error Recovery

Section 1: What to Log in an Agent System

Section 2: The Complete Logging System

Section 3: Error Recovery Strategies

Section 4: Alerting for Pakistani Production Systems

Practice Lab

Key Takeaways

Lesson Summary

Monitoring, Logging & Error Recovery Quiz