5.1 — Monitoring, Logging & Error Recovery
Monitoring, Logging & Error Recovery
Production mein agent deploy karna ek cheez hai. Woh agent subah 3 baje kisi reason se crash ho jaye aur kisi ko pata na chale — yeh alag baat hai. In production agent systems, what you cannot observe you cannot fix, and what you cannot fix will eventually fail at the worst possible moment. Monitoring, logging, and error recovery are not afterthoughts — they are the invisible infrastructure that turns a demo into a production business. This lesson gives you the complete observability stack for autonomous agent systems running in Pakistan and internationally.
Section 1: What to Log in an Agent System
Every agent execution should produce a structured log with at minimum:
{
"run_id": "unique_identifier_per_execution",
"timestamp_start": "2026-03-26T03:00:00.000Z",
"timestamp_end": "2026-03-26T03:00:47.230Z",
"agent_name": "karachi_outreach_agent",
"task_input": "Find 5 DHA restaurants, generate personalized pitches",
"steps_completed": [
{
"step": 1,
"action": "query_business_database",
"params": {"city": "Karachi", "category": "restaurant", "limit": 5},
"result_summary": "5 restaurants found",
"duration_ms": 234,
"tokens_used": {"input": 150, "output": 80}
},
{
"step": 2,
"action": "generate_pitch",
"params": {"business_name": "Sakura Restaurant DHA"},
"result_summary": "Pitch generated (200 words)",
"duration_ms": 1847,
"tokens_used": {"input": 450, "output": 200}
}
],
"total_steps": 10,
"status": "completed",
"errors": [],
"total_cost_usd": 0.0234,
"total_cost_pkr": 6.67,
"output_summary": "10 pitches generated, saved to outreach_batch_2026-03-26.csv"
}
This structured log answers every question you will have when debugging: What ran? In what order? How long did each step take? What did it cost? What was the output?
Section 2: The Complete Logging System
import json
import time
import uuid
import traceback
from datetime import datetime
from pathlib import Path
import os
class AgentLogger:
"""Production-grade logger for autonomous agents"""
def __init__(self, agent_name: str, log_dir: str = "logs"):
self.agent_name = agent_name
self.log_dir = Path(log_dir)
self.log_dir.mkdir(exist_ok=True)
self.run_id = str(uuid.uuid4())[:8]
self.run_log = {
"run_id": self.run_id,
"agent_name": agent_name,
"timestamp_start": datetime.utcnow().isoformat(),
"steps": [],
"errors": [],
"status": "running",
"total_cost_usd": 0.0,
"total_tokens": {"input": 0, "output": 0}
}
self._start_time = time.time()
print(f"[{agent_name}] Run {self.run_id} started")
def log_step(self, step_name: str, params: dict, result: any,
duration_ms: int, tokens: dict = None):
"""Log a completed step"""
step = {
"step_number": len(self.run_log["steps"]) + 1,
"name": step_name,
"params": params,
"result_summary": str(result)[:200], # Cap at 200 chars
"duration_ms": duration_ms,
"tokens": tokens or {}
}
self.run_log["steps"].append(step)
if tokens:
self.run_log["total_tokens"]["input"] += tokens.get("input", 0)
self.run_log["total_tokens"]["output"] += tokens.get("output", 0)
print(f"[{self.agent_name}] Step {step['step_number']}: {step_name} ({duration_ms}ms)")
def log_error(self, error: Exception, step_name: str, fatal: bool = False):
"""Log an error"""
error_entry = {
"timestamp": datetime.utcnow().isoformat(),
"step": step_name,
"error_type": type(error).__name__,
"error_message": str(error),
"traceback": traceback.format_exc(),
"fatal": fatal
}
self.run_log["errors"].append(error_entry)
print(f"[{self.agent_name}] ERROR in {step_name}: {error}")
if fatal:
self.finalize("failed")
def finalize(self, status: str = "completed"):
"""Write final log to disk"""
self.run_log["status"] = status
self.run_log["timestamp_end"] = datetime.utcnow().isoformat()
self.run_log["duration_seconds"] = round(time.time() - self._start_time, 2)
log_path = self.log_dir / f"{self.agent_name}_{self.run_id}_{status}.json"
with open(log_path, "w") as f:
json.dump(self.run_log, f, indent=2, default=str)
print(f"[{self.agent_name}] Run {self.run_id} {status}. Log: {log_path}")
return log_path
Section 3: Error Recovery Strategies
Not all errors are equal. Implement tiered recovery:
import anthropic
import time
class ResilientAgentRunner:
"""Agent runner with retry logic and graceful degradation"""
def __init__(self, max_retries: int = 3, retry_delay: float = 2.0):
self.max_retries = max_retries
self.retry_delay = retry_delay
self.client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
def safe_api_call(self, prompt: str, model: str = "claude-sonnet-4-6",
fallback_model: str = "claude-haiku-4-5-20251001"):
"""API call with retry, fallback, and graceful degradation"""
for attempt in range(self.max_retries):
try:
response = self.client.messages.create(
model=model,
max_tokens=1000,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text, model
except anthropic.RateLimitError:
wait = self.retry_delay * (2 ** attempt) # Exponential backoff
print(f"Rate limited. Waiting {wait}s before retry {attempt + 1}/{self.max_retries}")
time.sleep(wait)
except anthropic.APIStatusError as e:
if e.status_code >= 500: # Server errors — retry
time.sleep(self.retry_delay)
continue
else: # Client errors — do not retry, try fallback
print(f"API error {e.status_code}. Trying fallback model: {fallback_model}")
try:
response = self.client.messages.create(
model=fallback_model,
max_tokens=1000,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text, fallback_model
except Exception:
return None, None # Graceful degradation
print("All retries exhausted. Returning None.")
return None, None
Section 4: Alerting for Pakistani Production Systems
When your agent fails in production, you need to know immediately. Three alerting channels for Pakistani developers:
Channel 1: WhatsApp via WATI (best for Pakistan)
def send_whatsapp_alert(message: str, phone: str):
"""Send critical agent failure alert via WhatsApp"""
# WATI API call here
pass
Channel 2: Email via Gmail SMTP
import smtplib
from email.mime.text import MIMEText
def send_email_alert(subject: str, body: str, to_email: str):
msg = MIMEText(body)
msg["Subject"] = f"[AGENT ALERT] {subject}"
msg["From"] = os.environ.get("ALERT_EMAIL")
msg["To"] = to_email
# Send via SMTP
Channel 3: Telegram Bot (free, instant)
import requests
def send_telegram_alert(message: str):
bot_token = os.environ.get("TELEGRAM_BOT_TOKEN")
chat_id = os.environ.get("TELEGRAM_CHAT_ID")
requests.post(f"https://api.telegram.org/bot{bot_token}/sendMessage",
json={"chat_id": chat_id, "text": message})
Practice Lab
Exercise 1: Add Logging to an Existing Agent
Take any agent you have built in this course. Add the AgentLogger class to it. Run the agent 3 times. Review the log files. Are the step timings what you expected? Are there any steps taking longer than they should?
Exercise 2: Trigger and Test Error Recovery
Deliberately cause a rate limit error by making rapid API calls. Verify that the exponential backoff in ResilientAgentRunner correctly waits and retries. Check that the error is logged correctly in the run log.
Exercise 3: Set Up Your Alert Channel
Create a Telegram bot (free — search "BotFather" on Telegram). Configure the send_telegram_alert function with your bot token and chat ID. Test it by triggering a manual alert from your agent. Confirm you receive it on your phone.
Key Takeaways
- Structured logging with run IDs, step-by-step records, and cost tracking is the foundation of observable production agent systems
- Every log entry should answer: what ran, when, how long it took, what it cost, and what was the output — without this, debugging production failures is guesswork
- Tiered error recovery (retry with exponential backoff → fallback model → graceful degradation) prevents a single API hiccup from crashing your entire agent run
- Pakistani developers should set up Telegram bot alerts for critical failures — it is free, instant, and reaches you on your phone
- The monitoring infrastructure described in this lesson is what separates a demo-level agent from a production-grade system that clients and businesses can depend on
Lesson Summary
Monitoring, Logging & Error Recovery Quiz
4 questions to test your understanding. Score 60% or higher to pass.