5.2 — Cost Management — Token Budgets & Caching
Cost Management — Token Budgets & Caching
Ab baat karte hain paison ki — kyunke ek agent jo unchecked chalte chalte tumhara sara budget kha jaye, woh agent nahi, financial disaster hai. Agent systems have a unique cost structure that differs from single API calls: they make multiple calls per task, run potentially hundreds of tasks per day, and can spiral out of control with runaway loops. Professional agent developers treat cost management as a first-class concern — not an afterthought. This lesson gives you the complete toolkit for keeping your agent costs predictable, bounded, and optimized in the Pakistani and international context.
Section 1: Understanding Where Agent Costs Come From
In a multi-step agent, every step costs tokens. Here is a realistic cost breakdown for a Karachi outreach agent run:
AGENT RUN: Karachi Restaurant Outreach (10 businesses)
──────────────────────────────────────────────────────────
Step 1: Trend analysis (Gemini Flash)
Input: 500 tokens (prompt) × 10 runs = 5,000
Output: 200 tokens × 10 = 2,000
Cost: $0.00053
Step 2: Personalized pitch generation (Claude Sonnet)
Input: 800 tokens (context + business data) × 10 = 8,000
Output: 300 tokens (pitch) × 10 = 3,000
Cost: $0.069
Step 3: QC review (Claude Haiku)
Input: 600 tokens (pitch + criteria) × 10 = 6,000
Output: 100 tokens (verdict) × 10 = 1,000
Cost: $0.0028
Step 4: Email formatting (Claude Haiku)
Input: 500 tokens × 10 = 5,000
Output: 200 tokens × 10 = 2,000
Cost: $0.0035
TOTAL: $0.076 (PKR ~21.7)
At 50 runs/day: $3.80/day (PKR ~1,083/day)
At 50 runs/day for 30 days: $114/month (PKR ~32,490/month)
Without cost controls, a bug in the loop logic could trigger 500 runs in an hour — PKR 10,000 in API costs before you notice.
Section 2: Token Budget Enforcement
import anthropic
import os
class BudgetedAgent:
"""Agent with hard token and cost limits per run"""
def __init__(self, max_tokens_per_run: int = 50000,
max_cost_usd_per_run: float = 0.50,
pkr_rate: float = 285.0):
self.client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
self.max_tokens = max_tokens_per_run
self.max_cost = max_cost_usd_per_run
self.pkr_rate = pkr_rate
self.tokens_used = {"input": 0, "output": 0}
self.cost_usd = 0.0
# Cost per 1K tokens (model: (input_rate, output_rate))
self.rates = {
"claude-haiku-4-5-20251001": (0.00025, 0.00125),
"claude-sonnet-4-6": (0.003, 0.015),
}
def _check_budget(self, model: str, estimated_input: int, estimated_output: int):
"""Check if proceeding would exceed budget"""
rates = self.rates.get(model, (0.003, 0.015))
estimated_cost = (estimated_input / 1000 * rates[0]) + \
(estimated_output / 1000 * rates[1])
projected_total = self.cost_usd + estimated_cost
projected_tokens = self.tokens_used["input"] + self.tokens_used["output"] + \
estimated_input + estimated_output
if projected_total > self.max_cost:
raise ValueError(
f"Budget exceeded: projected ${projected_total:.3f} > limit ${self.max_cost:.2f} "
f"(PKR {projected_total * self.pkr_rate:.0f} > PKR {self.max_cost * self.pkr_rate:.0f})"
)
if projected_tokens > self.max_tokens:
raise ValueError(
f"Token budget exceeded: projected {projected_tokens:,} > limit {self.max_tokens:,}"
)
def call(self, prompt: str, model: str = "claude-sonnet-4-6",
max_output: int = 500):
"""Make a budget-checked API call"""
# Pre-call budget check (estimate input tokens)
estimated_input = len(prompt) // 4
self._check_budget(model, estimated_input, max_output)
# Execute call
response = self.client.messages.create(
model=model,
max_tokens=max_output,
messages=[{"role": "user", "content": prompt}]
)
# Track actual usage
actual_input = response.usage.input_tokens
actual_output = response.usage.output_tokens
rates = self.rates.get(model, (0.003, 0.015))
call_cost = (actual_input / 1000 * rates[0]) + (actual_output / 1000 * rates[1])
self.tokens_used["input"] += actual_input
self.tokens_used["output"] += actual_output
self.cost_usd += call_cost
print(f"[Budget] Used: ${self.cost_usd:.4f}/{self.max_cost} | "
f"PKR {self.cost_usd * self.pkr_rate:.1f}/{self.max_cost * self.pkr_rate:.0f}")
return response.content[0].text
Section 3: Prompt Caching for Agent Cost Reduction
In agent systems, the same large system prompt is often sent with every API call. Prompt caching reduces this cost by 90%:
def get_cached_system_call(system_prompt: str, user_message: str,
model: str = "claude-sonnet-4-6"):
"""
Uses Anthropic's prompt caching to reduce cost of repeated system prompts.
After first call, the system_prompt is cached — subsequent calls cost 10% of normal.
"""
client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
response = client.messages.create(
model=model,
max_tokens=1000,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"} # Cache this large system prompt
}
],
messages=[{"role": "user", "content": user_message}]
)
# Cache hit status
cache_read = response.usage.cache_read_input_tokens
cache_created = response.usage.cache_creation_input_tokens
if cache_read > 0:
print(f"Cache HIT: Saved {cache_read} tokens ({cache_read * 0.003 / 1000 * 0.9:.4f} USD)")
elif cache_created > 0:
print(f"Cache MISS (created): {cache_created} tokens cached for future calls")
return response.content[0].text
Example savings: A system prompt of 10,000 tokens sent 100 times per day:
- Without caching: 1,000,000 tokens × $0.003/1K = $3.00/day (PKR 855/day)
- With caching: 10,000 (first call) + 90 × 10,000 × 0.10 (cached) = 100,000 tokens = $0.30/day (PKR 85/day)
- Savings: 90% — PKR 770/day — PKR 23,100/month
Practice Lab
Exercise 1: Cost Audit Your Existing Agent
Add the BudgetedAgent wrapper to any agent you have built. Run it 5 times and review the cost report. Are the costs what you expected? Which step is the most expensive?
Exercise 2: Identify Cache Opportunities Review your agent's prompts. Which parts of the system prompt or context are identical across every call? Calculate the potential monthly savings from caching those sections.
Exercise 3: Daily Budget Alert Write a simple script that reads your agent's log files, calculates total daily spend, and sends a Telegram alert if the daily spend exceeds your set threshold. This is your financial alarm system.
Key Takeaways
- A production agent making 50 runs/day at $0.076/run costs PKR 32,490/month — understand your cost model before deploying at scale
- Hard token and cost limits per run (
BudgetedAgent) prevent runaway loops from causing financial disasters — this is essential production infrastructure - Prompt caching reduces costs by 90% for repeated large system prompts — a 10,000-token system prompt used 100 times/day saves PKR 23,100/month
- Always track actual API usage from
response.usage— do not rely on estimates, as actual token counts often differ significantly from character-based estimates - PKR cost visibility (not just USD) is important for Pakistani developers — your API budget should be planned against your revenue in the same currency
🇵🇰 Pakistan Case Study: The Agency That Made Agents Profitable
Hasan ran a Karachi-based AI automation agency. His first production agent — a competitor intelligence bot — was losing money. Monthly API cost: PKR 48,000. Monthly revenue from the service: PKR 35,000. Net loss: PKR 13,000/month.
He ran a cost audit and found 3 major leaks:
Leak 1: No token limits
The agent was generating 3,000-token reports when clients only needed 800-word summaries. Fix: max_tokens=600. Saved 65% of output tokens.
Leak 2: No model tiering All tasks ran on Claude Sonnet, including simple "is this URL a competitor?" classification. Fix: Route classification to Claude Haiku. Save Sonnet for strategic analysis. Result: 70% cost reduction on classification tasks.
Leak 3: No prompt caching
The 8,000-token system prompt was re-sent on every API call. Fix: Enable cache_control: ephemeral on system prompt. After first call: 90% reduction on cached tokens.
Cost after optimization:
| Optimization | Before | After | Monthly Saving |
|---|---|---|---|
| Token limits | PKR 19,200 | PKR 6,720 | PKR 12,480 |
| Model tiering | PKR 18,000 | PKR 5,400 | PKR 12,600 |
| Prompt caching | PKR 10,800 | PKR 1,080 | PKR 9,720 |
| Total | PKR 48,000 | PKR 13,200 | PKR 34,800 |
Monthly cost dropped from PKR 48,000 to PKR 13,200. Revenue: PKR 35,000. Net margin: PKR 21,800 (from -PKR 13,000 to +PKR 21,800).
📊 Monthly Cost Calculator for Pakistani Agents
def calculate_monthly_cost_pkr(
daily_runs: int,
avg_input_tokens: int,
avg_output_tokens: int,
model: str = "claude-sonnet-4-6",
pkr_rate: float = 285.0
) -> dict:
"""Calculate estimated monthly API cost in PKR for an agent."""
# USD per 1M tokens (approximate 2026 rates)
model_pricing = {
"claude-haiku-4-5": {"input": 0.25, "output": 1.25},
"claude-sonnet-4-6": {"input": 3.00, "output": 15.00},
"claude-opus-4-6": {"input": 15.00, "output": 75.00},
"gemini-2.5-flash": {"input": 0.075, "output": 0.30},
"gemini-2.5-pro": {"input": 1.25, "output": 5.00},
}
pricing = model_pricing.get(model, model_pricing["claude-sonnet-4-6"])
cost_per_run_usd = (
(avg_input_tokens * pricing["input"] / 1_000_000) +
(avg_output_tokens * pricing["output"] / 1_000_000)
)
monthly_runs = daily_runs * 30
monthly_cost_usd = cost_per_run_usd * monthly_runs
monthly_cost_pkr = monthly_cost_usd * pkr_rate
return {
"model": model,
"daily_runs": daily_runs,
"cost_per_run_usd": round(cost_per_run_usd, 4),
"monthly_cost_usd": round(monthly_cost_usd, 2),
"monthly_cost_pkr": round(monthly_cost_pkr, 0),
"break_even_pkr": f"Charge clients at least PKR {monthly_cost_pkr * 3:,.0f}/mo for 3x margin"
}
# Example: Intelligence agent running 50 times/day
result = calculate_monthly_cost_pkr(
daily_runs=50,
avg_input_tokens=2000,
avg_output_tokens=800,
model="claude-sonnet-4-6",
pkr_rate=285
)
print(result)
# Output: {'monthly_cost_pkr': 32490.0, 'break_even_pkr': 'Charge clients at least PKR 97,470/mo for 3x margin'}
Lesson Summary
Cost Management — Token Budgets & Caching Quiz
4 questions to test your understanding. Score 60% or higher to pass.