Autonomous AI AgentsModule 5

5.2Cost Management — Token Budgets & Caching

25 min 4 code blocks Practice Lab Quiz (4Q)

Cost Management — Token Budgets & Caching

Ab baat karte hain paison ki — kyunke ek agent jo unchecked chalte chalte tumhara sara budget kha jaye, woh agent nahi, financial disaster hai. Agent systems have a unique cost structure that differs from single API calls: they make multiple calls per task, run potentially hundreds of tasks per day, and can spiral out of control with runaway loops. Professional agent developers treat cost management as a first-class concern — not an afterthought. This lesson gives you the complete toolkit for keeping your agent costs predictable, bounded, and optimized in the Pakistani and international context.

Section 1: Understanding Where Agent Costs Come From

In a multi-step agent, every step costs tokens. Here is a realistic cost breakdown for a Karachi outreach agent run:

code
AGENT RUN: Karachi Restaurant Outreach (10 businesses)
──────────────────────────────────────────────────────────
Step 1: Trend analysis (Gemini Flash)
  Input: 500 tokens (prompt) × 10 runs = 5,000
  Output: 200 tokens × 10 = 2,000
  Cost: $0.00053

Step 2: Personalized pitch generation (Claude Sonnet)
  Input: 800 tokens (context + business data) × 10 = 8,000
  Output: 300 tokens (pitch) × 10 = 3,000
  Cost: $0.069

Step 3: QC review (Claude Haiku)
  Input: 600 tokens (pitch + criteria) × 10 = 6,000
  Output: 100 tokens (verdict) × 10 = 1,000
  Cost: $0.0028

Step 4: Email formatting (Claude Haiku)
  Input: 500 tokens × 10 = 5,000
  Output: 200 tokens × 10 = 2,000
  Cost: $0.0035

TOTAL: $0.076 (PKR ~21.7)
At 50 runs/day: $3.80/day (PKR ~1,083/day)
At 50 runs/day for 30 days: $114/month (PKR ~32,490/month)

Without cost controls, a bug in the loop logic could trigger 500 runs in an hour — PKR 10,000 in API costs before you notice.

Section 2: Token Budget Enforcement

python
import anthropic
import os

class BudgetedAgent:
    """Agent with hard token and cost limits per run"""

    def __init__(self, max_tokens_per_run: int = 50000,
                 max_cost_usd_per_run: float = 0.50,
                 pkr_rate: float = 285.0):
        self.client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
        self.max_tokens = max_tokens_per_run
        self.max_cost = max_cost_usd_per_run
        self.pkr_rate = pkr_rate

        self.tokens_used = {"input": 0, "output": 0}
        self.cost_usd = 0.0

        # Cost per 1K tokens (model: (input_rate, output_rate))
        self.rates = {
            "claude-haiku-4-5-20251001": (0.00025, 0.00125),
            "claude-sonnet-4-6": (0.003, 0.015),
        }

    def _check_budget(self, model: str, estimated_input: int, estimated_output: int):
        """Check if proceeding would exceed budget"""
        rates = self.rates.get(model, (0.003, 0.015))
        estimated_cost = (estimated_input / 1000 * rates[0]) + \
                        (estimated_output / 1000 * rates[1])

        projected_total = self.cost_usd + estimated_cost
        projected_tokens = self.tokens_used["input"] + self.tokens_used["output"] + \
                          estimated_input + estimated_output

        if projected_total > self.max_cost:
            raise ValueError(
                f"Budget exceeded: projected ${projected_total:.3f} > limit ${self.max_cost:.2f} "
                f"(PKR {projected_total * self.pkr_rate:.0f} > PKR {self.max_cost * self.pkr_rate:.0f})"
            )

        if projected_tokens > self.max_tokens:
            raise ValueError(
                f"Token budget exceeded: projected {projected_tokens:,} > limit {self.max_tokens:,}"
            )

    def call(self, prompt: str, model: str = "claude-sonnet-4-6",
             max_output: int = 500):
        """Make a budget-checked API call"""

        # Pre-call budget check (estimate input tokens)
        estimated_input = len(prompt) // 4
        self._check_budget(model, estimated_input, max_output)

        # Execute call
        response = self.client.messages.create(
            model=model,
            max_tokens=max_output,
            messages=[{"role": "user", "content": prompt}]
        )

        # Track actual usage
        actual_input = response.usage.input_tokens
        actual_output = response.usage.output_tokens
        rates = self.rates.get(model, (0.003, 0.015))
        call_cost = (actual_input / 1000 * rates[0]) + (actual_output / 1000 * rates[1])

        self.tokens_used["input"] += actual_input
        self.tokens_used["output"] += actual_output
        self.cost_usd += call_cost

        print(f"[Budget] Used: ${self.cost_usd:.4f}/{self.max_cost} | "
              f"PKR {self.cost_usd * self.pkr_rate:.1f}/{self.max_cost * self.pkr_rate:.0f}")

        return response.content[0].text

Section 3: Prompt Caching for Agent Cost Reduction

In agent systems, the same large system prompt is often sent with every API call. Prompt caching reduces this cost by 90%:

python
def get_cached_system_call(system_prompt: str, user_message: str,
                           model: str = "claude-sonnet-4-6"):
    """
    Uses Anthropic's prompt caching to reduce cost of repeated system prompts.
    After first call, the system_prompt is cached — subsequent calls cost 10% of normal.
    """
    client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

    response = client.messages.create(
        model=model,
        max_tokens=1000,
        system=[
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"}  # Cache this large system prompt
            }
        ],
        messages=[{"role": "user", "content": user_message}]
    )

    # Cache hit status
    cache_read = response.usage.cache_read_input_tokens
    cache_created = response.usage.cache_creation_input_tokens
    if cache_read > 0:
        print(f"Cache HIT: Saved {cache_read} tokens ({cache_read * 0.003 / 1000 * 0.9:.4f} USD)")
    elif cache_created > 0:
        print(f"Cache MISS (created): {cache_created} tokens cached for future calls")

    return response.content[0].text

Example savings: A system prompt of 10,000 tokens sent 100 times per day:

  • Without caching: 1,000,000 tokens × $0.003/1K = $3.00/day (PKR 855/day)
  • With caching: 10,000 (first call) + 90 × 10,000 × 0.10 (cached) = 100,000 tokens = $0.30/day (PKR 85/day)
  • Savings: 90% — PKR 770/day — PKR 23,100/month
Practice Lab

Practice Lab

Exercise 1: Cost Audit Your Existing Agent Add the BudgetedAgent wrapper to any agent you have built. Run it 5 times and review the cost report. Are the costs what you expected? Which step is the most expensive?

Exercise 2: Identify Cache Opportunities Review your agent's prompts. Which parts of the system prompt or context are identical across every call? Calculate the potential monthly savings from caching those sections.

Exercise 3: Daily Budget Alert Write a simple script that reads your agent's log files, calculates total daily spend, and sends a Telegram alert if the daily spend exceeds your set threshold. This is your financial alarm system.

Key Takeaways

  • A production agent making 50 runs/day at $0.076/run costs PKR 32,490/month — understand your cost model before deploying at scale
  • Hard token and cost limits per run (BudgetedAgent) prevent runaway loops from causing financial disasters — this is essential production infrastructure
  • Prompt caching reduces costs by 90% for repeated large system prompts — a 10,000-token system prompt used 100 times/day saves PKR 23,100/month
  • Always track actual API usage from response.usage — do not rely on estimates, as actual token counts often differ significantly from character-based estimates
  • PKR cost visibility (not just USD) is important for Pakistani developers — your API budget should be planned against your revenue in the same currency

🇵🇰 Pakistan Case Study: The Agency That Made Agents Profitable

Hasan ran a Karachi-based AI automation agency. His first production agent — a competitor intelligence bot — was losing money. Monthly API cost: PKR 48,000. Monthly revenue from the service: PKR 35,000. Net loss: PKR 13,000/month.

He ran a cost audit and found 3 major leaks:

Leak 1: No token limits The agent was generating 3,000-token reports when clients only needed 800-word summaries. Fix: max_tokens=600. Saved 65% of output tokens.

Leak 2: No model tiering All tasks ran on Claude Sonnet, including simple "is this URL a competitor?" classification. Fix: Route classification to Claude Haiku. Save Sonnet for strategic analysis. Result: 70% cost reduction on classification tasks.

Leak 3: No prompt caching The 8,000-token system prompt was re-sent on every API call. Fix: Enable cache_control: ephemeral on system prompt. After first call: 90% reduction on cached tokens.

Cost after optimization:

OptimizationBeforeAfterMonthly Saving
Token limitsPKR 19,200PKR 6,720PKR 12,480
Model tieringPKR 18,000PKR 5,400PKR 12,600
Prompt cachingPKR 10,800PKR 1,080PKR 9,720
TotalPKR 48,000PKR 13,200PKR 34,800

Monthly cost dropped from PKR 48,000 to PKR 13,200. Revenue: PKR 35,000. Net margin: PKR 21,800 (from -PKR 13,000 to +PKR 21,800).

📊 Monthly Cost Calculator for Pakistani Agents

python
def calculate_monthly_cost_pkr(
    daily_runs: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    model: str = "claude-sonnet-4-6",
    pkr_rate: float = 285.0
) -> dict:
    """Calculate estimated monthly API cost in PKR for an agent."""

    # USD per 1M tokens (approximate 2026 rates)
    model_pricing = {
        "claude-haiku-4-5":  {"input": 0.25,  "output": 1.25},
        "claude-sonnet-4-6": {"input": 3.00,  "output": 15.00},
        "claude-opus-4-6":   {"input": 15.00, "output": 75.00},
        "gemini-2.5-flash":  {"input": 0.075, "output": 0.30},
        "gemini-2.5-pro":    {"input": 1.25,  "output": 5.00},
    }

    pricing = model_pricing.get(model, model_pricing["claude-sonnet-4-6"])

    cost_per_run_usd = (
        (avg_input_tokens * pricing["input"] / 1_000_000) +
        (avg_output_tokens * pricing["output"] / 1_000_000)
    )

    monthly_runs = daily_runs * 30
    monthly_cost_usd = cost_per_run_usd * monthly_runs
    monthly_cost_pkr = monthly_cost_usd * pkr_rate

    return {
        "model": model,
        "daily_runs": daily_runs,
        "cost_per_run_usd": round(cost_per_run_usd, 4),
        "monthly_cost_usd": round(monthly_cost_usd, 2),
        "monthly_cost_pkr": round(monthly_cost_pkr, 0),
        "break_even_pkr": f"Charge clients at least PKR {monthly_cost_pkr * 3:,.0f}/mo for 3x margin"
    }

# Example: Intelligence agent running 50 times/day
result = calculate_monthly_cost_pkr(
    daily_runs=50,
    avg_input_tokens=2000,
    avg_output_tokens=800,
    model="claude-sonnet-4-6",
    pkr_rate=285
)
print(result)
# Output: {'monthly_cost_pkr': 32490.0, 'break_even_pkr': 'Charge clients at least PKR 97,470/mo for 3x margin'}

Lesson Summary

Includes hands-on practice lab4 runnable code examples4-question knowledge check below

Cost Management — Token Budgets & Caching Quiz

4 questions to test your understanding. Score 60% or higher to pass.