AI Prediction MarketsModule 2

2.2News Sentiment Analysis with AI — Reuters, AP, Twitter

35 min 11 code blocks Practice Lab Quiz (4Q)

News Sentiment Analysis with AI — Reuters, AP, Twitter

Market prices react to information. Your bot's edge is processing information faster and more accurately than competing traders. The news sentiment pipeline is the "perception layer" of your Oracle — it reads the world and converts raw text into quantified probability signals. In this lesson, we build a multi-source sentiment engine using Reuters, AP, Twitter/X, Pakistani news, and AI models to generate trading signals.

The Sentiment Pipeline Architecture

code
RAW NEWS (4 sources)
    │
    ▼
┌─────────────────────────┐
│  INGEST                  │  ← aiohttp + feedparser
│  80-120 articles/hour    │    (parallel fetching)
└──────────┬──────────────┘
           │
           ▼
┌─────────────────────────┐
│  DEDUPLICATE             │  ← difflib.SequenceMatcher
│  Remove cross-source     │    (similarity > 0.8 = duplicate)
│  duplicates              │
│  80 articles → ~45       │
└──────────┬──────────────┘
           │
           ▼
┌─────────────────────────┐
│  STAGE 1: FILTER         │  ← Gemini Flash (cheap)
│  "Is this relevant to    │    Cost: ~$0.0001 per article
│   any tracked market?"   │
│  45 articles → ~5-8      │
└──────────┬──────────────┘
           │
           ▼
┌─────────────────────────┐
│  STAGE 2: SCORE          │  ← Claude Sonnet (deep reasoning)
│  "How does this change   │    Cost: ~$0.005 per article
│   the implied prob?"     │
│  Output: trading signal  │
└──────────┬──────────────┘
           │
           ▼
┌─────────────────────────┐
│  SIGNAL OUTPUT           │  → Feeds into execution engine
│  market_id, new_prob,    │    (Module 4)
│  confidence, urgency     │
└─────────────────────────┘

COST AT SCALE:
Without 2-stage: 45 articles × $0.005 = $0.225/hour = $5.40/day
With 2-stage: 45 × $0.0001 + 6 × $0.005 = $0.0345/hour = $0.83/day
SAVINGS: 85% ← This is why the pipeline architecture matters

The separation of stages is critical for cost management. Running every headline through an expensive reasoning model would cost $50-200/day at scale. Running it through a cheap filter first reduces expensive calls by 85-90%.

Building the Multi-Source Ingester

Source Priority for Pakistani Markets

code
SOURCE PRIORITY TABLE:

| Source | Speed | Credibility | Best For | RSS Available |
|--------|-------|------------|----------|--------------|
| Twitter/X | Real-time | Variable | Breaking news, leaks | API/Nitter |
| Dawn | 1-2 hours | High | Pak politics, economy | Yes |
| Business Recorder | 1-3 hours | High | SBP, finance, trade | Yes |
| Geo News | 30 min | Medium-High | Breaking Pakistan news | Yes |
| Reuters | 2-4 hours | Highest | Global events | Yes |
| AP News | 2-4 hours | Highest | US politics, global | Yes |

KEY INSIGHT:
Pakistani sources (Dawn, Geo, BR) provide 2-6 HOUR information lead
over Western wire services for South Asian events.

If you're trading a Pakistan-related market and you only read Reuters,
you're the LAST person to know. Pakistani news sources are your edge.

The Async Ingester Code

python
import asyncio
import aiohttp
import feedparser
from datetime import datetime, timedelta

class NewsIngester:
    """Fetches headlines from multiple RSS sources concurrently."""

    SOURCES = {
        "reuters": "https://feeds.reuters.com/reuters/worldNews",
        "ap": "https://rsshub.app/apnews/topics/world-news",
        "dawn": "https://www.dawn.com/feeds/home",
        "business_recorder": "https://www.brecorder.com/feeds",
        "geo": "https://www.geo.tv/rss/1/0",
    }

    async def fetch_feed(self, session, name, url):
        """Fetch and parse a single RSS feed."""
        try:
            async with session.get(url, timeout=10) as response:
                text = await response.text()
                feed = feedparser.parse(text)
                articles = []
                for entry in feed.entries[:20]:  # Last 20 per source
                    articles.append({
                        "source": name,
                        "title": entry.get("title", ""),
                        "summary": entry.get("summary", "")[:500],
                        "url": entry.get("link", ""),
                        "published": entry.get("published", ""),
                        "fetched_at": datetime.utcnow().isoformat()
                    })
                return articles
        except Exception as e:
            print(f"[WARN] {name} fetch failed: {e}")
            return []

    async def fetch_all(self):
        """Fetch all sources in parallel — 2-4 sec vs 10-15 sec sequential."""
        async with aiohttp.ClientSession() as session:
            tasks = [
                self.fetch_feed(session, name, url)
                for name, url in self.SOURCES.items()
            ]
            results = await asyncio.gather(*tasks)
            # Flatten list of lists
            all_articles = [a for batch in results for a in batch]
            print(f"[INGEST] Fetched {len(all_articles)} articles from {len(self.SOURCES)} sources")
            return all_articles

Run all fetches in parallel with asyncio.gather() — fetching 5 sources sequentially takes 10-15 seconds; in parallel it takes 2-4 seconds.

Deduplication

The same news story appears across multiple sources. Without deduplication, you process (and pay for) the same story 3-4 times.

python
from difflib import SequenceMatcher

def deduplicate(articles, threshold=0.8):
    """Remove near-duplicate articles based on title similarity."""
    unique = []
    for article in articles:
        is_dupe = False
        for existing in unique:
            similarity = SequenceMatcher(
                None,
                article["title"].lower(),
                existing["title"].lower()
            ).ratio()
            if similarity > threshold:
                is_dupe = True
                # Keep the one from the higher-credibility source
                break
        if not is_dupe:
            unique.append(article)

    print(f"[DEDUP] {len(articles)} → {len(unique)} unique articles")
    return unique

For a pipeline ingesting 80 articles/hour from 5 sources, deduplication typically reduces to 40-55 unique stories — cutting your AI filter costs in half.

The Two-Stage AI Filtering

Stage 1 — Relevance Filter (Gemini Flash)

python
STAGE_1_PROMPT = """You are a news relevance filter for a prediction market trading bot.

Active markets being tracked:
{market_list}

For each headline below, respond with ONLY the relevant headlines
(ones that could affect the probability of any tracked market).
Return one headline per line. If none are relevant, return "NONE".

Headlines:
{headlines}"""

# Example market_list:
# - "Will SBP cut rates below 15% by Dec 2026?"
# - "Will India-Pakistan bilateral trade resume by Q3 2026?"
# - "Will PIA complete privatization by June 2026?"

Cost: ~$0.000075 per 1,000 tokens (Gemini Flash). Processing 45 headlines costs roughly $0.0001. This stage eliminates 80-90% of noise.

Stage 2 — Probability Scoring (Claude Sonnet)

python
STAGE_2_PROMPT = """You are an expert prediction market analyst.

Market question: "{market_question}"
Current market price (implied probability): {current_price}
Resolution criteria: "{resolution_criteria}"

Breaking news headline: "{headline}"
Summary: "{summary}"
Source: {source}
Published: {published}

Based on this news, analyze:
1. How does this headline change the implied probability?
2. What is your confidence in this assessment?
3. How urgent is this signal (should the bot act now or wait)?

Return JSON only:
{{
    "new_probability": 0.XX,
    "confidence": 0.XX,
    "urgency": "high" | "medium" | "low",
    "reasoning": "One sentence explanation",
    "direction": "bullish" | "bearish" | "neutral"
}}"""

Cost: ~$0.003 per 1,000 tokens. Only 5-8 articles reach this stage per hour, keeping daily costs under $1.

Cost Comparison at Scale

code
DAILY AI COSTS:

                    WITHOUT 2-STAGE     WITH 2-STAGE
Articles processed   1,080/day            1,080/day
Stage 1 (Flash)     —                    $0.10/day
Stage 2 (Sonnet)    $5.40/day            $0.72/day (only 144 articles)
─────────────────────────────────────────────────────
TOTAL               $5.40/day            $0.82/day
MONTHLY             $162/month           $24.60/month
SAVINGS             —                    85%

At 1 PKR = $0.0036:
Without: PKR 45,000/month → TOO EXPENSIVE for most traders
With:    PKR 6,800/month  → AFFORDABLE even for students

Twitter/X Integration

Real-time Twitter monitoring catches breaking news 15-30 minutes before RSS feeds. For Pakistani market events, key accounts to follow:

code
MUST-FOLLOW TWITTER ACCOUNTS (by category):

ECONOMY/FINANCE:
├── @StateBank_Pak — SBP rate announcements (CRITICAL for rate markets)
├── @FinaborPk — Pakistan finance news
├── @baborjakhar — Business journalist, breaks SBP news
└── @BusinessRecrdr — Business Recorder breaking news

POLITICS:
├── @PakPMO — Prime Minister's Office
├── @ForeignOfficePk — Foreign ministry (bilateral trade, BRICS)
├── @NAaborPakistan — National Assembly proceedings
└── @dawn_com — Dawn breaking news

SPORTS:
├── @GeoSuper — Cricket, PSL (for cricket market trades)
├── @TheRealPCB — Pakistan Cricket Board
└── @ESPNcricinfo — International cricket updates

INTERNATIONAL:
├── @IMFNews — IMF program updates (critical for PKR markets)
├── @Reuters — Global breaking news
└── @AP — Associated Press

Twitter API vs. Free Alternatives

code
OPTION 1: Twitter API v2 ($100/month)
├── Real-time filtered stream
├── Push-based (instant delivery)
├── Best for serious trading bots
└── Rate limit: 500,000 tweets/month

OPTION 2: Nitter RSS Bridge (FREE)
├── nitter.net/[username]/rss
├── 5-10 minute delay vs real-time
├── No API key needed
├── Fragile (Nitter may go down)
└── Good enough for learning/testing

OPTION 3: RSS.app or RSSHub ($0-5/month)
├── Converts Twitter profiles to RSS
├── 1-5 minute delay
├── More reliable than Nitter
└── Recommended starting point

Structuring the Signal Output

The final output of your sentiment pipeline is a list of trading signals that feed directly into your execution engine (Module 4):

json
{
    "market_id": "will-sbp-cut-rates-q2-2026",
    "headline": "State Bank signals dovish shift in latest quarterly review",
    "current_price": 0.42,
    "new_probability": 0.65,
    "confidence": 0.78,
    "urgency": "high",
    "direction": "bullish",
    "source": "dawn",
    "timestamp": "2026-03-26T09:15:00Z",
    "reasoning": "Quarterly review language shifted from 'maintaining' to 'reviewing appropriate levels' - historically precedes rate cuts"
}

Signal-to-Action Rules

code
TRADING RULES (configure in your bot):

IF new_probability − current_price > 0.15
   AND confidence > 0.70
   AND urgency == "high"
   → EXECUTE BUY (limit order at current_price + 0.02)

IF new_probability − current_price > 0.10
   AND confidence > 0.60
   AND urgency == "medium"
   → QUEUE for next batch execution (every 15 min)

IF confidence < 0.50
   OR abs(new_probability − current_price) < 0.05
   → SKIP (signal too weak)

IF urgency == "high" AND direction == "bearish"
   AND you hold YES shares
   → ALERT: consider selling current position
Practice Lab

Practice Lab

Exercise 1: Build the Ingester Set up the async fetcher with at least 3 sources (Reuters, Dawn, one more). Fetch and print the first 5 articles from each. Confirm Dawn returns Pakistani news faster than Reuters for South Asian events. Measure fetch time: parallel vs sequential.

Exercise 2: Test the Filter Pick a market from your Module 1 paper trading. Feed 30 headlines through a Stage 1 relevance filter prompt with 5-10 market keywords. Count how many survive filtering. What percentage was noise? Was the filter too aggressive or too lenient?

Exercise 3: Sentiment Scoring Test Take 3 surviving headlines from Exercise 2 and run each through a Stage 2 Claude Sonnet scoring prompt. Record the new_probability, confidence, and reasoning for each. Compare the AI's probability estimates with your own intuition. Where is the AI miscalibrated? Where is it more objective than you?

Pakistan Case Study

Meet Bilal — data science student at LUMS, building a Polymarket bot as his final year project.

His pipeline evolution:

Version 1 (Week 1-2):

  • Single source (Reuters RSS only)
  • No filtering — every headline sent to GPT-4o
  • No deduplication
  • Cost: $4.80/day = PKR 40,000/month
  • Signal quality: Noisy, many false positives
  • Win rate on signals: 48% (basically random)

Version 2 (Week 3-4):

  • Added Dawn + Business Recorder + Geo
  • Added deduplication (cut articles by 40%)
  • Still no 2-stage filtering
  • Cost: $3.20/day = PKR 27,000/month
  • Better coverage of Pakistan events
  • Win rate: 52% (slight improvement from local sources)

Version 3 (Week 5-6 — after this lesson):

  • Full 5-source ingestion
  • Deduplication + 2-stage filtering
  • Stage 1: Gemini Flash filter (90% reduction)
  • Stage 2: Claude Sonnet scoring
  • Twitter monitoring for @StateBank_Pak and @PakPMO
  • Cost: $0.90/day = PKR 7,500/month (81% cost reduction)
  • Signal quality: High — only 5-8 actionable signals/day
  • Win rate: 64% (real edge emerging)

The breakthrough: Bilal's bot caught an SBP monetary policy committee meeting announcement on Dawn 3 hours before Reuters covered it. The bot bought YES shares on a rate-cut market at $0.41. By the time Western traders reacted, the price was $0.58. Profit: $17 on a $41 position (41% return in 3 hours).

His key insight: "Pakistani news sources are my unfair advantage. Dawn publishes SBP news hours before Reuters. My bot reads Dawn at machine speed while other traders wait for Bloomberg. The 2-stage pipeline keeps costs under PKR 8,000/month — I'm spending less than a ChatGPT subscription to run a trading bot."

Key Takeaways

  • Two-stage filtering (cheap filter → expensive scorer) reduces AI costs by 85% while maintaining signal quality
  • Pakistani news sources (Dawn, Business Recorder, Geo) provide 2-6 hour information leads over Reuters/AP for South Asian events — this is your competitive edge
  • Always deduplicate before filtering — 5 sources reporting the same story should produce 1 signal, not 5
  • The async ingester fetches all sources in 2-4 seconds (parallel) vs. 10-15 seconds (sequential)
  • Twitter catches breaking news 15-30 minutes before RSS feeds — monitor key accounts like @StateBank_Pak
  • Signal output (new_probability, confidence, urgency) feeds directly into your trading execution engine
  • Monthly pipeline cost: PKR 7,000-8,000 with 2-stage filtering vs. PKR 40,000+ without — the architecture pays for itself
  • Stage 1 (Gemini Flash) filters noise, Stage 2 (Claude Sonnet) provides deep reasoning — never skip Stage 1

Next lesson: Building the execution engine — turning sentiment signals into actual trades.

Lesson Summary

Includes hands-on practice lab11 runnable code examples4-question knowledge check below

Quiz: News Sentiment Analysis with AI — Reuters, AP, Twitter

4 questions to test your understanding. Score 60% or higher to pass.