2.2 — News Sentiment Analysis with AI — Reuters, AP, Twitter
News Sentiment Analysis with AI — Reuters, AP, Twitter
Market prices react to information. Your bot's edge is processing information faster and more accurately than competing traders. The news sentiment pipeline is the "perception layer" of your Oracle — it reads the world and converts raw text into quantified probability signals. In this lesson, we build a multi-source sentiment engine using Reuters, AP, Twitter/X, Pakistani news, and AI models to generate trading signals.
The Sentiment Pipeline Architecture
RAW NEWS (4 sources)
│
▼
┌─────────────────────────┐
│ INGEST │ ← aiohttp + feedparser
│ 80-120 articles/hour │ (parallel fetching)
└──────────┬──────────────┘
│
▼
┌─────────────────────────┐
│ DEDUPLICATE │ ← difflib.SequenceMatcher
│ Remove cross-source │ (similarity > 0.8 = duplicate)
│ duplicates │
│ 80 articles → ~45 │
└──────────┬──────────────┘
│
▼
┌─────────────────────────┐
│ STAGE 1: FILTER │ ← Gemini Flash (cheap)
│ "Is this relevant to │ Cost: ~$0.0001 per article
│ any tracked market?" │
│ 45 articles → ~5-8 │
└──────────┬──────────────┘
│
▼
┌─────────────────────────┐
│ STAGE 2: SCORE │ ← Claude Sonnet (deep reasoning)
│ "How does this change │ Cost: ~$0.005 per article
│ the implied prob?" │
│ Output: trading signal │
└──────────┬──────────────┘
│
▼
┌─────────────────────────┐
│ SIGNAL OUTPUT │ → Feeds into execution engine
│ market_id, new_prob, │ (Module 4)
│ confidence, urgency │
└─────────────────────────┘
COST AT SCALE:
Without 2-stage: 45 articles × $0.005 = $0.225/hour = $5.40/day
With 2-stage: 45 × $0.0001 + 6 × $0.005 = $0.0345/hour = $0.83/day
SAVINGS: 85% ← This is why the pipeline architecture matters
The separation of stages is critical for cost management. Running every headline through an expensive reasoning model would cost $50-200/day at scale. Running it through a cheap filter first reduces expensive calls by 85-90%.
Building the Multi-Source Ingester
Source Priority for Pakistani Markets
SOURCE PRIORITY TABLE:
| Source | Speed | Credibility | Best For | RSS Available |
|--------|-------|------------|----------|--------------|
| Twitter/X | Real-time | Variable | Breaking news, leaks | API/Nitter |
| Dawn | 1-2 hours | High | Pak politics, economy | Yes |
| Business Recorder | 1-3 hours | High | SBP, finance, trade | Yes |
| Geo News | 30 min | Medium-High | Breaking Pakistan news | Yes |
| Reuters | 2-4 hours | Highest | Global events | Yes |
| AP News | 2-4 hours | Highest | US politics, global | Yes |
KEY INSIGHT:
Pakistani sources (Dawn, Geo, BR) provide 2-6 HOUR information lead
over Western wire services for South Asian events.
If you're trading a Pakistan-related market and you only read Reuters,
you're the LAST person to know. Pakistani news sources are your edge.
The Async Ingester Code
import asyncio
import aiohttp
import feedparser
from datetime import datetime, timedelta
class NewsIngester:
"""Fetches headlines from multiple RSS sources concurrently."""
SOURCES = {
"reuters": "https://feeds.reuters.com/reuters/worldNews",
"ap": "https://rsshub.app/apnews/topics/world-news",
"dawn": "https://www.dawn.com/feeds/home",
"business_recorder": "https://www.brecorder.com/feeds",
"geo": "https://www.geo.tv/rss/1/0",
}
async def fetch_feed(self, session, name, url):
"""Fetch and parse a single RSS feed."""
try:
async with session.get(url, timeout=10) as response:
text = await response.text()
feed = feedparser.parse(text)
articles = []
for entry in feed.entries[:20]: # Last 20 per source
articles.append({
"source": name,
"title": entry.get("title", ""),
"summary": entry.get("summary", "")[:500],
"url": entry.get("link", ""),
"published": entry.get("published", ""),
"fetched_at": datetime.utcnow().isoformat()
})
return articles
except Exception as e:
print(f"[WARN] {name} fetch failed: {e}")
return []
async def fetch_all(self):
"""Fetch all sources in parallel — 2-4 sec vs 10-15 sec sequential."""
async with aiohttp.ClientSession() as session:
tasks = [
self.fetch_feed(session, name, url)
for name, url in self.SOURCES.items()
]
results = await asyncio.gather(*tasks)
# Flatten list of lists
all_articles = [a for batch in results for a in batch]
print(f"[INGEST] Fetched {len(all_articles)} articles from {len(self.SOURCES)} sources")
return all_articles
Run all fetches in parallel with asyncio.gather() — fetching 5 sources sequentially takes 10-15 seconds; in parallel it takes 2-4 seconds.
Deduplication
The same news story appears across multiple sources. Without deduplication, you process (and pay for) the same story 3-4 times.
from difflib import SequenceMatcher
def deduplicate(articles, threshold=0.8):
"""Remove near-duplicate articles based on title similarity."""
unique = []
for article in articles:
is_dupe = False
for existing in unique:
similarity = SequenceMatcher(
None,
article["title"].lower(),
existing["title"].lower()
).ratio()
if similarity > threshold:
is_dupe = True
# Keep the one from the higher-credibility source
break
if not is_dupe:
unique.append(article)
print(f"[DEDUP] {len(articles)} → {len(unique)} unique articles")
return unique
For a pipeline ingesting 80 articles/hour from 5 sources, deduplication typically reduces to 40-55 unique stories — cutting your AI filter costs in half.
The Two-Stage AI Filtering
Stage 1 — Relevance Filter (Gemini Flash)
STAGE_1_PROMPT = """You are a news relevance filter for a prediction market trading bot.
Active markets being tracked:
{market_list}
For each headline below, respond with ONLY the relevant headlines
(ones that could affect the probability of any tracked market).
Return one headline per line. If none are relevant, return "NONE".
Headlines:
{headlines}"""
# Example market_list:
# - "Will SBP cut rates below 15% by Dec 2026?"
# - "Will India-Pakistan bilateral trade resume by Q3 2026?"
# - "Will PIA complete privatization by June 2026?"
Cost: ~$0.000075 per 1,000 tokens (Gemini Flash). Processing 45 headlines costs roughly $0.0001. This stage eliminates 80-90% of noise.
Stage 2 — Probability Scoring (Claude Sonnet)
STAGE_2_PROMPT = """You are an expert prediction market analyst.
Market question: "{market_question}"
Current market price (implied probability): {current_price}
Resolution criteria: "{resolution_criteria}"
Breaking news headline: "{headline}"
Summary: "{summary}"
Source: {source}
Published: {published}
Based on this news, analyze:
1. How does this headline change the implied probability?
2. What is your confidence in this assessment?
3. How urgent is this signal (should the bot act now or wait)?
Return JSON only:
{{
"new_probability": 0.XX,
"confidence": 0.XX,
"urgency": "high" | "medium" | "low",
"reasoning": "One sentence explanation",
"direction": "bullish" | "bearish" | "neutral"
}}"""
Cost: ~$0.003 per 1,000 tokens. Only 5-8 articles reach this stage per hour, keeping daily costs under $1.
Cost Comparison at Scale
DAILY AI COSTS:
WITHOUT 2-STAGE WITH 2-STAGE
Articles processed 1,080/day 1,080/day
Stage 1 (Flash) — $0.10/day
Stage 2 (Sonnet) $5.40/day $0.72/day (only 144 articles)
─────────────────────────────────────────────────────
TOTAL $5.40/day $0.82/day
MONTHLY $162/month $24.60/month
SAVINGS — 85%
At 1 PKR = $0.0036:
Without: PKR 45,000/month → TOO EXPENSIVE for most traders
With: PKR 6,800/month → AFFORDABLE even for students
Twitter/X Integration
Real-time Twitter monitoring catches breaking news 15-30 minutes before RSS feeds. For Pakistani market events, key accounts to follow:
MUST-FOLLOW TWITTER ACCOUNTS (by category):
ECONOMY/FINANCE:
├── @StateBank_Pak — SBP rate announcements (CRITICAL for rate markets)
├── @FinaborPk — Pakistan finance news
├── @baborjakhar — Business journalist, breaks SBP news
└── @BusinessRecrdr — Business Recorder breaking news
POLITICS:
├── @PakPMO — Prime Minister's Office
├── @ForeignOfficePk — Foreign ministry (bilateral trade, BRICS)
├── @NAaborPakistan — National Assembly proceedings
└── @dawn_com — Dawn breaking news
SPORTS:
├── @GeoSuper — Cricket, PSL (for cricket market trades)
├── @TheRealPCB — Pakistan Cricket Board
└── @ESPNcricinfo — International cricket updates
INTERNATIONAL:
├── @IMFNews — IMF program updates (critical for PKR markets)
├── @Reuters — Global breaking news
└── @AP — Associated Press
Twitter API vs. Free Alternatives
OPTION 1: Twitter API v2 ($100/month)
├── Real-time filtered stream
├── Push-based (instant delivery)
├── Best for serious trading bots
└── Rate limit: 500,000 tweets/month
OPTION 2: Nitter RSS Bridge (FREE)
├── nitter.net/[username]/rss
├── 5-10 minute delay vs real-time
├── No API key needed
├── Fragile (Nitter may go down)
└── Good enough for learning/testing
OPTION 3: RSS.app or RSSHub ($0-5/month)
├── Converts Twitter profiles to RSS
├── 1-5 minute delay
├── More reliable than Nitter
└── Recommended starting point
Structuring the Signal Output
The final output of your sentiment pipeline is a list of trading signals that feed directly into your execution engine (Module 4):
{
"market_id": "will-sbp-cut-rates-q2-2026",
"headline": "State Bank signals dovish shift in latest quarterly review",
"current_price": 0.42,
"new_probability": 0.65,
"confidence": 0.78,
"urgency": "high",
"direction": "bullish",
"source": "dawn",
"timestamp": "2026-03-26T09:15:00Z",
"reasoning": "Quarterly review language shifted from 'maintaining' to 'reviewing appropriate levels' - historically precedes rate cuts"
}
Signal-to-Action Rules
TRADING RULES (configure in your bot):
IF new_probability − current_price > 0.15
AND confidence > 0.70
AND urgency == "high"
→ EXECUTE BUY (limit order at current_price + 0.02)
IF new_probability − current_price > 0.10
AND confidence > 0.60
AND urgency == "medium"
→ QUEUE for next batch execution (every 15 min)
IF confidence < 0.50
OR abs(new_probability − current_price) < 0.05
→ SKIP (signal too weak)
IF urgency == "high" AND direction == "bearish"
AND you hold YES shares
→ ALERT: consider selling current position
Practice Lab
Exercise 1: Build the Ingester Set up the async fetcher with at least 3 sources (Reuters, Dawn, one more). Fetch and print the first 5 articles from each. Confirm Dawn returns Pakistani news faster than Reuters for South Asian events. Measure fetch time: parallel vs sequential.
Exercise 2: Test the Filter Pick a market from your Module 1 paper trading. Feed 30 headlines through a Stage 1 relevance filter prompt with 5-10 market keywords. Count how many survive filtering. What percentage was noise? Was the filter too aggressive or too lenient?
Exercise 3: Sentiment Scoring Test Take 3 surviving headlines from Exercise 2 and run each through a Stage 2 Claude Sonnet scoring prompt. Record the new_probability, confidence, and reasoning for each. Compare the AI's probability estimates with your own intuition. Where is the AI miscalibrated? Where is it more objective than you?
Pakistan Case Study
Meet Bilal — data science student at LUMS, building a Polymarket bot as his final year project.
His pipeline evolution:
Version 1 (Week 1-2):
- Single source (Reuters RSS only)
- No filtering — every headline sent to GPT-4o
- No deduplication
- Cost: $4.80/day = PKR 40,000/month
- Signal quality: Noisy, many false positives
- Win rate on signals: 48% (basically random)
Version 2 (Week 3-4):
- Added Dawn + Business Recorder + Geo
- Added deduplication (cut articles by 40%)
- Still no 2-stage filtering
- Cost: $3.20/day = PKR 27,000/month
- Better coverage of Pakistan events
- Win rate: 52% (slight improvement from local sources)
Version 3 (Week 5-6 — after this lesson):
- Full 5-source ingestion
- Deduplication + 2-stage filtering
- Stage 1: Gemini Flash filter (90% reduction)
- Stage 2: Claude Sonnet scoring
- Twitter monitoring for @StateBank_Pak and @PakPMO
- Cost: $0.90/day = PKR 7,500/month (81% cost reduction)
- Signal quality: High — only 5-8 actionable signals/day
- Win rate: 64% (real edge emerging)
The breakthrough: Bilal's bot caught an SBP monetary policy committee meeting announcement on Dawn 3 hours before Reuters covered it. The bot bought YES shares on a rate-cut market at $0.41. By the time Western traders reacted, the price was $0.58. Profit: $17 on a $41 position (41% return in 3 hours).
His key insight: "Pakistani news sources are my unfair advantage. Dawn publishes SBP news hours before Reuters. My bot reads Dawn at machine speed while other traders wait for Bloomberg. The 2-stage pipeline keeps costs under PKR 8,000/month — I'm spending less than a ChatGPT subscription to run a trading bot."
Key Takeaways
- Two-stage filtering (cheap filter → expensive scorer) reduces AI costs by 85% while maintaining signal quality
- Pakistani news sources (Dawn, Business Recorder, Geo) provide 2-6 hour information leads over Reuters/AP for South Asian events — this is your competitive edge
- Always deduplicate before filtering — 5 sources reporting the same story should produce 1 signal, not 5
- The async ingester fetches all sources in 2-4 seconds (parallel) vs. 10-15 seconds (sequential)
- Twitter catches breaking news 15-30 minutes before RSS feeds — monitor key accounts like @StateBank_Pak
- Signal output (new_probability, confidence, urgency) feeds directly into your trading execution engine
- Monthly pipeline cost: PKR 7,000-8,000 with 2-stage filtering vs. PKR 40,000+ without — the architecture pays for itself
- Stage 1 (Gemini Flash) filters noise, Stage 2 (Claude Sonnet) provides deep reasoning — never skip Stage 1
Next lesson: Building the execution engine — turning sentiment signals into actual trades.
Lesson Summary
Quiz: News Sentiment Analysis with AI — Reuters, AP, Twitter
4 questions to test your understanding. Score 60% or higher to pass.