Ensemble Voting — Combining Multiple AI Signals

No single AI model is right all the time. Gemini Pro is excellent at geopolitical analysis. Claude excels at parsing resolution criteria nuance. A specialized prompt focused on statistical base rates outperforms both on certain market categories. Ensemble voting — combining predictions from multiple independent models — consistently outperforms any single model. This lesson builds the ensemble voting layer that aggregates signals and produces a final trading decision.

Why Ensemble Methods Work

The mathematical basis for ensemble voting is error independence. When two models make errors on different inputs (uncorrelated errors), their combination reduces the overall error rate. If Model A is correct 70% of the time and Model B is independently correct 70% of the time, a majority vote of the two is correct approximately 84% of the time on the cases where they disagree.

For prediction markets, this translates directly to edge: an ensemble bot that's right 65-70% of the time on 50-cent markets earns significantly more than a single-model bot that's right 58-62% of the time.

The catch: models must make independent errors. If Gemini Pro and Gemini Flash are both trained on the same data and produce correlated errors (both wrong on the same types of inputs), ensembling them doesn't help. True diversity requires different model families (Gemini vs. Claude), different prompt strategies (base rate vs. news analysis), or different input data.

The Oracle's 4-Tier Signal Stack

Your Oracle uses four distinct signal types as ensemble members:

Signal 1 — Gemini Flash Fast Triage (weight: 0.1): Directional signal only (BULLISH_YES or BEARISH_YES). Low weight because it's the cheapest and least nuanced.

Signal 2 — Claude Deep Analysis (weight: 0.4): The primary reasoning signal. Claude Sonnet's calibrated probability estimate with full reasoning. Highest weight because it's most reliable.

Signal 3 — Base Rate Calculator (weight: 0.25): No news involved. What is the historical base rate for this type of market? E.g., "In 2024-2025, central banks that were in an IMF program cut rates at 60% of their scheduled meetings." This signal is independent of current news, providing valuable anchor against over-reaction.

Signal 4 — Resolution Criteria Parser (weight: 0.25): Focused analysis on the resolution criteria text itself, looking for edge cases, timing ambiguities, and source specificity that could cause a surprising resolution. Often uses Claude Haiku with a specific resolution-focused prompt.

The Weighted Ensemble Aggregator

python

from dataclasses import dataclass
from typing import Optional

@dataclass
class Signal:
    source: str
    probability: float
    confidence: float
    weight: float

def aggregate_ensemble(signals: list[Signal]) -> dict:
    """
    Weighted ensemble aggregation with confidence-adjusted voting.
    """
    total_weight = sum(s.weight * s.confidence for s in signals)

    if total_weight == 0:
        return {"final_probability": 0.5, "consensus": "UNCERTAIN", "execute": False}

    # Confidence-weighted probability
    weighted_prob = sum(
        s.probability * s.weight * s.confidence
        for s in signals
    ) / total_weight

    # Measure consensus — how much do signals agree?
    probs = [s.probability for s in signals]
    consensus_spread = max(probs) - min(probs)

    if consensus_spread < 0.10:
        consensus = "STRONG"    # All signals agree closely
    elif consensus_spread < 0.25:
        consensus = "MODERATE"  # Some disagreement
    else:
        consensus = "WEAK"      # Signals diverge significantly

    # Execute only on strong consensus with meaningful edge
    current_market_price = 0.50  # passed in from market data in production
    edge = abs(weighted_prob - current_market_price)
    execute = consensus in ["STRONG", "MODERATE"] and edge > 0.12

    return {
        "final_probability": weighted_prob,
        "consensus": consensus,
        "consensus_spread": consensus_spread,
        "edge": edge,
        "execute": execute,
        "direction": "BUY_YES" if weighted_prob > current_market_price else "BUY_NO"
    }

The 0.12 Edge Threshold

The edge > 0.12 threshold means: only execute when the ensemble's probability estimate differs from the current market price by at least 12 cents. This accounts for:

Bid-ask spread (typically 2-4 cents): your entry cost
Model error margin (±5-8%): models aren't perfectly calibrated
Expected profit requirement: you need at least 5c expected profit per trade to justify the time and cost

At 12c edge, your expected profit after spread and model error is approximately 5-7 cents per dollar wagered. On a PKR 28,000 (≈$100) position, that's PKR 1,400–1,960 expected profit. Not huge, but compound over hundreds of trades per month and the numbers become meaningful.

Handling Signal Disagreement

When signals diverge significantly (consensus = "WEAK"), the ensemble system should not trade. But disagreement itself is informative — it suggests the market may be genuinely hard to predict, or that you have a data gap.

Log all weak-consensus cases to a "review queue" in your database. Review them manually weekly. Over time, you'll identify which signal sources are systematically wrong in which market categories — letting you tune weights.

Example: You might discover that the Base Rate Calculator systematically underestimates the probability of SBP rate cuts because its training data overweights historical periods of monetary tightening. Fix: adjust its weight from 0.25 to 0.15 for SBP-related markets, increase Claude's weight from 0.40 to 0.50.

Pakistan Case Study: The Cricket Market That Exposed Weight Bias

Omar from NUST Islamabad ran his ensemble bot for 60 days. Overall win rate: 61%. But when he segmented by market category, he found a shocking discrepancy:

Market Category	Win Rate	Avg Edge	Net P&L
SBP macro markets	71%	14c	+PKR 28,400
US politics markets	59%	11c	+PKR 8,200
Pakistan cricket	38%	9c	-PKR 12,600
Global geopolitics	64%	13c	+PKR 16,800

His cricket market performance was destroying his overall P&L.

The diagnosis: His Base Rate Calculator (Signal 3, weight 0.25) was using a generic "sports event" base rate template that assumed roughly 50/50 outcomes. But Pakistan cricket markets are not 50/50 — they're heavily influenced by recent form, pitch conditions, player injuries, and opponent strength that the base rate model ignored.

The fix — category-specific weight overrides:

python

CATEGORY_WEIGHT_OVERRIDES = {
    "cricket": {
        # Base rate is unreliable for cricket — reduce weight
        "base_rate_calculator": 0.10,
        # Boost resolution criteria parsing (catches scoring nuances)
        "resolution_criteria_parser": 0.35,
        # Claude deep analysis handles cricket context well
        "claude_deep_analysis": 0.45,
        # Keep flash low
        "gemini_flash_triage": 0.10
    }
}

def get_ensemble_weights(market_category: str) -> dict:
    return CATEGORY_WEIGHT_OVERRIDES.get(
        market_category,
        DEFAULT_WEIGHTS  # Standard weights for all other categories
    )

Result after weight tuning:

Cricket market win rate: 38% → 59% (over next 30 days)
Overall bot win rate: 61% → 66%
Monthly P&L improvement: PKR 18,000+

Omar's insight: "The ensemble is only as good as the weights. Generic equal weighting is a starting point — never a permanent configuration. After 30 trades in any market category, review the category-specific win rate and adjust weights accordingly."

Ensemble Signal Aggregation Visualization

code

4-SIGNAL ENSEMBLE VOTING EXAMPLE
Market: "Will SBP cut rates in September?"
Current price: 0.42 (market says 42% chance)

Signal 1 — Gemini Flash (weight 0.10):
  ├── Direction: BULLISH_YES
  ├── Probability: 0.65
  └── Confidence: 0.80
  Contribution: 0.65 × 0.10 × 0.80 = 0.052

Signal 2 — Claude Sonnet (weight 0.40):
  ├── Direction: BULLISH_YES
  ├── Probability: 0.72
  └── Confidence: 0.88
  Contribution: 0.72 × 0.40 × 0.88 = 0.253

Signal 3 — Base Rate Calculator (weight 0.25):
  ├── Historical: SBP cuts rates at 55% of MPC meetings when CPI falls
  ├── Probability: 0.58
  └── Confidence: 0.75
  Contribution: 0.58 × 0.25 × 0.75 = 0.109

Signal 4 — Resolution Criteria Parser (weight 0.25):
  ├── Note: Resolution requires "official SBP press release" — no leak risk
  ├── Probability: 0.68
  └── Confidence: 0.82
  Contribution: 0.68 × 0.25 × 0.82 = 0.139

Total weight: 0.553
Weighted probability: (0.052+0.253+0.109+0.139) / 0.553 = 0.68

Market price: 0.42
Edge: 0.68 - 0.42 = 0.26 (26 cents — above 12-cent threshold)
Consensus spread: max(0.72,0.68) - min(0.58,0.65) = 0.72 - 0.58 = 0.14

DECISION: MODERATE consensus + 26c edge → EXECUTE BUY_YES

Default vs Category-Specific Weights Table

Market Category	Flash	Claude Sonnet	Base Rate	Res. Criteria
Default (all)	0.10	0.40	0.25	0.25
Pakistan cricket	0.10	0.45	0.10	0.35
SBP macro	0.10	0.50	0.25	0.15
US politics	0.10	0.35	0.35	0.20
Global geopolitics	0.10	0.40	0.20	0.30

Tune these weights after every 30+ trades per category. The default is your starting point, not your final answer.

Practice Lab

Manual ensemble exercise: Take one signal-market pair you've analyzed in previous lessons. Generate all four signals manually (Gemini triage, Claude analysis, your own base rate estimate, your read of the resolution criteria). Feed them into the aggregation function with the suggested weights. Does the ensemble output match your gut feeling? If not, which signal do you trust more — and why?
Consensus disagreement analysis: Deliberately create a scenario where signals disagree. Set Gemini Flash to BULLISH_YES (probability 0.70) and Claude to BEARISH_YES (probability 0.35). What does the aggregation output? Is "WEAK" consensus the right call not to trade? What additional information would resolve the disagreement?
Weight sensitivity test: Take one real prediction and vary the Claude weight from 0.2 to 0.6 while keeping total weights at 1.0. How much does the final probability change? This tells you how sensitive your bot is to Claude's accuracy — and whether you're over-relying on it.

Key Takeaways

Ensemble methods consistently outperform single models because independent error sources cancel out — a 70%+70% ensemble reaches ~84% accuracy on contested predictions
Model diversity is critical: combine different model families (Claude vs. Gemini) and different reasoning strategies (news analysis vs. base rates vs. resolution parsing)
The 12-cent edge threshold accounts for spread costs, model error, and minimum profit requirements — trading below this threshold is unprofitable in expectation
Disagreement between signals (weak consensus) is not a failure — it's actionable information. Log it, review it, and use it to tune ensemble weights for specific market categories
Category-specific weight overrides are the highest-value tuning lever — Omar's cricket market win rate jumped from 38% to 59% after adjusting weights for that category alone
Review category-level performance after every 30+ trades and update your weight overrides accordingly — generic equal weighting is a starting point, not an endpoint

3.3 — Ensemble Voting — Combining Multiple AI Signals