Methodology

How We Score Pundits

We don't just track whether pundits are right. We track how and why they're wrong.

The Accountability Gap

Sports media runs on predictions. Every week, pundits make dozens of public calls — spread picks, player props, bold season takes. When they hit, they celebrate loudly. When they miss, the tape quietly rolls on.

No one systematically tracks whether these people are actually right. Bettors lose real money tailing personalities who have never had their record audited. Fans form opinions based on pundits who sound confident but have no track record to back it up.

The Pundit Prediction Ledger closes that gap. But to do it right, we need to understand not just whether pundits are wrong, but the different ways they can be wrong — and what those patterns reveal.

Framework

The Axes of Truthiness

Before we get to numbers, we need to understand the different ways a prediction can fail — and what each failure reveals about the person making it.

Degrees of Falsehood

Not all wrong predictions are created equal. There's a spectrum from genuine error to deliberate deception.

Honest errorGenuinely believed it, had a reasonable basis, was wrong. "I thought the O-line would hold up — I didn't expect 3 injuries."
Lazy wrongnessDidn't do the homework, just vibed. "I like the over here" — no analysis, no reasoning.
BullshittingDoesn't know or care if it's true — says what sounds good. Indifferent to truth, not opposed to it.
Motivated reasoningHas a bias — network narrative, fandom, paid relationship — that distorts the analysis without the audience knowing.
Deliberate deceptionKnows it's wrong, says it anyway for engagement or money. "Lock of the century" on a pick they don't believe in.

Epistemic Basis

What is the prediction actually built on? The foundation matters as much as the outcome.

Data-groundedCites specific stats, film study, injury reports. The prediction has a traceable analytical basis.
Experience-based"I've covered this team for 20 years, I know their tendencies." Valid but unverifiable.
Gut intuition"I just feel it." Might be pattern recognition — or might be nothing.
Narrative-driven"This is a revenge game." The prediction serves a storyline, not an analysis.
Contrarian for clicksHot take with no basis other than being provocative. The prediction is content, not conviction.

Calibration

Does their confidence match their accuracy? The gap between how sure they sound and how often they're right is the bullshit detector.

Well-calibratedSays "70% confident" and is right about 70% of the time. Their confidence is informative.
OverconfidentEverything's a lock, but the hit rate is 52%. Their confidence is noise.
Strategic hedgerNever commits firmly enough to be proven wrong. "I could see the Chiefs winning" is unfalsifiable by design.
ErraticAll-in one week, wishy-washy the next. No consistent signal for anyone to act on.

Accountability

What happens after a pundit is proven wrong? This is the only axis that measures character, not competence.

Owns it"I was wrong, here's what I missed." Rare and valuable — shows the analysis is genuine.
Silent burialNever mentions it again. Moves on to the next take and hopes you forgot.
Revisionism"What I actually said was..." Reframes the prediction after the fact to look less wrong.
Doubling down"I'm still right, the outcome was fluky." Refuses to update even when the evidence is clear.
Deflection"Nobody could've predicted that." Externalizes all blame to preserve the illusion of competence.
The Scores

How Our Dimensions Map to Truthiness

Each scoring dimension acts as a detector for specific patterns of pundit unreliability.

Free Tier— The Triangle

Accuracy

Are they right?

The most basic truth test. Correct predictions divided by total predictions. Catches honest errors and deliberate bad picks alike — anyone can get lucky, but accuracy over hundreds of predictions reveals the signal.

Detects: Honest error, lazy wrongness, deliberate deception

Magnitude

When they're wrong, how wrong?

Separates informed misses from wild guesses. A pundit who says 'Chiefs by 3' when they win by 1 is very different from one who says 'Chiefs by 20' when they lose by 14. Small misses are forgiven; whoppers tank the score.

Detects: Lazy wrongness, gut-based predictions, overconfidence

Volume

Do they make enough testable claims to judge?

Filters out strategic hedgers who avoid commitment. If a pundit only makes 4 testable predictions per season, their score is unreliable. Low sample sizes receive a confidence penalty that shrinks the score toward the mean.

Detects: Strategic hedging, vague takes, low commitment

Pro Tier— The Full Profile
Pro

Precision

When they say 'lock,' do they mean it?

The bullshitting detector. Tracks predictions where the pundit expressed high confidence — 'lock of the week,' 'guaranteed,' 'hammer this' — versus their actual hit rate on those picks. High precision means their conviction calls hit. Low precision means they're performing confidence, not demonstrating it.

Detects: Bullshitting, confidence inflation, overconfidence

Pro

Consistency

Are they steady or streaky?

Separates skill from luck. A pundit who's 70% one month and 30% the next is less useful than one steady at 50%. Measures the standard deviation of rolling accuracy windows — bettors need to know if the signal is reliable week to week.

Detects: Streakiness, survivorship bias, small-window luck

Pro

Boldness

Do they actually say anything?

Measures how often a pundit goes against the consensus line or public betting percentages. High boldness plus high accuracy equals genuinely valuable signal. High boldness plus low accuracy means fade material. Low boldness means a chalk parrot who just picks favorites.

Detects: Chalk parroting, narrative-driven picks, contrarian performance

Coming Soon: Accountability

The 7th dimension. We're building the ability to scan whether a pundit references past misses in subsequent content — do they own their mistakes, bury them, revise history, or double down? The only dimension that measures character, not just competence.

Eligibility

What Counts as a Prediction

Not every statement is a scoreable prediction. We apply strict eligibility criteria — and the criteria themselves are part of the methodology.

A scoreable prediction must be:

Resolvable — A clear right/wrong outcome must exist.
Actionable — A bettor could place a wager based on it.
Valid Predictions
  • “Chiefs -3”
  • “Over 47.5”
  • “Eagles win the Super Bowl”
  • “Mahomes over 285.5 passing yards”
  • “They'll trade for a WR before the deadline”
Not Scoreable
  • “I like the Chiefs this week”
  • “I think the offense will be better”
  • “This team has momentum”
  • “He's going to have a big game”

Why we filter

Vague predictions are the pundit's escape hatch. By requiring testable claims, we remove the ability to retroactively claim “that's what I meant.” This is by design — if you can't be proven wrong, you shouldn't get credit for being right.

Taxonomy

The Claim Type Taxonomy

The Woj/Schefter distinction matters. A failed prediction is a bad call. A report whose underlying deal fell through is not the same thing, and we score it differently. This is the contract.

TypeDefinitionExampleHow we score it
Prediction
speech_act_type: assertion
Analyst asserts a future outcome will occur.“The Eagles will win the Super Bowl.”CORRECT or INCORRECT at resolution. Scored.
Report
speech_act_type: recall
Analyst relays insider information about the current state of affairs.“The Eagles are looking to trade for a WR per sources.”CORRECT if independently confirmed. VOID if the underlying deal/situation falls through. We score predictions, not reports.
Conditional
speech_act_type: conditional
Outcome depends on a contingent event.“If they draft a left tackle, the line will be top-10.”Scored only when the condition resolves cleanly. VOID if the “if” premise never occurs.
Opinion / Take
speech_act_type: opinion
Subjective qualitative assessment with no falsifiable outcome.“Mahomes is the best QB ever.”Not scored. Filtered at extraction; never enters the prediction ledger.
Commentary / Analogy / Joke / Rhetorical
speech_act_type: commentary, analogy, joke, rhetorical_question, hedge
Analysis, comparison, humor, hedged uncertainty, or rhetorical framing — no testable outcome.“They're playing like it's 1985.”Not scored. Filtered at extraction.
Schema Mapping

The public types above map 1:1 to the internal speech_act_type field documented in pipeline/src/domain_protocol.py. Only claims with speech_act_type of assertion, conditional, or recall are promoted to the scoreable prediction ledger. Everything else is filtered upstream.

Scoring

How a Prediction Is Scored

Every resolved prediction is reduced to three numbers. The math is public so anyone can replicate it.

binary_correct

The raw outcome bit, in {0, 0.5, 1}:

  • 1.0 — CORRECT (claim came true)
  • 0.5 — PARTIAL (e.g., predicted score within 3, partial credit)
  • 0.0 — INCORRECT (claim was wrong)

VOID and PENDING outcomes are excluded — they do not appear in the denominator of any aggregate.

accuracy = mean(binary_correct) over resolved, non-VOID predictions

weighted_score

Confidence-adjusted accuracy. A pundit who says “lock of the week” (confidence 0.95) and misses pays a bigger penalty than one who hedges at 0.55 and misses. The formula:

weighted_score = mean(binary_correct × confidence)
where confidence ∈ [0.5, 1.0]

Confidence is extracted from the language of the claim (“guaranteed”, “might”, “hammer this”) when a pundit doesn't state a probability directly. Predictions where confidence can't be inferred default to 0.5 and contribute neutrally.

Brier score

The gold standard for measuring calibration. For each probabilistic prediction with stated probability p and outcome o ∈ {0, 1}:

brier_i = (p_i − o_i)²
brier_overall = mean(brier_i)
  • 0.00 — perfect calibration
  • 0.25 — uninformative (random guessing at 50%)
  • 1.00 — maximally wrong with maximum confidence

Lower is better. The Brier score rewards pundits who say “60% confident” and are right 60% of the time — not the ones who say “lock” on everything and hit 52%.

Why three numbers, not one?

A pundit can have high accuracy and a bad Brier score — they hit often but always at low confidence. Or vice versa: a contrarian who calls underdogs with 70% confidence and hits 30% of the time looks good on accuracy variance but Brier exposes the calibration gap. Each number catches a different failure mode.

VOID

When a Prediction Becomes VOID

VOID is not the same as wrong. VOID means the claim cannot be fairly judged — and we exclude it from every aggregate rather than silently counting it against the pundit.

The four VOID conditions

A prediction resolves VOID when any one of the following applies:

  1. The underlying event the claim depends on was cancelled or never occurred.
  2. A report was accurate at the time but circumstances changed before resolution (trade fell through, injury healed, player traded).
  3. A conditional claim's premise (the “if X”) never resolved.
  4. Resolution criteria are ambiguous and no authoritative ground truth exists. An explanation is attached to every ambiguity-based VOID.

Worked examples

Example 1 — Report, deal fell through

“The Eagles are finalizing a trade for DK Metcalf, per sources.” — no trade materializes before the deadline.

Resolution: VOID. The pundit may have accurately reported the state of negotiations; the deal's collapse is not a failed prediction. We do not score reports.

Example 2 — Cancelled event

“Vikings -2.5 on Sunday” — the game is postponed and never rescheduled in the same season.

Resolution: VOID. The event the claim was about no longer exists. Counting this against the pundit would be unfair.

Example 3 — Conditional premise never met

“If Mahomes plays, Chiefs win by 10+.” — Mahomes is inactive on game day.

Resolution: VOID. The conditional was never triggered. A pundit cannot be scored on a claim whose premise didn't happen.

Example 4 — Ambiguous resolution criteria

“The Bears will have a top-tier defense this year.” — “top-tier” isn't defined; the team finishes 12th by DVOA and 6th by yards allowed.

Resolution: VOID with explanation. If we had a clear quantitative threshold from the pundit (top 10 in DVOA, top 5 in points allowed) we would score it. Without one, scoring is arbitrary and we say so publicly.

VOID is auditable, not a get-out-of-jail card.

Every VOID resolution is logged with a reason code and remains visible on the prediction's page. If a pundit accumulates an unusual rate of VOIDs, that pattern is itself visible — and suggestive — in the public record.

Integrity

The Immutable Ledger

Every prediction is cryptographically sealed at the moment of ingestion.

Hash-chained records

Each prediction receives a SHA-256 hash that includes the previous record's hash — forming an unbroken chain. Altering any record would break every hash that follows.

Append-only storage

The prediction ledger is write-once. No one — not even us — can edit or delete a prediction after it's been recorded. The infrastructure enforces this at every layer.

Public verification

Chain integrity can be independently verified via our API. If the data has been tampered with, anyone can detect it.

Verify the ledger
Disputes

Flagging a Scoring Error

We score every prediction the same way. If we got one wrong, we want to know.

Flag any prediction

Use the flag button on any prediction page. Tell us what you think we got wrong and include a source.

5-day review

Flags are reviewed within 5 business days. Confirmed errors are corrected and logged publicly in the prediction's history.

Important: Pundits cannot unilaterally remove predictions from their record. The ledger is immutable. A confirmed scoring error results in a correction entry — the original prediction remains visible with an updated resolution.

Dispute publicly on GitHub

We track adjudication disputes as public issues so the discussion and outcome are part of the record. File a dispute with the prediction URL, the current resolution, the resolution you believe is correct, and a link to authoritative source data.

Open a dispute issue

Or email corrections@cap-alpha.co. Methodology questions: support@cap-alpha.co.

Editorial Standards

Editorial Standards

Our commitment to independence, transparency, and consistent editorial practice.

Editorial Independence

Cap Alpha is not affiliated with any pundit, media organization, sports team, league, or sportsbook. Pundits are selected based on volume of public predictions. Scoring is automated and applies uniformly — no pundit can pay to be excluded, promoted, or re-scored.

Claim Selection Policy

Categories Tracked

Game outcome, player performance, trade, draft pick, injury, contract

Excluded

Entertainment predictions, non-sports content, and ambiguous claims that cannot be objectively resolved

Quality Threshold

A minimum testability threshold (testability score ≥ 0.6) is required for a claim to enter the ledger

Resolution Sources

Outcomes are determined from authoritative public sources:

Sports Reference / PFR — game outcomes and box scores

Spotrac and Over the Cap — contracts and transactions

Official league transaction wire — roster moves

Edge cases resolved by manual review are logged and attributed.

Extraction Transparency

Predictions are extracted using LLM assistance. The model version and prompt version are logged with each prediction. The raw verbatim quote is always stored alongside the extracted structured claim. Both are publicly visible on every prediction card.

View full disclosure

See it in action

Check the leaderboard to see how your favorite pundits actually perform.

View Leaderboard