Calibration Training: How Accurate Are Your Probabilities? | The Science of Superforecasting

You can decompose, blend views, update Bayes-style, and aggregate lenses—and still lie to yourself with precision theater. Calibration training is where superforecasting becomes accountable: your fifty-five percent must mean fifty-five percent over many events, or your Kelly sizing is built on sand.

Traders who skip calibration often confuse being right once with being well-calibrated. Markets forgive the former until they do not. Calibration asks the uncomfortable question: when you say sixty-two percent, does the event happen about sixty-two percent of the time over many trials? Training turns your fair probability from rhetoric into a measurable skill linking Brier scoring, trade journals, and mispricing economics.

Calibration is not confidence or hit rate

Calibration means stated probabilities match long-run frequencies. Confidence is how sure you feel—they diverge constantly. Accuracy as "I called YES and won" hides overconfidence: you can hit often while Brier is awful if you only bet favorites at eighty-five percent.

Resolution (sharpness) is whether you use the full scale—always saying fifty percent is calibrated but useless. Brier score penalizes distance between p and outcome; good Brier needs calibration and sensible sharpness together.

Price-as-probability on venues taught market language; calibration teaches your language must be honest before you bet against consensus.

The reliability diagram

Bucket forecasts by stated probability—everything you labeled fifty-five to sixty-five percent, then sixty-five to seventy-five, and so on. For each bucket, plot hit rate (share that resolved YES) against the bucket midpoint.

Perfect calibration lies on the diagonal. Points below the diagonal on high buckets mean overconfidence—you said eighty percent and reality behaved like sixty-five percent. Points above on low buckets mean underconfidence.

Fix the worst bucket gap first. After a hot streak, the seventy-to-ninety percent bins are where traders inflate—not the fifty-to-fifty-five band.

Building bins from journal rows

Log thirty resolved trades with pre-trade p never edited. Sort into buckets. Suppose nine forecasts lived in fifty-five to sixty-five percent and six resolved YES—hit rate sixty-seven percent versus midpoint sixty percent, mild underconfidence in that slice.

Seven forecasts in seventy to eighty percent with four YES—hit rate fifty-seven percent versus midpoint seventy-five percent, roughly eighteen points overconfident. Shrink new forecasts in that band by ten to twelve points until the bucket refills with at least twenty-five fresh calls.

When Brier and P&L disagree

Ten trades, bankroll up eight percent, mean Brier 0.24 while market baseline is 0.20—you got paid for luck or tail risk, not calibration. Shrink maps and smaller size come before aggressive Kelly enthusiasm.

Flat P&L with Brier 0.17 beating market 0.21 suggests belief is trustworthy but edge may be fees, slippage, or passivity—not that you should firehose size without economics.

Public leaderboards as mirror

Sites like Metaculus and Good Judgment Open are useful not because their prices are tradable, but because they force locks and public scores. Import the habit: deadline, probability, no edits, resolve, Brier. Your trading journal should feel equally strict.

Drills that do not require clicking

Trivia bins: fifty binary questions forcing ten to ninety percent, not zero or one hundred. Public forecasting sites with locked deadlines. Shadow books: log p and market price without trading. Dual forecast: outside p₁, inside p₂, then blend. Pre-mortem assuming NO. Finer steps near fifty percent, coarser steps in tails.

Gate: if you cannot assign p before seeing market price, you are anchored—write blind, then compare.

Shrinking toward consensus when bins scream

When a bucket is ten-plus points overconfident with under twenty samples, blend your estimate with market price—heavier market weight when miscalibration is severe. Replace coefficients with your own table as data grows; the point is economics should use honest p, not story peak.

Platform traps that poison bins

Play-money bravado inflates tails. Retail caps distort extremes you should not train on. AMM slippage means you must log effective price, not mid fantasy. Thin books and cascade chasing after spikes should be excluded from calibration sets or tagged separately.

Segment Brier by venue when rules and liquidity differ—one curve for regulated order books, another for thin on-chain props.

Ninety-day program sketch

Weeks one–two: every trade logs p, price, timestamp. Weeks three–four: forty public question locks without size. Weeks five–six: first reliability diagram; fix top gap. Weeks seven–eight: live shrink on entries. Weeks nine–ten: compare Brier to consensus. Weeks eleven–twelve: mispricing only when bins support range.

Sunday ritual: update diagram, one paragraph on dominant bias, adjust shrink.

Brier decomposition in plain language

Mean Brier over many events splits into calibration error (systematic bucket bias), resolution (whether you use informative spreads), and irreducible uncertainty. Traders fix calibration first because it is the bias you control directly. Do not flatten everything to fifty percent to look calibrated—that destroys resolution and earns a different kind of bad score.

Beating market Brier by two points over a hundred locked forecasts is meaningful evidence if those forecasts were not copied from price at entry.

Worked example: lucky month, bad bins

Three wins at stated eighty-five percent when true frequency in that bin is sixty percent can still show positive P&L. The diagram screams shrink; Kelly does not. Separate the celebration of P&L from the diagnosis of p.

Red flags that pause scaling

Ninety-percent bucket hits below eighty percent with fifteen or more trials—cap new highs at eighty percent until fixed. Brier worse than "always copy market" for a hundred events—default humble. Editing p after resolution—reset honor system. Edge only on illiquid ghosts—remove from calibration sample.

What comes next

You have a scoreboard. The next chapter is how top forecasters produce numbers daily—rhythms, team rituals, and revision rules that keep calibration gains through boring weeks and news floods.

Key ideas to carry forward

Bins reveal overconfidence. Fix worst bucket first. Brier and P&L diverge—trust both. Shrink before size-up. Ninety-day drills build the curve.

Calibration is the feedback loop that makes every other superforecasting habit measurable. Without bins, you are practicing style, not skill.

Treat calibration like physical training: repetitive, measurable, unglamorous, cumulative.

Your future self will trust a shrunk sixty-two percent more than a heroic eighty that your bins do not support.

Next: Common Superforecaster Habits and Workflows