It turns out the simplest March Madness strategy of just picking the higher seed is very good. A 1 beats a 16. A 2 beats a 15.
Picking the higher seed gets 70% of games right. Our model of 15 features (from 110+ candidates) over 23 seasons of training data only improves accuracy by 2.3 percentage points.
We started with over 110 candidate features to predict win probability between two teams, and ended up keeping only 16.
The most important feature by far is AdjEM (Adjusted Efficiency Margin): the difference between a team's points scored and points allowed per 100 possessions, adjusted for opponent strength. A team that beats weak opponents by 20 looks less impressive than one that beats strong opponents by 10. AdjEM accounts for more importance than all other features combined.
Try it yourself:
Select two teams to see how the model predicts win probability.
Over 90 features were evaluated and rejected. Raw counting stats (points, rebounds, steals, blocks) were too noisy without opponent adjustment. Ranking systems (KenPom, Sagarin, Massey, and 6 others) were highly correlated with each other and with AdjEM. Coach experience (tournament wins, appearances, Final Four trips) was statistically redundant with program history and seed. Clutch performance (close-game win rate, overtime record) suffered from small sample sizes. Shooting variance and conference tournament results also added no value in cross-validation.
We use a Monte Carlo simulation to simulate the entire bracket 50,000 times. Each simulation draws random outcomes weighted by our model's probabilities. In a bracket, Team A's path to the Final Four depends on who else advances. Simulation captures this path dependency that a pairwise model alone cannot.
Advancement probabilities from 50,000 simulated tournaments. The green and red deltas show where our model diverges from the seed-only baseline: the places where all those extra features actually shift the prediction.
| Team | Seed | R64 | R32 | S16 | E8 | F4 | Champ | vs Seed |
|---|
Season-by-season accuracy:
The model works poorly for upset-heavy years like 2011 and 2014.