Scientific backing for MLB outcome prediction.
Our pipeline synthesizes historical play-by-play data with modern sabermetric theory and atmospheric physics. This page details the academic foundations of our model architecture.
Baseline Architecture
Model Methodology
We employ a dual-layered feature vector processed through a calibrated Logistic Regression model to isolate talent from variance.
- Predictive Accuracy of FIP (DIPS Theory) The model prioritizes Fielding Independent Pitching (FIP) over ERA. Based on Voros McCracken’s DIPS (2001), we isolate events a pitcher controls (K, BB, HR). Research confirms FIP stabilizes in ~100 innings, whereas ERA requires 500+ to filter defensive noise. McCracken, V. (2001). Pitching and Defense. Baseball Prospectus.
- Linear Weights & wOBA Integration Our offensive signals utilize Weighted On-Base Average (wOBA). Unlike OPS or AVG, wOBA is derived from RE24 run expectancy matrices, correctly weighting a double (approx. 1.25 runs) vs a single (approx. 0.9 runs). Tango, T. M., et al. (2007). The Book: Playing the Percentages in Baseball.
- Logistic Regression for Noisy Binary Outcomes For sports betting, we favor Logistic Regression for its robustness to stochastic noise. Unlike GBDT (XGBoost), LR naturally produces well-calibrated probabilities, reducing Brier Score variance in high-randomness environments. Bradbury, J. C. (2007). Does the Pitcher Matter? Journal of Sports Economics.
- Market Efficiency & Closing Line Value (CLV) We evaluate model performance primarily through CLV. As a semi-strong efficient market, the closing line incorporates all situational data (injuries, sharp action). Consistent CLV is the statistically valid predictor of long-term ROI. Thaler, R. H., & Ziemba, W. T. (1988). Parimutuel Betting Markets.
Active Development
Strategic Roadmap
Our upcoming integrations focus on situational physics and latent skill variables that market averages often under-weight.
- Hierarchical Bayesian Catcher Framing We are implementing a Bayesian model to estimate "Strike Rate" by treating the catcher as a random effect in a multi-level model. This isolates a catcher's framing value from umpire and pitcher bias. Deshpande, S. K., & Wyner, A. J. (2017). Bayesian Model of Pitch Framing.
- Atmospheric Physics & Air Density Adjustments Integrating dynamic coefficients for air density ($\rho$). For every 10°F rise in temperature, ball flight increases by ~3.3ft due to reduced drag. We track game-time dew point and pressure to adjust home run probabilities. Nathan, A. M. (2008). The effect of spin on the flight of a baseball.
- Fatigue Mechanics & TTTO Penalty Tracking the "Times Through The Order" (TTTO) penalty and bullpen pitch-counts over the trailing 48 hours. Fresh high-velocity relievers consistently outperform tiring starters facing a lineup for the third time. Lichtman, M. G. (2013). The Third Time Through the Order.
- Statcast Matchup Vectors (xwOBA) Utilizing Exit Velocity (EV) and Launch Angle (LA) distributions to construct expected metrics. This captures "hard-hit" form that traditional box scores miss, providing a purer signal of future offensive output. Savant/Statcast (2024). Expected Metrics Methodology.