Technical Methodology

Scientific backing for MLB outcome prediction.

Our pipeline synthesizes historical play-by-play data with modern sabermetric theory and atmospheric physics. This page details the academic foundations of our model architecture.

Baseline Architecture

Model Methodology

We employ a dual-layered feature vector processed through a calibrated Logistic Regression model to isolate talent from variance.

Predictive Accuracy of FIP (DIPS Theory) The model prioritizes Fielding Independent Pitching (FIP) over ERA. Based on Voros McCracken’s DIPS (2001), we isolate events a pitcher controls (K, BB, HR). Research confirms FIP stabilizes in ~100 innings, whereas ERA requires 500+ to filter defensive noise. McCracken, V. (2001). Pitching and Defense. Baseball Prospectus.
Linear Weights & wOBA Integration Our offensive signals utilize Weighted On-Base Average (wOBA). Unlike OPS or AVG, wOBA is derived from RE24 run expectancy matrices, correctly weighting a double (approx. 1.25 runs) vs a single (approx. 0.9 runs). Tango, T. M., et al. (2007). The Book: Playing the Percentages in Baseball.
Logistic Regression for Noisy Binary Outcomes For sports betting, we favor Logistic Regression for its robustness to stochastic noise. Unlike GBDT (XGBoost), LR naturally produces well-calibrated probabilities, reducing Brier Score variance in high-randomness environments. Bradbury, J. C. (2007). Does the Pitcher Matter? Journal of Sports Economics.
Market Efficiency & Closing Line Value (CLV) We evaluate model performance primarily through CLV. As a semi-strong efficient market, the closing line incorporates all situational data (injuries, sharp action). Consistent CLV is the statistically valid predictor of long-term ROI. Thaler, R. H., & Ziemba, W. T. (1988). Parimutuel Betting Markets.

Active Development

Strategic Roadmap

Our upcoming integrations focus on situational physics and latent skill variables that market averages often under-weight.

Hierarchical Bayesian Catcher Framing We are implementing a Bayesian model to estimate "Strike Rate" by treating the catcher as a random effect in a multi-level model. This isolates a catcher's framing value from umpire and pitcher bias. Deshpande, S. K., & Wyner, A. J. (2017). Bayesian Model of Pitch Framing.
Atmospheric Physics & Air Density Adjustments Integrating dynamic coefficients for air density ($\rho$). For every 10°F rise in temperature, ball flight increases by ~3.3ft due to reduced drag. We track game-time dew point and pressure to adjust home run probabilities. Nathan, A. M. (2008). The effect of spin on the flight of a baseball.
Fatigue Mechanics & TTTO Penalty Tracking the "Times Through The Order" (TTTO) penalty and bullpen pitch-counts over the trailing 48 hours. Fresh high-velocity relievers consistently outperform tiring starters facing a lineup for the third time. Lichtman, M. G. (2013). The Third Time Through the Order.
Statcast Matchup Vectors (xwOBA) Utilizing Exit Velocity (EV) and Launch Angle (LA) distributions to construct expected metrics. This captures "hard-hit" form that traditional box scores miss, providing a purer signal of future offensive output. Savant/Statcast (2024). Expected Metrics Methodology.