Passes and Patterns: Reverting to the Mean: Regression EP Models in U Sports Football

1. Abstract

A set of five different regression models were tested as measures of Expected Points, parallelling prior work in the field (Clement 2019) - the Multi-Layer Perceptron, Stochastic Gradient Descent, Elastic Net, Ada Boost, and Bayesian Ridge models. The model outputs were viewed and compared to the results of the raw data, and calibration graphs for each model were developed, as well as calibration graphs broken down by down, quarter, and home/away. The Multi-Layer Perceptron proved the only effective model, with the Elastic Net and Bayesian Ridge models effective only in certain limited circumstances, the Ada Boost is of very limited use, and the Stochastic Gradient Descent proved completely useless as a predictor of Expected Points.

2. Introduction

The development of an EP model can proceed by two different approaches. The first, already seen, is to use a classification approach to determine the probabilities of each possible scoring play, and then map those probabilities to the values of each of those scores. This requires each probability to be calculated either one-vs.-one one-vs.-all, and is slow to fit. Furthermore, one cannot determine beforehand whether this additional step will help the model’s accuracy or not. This approach was used in a previous work (Clement 2019), to generally good results. This work seeks to determine whether similar or better results can be obtained by regression models, which are far faster to fit and simpler to interpret.

3. Methods

The data come from the existing Passes & Patterns, with games from 2002-2018 (Clement 2018b), and parsed to create a hierarchical object-oriented database (Clement 2018e). The determination of the next scoring play is as described in the work on classification models (Clement 2019), but in order to create a regression model this was first converted into the numerical value of the scoring play before being fed into the model, in accordance with Table 1 below. 10-fold cross-validation was used, with each play’s output being assigned to an attribute of that play, in order to avoid overfitting. In order to create the output graphs in section 4 the results of every play with a given down, distance, and yardline combination were averaged, in order to properly account for slight differences in the different k-folds. Gaps in the data were determined by predicting them on a model trained using the entire dataset. The calibration graphs were made by binning predicted EP values to the nearest tenth of a point, and averaging the true next scores. The error bars in all cases were determined by bootstrapping. All graphs were created through the use of the matplotlib library (Hunter 2007).

a. Feature Selection

As the mission of this work is to test the efficacy of several different regression models, the same feature selection has been used here as was used in the prior work (Clement 2019). The intent is to provide a simple initial test, thus the data has not been normalized in any way, it is presented to the model as-is, with the understanding that some of the models would benefit from a degree of data standardization. The matter of such data manipulation, however, is not a straightforward one, having already been discussed in some detail (Clement 2019), largely focussing on the issue of the data being measured in different units, as it were, as well as being non-normally distributed.

i. Down (0-3)

While down might more properly be considered a categorical variable, sklearn does not support categorical variables. For a first approximation down will have to be modeled as a continuous variable. Certain models here have the ability to treat down pseudo-categorically because of the way they operate. Down ranges from 0 to 3, with 0 down being used as a placeholder for kickoffs and PATs.

ii. Distance (1-109)

The distance to gain is a continuous feature of any positive integer up to, theoretically, 109. Realistically this is limited to 25 in most cases, usually 10 or less, but there are exceptional cases of 50 or more.

iii. Field Position (1-109)

While field position and distance have the same valid range, field position is far more evenly distributed. Though points nearer the middle of the field are more common, we see no shortage of instances at any given field position.

iv. EP Input

Whereas the previous work with classification models predicted class probabilities which could then be converted to EP predictions, this work uses the score values directly as numerical inputs. The score inputs are derived from the adjusted nominal values, repeated iteratively until the values converge beyond the 10-16 level. The following score values are attributed, along with their bootstrapped 95% confidence intervals.

Score	Lower CI	Value	Upper CI
Field Goal	2.5603	2.6563	2.7593
Rouge	0.5603	0.6563	0.7593
Safety	-3.0665	-2.9936	-2.9177
Touchdown	6.7053	6.7257	6.7460
Half	0 (defined)

Table 1 Score values with confidence intervals

These values differ from those found in the classification-based work because of ongoing improvements in the database regarding bookkeeping errors, an ongoing process due to the inconsistent rigour in the scorekeeping. A particular problem is an inconsistency in labelling the change in possession after a defensive touchdown

b. Model Selection

Of the models used for classification, the logistic regression model simply cannot be applied to regression problems, while the k-nearest neighbours, random forest, and gradient boosting models all return the same results when used in regression mode, without the intermediate step of calculating score probabilities before mapping them to score values to get an EP value. Thus, only the multi-layer perceptron model could be reused in regression mode to get a direct comparison between models. Instead, a different set of models were chosen, based on the recommendations of the sklearn cheat sheet (Pedregosa et al. 2011). The support vector machine was excluded because the dataset is sufficiently large to make the fit times unmanageably long. For consistency of code, and because it is the premier machine learning library for Python, model selection was limited to models within sklearn.

The models chosen are intended to represent a series of different approaches to the problem, even if all of the models are variations of linear approaches there are still great variations in methods that lead to differing results. While a complete examination of all the different algorithms even just within sklearn would be nigh-impossible, this provides a reasonable sampling of different popular approaches that are similar to many of the available methods.

Methods within this Passes & Patterns code were also preserved. K-fold cross validation, with k=10, was used to predict the EP of each play, and these values were assigned to the play objects. Rather than determining the individual probabilities of future scoring plays, the input for the training data is the value of the next scoring play, as adjusted for knock-on effects (Clement 2019).

The same style of correlation graph is used here as in the previous work, and largely the same analysis will be performed, looking at the predicted EP against the true EP of each model, and further breaking down the models by quarter, down and home/away to identify possible biases. The heatmap graphs are the same as in the previous work, with the notable change of the colour scheme. rainbow was formerly used, this has been changed to viridis (Bob Rudis and Garnier 2018) for improved accessibility, as viridis is perceptually uniform and robust to colourblindness. Certain changes to the graphs for basic formatting have also been made, but their fundamental nature remains intact and comparisons between the two are straightforward, while the correlation coefficients R2 and RMSE are directly comparable.

i. Multi-Layer Perceptron (MLP)

This model comes from sklearn.neural_network.MLPRegressor, (Pedregosa et al. 2011) whose classification equivalent was a strong performer, albeit one that was very slow to train, likely due to the oversized network in use. The regression model is much faster to train, and MLP models are also well-adapted to warm starts, drastically decreasing training time after the first k-fold. While MLPs can be very powerful models, they are unfortunately rather inscrutable, with the underlying model consisting of a large number of hidden nodes, with a logical path that cannot be reasonably be traced by hand. Nonetheless, this model has a strong past as a class predictor for both EP and field goals can hopefully provide similar results.

ii. SGD Regressor (SGD)

From sklearn.linear_model.SGDRegressor, this model is known for its strength in sparse problems (Pedregosa et al. 2011), which is important in unusual distance and yardline combinations for 2nd and 3rd downs, that being the area where the raw data itself cannot be used directly. Furthermore, the model scales well with large datasets and large numbers of features. Although EP uses few features, There are over 200,000 records, and so efficient scaling is a valuable attribute for a model. SGD is sensitive to feature scaling, but the features in this model are not scaled, both to maintain the simplicity of the scope of work, and also because the question of how to scale features is uncertain, and has been previously discussed (Clement 2019). Table 2 gives the coefficients and intercepts for the output of the SGD model.

Feature	Coefficient
Down	-0.2766
Distance	-0.0045
Yardline	-0.0759
Intercept	4.6321

Table 2 Coefficients and Intercepts for SGD Model

From Table 2 we see that this model favours down as the strongest predictor of EP. In an unregularized context down is unsurprising to see as the most weighted feature, but these values seem somewhat odd, with field position being more valuable than distance to gain by a factor of 16, despite the narrower domain. Granted, these coefficients are not pure measures of feature importance, and so to compare the coefficients in this way is, at best, a heuristic.

iii. ElasticNet (Elastic)

From sklearn.linear_model.ElasticNet (Pedregosa et al. 2011), the Elastic model is a linear regression that blends the Lasso and Ridge models, using both L1 and L2 regularization to handle both sparse learning and regularization. Table 3 gives the coefficients of the features.

Feature	Coefficient
Down	0
Distance	0
Yardline	-0.0537
Intercept	4.2521

Table 3 Coefficients and Intercepts for Elastic Model

As seen in Table 3, the Elastic model looks to 0 the coefficients of features where possible. While we consider it dubious in this case that down and distance can simply be handwaved away and EP can be reduced to a function of field position as anything more than a first approximation, the model shall be allowed to stand, or fall, for itself.

iv. Ada Boost Regressor (Ada)

The Ada Boost Regressor (Pedregosa et al. 2011) is a form of ensemble model from sklearn.ensemble.AdaBoostRegressor that applies error-adjusted weighting at each iteration to provide progressively better results across each iteration. The Ada model is conceptually similar to other boosted ensemble models, such as the Gradient Boosting Classifier previously seen. As with previous ensemble models, 1000 estimators were used. In Table 4 we have the relative feature importances for the Ada model.

Feature	Coefficient
Down	0.1586
Distance	0.0001
Yardline	0.8413

Table 4 Feature Importances for Ada Model

While the primacy of field position is unsurprising in Table 4. The degree to which it dominates the model is, as is the near-complete eradication of distance as a factor to be considered. While distance may indeed be considered the least important of the features, something that has been the general consensus across the various models both in this work and its predecessor (Clement 2019), it has never before been considered so vastly unimportant as this, where it is little more than a rounding error. The only comparison is the Elastic model, one which actively sought to remove features from consideration.

v. Bayesian Ridge (BR)

The BR model, from sklearn.linear_model.BayesianRidge (Pedregosa et al. 2011) is the first foray into a Bayesian approach, where we begin with a set of uninformative priors and develop the model “along the way,” as it were, tuning the regularization parameter as part of the model fitting. Other models generally treat these parameters as being pre-set, and modify the weights of the model to optimize the model precision. Table 5 gives the coefficients and intercept of the BR model.

Feature	Coefficient
Down	-0.2575
Distance	-0.0018
Yardline	-0.0537
Intercept	4.6690

Table 5 Coefficients and Intercepts for Bayesian Ridge Model

As is typical, Table 5 shows distance as being the least important feature. This model has down as the most heavily weighted feature, though that is still likely a product of non-normalized domains of the data. These values are similar to an order-of-magnitude level to the coefficients of the SGD model, despite their vastly different approaches.

4. Results

In order to fully examine the models, the analysis of the model output has been separated from the evaluation of their performance based on their correlation with reality. While correlation is evaluated based on a number of different axes, looking at down, quarter, and home-away, the model results can only really be effectively visualized when separated by down. Thus each down is looked at separately in the sections below, and within each down the raw results are first evaluated, and then each model in turn.

a. 1st Down

1st Down differs from 2nd and 3rd down in that nearly all plays have a distance of 10 yards. While penalties can create distances of (usually) 5 or 15 yards, these are relatively uncommon compared to 1st & 10. A notable exception is near the goal line, where instead of 1st & 10 we see 1st & goal, and the distance to gain is equal to the yardline. Because of this we are only looking at 1st & 10/1st & goal in the following sections, and omitting any visualizations of other distances. Removing one of the dimensions means that the data can easily be plotted on a two-dimensional plot, and this allows for the inclusion of error bars. The error bars created here are the product of bootstrapping the outcomes from all the 1st & 10 plays at each yardline.

A brief note about the methodology for the models: as with the classification work, the k-fold cross-validation means that the predicted EP for each play comes from the predictions of a model fitted against other data. The predictions from all plays are then included and averaged. This approach was adopted to avoid issues of overfitting for data that lie at unusual combinations of distance and yardline that otherwise become overfitted.

Each model’s data on 1st & 10 was plotted, and a linear regression was applied to the results. The slope, intercept, and coefficients of each are given in Table 6 as a means of comparing the similarity of the different models.

	Slope	Intercept	R2	RMSE
Raw	-0.0643	5.494	0.2082	0.9895
MLP	-0.0645	5.49	0.1592	0.9939
SGD	-0.1021	5.008	0.3489	0.9883
Elastic	-0.0598	4.711	0.3688	0.9630
Ada	-0.0512	4.404	0.5375	0.9
BR	-0.0592	4.808	0.3363	0.9685

Table 6 Fitted Lines and Coefficients for 1st & 10 Results

A cursory glance at the various parameters of the linear fits to the different models is immediately telling. The MLP model is a near-exact match to the raw data, whereas the SGD model seems to have been derived from completely unrelated data. The other models seem to have varying degrees of similarity to the raw data. It should never be forgotten that for 1st down, and for 1st down only, that we hope to have as exact a match as possible to the raw data, as the confidence intervals are very narrow for the raw data here. For other downs a more smoothed effect is preferable, as the data is much sparser.

i. Raw

The use of raw data versus model-derived results is a quintessential question in data science. It is the opinion of the author that where sufficient data exists the data scientist should stay as close to the raw numbers as possible. In football we have a large state-space for EP, but the data is unevenly distributed. A majority of all plays are 1st & 10, and these are only distributed over the 110 yards of the field (of which there are only 109 possible ball placements). Therefore it is reasonable here to use the raw data, as shown in Figure 1. The error bars shown are bootstrapped from the list of all future scoring outcomes from the given situation, using 1000 bootstrap iterations, and show the two-sided 95% confidence interval.

Figure 1 EP for 1st & 10 by Yardline, Raw Data

Raw EP shown in Figure 1 is linear with respect to field position, except at the extremes of the field. This is consistent not just with the previous work (Clement 2019), but also with all previous examinations of EP in American football, both at the NFL and NCAA level (Clement 2018a). The point at 1st & Goal at the 5-yard line continues to baffle, as EP is far lower at this point than any of its neighbours. It is possible that some of this is caused by mislabelled 2-point conversions in the data being listed as 1st down instead of 0th down, but such instances that escaped detection during the various data cleanup projects would be sufficiently rare that this seems unlikely. The fitted line should not be taken as a substitute for the raw values, as the true data is not linear at the edges and this influences the intercept of the line, bringing it higher than it should be, as the data points at either extreme of the field are above the fitted line. However, the value of the fitted line is in its slope, we can assess the EP value of a marginal yard, which here is 0.0643 EP/yd, or 1/17 EP/yd as an approximation. Unfortunately 17 is not an easily divisible number but it can provide an effective heuristic for coaches to understand the value of field position in a more quantitative way.

Note also that certain data points have very small bootstrapped intervals, such as at the 75 yard line (-35 yard line in standard parlance), this is the point at which the ball is spotted after rouges and field goals where the defense elects not to receive a kickoff, and is also a common spot for kickoffs to be tackled. Thus there are a huge number of plays from this point, and the error on the point estimate shrinks as a result. A similar, but smaller, effect occurs at 90 yards (-20), as that is the spot after a touchback, and at the 1-yard line, as a result of penalties near the goal line. Note also that the point at the 75 yard line is well below the fitted line. This is likely attributed to teams starting after a rouge. Rouges have three sources: missed field goals, long kickoffs, and punts, either failed coffin corners or punts that go further than expected.. Field goal attempts and punts that have the potential to become rouges require the offense to move into opposition territory, a sign of an effective offense. Rouges from kickoffs generally imply that the kicking team scored, since otherwise they would only kickoff once per game. Thus each of these is a sign of a team whose offense is effective in advancing the ball, and correlate with winning. As a result, the team beginning on the -35 is usually less good than their opponent, and so will tend to score fewer points. A 1st & 10 on the -34 or -36 is NOT the result of a rouge, and so will be selected more-or-less randomly. In future analysis we should then consider two different scenarios for 1st & 10 at the -35 - whether or not this position is the result of a rouge or not. If not, we should interpolate between the adjacent yardlines to get a more accurate picture of the EP in this circumstance, or we should separate this into two EP values, whether from a rouge or not.

ii. MLP

Given that the parameters in Table 2 are a near-perfect match between the raw data and MLP model, expectations are high for the visual comparison. The MLP classification model was also an excellent EP model, so there is reason to believe that the regression equivalent will also be a strong performer

Figure 2 EP for 1st & 10 by Yardline, Multi-Layer Perceptron

As visible in Figure 2, the results of the MLP models are hyper-linearized, almost exactly following the trendline. This stands in sharp contrast to the classification model. The only non-linear aspects to the model are in & Goal situations, but the slight deviations from the trendline show that the model is doing more than simply connecting the dots over the middle portion of the data. In fact, this model may be the most effective to reduce the noise without oversimplifying the results.

iii. SGD

From the values in Table 2 there are immediate concerns about the SGD model. The slope is far shallower than the raw data, or any other model. Given the intercept, the EP at yardline=110 (the opposite goal line) would be ~-6 EP, a number which simply makes no sense, it essentially assumes that the opposing team is near-certain to score a touchdown next. A visual representation in Figure 3 offers the opportunity to confirm this suspicion.

Figure 3 EP for 1st & 10 by Yardline, Stochastic Gradient Descent

Figure 3 is as expected, and in fact the data goes literally off-the-chart at about the 70 yard line, since no previous 1st & 10 EP value had ever approached -2 EP, never mind -6 EP, the scale was thus not calibrated appropriately, or rather the model has clear issues, at least on 1st down. The SGD model clearly cannot be trusted in this situation, these results are entirely divorced from reality.

iv. Elastic

The Elastic models parameters in Table 2 are broadly similar to those of the raw data. While they do not parallel as closely as the MLP, they are certainly in a range that could be conceivably prove reasonable. Ultimately a closer inspection of the plots is required, and can be seen in Figure 4.

Figure 4 EP for 1st & 10 by Yardline, Elastic Net

As is fairly common, Figure 4 faithfully recreates the data for 1st & Goal, but thereafter simply follows a linear trend. This has been seen before from other linear-type models, and it unfortunately provides a weak approximation to the actual data, especially with the sudden drop in EP at the 10 yard line, and the lack of a tailing-off for data at higher yardlines.

v. Ada

From Table 2, the Ada model appeared to be the second-most similar to the raw data, after the MLP, although it was a distance second and nearer to the third- and fourth-place models. In Figure 5 we see the plotted results for the Ada model on 1st & 10 by yardline.

Figure 5 EP for 1st & 10 by Yardline, Ada Boost

While the big-picture parameters may paint the Ada model as a decent approximation of reality, taking even a cursory glance at the plotted data in Figure 5 exposes the model as pure, unadulterated garbage. The model correctly emulates the 1st & Goal data, then proceeds in seemingly arbitrary steps, with sections of completely flat EP, and at no point does it bear a resemblance to the data or its own trendline.

vi. BR

The BR model offers some hope for its novel approach compared to other models, which have mostly been variations on a theme, usually trees or linear models. While the parameters from Table 2 were not a perfect match, nor anything near it, they had enough of a resemblance to put the model in contention with the Ada and Elastic models. In Figure 6 we get a closer look at the BR data.

Figure 6 EP for 1st & 10 by Yardline, Bayesian Ridge

The BR data is very reasonable, especially when compared to the Ada model. It closely resembles the Elastic model, which is unsurprising as they are both related to ridge regression, the BR being a Bayesian form of ridge regression, while the Elastic model is the blending of the ridge and lasso approaches. The 1st & Goal data remains essentially a copy of the raw data, while the rest of the data is a straight line. While not the best representation of the data, it is far from the worst that has been presented in this work.

b. 2nd Down

Whereas 1st down is dominated by a surplus of data and the ease with which models can accurately approximate the raw data, 2nd down blends areas that are well-defined with areas that are poorly defined. For all the graphs below the distance to gain is limited at 25, since plays with greater distances are rare and lead to a very sparsely populated space. The models themselves include all the data, and the array of EP objects extends to the full 109 yards that is the maximum possible distance to gain, however improbable it may be. Furthermore, note that in each of the heatmaps the “impossible” combinations of distance and yardline have been omitted. These are combinations where the distance is greater than the yardline, which is truncated to & Goal, or situations where the line to gain is behind the -11, since no 1st & 10 could occur behind a team’s own -1 yard line, meaning that the line to gain cannot be behind the -11. While the models are capable of calculating EP values for such situations, they are meaningless and so have been removed.