Passes and Patterns: The Numbers Behind the Numbers: EP Classification Models at the Class Level

1. Abstract

Using the three highest-performing models from the previous examination of EP classification models (Clement 2019a), this work looks at how well these models are able to predict the individual classes of future scoring. The k-Nearest Neighbours (kNN) model performs adequately, but not as well as the Multi-Layer Perceptron (MLP) and Gradient Boosting Classifier (GBC), which performed very well, the GBC slightly better than the MLP. The models all show bias by quarter, demonstrating the need to include game time as a factor, and home-field advantage. Future work will involve feature selection and fine-tuning of the MLP and GBC models.

2. Introduction

Previous work on EP models has looked at the output of the model as an EP value. With regression models (Clement 2020a) this is obvious, as the output of the model is an EP value. When using a classification model the output is a set of probabilities for each possible outcome class. These can be mapped to the point values of each score, adjusted for the knock-on effects of the ensuing kickoff or change of possession. In evaluating the performance of the model it can be done at two levels” One can look at whether or not the final EP values match with reality, or one can compare the predicted probabilities of the individual classes against their actual outcomes. Here we are looking to do the latter, having already accomplished the former in a prior work, which served to make a` first pass to identify models that offered the possibility of utility.

3. Models

In this work we are looking at models from the prior work on classification models (Clement 2019a). While that work examined five different classification models, here we will only be focusing on three of those models, and have eliminated the logistic regression and random forest models. This was done in the interest of brevity and focus, as those were the two-worst-performing models. The three remaining models are k-Nearest Neighbours (kNN), Multi-Layer Perceptron (MLP) and Gradient Boosting Classifier (GBC). These three models all performed well in predicting the EP values, and the goal of this work is to evaluate their performance at predicting the individual probabilities that feed into those values.

We are retaining the same features as before - down, distance, and field position - as the core features behind any EP model. There are other features of value, such as time and home field advantage, that have shown themselves quite useful (Horowitz, Yurko, and Ventura 2017), in order to reduce the dimensionality of the models and make preliminary determinations about the efficacy of various models. Data from the 2019 season has been added to the database (Clement 2020b), and will be included in this evaluation. The addition of the data did not noticeably change the performance of any of the models.

The models are each described in brief below; a more thorough examination of these models can be found in the previous work or in the sklearn documentation (Pedregosa et al. 2011).

a. kNN

The k-Nearest Neighbours sklearn.neighbors.kNeighborsClassifier model was the worst-performing of the three models, but still had excellent performance and gave comparable performance to the other two models while requiring a fraction of the training time. It also provides a benchmark for the more complex models. kNN can perform well in data-rich areas because there is enough information to be able to take a simple mean, and even in data-sparse environments it can give passable results, but it struggles near the boundaries of the domain of the input data, because it has to draw from outside its own point, but it cannot draw symmetrically, so it ends up drawing additional data in only one direction, towards the centre of the data, since no data exists in the other direction. This is especially visible at the corners of the dataset.

b. MLP

The Multi-Layer Perceptron sklearn.neural_network.MLPClassifier was one of the two top-performing models based on their EP values (Clement 2019a). Unfortunately both of the top performers had very slow fit times, raising questions about scalability of the models. While standardizing the inputs is recommended for the MLP (Pedregosa et al. 2011), the model has proven very effective with no preprocessing of the data. The strength of a neural network is in finding the interactions between features, which is especially important in areas such as the red zone, where the assumption of linearity breaks down and where the kNN model is weakest.

c. GBC

Gradient Boosting Classifiers are an ensemble method that comes from sklearn.ensemble.GradientBoostingClassifier and are a form of tree model where each tree builds on its predecessor. The GBC performed comparably to the MLP model, but with even slower fit times. GBC handles categorical data as well as continuous, and even though sklearn itself does not support categorical features, the ordinal nature of downs allows GBC to treat them categorically. This will also prove useful for the addition of further features to the model.

4. Probability Correlation

Correlation graphs have been used repeatedly with different models for different purposes in the Passes & Patterns archives (Clement 2019a, [a] 2020, [b] 2019). In this work we are using probability correlation graphs, where we are looking at the predicted probability against the actual outcome probability, such as were used to predict the outcomes of field goal attempts (Clement 2019b). This contrasts to the value correlation graphs seen either in the first work on classification models (Clement 2019a) or in the analysis of regression models (Clement 2020a).

While the actual score values are not relevant to the results of this work, they have been included below in Table 1 for clarity’s sake. It should be noted that the convention used here involves two play-level attributes: next_score and next_score_is_offense. The first is the type of score, be, from the list [“TD”, “FG”, “ROUGE”, “SAFETY”, “HALF”], and the second is a boolean that identifies if that score will belong to the current offensive team, so lies in [True, False]. Two important conventions to note are that HALF is always paired with False, so there is no such state as [“HALF”, True], as the current offense did not “score” the end of the half. Additionally, note that safeties are credited to the team who surrenders the safety, that is the team with possession which finds itself tackled in its own end zone, and they have a nominal value of -2 points,

	Nominal value	Lower CI	Adjusted Value	Upper CI
TD	7	6.6877	6.7086	2.7135
FG	3	2.5142	2.6113	2.7135
ROUGE	1	0.5142	0.6113	0.7135
SAFETY	-2	-3.0767	-3.0103	-2.9423
HALF	0	[defined]	0	[defined]

Table 1 Score values with 95% CI

a. By Model

It should be noted that each point in all of the subsequent figures has exact binomial confidence intervals calculated at the 95% level (convergence to 10-16), but that they are ordinarily not visible because they are so small as to be subsumed within the thickness of the line itself. This is largely related to the sample size, as the model creates nine different data points, so instead of 304,109 data, one per play, we have 2,736,981, 9 per play. By massively increasing the sample size we decrease the interval to the point of being invisible.

	R2	RMSE
kNN	0.9897	2.6379
MLP	0.9975	1.3317
GBC	0.9976	1.2773

Table 2 Correlation coefficients for probabilities

As has been the running theme, the kNN performs slightly worse than the other two models, while the MLP and GBC models are nigh-indistinguishable. Still, even the kNN model shows itself to be an excellent predictor of score outcomes. We can see this in a more granular form below in Table 3, where we have the models broken down by individual scoring type.

	kNN		MLP		GBC
	R2	RMSE	R2	RMSE	R2	RMSE
TD, True	0.9846	2.9143	0.9961	1.5745	0.9979	1.1040
FG, True	0.9592	3.8843	0.9890	2.3919	0.9923	2.0615
ROUGE, True	0.5221	2.0707	0.9396	1.0747	0.9705	1.0768
SAFETY, True	0.9421	4.4357	0.9648	2.9321	0.9834	2.1427
HALF, False	0.8456	1.3823	0.9847	0.4201	0.9393	1.5695
SAFETY, False	0.8133	1.0008	0.9452	0.7673	0.9188	1.4450
ROUGE, False	-0.0492	1.3060	0.9378	0.3510	0.9243	0.9786
FG, False	0.9684	1.1320	0.9485	1.3834	0.9927	0.6926
TD, False	0.8945	2.9091	0.9787	1.4353	0.9944	0.7905

Table 3 Correlation coefficients for probabilities by class

While an exhaustive review of each result in Table 3 is beyond the scope of this work, this is where we can start to see the two more sophisticated models beginning to set themselves apart. For the MLP and GBC models every single R2 model is above 0.9, while the kNN values are good for the more common score values, the less common values are decidedly less well-correlated. Specifically for [“ROUGE”, False] the R2 is actually negative, even with all the data compressed into a small range, leading to very large sample sizes. One should note that the RMSE values do not necessarily follow the R2 values

i. kNN

In Figure 1 we see the correlation graph for the kNN model. We see that the model is well-calibrated, almost as well-calibrated as it was for the EP values (Clement 2019a, fig. 2). The calibration is near-perfect below 40%, and becomes noisier thereafter. This is unsurprising, as this is where the vast majority of the probabilities will lie, and the power of averaging, especially for a kNN model that is fundamentally an averaging of similar records, creates a strong model.

Figure 1 Correlation graph for predicted probabilities, kNN

The correlation graph stops at about 90%, above which there are no points that meet the N=100 threshold to be displayed on the graph. This should not come as a surprise, we can think of few situations where a specific score is near-certain to come next. Certainly no defensive score would ever be this likely, nor would an offensive rouge. And without time remaining as a feature the end of the half would not be this likely. This leaves offensive touchdown, field goal, or safety. A touchdown is most likely when the offense has 1st & goal at the 1-yard line, a fairly common occurrence as the result of penalties in the red zone. P(1D) for this situation, which is equivalent to the probability of ending a drive with a touchdown, is 90% (Clement 2018b) There is some small chance of the drive failing and the resultant score on a future drive being a touchdown for the same offensive team, but this is minor noise in the big picture. A field goal would be most likely on 3rd & goal at a distance that would be too great for most teams to consider going for the touchdown, but close enough that the FG is near-certain. This would occur in the 5-10 yard line area, where again the probability of successfully making this kick is around 90%. Finally, a safety might be intentionally conceded when a team is backed up well within its own end, but as we see in Figure 2 below, this probability never goes much past 50%, as many coaches would prefer to punt from this situation.

Figure 2 Correlation graph for predicted probabilities by class, kNN

Per prediction the kNN model would perform much better, but it is precisely in those extreme situations when it is most important for a model to retain its validity, and here is where we have concerns about the limitations of the kNN model. Extreme probabilities occur in extreme situations, and those very extreme situations are precisely those where we have already seen the limitations of the kNN model due to the asymmetry of the data surrounding it, and yet these extreme situations are also those with higher leverage, where we most need the model to remain accurate. We see, for example, the model grossly underpredicting the probability of [“FG”, True] above 50%. We also see that field goals are not part of the 90% probability prediction for the kNN model, and that these very high data points in the complete correlation graph in Figure 1 are largely due to offensive touchdowns for everything above 60%.

Normalization of the data would also improve some of the results, especially for [“FG”, True] and [“SAFETY”, True], where down is such a driver of these probabilities, but that normalization is highly problematic because of the non-normality of the distribution of each feature. A linear compression of each feature would then imply that each feature is to be weighted equally over its valid domain, which is an equally dubious statement. Grid-searching the best combination of normalization parameters is possible, but then becomes an exercise in overfitting.

ii. MLP

The MLP models has been locked in battle with the GBC to provide the best model. Both were excellent in providing accurate EP values (Clement 2019a), with R2 values consistently in excess of 0.995. Figure 3 shows the probability graph for the entire MLP model. As with the kNN model in Figure 1 we see excellent correlation below 40% and slowly deteriorating above that where the sample sizes begin to shrink. We see the graph end at a higher probability than with the kNN model, as the MLP model is better able to predict at the extremes than the kNN, for reasons already discussed.

Figure 3 Correlation graph for predicted probabilities, MLP

We also see in Figure 3 that the model, even as the variance grows, shows no discernible bias. In Figure 4 we see each scoring possibility, and we continue to see strong results across the board. Unlike the kNN model we do not see large deteriorations in correlation quality, either by R2 or RMSE, for the less likely scoring plays. We also see significant changes in [“FG”, True], where this model is better able to predict situations where a field goal is very likely, as well as [“SAFETY”, True], instead of relegating it to the lower-left corner.

Figure 4 Correlation graph for predicted probabilities by class, MLP

Looking at each correlation graph in Figure 4 none of the graphs has a visible bias, an encouraging sign that the model is well-calibrated. On both [“TD”, False] and [“FG”, False] we do see a drop into underprediction at the very end, and we even see this for [“TD”, True], and very faintly for a few others. Whether it is meaningful or simply noise is difficult to discern at this point, but it should be considered when looking at the models along different axes.

iii. GBC

The GBC model has consistently been the best performer, and here has bested the MLP along every R2 measure in Tables 2 and 3, while having the better RMSE in most of those cases. In Figure 5 we see the correlation graph for the MLP model in its entirety.

Figure 5 Correlation graph for predicted probabilities, GBC

Figure 5, with good correlation, also shows no visible bias, and a good range of values, much like the MLP. The only disadvantage to the GBC relative to the MLP is the long fit times, especially when using k-fold cross-validation, where the warm_start option for the MLP allows for much faster fit times by allowing the model to start where it left off and go immediately to fine-tuning, whereas the GBC model’s warm_start cannot be used in this way.

Figure 6 Correlation graph for predicted probabilities by class, GBC

Unlike the drop-off seen from the MLP model in Figure 4 for many of the scoring possibilities, the GBC instead shows an “up-down” spike in Figure 6 for several, including [“TD”, False], [“SAFETY”, False], [“HALF”, False], and [“SAFETY”, True].

b. By Quarter

To look for problems in the model we can look at the correlation graphs split out by various dimensions to identify consistent biases, as we have done consistently in prior works (Clement 2019a, [a] 2020, [b] 2019). We use down as a way of splitting out the data along an axis that is both significant to the game, and therefore likely to show biases, and it is not a feature in the model, so the model is unaware of it. Dividing the model along multiple axes, some of which are and some which are not features, gives us different angles to examine the model’s performance based on what the model does or does not know beforehand. Table 4 gives the correlation coefficients of each model according to down.

	kNN		MLP		GBC
	R2	RMSE	R2	RMSE	R2	RMSE
1st	0.9667	4.2199	0.9792	3.2459	0.9781	3.8748
2nd	0.9837	2.9583	0.9893	2.5526	0.9854	3.0189
3rd	0.9697	4.0182	0.9788	3.3294	0.9780	3.6994
4th	0.9323	5.4620	0.9144	6.6543	0.9201	6.2629

Table 4 Correlation coefficients for probabilities by quarter

In Table 4 we see some expected results, such as every model’s worst quarter being the 4th quarter. This happens when Win Probability (WP) considerations overtake EP considerations late in the game. But we also see some unusual results, such as the kNN model providing the best 4th quarter performance of all three models, and in every quarter being neck-and-neck with the other models.