Abstract
A redevelopment of the U Sports parser and calculator previously described (Clement 2018b) that was built using VBA, this time working in Python. Data is imported through Python’s csv package, and parsed using an object-oriented approach, creating games and plays as classes with attributes as appropriate to each. Further objects exist to support analysis on the parsed data, and the numpy package with its arrays allows for far faster calculation of results. Discussion of future work built on this restructured database includes examination of special teams, expected points (EP) and Win Probability (WP).
Introduction
While previous work developed a pair of relational databases for U Sports and CFL data, and this proved adequate for a certain tasks, there was an interest in taking an object-oriented approach to the matter. The prime benefit of an object-oriented approach would be to structure the data in a fashion more akin to how our data is structured in reality. Games are independent events, and the plays are independent within them. By structuring the games as objects, and the plays as a separate class within them, we can also define attributes as being either game-level attributes or play-level attributes. The date of a game is a game-level attribute, whereas field position is a play-level attribute.
In the relational database a large number of rows were dedicated to bookkeeping: score statements, clock statements, and quarter starts among them, consuming as much as one-quarter of all the rows. By moving to an object-oriented structure that information is held at the appropriate level as an attribute of the object, making for a cleaner data structure.
Because this version of the database does not immediately provide a visual output of all the parsed data it was necessary to implement a more rigorous error-checking. In the process of doing so a number of other errors were spotted and corrected, though none that would affect the integrity of the P(1D) results already delivered (Clement 2018b). A number of missing score statements were repaired, as well as errors relating to distance gained on plays with fumbles.
Database
The code is structured using a number of modules. There are three central modules in the central aspect of the code, each discussed below. The program is run from the central Master module, which runs small sections of code and calls other methods and functions, serving to direct the overall flow of the code. A limited number of global variables are held in Globals to ease the difficulty of constantly passing large numbers of local variables, making code maintainability needlessly difficult. Finally, Functions holds a number of small numerical functions that rove useful in a variety of different situations.
Furthermore, there is a folder for the class modules, each of which is responsible for holding one class of object, with its attendant methods, functions and attributes, and any additional code that relates to that object class.
Master Module
Traffic within the program is managed by the Master module, a central module which initializes all the other elements of the program and directs the sequence of calculations. Master also serves as the laboratory to implement initial concepts before better organizing them within the structure of the code.
The first responsibility of the Master module is the importation of the data itself. The data was imported from .csv files using Python’s csv module. The three different formats of U Sports data (Clement 2018a) were each stored in a separate csv file. After importing the csv file the script reads through each row of the data to search for new games, as identified by the presence of “vs.” to create new Game objects. All other rows were added the the game’s rowlist attribute for further treatment. When a new game begins the current game is added to the global list gamelist. The code below demonstrates the approach taken to create the game objects. The dummy variable simply avoids an off-by-one error when reading the first game, in an inelegant but utilitarian fashion.
import csv #to bring in the MULE data from CSV
dummy=0
#IMPORTING DATA FROM THE CSV MULES
with open("Data/CIS MULE 01.csv") as csvfile:
MULE = csv.reader(csvfile)
for row in MULE:
if " vs. " in row[0]: #looking for new games
Globals.gamelist.append(GameClass.game(row[0], 1))#add the old temp to the list of games
elif dummy==0:
dummy=1
elif row[0]!=None or row[1]!=None:
tempplay=[]
tempplay.append(row[0])
tempplay.append(row[1])
Globals.gamelist[-1].rowlist.append(tempplay)
Globals Module
The simple mission of the Globals module is to hold a set of global variables that can be imported en bloc by each other module to give access to the core variables and data of the program and limiting the number of redundant passes. Care was taken to be judicious in the use of these global variables to avoid overencumbering the system with needless global variables. The variables defined in Globals are given below.
CONFIDENCE=0.025 # The one-sided confidence interval size for all statistical tests
BOOTSTRAP_SIZE=10000 # Number of bootstrap iterations to use
TDval=7.0 # Value of a touchdown
TDval_BOOTSTRAP=[] # Bootstrap array to determine confidence interval of TDval
TDval_HIGH=None # Upper bound of TDval CI
TDval_LOW=None # Lower bound of TD CI
FGval=3.0 # Same for field goal
FGval_BOOTSTRAP=[]
FGval_HIGH=None
FGval_LOW=None
ROUGEval=1.0 # For rouge
ROUGEval_BOOTSTRAP=[]
ROUGEval_HIGH=None
ROUGEval_LOW=None
SAFETYval=-2.0 # For safety
SAFETYval_BOOTSTRAP=[]
SAFETYval_HIGH=None
SAFETYval_LOW=None
HALFval=0.0 # Value of the end of the half is fixed and certain
gamelist=[] # The gamelist holds all the games
DummyArray = numpy.full(BOOTSTRAP_SIZE, -100, dtype='int32') # Holds a default value to avoid errors when comparing to None
Functions
The Functions module provides access to a few functions that are reused across several classes and that improve maintainability by centralizing these in a single module
Binomial Error
Binomial error in the program is calculated iteratively to obtain a value that is as exact as we choose to make it, here we go until we converge at the 10-30 level. We recycle the same binomial function used in the previous calculator (Pezzullo 2014), as was discussed in that work (Clement 2018a).
BootCompare
This function takes two bootstrap arrays and determines how many values of array A are greater than values of array B. It returns the proportion of all pairs of elements of A and B where the element from A is greater than the element from B. It assumes that A and B are both sorted arrays, which they are because prior elements of the code demand that they be sorted, and with this assumption is able to operate with only 2 * Globals.BOOTSTRAP_SIZE comparisons, at O(n) efficiency, rather than performing a comparison of every pair at O(n2).
def BootCompare(arrA, arrB):
count=0
b=0
for a in range (0, Globals.BOOTSTRAP_SIZE):
for b in range (b, Globals.BOOTSTRAP_SIZE):
if arrB[b] > arrA[a]:
count += b
break
else:
count += Globals.BOOTSTRAP_SIZE
count /= Globals.BOOTSTRAP_SIZE**2
return float(count)
Game class
The game class creates objects that hold games in their entirety. This allows us to append attributes belonging to the game directly to the object, rather than carrying them with every play as was done in the relational database. Notably, the teams playing, the date of the game, and the final score are all attributes that belong to the game and not to the individual play. The dataset from which the game comes is also passed as the attribute MULE to account for differences in how the parser should handle certain attributes. The code below shows how a game object is initialized.
__init_ method
The __init__ method initializes the game object, creating all of the attributes that will then be determined in later functions. No work is done in this method to assign values to any attributes beyond the statement and data format that are imported initially, so as to improve the maintainability and portability of the code. Most of the attributes are determined in the subsequent game_calc method. The code for the __init__ method is given below.
def __init__(self, statement, MULE):
self.rowlist=[] # Holds all the rows of raw data from the csv
self.playlist=[] # Holds a list of all the plays as objects
self.MULE=MULE # Which data format is this from
self.game_statement = statement # The initial "vs. " statement
self.AWAY=self.game_statement[0:3] # Automatically grabs the Home and Away team from the vs. statement
self.HOME=self.game_statement[8:11]
self.H_WIN=None # If the home team wins the game
self.H_FINAL=None # Final score of them game
self.A_FINAL=None
self.LEAGUE = None
self.CONFERENCE = None # The conference in which the game was played
self.YEAR=None # The year, month, day, weekday of the game
self.MONTH=None
self.DAY=None
self.WEEKDAY=None
self.SEASON=None # The season in whichthe game was played, always the same as YEAR for Canadian football, as seasons don't span over New Year's Day, exists for compatibility with other leagues in the future
Game_calc method
This method determines a number of the core attributes of the game. It separates the row attribute into its components, based on the data format, and through some string interpretation determines the date of the game. It also determines the conference in which the game was played, which considers the decision of Bishop’s University to move from the RSEQ to AUS for the 2017 season. The code of the game_calc method is included here.
# Determines some of the basic info about the game - date, conference
def game_calc(self):
for x in range (2001,2100): # Loop to get the year of the game. Checks 2001 to 2100 to be fully future-proofed and allow for the addition of old PxP games we might find.
if str(x) in self.game_statement: # Check if the year is in the game statement
self.YEAR=x # Assign that year to the attribute
self.SEASON=x
break # Once we find the year there's no need to waste clock cycles
self.MONTH = int(self.game_statement[self.game_statement.find(str(self.YEAR)) + 5 : self.game_statement.find(str(self.YEAR)) + 7]) # Assign the other attributes from string manipulation once we know the year
self.DAY = int(self.game_statement[self.game_statement.find(str(self.YEAR)) + 8 : self.game_statement.find(str(self.YEAR)) + 10])
self.WEEKDAY = date(self.YEAR, self.MONTH, self.DAY).weekday()
self.LEAGUE="U SPORTS" # Always U Sports in this case, will get more sophisticated if we start looking at other leagues.
CWUAA=["MAN", "SKH", "REG", "ALB", "CGY", "UBC", "SFU"] # These are to determine the conference of a game #These two conferences always have the same teams
OUA = ["CAR", "OTT", "QUE", "TOR", "YRK", "MAC", "GUE", "WAT", "WLU", "WES", "WIN"]
if self.YEAR<2016: # Because Bishop's changed conferences in 2016 we need to adapt accordingly
RSEQ=["BIS", "SHE", "MCG", "CON", "MON", "LAV"]
AUS = ["SMU", "SFX", "MTA", "ACA"]
else:
RSEQ=["SHE", "MCG", "CON", "MON", "LAV"]
AUS = ["SMU", "SFX", "MTA", "ACA", "BIS"]
if self.HOME in CWUAA and self.AWAY in CWUAA: # If the two teams are in the same conference we assign that conference, otherwise it's a non-conference game
self.CONFERENCE="CWUAA"
elif self.HOME in OUA and self.AWAY in OUA:
self.CONFERENCE="OUA"
elif self.HOME in RSEQ and self.AWAY in RSEQ:
self.CONFERENCE="RSEQ"
elif self.HOME in AUS and self.AWAY in AUS:
self.CONFERENCE="AUS"
else:
self.CONFERENCE="NONCON"
make_plays method
The make_plays method is forms the basis of the database, iterating over rowlist and identifying what rows are plays, and interpreting the bookkeeping rows , applying them as needed.
def make_plays(self): # Go through the rowlist to identify and create play objects, and to get the data from bookkeeping rows
quarter=1 # Initialize the quarter and scores as zero, obviously, two timeouts per team, clock is at 15:00
homescore=0
awayscore=0
HTO=2
ATO=2
clock="15:00"
off=[] # No team is in offense until one is assigned
for row in self.rowlist: #Looping through the whole rowlist
if row[0]=="2nd": # Identifying new quarters and resetting the clock and timeouts
quarter=2
clock="15:00"
elif row[0]=="3rd":
quarter=3
clock="15:00"
HTO=2
ATO=2
elif row[0]=="4th":
quarter=4
clock="15:00"
elif row[0]=="OT":
quarter=5
clock="00:00"
if self.MULE==1 or self.MULE==3: # checking for possession statements, but these only exist in data formats 1 and 3
if "drive start" in row[1]:
off=row[1][row[1].find("drive start")-4:row[1].find("drive start")-1]
elif self.MULE==2:
if len(row[0])==3:
if row[0]!="1st" and row[0]!="2nd" and row[0]!="3rd" and row[0] !="4th":
off=row[0]
if off != self.HOME and off != self.AWAY:
print ("POSSESSION ERROR", self.MULE, self.game_statement, row)
try:
if self.MULE == 1 or self.MULE == 3:
if "TIMEOUT" in row[1]: # checking for timeout statements
TO=row[1].find("TIMEOUT")
TOTEAM=row[1][ (TO + 8) : (TO + 11)]
if TOTEAM == self.HOME:
HTO=HTO-1
elif TOTEAM == self.AWAY:
ATO=ATO-1
else: #Error checking if the team calling a timeout isn't properly interpreted
print ("TIMEOUT ERROR", self.MULE, row, "TO", TOTEAM, "HOME", self.HOME, "AWAY", self.AWAY)
elif self.MULE == 2:
if "TIMEOUT" in row[3]:
if row[3][row[3].find("TIMEOUT") + 8 : row[3].find("TIMEOUT")+11] == self.HOME:
HTO=HTO-1
elif row[3][row[3].find("TIMEOUT")+8:row[3].find("TIMEOUT")+11]==self.AWAY:
ATO=ATO-1
else: #Error checking if the team calling a timeout isn't properly interpreted
print ("TIMEOUT ERROR", self.MULE, row)
except Exception:
print ("TIMEOUT EXCEPTION", self.MULE, self.game_statement, row[0], row[1])
try:
if self.MULE==1 or self.MULE==3: # find score statements
if len(row[1])>11 and len(row[1]) <15:
if row[1][3]==" ":
if row[1][0:3]==self.AWAY:
awayscore=int(row[1][4:row[1].find(",")].lstrip(" "))
homescore=int(row[1][-2:].lstrip(" "))
elif self.MULE==2:
if len(row[3]) > 11 and len(row[3]) < 15:
if row[3][3]==" ":
if row[3][0:3]==self.AWAY:
awayscore = int(row[3][4 : row[3].find(",")].lstrip(" "))
Homescore = int(row[3][-2:].lstrip(" "))
except Exception:
print ("SCORE ERROR", self.MULE, row, awayscore, homescore)
#identify clock statements
try:
if self.MULE==1 or self.MULE==3:
if ":" in row[1]:
clock=row[1][row[1].find(":")-2:row[1].find(":")+3]
elif self.MULE==2:
if ":" in row[3]:
clock=row[3][row[3].find(":")-2:row[3].find(":")+3]
#identify plays
if self.MULE==1 :
if "rush" in row[1] or "pass" in row[1] or "sack" in row[1] or "kick" in row[1] or "punt" in row[1] or "field goal" in row[1] or "PENALTY" in row[1]:
self.playlist.append(PlayClass.play(row, homescore, awayscore, off, quarter, ATO, HTO, clock, self.MULE))
clock=None
elif self.MULE==2:
if "rush" in row[3] or "pass" in row[3] or "sack" in row[3] or "kick" in row[3] or "punt" in row[3] or "field goal" in row[3] or "PENALTY" in row[3]:
self.playlist.append(PlayClass.play(row, homescore, awayscore, off, quarter, ATO, HTO, clock, self.MULE))
clock=None
elif self.MULE==3:
if "rush" in row[1] or "pass" in row[1] or "sack" in row[1] or "kick" in row[1] or "punt" in row[1] or "field goal" in row[1] or "PENALTY" in row[1]:
self.playlist.append(PlayClass.play(row, homescore, awayscore, off, quarter, ATO, HTO, clock, self.MULE))
clock=None
except Exception:
print ("CLOCK ERROR", self.MULE, row)
if homescore>awayscore: # Identifying the winning team
self.H_WIN=True
elif homescore < awayscore:
self.H_WIN=False
else: #If one isn't greater than the other we have a problem, there are no ties.
print ("H_WIN ERROR", self.MULE, self.game_statement, awayscore, homescore) #error-checking for ties
DEFENSE_FN Function
DEFENSE is the current defensive team, and is, obviously, defined as the team that is not currently the offensive team.
def DEFENSE_FN(self):
for x in self.playlist:
if x.OFFENSE==self.HOME: # Match the offense to the home or away team and set the defense to the other
x.DEFENSE=self.AWAY
elif x.OFFENSE == self.AWAY:
x.DEFENSE=self.HOME
else: # If it's not the home team and it's not the away team something is wrong
print ("DEFENSE ERROR", self.MULE, x.playdesc)
O_D_SCORE_FN Function
O_SCORE and D_SCORE give the current score for the offensive and defensive teams. It is one of a number of functions that convert between existing home and away attributes by comparing whether OFFENSE is equal to HOME or AWAY and copying the appropriate attribute.
def O_D_SCORE_FN(self):
for x in self.playlist:
if x.OFFENSE==self.HOME: # Match the offense to the home or away team and set the scores accordingly
x.O_SCORE=x.HOME_SCORE
x.D_SCORE=x.AWAY_SCORE
elif x.OFFENSE == self.AWAY:
x.O_SCORE=x.AWAY_SCORE
x.D_SCORE=x.HOME_SCORE
else: # If it's not one way or the other something is wrong
print("O/D SCORE ERROR", self.MULE, x.playdesc)
x.O_LEAD=x.O_SCORE-x.D_SCORE
O_D_TO_FN Function
The O_TO and D_TO attributes show the remaining timeouts for both the offense and defense. The calculation follows the same logic as above, comparing OFFENSE to HOME and AWAY.
def O_D_TO_FN(self):
for x in self.playlist:
if x.OFFENSE==self.HOME: # Match the offense to the home or away team and set the timeouts accordingly
x.O_TO=x.HOME_TO
x.D_TO=x.AWAY_TO
elif x.OFFENSE == self.AWAY:
x.O_TO=x.AWAY_TO
x.D_TO=x.HOME_TO
else:
print ("O/D TO ERROR", self.MULE, x.playdesc)
O_WIN_FN Function
The O_WIN attribute is set to True if the offensive team eventually wins the game, and is calculated by the same method as the above attributes, DEFENSE, OFF_SCORE & DEF_SCORE, and O_TO & D_TO.
def O_WIN_FN(self):
for x in self.playlist:
if x.OFFENSE==self.HOME: # Match the offense to the home or away team and set the win setting accordingly
x.O_WIN=self.H_WIN
elif x.OFFENSE == self.AWAY:
x.O_WIN=not(self.H_WIN)
else:
print("O WIN ERROR", self.MULE, x.playdesc)
TIME_FN Function
The TIME_FN uses the existing CLOCK data to interpolate the remaining time in the game, measured in seconds, for all plays. It loops through playlist three times. The first time it takes the existing CLOCK statements and converts them to seconds in the TIME attribute. The second loop interpolates every gap between two TIME statements, and the third loop simply rounds all of the TIME statements to whole seconds.
def TIME_FN(self):
try:
for x in self.playlist: # First loop through converts all the clock statements to time statements
if not(x.CLOCK==None):
x.TIME=int(3600-900*x.QUARTER+int(x.CLOCK[-2:])+60*int(x.CLOCK[:2]))
if x.QUARTER==5:
x.TIME=0
if self.playlist[-1].TIME==None: # If the last play has no time we assign it to 0 to avoid an error down the line
self.playlist[-1].TIME=0
for x in range (0, len(self.playlist)-1): # Second loop splines between all the known times
if self.playlist[x].TIME==None: #Find the start of a gap
for templong in range (x , len(self.playlist)): # Loop to find the end of a gap
if not(self.playlist[templong].TIME==None): #If the gap has ended
for temptwo in range (x,templong): # Another nested loop to spline
self.playlist[temptwo].TIME=self.playlist[temptwo-1].TIME -(self.playlist[x-1].TIME - self.playlist[templong].TIME)/ (templong - x)
x=templong
break
for x in self.playlist: # Third loop to round to the nearest second
x.TIME=int(round(x.TIME,0))
except Exception: # A lot of misc errors can happen here
print ("TIME ERROR", self.MULE, x.playdesc)
SCORING_PLAY_FN Function
The scoring play function identifies plays where there is a scoring event, and identifies the type of score, as well as which team scored. The scoring plays are touchdowns, field goals, rouges, and safeties. Safeties are considered to have been “scored” by the team who surrenders the safety and therefore have a nominal value of -2 points. This function is important for the EP_INPUT_FN.
def SCORING_PLAY_FN(self): # Need to find all the plays with scores
try:
for x in range (1, len(self.playlist)):
if self.playlist[x].FG_RSLT=="ROUGE":
self.playlist[x].SCORING_PLAY="O-ROUGE" # Only the offense is realistically likely to score a rouge, ever
elif self.playlist[x].FG_RSLT=="GOOD" and self.playlist[x].DOWN>0: # GOOD signals a made field goal but need to check the down to avoid PAT. Really only the offense can scorea field goal
self.playlist[x].SCORING_PLAY="O-FG"
elif "TOUCHDOWN" in self.playlist[x].playdesc: # Looking for TDs
for templong in range (x + 1,len(self.playlist)):
if "attempt" in self.playlist[templong].playdesc or "kickoff" in self.playlist[templong].playdesc:
if self.playlist[x].OFFENSE==self.playlist[templong].OFFENSE: # The team that has the PAT or kickoff is the one that scores the TD
self.playlist[x].SCORING_PLAY="O-TD"
elif self.playlist[x].OFFENSE==self.playlist[templong].DEFENSE:
self.playlist[x].SCORING_PLAY="D-TD"
break # If we don't break it will overwrite with future PATs or kickoffs
elif self.playlist[templong].O_SCORE>self.playlist[templong-1].O_SCORE: #Failing in that we can look for a change in one team's score
self.playlist[x].SCORING_PLAY="O-TD"
break
elif self.playlist[templong].D_SCORE>self.playlist[templong-1].D_SCORE:
self.playlist[x].SCORING_PLAY="D-TD"
break
else: # if we can't find the provenance of a TD it's a last-play TD, that all end up being on the offense
self.playlist[x].SCORING_PLAY="O-TD"
elif "SAFETY" in self.playlist[x].playdesc: #Safeties are usually on the O but KOR/PR can lead to a D safety
if self.playlist[x].YDLINE<65: # The field position of the play can effectively tell us whose safety it is.
self.playlist[x].SCORING_PLAY="D-SAFETY"
else:
self.playlist[x].SCORING_PLAY="O-SAFETY"
except Exception:
print ("SCORING PLAY ERROR", self.MULE, self.playlist[x].playdesc)
P1D_INPUT_FN Function
This function determines whether each play leads to a successful first down within that drive. Touchdowns are also considered to be successful first downs, whereas turnovers, punts, field goals, turnovers on downs, and ends of halves are considered unsuccessful.
def P1D_INPUT_FN(self):
for x in self.playlist:
if x.ODK=="OD": # We only care about P(1D) for OD plays
if x.DOWN>0: #We don't care about 2-pt conversions
if x.SCORING_PLAY=="O-TD": # If the O scores a TD then it's obviously good
x.P1D_INPUT=True
elif x.SCORING_PLAY=="D-TD": # if the D scores then it's bad
x.P1D_INPUT=False
elif x.SCORING_PLAY=="O-SAFETY": # Safeties are also bad
x.P1D_INPUT=False
else:
for y in self.playlist[self.playlist.index(x) + 1:]: #Now we loop through the rest of the plays
if y.OFFENSE != x.OFFENSE: #If there's a change of possession it's a fail
x.P1D_INPUT=False
break # Always break to avoid overwriting and keep the structure simple
elif y.SCORING_PLAY=="O-TD": # If offense scores a touchdown that's good
x.P1D_INPUT=True
break
elif y.ODK=="P" or y.ODK=="FG" or y.ODK=="KO": # If there's a non-OD play it implies a failure of the drive
x.P1D_INPUT=False
break
elif y.SCORING_PLAY == "D-TD": # A defensive touchdown is bad
x.P1D_INPUT=False
break
elif y==self.playlist[-1]:
x.P1D_INPUT=False
break
elif y.DOWN==1 and y.DISTANCE==10: # A 1st & 10 is good
x.P1D_INPUT=True
break
elif y.DOWN==1 and y.DISTANCE==y.YDLINE: # 1st & Goal is good
x.P1D_INPUT=True
break
else:
x.P1D_INPUT = False # If we get to the end of the game, finding nothing
elif x.DOWN==0: # Handling 2-point conversions
if "GOOD" in x.playdesc:
x.P1D_INPUT=True
else:
x.P1D_INPUT=False
EP_INPUT_FN Function
EP Input determines the next score in the game, be it one of the four scores above, or the end of a half. It also identifies whether the next score is scored by the current offense or defense. It loops through the playlist looking for a value in SCORING_PLAY or the end of a half.
def EP_INPUT_FN(self):
for x in self.playlist: # Loop through the playlist
for y in self.playlist[self.playlist.index(x):]: # Looping through all the plays going forward
if y.SCORING_PLAY != None: # If there's a scoring play
if y.SCORING_PLAY[0]=="O": # Need to match the scoring team with the current offense
if y.OFFENSE==x.OFFENSE:
x.EP_INPUT=y.SCORING_PLAY
break
elif y.OFFENSE == x.DEFENSE:
x.EP_INPUT="D" + y.SCORING_PLAY[1:]
break
else:
print ("EP INPUT ERROR:", self.MULE, x.playdesc)
elif y.SCORING_PLAY[0] == "D":
if y.OFFENSE == x.DEFENSE:
x.EP_INPUT=y.SCORING_PLAY
break
elif y.OFFENSE == x.OFFENSE:
x.EP_INPUT="D" + y.SCORING_PLAY[1:]
break
else:
print ("EP INPUT ERROR:", self.MULE, x.playdesc)
else:
print ("EP INPUT ERROR:", self.MULE, x.playdesc)
elif y.QUARTER==2 and self.playlist[self.playlist.index(y)+1]==3: # If it's halftime
x.EP_INPUT="HALF"
break
elif y.QUARTER==4 and self.playlist[self.playlist.index(y)+1]==5: # If OT begins
x.EP_INPUT="HALF"
break
else: # If the game ends
x.EP_INPUT = "HALF"
Play Class
Within the game object we have the attribute rowlist, which contains the raw text of every row within that game as found within the initial .csv that was imported. Games also have the attribute playlist, a list of plays that is initially empty. The method make_plays exists to identify plays within the rowlist, and to identify important attributes that need to be passed.
__init__ Method
The __init__ method imports the basic data passed from the game class, and initializes all of the other attributes in the class. Most importantly it brings in the data format identifier, which affects the behaviour of a number of functions.
def __init__(self,row, homescore, awayscore, off, quarter, ATO, HTO, clock, MULE):
self.MULE = MULE # Need to carry this information because it affects some of the parsing
if self.MULE == 1 or self.MULE==3: # The different formats have different row structures
self.DD = row[0] #The first cell has down & distance
self.SPOT = None #This data format doesn't have a separate FPOS cell
self.playdesc = row[1] #play description is in the second cell
elif self.MULE == 2:
self.DD = row[1]#Mule 2 is structured differently
self.SPOT = row[2]
self.playdesc = row[3]
self.HOME_SCORE = homescore#Carry over the score from the parent game
self.AWAY_SCORE = awayscore
self.OFFENSE = off #Carry over the offense
self.HOME_LEAD = self.HOME_SCORE-self.AWAY_SCORE#Home lead is obviously just the difference here, though this could just be a function but it seems dumb to have a one-line function
self.AWAY_TO = ATO#Carry over the TO situation
self.HOME_TO = HTO
self.CLOCK = clock # If there's any clock info
self.QUARTER = quarter # Carry over the qtr
# Here are all the other attributes we figure out via functions but which we initialize to None to start
self.DOWN=None
self.DISTANCE=None
self.RP=None
self.P_RSLT=None
self.FPOS=None
self.DEFENSE=None
self.O_SCORE=None
self.D_SCORE=None
self.ODK=None
self.O_WIN=None
self.FG_RSLT=None
self.GAIN=None
self.TIME=None
self.SCORING_PLAY=None
self.P1D_INPUT=None
self.EP_INPUT=None
DOWN_FN Function
This function determines the DOWN attribute, the down of the play. Because of the three different datasets coding this function essentially serves as three functions in one, determining down differently based on the data format.
def DOWN_FN(self):
try:
if self.MULE==1:#Mule 1 we have to parse it from the DD cell
if "0th" in self.DD:
self.DOWN=0
elif "1st" in self.DD:
self.DOWN=1
elif "2nd" in self.DD:
self.DOWN=2
elif "3rd" in self.DD:
self.DOWN=3
else:
print ("DOWN ERROR:", self.MULE, self.playdesc)#Catch errors
elif self.MULE==2:
self.DOWN=int(self.DD[0]) # In format 2 it's always the first charactor of the DD column
elif self.MULE==3:
self.DOWN=int(self.DD[2]) # In format three down is the third character
except Exception: #This will catch if there's anything non-numeric raising an exception with int()
print ("Down Error", self.MULE, self.playdesc)
DISTANCE_FN Function
DISTANCE is the distance to gain for the offense. Each data set requires a slight difference in the method of calculation, but ultimately it’s a string manipulation to pull an integer out of the DD attribute.
def DISTANCE_FN(self):
try:
if self.MULE==1: # for each data format distance is a simple string interpretation
self.DISTANCE=int(self.DD[8:10])
elif self.MULE==2:
self.DISTANCE=int(self.DD[2:])
elif self.MULE==3:
self.DISTANCE=int(self.DD[4:6])
if self.DISTANCE<=0: #error-checking, distance can't be 0 or negative
print ("DISTANCE ERROR:", self.MULE, self.playdesc)
except Exception:
print ("DISTANCE ERROR:", self.MULE, self.playdesc)
R_P_FN Function
This method identifies run and pass plays by looking for a set of keyword in the playdesc attribute. It is important for improving the efficiency of the P_RSLT_FN function, but also for future research where distinguishing between rush and pass attempts may be of value.
def RP_FN(self): #No need to error-check, either the phrases are in playdesc or not
if "pass" in self.playdesc or "sack" in self.playdesc or "scramble" in self.playdesc:
self.RP="P"
elif "rush" in self.playdesc:
self.RP="R"
P_RSLT_FN Function
Following the identification of pass plays this function looks at the result of said pass plays, be they complete, incomplete, interceptions, sacks, or scrambles. Scrambles are underrepresented, as in many cases the scorekeeper does not label them as such, and there is therefore no way to distinguish them from ordinary rushing plays.
def P_RSLT_FN(self): # Determining the result of a pass
if self.RP=="P": # Obviously only interested in pass plays
if "incomplete" in self.playdesc: # Looking for key strings
self.P_RSLT="I"
elif "complete" in self.playdesc:
self.P_RSLT="C"
elif "intercept" in self.playdesc:
self.P_RSLT="X"
elif "sack" in self.playdesc:
self.P_RSLT ="S"
elif "scramble" in self.playdesc or ("pass" in self.playdesc and "rush" in self.playdesc):
self.P_RSLT="R"
elif "FAILED" in self.playdesc: #catching 2-pt conversions
self.P_RSLT = "I"
elif "GOOD" in self.playdesc or "good" in self.playdesc:
self.P_RSLT = "C"
else: # If we don't have any of the key phrases something is wrong
print ("P RSLT ERROR", self.MULE, self.playdesc)
FPOS_FN Function
This function finds the field position of the play, using the +/- notation, where positive values are on the defense’s side of midfield, and negative values are in the offense’s end. By convention plays at midfield are defined as being the +55, although this would not actually affect any calculations going forward. This is done by checking whether the ball is on the offense’s or defense’s side of midfield by using a string comparison to the offensive team, and then either taking the yard line as-is or multiplying by -1.
def FPOS_FN(self): # We don't need to error check the result because if it's not a proper int the int conversion will raise an exception.
try:
if self.MULE==1: # String interpretation for each data format
self.FPOS=int(self.DD[-2:])
if self.DD[-5:-2]==self.OFFENSE: # If we're on the offensive side of the field we flip the sign
self.FPOS=self.FPOS*(-1)
elif self.MULE==2:
if self.SPOT[:3]==self.OFFENSE: # Data format 2 has a separate column for the field position
self.FPOS=int(self.SPOT[-2:])*(-1)
else:
self.FPOS = int(self.SPOT[-2:])
elif self.MULE == 3:
if self.DD[0] == self.DD[-3]:
self.FPOS=int(self.DD[-2:])*(-1)
else:
self.FPOS=int(self.DD[-2:])
if self.FPOS==-55: # If it's at midfield we define it as positive by convention
self.FPOS=55
except Exception: # Will catch any non-ints
print ("FPOS ERROR", self.MULE, self.playdesc)
YDLINE_FN Function
This function converts the FPOS attribute into YDLINE. YDLINE is simply the number of yards from the goal line, and spans from 1 to 109. It makes calculations much easier because the value is continuous, rather than having the discontinuity that FPOS does of jumping from +55 to -54, requiring a lot of repetitive conversions in later methods.
def YDLINE_FN(self):
if self.FPOS > 0: # Simple conversion from FPOS to YDLINE
self.YDLINE = self.FPOS
elif self.FPOS < 0:
self.YDLINE = 110 + self.FPOS
else:
print ("YDLINE ERROR", self.MULE, self.playdesc)
if self.YDLINE <= 0 or self.YDLINE <= 110: #For out of range errors
print ("YDLINE ERROR", self.MULE, self.playdesc)
ODK_FN Function
This function determines if the play is an offensive play or some kind of special teams play. This is largely done by looking for keywords in playdesc. It includes a few checks for unusual situations, such as intentional safeties being coded as punts and identifying PAT attempts and dead-ball penalties.
def ODK_FN(self):
if self.DOWN == 3 and "SAFETY" in self.playdesc:
self.ODK = "P" #need to account for intentional safeties
elif "punt" in self.playdesc: # Looking for some pretty straightforward key phrases
self.ODK="P"
elif "kickoff" in self.playdesc:
self.ODK="KO"
elif "field goal" in self.playdesc or "kick attempt" in self.playdesc: #kick attempt is for PAT
self.ODK="FG"
elif not(self.RP==None): #otherwise plays that are runs or passes are just "OD," but it comes last because of fakes
self.ODK="OD"
elif "PENALTY" in self.playdesc:
self.ODK="PEN"
else:
print ("ODK ERROR", self.MULE, self.playdesc)
FG_RSLT_FN Function
The field goal result finds field goal attempts and logs whether they were successful or not, classifying them as either good, missed, or rouge. “Missed” includes blocks as well, but those are often not labelled differently and so at this time a separate classification has not been included,
def FG_RSLT_FN(self): #No error-catching because we have the else used for misses
if self.ODK == "FG":
if "GOOD" in self.playdesc:
self.FG_RSLT = "GOOD"
elif "ROUGE" in self.playdesc:
self.FG_RSLT = "ROUGE"
else: # There are several ways to denote failed FG attempts so this is a catch-all
self.FG_RSLT = "MISSED"
GAIN_FN Function
The gain function determines from string comprehension the gain on the play. This version properly accounts for gains of more than 100 yards, which were not properly calculated in the previous database. While these plays are outliers, they have a disproportionate impact on averages.
def GAIN_FN(self):
try:
if not(self.RP==None):
if self.P_RSLT == "I" or self.P_RSLT == "X": # Incompletions obviously have no gain
self.GAIN = 0
elif "no gain" in self.playdesc: # they use "no gain" instead of 0
self.GAIN = 0
elif "GOOD" in self.playdesc and self.DOWN == 0: # Handling 2-pt conversions
self.GAIN = self.YDLINE
elif "FAILED" in self.playdesc:
self.GAIN = 0
elif "loss" in self.playdesc: # They use "loss" instead of negative gain, so need to flip the sign
if self.playdesc[self.playdesc.find("yard") - 4] == 1: # If somehow there's a loss of 100 yards
self.GAIN = int(self.playdesc[self.playdesc.find("yard") - 4 : self.playdesc.find("yard") - 1])
else:
self.GAIN = int(self.playdesc[self.playdesc.find("yard") - 3 : self.playdesc.find("yard") - 1])
else:
if self.playdesc[self.playdesc.find("yard")-4] == 1: # Handles gains of 100+ yards
self.GAIN = int(self.playdesc[self.playdesc.find("yard") - 4 : self.playdesc.find("yard")])
else:
self.GAIN = int(self.playdesc[self.playdesc.find("yard") - 3 : self.playdesc.find("yard")])
except Exception: # Will catch any non-ints
print ("Gain Error", self.MULE, self.playdesc)
Punt Class
To better look at the different kinds of plays it is easiest to create new objects most appropriate for the situation. One of those is to create a class for punts, to better look at the effects of punting in different situations.
Array_Declaration Method
This method sits in the class module and is called by the Master module to create a list of punt objects where the list index can be mapped to the YDLINE attribute. Beyond a certain yardline the objects are largely empty, as there are no punts very near the goal line. This is not a concern as these are just left as placeholders.
def Array_Declaration():
for x in range (0,110):
PUNT_ARRAY.append(PUNT(x))
for x in range (0,110):
PUNT_ARRAY.append(PUNT(x))
__init__ Method
This method creates all the attributes we will need for the punt object. The current form is skeletal and will be developed along with future research that will look at punts. Because of the bizarre multinomial distribution of scoring plays bootstrapping is heavily used in the calculation of most confidence intervals.
def __init__(self, ydline):
self.N=0 #Number of punts from this ydline
self.YDLINE=ydline
self.EP=None #Average EP value of this punt
self.EP_ARRAY=[] #List holding all the EP data
self.EP_LOW=None #Lower CI for EP
self.EP_HIGH=None #Upper CI for EP
self.BOOTSTRAP=Globals.DummyArray #A bootstrap of the EP to determine the CI
Calculate Method
This method determines the EP value of the punt by averaging the EP inputs from all the punt plays that occurred from this yard line.
def calculate(self):
if self.N > 0: # Avoid divide-by-zero
self.EP=sum(self.EP_ARRAY)/self.N
Boot Function
This function generates the a bootstrapping of the EP values to determine a confidence interval and a distribution. It uses the numpy module to improve speed, creating a ten-fold improvement in execution time compared to a strictly Python-coded earlier version
def boot(self):
if self.N>10:
self.BOOTSTRAP = numpy.sort(numpy.array([numpy.average(numpy.random.choice(self.EP_ARRAY, self.N, replace=True)) for _ in range(Globals.BOOTSTRAP_SIZE)], dtype='f4'))
self.EP_HIGH = self.BOOTSTRAP[Globals.BOOTSTRAP_SIZE * (1-Globals.CONFIDENCE)]
self.EP_LOW = self.BOOTSTRAP[Globals.BOOTSTRAP_SIZE * Globals.CONFIDENCE - 1]
KO Class
Similar to the punt class, the kickoff class carries information particular to the needs of analyzing kickoffs. It creates a list of kickoff objects, though the vast majority of them. All the methods and functions are the same as the punt class, holding information about the EP data, and creating a bootstrap of the EP value.
FG Class
The FG class follows the same structure as the punt and kickoff classes, creating an array of objects indexed by yardline. Because of the nature of field goals either being missed, rouge, or good, it tracks each of the three separately to allow them to be combined later.
__init__ method
Because of the nature of field goals either being missed, rouge, or good, each of the three results is tracked in a separate attribute. This also allows the calculation of separate confidence intervals on each of these
def __init__(self, ydline):
self.YDLINE=ydline
self.N=0
self.GOOD=0
self.ROUGE=0
self.MISSED=0
self.P_GOOD=None
self.P_GOOD_LOW=None
self.P_GOOD_HIGH=None
self.P_ROUGE=None
self.P_ROUGE_LOW=None
self.P_ROUGE_HIGH=None
self.P_MISSED=None
self.P_MISSED_LOW=None
self.P_MISSED_HIGH=None
self.EP=None
self.BOOTSTRAP=Globals.DummyArray
self.EP_ARRAY=[]
self.EP_LOW=None
self.EP_HIGH=None
Calculate method
Using the binomial confidence functions from the Functions module we can determine the confidence intervals on each result using a 1 vs. all method.
def calculate(self): #Calculate all the percentages and the binomial CIs. Can do right away b/c there's no EP aspect involved
if self.N>0:
self.P_GOOD=self.GOOD/self.N
self.P_GOOD_HIGH=Functions.BinomHigh(self.GOOD, self.N, Globals.CONFIDENCE)
self.P_GOOD_LOW=Functions.BinomLow(self.GOOD, self.N, Globals.CONFIDENCE)
self.P_ROUGE=self.ROUGE/self.N
self.P_ROUGE_HIGH=Functions.BinomHigh(self.ROUGE, self.N, Globals.CONFIDENCE)
self.P_ROUGE_LOW=Functions.BinomLow(self.ROUGE, self.N, Globals.CONFIDENCE)
self.P_MISSED=self.MISSED/self.N
self.P_MISSED_HIGH=Functions.BinomHigh(self.MISSED, self.N, Globals.CONFIDENCE)
self.P_MISSED_LOW=Functions.BinomLow(self.MISSED, self.N, Globals.CONFIDENCE)
self.EP=sum(self.EP_ARRAY) / self.N
P(1D) Class
The P(1D) class allows us to calculate the first down probability of a situation from the data previously parsed. This recreates the methods used in the previous examination of P(1D) (Clement 2018b) to determine the P(1D) of down & distance situations with confidence intervals.
Array_Declaration Method
Similar to the previous lists that were one-dimensional for field position, P(1D) objects are held in a dimensional list that lets us consider both down and distance. A separate but similarly constructed array handles & Goal situations
def Array_Declaration():
for down in range (0,4):
temp=[]
for distance in range(0,26):
temp.append(P1D(down, distance))
P1D_ARRAY.append(temp)
for down in range (0,4):
temp=[]
for distance in range(0,26):
temp.append(P1D(down, distance))
P1D_GOAL_ARRAY.append(temp)
__init__ Method
This method defines the attributes we will need. Since P(1D) is a simple binomial variable, we only need to hold the attributes related to this. The confidence intervals are calculated with the binomial functions developed in the Functions module.
def __init__(self, down, distance):
self.DOWN=down
self.DISTANCE=distance
self.N=0
self.X=0
self.P=None
self.LOW=None
self.HIGH=None
self.SMOOTHED=None
Binom method
The binom method calls the binomial confidence functions mentioned above to determine the upper and lower bounds on the confidence interval for P(1D).
def binom(self):
if self.X > 0:
self.P=self.X/self.N
self.LOW=Functions.BinomLow(self.X, self.N, Globals.CONFIDENCE)
self.HIGH=Functions.BinomHigh(self.X, self.N, Globals.CONFIDENCE)
if self.X > 0:
self.P=self.X/self.N
self.LOW=Functions.BinomLow(self.X, self.N, Globals.CONFIDENCE)
self.HIGH=Functions.BinomHigh(self.X, self.N, Globals.CONFIDENCE)
Conclusion
With the development of a more sophisticated interpreter of the U Sports database it becomes possible to parse in a mere fraction of the time, allowing for shorter development and testing cycles. This method is also far less resource-intensive as the entirety of the raw and parsed data is not constantly being held in memory but instead the raw data is read once at runtime and the parsing happens there. Heavy use of the numpy package provides significant speed improvements as the package is run in C and is highly optimized.
This framework will provide the basis for future research into the nature of different kicking plays, leading into a discussion of EP and 3rd down decision-making. The availability of Python’s many packages will assist in the development of a future WP model through the sklearn package.
References
Clement, Christopher M. 2018a. “It’s the Data, Stupid: Development of a U Sports Football Database.” Passes and Patterns. June 30, 2018. http://passesandpatterns.blogspot.com/2018/06/its-data-stupid-development-of-u-sports.html.
———. 2018b. “Three Downs Away: P(1D) In U Sports Football.” Passes and Patterns. August 23, 2018. http://passesandpatterns.blogspot.com/2018/08/three-downs-away-p1d-in-u-sports.html.
Pezzullo, John C. 2014. “Exact Confidence Intervals for Samples from the Binomial and Poisson Distributions.” Georgetown University Medical Center. http://statpages.info/confint.xls.
No comments:
Post a Comment