Saturday, October 13, 2018

The Roman Numerals of Computing: An Object-Oriented Database of U Sports Football

  1. Abstract

A redevelopment of the U Sports parser and calculator previously described (Clement 2018b) that was built using VBA, this time working in Python. Data is imported through Python’s csv package, and parsed using an object-oriented approach, creating games and plays as classes with attributes as appropriate to each. Further objects exist to support analysis on the parsed data, and the numpy package with its arrays allows for far faster calculation of results. Discussion of future work built on this restructured database includes examination of special teams, expected points (EP) and Win Probability (WP).

  1. Introduction

While previous work developed a pair of relational databases for U Sports and CFL data, and this proved adequate for a certain tasks, there was an interest in taking an object-oriented approach to the matter. The prime benefit of an object-oriented approach would be to structure the data in a fashion more akin to how our data is structured in reality. Games are independent events, and the plays are independent within them. By structuring the games as objects, and the plays as a separate class within them, we can also define attributes as being either game-level attributes or play-level attributes. The date of a game is a game-level attribute, whereas field position is a play-level attribute.
In the relational database a large number of rows were dedicated to bookkeeping: score statements, clock statements, and quarter starts among them, consuming as much as one-quarter of all the rows. By moving to an object-oriented structure that information is held at the appropriate level as an attribute of the object, making for a cleaner data structure.
Because this version of the database does not immediately provide a visual output of all the parsed data it was necessary to implement a more rigorous error-checking. In the process of doing so a number of other errors were spotted and corrected, though none that would affect the integrity of the P(1D) results already delivered (Clement 2018b). A number of missing score statements were repaired, as well as errors relating to distance gained on plays with fumbles.
  1. Database

The code is structured using a number of modules. There are three central modules in the central aspect of the code, each discussed below. The program is run from the central Master module, which runs small sections of code and calls other methods and functions, serving to direct the overall flow of the code. A limited number of global variables are held in Globals to ease the difficulty of constantly passing large numbers of local variables, making code maintainability needlessly difficult. Finally, Functions holds a number of small numerical functions that rove useful in a variety of different situations.
Furthermore, there is a folder for the class modules, each of which is responsible for holding one class of object, with its attendant methods, functions and attributes, and any additional code that relates to that object class.
    1. Master Module

Traffic within the program is managed by the Master module, a central module which initializes all the other elements of the program and directs the sequence of calculations. Master also serves as the laboratory to implement initial concepts before better organizing them within the structure of the code.
The first responsibility of the Master module is the importation of the data itself. The data was imported from .csv files using Python’s csv module. The three different formats of U Sports data (Clement 2018a) were each stored in a separate csv file. After importing the csv file the script reads through each row of the data to search for new games, as identified by the presence of “vs.” to create new Game objects. All other rows were added the the game’s rowlist attribute for further treatment. When a new game begins the current game is added to the global list gamelist. The code below demonstrates the approach taken to create the game objects. The dummy variable simply avoids an off-by-one error when reading the first game, in an inelegant but utilitarian fashion.
import csv #to bring in the MULE data from CSV
dummy=0
#IMPORTING DATA FROM THE CSV MULES
with open("Data/CIS MULE 01.csv") as csvfile:
       MULE = csv.reader(csvfile)
       for row in MULE:
           if " vs. " in row[0]: #looking for new games
               Globals.gamelist.append(GameClass.game(row[0], 1))#add the old temp to the list of games
           elif dummy==0:
               dummy=1
           elif row[0]!=None or row[1]!=None:
               tempplay=[]
               tempplay.append(row[0])
               tempplay.append(row[1])
               Globals.gamelist[-1].rowlist.append(tempplay)
    1. Globals Module

The simple mission of the Globals module is to hold a set of global variables that can be imported en bloc by each other module to give access to the core variables and data of the program and limiting the number of redundant passes. Care was taken to be judicious in the use of these global variables to avoid overencumbering the system with needless global variables. The variables defined in Globals are given below.
CONFIDENCE=0.025 # The one-sided confidence interval size for all statistical tests
BOOTSTRAP_SIZE=10000 # Number of bootstrap iterations to use


TDval=7.0 # Value of a touchdown
TDval_BOOTSTRAP=[] # Bootstrap array to determine confidence interval of TDval
TDval_HIGH=None # Upper bound of TDval CI
TDval_LOW=None # Lower bound of TD CI


FGval=3.0 # Same for field goal
FGval_BOOTSTRAP=[]
FGval_HIGH=None
FGval_LOW=None


ROUGEval=1.0 # For rouge
ROUGEval_BOOTSTRAP=[]
ROUGEval_HIGH=None
ROUGEval_LOW=None


SAFETYval=-2.0 # For safety
SAFETYval_BOOTSTRAP=[]
SAFETYval_HIGH=None
SAFETYval_LOW=None


HALFval=0.0 # Value of the end of the half is fixed and certain
gamelist=[] # The gamelist holds all the games


DummyArray = numpy.full(BOOTSTRAP_SIZE, -100, dtype='int32') # Holds a default value to avoid errors when comparing to None
    1. Functions

The Functions module provides access to a few functions that are reused across several classes and that improve maintainability by centralizing these in a single module
      1. Binomial Error

Binomial error in the program is calculated iteratively to obtain a value that is as exact as we choose to make it, here we go until we converge at the 10-30 level.  We recycle the same binomial function used in the previous calculator (Pezzullo 2014), as was discussed in that work (Clement 2018a).
      1. BootCompare

This function takes two bootstrap arrays and determines how many values of array A are greater than values of array B. It returns the proportion of all pairs of elements of A and B where the element from A is greater than the element from B. It assumes that A and B are both sorted arrays, which they are because prior elements of the code demand that they be sorted, and with this assumption is able to operate with only 2 * Globals.BOOTSTRAP_SIZE comparisons, at O(n) efficiency, rather than performing a comparison of every pair at O(n2).
def BootCompare(arrA, arrB):
   count=0
   b=0
   for a in range (0, Globals.BOOTSTRAP_SIZE):
       for b in range (b, Globals.BOOTSTRAP_SIZE):
           if arrB[b] > arrA[a]:
               count += b
               break
       else:
           count += Globals.BOOTSTRAP_SIZE
   count /= Globals.BOOTSTRAP_SIZE**2
   return float(count)
    1. Game class

The game class creates objects that hold games in their entirety.  This allows us to append attributes belonging to the game directly to the object, rather than carrying them with every play as was done in the relational database. Notably, the teams playing, the date of the game, and the final score are all attributes that belong to the game and not to the individual play. The dataset from which the game comes is also passed as the attribute MULE to account for differences in how the parser should handle certain attributes. The code below shows how a game object is initialized.
      1. __init_ method

The __init__ method initializes the game object, creating all of the attributes that will then be determined in later functions. No work is done in this method to assign values to any attributes beyond the statement and data format that are imported initially, so as to improve the maintainability and portability of the code. Most of the attributes are determined in the subsequent game_calc method. The code for the __init__ method is given below.
def __init__(self, statement, MULE):
   self.rowlist=[] # Holds all the rows of raw data from the csv
   self.playlist=[] # Holds a list of all the plays as objects
   self.MULE=MULE # Which data format is this from
   self.game_statement = statement # The initial "vs. " statement
   self.AWAY=self.game_statement[0:3] # Automatically grabs the Home and Away team from the vs. statement
   self.HOME=self.game_statement[8:11]
   self.H_WIN=None # If the home team wins the game
   self.H_FINAL=None # Final score of them game
   self.A_FINAL=None
   self.LEAGUE = None
   self.CONFERENCE = None # The conference in which the game was played
   self.YEAR=None # The year, month, day, weekday of the game
   self.MONTH=None
   self.DAY=None
   self.WEEKDAY=None
   self.SEASON=None # The season in whichthe game was played, always the same as YEAR for Canadian football, as seasons don't span over New Year's Day, exists for compatibility with other leagues in the future
      1. Game_calc method

This method determines a number of the core attributes of the game. It separates the row attribute into its components, based on the data format, and through some string interpretation determines the date of the game. It also determines the conference in which the game was played, which considers the decision of Bishop’s University to move from the RSEQ to AUS for the 2017 season. The code of the game_calc method is included here.
# Determines some of the basic info about the game - date, conference
def game_calc(self):
   for x in range (2001,2100): # Loop to get the year of the game. Checks 2001 to 2100 to be fully future-proofed and allow for the addition of old PxP games we might find.
       if str(x) in self.game_statement: # Check if the year is in the game statement
           self.YEAR=x # Assign that year to the attribute
           self.SEASON=x
           break # Once we find the year there's no need to waste clock cycles
   self.MONTH = int(self.game_statement[self.game_statement.find(str(self.YEAR)) + 5 : self.game_statement.find(str(self.YEAR)) + 7]) # Assign the other attributes from string manipulation once we know the year
   self.DAY = int(self.game_statement[self.game_statement.find(str(self.YEAR)) + 8 : self.game_statement.find(str(self.YEAR)) + 10])
   self.WEEKDAY = date(self.YEAR, self.MONTH, self.DAY).weekday()
   self.LEAGUE="U SPORTS" # Always U Sports in this case, will get more sophisticated if we start looking at other leagues.
   CWUAA=["MAN", "SKH", "REG", "ALB", "CGY", "UBC", "SFU"] # These are to determine the conference of a game #These two conferences always have the same teams
   OUA = ["CAR", "OTT", "QUE", "TOR", "YRK", "MAC", "GUE", "WAT", "WLU", "WES", "WIN"]
   if self.YEAR<2016: # Because Bishop's changed conferences in 2016 we need to adapt accordingly
       RSEQ=["BIS", "SHE", "MCG", "CON", "MON", "LAV"]
       AUS = ["SMU", "SFX", "MTA", "ACA"]
   else:
       RSEQ=["SHE", "MCG", "CON", "MON", "LAV"]
       AUS = ["SMU", "SFX", "MTA", "ACA", "BIS"]


   if self.HOME in CWUAA and self.AWAY in CWUAA: # If the two teams are in the same conference we assign that conference, otherwise it's a non-conference game
       self.CONFERENCE="CWUAA"
   elif self.HOME in OUA and self.AWAY in OUA:
       self.CONFERENCE="OUA"
   elif self.HOME in RSEQ and self.AWAY in RSEQ:
       self.CONFERENCE="RSEQ"
   elif self.HOME in AUS and self.AWAY in AUS:
       self.CONFERENCE="AUS"
   else:
       self.CONFERENCE="NONCON"
      1. make_plays method

The make_plays method is forms the basis of the database, iterating over rowlist and identifying what rows are plays, and interpreting the bookkeeping rows , applying them as needed.
def make_plays(self): # Go through the rowlist to identify and create play objects, and to get the data from bookkeeping rows
       quarter=1 # Initialize the quarter and scores as zero, obviously, two timeouts per team, clock is at 15:00
       homescore=0
       awayscore=0
       HTO=2
       ATO=2
       clock="15:00"
       off=[] # No team is in offense until one is assigned
       
       for row in self.rowlist: #Looping through the whole rowlist
           if row[0]=="2nd": # Identifying new quarters and resetting the clock and timeouts
               quarter=2
               clock="15:00"
           elif row[0]=="3rd":
               quarter=3
               clock="15:00"
               HTO=2
               ATO=2
           elif row[0]=="4th":
               quarter=4
               clock="15:00"
           elif row[0]=="OT":
               quarter=5
               clock="00:00"
           
           if self.MULE==1 or self.MULE==3: # checking for possession statements, but these only exist in data formats 1 and 3
               if "drive start" in row[1]:
                   off=row[1][row[1].find("drive start")-4:row[1].find("drive start")-1]
           elif self.MULE==2:
               if len(row[0])==3:
                   if row[0]!="1st" and row[0]!="2nd" and row[0]!="3rd" and row[0] !="4th":
                       off=row[0]
           if off != self.HOME and off != self.AWAY:
               print ("POSSESSION ERROR", self.MULE, self.game_statement, row)


           try:
               if self.MULE == 1 or self.MULE == 3:
                   if "TIMEOUT" in row[1]: # checking for timeout statements
                       TO=row[1].find("TIMEOUT")
                       TOTEAM=row[1][ (TO + 8) : (TO + 11)]
                       if TOTEAM == self.HOME:
                           HTO=HTO-1
                       elif TOTEAM == self.AWAY:
                           ATO=ATO-1
                       else: #Error checking if the team calling a timeout isn't properly interpreted
                           print ("TIMEOUT ERROR", self.MULE, row, "TO", TOTEAM, "HOME", self.HOME, "AWAY", self.AWAY)
               elif self.MULE == 2:
                   if "TIMEOUT" in row[3]:
                       if row[3][row[3].find("TIMEOUT") + 8 : row[3].find("TIMEOUT")+11] == self.HOME:
                           HTO=HTO-1
                       elif row[3][row[3].find("TIMEOUT")+8:row[3].find("TIMEOUT")+11]==self.AWAY:
                           ATO=ATO-1
                       else: #Error checking if the team calling a timeout isn't properly interpreted
                           print ("TIMEOUT ERROR", self.MULE, row)
           except Exception:
               print ("TIMEOUT EXCEPTION", self.MULE, self.game_statement, row[0], row[1])


           try:
               if self.MULE==1 or self.MULE==3: # find score statements
                   if len(row[1])>11 and len(row[1]) <15:
                       if row[1][3]==" ":
                           if row[1][0:3]==self.AWAY:
                               awayscore=int(row[1][4:row[1].find(",")].lstrip(" "))
                               homescore=int(row[1][-2:].lstrip(" "))
               elif self.MULE==2:
                   if len(row[3]) > 11 and len(row[3]) < 15:
                       if row[3][3]==" ":
                           if row[3][0:3]==self.AWAY:
                               awayscore = int(row[3][4 : row[3].find(",")].lstrip(" "))
                               Homescore = int(row[3][-2:].lstrip(" "))
           except Exception:
               print ("SCORE ERROR", self.MULE, row, awayscore, homescore)
   
           #identify clock statements
           try:
               if self.MULE==1 or self.MULE==3:
                   if ":" in row[1]:
                       clock=row[1][row[1].find(":")-2:row[1].find(":")+3]
               elif self.MULE==2:
                   if ":" in row[3]:
                       clock=row[3][row[3].find(":")-2:row[3].find(":")+3]
   
                   #identify plays
               if self.MULE==1 :
                   if "rush" in row[1] or "pass" in row[1] or "sack" in row[1] or "kick" in row[1] or "punt" in row[1] or "field goal" in row[1] or "PENALTY" in row[1]:
                       self.playlist.append(PlayClass.play(row, homescore, awayscore, off, quarter, ATO, HTO, clock, self.MULE))
                       clock=None
               elif self.MULE==2:
                   if "rush" in row[3] or "pass" in row[3] or "sack" in row[3] or "kick" in row[3] or "punt" in row[3] or "field goal" in row[3] or "PENALTY" in row[3]:
                       self.playlist.append(PlayClass.play(row, homescore, awayscore, off, quarter, ATO, HTO, clock, self.MULE))
                       clock=None
               elif self.MULE==3:
                   if "rush" in row[1] or "pass" in row[1] or "sack" in row[1] or "kick" in row[1] or "punt" in row[1] or "field goal" in row[1] or "PENALTY" in row[1]:
                       self.playlist.append(PlayClass.play(row, homescore, awayscore, off, quarter, ATO, HTO, clock, self.MULE))
                       clock=None
           except Exception:
               print ("CLOCK ERROR", self.MULE, row)
               
       if homescore>awayscore: # Identifying the winning team
           self.H_WIN=True
       elif homescore < awayscore:
           self.H_WIN=False
       else: #If one isn't greater than the other we have a problem, there are no ties.
           print ("H_WIN ERROR", self.MULE, self.game_statement, awayscore, homescore) #error-checking for ties
      1. DEFENSE_FN Function

DEFENSE is the current defensive team, and is, obviously, defined as the team that is not currently the offensive team.
   def DEFENSE_FN(self):
       for x in self.playlist:
           if x.OFFENSE==self.HOME: # Match the offense to the home or away team and set the defense to the other
               x.DEFENSE=self.AWAY
           elif x.OFFENSE == self.AWAY:
               x.DEFENSE=self.HOME
           else: # If it's not the home team and it's not the away team something is wrong
               print ("DEFENSE ERROR", self.MULE, x.playdesc)
      1. O_D_SCORE_FN Function

O_SCORE and D_SCORE give the current score for the offensive and defensive teams. It is one of a number of functions that convert between existing home and away attributes by comparing whether OFFENSE is equal to HOME or AWAY and copying the appropriate attribute.
def O_D_SCORE_FN(self):
   for x in self.playlist:
       if x.OFFENSE==self.HOME: # Match the offense to the home or away team and set the scores accordingly
           x.O_SCORE=x.HOME_SCORE
           x.D_SCORE=x.AWAY_SCORE
       elif x.OFFENSE == self.AWAY:
           x.O_SCORE=x.AWAY_SCORE
           x.D_SCORE=x.HOME_SCORE
       else: # If it's not one way or the other something is wrong
           print("O/D SCORE ERROR", self.MULE, x.playdesc)
       x.O_LEAD=x.O_SCORE-x.D_SCORE
      1. O_D_TO_FN Function

The O_TO and D_TO attributes show the remaining timeouts for both the offense and defense. The calculation follows the same logic as above, comparing OFFENSE to HOME and AWAY.
def O_D_TO_FN(self):
   for x in self.playlist:
       if x.OFFENSE==self.HOME: # Match the offense to the home or away team and set the timeouts accordingly
           x.O_TO=x.HOME_TO
           x.D_TO=x.AWAY_TO
       elif x.OFFENSE == self.AWAY:
           x.O_TO=x.AWAY_TO
           x.D_TO=x.HOME_TO
       else:
           print ("O/D TO ERROR", self.MULE, x.playdesc)
      1. O_WIN_FN Function

The O_WIN attribute is set to True if the offensive team eventually wins the game, and is calculated by the same method as the above attributes, DEFENSE, OFF_SCORE & DEF_SCORE, and O_TO & D_TO.
def O_WIN_FN(self):
   for x in self.playlist:
       if x.OFFENSE==self.HOME: # Match the offense to the home or away team and set the win setting accordingly
           x.O_WIN=self.H_WIN
       elif x.OFFENSE == self.AWAY:
           x.O_WIN=not(self.H_WIN)
       else:
           print("O WIN ERROR", self.MULE, x.playdesc)
      1. TIME_FN Function

The TIME_FN uses the existing CLOCK data to interpolate the remaining time in the game, measured in seconds, for all plays. It loops through playlist three times. The first time it takes the existing CLOCK statements and converts them to seconds in the TIME attribute. The second loop interpolates every gap between two TIME statements, and the third loop simply rounds all of the TIME statements to whole seconds.
   def TIME_FN(self):
       try:
           for x in self.playlist: # First loop through converts all the clock statements to time statements
               if not(x.CLOCK==None):
                   x.TIME=int(3600-900*x.QUARTER+int(x.CLOCK[-2:])+60*int(x.CLOCK[:2]))
               if x.QUARTER==5:    
                   x.TIME=0
           if self.playlist[-1].TIME==None: # If the last play has no time we assign it to 0 to avoid an error down the line
               self.playlist[-1].TIME=0
           for x in range (0, len(self.playlist)-1): # Second loop splines between all the known times
               if self.playlist[x].TIME==None: #Find the start of a gap
                   for templong in range (x , len(self.playlist)): # Loop to find the end of a gap
                       if not(self.playlist[templong].TIME==None): #If the gap has ended
                           for temptwo in range (x,templong): # Another nested loop to spline
                               self.playlist[temptwo].TIME=self.playlist[temptwo-1].TIME -(self.playlist[x-1].TIME - self.playlist[templong].TIME)/ (templong - x)
                           x=templong
                           break
           for x in self.playlist: # Third loop to round to the nearest second
               x.TIME=int(round(x.TIME,0))
       except Exception: # A lot of misc errors can happen here
           print ("TIME ERROR", self.MULE, x.playdesc)
      1. SCORING_PLAY_FN Function

The scoring play function identifies plays where there is a scoring event, and identifies the type of score, as well as which team scored. The scoring plays are touchdowns, field goals, rouges, and safeties. Safeties are considered to have been “scored” by the team who surrenders the safety and therefore have a nominal value of -2 points. This function is important for the EP_INPUT_FN.
def SCORING_PLAY_FN(self): # Need to find all the plays with scores
   try:
       for x in range (1, len(self.playlist)):
           if self.playlist[x].FG_RSLT=="ROUGE":
               self.playlist[x].SCORING_PLAY="O-ROUGE" # Only the offense is realistically likely to score a rouge, ever
           elif self.playlist[x].FG_RSLT=="GOOD" and self.playlist[x].DOWN>0: # GOOD signals a made field goal but need to check the down to avoid PAT. Really only the offense can scorea field goal
               self.playlist[x].SCORING_PLAY="O-FG"
           elif "TOUCHDOWN" in self.playlist[x].playdesc: # Looking for TDs
               for templong in range (x + 1,len(self.playlist)):
                   if "attempt" in self.playlist[templong].playdesc or "kickoff" in self.playlist[templong].playdesc:
                       if self.playlist[x].OFFENSE==self.playlist[templong].OFFENSE: # The team that has the PAT or kickoff is the one that scores the TD
                           self.playlist[x].SCORING_PLAY="O-TD"
                       elif self.playlist[x].OFFENSE==self.playlist[templong].DEFENSE:
                           self.playlist[x].SCORING_PLAY="D-TD"
                       break # If we don't break it will overwrite with future PATs or kickoffs
                   elif self.playlist[templong].O_SCORE>self.playlist[templong-1].O_SCORE: #Failing in that we can look for a change in one team's score
                       self.playlist[x].SCORING_PLAY="O-TD"
                       break
                   elif self.playlist[templong].D_SCORE>self.playlist[templong-1].D_SCORE:
                       self.playlist[x].SCORING_PLAY="D-TD"
                       break
               else: # if we can't find the provenance of a TD it's a last-play TD, that all end up being on the offense
                   self.playlist[x].SCORING_PLAY="O-TD"
           elif "SAFETY" in self.playlist[x].playdesc: #Safeties are usually on the O but KOR/PR can lead to a D safety
               if self.playlist[x].YDLINE<65: # The field position of the play can effectively tell us whose safety it is.
                   self.playlist[x].SCORING_PLAY="D-SAFETY"
               else:
                   self.playlist[x].SCORING_PLAY="O-SAFETY"
   except Exception:
       print ("SCORING PLAY ERROR", self.MULE, self.playlist[x].playdesc)
      1. P1D_INPUT_FN Function

This  function determines whether each play leads to a successful first down within that drive. Touchdowns are also considered to be successful first downs, whereas turnovers, punts, field goals, turnovers on downs, and ends of halves are considered unsuccessful.
def P1D_INPUT_FN(self):
   for x in self.playlist:
       if x.ODK=="OD": # We only care about P(1D) for OD plays
           if x.DOWN>0: #We don't care about 2-pt conversions
               if x.SCORING_PLAY=="O-TD": # If the O scores a TD then it's obviously good
                   x.P1D_INPUT=True
               elif x.SCORING_PLAY=="D-TD": # if the D scores then it's bad
                   x.P1D_INPUT=False
               elif x.SCORING_PLAY=="O-SAFETY": # Safeties are also bad
                       x.P1D_INPUT=False
               else:
                   for y in self.playlist[self.playlist.index(x) + 1:]: #Now we loop through the rest of the plays
                       if y.OFFENSE != x.OFFENSE: #If there's a change of possession it's a fail
                           x.P1D_INPUT=False
                           break # Always break to avoid overwriting and keep the structure simple
                       elif y.SCORING_PLAY=="O-TD": # If offense scores a touchdown that's good
                           x.P1D_INPUT=True
                           break
                       elif y.ODK=="P" or y.ODK=="FG" or y.ODK=="KO": # If there's a non-OD play it implies a failure of the drive
                           x.P1D_INPUT=False
                           break
                       elif y.SCORING_PLAY == "D-TD": # A defensive touchdown is bad
                           x.P1D_INPUT=False
                           break
                       elif y==self.playlist[-1]:
                           x.P1D_INPUT=False
                           break
                       elif y.DOWN==1 and y.DISTANCE==10: # A 1st & 10 is good
                           x.P1D_INPUT=True
                           break
                       elif y.DOWN==1 and y.DISTANCE==y.YDLINE: # 1st & Goal is good
                           x.P1D_INPUT=True
                           break
                   else:
                       x.P1D_INPUT = False # If we get to the end of the game, finding nothing
           elif x.DOWN==0: # Handling 2-point conversions
               if "GOOD" in x.playdesc:
                   x.P1D_INPUT=True
               else:
                   x.P1D_INPUT=False
      1. EP_INPUT_FN Function

EP Input determines the next score in the game, be it one of the four scores above, or the end of a half. It also identifies whether the next score is scored by the current offense or defense. It loops through the playlist looking for a value in SCORING_PLAY or the end of a half.
def EP_INPUT_FN(self):
   for x in self.playlist: # Loop through the playlist
       for y in self.playlist[self.playlist.index(x):]: # Looping through all the plays going forward
           if y.SCORING_PLAY != None: # If there's a scoring play
               if y.SCORING_PLAY[0]=="O": # Need to match the scoring team with the current offense
                   if y.OFFENSE==x.OFFENSE:
                       x.EP_INPUT=y.SCORING_PLAY
                       break
                   elif y.OFFENSE == x.DEFENSE:
                       x.EP_INPUT="D" + y.SCORING_PLAY[1:]
                       break
                   else:
                       print ("EP INPUT ERROR:", self.MULE, x.playdesc)
               elif y.SCORING_PLAY[0] == "D":
                   if y.OFFENSE == x.DEFENSE:
                       x.EP_INPUT=y.SCORING_PLAY
                       break
                   elif y.OFFENSE == x.OFFENSE:
                       x.EP_INPUT="D" + y.SCORING_PLAY[1:]
                       break
                   else:
                       print ("EP INPUT ERROR:", self.MULE, x.playdesc)
               else:
                   print ("EP INPUT ERROR:", self.MULE, x.playdesc)
           elif y.QUARTER==2 and self.playlist[self.playlist.index(y)+1]==3: # If it's halftime
               x.EP_INPUT="HALF"
               break
           elif y.QUARTER==4 and self.playlist[self.playlist.index(y)+1]==5: # If OT begins
               x.EP_INPUT="HALF"
               break
       else: # If the game ends
           x.EP_INPUT = "HALF"
    1. Play Class

Within the game object we have the attribute rowlist, which contains the raw text of every row within that game as found within the initial .csv that was imported. Games also have the attribute playlist, a list of plays that is initially empty. The method make_plays exists to identify plays within the rowlist, and to identify important attributes that need to be passed.
      1. __init__ Method

The __init__ method imports the basic data passed from the game class, and initializes all of the other attributes in the class. Most importantly it brings in the data format identifier, which affects the behaviour of a number of functions.
def __init__(self,row, homescore, awayscore, off, quarter, ATO, HTO, clock, MULE):
   self.MULE = MULE # Need to carry this information because it affects some of the parsing
   if self.MULE == 1 or self.MULE==3: # The different formats have different row structures
       self.DD = row[0] #The first cell has down & distance
       self.SPOT = None #This data format doesn't have a separate FPOS cell
       self.playdesc = row[1] #play description is in the second cell
   elif self.MULE == 2:
       self.DD = row[1]#Mule 2 is structured differently
       self.SPOT = row[2]
       self.playdesc = row[3]
   self.HOME_SCORE = homescore#Carry over the score from the parent game
   self.AWAY_SCORE = awayscore
   self.OFFENSE = off #Carry over the offense
   self.HOME_LEAD = self.HOME_SCORE-self.AWAY_SCORE#Home lead is obviously just the difference here, though this could just be a function but it seems dumb to have a one-line function
   self.AWAY_TO = ATO#Carry over the TO situation
   self.HOME_TO = HTO
   self.CLOCK = clock # If there's any clock info
   self.QUARTER = quarter # Carry over the qtr


   # Here are all the other attributes we figure out via functions but which we initialize to None to start
   self.DOWN=None
   self.DISTANCE=None
   self.RP=None
   self.P_RSLT=None
   self.FPOS=None
   self.DEFENSE=None
   self.O_SCORE=None
   self.D_SCORE=None
   self.ODK=None
   self.O_WIN=None
   self.FG_RSLT=None
   self.GAIN=None
   self.TIME=None
   self.SCORING_PLAY=None
   self.P1D_INPUT=None
   self.EP_INPUT=None
      1. DOWN_FN Function

This function determines the DOWN attribute, the down of the play. Because of the three different datasets coding this function essentially serves as three functions in one, determining down differently based on the data format.
def DOWN_FN(self):
   try:
       if self.MULE==1:#Mule 1 we have to parse it from the DD cell
           if "0th" in self.DD:
               self.DOWN=0
           elif "1st" in self.DD:
               self.DOWN=1
           elif "2nd" in self.DD:
               self.DOWN=2
           elif "3rd" in self.DD:
               self.DOWN=3
           else:
               print ("DOWN ERROR:", self.MULE, self.playdesc)#Catch errors
       elif self.MULE==2:
           self.DOWN=int(self.DD[0]) # In format 2 it's always the first charactor of the DD column
       elif self.MULE==3:
           self.DOWN=int(self.DD[2]) # In format three down is the third character
   except Exception: #This will catch if there's anything non-numeric raising an exception with int()
       print ("Down Error", self.MULE, self.playdesc)
      1. DISTANCE_FN Function

DISTANCE is the distance to gain for the offense. Each data set requires a slight difference in the method of calculation, but ultimately it’s a string manipulation to pull an integer out of the DD attribute.
def DISTANCE_FN(self):
   try:
       if self.MULE==1: # for each data format distance is a simple string interpretation
           self.DISTANCE=int(self.DD[8:10])
       elif self.MULE==2:
           self.DISTANCE=int(self.DD[2:])
       elif self.MULE==3:
           self.DISTANCE=int(self.DD[4:6])
       if self.DISTANCE<=0: #error-checking, distance can't be 0 or negative
           print ("DISTANCE ERROR:", self.MULE, self.playdesc)
   except Exception:
       print ("DISTANCE ERROR:", self.MULE, self.playdesc)
      1. R_P_FN Function

This method identifies run and pass plays by looking for a set of keyword in the playdesc attribute. It is important for improving the efficiency of the P_RSLT_FN function, but also for future research where distinguishing between rush and pass attempts may be of value.
   def RP_FN(self): #No need to error-check, either the phrases are in playdesc or not
       if "pass" in self.playdesc or "sack" in self.playdesc or "scramble" in self.playdesc:
           self.RP="P"
       elif "rush" in self.playdesc:
           self.RP="R"
      1. P_RSLT_FN Function

Following the identification of pass plays this function looks at the result of said pass plays, be they complete, incomplete, interceptions, sacks, or scrambles. Scrambles are underrepresented, as in many cases the scorekeeper does not label them as such, and there is therefore no way to distinguish them from ordinary rushing plays.
def P_RSLT_FN(self): # Determining the result of a pass
   if self.RP=="P": # Obviously only interested in pass plays
       if "incomplete" in self.playdesc: # Looking for key strings
           self.P_RSLT="I"
       elif "complete" in self.playdesc:
           self.P_RSLT="C"
       elif "intercept" in self.playdesc:
           self.P_RSLT="X"
       elif "sack" in self.playdesc:
           self.P_RSLT ="S"
       elif "scramble" in self.playdesc or ("pass" in self.playdesc and "rush" in self.playdesc):
           self.P_RSLT="R"
       elif "FAILED" in self.playdesc: #catching 2-pt conversions
           self.P_RSLT = "I"
       elif "GOOD" in self.playdesc or "good" in self.playdesc:
           self.P_RSLT = "C"
       else: # If we don't have any of the key phrases something is wrong
           print ("P RSLT ERROR", self.MULE, self.playdesc)
      1. FPOS_FN Function

This function finds the field position of the play, using the +/- notation, where positive values are on the defense’s side of midfield, and negative values are in the offense’s end. By convention plays at midfield are defined as being the +55, although this would not actually affect any calculations going forward. This is done by checking whether the ball is on the offense’s or defense’s side of midfield by using a string comparison to the offensive team, and then either taking the yard line as-is or multiplying by -1.
def FPOS_FN(self): # We don't need to error check the result because if it's not a proper int the int conversion will raise an exception.
   try:
       if self.MULE==1: # String interpretation for each data format
           self.FPOS=int(self.DD[-2:])
           if self.DD[-5:-2]==self.OFFENSE: # If we're on the offensive side of the field we flip the sign
               self.FPOS=self.FPOS*(-1)
       elif self.MULE==2:
           if self.SPOT[:3]==self.OFFENSE: # Data format 2 has a separate column for the field position
               self.FPOS=int(self.SPOT[-2:])*(-1)
           else:
               self.FPOS = int(self.SPOT[-2:])
       elif self.MULE == 3:
           if self.DD[0] == self.DD[-3]:
               self.FPOS=int(self.DD[-2:])*(-1)
           else:
               self.FPOS=int(self.DD[-2:])
       if self.FPOS==-55: # If it's at midfield we define it as positive by convention
           self.FPOS=55
   except Exception: # Will catch any non-ints
       print ("FPOS ERROR", self.MULE, self.playdesc)
      1. YDLINE_FN Function

This function converts the FPOS attribute into YDLINE. YDLINE is simply the number of yards from the goal line, and spans from 1 to 109. It makes calculations much easier because the value is continuous, rather than having the discontinuity that FPOS does of jumping from +55 to -54, requiring a lot of repetitive conversions in later methods.
def YDLINE_FN(self):
   if self.FPOS > 0: # Simple conversion from FPOS to YDLINE
       self.YDLINE = self.FPOS
   elif self.FPOS < 0:
       self.YDLINE = 110 + self.FPOS
   else:
       print ("YDLINE ERROR", self.MULE, self.playdesc)
   if self.YDLINE <= 0 or self.YDLINE <= 110: #For out of range errors
       print ("YDLINE ERROR", self.MULE, self.playdesc)
      1. ODK_FN Function

This function determines if the play is an offensive play or some kind of special teams play. This is largely done by looking for keywords in playdesc.  It includes a few checks for unusual situations, such as intentional safeties being coded as punts and identifying PAT attempts and dead-ball penalties.
def ODK_FN(self):
   if self.DOWN == 3 and "SAFETY" in self.playdesc:
       self.ODK = "P" #need to account for intentional safeties
   elif "punt" in self.playdesc: # Looking for some pretty straightforward key phrases
       self.ODK="P"
   elif "kickoff" in self.playdesc:
       self.ODK="KO"
   elif "field goal" in self.playdesc or "kick attempt" in self.playdesc: #kick attempt is for PAT
       self.ODK="FG"
   elif not(self.RP==None): #otherwise plays that are runs or passes are just "OD," but it comes last because of fakes
       self.ODK="OD"
   elif "PENALTY" in self.playdesc:
       self.ODK="PEN"
   else:
       print ("ODK ERROR", self.MULE, self.playdesc)
      1. FG_RSLT_FN Function

The field goal result finds field goal attempts and logs whether they were successful or not, classifying them as either good, missed, or rouge. “Missed” includes blocks as well, but those are often not labelled differently and so at this time a separate classification has not been included,
   def FG_RSLT_FN(self): #No error-catching because we have the else used for misses
       if self.ODK == "FG":
           if "GOOD" in self.playdesc:
               self.FG_RSLT = "GOOD"
           elif "ROUGE" in self.playdesc:
               self.FG_RSLT = "ROUGE"
           else: # There are several ways to denote failed FG attempts so this is a catch-all
               self.FG_RSLT = "MISSED"
      1. GAIN_FN Function

The gain function determines from string comprehension the gain on the play. This version properly accounts for gains of more than 100 yards, which were not properly calculated in the previous database. While these plays are outliers, they have a disproportionate impact on averages.
def GAIN_FN(self):
   try:
       if not(self.RP==None):
           if self.P_RSLT == "I" or self.P_RSLT == "X": # Incompletions obviously have no gain
               self.GAIN = 0
           elif "no gain" in self.playdesc: # they use "no gain" instead of 0
               self.GAIN = 0
           elif "GOOD" in self.playdesc and self.DOWN == 0: # Handling 2-pt conversions
               self.GAIN = self.YDLINE
           elif "FAILED" in self.playdesc:
               self.GAIN = 0
           elif "loss" in self.playdesc: # They use "loss" instead of negative gain, so need to flip the sign
               if self.playdesc[self.playdesc.find("yard") - 4] == 1: # If somehow there's a loss of 100 yards
                   self.GAIN = int(self.playdesc[self.playdesc.find("yard") - 4 : self.playdesc.find("yard") - 1])
               else:         
                   self.GAIN = int(self.playdesc[self.playdesc.find("yard") - 3 : self.playdesc.find("yard") - 1])
           else:
               if self.playdesc[self.playdesc.find("yard")-4] == 1: # Handles gains of 100+ yards
                   self.GAIN = int(self.playdesc[self.playdesc.find("yard") - 4 : self.playdesc.find("yard")])
               else:         
                   self.GAIN = int(self.playdesc[self.playdesc.find("yard") - 3 : self.playdesc.find("yard")])
   except Exception: # Will catch any non-ints
       print ("Gain Error", self.MULE, self.playdesc)
    1. Punt Class

To better look at the different kinds of plays it is easiest to create new objects most appropriate for the situation. One of those is to create a class for punts, to better look at the effects of punting in different situations.
      1. Array_Declaration Method

This method sits in the class module and is called by the Master module to create a list of punt objects where the list index can be mapped to the YDLINE attribute. Beyond a certain yardline the objects are largely empty, as there are no punts very near the goal line. This is not a concern as these are just left as placeholders.
def Array_Declaration():
   for x in range (0,110):
       PUNT_ARRAY.append(PUNT(x))  
      1. __init__ Method

This method creates all the attributes we will need for the punt object. The current form is skeletal and will be developed along with future research that will look at punts. Because of the bizarre multinomial distribution of scoring plays bootstrapping is heavily used in the calculation of most confidence intervals.
def __init__(self, ydline):
   self.N=0 #Number of punts from this ydline
   self.YDLINE=ydline
   self.EP=None #Average EP value of this punt
   self.EP_ARRAY=[] #List holding all the EP data
   self.EP_LOW=None #Lower CI for EP
   self.EP_HIGH=None #Upper CI for EP
   self.BOOTSTRAP=Globals.DummyArray #A bootstrap of the EP to determine the CI
      1. Calculate Method

This method determines the EP value of the punt by averaging the EP inputs from all the punt plays that occurred from this yard line.
def calculate(self):
   if self.N > 0: # Avoid divide-by-zero
       self.EP=sum(self.EP_ARRAY)/self.N
      1. Boot Function

This function generates the a bootstrapping of the EP values to determine a confidence interval and a distribution. It uses the numpy module to improve speed, creating a ten-fold improvement in execution time compared to a strictly Python-coded earlier version
def boot(self):
   if self.N>10:
       self.BOOTSTRAP = numpy.sort(numpy.array([numpy.average(numpy.random.choice(self.EP_ARRAY, self.N, replace=True)) for _ in range(Globals.BOOTSTRAP_SIZE)], dtype='f4'))
       self.EP_HIGH = self.BOOTSTRAP[Globals.BOOTSTRAP_SIZE * (1-Globals.CONFIDENCE)]
       self.EP_LOW = self.BOOTSTRAP[Globals.BOOTSTRAP_SIZE * Globals.CONFIDENCE - 1]
    1. KO Class

Similar to the punt class, the kickoff class carries information particular to the needs of analyzing kickoffs. It creates a list of kickoff objects, though the vast majority of them. All the methods and functions are the same as the punt class, holding information about the EP data, and creating a bootstrap of the EP value.
    1. FG Class

The FG class follows the same structure as the punt and kickoff classes, creating an array of objects indexed by yardline. Because of the nature of field goals either being missed, rouge, or good, it tracks each of the three separately to allow them to be combined later.
      1. __init__ method

Because of the nature of field goals either being missed, rouge, or good, each of the three results is tracked in a separate attribute. This also allows the calculation of separate confidence intervals on each of these
def __init__(self, ydline):
   self.YDLINE=ydline
   self.N=0
   self.GOOD=0
   self.ROUGE=0
   self.MISSED=0
   
   self.P_GOOD=None
   self.P_GOOD_LOW=None
   self.P_GOOD_HIGH=None
   self.P_ROUGE=None
   self.P_ROUGE_LOW=None
   self.P_ROUGE_HIGH=None
   self.P_MISSED=None
   self.P_MISSED_LOW=None
   self.P_MISSED_HIGH=None
   
   self.EP=None
   self.BOOTSTRAP=Globals.DummyArray
   self.EP_ARRAY=[]
   self.EP_LOW=None
   self.EP_HIGH=None
      1. Calculate method

Using the binomial confidence functions from the Functions module we can determine the confidence intervals on each result using a 1 vs. all method.
def calculate(self): #Calculate all the percentages and the binomial CIs. Can do right away b/c there's no EP aspect involved
   if self.N>0:
       self.P_GOOD=self.GOOD/self.N
       self.P_GOOD_HIGH=Functions.BinomHigh(self.GOOD, self.N, Globals.CONFIDENCE)
       self.P_GOOD_LOW=Functions.BinomLow(self.GOOD, self.N, Globals.CONFIDENCE)
       
       self.P_ROUGE=self.ROUGE/self.N
       self.P_ROUGE_HIGH=Functions.BinomHigh(self.ROUGE, self.N, Globals.CONFIDENCE)
       self.P_ROUGE_LOW=Functions.BinomLow(self.ROUGE, self.N, Globals.CONFIDENCE)
       
       self.P_MISSED=self.MISSED/self.N
       self.P_MISSED_HIGH=Functions.BinomHigh(self.MISSED, self.N, Globals.CONFIDENCE)
       self.P_MISSED_LOW=Functions.BinomLow(self.MISSED, self.N, Globals.CONFIDENCE)
       
       self.EP=sum(self.EP_ARRAY) / self.N
    1. P(1D) Class

The P(1D) class allows us to calculate the first down probability of a situation from the data previously parsed. This recreates the methods used in the previous examination of P(1D) (Clement 2018b) to determine the P(1D) of down & distance situations with confidence intervals.
      1. Array_Declaration Method

Similar to the previous lists that were one-dimensional for field position, P(1D) objects are held in a dimensional list that lets us consider both down and distance. A separate but similarly constructed array handles & Goal situations
def Array_Declaration():
   for down in range (0,4):
       temp=[]
       for distance in range(0,26):
           temp.append(P1D(down, distance))
       P1D_ARRAY.append(temp)
   for down in range (0,4):
       temp=[]
       for distance in range(0,26):
           temp.append(P1D(down, distance))
       P1D_GOAL_ARRAY.append(temp)


      1. __init__ Method

This method defines the attributes we will need. Since P(1D) is a simple binomial variable, we only need to hold the attributes related to this. The confidence intervals are calculated with the binomial functions developed in the Functions module.
def __init__(self, down, distance):
   self.DOWN=down
   self.DISTANCE=distance
   self.N=0
   self.X=0
   self.P=None
   self.LOW=None
   self.HIGH=None
   self.SMOOTHED=None
      1. Binom method

The binom method calls the binomial confidence functions mentioned above to determine the upper and lower bounds on the confidence interval for P(1D).
def binom(self):
   if self.X > 0:
       self.P=self.X/self.N
       self.LOW=Functions.BinomLow(self.X, self.N, Globals.CONFIDENCE)
       self.HIGH=Functions.BinomHigh(self.X, self.N, Globals.CONFIDENCE)
  1. Conclusion

With the development of a more sophisticated interpreter of the U Sports database it becomes possible to parse in a mere fraction of the time, allowing for shorter development and testing cycles. This method is also far less resource-intensive as the entirety of the raw and parsed data is not constantly being held in memory but instead the raw data is read once at runtime and the parsing happens there. Heavy use of the numpy package provides significant speed improvements as the package is run in C and is highly optimized.
This framework will provide the basis for future research into the nature of different kicking plays, leading into a discussion of EP and 3rd down decision-making. The availability of Python’s many packages will assist in the development of a future WP model through the sklearn package.
  1. References


No comments:

Post a Comment

Three Downs Away: P(1D) In U Sports Football

1-Abstract A data set of U Sports football play-by-play data was analyzed to determine the First Down Probability (P(1D)) of down & d...