Passes and Patterns: The Roman Numerals of Computing: An Object-Oriented Database of U Sports Football

Abstract

A redevelopment of the U Sports parser and calculator previously described (Clement 2018b) that was built using VBA, this time working in Python. Data is imported through Python’s csv package, and parsed using an object-oriented approach, creating games and plays as classes with attributes as appropriate to each. Further objects exist to support analysis on the parsed data, and the numpy package with its arrays allows for far faster calculation of results. Discussion of future work built on this restructured database includes examination of special teams, expected points (EP) and Win Probability (WP).

Introduction

While previous work developed a pair of relational databases for U Sports and CFL data, and this proved adequate for a certain tasks, there was an interest in taking an object-oriented approach to the matter. The prime benefit of an object-oriented approach would be to structure the data in a fashion more akin to how our data is structured in reality. Games are independent events, and the plays are independent within them. By structuring the games as objects, and the plays as a separate class within them, we can also define attributes as being either game-level attributes or play-level attributes. The date of a game is a game-level attribute, whereas field position is a play-level attribute.

In the relational database a large number of rows were dedicated to bookkeeping: score statements, clock statements, and quarter starts among them, consuming as much as one-quarter of all the rows. By moving to an object-oriented structure that information is held at the appropriate level as an attribute of the object, making for a cleaner data structure.

Because this version of the database does not immediately provide a visual output of all the parsed data it was necessary to implement a more rigorous error-checking. In the process of doing so a number of other errors were spotted and corrected, though none that would affect the integrity of the P(1D) results already delivered (Clement 2018b). A number of missing score statements were repaired, as well as errors relating to distance gained on plays with fumbles.

Database

The code is structured using a number of modules. There are three central modules in the central aspect of the code, each discussed below. The program is run from the central Master module, which runs small sections of code and calls other methods and functions, serving to direct the overall flow of the code. A limited number of global variables are held in Globals to ease the difficulty of constantly passing large numbers of local variables, making code maintainability needlessly difficult. Finally, Functions holds a number of small numerical functions that rove useful in a variety of different situations.

Furthermore, there is a folder for the class modules, each of which is responsible for holding one class of object, with its attendant methods, functions and attributes, and any additional code that relates to that object class.

Master Module

Traffic within the program is managed by the Master module, a central module which initializes all the other elements of the program and directs the sequence of calculations. Master also serves as the laboratory to implement initial concepts before better organizing them within the structure of the code.

The first responsibility of the Master module is the importation of the data itself. The data was imported from .csv files using Python’s csv module. The three different formats of U Sports data (Clement 2018a) were each stored in a separate csv file. After importing the csv file the script reads through each row of the data to search for new games, as identified by the presence of “vs.” to create new Game objects. All other rows were added the the game’s rowlist attribute for further treatment. When a new game begins the current game is added to the global list gamelist. The code below demonstrates the approach taken to create the game objects. The dummy variable simply avoids an off-by-one error when reading the first game, in an inelegant but utilitarian fashion.

import csv #to bring in the MULE data from CSV

dummy=0

#IMPORTING DATA FROM THE CSV MULES

with open("Data/CIS MULE 01.csv") as csvfile:

MULE = csv.reader(csvfile)

for row in MULE:

if " vs. " in row[0]: #looking for new games

Globals.gamelist.append(GameClass.game(row[0], 1))#add the old temp to the list of games

elif dummy==0:

dummy=1

elif row[0]!=None or row[1]!=None:

tempplay=[]

tempplay.append(row[0])

tempplay.append(row[1])

Globals.gamelist[-1].rowlist.append(tempplay)

Globals Module

The simple mission of the Globals module is to hold a set of global variables that can be imported en bloc by each other module to give access to the core variables and data of the program and limiting the number of redundant passes. Care was taken to be judicious in the use of these global variables to avoid overencumbering the system with needless global variables. The variables defined in Globals are given below.

CONFIDENCE=0.025 # The one-sided confidence interval size for all statistical tests

BOOTSTRAP_SIZE=10000 # Number of bootstrap iterations to use

TDval=7.0 # Value of a touchdown

TDval_BOOTSTRAP=[] # Bootstrap array to determine confidence interval of TDval

TDval_HIGH=None # Upper bound of TDval CI

TDval_LOW=None # Lower bound of TD CI

FGval=3.0 # Same for field goal

FGval_BOOTSTRAP=[]

FGval_HIGH=None

FGval_LOW=None

ROUGEval=1.0 # For rouge

ROUGEval_BOOTSTRAP=[]

ROUGEval_HIGH=None

ROUGEval_LOW=None

SAFETYval=-2.0 # For safety

SAFETYval_BOOTSTRAP=[]

SAFETYval_HIGH=None

SAFETYval_LOW=None

HALFval=0.0 # Value of the end of the half is fixed and certain

gamelist=[] # The gamelist holds all the games

DummyArray = numpy.full(BOOTSTRAP_SIZE, -100, dtype='int32') # Holds a default value to avoid errors when comparing to None

Functions

The Functions module provides access to a few functions that are reused across several classes and that improve maintainability by centralizing these in a single module

Binomial Error

Binomial error in the program is calculated iteratively to obtain a value that is as exact as we choose to make it, here we go until we converge at the 10-30 level. We recycle the same binomial function used in the previous calculator (Pezzullo 2014), as was discussed in that work (Clement 2018a).

BootCompare

This function takes two bootstrap arrays and determines how many values of array A are greater than values of array B. It returns the proportion of all pairs of elements of A and B where the element from A is greater than the element from B. It assumes that A and B are both sorted arrays, which they are because prior elements of the code demand that they be sorted, and with this assumption is able to operate with only 2 * Globals.BOOTSTRAP_SIZE comparisons, at O(n) efficiency, rather than performing a comparison of every pair at O(n2).

def BootCompare(arrA, arrB):

count=0

b=0

for a in range (0, Globals.BOOTSTRAP_SIZE):

for b in range (b, Globals.BOOTSTRAP_SIZE):

if arrB[b] > arrA[a]:

count += b

break

else:

count += Globals.BOOTSTRAP_SIZE

count /= Globals.BOOTSTRAP_SIZE**2

return float(count)

Game class

The game class creates objects that hold games in their entirety. This allows us to append attributes belonging to the game directly to the object, rather than carrying them with every play as was done in the relational database. Notably, the teams playing, the date of the game, and the final score are all attributes that belong to the game and not to the individual play. The dataset from which the game comes is also passed as the attribute MULE to account for differences in how the parser should handle certain attributes. The code below shows how a game object is initialized.

__init_ method

The __init__ method initializes the game object, creating all of the attributes that will then be determined in later functions. No work is done in this method to assign values to any attributes beyond the statement and data format that are imported initially, so as to improve the maintainability and portability of the code. Most of the attributes are determined in the subsequent game_calc method. The code for the __init__ method is given below.

def __init__(self, statement, MULE):

self.rowlist=[] # Holds all the rows of raw data from the csv

self.playlist=[] # Holds a list of all the plays as objects

self.MULE=MULE # Which data format is this from

self.game_statement = statement # The initial "vs. " statement

self.AWAY=self.game_statement[0:3] # Automatically grabs the Home and Away team from the vs. statement

self.HOME=self.game_statement[8:11]

self.H_WIN=None # If the home team wins the game

self.H_FINAL=None # Final score of them game

self.A_FINAL=None

self.LEAGUE = None

self.CONFERENCE = None # The conference in which the game was played

self.YEAR=None # The year, month, day, weekday of the game

self.MONTH=None

self.DAY=None

self.WEEKDAY=None

self.SEASON=None # The season in whichthe game was played, always the same as YEAR for Canadian football, as seasons don't span over New Year's Day, exists for compatibility with other leagues in the future

Game_calc method

This method determines a number of the core attributes of the game. It separates the row attribute into its components, based on the data format, and through some string interpretation determines the date of the game. It also determines the conference in which the game was played, which considers the decision of Bishop’s University to move from the RSEQ to AUS for the 2017 season. The code of the game_calc method is included here.

# Determines some of the basic info about the game - date, conference

def game_calc(self):

for x in range (2001,2100): # Loop to get the year of the game. Checks 2001 to 2100 to be fully future-proofed and allow for the addition of old PxP games we might find.

if str(x) in self.game_statement: # Check if the year is in the game statement

self.YEAR=x # Assign that year to the attribute

self.SEASON=x

break # Once we find the year there's no need to waste clock cycles

self.MONTH = int(self.game_statement[self.game_statement.find(str(self.YEAR)) + 5 : self.game_statement.find(str(self.YEAR)) + 7]) # Assign the other attributes from string manipulation once we know the year

self.DAY = int(self.game_statement[self.game_statement.find(str(self.YEAR)) + 8 : self.game_statement.find(str(self.YEAR)) + 10])

self.WEEKDAY = date(self.YEAR, self.MONTH, self.DAY).weekday()

self.LEAGUE="U SPORTS" # Always U Sports in this case, will get more sophisticated if we start looking at other leagues.

CWUAA=["MAN", "SKH", "REG", "ALB", "CGY", "UBC", "SFU"] # These are to determine the conference of a game #These two conferences always have the same teams

OUA = ["CAR", "OTT", "QUE", "TOR", "YRK", "MAC", "GUE", "WAT", "WLU", "WES", "WIN"]

if self.YEAR<2016: # Because Bishop's changed conferences in 2016 we need to adapt accordingly

RSEQ=["BIS", "SHE", "MCG", "CON", "MON", "LAV"]

AUS = ["SMU", "SFX", "MTA", "ACA"]

else:

RSEQ=["SHE", "MCG", "CON", "MON", "LAV"]

AUS = ["SMU", "SFX", "MTA", "ACA", "BIS"]

if self.HOME in CWUAA and self.AWAY in CWUAA: # If the two teams are in the same conference we assign that conference, otherwise it's a non-conference game

self.CONFERENCE="CWUAA"

elif self.HOME in OUA and self.AWAY in OUA:

self.CONFERENCE="OUA"

elif self.HOME in RSEQ and self.AWAY in RSEQ:

self.CONFERENCE="RSEQ"

elif self.HOME in AUS and self.AWAY in AUS:

self.CONFERENCE="AUS"

else:

self.CONFERENCE="NONCON"

make_plays method

The make_plays method is forms the basis of the database, iterating over rowlist and identifying what rows are plays, and interpreting the bookkeeping rows , applying them as needed.

def make_plays(self): # Go through the rowlist to identify and create play objects, and to get the data from bookkeeping rows

quarter=1 # Initialize the quarter and scores as zero, obviously, two timeouts per team, clock is at 15:00

homescore=0

awayscore=0

HTO=2

ATO=2

clock="15:00"

off=[] # No team is in offense until one is assigned

for row in self.rowlist: #Looping through the whole rowlist

if row[0]=="2nd": # Identifying new quarters and resetting the clock and timeouts

quarter=2

clock="15:00"

elif row[0]=="3rd":

quarter=3

clock="15:00"

HTO=2

ATO=2

elif row[0]=="4th":

quarter=4

clock="15:00"

elif row[0]=="OT":

quarter=5

clock="00:00"

if self.MULE==1 or self.MULE==3: # checking for possession statements, but these only exist in data formats 1 and 3

if "drive start" in row[1]:

off=row[1][row[1].find("drive start")-4:row[1].find("drive start")-1]

elif self.MULE==2:

if len(row[0])==3:

if row[0]!="1st" and row[0]!="2nd" and row[0]!="3rd" and row[0] !="4th":

off=row[0]

if off != self.HOME and off != self.AWAY:

print ("POSSESSION ERROR", self.MULE, self.game_statement, row)

try:

if self.MULE == 1 or self.MULE == 3:

if "TIMEOUT" in row[1]: # checking for timeout statements

TO=row[1].find("TIMEOUT")

TOTEAM=row[1][ (TO + 8) : (TO + 11)]

if TOTEAM == self.HOME:

HTO=HTO-1

elif TOTEAM == self.AWAY:

ATO=ATO-1

else: #Error checking if the team calling a timeout isn't properly interpreted

print ("TIMEOUT ERROR", self.MULE, row, "TO", TOTEAM, "HOME", self.HOME, "AWAY", self.AWAY)

elif self.MULE == 2:

if "TIMEOUT" in row[3]:

if row[3][row[3].find("TIMEOUT") + 8 : row[3].find("TIMEOUT")+11] == self.HOME:

HTO=HTO-1

elif row[3][row[3].find("TIMEOUT")+8:row[3].find("TIMEOUT")+11]==self.AWAY:

ATO=ATO-1

else: #Error checking if the team calling a timeout isn't properly interpreted

print ("TIMEOUT ERROR", self.MULE, row)

except Exception:

print ("TIMEOUT EXCEPTION", self.MULE, self.game_statement, row[0], row[1])

try:

if self.MULE==1 or self.MULE==3: # find score statements

if len(row[1])>11 and len(row[1]) <15:

if row[1][3]==" ":

if row[1][0:3]==self.AWAY:

awayscore=int(row[1][4:row[1].find(",")].lstrip(" "))

homescore=int(row[1][-2:].lstrip(" "))

elif self.MULE==2:

if len(row[3]) > 11 and len(row[3]) < 15:

if row[3][3]==" ":

if row[3][0:3]==self.AWAY:

awayscore = int(row[3][4 : row[3].find(",")].lstrip(" "))

Homescore = int(row[3][-2:].lstrip(" "))

except Exception:

print ("SCORE ERROR", self.MULE, row, awayscore, homescore)

#identify clock statements

try:

if self.MULE==1 or self.MULE==3:

if ":" in row[1]:

clock=row[1][row[1].find(":")-2:row[1].find(":")+3]

elif self.MULE==2:

if ":" in row[3]:

clock=row[3][row[3].find(":")-2:row[3].find(":")+3]

#identify plays

if self.MULE==1 :

if "rush" in row[1] or "pass" in row[1] or "sack" in row[1] or "kick" in row[1] or "punt" in row[1] or "field goal" in row[1] or "PENALTY" in row[1]:

self.playlist.append(PlayClass.play(row, homescore, awayscore, off, quarter, ATO, HTO, clock, self.MULE))

clock=None

elif self.MULE==2:

if "rush" in row[3] or "pass" in row[3] or "sack" in row[3] or "kick" in row[3] or "punt" in row[3] or "field goal" in row[3] or "PENALTY" in row[3]:

self.playlist.append(PlayClass.play(row, homescore, awayscore, off, quarter, ATO, HTO, clock, self.MULE))

clock=None

elif self.MULE==3:

if "rush" in row[1] or "pass" in row[1] or "sack" in row[1] or "kick" in row[1] or "punt" in row[1] or "field goal" in row[1] or "PENALTY" in row[1]:

self.playlist.append(PlayClass.play(row, homescore, awayscore, off, quarter, ATO, HTO, clock, self.MULE))

clock=None

except Exception:

print ("CLOCK ERROR", self.MULE, row)

if homescore>awayscore: # Identifying the winning team

self.H_WIN=True

elif homescore < awayscore:

self.H_WIN=False

else: #If one isn't greater than the other we have a problem, there are no ties.

print ("H_WIN ERROR", self.MULE, self.game_statement, awayscore, homescore) #error-checking for ties

DEFENSE_FN Function

DEFENSE is the current defensive team, and is, obviously, defined as the team that is not currently the offensive team.

def DEFENSE_FN(self):

for x in self.playlist:

if x.OFFENSE==self.HOME: # Match the offense to the home or away team and set the defense to the other

x.DEFENSE=self.AWAY

elif x.OFFENSE == self.AWAY:

x.DEFENSE=self.HOME

else: # If it's not the home team and it's not the away team something is wrong

print ("DEFENSE ERROR", self.MULE, x.playdesc)

O_D_SCORE_FN Function

O_SCORE and D_SCORE give the current score for the offensive and defensive teams. It is one of a number of functions that convert between existing home and away attributes by comparing whether OFFENSE is equal to HOME or AWAY and copying the appropriate attribute.

def O_D_SCORE_FN(self):

for x in self.playlist:

if x.OFFENSE==self.HOME: # Match the offense to the home or away team and set the scores accordingly

x.O_SCORE=x.HOME_SCORE

x.D_SCORE=x.AWAY_SCORE

elif x.OFFENSE == self.AWAY:

x.O_SCORE=x.AWAY_SCORE

x.D_SCORE=x.HOME_SCORE

else: # If it's not one way or the other something is wrong

print("O/D SCORE ERROR", self.MULE, x.playdesc)

x.O_LEAD=x.O_SCORE-x.D_SCORE

O_D_TO_FN Function

The O_TO and D_TO attributes show the remaining timeouts for both the offense and defense. The calculation follows the same logic as above, comparing OFFENSE to HOME and AWAY.

def O_D_TO_FN(self):

for x in self.playlist:

if x.OFFENSE==self.HOME: # Match the offense to the home or away team and set the timeouts accordingly

x.O_TO=x.HOME_TO

x.D_TO=x.AWAY_TO

elif x.OFFENSE == self.AWAY:

x.O_TO=x.AWAY_TO

x.D_TO=x.HOME_TO

else:

print ("O/D TO ERROR", self.MULE, x.playdesc)

O_WIN_FN Function

The O_WIN attribute is set to True if the offensive team eventually wins the game, and is calculated by the same method as the above attributes, DEFENSE, OFF_SCORE & DEF_SCORE, and O_TO & D_TO.

def O_WIN_FN(self):

for x in self.playlist:

if x.OFFENSE==self.HOME: # Match the offense to the home or away team and set the win setting accordingly

x.O_WIN=self.H_WIN

elif x.OFFENSE == self.AWAY:

x.O_WIN=not(self.H_WIN)

else:

print("O WIN ERROR", self.MULE, x.playdesc)

TIME_FN Function

The TIME_FN uses the existing CLOCK data to interpolate the remaining time in the game, measured in seconds, for all plays. It loops through playlist three times. The first time it takes the existing CLOCK statements and converts them to seconds in the TIME attribute. The second loop interpolates every gap between two TIME statements, and the third loop simply rounds all of the TIME statements to whole seconds.

def TIME_FN(self):

try:

for x in self.playlist: # First loop through converts all the clock statements to time statements

if not(x.CLOCK==None):

x.TIME=int(3600-900*x.QUARTER+int(x.CLOCK[-2:])+60*int(x.CLOCK[:2]))

if x.QUARTER==5:

x.TIME=0

if self.playlist[-1].TIME==None: # If the last play has no time we assign it to 0 to avoid an error down the line

self.playlist[-1].TIME=0

for x in range (0, len(self.playlist)-1): # Second loop splines between all the known times

if self.playlist[x].TIME==None: #Find the start of a gap

for templong in range (x , len(self.playlist)): # Loop to find the end of a gap

if not(self.playlist[templong].TIME==None): #If the gap has ended

for temptwo in range (x,templong): # Another nested loop to spline

self.playlist[temptwo].TIME=self.playlist[temptwo-1].TIME -(self.playlist[x-1].TIME - self.playlist[templong].TIME)/ (templong - x)

x=templong

break

for x in self.playlist: # Third loop to round to the nearest second

x.TIME=int(round(x.TIME,0))

except Exception: # A lot of misc errors can happen here

print ("TIME ERROR", self.MULE, x.playdesc)

SCORING_PLAY_FN Function

The scoring play function identifies plays where there is a scoring event, and identifies the type of score, as well as which team scored. The scoring plays are touchdowns, field goals, rouges, and safeties. Safeties are considered to have been “scored” by the team who surrenders the safety and therefore have a nominal value of -2 points. This function is important for the EP_INPUT_FN.

def SCORING_PLAY_FN(self): # Need to find all the plays with scores

try:

for x in range (1, len(self.playlist)):

if self.playlist[x].FG_RSLT=="ROUGE":

self.playlist[x].SCORING_PLAY="O-ROUGE" # Only the offense is realistically likely to score a rouge, ever

elif self.playlist[x].FG_RSLT=="GOOD" and self.playlist[x].DOWN>0: # GOOD signals a made field goal but need to check the down to avoid PAT. Really only the offense can scorea field goal

self.playlist[x].SCORING_PLAY="O-FG"

elif "TOUCHDOWN" in self.playlist[x].playdesc: # Looking for TDs

for templong in range (x + 1,len(self.playlist)):

if "attempt" in self.playlist[templong].playdesc or "kickoff" in self.playlist[templong].playdesc:

if self.playlist[x].OFFENSE==self.playlist[templong].OFFENSE: # The team that has the PAT or kickoff is the one that scores the TD

self.playlist[x].SCORING_PLAY="O-TD"

elif self.playlist[x].OFFENSE==self.playlist[templong].DEFENSE:

self.playlist[x].SCORING_PLAY="D-TD"

break # If we don't break it will overwrite with future PATs or kickoffs

elif self.playlist[templong].O_SCORE>self.playlist[templong-1].O_SCORE: #Failing in that we can look for a change in one team's score

self.playlist[x].SCORING_PLAY="O-TD"

break

elif self.playlist[templong].D_SCORE>self.playlist[templong-1].D_SCORE:

self.playlist[x].SCORING_PLAY="D-TD"

break

else: # if we can't find the provenance of a TD it's a last-play TD, that all end up being on the offense

self.playlist[x].SCORING_PLAY="O-TD"

elif "SAFETY" in self.playlist[x].playdesc: #Safeties are usually on the O but KOR/PR can lead to a D safety

if self.playlist[x].YDLINE<65: # The field position of the play can effectively tell us whose safety it is.

self.playlist[x].SCORING_PLAY="D-SAFETY"

else:

self.playlist[x].SCORING_PLAY="O-SAFETY"

except Exception:

print ("SCORING PLAY ERROR", self.MULE, self.playlist[x].playdesc)

P1D_INPUT_FN Function

This function determines whether each play leads to a successful first down within that drive. Touchdowns are also considered to be successful first downs, whereas turnovers, punts, field goals, turnovers on downs, and ends of halves are considered unsuccessful.

def P1D_INPUT_FN(self):

for x in self.playlist:

if x.ODK=="OD": # We only care about P(1D) for OD plays

if x.DOWN>0: #We don't care about 2-pt conversions

if x.SCORING_PLAY=="O-TD": # If the O scores a TD then it's obviously good

x.P1D_INPUT=True

elif x.SCORING_PLAY=="D-TD": # if the D scores then it's bad

x.P1D_INPUT=False

elif x.SCORING_PLAY=="O-SAFETY": # Safeties are also bad

x.P1D_INPUT=False

else:

for y in self.playlist[self.playlist.index(x) + 1:]: #Now we loop through the rest of the plays

if y.OFFENSE != x.OFFENSE: #If there's a change of possession it's a fail

x.P1D_INPUT=False

break # Always break to avoid overwriting and keep the structure simple

elif y.SCORING_PLAY=="O-TD": # If offense scores a touchdown that's good

x.P1D_INPUT=True

break

elif y.ODK=="P" or y.ODK=="FG" or y.ODK=="KO": # If there's a non-OD play it implies a failure of the drive

x.P1D_INPUT=False

break

elif y.SCORING_PLAY == "D-TD": # A defensive touchdown is bad

x.P1D_INPUT=False

break

elif y==self.playlist[-1]:

x.P1D_INPUT=False

break

elif y.DOWN==1 and y.DISTANCE==10: # A 1st & 10 is good

x.P1D_INPUT=True

break

elif y.DOWN==1 and y.DISTANCE==y.YDLINE: # 1st & Goal is good

x.P1D_INPUT=True

break

else:

x.P1D_INPUT = False # If we get to the end of the game, finding nothing

elif x.DOWN==0: # Handling 2-point conversions

if "GOOD" in x.playdesc:

x.P1D_INPUT=True

else:

x.P1D_INPUT=False

EP_INPUT_FN Function

EP Input determines the next score in the game, be it one of the four scores above, or the end of a half. It also identifies whether the next score is scored by the current offense or defense. It loops through the playlist looking for a value in SCORING_PLAY or the end of a half.

def EP_INPUT_FN(self):

for x in self.playlist: # Loop through the playlist

for y in self.playlist[self.playlist.index(x):]: # Looping through all the plays going forward

if y.SCORING_PLAY != None: # If there's a scoring play

if y.SCORING_PLAY[0]=="O": # Need to match the scoring team with the current offense

if y.OFFENSE==x.OFFENSE:

x.EP_INPUT=y.SCORING_PLAY

break

elif y.OFFENSE == x.DEFENSE:

x.EP_INPUT="D" + y.SCORING_PLAY[1:]

break

else:

print ("EP INPUT ERROR:", self.MULE, x.playdesc)

elif y.SCORING_PLAY[0] == "D":

if y.OFFENSE == x.DEFENSE:

x.EP_INPUT=y.SCORING_PLAY

break

elif y.OFFENSE == x.OFFENSE:

x.EP_INPUT="D" + y.SCORING_PLAY[1:]

break

else:

print ("EP INPUT ERROR:", self.MULE, x.playdesc)

else:

print ("EP INPUT ERROR:", self.MULE, x.playdesc)

elif y.QUARTER==2 and self.playlist[self.playlist.index(y)+1]==3: # If it's halftime

x.EP_INPUT="HALF"

break

elif y.QUARTER==4 and self.playlist[self.playlist.index(y)+1]==5: # If OT begins

x.EP_INPUT="HALF"

break

else: # If the game ends

x.EP_INPUT = "HALF"

Play Class

Within the game object we have the attribute rowlist, which contains the raw text of every row within that game as found within the initial .csv that was imported. Games also have the attribute playlist, a list of plays that is initially empty. The method make_plays exists to identify plays within the rowlist, and to identify important attributes that need to be passed.

__init__ Method

The __init__ method imports the basic data passed from the game class, and initializes all of the other attributes in the class. Most importantly it brings in the data format identifier, which affects the behaviour of a number of functions.

def __init__(self,row, homescore, awayscore, off, quarter, ATO, HTO, clock, MULE):

self.MULE = MULE # Need to carry this information because it affects some of the parsing

if self.MULE == 1 or self.MULE==3: # The different formats have different row structures

self.DD = row[0] #The first cell has down & distance

self.SPOT = None #This data format doesn't have a separate FPOS cell

self.playdesc = row[1] #play description is in the second cell

elif self.MULE == 2:

self.DD = row[1]#Mule 2 is structured differently

self.SPOT = row[2]

self.playdesc = row[3]

self.HOME_SCORE = homescore#Carry over the score from the parent game

self.AWAY_SCORE = awayscore

self.OFFENSE = off #Carry over the offense

self.HOME_LEAD = self.HOME_SCORE-self.AWAY_SCORE#Home lead is obviously just the difference here, though this could just be a function but it seems dumb to have a one-line function

self.AWAY_TO = ATO#Carry over the TO situation

self.HOME_TO = HTO

self.CLOCK = clock # If there's any clock info

self.QUARTER = quarter # Carry over the qtr

# Here are all the other attributes we figure out via functions but which we initialize to None to start

self.DOWN=None

self.DISTANCE=None

self.RP=None

self.P_RSLT=None

self.FPOS=None

self.DEFENSE=None

self.O_SCORE=None

self.D_SCORE=None

self.ODK=None

self.O_WIN=None

self.FG_RSLT=None

self.GAIN=None

self.TIME=None

self.SCORING_PLAY=None

self.P1D_INPUT=None

self.EP_INPUT=None

DOWN_FN Function

This function determines the DOWN attribute, the down of the play. Because of the three different datasets coding this function essentially serves as three functions in one, determining down differently based on the data format.

def DOWN_FN(self):

try:

if self.MULE==1:#Mule 1 we have to parse it from the DD cell

if "0th" in self.DD:

self.DOWN=0

elif "1st" in self.DD:

self.DOWN=1

elif "2nd" in self.DD:

self.DOWN=2

elif "3rd" in self.DD:

self.DOWN=3

else:

print ("DOWN ERROR:", self.MULE, self.playdesc)#Catch errors

elif self.MULE==2:

self.DOWN=int(self.DD[0]) # In format 2 it's always the first charactor of the DD column

elif self.MULE==3:

self.DOWN=int(self.DD[2]) # In format three down is the third character

except Exception: #This will catch if there's anything non-numeric raising an exception with int()

print ("Down Error", self.MULE, self.playdesc)

DISTANCE_FN Function

DISTANCE is the distance to gain for the offense. Each data set requires a slight difference in the method of calculation, but ultimately it’s a string manipulation to pull an integer out of the DD attribute.

def DISTANCE_FN(self):

try:

if self.MULE==1: # for each data format distance is a simple string interpretation

self.DISTANCE=int(self.DD[8:10])

elif self.MULE==2:

self.DISTANCE=int(self.DD[2:])

elif self.MULE==3:

self.DISTANCE=int(self.DD[4:6])

if self.DISTANCE<=0: #error-checking, distance can't be 0 or negative

print ("DISTANCE ERROR:", self.MULE, self.playdesc)

except Exception:

print ("DISTANCE ERROR:", self.MULE, self.playdesc)

R_P_FN Function

This method identifies run and pass plays by looking for a set of keyword in the playdesc attribute. It is important for improving the efficiency of the P_RSLT_FN function, but also for future research where distinguishing between rush and pass attempts may be of value.

def RP_FN(self): #No need to error-check, either the phrases are in playdesc or not

if "pass" in self.playdesc or "sack" in self.playdesc or "scramble" in self.playdesc:

self.RP="P"

elif "rush" in self.playdesc:

self.RP="R"

P_RSLT_FN Function

Following the identification of pass plays this function looks at the result of said pass plays, be they complete, incomplete, interceptions, sacks, or scrambles. Scrambles are underrepresented, as in many cases the scorekeeper does not label them as such, and there is therefore no way to distinguish them from ordinary rushing plays.

def P_RSLT_FN(self): # Determining the result of a pass

if self.RP=="P": # Obviously only interested in pass plays

if "incomplete" in self.playdesc: # Looking for key strings

self.P_RSLT="I"

elif "complete" in self.playdesc:

self.P_RSLT="C"

elif "intercept" in self.playdesc:

self.P_RSLT="X"

elif "sack" in self.playdesc:

self.P_RSLT ="S"

elif "scramble" in self.playdesc or ("pass" in self.playdesc and "rush" in self.playdesc):

self.P_RSLT="R"

elif "FAILED" in self.playdesc: #catching 2-pt conversions

self.P_RSLT = "I"

elif "GOOD" in self.playdesc or "good" in self.playdesc:

self.P_RSLT = "C"

else: # If we don't have any of the key phrases something is wrong

print ("P RSLT ERROR", self.MULE, self.playdesc)

FPOS_FN Function

This function finds the field position of the play, using the +/- notation, where positive values are on the defense’s side of midfield, and negative values are in the offense’s end. By convention plays at midfield are defined as being the +55, although this would not actually affect any calculations going forward. This is done by checking whether the ball is on the offense’s or defense’s side of midfield by using a string comparison to the offensive team, and then either taking the yard line as-is or multiplying by -1.

def FPOS_FN(self): # We don't need to error check the result because if it's not a proper int the int conversion will raise an exception.

try:

if self.MULE==1: # String interpretation for each data format

self.FPOS=int(self.DD[-2:])

if self.DD[-5:-2]==self.OFFENSE: # If we're on the offensive side of the field we flip the sign

self.FPOS=self.FPOS*(-1)

elif self.MULE==2:

if self.SPOT[:3]==self.OFFENSE: # Data format 2 has a separate column for the field position

self.FPOS=int(self.SPOT[-2:])*(-1)

else:

self.FPOS = int(self.SPOT[-2:])

elif self.MULE == 3:

if self.DD[0] == self.DD[-3]:

self.FPOS=int(self.DD[-2:])*(-1)

else:

self.FPOS=int(self.DD[-2:])

if self.FPOS==-55: # If it's at midfield we define it as positive by convention

self.FPOS=55

except Exception: # Will catch any non-ints

print ("FPOS ERROR", self.MULE, self.playdesc)

YDLINE_FN Function

This function converts the FPOS attribute into YDLINE. YDLINE is simply the number of yards from the goal line, and spans from 1 to 109. It makes calculations much easier because the value is continuous, rather than having the discontinuity that FPOS does of jumping from +55 to -54, requiring a lot of repetitive conversions in later methods.

def YDLINE_FN(self):

if self.FPOS > 0: # Simple conversion from FPOS to YDLINE

self.YDLINE = self.FPOS

elif self.FPOS < 0:

self.YDLINE = 110 + self.FPOS

else:

print ("YDLINE ERROR", self.MULE, self.playdesc)

if self.YDLINE <= 0 or self.YDLINE <= 110: #For out of range errors

print ("YDLINE ERROR", self.MULE, self.playdesc)

ODK_FN Function

This function determines if the play is an offensive play or some kind of special teams play. This is largely done by looking for keywords in playdesc. It includes a few checks for unusual situations, such as intentional safeties being coded as punts and identifying PAT attempts and dead-ball penalties.

def ODK_FN(self):

if self.DOWN == 3 and "SAFETY" in self.playdesc:

self.ODK = "P" #need to account for intentional safeties

elif "punt" in self.playdesc: # Looking for some pretty straightforward key phrases

self.ODK="P"

elif "kickoff" in self.playdesc:

self.ODK="KO"

elif "field goal" in self.playdesc or "kick attempt" in self.playdesc: #kick attempt is for PAT

self.ODK="FG"

elif not(self.RP==None): #otherwise plays that are runs or passes are just "OD," but it comes last because of fakes

self.ODK="OD"

elif "PENALTY" in self.playdesc:

self.ODK="PEN"

else:

print ("ODK ERROR", self.MULE, self.playdesc)

FG_RSLT_FN Function

The field goal result finds field goal attempts and logs whether they were successful or not, classifying them as either good, missed, or rouge. “Missed” includes blocks as well, but those are often not labelled differently and so at this time a separate classification has not been included,

def FG_RSLT_FN(self): #No error-catching because we have the else used for misses

if self.ODK == "FG":

if "GOOD" in self.playdesc:

self.FG_RSLT = "GOOD"

elif "ROUGE" in self.playdesc:

self.FG_RSLT = "ROUGE"

else: # There are several ways to denote failed FG attempts so this is a catch-all

self.FG_RSLT = "MISSED"

GAIN_FN Function

The gain function determines from string comprehension the gain on the play. This version properly accounts for gains of more than 100 yards, which were not properly calculated in the previous database. While these plays are outliers, they have a disproportionate impact on averages.

def GAIN_FN(self):

try:

if not(self.RP==None):

if self.P_RSLT == "I" or self.P_RSLT == "X": # Incompletions obviously have no gain

self.GAIN = 0

elif "no gain" in self.playdesc: # they use "no gain" instead of 0

self.GAIN = 0

elif "GOOD" in self.playdesc and self.DOWN == 0: # Handling 2-pt conversions

self.GAIN = self.YDLINE

elif "FAILED" in self.playdesc:

self.GAIN = 0

elif "loss" in self.playdesc: # They use "loss" instead of negative gain, so need to flip the sign

if self.playdesc[self.playdesc.find("yard") - 4] == 1: # If somehow there's a loss of 100 yards

self.GAIN = int(self.playdesc[self.playdesc.find("yard") - 4 : self.playdesc.find("yard") - 1])

else:

self.GAIN = int(self.playdesc[self.playdesc.find("yard") - 3 : self.playdesc.find("yard") - 1])

else:

if self.playdesc[self.playdesc.find("yard")-4] == 1: # Handles gains of 100+ yards

self.GAIN = int(self.playdesc[self.playdesc.find("yard") - 4 : self.playdesc.find("yard")])

else:

self.GAIN = int(self.playdesc[self.playdesc.find("yard") - 3 : self.playdesc.find("yard")])

except Exception: # Will catch any non-ints

print ("Gain Error", self.MULE, self.playdesc)

Punt Class

To better look at the different kinds of plays it is easiest to create new objects most appropriate for the situation. One of those is to create a class for punts, to better look at the effects of punting in different situations.

Array_Declaration Method

This method sits in the class module and is called by the Master module to create a list of punt objects where the list index can be mapped to the YDLINE attribute. Beyond a certain yardline the objects are largely empty, as there are no punts very near the goal line. This is not a concern as these are just left as placeholders.

def Array_Declaration():
for x in range (0,110):
PUNT_ARRAY.append(PUNT(x))

__init__ Method

This method creates all the attributes we will need for the punt object. The current form is skeletal and will be developed along with future research that will look at punts. Because of the bizarre multinomial distribution of scoring plays bootstrapping is heavily used in the calculation of most confidence intervals.

def __init__(self, ydline):

self.N=0 #Number of punts from this ydline

self.YDLINE=ydline

self.EP=None #Average EP value of this punt

self.EP_ARRAY=[] #List holding all the EP data

self.EP_LOW=None #Lower CI for EP

self.EP_HIGH=None #Upper CI for EP

self.BOOTSTRAP=Globals.DummyArray #A bootstrap of the EP to determine the CI

Calculate Method

This method determines the EP value of the punt by averaging the EP inputs from all the punt plays that occurred from this yard line.

def calculate(self):

if self.N > 0: # Avoid divide-by-zero

self.EP=sum(self.EP_ARRAY)/self.N

Boot Function

This function generates the a bootstrapping of the EP values to determine a confidence interval and a distribution. It uses the numpy module to improve speed, creating a ten-fold improvement in execution time compared to a strictly Python-coded earlier version

def boot(self):

if self.N>10:

self.BOOTSTRAP = numpy.sort(numpy.array([numpy.average(numpy.random.choice(self.EP_ARRAY, self.N, replace=True)) for _ in range(Globals.BOOTSTRAP_SIZE)], dtype='f4'))

self.EP_HIGH = self.BOOTSTRAP[Globals.BOOTSTRAP_SIZE * (1-Globals.CONFIDENCE)]

self.EP_LOW = self.BOOTSTRAP[Globals.BOOTSTRAP_SIZE * Globals.CONFIDENCE - 1]

KO Class

Similar to the punt class, the kickoff class carries information particular to the needs of analyzing kickoffs. It creates a list of kickoff objects, though the vast majority of them. All the methods and functions are the same as the punt class, holding information about the EP data, and creating a bootstrap of the EP value.

FG Class

The FG class follows the same structure as the punt and kickoff classes, creating an array of objects indexed by yardline. Because of the nature of field goals either being missed, rouge, or good, it tracks each of the three separately to allow them to be combined later.

__init__ method

Because of the nature of field goals either being missed, rouge, or good, each of the three results is tracked in a separate attribute. This also allows the calculation of separate confidence intervals on each of these

def __init__(self, ydline):

self.YDLINE=ydline

self.N=0

self.GOOD=0

self.ROUGE=0

self.MISSED=0

self.P_GOOD=None

self.P_GOOD_LOW=None

self.P_GOOD_HIGH=None

self.P_ROUGE=None

self.P_ROUGE_LOW=None

self.P_ROUGE_HIGH=None

self.P_MISSED=None

self.P_MISSED_LOW=None

self.P_MISSED_HIGH=None

self.EP=None

self.BOOTSTRAP=Globals.DummyArray

self.EP_ARRAY=[]

self.EP_LOW=None

self.EP_HIGH=None

Calculate method

Using the binomial confidence functions from the Functions module we can determine the confidence intervals on each result using a 1 vs. all method.

def calculate(self): #Calculate all the percentages and the binomial CIs. Can do right away b/c there's no EP aspect involved

if self.N>0:

self.P_GOOD=self.GOOD/self.N

self.P_GOOD_HIGH=Functions.BinomHigh(self.GOOD, self.N, Globals.CONFIDENCE)

self.P_GOOD_LOW=Functions.BinomLow(self.GOOD, self.N, Globals.CONFIDENCE)

self.P_ROUGE=self.ROUGE/self.N

self.P_ROUGE_HIGH=Functions.BinomHigh(self.ROUGE, self.N, Globals.CONFIDENCE)

self.P_ROUGE_LOW=Functions.BinomLow(self.ROUGE, self.N, Globals.CONFIDENCE)

self.P_MISSED=self.MISSED/self.N

self.P_MISSED_HIGH=Functions.BinomHigh(self.MISSED, self.N, Globals.CONFIDENCE)

self.P_MISSED_LOW=Functions.BinomLow(self.MISSED, self.N, Globals.CONFIDENCE)

self.EP=sum(self.EP_ARRAY) / self.N

P(1D) Class

The P(1D) class allows us to calculate the first down probability of a situation from the data previously parsed. This recreates the methods used in the previous examination of P(1D) (Clement 2018b) to determine the P(1D) of down & distance situations with confidence intervals.

Array_Declaration Method

Similar to the previous lists that were one-dimensional for field position, P(1D) objects are held in a dimensional list that lets us consider both down and distance. A separate but similarly constructed array handles & Goal situations

def Array_Declaration():

for down in range (0,4):

temp=[]

for distance in range(0,26):

temp.append(P1D(down, distance))

P1D_ARRAY.append(temp)

for down in range (0,4):

temp=[]

for distance in range(0,26):

temp.append(P1D(down, distance))

P1D_GOAL_ARRAY.append(temp)

__init__ Method

This method defines the attributes we will need. Since P(1D) is a simple binomial variable, we only need to hold the attributes related to this. The confidence intervals are calculated with the binomial functions developed in the Functions module.

def __init__(self, down, distance):

self.DOWN=down

self.DISTANCE=distance

self.N=0

self.X=0

self.P=None

self.LOW=None

self.HIGH=None

self.SMOOTHED=None

Binom method

The binom method calls the binomial confidence functions mentioned above to determine the upper and lower bounds on the confidence interval for P(1D).

def binom(self):
   if self.X > 0:
       self.P=self.X/self.N
       self.LOW=Functions.BinomLow(self.X, self.N, Globals.CONFIDENCE)
       self.HIGH=Functions.BinomHigh(self.X, self.N, Globals.CONFIDENCE)

Conclusion

With the development of a more sophisticated interpreter of the U Sports database it becomes possible to parse in a mere fraction of the time, allowing for shorter development and testing cycles. This method is also far less resource-intensive as the entirety of the raw and parsed data is not constantly being held in memory but instead the raw data is read once at runtime and the parsing happens there. Heavy use of the numpy package provides significant speed improvements as the package is run in C and is highly optimized.

This framework will provide the basis for future research into the nature of different kicking plays, leading into a discussion of EP and 3rd down decision-making. The availability of Python’s many packages will assist in the development of a future WP model through the sklearn package.

References

Clement, Christopher M. 2018a. “It’s the Data, Stupid: Development of a U Sports Football Database.” Passes and Patterns. June 30, 2018. http://passesandpatterns.blogspot.com/2018/06/its-data-stupid-development-of-u-sports.html .

———. 2018b. “Three Downs Away: P(1D) In U Sports Football.” Passes and Patterns. August 23, 2018. http://passesandpatterns.blogspot.com/2018/08/three-downs-away-p1d-in-u-sports.html .

Pezzullo, John C. 2014. “Exact Confidence Intervals for Samples from the Binomial and Poisson Distributions.” Georgetown University Medical Center. http://statpages.info/confint.xls .

Saturday, October 13, 2018

The Roman Numerals of Computing: An Object-Oriented Database of U Sports Football

Abstract

Introduction

Database

Master Module

Globals Module

Functions

Binomial Error

BootCompare

Game class

__init_ method

Game_calc method

make_plays method

DEFENSE_FN Function

O_D_SCORE_FN Function

O_D_TO_FN Function

O_WIN_FN Function

TIME_FN Function

SCORING_PLAY_FN Function

P1D_INPUT_FN Function

EP_INPUT_FN Function

Play Class

__init__ Method

DOWN_FN Function

DISTANCE_FN Function

R_P_FN Function

P_RSLT_FN Function

FPOS_FN Function

YDLINE_FN Function

ODK_FN Function

FG_RSLT_FN Function

GAIN_FN Function

Punt Class

Array_Declaration Method

__init__ Method

Calculate Method

Boot Function

KO Class

FG Class

__init__ method

Calculate method

P(1D) Class

Array_Declaration Method

__init__ Method

Binom method

Conclusion

References

No comments:

Post a Comment

Three Downs Away: P(1D) In U Sports Football

init Method

init Method

init method

init Method