Saturday, June 30, 2018

It's the Data, Stupid: Development of a U Sports Football Database

1-Abstract

A discussion of the development of a fully parsed database of U Sports football play-by-play data, including the sources and structures of available play-by-play data, examples of common errors in the data and methods to rectify them, and complete discussion of the code, structured as individual functions, used to parse the data into discrete columns. The code used is included within the discussion of each function.  The parsed data forms a relational database of plays that allow further research to be conducted at the individual play level. This database is believed to be the first of its kind for both U Sports and Canadian football in general.

2-Introduction

A truism, that any statistical analysis needs a dataset to be analyzed, is as applicable to sports analytics as any other topic. For football, a number of existing databases exist for both NFL (Yurko, Ventura, and Horowitz 2018) , as well as a publicly available scraping tool (Horowitz 2016). Analytics in American football, while not at the level of baseball, soccer, or basketball, are still relatively well-developed. Unfortunately that same academic spirit has yet to infiltrate the sport north of the border. Research into Canadian football is scarce, and there are no publicly available datasets to conduct any analysis. Ergo, any attempts to begin the study of the game, either for its own sake or in comparison to the American game, must begin with a quest to find data.
The fundamental unit of football data is the play. Attempts to perform analysis on game-level data are limited to, at best, meta-analysis and general trends. Ergo, a repository of accurate play-by-play data is prerequisite to any substantial work. Canadian football has two major leagues playing consistent, high-level football and tracking play-by-play data: the CFL and U Sports. The author’s choice to build a U Sports database stems first from the better availability of data, and the perception that U Sports is in some ways a “purer” form of Canadian football, owing to the strong American influence found in the CFL coaching ranks.

3-Data

U Sports play-by-play data was acquired from public sources and put into a Microsoft Excel spreadsheet. The choice of using Excel over more sophisticated tools was a pragmatic one. First, the actual quantity of data is not enormous when compared to the limits of Excel. Second, the ubiquity of Excel makes the database far more portable. Third, the calculations needed for this application are not overly sophisticated and fall well within the capabilities of Excel and VBA scripting. The use of other software would simply be overkill. In all, 1497 games exist in the dataset for a total of 247 850 plays, and a list of these games in chronological order is given in Appendix 1.

a-Sources

Play-by-play data for U Sports football is available in five locations. The main U Sports site(U Sports n.d.), formerly the CIS website, and the sites of each of the four conferences (CWUAA n.d.; OUA n.d.; RSEQ n.d.; AUS n.d.). The data exists in three different formats depending on the year and provenance. The first format, that of PrestoSports (PrestoSports n.d.), is what is currently available on the U Sports site as of 2009, as well as CanWest and OUA games in that time period. This is a two-column format, with the first column giving the down, distance, and field position and the play description in the second column. An example of the PrestoSports data format (hereafter referred to as Data Format 1) is given in Table 1.


2nd and 12 at MAN03
Anthony Coombs rush for 107 yards to the ALB0, TOUCHDOWN, 00:30.
Table 1 Example of Data Format 1  (U Sports 2013)
The second format (Data Format 2) is mostly found in CanWest games between 2006 and 2009; it is a four-column format with the team holding possession in the first column, down and distance in the second, field position in the third, and the play description in the fourth column.
MAN
1-10
MAN52
John Makie pass complete to Mike Mizerski for 58 yards to the SFU0, TOUCHDOWN, 13:21.
Table 2 Example of Data Format 2 (CWUAA 2008)
The third format is found in RSEQ, AUS, and CanWest games after 2002. This is a block-text format (The Automated Scorebook n.d.) that needs significant treatment to combine lines into one row per play and then separated into a two-column format similar to the first format. Data Format 3 is also seen in data for the 2017 RSEQ and AUS seasons because a reformatting of the U Sports site no longer makes visible the first column of Data Format 1, and Data Format 3 is the format used in the RSEQ and AUS conference websites. Ultimately the same information can be parsed from each format. An example of Data Format 3 is given in Table 3.
S 1-10 S45
J. Creighton pass complete to C. Hardwick for 65 yards to the WES0, TOUCHDOWN, clock 14:34.
Table 3 Example of Data Format 3 (Atlantic University Sport 2008)
Each game was imported into the database by simple copy and pasting of the data, as U Sports was unable to provide the data in a more practical format (Ridout 2017).

b-Cleanup

Although each of the three formats contains largely the same information, the difference in structure means that they must be treated separately. Because of the largely volunteer and informal nature of this play-by-play tracking, an extensive amount of data cleanup was needed. The following list details some of the more common problems that needed to be fixed.

i-American vs. Canadian Football

The software behind Data Format 1 is designed for American football, and thus 4 downs. As a result, the system does not automatically handle turnovers on downs in Canadian football. Instead it will show 4th down, and because play-by-play uses jersey numbers, will show the corresponding jersey numbers from the other team’s offensive players. Occasionally the new offense will lose enough yardage on 1st down to allow the nominal offense to gain a “1st down” on their 4th down attempt, meaning that four more plays will now be shown with the wrong team on offense, often losing strange amounts of yardage. This was rectified by identifying every instance of “4th down” and fixing the erroneous plays.

ii-Drive Start Statements

Data Formats 1 and 3 rely on “drive start” statements to identify which team has possession. Where these are missing there is no way for the script to know which team has possession, and only a careful reading would notice a change of possession on a play. This problem was rectified with a script to check that a drive start statement exists for the start of every half, and after every play with a punt, field goal, kickoff, interception, fumble, or 3rd down attempt, flagging rows that do not have such a statement, and then manually adding them where necessary. Table 4 shows an example of a drive start statement.

MAN drive start at 00:57.
Table 4 Example of drive start statement (U Sports 2014)

iii-Missing Score Statements

The score in the game is given by periodic rows giving the score in the game. These score statements should occur after every scoring play but are often missing. A script flagged all possible scoring plays that were not followed by a score statement, and the necessary score statements were added manually. A second script then checked to see if there were any jumps in score that did not correspond to a football scoring play, that is, any time both teams changed score, any decrease in score, or any change in score by an amount that is not 1, 2, 3, 6, 7, or 8. Table 5 gives an example of a score statement.

CGY 31, MAN 50
Table 5 Example of a score statement (U Sports 2014)

iv-Team Names

Team names and abbreviations in U Sports are not standardized, and so the abbreviations used in the play-by-play are entirely inconsistent. These variations range from banal variations on a theme, such as the University of Saskatchewan being abbreviated as either SKH, SASK, or SAS, including the season and sport, such as the University of Waterloo being abbreviated as WATMENFB12, the use of team mascots instead of school names, such as Concordia University being shown as STINGERS, and sometimes simply using HOME and AWAY. Ultimately it was decided to use 3-letter abbreviations of the school name, choosing the most common or unambiguous abbreviations. The use of find/replace and VBA scripts to fix some of the more common variations aided enormously, and a list of school’s and their abbreviations sorted by conference is provided in Table 6. These abbreviations are used universally throughout the dataset and are the only means by which teams are identified or referred to.


Canada West
Universities
Athletic
Association
CanWest
Ontario
University
Athletics
OUA
Réseau du
sport
étudiant
du Québec
RSEQ
Atlantic
University
Sport
AUS
University of
Manitoba
MAN
University of
Windsor
WIN
McGill
University
MCG
Acadia
University
ACA
University of
Saskatchewan
SKH
University of
Western Ontario
WES
Concordia
University
CON
Mount Allison
University
MTA
University of
Regina
REG
University of
Waterloo
WAT
Université de
Montreal
MON
St. Francis
Xavier
University
SFX
University of
Calgary
CGY
Wilfrid Laurier
University
WLU
Université de
Sherbrooke
SHE
Saint Mary’s
University
SMU
University of
Alberta
ALB
University of
Guelph
GUE
Université de
Laval
LAV
Bishop’s
University
(from 2017)
BIS
University of
British Columbia
UBC
McMaster
University
MAC
Bishop’s
University
(through 2016)
BIS


Simon Fraser
University
(through 2009)
SFU
University of
Toronto
TOR






York
University
YRK






Queen’s
University
QUE






University of
Ottawa
OTT






Carleton
University
(from 2013)
CAR




Table 6 List of U Sports teams and abbreviations


c-Parsing

The play-by-play data was parsed using a script in Microsoft Excel’s VBA. The columns calculated are listed below in the order in which they are calculated, which is not the order in which they are displayed, because the display is designed for readability, but some columns are dependent on other columns to simplify the calculation and so are calculated in a different order.
The database is parsed column-wise, passing through each row in the set and calculating that column’s value, repeating the loop for each column. This is a less computationally efficient method than could be achieved working row-wise, but it was specifically chosen for it’s maintainability, each column can be treated in a unique function and bugs are not transmitted across columns, making problems easy to isolate. The longer runtime of the script is not a concern, as the entire database is processed in ~15-30 seconds. There are huge opportunities to optimize the runtime, but that is not the goal. Since computing power grows exponentially or perhaps slightly slower, and the dataset will grow linearly or perhaps slightly faster if more teams join the league, the runtime of the code will simply never be a problem. Most importantly, it must be easy to add features and columns to adapt to the changing scope of a research project, and the code as written allows for that by adding a new function to the list of executed functions. If repeated iterations of the program are executed and runtime somehow becomes an issue then any columns that do not have downstream dependencies can simply be commented out. A quick reading of the code will highlight the most computationally-intensive functions that should be the first to comment out in this case.

i-Down (0, 1, 2, 3)

This column shows the down state for each play. Kickoffs and PAT attempts were shown as having down 0 as a placeholder. The down column can be directly parsed from the play-by-play. In Data Format 1 a search for the term “0th,” “1st,” “2nd,” or “3rd” defined the down. In Data format 2 the first character of column 2 is the down, and in Data Format 3 the down is the character immediately preceding the first hyphen. Because the structure of the data formats differs a single function could not be used across all three data formats. Instead only the code for Data Format 1 is shown as an example.
'DOWN - determine the down of the play
For row = 2 To rowcount
   If Len(DataOne(row, 1)) > 3 Then 'if the length of the D&D field is 3 then it's a qtr statement
       If Left(DataOne(row, 1), 3) = "1st" Then 'going through these in the order of probability for efficiency
           DataOne(row, DOWN_COL) = 1
       ElseIf Left(DataOne(row, 1), 3) = "2nd" Then
           DataOne(row, DOWN_COL) = 2
       ElseIf Left(DataOne(row, 1), 3) = "3rd" Then
           DataOne(row, DOWN_COL) = 3
       ElseIf Left(DataOne(row, 1), 3) = "0th" Then
           DataOne(row, DOWN_COL) = 0
       End If
   End If
Next row

ii-Distance (1-109)

This is the distance to gain on the play. Kickoffs have a distance of 10 and PAT attempts are considered & Goal plays and thus have a distance equal to the field position. The distance column can be directly parsed from the play-by-play. In Data format 1 the number after the word “and” in column 1 is the distance. In Data Format 2 the number after the hyphen in column 2 is the distance. In Data Format 3 the number after the hyphen in column 1 is the distance. As with Down, Distance requires a bespoke function for each data format, and the example below is used for Data Format 1.
'DISTANCE - determine the distance to gain of the play
For row = 2 To rowcount
   If InStr(DataOne(row, 1), " and ") Then 'check for an "and" in D&D to show we have a real D&D statement
       DataOne(row, DIST_COL) = CInt(Trim(Mid(DataOne(row, 1), 9, 2))) 'grab the distance, trim out extra spaces to account for 1 or 2-digit numbers, and force conversion to integer to throw an error if something is awry.
   End If
Next row

iii-Offense (Any team abbreviation)

This identifies which team is currently on offense. For kicking plays the kicking team is defined as the offense. Data Format 1 uses “drive start” statements, and example of which is shown in Table 4. Data Format 3 uses the same type of statements to identify possession. The team with possession remains the same until a new possession statement causes it to change. Data Format 2 shows possession directly in column 1.
Function OFF_FN(data As Variant)'Determine the offensive team
For row = 2 To rowcount
   If InStr(data(row, 2), "drive start") Then 'Identify drive start statements
       OFF = Left(data(row, 2), 3) 'Set the offense to the team stated
   End If
   data(row, OFF_COL) = OFF 'Recycle the offensive team until there's a reason to switch
Next row
End Function

iv-Home/Away (Any team abbreviation)

Home and Away are two columns calculated in the same function. Game start statements exist before each game in the same style across all data formats, an example of which is given below in Table 7. The Home and Away teams remain the same until a new game start statement causes it to change.
MAN vs. SMU (2007-11-23 at Toronto, ON)

Table 7 Example of game statement (CWUAA 2007)
Function HOME_AWAY_FN(data As Variant)'Find the home and away teams
For row = 2 To rowcount
   If InStr(data(row, 1), " vs. ") Then 'this is only found on game statements
       awayteam = Left(data(row, 1), 3) 'away team is named first
       hometeam = Mid(data(row, 1), 9, 3) 'home team is named second
   End If
   data(row, AWAY_COL) = awayteam 'recycle until there's a new game
   data(row, HOME_COL) = hometeam
Next row
End Function

v-Defense (Any team abbreviation)

Having determined the offensive team as well as the home and away teams, the defensive team is simply the one that is not the offense.
Function DEF_FN(data As Variant)'Find the defensive team
For row = 2 To rowcount
   If data(row, HOME_COL) = data(row, 28) Then 'If the home team has the ball the away team is the defense
       data(row, DEF_COL) = data(row, AWAY_COL)
   Else
       data(row, DEF_COL) = data(row, HOME_COL) 'or vice-versa
   End If
Next row
End Function

vi-Field Position (1-54, -1--54)

The field position of the current play, using +/- notation where the field position is negative when the ball is on the offensive or kicking team’s side of midfield, and positive when the ball is in the defensive team’s end. For Data Format 1 field position is calculated in two steps. First by using the number that is the last two characters of column 1 to determine the absolute value of field position. Then comparing whether the three characters preceding that match the offensive team. If so then the ball is on the offensive side of midfield and the sign of the field position is made negative. In Data Format 2 much the same process is followed, though field position is given in column 3. In Data Format 3 the first character of column 1 is compared to the single character preceding the last two digits. This is because there is a single character code used to identify the two teams playing in this format. Unfortunately the coding was so inconsistent and ambiguous it could not be used as a play-by-play marker of possession. This column cannot be calculated with the same function for all data formats, so the code shown is an example for Data Format 1.
'FPOS - This loop determines the field position and needs to go AFTER we determine possession
For row = 2 To rowcount
   If InStr(DataOne(row, 1), " and ") Then  'only D&D statements would have "and" in them
       If Mid(DataOne(row, 1), Len(DataOne(row, 1)) - 4, 3) = DataOne(row, OFF_COL) Then 'check for whether we are in + or - territory
           DataOne(row, FPOS_COL) = Right(DataOne(row, 1), 2) * -1 'if in - territory we invert the sign of the field position
       Else
           DataOne(row, FPOS_COL) = Right(DataOne(row, 1), 2)
       End If
       If DataOne(row, FPOS_COL) = "-55" Then 'all plays on the 55 are marked positive
           DataOne(row, FPOS_COL) = 55
       End If
   End If
Next row

vii-Quarter (1-5)

The quarter in which the play occurs, overtime is shown as (5), regardless of whether there are multiple overtimes. Quarter start statements are included in the play-by-play data as a row with just “2nd,” “3rd,” “4th,” or “OT,” an example of which is shown in Table 8. New game statements reset the quarter to (1).
2nd

Table 8 Example of quarter start statement (U Sports 2014)
Function QUARTER_FN(data As Variant)'determine what quarter we're in
For row = 2 To rowcount
   If InStr(data(row, 1), "vs.") Then 'New games mean 1st quarter
       qtr = 1
   ElseIf data(row, 1) = "2nd" Then 'identify quarter statements, this is the entire statement
       qtr = 2
   ElseIf data(row, 1) = "3rd" Then
       qtr = 3
   ElseIf data(row, 1) = "4th" Then
       qtr = 4
   ElseIf data(row, 1) = "OT" Then 'Overtime is considered the 5th quarter
       qtr = 5
   End If
   data(row, QTR_COL) = qtr 'stick with the quarter until there's a reason to change
Next row
End Function

viii-Yardline (0-109)

Yardline is an alternative means of showing field position. The standard notation of field position is not intuitive to computers as it goes from 0 to -54 to 55 to 0. Instead, we use Yardline in certain calculations to show distance from the goal line measured from 109-0. If Field Position is positive then Yardline is equal to Field Position. If Field Position is negative then Yardline is equal to Field Position +110.
Function YDLINE_FN(data As Variant) 'Determine the yardline from the field position
For row = 2 To rowcount
   If IsEmpty(data(row, FPOS_COL)) = False Then 'Don’t bother looking at rows that don’t have a Field Position value
       If data(row, FPOS_COL) > 0 Then
           data(row, YARDLINE_COL) = data(row, FPOS_COL) 'Positive values stay positive
       Else
           data(row, YARDLINE_COL) = 110 + data(row, FPOS_COL) 'Negative values get converted
       End If
   End If
Next row
End Function

ix-Year (2002-2017)

This function returns the year in which the game was played. Since the range of seasons is small and known the function looks for those seasons as strings within the game statements.
Function YEAR_FN(data As Variant) 'Determine the year the game was played
For row = 2 To rowcount
   If InStr(data(row, 1), " vs. ") Then 'Only looking in game statements
       For x = 2002 To 2017 'Loop through all the years in our database
           If InStr(data(row, 1), x) Then 'If the appropriate year is found
               season = x
               Exit For
           End If
       Next x
   End If
   data(row, YEAR_COL) = season 'Recycle until the next game statement
Next row
End Function

x-League (“U Sports”)

This column simply shows “U Sports” on every year. Although seemingly redundant, it allows for better compatibility with datasets of different leagues.
Function LEAGUE_FN(data As Variant, lge As String) 'Just put the league in every cell
For row = 2 To rowcount
   data(row, LEAGUE_COL) = lge
Next row
End Function

xi-Run/Pass (R/P)

This identifies whether the play is a run or a pass. It is blank for most plays that are not “OD” plays. The run/pass column can be parsed directly from the play-by-play by looking in the play description for the phrase “rush” to identify running plays, and either “pass” or “sack” to identify passing plays.
Function RUN_PASS_FN(data As Variant, col As Integer) 'Identify run or pass
For row = 2 To rowcount
   If InStr(data(row, col), "rush") Then 'The word rush tells us it’s a run
       data(row, RP_COL) = "R"
   ElseIf InStr(data(row, col), "pass") Then 'obviously a pass
       data(row, RP_COL) = "P"
   ElseIf InStr(data(row, col), "sacked") Then 'Sacks are pass plays
       data(row, RP_COL) = "P"
   End If
Next row
End Function

xii-Pass Result (C, I, X, S)

Pass Result is only calculated for plays identified as passing plays. By looking in the play description the script finds the phrases “incomplete” (I), “complete” (C), “intercept” (X), or “sack.” Unfortunately scrambles, which are usually denoted by (R) are not shown because the play-by-play simply notes them as rushing plays. Additionally, some sacks are marked as rushing plays as well, but it is not feasible to determine which rushing plays should be changed to sacks or scrambles.
Function PASS_RESULT_FN(data As Variant, col As Integer) 'Find the result of the pass
For row = 2 To rowcount
   If data(row, RP_COL) = "P" Then 'only check pass plays
       If InStr(data(row, col), "incomplete") Then  'incomplete has to come first because otherwise it would catch "complete" within “incomplete”
           data(row, P_RSLT_COL) = "I"
       ElseIf InStr(data(row, col), "complete") Then
           data(row, P_RSLT_COL) = "C"
       ElseIf InStr(data(row, col), "sacked") Then
           data(row, P_RSLT_COL) = "S"
       ElseIf InStr(data(row, col), "intercepted") Then
           data(row, P_RSLT_COL) = "X"
       End If
   End If
Next row
End Function

xiii-ODK (“OD”, “KO”, “FG”, “P”, “PEN”)

This column identified the type of football play, either an offensive play from scrimmage (OD), a kickoff (KO), field goal (FG), punt (P), or penalty (PEN). The phrase “kickoff” appears in all data formats in the play description, as does the phrase “punt.” “Field goal” appears for all field goal attempts, and “kick attempt” for all PAT attempts, which are also coded as “FG.” “OD” plays are plays that remain that are noted as being runs or passes. Penalties are plays that do not fit into the above categories that are simply dead-ball penalties. Intentional safeties are considered punts as well. There is some small chance of a legitimate OD play being coded as a punt in this fashion but it is improbable.
Function ODK_FN(data As Variant, col As Integer) 'Determine the ODK of a play
For row = 2 To rowcount
   If data(row, DOWN_COL) = 3 And InStr(data(row, col), "SAFETY") Then  'accounting for intentional safeties
       data(row, ODK_COL) = "P"
   ElseIf IsEmpty(data(row, RP_COL)) = False Then 'If it's a run or a pass then it's pretty straightforward.
       data(row, ODK_COL) = "OD"
   ElseIf InStr(data(row, col), "punt") Then  'punts are labelled
       data(row, ODK_COL) = "P"
   ElseIf InStr(data(row, col), "kickoff") Then 'kickoffs are labelled
       data(row, ODK_COL) = "KO"
   ElseIf InStr(data(row, col), "field goal") Or InStr(data(row, col), "kick attempt") Then   'field goals, "kick attempt" implies PAT
       data(row, ODK_COL) = "FG"
   ElseIf InStr(data(row, col), "PENALTY") Then  'if it's nothing else but there's a penalty we flag it as such
       data(row, ODK_COL) = "PEN"
   End If
Next row
End Function

xiv-Is Play (TRUE, FALSE)

this column identified plays as being TRUE, as opposed to rows that served administrative functions, such as the start of games, possession changes, score changes, or timeouts (FALSE). This also served as a dummy column when combining the different datasets into the final database to include only football plays. Any play with a value for ODK is marked as TRUE.
Function ISPLAY_FN(data As Variant) 'Determine if it’s a play or admin
For row = 2 To rowcount
   If IsEmpty(data(row, ODK_COL)) = False Then 'If it has ODK it’s a play
       data(row, ISPLAY_COL) = True
   Else
       data(row, ISPLAY_COL) = False
   End If
Next row
End Function

xv-Clock (00:00-15:00)

Clock shows the time remaining on the game clock. This is only shown for plays where there is a clock statement associated, or for the first play after a clock statement occuring on a non-play. Clock statements are identified in the play description by the existence of a colon (:), and the clock time is the 2 characters on either side of the colon and the colon itself. An example of a play with a clock statement is shown in Table 9.
1st and 10 at MAN21
Jordan Yantz pass complete to Andrew Smith for 89 yards to the REG0, TOUCHDOWN, 00:43
Table 9 Example of clock statement (CWUAA 2013b)
Function CLOCK_FN(data As Variant, col As Integer) 'Identify clock statements
For row = 2 To rowcount
   If InStr(data(row, col), ":") Then
       If InStr(data(row, col), ":") < 30 Then 'if it comes early in the play description it means it applies to the current play
           data(row, CLOCK_COL) = Mid(data(row, col), InStr(data(row, col), ":") - 2, 5)
       ElseIf data(row, ISPLAY_COL) Then  'otherwise it's talking about the time on the clock at the END of the play
           data(row + 1, CLOCK_COL) = Mid(data(row, col), InStr(data(row, col), ":") - 2, 5)
       Else 'In this case it's for simple CLOCK statements with no play
           data(row, CLOCK_COL) = Mid(data(row, col), InStr(data(row, col), ":") - 2, 5)
       End If
   ElseIf data(row, QTR_COL) = 5 Then 'time is set to 0 in overtime
       data(row, CLOCK_COL) = "00:00"
   End If
Next row


For row = 2 To rowcount 'this loop puts a clock statement at the beginnings of quarters and the ends of halves
   If data(row, 1) = "1st" Or data(row, 1) = "2nd" Or data(row, 1) = "3rd" Or data(row, 1) = "4th" Or InStr(data(row, 1), " vs. ") Then
       data(row, CLOCK_COL) = "15:00"
   ElseIf InStr(data(row, col), "end of game") Then  'at the end of the game the clock is obv at 0
       data(row, CLOCK_COL) = "00:00"
   End If
Next row


For row = 2 To rowcount 'Here we account for clock statements on non-plays to help fill in some gaps and reduce errors in our averaging
   If IsEmpty(data(row, CLOCK_COL)) = False Then 'if we have a clock statement
       If data(row, ISPLAY_COL) = False Then 'and it's not applied to an actual play
           If IsEmpty(data(row + 1, CLOCK_COL)) Then 'and the next row does not have a clock stamp
               data(row + 1, CLOCK_COL) = data(row, CLOCK_COL) 'paste it into the next row as well
           End If
       End If
   End If
Next row


For row = 2 To rowcount 'Puts clock statements into non-plays where the preceding play has one to fill more gaps
   If IsEmpty(data(row, CLOCK_COL)) = False Then 'if we have a clock statement
       If data(row + 1, ISPLAY_COL) = False Then 'and the next play is a non-play
           If IsEmpty(data(row + 1, CLOCK_COL)) Then  'and the next play doesn't have a clock statement
               data(row + 1, CLOCK_COL) = data(row, CLOCK_COL) 'copy over the previous timestamp
           End If
       End If
   End If
Next row
End Function

xvi-Gain (-109-109)

This column gives the gain of the play from scrimmage for plays that have a value for Run/Pass. Plays that result in a loss are shown as a negative gain. Incomplete passes and interceptions are shown as a gain of 0. The phrase “no gain” is also a gain of 0. Thereafter the gain is given by the number preceding the first instance of the word “yards;” if the phrase “loss” precedes this then the sign is made negative to show the loss on the play.
Function GAIN_FN(data As Variant, col As Integer) 'Find the gain on the play
For row = 2 To rowcount
   If data(row, ODK_COL) = "OD" Then
       If data(row, P_RSLT_COL) = "I" Or data(row, P_RSLT_COL) = "X" Then 'incompletions and interceptions are 0-yard gains
           data(row, GAIN_COL) = 0
       ElseIf InStr(data(row, col), "no gain") Then  'no gain means 0 gain
           data(row, GAIN_COL) = 0
       ElseIf InStr(data(row, col), "loss of") Then  'loss has to come before gain because gains don't specifically say gain
           data(row, GAIN_COL) = "-" + Trim(Mid(data(row, col), InStr(data(row, 2), "loss of") + 7, 3))
       ElseIf data(row, DOWN_COL) = 0 Then 'handling extra points
           If InStr(data(row, col), "good") Then
               data(row, GAIN_COL) = data(row, DIST_COL)
           ElseIf InStr(data(row, col), "failed") Then
               data(row, GAIN_COL) = 0
           End If
       ElseIf InStr(data(row, col), "yard") = 0 Then 'if something goes screwy flag it
           data(row, 12) = "FIX ME!GAIN"
       Else
           data(row, GAIN_COL) = Trim(Mid(data(row, col), InStr(data(row, col), "yard") - 3, 3)) 'and for regular gains
       End If
   End If
Next row
End Function

xvii-Home/Away Score (0-99)

This column determines the current score in the game from the use of score statements that appear periodically and after each scoring play. Score statements follow the format shown below in Table 5. The away score is the number preceding the comma, and the home score is the number at the end of the score statement.
Function HOME_AWAY_SCORE_FN(data As Variant, col As Integer) 'Find the home and away scores
For row = 2 To rowcount
   If Len(data(row, col)) > 11 And Len(data(row, col)) < 15 Then 'score statements have a limited range of lengths
       If Mid(data(row, col), 4, 1) = " " Then '4th char MUST be a space
           If InStr(data(row, col), ", ") Then
               If Left(data(row, col), 3) = data(row, AWAY_COL) Then
                   awayscore = Trim(Mid(data(row, col), InStr(data(row, col), ",") - 2, 2))
                   homescore = Trim(Right(data(row, col), 2))
               End If
           End If
       End If
   End If
   If InStr(data(row, 1), "vs.") Then 'New game resets the score
       awayscore = 0
       homescore = 0
   End If
   data(row, AWAY_SCORE_COL) = awayscore 'Reuse the score until a change happens
   data(row, HOME_SCORE_COL) = homescore
Next row
End Function

xviii-Off/Def Score (0-99)

Offense/Defense score is determined by checking to see whether the home team has possession or not, and then assigning the appropriate score from Home/Away score, and then assigning the other score to the defense.
Function OFF_DEF_SCORE_FN(data As Variant) 'Find the offensive and defensive score
For row = 2 To rowcount
   If data(row, HOME_COL) = data(row, OFF_COL) Then 'If the home team has the ball
       data(row, OFF_SCORE_COL) = data(row, HOME_SCORE_COL)
       data(row, DEF_SCORE_COL) = data(row, AWAY_SCORE_COL)
   Else 'If the away team has the ball
       data(row, OFF_SCORE_COL) = data(row, AWAY_SCORE_COL)
       data(row, DEF_SCORE_COL) = data(row, HOME_SCORE_COL)
   End If
Next row
End Function

xix-Home Lead (0-99)

Home Lead is simply the score of the home team minus the score of the away team. If the lead is negative then the home team is trailing.
Function HOME_AWAY_LEAD_FN(data As Variant) 'Find the lead of the home team
For row = 2 To rowcount
   data(row, HOME_LEAD_COL) = data(row, HOME_SCORE_COL) - data(row, AWAY_SCORE_COL) 'Subtract the away score from the home score
Next row
End Function

xx-Offense Lead (0-99)

Offense Lead is the score of the offensive team minus the score of the defensive team. If the lead is negative then the offense is trailing.
Function OFF_DEF_LEAD_FN(data As Variant) 'Find the lead of the offensive team
For row = 2 To rowcount
   If data(row, HOME_SCORE_COL) = data(row, OFF_SCORE_COL) Then 'If the home team has the ball
       data(row, OFF_LEAD_COL) = data(row, HOME_LEAD_COL) 'Offensive lead equals home lead
   Else 'If the away team has the ball
       data(row, OFF_LEAD_COL) = -1 * data(row, HOME_LEAD_COL) 'Offense lead is negative home lead
   End If
Next row
End Function

xxi-Home Win (TRUE, FALSE)

Home Win shows whether the home team will win this game, in which case the result is TRUE. It is calculated by finding the game start statement, finding the next game start statement, and seeing if Home Lead is positive on the preceding line, then filling in every row of the game with the appropriate value.
Function HOME_WIN_FN(data As Variant, col As Integer) 'State if the home team wins
For row = 2 To rowcount
   If InStr(data(row, 1), " vs. ") Then 'Look for game starts
       For templong = row + 1 To row + 500 'Games donèt last more than 500 rows
           If InStr(data(templong, col), "end of game") Then 'look for the end of the game
               If data(row, HOME_LEAD_COL) > 0 Then 'If the home team leads
                   win = True
               Else 'If the home team is trailing
                   win = False
               End If
               For x = row To templong
                   data(x, HOME_WIN_COL) = win 'Copy the win status for the whole game
               Next x
               row = templong 'jump to the end of the game to save wasted cycles
               Exit For
           End If
       Next templong
   End If
Next row
End Function

xxii-Offense Win (TRUE, FALSE)

Offense Win shows if the offense wins the game. It compares the offensive team to the home team. If they are equal then Offense Win is equal to Home Win. If not then Offense win is the opposite of Home Win.
Function OFF_WIN_FN(data As Variant) 'State if the offense wins
For row = 2 To rowcount
   If data(row, HOME_COL) = data(row, OFF_COL) Then 'If the home team has the ball
       data(row, OFF_WIN_COL) = data(row, HOME_WIN_COL) 'Offense win equals home win
   Else
       data(row, OFF_WIN_COL) = Not (data(row, HOME_WIN_COL)) 'Offense win is the opposite of home win
   End If
Next row
End Function

xxiii-Time (0-3600)

Time gives the amount of time remaining in the game, shown in seconds. It is calculated by first assigning the known values of the start of each quarter, 3600 seconds at the start of the 1st quart, 2700 at the start of the 2nd quarter, 1800 at the start of the 3rd quarter, and 900 seconds at start of the 4th quarter. All overtime plays show 0 seconds. Any plays that have a value for Clock are then converted to seconds by taking the quarter, multiplying by 900, subtracting 900, adding the number of seconds in Clock (the last two digits) and adding 60 times the number of minutes in Clock (the first two digits). Then, for all the gaps the number of plays is counted and the gap between statements is calculated. The average gap is calculated and each play is given a Time value of the preceding play minus the average gap, a linear spline across the gap. Then all Times are rounded to integer values.
Function TIME_FN(data As Variant) 'Determines the time from the clock statements
For row = 2 To rowcount
   If Not (IsEmpty(data(row, CLOCK_COL))) Then 'just convert all the clock cells to times
       data(row, TIME_COL) = 3600 - 900 * data(row, QTR_COL) + 60 * Left(data(row, CLOCK_COL), 2) + Right(data(row, CLOCK_COL), 2) 'Each quarter is 900 seconds, plus 60 * the minutes plus the seconds
   End If
Next row


For row = 2 To rowcount 'This loop splines the gaps
   If IsEmpty(data(row, TIME_COL)) Then 'if we find the start of a gap
       counter = 0 'couns how many isplays are in the gap
       For templong = row To row + 500 'This is longer than any gap could be
           If IsEmpty(data(templong, TIME_COL)) = False Then 'We've found a timestamp to bookend our gap
               Gap = data(row - 1, TIME_COL) - data(templong, TIME_COL) 'now we've defined our gap between the two known timestamps
               Exit For
           ElseIf data(templong, ISPLAY_COL) Then  'if we find an actual play with no timestamp we count it
               counter = counter + 1
           End If
       Next templong
       For x = row To templong - 1 'now we pass over the same gap
           If data(x, ISPLAY_COL) = False Then
               data(x, TIME_COL) = data(x - 1, TIME_COL) 'non-plays just get the same as what's above them
           Else
               data(x, TIME_COL) = data(x - 1, TIME_COL) - (Gap / (counter + 1)) 'plays get the previous time minus the average (+1 because there's one more gap than there are plays in the gap)
           End If
       Next x
       row = templong 'Jump to the end of the gap to save time
   End If
Next row
On Error Resume Next
For row = 2 To rowcount 'this loop rounds all the estimated times we calculated in the previous loop
   data(row, TIME_COL) = Round(data(row, TIME_COL), 0)
Next row
End Function

xxiv-Scoring Play (“O-TD”, “O-FG”, “O-SAFETY”, “O-ROUGE”, “D-SAFETY”, “D-TD”)

This column identifies plays that are scoring plays according to who scored and what type of scoring play. Touchdowns have the word “TOUCHDOWN” in the play description. On OD plays these are offensive touchdowns unless the play is an interception or the play description has the word “FUMBLE.” If the play is a kickoff, a punt, or a field goal attempt it is a defensive touchdown. Field goals are obviously always offensive, and are identified by the phrase “GOOD” in the play description. Rouges are also always to the offense, and have the phrase “ROUGE” in the play description. Safeties are credited to the team conceding the safety, and so usually belong to the offense. On kicking plays behind the -40 yard line these are credited as offensive safeties, and beyond that are defensive safeties, as it is rather rare to concede a 45-yard loss for a safety, or to have a 70-yard punt.
Function SCORING_PLAY_FN(data As Variant, col As Integer) 'Identifies scoring plays
For row = 2 To rowcount
   If InStr(data(row, col), "ROUGE") Then  'ROUGE is easy, defense will never score a rouge
       data(row, SCORING_PLAY_COL) = "O-ROUGE"
   ElseIf data(row, ODK_COL) = "FG" And InStr(data(row, col), "GOOD") Then  'defence also never scores a FG
       data(row, SCORING_PLAY_COL) = "O-FG"
   ElseIf InStr(data(row, col), "TOUCHDOWN") Then  'touchdowns are tricky, they can be scored by either team
       For templong = row To row + 50 'Look for the PAT
           If InStr(data(templong, col), "attempt") Then 'Check which team did the PAT to tell us which team scored
               If data(row, OFF_COL) = data(templong, OFF_COL) Then
                   data(row, SCORING_PLAY_COL) = "O-TD"
               Else
                   data(row, SCORING_PLAY_COL) = "D-TD"
               End If
               row = templong
               Exit For
           ElseIf InStr(data(templong, col), "end of game") Then 'If the game ends on a TD meaning no PAT
               If data(templong, OFF_SCORE_COL) > data(row, OFF_SCORE_COL) Then 'Check to see who’s score increased
                   data(row, SCORING_PLAY_COL) = "O-TD"
               Else
                   data(row, SCORING_PLAY_COL) = "D-TD"
               End If
               row = templong
               Exit For
           End If
       Next templong
   ElseIf InStr(data(row, col), "SAFETY") Then  'Safeties are usually on the offense but sometimes are the result of bad returns
       If data(row, YARDLINE_COL) < 65 Then
           data(row, SCORING_PLAY_COL) = "D-SAFETY" 'it's really hard to give up a safety from beyond your own 45
       Else
           data(row, SCORING_PLAY_COL) = "O SAFETY"
       End If
   End If
Next row
End Function

xxv-P(1D) Input (TRUE, FALSE)

This column identifies whether the offense successfully gains a first down within this drive, in order to calculate First Down Probability (P(1D)). If a 1st & 10, 1st & Goal, or TD occur before a change of possession, punt, field goal attempt, or end of half the column is scored as TRUE.
Function P1D_INPUT_FN(data As Variant) 'Determine the input for P(1D)
For row = 2 To rowcount
   If data(row, ODK_COL) = "OD" Then 'Only looking at OD plays
       If data(row, SCORING_PLAY_COL) = "O-TD" Then 'TD plays obviously success
           data(row, P1D_INPUT_COL) = True
       ElseIf data(row, SCORING_PLAY_COL) = "D-TD" Then 'Defensive TD obviously fail
           data(row, P1D_INPUT_COL) = False
       ElseIf data(row, SCORING_PLAY_COL) = "O-SAFETY" Then 'Safety obviously fail
           data(row, P1D_INPUT_COL) = False
       Else
           For templong = row + 1 To row + 100
               If data(templong, 28) <> data(row, 28) Then 'if loss of possession then p(1D) is failed
                   poned = False
                   Exit For
               ElseIf data(templong, DOWN_COL) = 1 And data(templong, DIST_COL) = 10 Then 'if 1st & 10 then P(1D) is success
                   poned = True
                   Exit For
               ElseIf data(templong, DOWN_COL) = 1 And data(templong, DIST_COL) = data(templong, FPOS_COL) Then 'if 1st & Goal then P(1D) is success
                   poned = True
                   Exit For
               ElseIf data(templong, SCORING_PLAY_COL) = "O-TD" Then 'if score a touchdown then P(1D) is success
                   poned = True
                   Exit For
               ElseIf data(templong, SCORING_PLAY_COL) = "D-TD" Then 'if D score a touchdown then P(1D) is fail
                   poned = False
                   Exit For
               ElseIf data(templong, SCORING_PLAY_COL) = "O-SAFETY" Then 'Giving up safeties is a fail
                   poned = False
                   Exit For
               End If
           Next templong
           For x = row To templong - 1
               data(x, P1D_INPUT_COL) = poned 'Fill for the whole sequence
           Next x
           row = templong - 1 'Jump ahead
       End If
   End If
Next row
End Function

xxvi-EP Input (“O-TD”, “O-FG”, “O-SAFETY”, “O-ROUGE”, “HALF”, “D-SAFETY”, “D-TD”)

This column determines what the next scoring play is to be able to calculate Expected Points (EP). It looks for the next scoring play in the Scoring Play column, as well as for the end of half or end of game. Once it finds the next scoring play it checks if the offense on the next scoring play is the same as current offense. If not then it flips the offense/defense of the scoring play.
Function EP_INPUT_FN(data As Variant, col As Integer) 'Determine the input for EP
For row = 2 To rowcount
   For templong = row To row + 500 'Looking forward
       If IsEmpty(data(templong, SCORING_PLAY_COL)) = False Then 'Find a scoring play
           If Left(data(templong, SCORING_PLAY_COL), 1) = "O" Then 'Determine who scored
               scoreteam = data(templong, 28)
               SCORE = Right(data(templong, SCORING_PLAY_COL), Len(data(templong, SCORING_PLAY_COL)) - 2)
               Exit For
           ElseIf Left(data(templong, SCORING_PLAY_COL), 1) = "D" Then
               scoreteam = data(templong, DEF_COL)
               SCORE = Right(data(templong, SCORING_PLAY_COL), Len(data(templong, SCORING_PLAY_COL)) - 2)
               Exit For
           End If
       ElseIf InStr(data(templong, col), "3rd") Or InStr(data(templong, col), "end of game") Then 'Check for end of half
           scoreteam = data(templong, 28)
           SCORE = "HALF"
           Exit For
       End If
   Next templong
   For x = row To templong 'Fill in from now to the scoring play
       If data(x, 28) = scoreteam Then 'Alternate O/D score based on who has the ball now and who ends up scoring
           data(x, EP_INPUT_COL) = "O-" & SCORE
       Else
           data(x, EP_INPUT_COL) = "D-" & SCORE
       End If
   Next x
   row = templong
Next row
End Function

xxvii-Conference (“CanWest”, “OUA”, “RSEQ”, “AUS”, “NONCON”)

This column shows in what conference the game is played. By using four lists each listing all the teams in a given conference, it checks the two teams in the game. If both teams are in the same list then this game is in that conference. If the two teams are in different lists the game is listed as a non-conference game (NONCON). Because of Bishop’s University moving from the RSEQ to the AUS for the 2017 season the data for that season and future seasons has to be treated with a different set of lists.
Function CONFERENCE_FN(data As Variant, AUS() As String, RSEQ() As String, OUA() As String, CANWEST() As String) 'Determine in what conference the games are played
For row = 2 To rowcount
   If InStr(data(row, 1), " vs. ") Then 'only looking at game statements
       If UBound(Filter(AUS, data(row, AWAY_COL))) > -1 And (UBound(Filter(AUS, data(row, HOME_COL))) > -1) Then 'Check if both teams are in the same conference
           conf = "AUS"
       ElseIf UBound(Filter(RSEQ, data(row, AWAY_COL))) > -1 And (UBound(Filter(RSEQ, data(row, HOME_COL))) > -1) Then
           conf = "RSEQ"
       ElseIf UBound(Filter(OUA, data(row, AWAY_COL))) > -1 And (UBound(Filter(OUA, data(row, HOME_COL))) > -1) Then
           conf = "OUA"
       ElseIf UBound(Filter(CANWEST, data(row, AWAY_COL))) > -1 And (UBound(Filter(CANWEST, data(row, HOME_COL))) > -1) Then
           conf = "CANWEST"
       Else
           conf = "NONCON" 'If they’re not in the same list it’s a nonconference game
       End If
   End If
   data(row, CONFERENCE_COL) = conf 'copy through until next game statement
Next row
End Function

xxviii-Month (8-11)

This column shows the month in which the game was played, which is found in the game statements by using the 10-character sequence at the end of the game statement that matches the date format YYYY-MM-DD, pursuant to ISO 8601 (Munroe 2013).
Function MONTH_FN(data As Variant) 'Find the month the game was played
For row = 2 To rowcount
   If InStr(data(row, 1), "vs.") Then 'look at game statements
       MO = MONTH(Mid(data(row, 1), InStr(data(row, 1), data(row, YEAR_COL)), 10)) 'The last 10 characters are the game date, grab the month
   End If
   data(row, MONTH_COL) = MO 'Reuse until next game
Next row
End Function

xxix-Day (1-31)

This column gives the day of the game’s date by the same method as the Month and Year columns.
Function DAY_FN(data As Variant) 'Get the day the game was play
For row = 2 To rowcount
   If InStr(data(row, 1), "vs.") Then
       DAE = DAY(Mid(data(row, 1), InStr(data(row, 1), data(row, YEAR_COL)), 10))'The last 10 characters are the date, grab the day
   End If
   data(row, DAY_COL) = DAE 'Repeat until next game
Next row
End Function

xxx-Weekday (1-7)

This column gives the day of the week of the game using the same method as the Year, Month, and Day columns.
Function WEEKDAY_FN(data As Variant) 'Get the day of the week of the game
For row = 2 To rowcount
   If InStr(data(row, 1), "vs.") Then
       wkday = Weekday(Mid(data(row, 1), InStr(data(row, 1), data(row, YEAR_COL)), 10)) 'The last 10 characters are the date, get the weekday
   End If
   data(row, WEEKDAY_COL) = wkday 'repeat until next game
Next row
End Function

xxxi-Season (2002-2017)

The season column is equal to the Year column because U Sports football seasons do not span across calendar years. However, this column allows for better compatibility with other datasets from leagues that may have one season played in two calendar years. This function still accounts for leagues that stretch into January or February in order to create future compatibility.
Function SEASON_FN(data As Variant) 'Get the season of the game
For row = 2 To rowcount
   If InStr(data(row, 1), "vs.") Then
       If data(row, MONTH_COL) < 5 Then 'If the game is played in Jan, Feb, Mar, or Apr
           season = data(row, YEAR_COL) - 1 'Season is 1 less than Year
       Else
           season = data(row, YEAR_COL) 'Otherwise Season is Year
       End If
   End If
   data(row, SEASON_COL) = season 'Copy until next game
Next row
End Function

xxxii-FG Result (“MADE”, “MISSED”, “ROUGE”)

This column shows if a field goal attempt was successful (MADE), unsuccessful (MISSED), or unsuccessful leading to a rouge (ROUGE), so as to be able todevelop statistics regarding the probability of converting a field goal. This column is calculated in much the same way as the Scoring Play column, as successful field goal attempts include the phrase “GOOD,” failed attempts include “MISSED” or “BLOCKED” and rouges have “ROUGE.” This column only looks at plays where ODK is (FG), and so rouges on punts are not noted in this column. In American football the absence of the rouge would allow this column to be marked TRUE/FALSE as a boolean. Table 10 shows an example of a field goal play.
SKH
3-15
MAN37
D.Kolodzinski field goal attempt from 44 MISSED - short, spot at MAN37, clock 00:00.
Table 10 Example of field goal play (CWUAA 2013c)
Function FG_RESULT_FN(data As Variant, col As Integer) 'Find if a field goal was good or not
For row = 2 To rowcount
   If data(row, ODK_COL) = "FG" Then 'Only look at FG plays
       If InStr(data(row, col), "GOOD") Then 'Good FGs are labelled
           data(row, FG_RESULT_COL) = "MADE"
       ElseIf InStr(data(row, col), "ROUGE") Then 'Rouges are labelled
           data(row, FG_RESULT_COL) = "ROUGE"
       ElseIf InStr(data(row, col), "MISSED") Or InStr(data(row, col), "BLOCKED") Or InStr(data(row, col), "failed") Then 'To ways to mark a miss
           data(row, FG_RESULT_COL) = "MISSED"
       End If
   End If
Next row
End Function

xxxiii-Home/Away Timeout (0-2)

This column tracks how many timeouts each team has. The column is reset to 2 at the start of each half. When teams use a timeout a timeout statement appears as seen in Table 11, and the column is decremented according to which team used their timeout. On rare occasions, such as MAN @ UBC, 2013-09-14 (CWUAA 2013a) an officiating error allows a team to use three timeouts in a half. These have been fixed manually to remove the timeout awarded in error.
SKH
3-15
MAN37
Timeout MAN, clock 00:02.
Table 11 Example of timeout play (CWUAA 2013c)
Function HA_TO_FN(data As Variant, col As Integer) 'Show how many timeouts teams have left
For row = 2 To rowcount
   If InStr(data(row, 1), " vs. ") Then 'Game start resets TOs
       HTO = 2
       ATO = 2
   ElseIf data(row, col) = "3rd" Then '2nd half resets TOs
       HTO = 2
       ATO = 2
   ElseIf InStr(data(row, col), "Timeout") Then 'Timeout statements
       team = Mid(data(row, col), InStr(data(row, col), "Timeout") + 8, 3) 'The team taking the timeout comes after the word timeout
       If team = data(row, HOME_COL) Then 'decrement the team who called TO
           HTO = HTO - 1
       ElseIf team = data(row, AWAY_COL) Then
           ATO = ATO - 1
       Else
           data(row, 12) = "FIX-TO" 'Flag problems
       End If
   End If
   data(row, HOME_TO_COL) = HTO 'Recycle until new half or TO statement
   data(row, AWAY_TO_COL) = ATO
Next row
End Function

xxxiv-Offense/Defense Timeout (0-2)

Offense/Defense Timeout shows how many timeouts the offense and defense have remaining. This function checks to see if the row has the Offense equal to Home, and if so sets Offense Timeout equal to Home Timeout, with Defense Timeout equal to Away Timeout. If Offense is not equal to Home, then Offense Timeout is set equal to Away Timeout, and Defense Timeout is set equal to Home Timeout.
Function OD_TO_FN(data As Variant) 'Shows the timeouts for O and D
For row = 2 To rowcount
   If data(row, OFF_COL) = data(row, HOME_COL) Then 'If home is offense
       data(row, OFF_TO_COL) = data(row, HOME_TO_COL)
       data(row, DEF_TO_COL) = data(row, AWAY_TO_COL)
   Else
       data(row, OFF_TO_COL) = data(row, AWAY_TO_COL)
       data(row, DEF_TO_COL) = data(row, HOME_TO_COL)
   End If
Next row
End Function

xxxv-Series (0-99)

This column gives the offensive series of the play, the nth sequence of uninterrupted offensive plays for either team. A series is considered to have ended by any change of possession or kicking play. Ergo, if a team punts but a penalty gives them a first down and they continue on offense it is considered a new series, as the defense completed their task and ended the offensive drive, while the defensive special teams unit gave the ball back to the offense.
Function SER_COUNT_FN(data As Variant, col As Integer) 'Determine what series
For row = 2 To rowcount
   If InStr(data(row, 1), " vs. ") Then 'Find game statements
       For templong = row + 1 To row + 500 'Looking for the end of the game
           If InStr(data(templong, col), "end of game") Then
               Exit For
           End If
       Next templong
       counter = 0 'Reset the series counter
       bool = False 'whether we’re on a drive or not
       For x = row To templong 'Scan through the game
           If bool Then
               If data(x, OFF_COL) <> data(x - 1, OFF_COL) Then 'If change of possession
                   If data(x, ODK_COL) = "OD" Then 'and an OD play
                       counter = counter + 1
                   End If
               ElseIf data(x, ODK_COL) = "FG" Then 'Turn off bool
                   bool = False
               ElseIf data(x, ODK_COL) = "P" Then
                   bool = False
               ElseIf data(x, ODK_COL) = "KO" Then
                   bool = False
               ElseIf data(x, col) = "3rd" Then
                   bool = False
               ElseIf InStr(data(x, col), "end of game") Then
                   bool = False
               ElseIf InStr(data(x, col), "OT") Then
                   bool = False
               End If
           Else
               If data(x, ODK_COL) = "OD" Then 'Once we’re back to a  drive
                   counter = counter + 1 'Increment the counter
                   bool = True
               End If
           End If
           If bool Then
               data(x, SER_COUNT_COL) = counter 'Output the counter
           End If
       Next x
       row = x - 1 'Jump to the next game
   End If
Next row
End Function

xxxvi-Home/Away Series (0-99)

This column separates a counter for the two teams by a similar means as the Series column. This column is useful because turnovers and penalties can lead to one team having significantly more  series than the other. Because the two functions are effectively identical, only the Home Series has been included.
Function H_SER_COUNT_FN(data As Variant, col As Integer) 'Find the series of the home team
For row = 2 To rowcount
   If InStr(data(row, 1), " vs. ") Then 'find game starts
       For templong = row + 1 To row + 500 'Look for the game ends
           If InStr(data(templong, col), "end of game") Then
               Exit For
           End If
       Next templong
       bool = False
       counter = 0 'Series counter reset to 0
       For x = row To templong
           If bool Then 'Looking for the ends of series
               If data(x, ODK_COL) = "P" Then 'Set bool to false, ending the series
                   bool = False
               ElseIf data(x, ODK_COL) = "FG" Then
                   bool = False
               ElseIf data(x, ODK_COL) = "FG" Then
                   bool = False
               ElseIf Not (data(x, OFF_COL) = data(x, HOME_COL)) Then
                   bool = False
               ElseIf InStr(data(x, col), "3rd") Then
                   bool = False
               ElseIf InStr(data(x, col), "OT") Then
                   bool = False
               ElseIf InStr(data(x, col), "end of game") Then
                   bool = False
               End If
           Else
               If data(x, ODK_COL) = "OD" Then 'Looking for the start of series
                   If data(x, OFF_COL) = data(x, HOME_COL) Then
                       bool = True
                       counter = counter + 1
                   End If
               End If
           End If
           If bool Then
               data(x, H_SER_COUNT_COL) = counter 'Print the counter
           End If
       Next x
       row = x - 1 'Jump to the next game
   End If
Next row
End Function

xxxvii-Offensive Series (0-99)

This column gives the series of the play for the offensive team, it only counts offensive series by the current offensive team and does not include series by the other team the way the Series column counts every series by either team. Thus, the Offensive Series column is usually about half the value of the Series column.
Function O_SER_COUNT_FN(data As Variant)
For row = 2 To rowcount
   If data(row, OFF_COL) = data(row, HOME_COL) Then 'Check if the home team is on offense
       data(row, O_SER_COUNT_COL) = data(row, H_SER_COUNT_COL) 'Copy the appropriate column
   Else
       data(row, O_SER_COUNT_COL) = data(row, A_SER_COUNT_COL)
   End If
Next row
End Function

4-Conclusion

Development of a fully parsed database of U Sports will allow for in-depth analysis of the nature of the game. The above functions form the basis for a relational database capable of investigating matters relating to P(1D), EP, WP, assorted kicking game indices. Additional features may be added on an as-needed basis pursuant to the needs of the study, and such code will be discussed at that time. The modular structure of the code permits such additions. Refactoring of the existing code for greater efficiency, while possible, is likely unnecessary. However, the existing code could also be applied, with some degree of alteration, to other datasets as well. A CFL dataset would be on the same order of magnitude, and would allow for direct comparison of the two leagues. In American football, there are NFL databases but there is not the same centralized source of NCAA data. Scraping of play-by-play data to create such a database would make for a very large database, perhaps two orders of magnitude larger across the multiple divisions. Such a database would then justify a refactoring of the code. For consistency’s sake it would be preferable then to make the same modifications to the U Sports code. Although the database is currently structured in a relational fashion, with each play acting as an independent record, it could easily be restructured as within an object-oriented paradigm, making better use of the games within the database.
The dataset is sufficiently large to allow for good precision as well as year-over-year comparison and hopefully provides the basis for a growing body of work researching Canadian football. What is needed next is the impetus to do said research, and for different groups to be able to compare their results. While the author hopes that the database will see wider use, he has plans to make use of the data himself in the immediate future.

5-References

No comments:

Post a Comment

Three Downs Away: P(1D) In U Sports Football

1-Abstract A data set of U Sports football play-by-play data was analyzed to determine the First Down Probability (P(1D)) of down & d...