1-Abstract
A discussion of the development of a fully parsed database of U Sports football play-by-play data, including the sources and structures of available play-by-play data, examples of common errors in the data and methods to rectify them, and complete discussion of the code, structured as individual functions, used to parse the data into discrete columns. The code used is included within the discussion of each function. The parsed data forms a relational database of plays that allow further research to be conducted at the individual play level. This database is believed to be the first of its kind for both U Sports and Canadian football in general.
2-Introduction
A truism, that any statistical analysis needs a dataset to be analyzed, is as applicable to sports analytics as any other topic. For football, a number of existing databases exist for both NFL (Yurko, Ventura, and Horowitz 2018) , as well as a publicly available scraping tool (Horowitz 2016). Analytics in American football, while not at the level of baseball, soccer, or basketball, are still relatively well-developed. Unfortunately that same academic spirit has yet to infiltrate the sport north of the border. Research into Canadian football is scarce, and there are no publicly available datasets to conduct any analysis. Ergo, any attempts to begin the study of the game, either for its own sake or in comparison to the American game, must begin with a quest to find data.
The fundamental unit of football data is the play. Attempts to perform analysis on game-level data are limited to, at best, meta-analysis and general trends. Ergo, a repository of accurate play-by-play data is prerequisite to any substantial work. Canadian football has two major leagues playing consistent, high-level football and tracking play-by-play data: the CFL and U Sports. The author’s choice to build a U Sports database stems first from the better availability of data, and the perception that U Sports is in some ways a “purer” form of Canadian football, owing to the strong American influence found in the CFL coaching ranks.
3-Data
U Sports play-by-play data was acquired from public sources and put into a Microsoft Excel spreadsheet. The choice of using Excel over more sophisticated tools was a pragmatic one. First, the actual quantity of data is not enormous when compared to the limits of Excel. Second, the ubiquity of Excel makes the database far more portable. Third, the calculations needed for this application are not overly sophisticated and fall well within the capabilities of Excel and VBA scripting. The use of other software would simply be overkill. In all, 1497 games exist in the dataset for a total of 247 850 plays, and a list of these games in chronological order is given in Appendix 1.
a-Sources
Play-by-play data for U Sports football is available in five locations. The main U Sports site(U Sports n.d.), formerly the CIS website, and the sites of each of the four conferences (CWUAA n.d.; OUA n.d.; RSEQ n.d.; AUS n.d.). The data exists in three different formats depending on the year and provenance. The first format, that of PrestoSports (PrestoSports n.d.), is what is currently available on the U Sports site as of 2009, as well as CanWest and OUA games in that time period. This is a two-column format, with the first column giving the down, distance, and field position and the play description in the second column. An example of the PrestoSports data format (hereafter referred to as Data Format 1) is given in Table 1.
2nd and 12 at MAN03
|
Anthony Coombs rush for 107 yards to the ALB0, TOUCHDOWN, 00:30.
|
Table 1 Example of Data Format 1 (U Sports 2013)
The second format (Data Format 2) is mostly found in CanWest games between 2006 and 2009; it is a four-column format with the team holding possession in the first column, down and distance in the second, field position in the third, and the play description in the fourth column.
MAN
|
1-10
|
MAN52
|
John Makie pass complete to Mike Mizerski for 58 yards to the SFU0, TOUCHDOWN, 13:21.
|
Table 2 Example of Data Format 2 (CWUAA 2008)
The third format is found in RSEQ, AUS, and CanWest games after 2002. This is a block-text format (The Automated Scorebook n.d.) that needs significant treatment to combine lines into one row per play and then separated into a two-column format similar to the first format. Data Format 3 is also seen in data for the 2017 RSEQ and AUS seasons because a reformatting of the U Sports site no longer makes visible the first column of Data Format 1, and Data Format 3 is the format used in the RSEQ and AUS conference websites. Ultimately the same information can be parsed from each format. An example of Data Format 3 is given in Table 3.
S 1-10 S45
|
J. Creighton pass complete to C. Hardwick for 65 yards to the WES0, TOUCHDOWN, clock 14:34.
|
Table 3 Example of Data Format 3 (Atlantic University Sport 2008)
Each game was imported into the database by simple copy and pasting of the data, as U Sports was unable to provide the data in a more practical format (Ridout 2017).
b-Cleanup
Although each of the three formats contains largely the same information, the difference in structure means that they must be treated separately. Because of the largely volunteer and informal nature of this play-by-play tracking, an extensive amount of data cleanup was needed. The following list details some of the more common problems that needed to be fixed.
i-American vs. Canadian Football
The software behind Data Format 1 is designed for American football, and thus 4 downs. As a result, the system does not automatically handle turnovers on downs in Canadian football. Instead it will show 4th down, and because play-by-play uses jersey numbers, will show the corresponding jersey numbers from the other team’s offensive players. Occasionally the new offense will lose enough yardage on 1st down to allow the nominal offense to gain a “1st down” on their 4th down attempt, meaning that four more plays will now be shown with the wrong team on offense, often losing strange amounts of yardage. This was rectified by identifying every instance of “4th down” and fixing the erroneous plays.
ii-Drive Start Statements
Data Formats 1 and 3 rely on “drive start” statements to identify which team has possession. Where these are missing there is no way for the script to know which team has possession, and only a careful reading would notice a change of possession on a play. This problem was rectified with a script to check that a drive start statement exists for the start of every half, and after every play with a punt, field goal, kickoff, interception, fumble, or 3rd down attempt, flagging rows that do not have such a statement, and then manually adding them where necessary. Table 4 shows an example of a drive start statement.
MAN drive start at 00:57.
|
Table 4 Example of drive start statement (U Sports 2014)
iii-Missing Score Statements
The score in the game is given by periodic rows giving the score in the game. These score statements should occur after every scoring play but are often missing. A script flagged all possible scoring plays that were not followed by a score statement, and the necessary score statements were added manually. A second script then checked to see if there were any jumps in score that did not correspond to a football scoring play, that is, any time both teams changed score, any decrease in score, or any change in score by an amount that is not 1, 2, 3, 6, 7, or 8. Table 5 gives an example of a score statement.
CGY 31, MAN 50
|
Table 5 Example of a score statement (U Sports 2014)
iv-Team Names
Team names and abbreviations in U Sports are not standardized, and so the abbreviations used in the play-by-play are entirely inconsistent. These variations range from banal variations on a theme, such as the University of Saskatchewan being abbreviated as either SKH, SASK, or SAS, including the season and sport, such as the University of Waterloo being abbreviated as WATMENFB12, the use of team mascots instead of school names, such as Concordia University being shown as STINGERS, and sometimes simply using HOME and AWAY. Ultimately it was decided to use 3-letter abbreviations of the school name, choosing the most common or unambiguous abbreviations. The use of find/replace and VBA scripts to fix some of the more common variations aided enormously, and a list of school’s and their abbreviations sorted by conference is provided in Table 6. These abbreviations are used universally throughout the dataset and are the only means by which teams are identified or referred to.
Canada West
Universities
Athletic
Association
|
CanWest
|
Ontario
University
Athletics
|
OUA
|
Réseau du
sport
étudiant
du Québec
|
RSEQ
|
Atlantic
University
Sport
|
AUS
|
University of
Manitoba
|
MAN
|
University of
Windsor
|
WIN
|
McGill
University
|
MCG
|
Acadia
University
|
ACA
|
University of
Saskatchewan
|
SKH
|
University of
Western Ontario
|
WES
|
Concordia
University
|
CON
|
Mount Allison
University
|
MTA
|
University of
Regina
|
REG
|
University of
Waterloo
|
WAT
|
Université de
Montreal
|
MON
|
St. Francis
Xavier
University
|
SFX
|
University of
Calgary
|
CGY
|
Wilfrid Laurier
University
|
WLU
|
Université de
Sherbrooke
|
SHE
|
Saint Mary’s
University
|
SMU
|
University of
Alberta
|
ALB
|
University of
Guelph
|
GUE
|
Université de
Laval
|
LAV
|
Bishop’s
University
(from 2017)
|
BIS
|
University of
British Columbia
|
UBC
|
McMaster
University
|
MAC
|
Bishop’s
University
(through 2016)
|
BIS
| ||
Simon Fraser
University
(through 2009)
|
SFU
|
University of
Toronto
|
TOR
| ||||
York
University
|
YRK
| ||||||
Queen’s
University
|
QUE
| ||||||
University of
Ottawa
|
OTT
| ||||||
Carleton
University
(from 2013)
|
CAR
|
Table 6 List of U Sports teams and abbreviations
c-Parsing
The play-by-play data was parsed using a script in Microsoft Excel’s VBA. The columns calculated are listed below in the order in which they are calculated, which is not the order in which they are displayed, because the display is designed for readability, but some columns are dependent on other columns to simplify the calculation and so are calculated in a different order.
The database is parsed column-wise, passing through each row in the set and calculating that column’s value, repeating the loop for each column. This is a less computationally efficient method than could be achieved working row-wise, but it was specifically chosen for it’s maintainability, each column can be treated in a unique function and bugs are not transmitted across columns, making problems easy to isolate. The longer runtime of the script is not a concern, as the entire database is processed in ~15-30 seconds. There are huge opportunities to optimize the runtime, but that is not the goal. Since computing power grows exponentially or perhaps slightly slower, and the dataset will grow linearly or perhaps slightly faster if more teams join the league, the runtime of the code will simply never be a problem. Most importantly, it must be easy to add features and columns to adapt to the changing scope of a research project, and the code as written allows for that by adding a new function to the list of executed functions. If repeated iterations of the program are executed and runtime somehow becomes an issue then any columns that do not have downstream dependencies can simply be commented out. A quick reading of the code will highlight the most computationally-intensive functions that should be the first to comment out in this case.
i-Down (0, 1, 2, 3)
This column shows the down state for each play. Kickoffs and PAT attempts were shown as having down 0 as a placeholder. The down column can be directly parsed from the play-by-play. In Data Format 1 a search for the term “0th,” “1st,” “2nd,” or “3rd” defined the down. In Data format 2 the first character of column 2 is the down, and in Data Format 3 the down is the character immediately preceding the first hyphen. Because the structure of the data formats differs a single function could not be used across all three data formats. Instead only the code for Data Format 1 is shown as an example.
'DOWN - determine the down of the play
For row = 2 To rowcount
If Len(DataOne(row, 1)) > 3 Then 'if the length of the D&D field is 3 then it's a qtr statement
If Left(DataOne(row, 1), 3) = "1st" Then 'going through these in the order of probability for efficiency
DataOne(row, DOWN_COL) = 1
ElseIf Left(DataOne(row, 1), 3) = "2nd" Then
DataOne(row, DOWN_COL) = 2
ElseIf Left(DataOne(row, 1), 3) = "3rd" Then
DataOne(row, DOWN_COL) = 3
ElseIf Left(DataOne(row, 1), 3) = "0th" Then
DataOne(row, DOWN_COL) = 0
End If
End If
Next row
For row = 2 To rowcount
If Len(DataOne(row, 1)) > 3 Then 'if the length of the D&D field is 3 then it's a qtr statement
If Left(DataOne(row, 1), 3) = "1st" Then 'going through these in the order of probability for efficiency
DataOne(row, DOWN_COL) = 1
ElseIf Left(DataOne(row, 1), 3) = "2nd" Then
DataOne(row, DOWN_COL) = 2
ElseIf Left(DataOne(row, 1), 3) = "3rd" Then
DataOne(row, DOWN_COL) = 3
ElseIf Left(DataOne(row, 1), 3) = "0th" Then
DataOne(row, DOWN_COL) = 0
End If
End If
Next row
ii-Distance (1-109)
This is the distance to gain on the play. Kickoffs have a distance of 10 and PAT attempts are considered & Goal plays and thus have a distance equal to the field position. The distance column can be directly parsed from the play-by-play. In Data format 1 the number after the word “and” in column 1 is the distance. In Data Format 2 the number after the hyphen in column 2 is the distance. In Data Format 3 the number after the hyphen in column 1 is the distance. As with Down, Distance requires a bespoke function for each data format, and the example below is used for Data Format 1.
'DISTANCE - determine the distance to gain of the play
For row = 2 To rowcount
If InStr(DataOne(row, 1), " and ") Then 'check for an "and" in D&D to show we have a real D&D statement
DataOne(row, DIST_COL) = CInt(Trim(Mid(DataOne(row, 1), 9, 2))) 'grab the distance, trim out extra spaces to account for 1 or 2-digit numbers, and force conversion to integer to throw an error if something is awry.
End If
Next row
iii-Offense (Any team abbreviation)
This identifies which team is currently on offense. For kicking plays the kicking team is defined as the offense. Data Format 1 uses “drive start” statements, and example of which is shown in Table 4. Data Format 3 uses the same type of statements to identify possession. The team with possession remains the same until a new possession statement causes it to change. Data Format 2 shows possession directly in column 1.
Function OFF_FN(data As Variant)'Determine the offensive team
For row = 2 To rowcount
If InStr(data(row, 2), "drive start") Then 'Identify drive start statements
OFF = Left(data(row, 2), 3) 'Set the offense to the team stated
End If
data(row, OFF_COL) = OFF 'Recycle the offensive team until there's a reason to switch
Next row
End Function
iv-Home/Away (Any team abbreviation)
Home and Away are two columns calculated in the same function. Game start statements exist before each game in the same style across all data formats, an example of which is given below in Table 7. The Home and Away teams remain the same until a new game start statement causes it to change.
MAN vs. SMU (2007-11-23 at Toronto, ON)
|
Table 7 Example of game statement (CWUAA 2007)
Function HOME_AWAY_FN(data As Variant)'Find the home and away teams
For row = 2 To rowcount
If InStr(data(row, 1), " vs. ") Then 'this is only found on game statements
awayteam = Left(data(row, 1), 3) 'away team is named first
hometeam = Mid(data(row, 1), 9, 3) 'home team is named second
End If
data(row, AWAY_COL) = awayteam 'recycle until there's a new game
data(row, HOME_COL) = hometeam
Next row
End Function
v-Defense (Any team abbreviation)
Having determined the offensive team as well as the home and away teams, the defensive team is simply the one that is not the offense.
Function DEF_FN(data As Variant)'Find the defensive team
For row = 2 To rowcount
If data(row, HOME_COL) = data(row, 28) Then 'If the home team has the ball the away team is the defense
data(row, DEF_COL) = data(row, AWAY_COL)
Else
data(row, DEF_COL) = data(row, HOME_COL) 'or vice-versa
End If
Next row
End Function
vi-Field Position (1-54, -1--54)
The field position of the current play, using +/- notation where the field position is negative when the ball is on the offensive or kicking team’s side of midfield, and positive when the ball is in the defensive team’s end. For Data Format 1 field position is calculated in two steps. First by using the number that is the last two characters of column 1 to determine the absolute value of field position. Then comparing whether the three characters preceding that match the offensive team. If so then the ball is on the offensive side of midfield and the sign of the field position is made negative. In Data Format 2 much the same process is followed, though field position is given in column 3. In Data Format 3 the first character of column 1 is compared to the single character preceding the last two digits. This is because there is a single character code used to identify the two teams playing in this format. Unfortunately the coding was so inconsistent and ambiguous it could not be used as a play-by-play marker of possession. This column cannot be calculated with the same function for all data formats, so the code shown is an example for Data Format 1.
'FPOS - This loop determines the field position and needs to go AFTER we determine possession
For row = 2 To rowcount
If InStr(DataOne(row, 1), " and ") Then 'only D&D statements would have "and" in them
If Mid(DataOne(row, 1), Len(DataOne(row, 1)) - 4, 3) = DataOne(row, OFF_COL) Then 'check for whether we are in + or - territory
DataOne(row, FPOS_COL) = Right(DataOne(row, 1), 2) * -1 'if in - territory we invert the sign of the field position
Else
DataOne(row, FPOS_COL) = Right(DataOne(row, 1), 2)
End If
If DataOne(row, FPOS_COL) = "-55" Then 'all plays on the 55 are marked positive
DataOne(row, FPOS_COL) = 55
End If
End If
Next row
vii-Quarter (1-5)
The quarter in which the play occurs, overtime is shown as (5), regardless of whether there are multiple overtimes. Quarter start statements are included in the play-by-play data as a row with just “2nd,” “3rd,” “4th,” or “OT,” an example of which is shown in Table 8. New game statements reset the quarter to (1).
2nd
|
Table 8 Example of quarter start statement (U Sports 2014)
Function QUARTER_FN(data As Variant)'determine what quarter we're in
For row = 2 To rowcount
If InStr(data(row, 1), "vs.") Then 'New games mean 1st quarter
qtr = 1
ElseIf data(row, 1) = "2nd" Then 'identify quarter statements, this is the entire statement
qtr = 2
ElseIf data(row, 1) = "3rd" Then
qtr = 3
ElseIf data(row, 1) = "4th" Then
qtr = 4
ElseIf data(row, 1) = "OT" Then 'Overtime is considered the 5th quarter
qtr = 5
End If
data(row, QTR_COL) = qtr 'stick with the quarter until there's a reason to change
Next row
End Function
viii-Yardline (0-109)
Yardline is an alternative means of showing field position. The standard notation of field position is not intuitive to computers as it goes from 0 to -54 to 55 to 0. Instead, we use Yardline in certain calculations to show distance from the goal line measured from 109-0. If Field Position is positive then Yardline is equal to Field Position. If Field Position is negative then Yardline is equal to Field Position +110.
Function YDLINE_FN(data As Variant) 'Determine the yardline from the field position
For row = 2 To rowcount
If IsEmpty(data(row, FPOS_COL)) = False Then 'Don’t bother looking at rows that don’t have a Field Position value
If data(row, FPOS_COL) > 0 Then
data(row, YARDLINE_COL) = data(row, FPOS_COL) 'Positive values stay positive
Else
data(row, YARDLINE_COL) = 110 + data(row, FPOS_COL) 'Negative values get converted
End If
End If
Next row
End Function
ix-Year (2002-2017)
This function returns the year in which the game was played. Since the range of seasons is small and known the function looks for those seasons as strings within the game statements.
Function YEAR_FN(data As Variant) 'Determine the year the game was played
For row = 2 To rowcount
If InStr(data(row, 1), " vs. ") Then 'Only looking in game statements
For x = 2002 To 2017 'Loop through all the years in our database
If InStr(data(row, 1), x) Then 'If the appropriate year is found
season = x
Exit For
End If
Next x
End If
data(row, YEAR_COL) = season 'Recycle until the next game statement
Next row
End Function
x-League (“U Sports”)
This column simply shows “U Sports” on every year. Although seemingly redundant, it allows for better compatibility with datasets of different leagues.
Function LEAGUE_FN(data As Variant, lge As String) 'Just put the league in every cell
For row = 2 To rowcount
data(row, LEAGUE_COL) = lge
Next row
End Function
xi-Run/Pass (R/P)
This identifies whether the play is a run or a pass. It is blank for most plays that are not “OD” plays. The run/pass column can be parsed directly from the play-by-play by looking in the play description for the phrase “rush” to identify running plays, and either “pass” or “sack” to identify passing plays.
Function RUN_PASS_FN(data As Variant, col As Integer) 'Identify run or pass
For row = 2 To rowcount
If InStr(data(row, col), "rush") Then 'The word rush tells us it’s a run
data(row, RP_COL) = "R"
ElseIf InStr(data(row, col), "pass") Then 'obviously a pass
data(row, RP_COL) = "P"
ElseIf InStr(data(row, col), "sacked") Then 'Sacks are pass plays
data(row, RP_COL) = "P"
End If
Next row
End Function
xii-Pass Result (C, I, X, S)
Pass Result is only calculated for plays identified as passing plays. By looking in the play description the script finds the phrases “incomplete” (I), “complete” (C), “intercept” (X), or “sack.” Unfortunately scrambles, which are usually denoted by (R) are not shown because the play-by-play simply notes them as rushing plays. Additionally, some sacks are marked as rushing plays as well, but it is not feasible to determine which rushing plays should be changed to sacks or scrambles.
Function PASS_RESULT_FN(data As Variant, col As Integer) 'Find the result of the pass
For row = 2 To rowcount
If data(row, RP_COL) = "P" Then 'only check pass plays
If InStr(data(row, col), "incomplete") Then 'incomplete has to come first because otherwise it would catch "complete" within “incomplete”
data(row, P_RSLT_COL) = "I"
ElseIf InStr(data(row, col), "complete") Then
data(row, P_RSLT_COL) = "C"
ElseIf InStr(data(row, col), "sacked") Then
data(row, P_RSLT_COL) = "S"
ElseIf InStr(data(row, col), "intercepted") Then
data(row, P_RSLT_COL) = "X"
End If
End If
Next row
End Function
xiii-ODK (“OD”, “KO”, “FG”, “P”, “PEN”)
This column identified the type of football play, either an offensive play from scrimmage (OD), a kickoff (KO), field goal (FG), punt (P), or penalty (PEN). The phrase “kickoff” appears in all data formats in the play description, as does the phrase “punt.” “Field goal” appears for all field goal attempts, and “kick attempt” for all PAT attempts, which are also coded as “FG.” “OD” plays are plays that remain that are noted as being runs or passes. Penalties are plays that do not fit into the above categories that are simply dead-ball penalties. Intentional safeties are considered punts as well. There is some small chance of a legitimate OD play being coded as a punt in this fashion but it is improbable.
Function ODK_FN(data As Variant, col As Integer) 'Determine the ODK of a play
For row = 2 To rowcount
If data(row, DOWN_COL) = 3 And InStr(data(row, col), "SAFETY") Then 'accounting for intentional safeties
data(row, ODK_COL) = "P"
ElseIf IsEmpty(data(row, RP_COL)) = False Then 'If it's a run or a pass then it's pretty straightforward.
data(row, ODK_COL) = "OD"
ElseIf InStr(data(row, col), "punt") Then 'punts are labelled
data(row, ODK_COL) = "P"
ElseIf InStr(data(row, col), "kickoff") Then 'kickoffs are labelled
data(row, ODK_COL) = "KO"
ElseIf InStr(data(row, col), "field goal") Or InStr(data(row, col), "kick attempt") Then 'field goals, "kick attempt" implies PAT
data(row, ODK_COL) = "FG"
ElseIf InStr(data(row, col), "PENALTY") Then 'if it's nothing else but there's a penalty we flag it as such
data(row, ODK_COL) = "PEN"
End If
Next row
End Function
xiv-Is Play (TRUE, FALSE)
this column identified plays as being TRUE, as opposed to rows that served administrative functions, such as the start of games, possession changes, score changes, or timeouts (FALSE). This also served as a dummy column when combining the different datasets into the final database to include only football plays. Any play with a value for ODK is marked as TRUE.
Function ISPLAY_FN(data As Variant) 'Determine if it’s a play or admin
For row = 2 To rowcount
If IsEmpty(data(row, ODK_COL)) = False Then 'If it has ODK it’s a play
data(row, ISPLAY_COL) = True
Else
data(row, ISPLAY_COL) = False
End If
Next row
End Function
xv-Clock (00:00-15:00)
Clock shows the time remaining on the game clock. This is only shown for plays where there is a clock statement associated, or for the first play after a clock statement occuring on a non-play. Clock statements are identified in the play description by the existence of a colon (:), and the clock time is the 2 characters on either side of the colon and the colon itself. An example of a play with a clock statement is shown in Table 9.
1st and 10 at MAN21
|
Jordan Yantz pass complete to Andrew Smith for 89 yards to the REG0, TOUCHDOWN, 00:43
|
Table 9 Example of clock statement (CWUAA 2013b)
Function CLOCK_FN(data As Variant, col As Integer) 'Identify clock statements
For row = 2 To rowcount
If InStr(data(row, col), ":") Then
If InStr(data(row, col), ":") < 30 Then 'if it comes early in the play description it means it applies to the current play
data(row, CLOCK_COL) = Mid(data(row, col), InStr(data(row, col), ":") - 2, 5)
ElseIf data(row, ISPLAY_COL) Then 'otherwise it's talking about the time on the clock at the END of the play
data(row + 1, CLOCK_COL) = Mid(data(row, col), InStr(data(row, col), ":") - 2, 5)
Else 'In this case it's for simple CLOCK statements with no play
data(row, CLOCK_COL) = Mid(data(row, col), InStr(data(row, col), ":") - 2, 5)
End If
ElseIf data(row, QTR_COL) = 5 Then 'time is set to 0 in overtime
data(row, CLOCK_COL) = "00:00"
End If
Next row
For row = 2 To rowcount 'this loop puts a clock statement at the beginnings of quarters and the ends of halves
If data(row, 1) = "1st" Or data(row, 1) = "2nd" Or data(row, 1) = "3rd" Or data(row, 1) = "4th" Or InStr(data(row, 1), " vs. ") Then
data(row, CLOCK_COL) = "15:00"
ElseIf InStr(data(row, col), "end of game") Then 'at the end of the game the clock is obv at 0
data(row, CLOCK_COL) = "00:00"
End If
Next row
For row = 2 To rowcount 'Here we account for clock statements on non-plays to help fill in some gaps and reduce errors in our averaging
If IsEmpty(data(row, CLOCK_COL)) = False Then 'if we have a clock statement
If data(row, ISPLAY_COL) = False Then 'and it's not applied to an actual play
If IsEmpty(data(row + 1, CLOCK_COL)) Then 'and the next row does not have a clock stamp
data(row + 1, CLOCK_COL) = data(row, CLOCK_COL) 'paste it into the next row as well
End If
End If
End If
Next row
For row = 2 To rowcount 'Puts clock statements into non-plays where the preceding play has one to fill more gaps
If IsEmpty(data(row, CLOCK_COL)) = False Then 'if we have a clock statement
If data(row + 1, ISPLAY_COL) = False Then 'and the next play is a non-play
If IsEmpty(data(row + 1, CLOCK_COL)) Then 'and the next play doesn't have a clock statement
data(row + 1, CLOCK_COL) = data(row, CLOCK_COL) 'copy over the previous timestamp
End If
End If
End If
Next row
End Function
xvi-Gain (-109-109)
This column gives the gain of the play from scrimmage for plays that have a value for Run/Pass. Plays that result in a loss are shown as a negative gain. Incomplete passes and interceptions are shown as a gain of 0. The phrase “no gain” is also a gain of 0. Thereafter the gain is given by the number preceding the first instance of the word “yards;” if the phrase “loss” precedes this then the sign is made negative to show the loss on the play.
Function GAIN_FN(data As Variant, col As Integer) 'Find the gain on the play
For row = 2 To rowcount
If data(row, ODK_COL) = "OD" Then
If data(row, P_RSLT_COL) = "I" Or data(row, P_RSLT_COL) = "X" Then 'incompletions and interceptions are 0-yard gains
data(row, GAIN_COL) = 0
ElseIf InStr(data(row, col), "no gain") Then 'no gain means 0 gain
data(row, GAIN_COL) = 0
ElseIf InStr(data(row, col), "loss of") Then 'loss has to come before gain because gains don't specifically say gain
data(row, GAIN_COL) = "-" + Trim(Mid(data(row, col), InStr(data(row, 2), "loss of") + 7, 3))
ElseIf data(row, DOWN_COL) = 0 Then 'handling extra points
If InStr(data(row, col), "good") Then
data(row, GAIN_COL) = data(row, DIST_COL)
ElseIf InStr(data(row, col), "failed") Then
data(row, GAIN_COL) = 0
End If
ElseIf InStr(data(row, col), "yard") = 0 Then 'if something goes screwy flag it
data(row, 12) = "FIX ME!GAIN"
Else
data(row, GAIN_COL) = Trim(Mid(data(row, col), InStr(data(row, col), "yard") - 3, 3)) 'and for regular gains
End If
End If
Next row
End Function
xvii-Home/Away Score (0-99)
This column determines the current score in the game from the use of score statements that appear periodically and after each scoring play. Score statements follow the format shown below in Table 5. The away score is the number preceding the comma, and the home score is the number at the end of the score statement.
Function HOME_AWAY_SCORE_FN(data As Variant, col As Integer) 'Find the home and away scores
For row = 2 To rowcount
If Len(data(row, col)) > 11 And Len(data(row, col)) < 15 Then 'score statements have a limited range of lengths
If Mid(data(row, col), 4, 1) = " " Then '4th char MUST be a space
If InStr(data(row, col), ", ") Then
If Left(data(row, col), 3) = data(row, AWAY_COL) Then
awayscore = Trim(Mid(data(row, col), InStr(data(row, col), ",") - 2, 2))
homescore = Trim(Right(data(row, col), 2))
End If
End If
End If
End If
If InStr(data(row, 1), "vs.") Then 'New game resets the score
awayscore = 0
homescore = 0
End If
data(row, AWAY_SCORE_COL) = awayscore 'Reuse the score until a change happens
data(row, HOME_SCORE_COL) = homescore
Next row
End Function
xviii-Off/Def Score (0-99)
Offense/Defense score is determined by checking to see whether the home team has possession or not, and then assigning the appropriate score from Home/Away score, and then assigning the other score to the defense.
Function OFF_DEF_SCORE_FN(data As Variant) 'Find the offensive and defensive score
For row = 2 To rowcount
If data(row, HOME_COL) = data(row, OFF_COL) Then 'If the home team has the ball
data(row, OFF_SCORE_COL) = data(row, HOME_SCORE_COL)
data(row, DEF_SCORE_COL) = data(row, AWAY_SCORE_COL)
Else 'If the away team has the ball
data(row, OFF_SCORE_COL) = data(row, AWAY_SCORE_COL)
data(row, DEF_SCORE_COL) = data(row, HOME_SCORE_COL)
End If
Next row
End Function
xix-Home Lead (0-99)
Home Lead is simply the score of the home team minus the score of the away team. If the lead is negative then the home team is trailing.
Function HOME_AWAY_LEAD_FN(data As Variant) 'Find the lead of the home team
For row = 2 To rowcount
data(row, HOME_LEAD_COL) = data(row, HOME_SCORE_COL) - data(row, AWAY_SCORE_COL) 'Subtract the away score from the home score
Next row
End Function
xx-Offense Lead (0-99)
Offense Lead is the score of the offensive team minus the score of the defensive team. If the lead is negative then the offense is trailing.
Function OFF_DEF_LEAD_FN(data As Variant) 'Find the lead of the offensive team
For row = 2 To rowcount
If data(row, HOME_SCORE_COL) = data(row, OFF_SCORE_COL) Then 'If the home team has the ball
data(row, OFF_LEAD_COL) = data(row, HOME_LEAD_COL) 'Offensive lead equals home lead
Else 'If the away team has the ball
data(row, OFF_LEAD_COL) = -1 * data(row, HOME_LEAD_COL) 'Offense lead is negative home lead
End If
Next row
End Function
xxi-Home Win (TRUE, FALSE)
Home Win shows whether the home team will win this game, in which case the result is TRUE. It is calculated by finding the game start statement, finding the next game start statement, and seeing if Home Lead is positive on the preceding line, then filling in every row of the game with the appropriate value.
Function HOME_WIN_FN(data As Variant, col As Integer) 'State if the home team wins
For row = 2 To rowcount
If InStr(data(row, 1), " vs. ") Then 'Look for game starts
For templong = row + 1 To row + 500 'Games donèt last more than 500 rows
If InStr(data(templong, col), "end of game") Then 'look for the end of the game
If data(row, HOME_LEAD_COL) > 0 Then 'If the home team leads
win = True
Else 'If the home team is trailing
win = False
End If
For x = row To templong
data(x, HOME_WIN_COL) = win 'Copy the win status for the whole game
Next x
row = templong 'jump to the end of the game to save wasted cycles
Exit For
End If
Next templong
End If
Next row
End Function
xxii-Offense Win (TRUE, FALSE)
Offense Win shows if the offense wins the game. It compares the offensive team to the home team. If they are equal then Offense Win is equal to Home Win. If not then Offense win is the opposite of Home Win.
Function OFF_WIN_FN(data As Variant) 'State if the offense wins
For row = 2 To rowcount
If data(row, HOME_COL) = data(row, OFF_COL) Then 'If the home team has the ball
data(row, OFF_WIN_COL) = data(row, HOME_WIN_COL) 'Offense win equals home win
Else
data(row, OFF_WIN_COL) = Not (data(row, HOME_WIN_COL)) 'Offense win is the opposite of home win
End If
Next row
End Function
xxiii-Time (0-3600)
Time gives the amount of time remaining in the game, shown in seconds. It is calculated by first assigning the known values of the start of each quarter, 3600 seconds at the start of the 1st quart, 2700 at the start of the 2nd quarter, 1800 at the start of the 3rd quarter, and 900 seconds at start of the 4th quarter. All overtime plays show 0 seconds. Any plays that have a value for Clock are then converted to seconds by taking the quarter, multiplying by 900, subtracting 900, adding the number of seconds in Clock (the last two digits) and adding 60 times the number of minutes in Clock (the first two digits). Then, for all the gaps the number of plays is counted and the gap between statements is calculated. The average gap is calculated and each play is given a Time value of the preceding play minus the average gap, a linear spline across the gap. Then all Times are rounded to integer values.
Function TIME_FN(data As Variant) 'Determines the time from the clock statements
For row = 2 To rowcount
If Not (IsEmpty(data(row, CLOCK_COL))) Then 'just convert all the clock cells to times
data(row, TIME_COL) = 3600 - 900 * data(row, QTR_COL) + 60 * Left(data(row, CLOCK_COL), 2) + Right(data(row, CLOCK_COL), 2) 'Each quarter is 900 seconds, plus 60 * the minutes plus the seconds
End If
Next row
For row = 2 To rowcount 'This loop splines the gaps
If IsEmpty(data(row, TIME_COL)) Then 'if we find the start of a gap
counter = 0 'couns how many isplays are in the gap
For templong = row To row + 500 'This is longer than any gap could be
If IsEmpty(data(templong, TIME_COL)) = False Then 'We've found a timestamp to bookend our gap
Gap = data(row - 1, TIME_COL) - data(templong, TIME_COL) 'now we've defined our gap between the two known timestamps
Exit For
ElseIf data(templong, ISPLAY_COL) Then 'if we find an actual play with no timestamp we count it
counter = counter + 1
End If
Next templong
For x = row To templong - 1 'now we pass over the same gap
If data(x, ISPLAY_COL) = False Then
data(x, TIME_COL) = data(x - 1, TIME_COL) 'non-plays just get the same as what's above them
Else
data(x, TIME_COL) = data(x - 1, TIME_COL) - (Gap / (counter + 1)) 'plays get the previous time minus the average (+1 because there's one more gap than there are plays in the gap)
End If
Next x
row = templong 'Jump to the end of the gap to save time
End If
Next row
On Error Resume Next
For row = 2 To rowcount 'this loop rounds all the estimated times we calculated in the previous loop
data(row, TIME_COL) = Round(data(row, TIME_COL), 0)
Next row
End Function
xxiv-Scoring Play (“O-TD”, “O-FG”, “O-SAFETY”, “O-ROUGE”, “D-SAFETY”, “D-TD”)
This column identifies plays that are scoring plays according to who scored and what type of scoring play. Touchdowns have the word “TOUCHDOWN” in the play description. On OD plays these are offensive touchdowns unless the play is an interception or the play description has the word “FUMBLE.” If the play is a kickoff, a punt, or a field goal attempt it is a defensive touchdown. Field goals are obviously always offensive, and are identified by the phrase “GOOD” in the play description. Rouges are also always to the offense, and have the phrase “ROUGE” in the play description. Safeties are credited to the team conceding the safety, and so usually belong to the offense. On kicking plays behind the -40 yard line these are credited as offensive safeties, and beyond that are defensive safeties, as it is rather rare to concede a 45-yard loss for a safety, or to have a 70-yard punt.
Function SCORING_PLAY_FN(data As Variant, col As Integer) 'Identifies scoring plays
For row = 2 To rowcount
If InStr(data(row, col), "ROUGE") Then 'ROUGE is easy, defense will never score a rouge
data(row, SCORING_PLAY_COL) = "O-ROUGE"
ElseIf data(row, ODK_COL) = "FG" And InStr(data(row, col), "GOOD") Then 'defence also never scores a FG
data(row, SCORING_PLAY_COL) = "O-FG"
ElseIf InStr(data(row, col), "TOUCHDOWN") Then 'touchdowns are tricky, they can be scored by either team
For templong = row To row + 50 'Look for the PAT
If InStr(data(templong, col), "attempt") Then 'Check which team did the PAT to tell us which team scored
If data(row, OFF_COL) = data(templong, OFF_COL) Then
data(row, SCORING_PLAY_COL) = "O-TD"
Else
data(row, SCORING_PLAY_COL) = "D-TD"
End If
row = templong
Exit For
ElseIf InStr(data(templong, col), "end of game") Then 'If the game ends on a TD meaning no PAT
If data(templong, OFF_SCORE_COL) > data(row, OFF_SCORE_COL) Then 'Check to see who’s score increased
data(row, SCORING_PLAY_COL) = "O-TD"
Else
data(row, SCORING_PLAY_COL) = "D-TD"
End If
row = templong
Exit For
End If
Next templong
ElseIf InStr(data(row, col), "SAFETY") Then 'Safeties are usually on the offense but sometimes are the result of bad returns
If data(row, YARDLINE_COL) < 65 Then
data(row, SCORING_PLAY_COL) = "D-SAFETY" 'it's really hard to give up a safety from beyond your own 45
Else
data(row, SCORING_PLAY_COL) = "O SAFETY"
End If
End If
Next row
End Function
xxv-P(1D) Input (TRUE, FALSE)
This column identifies whether the offense successfully gains a first down within this drive, in order to calculate First Down Probability (P(1D)). If a 1st & 10, 1st & Goal, or TD occur before a change of possession, punt, field goal attempt, or end of half the column is scored as TRUE.
Function P1D_INPUT_FN(data As Variant) 'Determine the input for P(1D)
For row = 2 To rowcount
If data(row, ODK_COL) = "OD" Then 'Only looking at OD plays
If data(row, SCORING_PLAY_COL) = "O-TD" Then 'TD plays obviously success
data(row, P1D_INPUT_COL) = True
ElseIf data(row, SCORING_PLAY_COL) = "D-TD" Then 'Defensive TD obviously fail
data(row, P1D_INPUT_COL) = False
ElseIf data(row, SCORING_PLAY_COL) = "O-SAFETY" Then 'Safety obviously fail
data(row, P1D_INPUT_COL) = False
Else
For templong = row + 1 To row + 100
If data(templong, 28) <> data(row, 28) Then 'if loss of possession then p(1D) is failed
poned = False
Exit For
ElseIf data(templong, DOWN_COL) = 1 And data(templong, DIST_COL) = 10 Then 'if 1st & 10 then P(1D) is success
poned = True
Exit For
ElseIf data(templong, DOWN_COL) = 1 And data(templong, DIST_COL) = data(templong, FPOS_COL) Then 'if 1st & Goal then P(1D) is success
poned = True
Exit For
ElseIf data(templong, SCORING_PLAY_COL) = "O-TD" Then 'if score a touchdown then P(1D) is success
poned = True
Exit For
ElseIf data(templong, SCORING_PLAY_COL) = "D-TD" Then 'if D score a touchdown then P(1D) is fail
poned = False
Exit For
ElseIf data(templong, SCORING_PLAY_COL) = "O-SAFETY" Then 'Giving up safeties is a fail
poned = False
Exit For
End If
Next templong
For x = row To templong - 1
data(x, P1D_INPUT_COL) = poned 'Fill for the whole sequence
Next x
row = templong - 1 'Jump ahead
End If
End If
Next row
End Function
xxvi-EP Input (“O-TD”, “O-FG”, “O-SAFETY”, “O-ROUGE”, “HALF”, “D-SAFETY”, “D-TD”)
This column determines what the next scoring play is to be able to calculate Expected Points (EP). It looks for the next scoring play in the Scoring Play column, as well as for the end of half or end of game. Once it finds the next scoring play it checks if the offense on the next scoring play is the same as current offense. If not then it flips the offense/defense of the scoring play.
Function EP_INPUT_FN(data As Variant, col As Integer) 'Determine the input for EP
For row = 2 To rowcount
For templong = row To row + 500 'Looking forward
If IsEmpty(data(templong, SCORING_PLAY_COL)) = False Then 'Find a scoring play
If Left(data(templong, SCORING_PLAY_COL), 1) = "O" Then 'Determine who scored
scoreteam = data(templong, 28)
SCORE = Right(data(templong, SCORING_PLAY_COL), Len(data(templong, SCORING_PLAY_COL)) - 2)
Exit For
ElseIf Left(data(templong, SCORING_PLAY_COL), 1) = "D" Then
scoreteam = data(templong, DEF_COL)
SCORE = Right(data(templong, SCORING_PLAY_COL), Len(data(templong, SCORING_PLAY_COL)) - 2)
Exit For
End If
ElseIf InStr(data(templong, col), "3rd") Or InStr(data(templong, col), "end of game") Then 'Check for end of half
scoreteam = data(templong, 28)
SCORE = "HALF"
Exit For
End If
Next templong
For x = row To templong 'Fill in from now to the scoring play
If data(x, 28) = scoreteam Then 'Alternate O/D score based on who has the ball now and who ends up scoring
data(x, EP_INPUT_COL) = "O-" & SCORE
Else
data(x, EP_INPUT_COL) = "D-" & SCORE
End If
Next x
row = templong
Next row
End Function
xxvii-Conference (“CanWest”, “OUA”, “RSEQ”, “AUS”, “NONCON”)
This column shows in what conference the game is played. By using four lists each listing all the teams in a given conference, it checks the two teams in the game. If both teams are in the same list then this game is in that conference. If the two teams are in different lists the game is listed as a non-conference game (NONCON). Because of Bishop’s University moving from the RSEQ to the AUS for the 2017 season the data for that season and future seasons has to be treated with a different set of lists.
Function CONFERENCE_FN(data As Variant, AUS() As String, RSEQ() As String, OUA() As String, CANWEST() As String) 'Determine in what conference the games are played
For row = 2 To rowcount
If InStr(data(row, 1), " vs. ") Then 'only looking at game statements
If UBound(Filter(AUS, data(row, AWAY_COL))) > -1 And (UBound(Filter(AUS, data(row, HOME_COL))) > -1) Then 'Check if both teams are in the same conference
conf = "AUS"
ElseIf UBound(Filter(RSEQ, data(row, AWAY_COL))) > -1 And (UBound(Filter(RSEQ, data(row, HOME_COL))) > -1) Then
conf = "RSEQ"
ElseIf UBound(Filter(OUA, data(row, AWAY_COL))) > -1 And (UBound(Filter(OUA, data(row, HOME_COL))) > -1) Then
conf = "OUA"
ElseIf UBound(Filter(CANWEST, data(row, AWAY_COL))) > -1 And (UBound(Filter(CANWEST, data(row, HOME_COL))) > -1) Then
conf = "CANWEST"
Else
conf = "NONCON" 'If they’re not in the same list it’s a nonconference game
End If
End If
data(row, CONFERENCE_COL) = conf 'copy through until next game statement
Next row
End Function
xxviii-Month (8-11)
This column shows the month in which the game was played, which is found in the game statements by using the 10-character sequence at the end of the game statement that matches the date format YYYY-MM-DD, pursuant to ISO 8601 (Munroe 2013).
Function MONTH_FN(data As Variant) 'Find the month the game was played
For row = 2 To rowcount
If InStr(data(row, 1), "vs.") Then 'look at game statements
MO = MONTH(Mid(data(row, 1), InStr(data(row, 1), data(row, YEAR_COL)), 10)) 'The last 10 characters are the game date, grab the month
End If
data(row, MONTH_COL) = MO 'Reuse until next game
Next row
End Function
xxix-Day (1-31)
This column gives the day of the game’s date by the same method as the Month and Year columns.
Function DAY_FN(data As Variant) 'Get the day the game was play
For row = 2 To rowcount
If InStr(data(row, 1), "vs.") Then
DAE = DAY(Mid(data(row, 1), InStr(data(row, 1), data(row, YEAR_COL)), 10))'The last 10 characters are the date, grab the day
End If
data(row, DAY_COL) = DAE 'Repeat until next game
Next row
End Function
xxx-Weekday (1-7)
This column gives the day of the week of the game using the same method as the Year, Month, and Day columns.
Function WEEKDAY_FN(data As Variant) 'Get the day of the week of the game
For row = 2 To rowcount
If InStr(data(row, 1), "vs.") Then
wkday = Weekday(Mid(data(row, 1), InStr(data(row, 1), data(row, YEAR_COL)), 10)) 'The last 10 characters are the date, get the weekday
End If
data(row, WEEKDAY_COL) = wkday 'repeat until next game
Next row
End Function
xxxi-Season (2002-2017)
The season column is equal to the Year column because U Sports football seasons do not span across calendar years. However, this column allows for better compatibility with other datasets from leagues that may have one season played in two calendar years. This function still accounts for leagues that stretch into January or February in order to create future compatibility.
Function SEASON_FN(data As Variant) 'Get the season of the game
For row = 2 To rowcount
If InStr(data(row, 1), "vs.") Then
If data(row, MONTH_COL) < 5 Then 'If the game is played in Jan, Feb, Mar, or Apr
season = data(row, YEAR_COL) - 1 'Season is 1 less than Year
Else
season = data(row, YEAR_COL) 'Otherwise Season is Year
End If
End If
data(row, SEASON_COL) = season 'Copy until next game
Next row
End Function
xxxii-FG Result (“MADE”, “MISSED”, “ROUGE”)
This column shows if a field goal attempt was successful (MADE), unsuccessful (MISSED), or unsuccessful leading to a rouge (ROUGE), so as to be able todevelop statistics regarding the probability of converting a field goal. This column is calculated in much the same way as the Scoring Play column, as successful field goal attempts include the phrase “GOOD,” failed attempts include “MISSED” or “BLOCKED” and rouges have “ROUGE.” This column only looks at plays where ODK is (FG), and so rouges on punts are not noted in this column. In American football the absence of the rouge would allow this column to be marked TRUE/FALSE as a boolean. Table 10 shows an example of a field goal play.
SKH
|
3-15
|
MAN37
|
D.Kolodzinski field goal attempt from 44 MISSED - short, spot at MAN37, clock 00:00.
|
Table 10 Example of field goal play (CWUAA 2013c)
Function FG_RESULT_FN(data As Variant, col As Integer) 'Find if a field goal was good or not
For row = 2 To rowcount
If data(row, ODK_COL) = "FG" Then 'Only look at FG plays
If InStr(data(row, col), "GOOD") Then 'Good FGs are labelled
data(row, FG_RESULT_COL) = "MADE"
ElseIf InStr(data(row, col), "ROUGE") Then 'Rouges are labelled
data(row, FG_RESULT_COL) = "ROUGE"
ElseIf InStr(data(row, col), "MISSED") Or InStr(data(row, col), "BLOCKED") Or InStr(data(row, col), "failed") Then 'To ways to mark a miss
data(row, FG_RESULT_COL) = "MISSED"
End If
End If
Next row
End Function
xxxiii-Home/Away Timeout (0-2)
This column tracks how many timeouts each team has. The column is reset to 2 at the start of each half. When teams use a timeout a timeout statement appears as seen in Table 11, and the column is decremented according to which team used their timeout. On rare occasions, such as MAN @ UBC, 2013-09-14 (CWUAA 2013a) an officiating error allows a team to use three timeouts in a half. These have been fixed manually to remove the timeout awarded in error.
SKH
|
3-15
|
MAN37
|
Timeout MAN, clock 00:02.
|
Table 11 Example of timeout play (CWUAA 2013c)
Function HA_TO_FN(data As Variant, col As Integer) 'Show how many timeouts teams have left
For row = 2 To rowcount
If InStr(data(row, 1), " vs. ") Then 'Game start resets TOs
HTO = 2
ATO = 2
ElseIf data(row, col) = "3rd" Then '2nd half resets TOs
HTO = 2
ATO = 2
ElseIf InStr(data(row, col), "Timeout") Then 'Timeout statements
team = Mid(data(row, col), InStr(data(row, col), "Timeout") + 8, 3) 'The team taking the timeout comes after the word timeout
If team = data(row, HOME_COL) Then 'decrement the team who called TO
HTO = HTO - 1
ElseIf team = data(row, AWAY_COL) Then
ATO = ATO - 1
Else
data(row, 12) = "FIX-TO" 'Flag problems
End If
End If
data(row, HOME_TO_COL) = HTO 'Recycle until new half or TO statement
data(row, AWAY_TO_COL) = ATO
Next row
End Function
xxxiv-Offense/Defense Timeout (0-2)
Offense/Defense Timeout shows how many timeouts the offense and defense have remaining. This function checks to see if the row has the Offense equal to Home, and if so sets Offense Timeout equal to Home Timeout, with Defense Timeout equal to Away Timeout. If Offense is not equal to Home, then Offense Timeout is set equal to Away Timeout, and Defense Timeout is set equal to Home Timeout.
Function OD_TO_FN(data As Variant) 'Shows the timeouts for O and D
For row = 2 To rowcount
If data(row, OFF_COL) = data(row, HOME_COL) Then 'If home is offense
data(row, OFF_TO_COL) = data(row, HOME_TO_COL)
data(row, DEF_TO_COL) = data(row, AWAY_TO_COL)
Else
data(row, OFF_TO_COL) = data(row, AWAY_TO_COL)
data(row, DEF_TO_COL) = data(row, HOME_TO_COL)
End If
Next row
End Function
xxxv-Series (0-99)
This column gives the offensive series of the play, the nth sequence of uninterrupted offensive plays for either team. A series is considered to have ended by any change of possession or kicking play. Ergo, if a team punts but a penalty gives them a first down and they continue on offense it is considered a new series, as the defense completed their task and ended the offensive drive, while the defensive special teams unit gave the ball back to the offense.
Function SER_COUNT_FN(data As Variant, col As Integer) 'Determine what series
For row = 2 To rowcount
If InStr(data(row, 1), " vs. ") Then 'Find game statements
For templong = row + 1 To row + 500 'Looking for the end of the game
If InStr(data(templong, col), "end of game") Then
Exit For
End If
Next templong
counter = 0 'Reset the series counter
bool = False 'whether we’re on a drive or not
For x = row To templong 'Scan through the game
If bool Then
If data(x, OFF_COL) <> data(x - 1, OFF_COL) Then 'If change of possession
If data(x, ODK_COL) = "OD" Then 'and an OD play
counter = counter + 1
End If
ElseIf data(x, ODK_COL) = "FG" Then 'Turn off bool
bool = False
ElseIf data(x, ODK_COL) = "P" Then
bool = False
ElseIf data(x, ODK_COL) = "KO" Then
bool = False
ElseIf data(x, col) = "3rd" Then
bool = False
ElseIf InStr(data(x, col), "end of game") Then
bool = False
ElseIf InStr(data(x, col), "OT") Then
bool = False
End If
Else
If data(x, ODK_COL) = "OD" Then 'Once we’re back to a drive
counter = counter + 1 'Increment the counter
bool = True
End If
End If
If bool Then
data(x, SER_COUNT_COL) = counter 'Output the counter
End If
Next x
row = x - 1 'Jump to the next game
End If
Next row
End Function
xxxvi-Home/Away Series (0-99)
This column separates a counter for the two teams by a similar means as the Series column. This column is useful because turnovers and penalties can lead to one team having significantly more series than the other. Because the two functions are effectively identical, only the Home Series has been included.
Function H_SER_COUNT_FN(data As Variant, col As Integer) 'Find the series of the home team
For row = 2 To rowcount
If InStr(data(row, 1), " vs. ") Then 'find game starts
For templong = row + 1 To row + 500 'Look for the game ends
If InStr(data(templong, col), "end of game") Then
Exit For
End If
Next templong
bool = False
counter = 0 'Series counter reset to 0
For x = row To templong
If bool Then 'Looking for the ends of series
If data(x, ODK_COL) = "P" Then 'Set bool to false, ending the series
bool = False
ElseIf data(x, ODK_COL) = "FG" Then
bool = False
ElseIf data(x, ODK_COL) = "FG" Then
bool = False
ElseIf Not (data(x, OFF_COL) = data(x, HOME_COL)) Then
bool = False
ElseIf InStr(data(x, col), "3rd") Then
bool = False
ElseIf InStr(data(x, col), "OT") Then
bool = False
ElseIf InStr(data(x, col), "end of game") Then
bool = False
End If
Else
If data(x, ODK_COL) = "OD" Then 'Looking for the start of series
If data(x, OFF_COL) = data(x, HOME_COL) Then
bool = True
counter = counter + 1
End If
End If
End If
If bool Then
data(x, H_SER_COUNT_COL) = counter 'Print the counter
End If
Next x
row = x - 1 'Jump to the next game
End If
Next row
End Function
xxxvii-Offensive Series (0-99)
This column gives the series of the play for the offensive team, it only counts offensive series by the current offensive team and does not include series by the other team the way the Series column counts every series by either team. Thus, the Offensive Series column is usually about half the value of the Series column.
Function O_SER_COUNT_FN(data As Variant)
For row = 2 To rowcount
If data(row, OFF_COL) = data(row, HOME_COL) Then 'Check if the home team is on offense
data(row, O_SER_COUNT_COL) = data(row, H_SER_COUNT_COL) 'Copy the appropriate column
Else
data(row, O_SER_COUNT_COL) = data(row, A_SER_COUNT_COL)
End If
Next row
End Function
4-Conclusion
Development of a fully parsed database of U Sports will allow for in-depth analysis of the nature of the game. The above functions form the basis for a relational database capable of investigating matters relating to P(1D), EP, WP, assorted kicking game indices. Additional features may be added on an as-needed basis pursuant to the needs of the study, and such code will be discussed at that time. The modular structure of the code permits such additions. Refactoring of the existing code for greater efficiency, while possible, is likely unnecessary. However, the existing code could also be applied, with some degree of alteration, to other datasets as well. A CFL dataset would be on the same order of magnitude, and would allow for direct comparison of the two leagues. In American football, there are NFL databases but there is not the same centralized source of NCAA data. Scraping of play-by-play data to create such a database would make for a very large database, perhaps two orders of magnitude larger across the multiple divisions. Such a database would then justify a refactoring of the code. For consistency’s sake it would be preferable then to make the same modifications to the U Sports code. Although the database is currently structured in a relational fashion, with each play acting as an independent record, it could easily be restructured as within an object-oriented paradigm, making better use of the games within the database.
The dataset is sufficiently large to allow for good precision as well as year-over-year comparison and hopefully provides the basis for a growing body of work researching Canadian football. What is needed next is the impetus to do said research, and for different groups to be able to compare their results. While the author hopes that the database will see wider use, he has plans to make use of the data himself in the immediate future.
5-References
Atlantic University Sport. 2008. “Saint Mary’s vs WESTERN (Nov 16, 2008).” Atlantic University Sport (AUS). November 16, 2008. http://www.atlanticuniversitysport.com/sports/fball/statistics/2008/MITCHELL.HTM.
AUS. n.d. “AUS Home Page.” Atlantic University Sport. Accessed June 10, 2018. http://www.atlanticuniversitysport.com/landing/index.
CWUAA. 2007. “Manitoba vs Saint Mary’s Huskies (Nov 23, 2007).” Canada West Universities Athletic Association. November 23, 2007. https://www.canadawest.org/custompages/Football/Stats/2007-08/pn23mbsm.htm.
———. 2008. “Manitoba vs Simon Fraser (Oct 03, 2008) - Canada West Universities Athletic Association.” Canada West Universities Athletic Association. October 3, 2008. https://www.canadawest.org/custompages/Football/Stats/2008-09/07mansfu.htm#GAME.PLY.
———. 2013a. “Manitoba vs. UBC - Box Score - 9/14/2013.” Canada West Universities Athletic Association. September 14, 2013. https://www.canadawest.org/sports/fball/2013-14/boxscores/20130914_9uzw.xml?view=plays.
———. 2013b. “Regina vs. Manitoba - Box Score - 9/28/2013.” Canada West Universities Athletic Association. September 28, 2013. https://www.canadawest.org/sports/fball/2013-14/boxscores/20130928_4s9g.xml?view=plays.
———. 2013c. “Saskatchewan vs Manitoba (Nov 01, 2013).” Canada West Universities Athletic Association. November 1, 2013. https://www.canadawest.org/custompages/Football/Stats/2013-14/09sskman.htm.
———. n.d. “CWUAA Home Page.” Canada West Universities Athletic Association. Accessed June 10, 2018. https://www.canadawest.org/landing/index.
Horowitz, Maksim. 2016. “Introducing NflscrapR - Part 1.” 2016. https://tartansportsanalytics.com/2016/03/10/introducing-nflscrapr-part-1/.
OUA. n.d. “OUA Home Page.” Ontario University Athletics. Accessed June 10, 2018. http://www.oua.ca/landing/index.
PrestoSports. n.d. “PrestoSports Home Page.” PrestoSports. Accessed June 10, 2018. https://www.prestosports.com/landing/index.
RSEQ. n.d. “RSEQ Home Page.” Réseau Du Sport étudiant Du Québec. Accessed June 10, 2018. http://rseq.ca/.
The Automated Scorebook. n.d. “The Automated Scorebook.” The Automated Scorebook. Accessed June 18, 2018. http://www.automatedscorebook.com/.
U Sports. 2013. “Manitoba At. Alberta | Box Score | 10/26/2013 - U SPORTS - English.” U Sports. October 26, 2013. http://presto-en.usports.ca/sports/fball/2013-14/boxscores/20131026_ra78.xml?view=plays.
No comments:
Post a Comment