Thursday, August 30, 2018

Going Pro: Developing a CFL Play-by-Play Database

1-Abstract

CFL play-by-play data was collected from the official website into a single database. Idiosyncrasies in the structure of the data required the development of a VBA HTML scraper into a database. The data was parsed with a method that parallels the U Sports database, making it available for future investigation.

2-Introduction

A recent work by the author used a VBA script to parse a database of U Sports football play-by-play data (Clement 2018). The data itself was manually copied from various sources over a period of time. In order to compare the two major leagues of Canadian football, U Sports and the CFL, it became necessary to develop a similar CFL database. Data is available back to the 2004 season, and through the 2017 season 1251 games were scraped. A list of the games included is given in Appendix A.


3-Database

While play-by-play data for CFL games is available at the CFL website (CFL n.d.), the format of this data shows the team with possession via their logo, which made it impossible to copy the data directly in a manner similar to what had been down with the U Sports data. Thus, it became necessary to develop an HTML scraper that could interpret the logos and return the team name.

a-Game Scraper

Since the ultimate goal was for the data to end up in an Excel sheet, the obvious choice was to develop the scraper in VBA. The CFL data structure consists of 7 columns: The first numbers the row within the game. The second column is the logo of the team currently with possession, or the kicking team in the case of kickoffs. The third, fourth, and fifth columns showed down, distance, and field position. The sixth column categorized the play as a kickoff, rush, pass, punt, field goal, or other category as appropriate, and the seventh column gave the actual description of the play. The two final columns give the away and home team scores. Table 1 gives an example of the data format used by the CFL


158
1
10
S26
Field Goal
(00:00) D. DUVAL Field Goal (33 yds) - Field Goal
SCORING TIME: 15:00 - Montreal Alouettes win their sixth Grey Cup. MOP: Avon Cobourne. CDN: Ben Cahoon
28
27
Table 1 Example of CFL play-by-play data (CFL 2009)
The code used is given below:
Sub WebScraper()
Dim appIE As Object 'creates an object that we will use to open internet explorer
Set appIE = CreateObject("internetexplorer.application") 'will open IE, but in the background
Application.ScreenUpdating = False 'Since we’re going to be printing a lot to the sheet we don’t want to be slowed down by refreshing the visual
Dim tables As Object 'A set of objects to hold tables, rows, and data
Dim itemELE As Object
Dim rows As Long
Dim text As String
Dim StartS As Long
Dim LenS As Long
Dim URL As String
Dim x As Integer


rows = 1 'This holds the active row we’re printing on
For x = 1 To 957 '957 is the number of URLs we scraped at one time
   URL = Sheets("URLs").Cells(x, 1).Value 'grabs next URL in the list
   With appIE 'Opens the page
       .navigate (URL)
   End With
   Application.Wait Now + #12:00:10 AM# 'wait an extra few seconds to make sure the page is fully loaded
   Do While appIE.busy 'kills time while the page loads
       DoEvents
   Loop
   text = appIE.Document.body.innerHTML 'gives us a string with all the html stuff buried in it to find dates
   
   StartS = InStr(text, "team_2_nickname") + 28 'figuring out the road team
   LenS = InStr(Mid(text, StartS, 100), Chr(34)) - 1
   awayteam = Mid(text, StartS, LenS)
   StartS = InStr(text, "team_1_nickname") + 28 'figuring out the home team
   LenS = InStr(Mid(text, StartS, 100), Chr(34)) - 1
   hometeam = Mid(text, StartS, LenS)
   Cells(rows, 1).Value = awayteam & " vs. " & hometeam 'This gets the who vs who line
   rows = rows + 1
   On Error Resume Next    
   'This is the part that actually gets the play by play data.
   For Each tables In appIE.Document.getElementsByTagName("tbody")
   'here we need to pull the Quarter
       For Each itemELE In tables.getElementsByTagName("tr")
           On Error Resume Next
           Cells(rows, 2).Value = itemELE.getAttribute("class") 'This line gets the possession data by finding the name of the team logo
           Cells(rows, 1).Value = itemELE.getAttribute("data-playid") 'This pulls the play number
           Cells(rows, 6).Value = itemELE.getAttribute("data-playtype") 'pulls the ODK
           Cells(rows, 3).Value = itemELE.getElementsByTagName("td")(3).innerText 'pulls the down
           Cells(rows, 4).Value = itemELE.getElementsByTagName("td")(4).innerText 'pulls the distance
           Cells(rows, 5).Value = itemELE.getElementsByTagName("td")(5).innerText 'pulls the FPOS
           Cells(rows, 7).Value = itemELE.getElementsByTagName("td")(9).innerText 'pulls the play description
           Cells(rows, 8).Value = itemELE.getElementsByTagName("td")(10).innerText 'pulls the away score
           Cells(rows, 9).Value = itemELE.getElementsByTagName("td")(11).innerText 'pulls the away score
           rows = rows + 1 'pushes us onto the next line
       Next itemELE
   Next tables
Next x
End Sub

b-Game List

While the scraper above effectively gathers the data from individual games, it was impractical to individually open each game to pull the URL and feed into the scraper. Instead, a new scraper was developed to go through the season schedule and pull the game URLs into a list that can be then fed into the scraper. The code for this meta-scraper is included below:
Sub URLScraper()
Dim appIE As Object 'creates an object that we will use to open internet explorer
Set appIE = CreateObject("internetexplorer.application") 'will open IE, but in the background
Dim tables As Object
Dim list As HTMLObjectElement
Dim rows As Integer
rows = 1
With appIE
   .navigate ("https://www.cfl.ca/schedule/?season=2016") 'this is the url for the 2016 season to get the game URLs
End With
Do While appIE.busy 'kills time while the page loads
   DoEvents
Loop
Application.Wait Now + #12:00:03 AM# 'wait an extra three seconds to make sure the page is fully loaded


For Each one In appIE.Document.getElementsByClassName("gametracker")
   Cells(rows, 1).Value = one.getAttribute("data-url") 'pull the url from the gametracker
   rows = rows + 1 'increment the row
Next one
End Sub


Having collected all the URLs, a simple Excel text formula added “#playbyplay” to each URL to direct the scraper directly to the play-by-play data.

c-Cleanup

Having acquired all of the play-by-play data, it became necessary to clean the data to prepare it for the parser. The first action was to convert all the team abbreviations into a uniform system. In the original data teams are referred to in varying ways, as their full name, city name, team name, abbreviations of either, or by single letter codes. As was done with the U Sports database, each team was given a three-letter code. While U Sports program codes were based on the name of the school, CFL team codes are based on the team nickname, owing to the overlap between many of the school names and city names that would otherwise create ambiguity. Toronto, Montreal, Ottawa, Saskatchewan, Calgary, and British Columbia are all the names of cities with CFL teams and also the names of universities with football programs. Table 2 shows each of the teams and the abbreviations associated. Note that there are two teams from Ottawa, the Renegades, who existed from 2002-2005, and the REDBLACKS, formed in 2014. The CFL recognizes these two programs, as well as the previous Ottawa Rough Riders (1876-1996), as constituting a single lineage, but in this database they are being kept separate to allow for greater flexibility, and to be combined when desired.
Team Name
Team Abbreviation
Montreal Alouettes
ALS
Ottawa Renegades
REN
Ottawa REDBLACKS
RED
Toronto Argonauts
ARG
Hamilton Tiger-Cats
TIC
Winnipeg Blue Bombers
BBO
Saskatchewan Roughriders
RRI
Edmonton Eskimos
ESK
Calgary Stampeders
STA
BC Lions
LNS
Table 2 CFL Team Names and Abbreviations
It should be noted that the CFL data was in far better condition than the U Sports data, and required far less cleaning. Only a couple games proved to be too incomplete or corrupt to be salvaged. The next step was to include down and distance for kickoffs and converts, which are given down of 0 for bookkeeping purposes, and a distance of 10 for kickoffs, or a distance equal to the yard line for converts.
Notably, the score given in the rows is the score including any scoring on that play, whereas the convention in our dataset does not include scoring on the current play. Additionally, while clock statements appear on many plays and in the standard showing the time remaining in the quarter, clock statements also appear showing the time of the score, but these time statements show the time elapsed in the quarter. Finally, end of game statements were added at the end of each game.

d-Parsing

The play-by-play data was parsed using much the same methods as used in the U Sports database (Clement 2018), with minor adjustments for the different data structure. The greater separation of the different attributes into separate columns simplified the task somewhat, but the end result was the same, with the same attributes for each play being parsed

4-Conclusion

The development of this database structured in parallel with the U Sports database will allow a direct comparison of the two leagues on all metrics.  It also allows us to ask and then answer the question of whether the two games are fundamentally different, are their differences the result of the differences in skill level, or the small differences in rules? And if the two leagues are similar enough, does that open the door to combining their data pools, vastly increasing the level of detail in the data.
In the immediate future, the comparison of P(1D) between the two leagues will be forthcoming, since the parsed databases can directly recycle each other’s code

5-References


No comments:

Post a Comment

Three Downs Away: P(1D) In U Sports Football

1-Abstract A data set of U Sports football play-by-play data was analyzed to determine the First Down Probability (P(1D)) of down & d...