Wednesday, April 3, 2019

Rain or Shine: Incorporating Weather Data into the U Sports Database

1. Abstract

METAR weather data was compiled for the seasons and locations of games within the existing U Sports database by scraping three different sources of archived data. Within the parser, METARs covering the dates and times of games were attached in a list to the game objects. Plays within each game were given an estimated real-world time, and each play was then assigned the METAR nearest to it. Future applications of weather data include its use in development of field goal models, examining the impact of weather on various existing analyses, and contextualizing future research.

2. Introduction

Following the addition of location context data into the U Sports Database, it became possible to use the date, time, and location information to determine other contextual data related to a game. The first use of this is to acquire weather data from the time and place of the game. Weather can impact many aspects of a game, making passing and kicking more difficult, increasing the turnover rate, and causing randomness in the outcomes of plays. Weather data has long been incorporated into models of field goal rates (Clement 2018). Weather data can also be incorporated into future Win Probability models or other characterizations of the game.

3. Sourcing Data

Weather data was sourced from three different locations. 2007-2018 data was available from Iowa State University’s Meteorology Department (Herzmann, Arritt, and Todey 2004), and scraped using their provided Python scraper. 2005-2006 was found with OGIMET (Valor n.d.), using their provided scraping tool. Finally, 2002-2004 data is available through Plymouth State University (Plymouth State University n.d.). While they graciously offered to preovide their data, they were unable to source it in a compatible format, and so a scraper was developed using Python’s urllib package (Van Rossum and Drake 1995)
All weather data was stored in METAR format. METARs are standard aviation weather statements with a known format that can be interpreted in a consistent fashion, and include the key information of temperature, wind, and precipitation.
METAR reports follow a common structure seen in Figure 1. The first part identifies whether the report is of type METAR or SPECI. These are the same except that METAR reports are given hourly, and SPECI reports occur off the hours when circumstances change in such a way that a new report is required. The ICAO airport code is given next, in this example it is CYUL, Montreal’s Pierre Elliott Trudeau airport. The timestamp that follows shows only the day of the month and the 4-digit UTC time. METARs are meant as rapid, transient means of conveying weather information to pilots, so the month and year are not stored. For this reason our data requires an additional column as a timestamp. Following that is wind direction given as a 3-digit compass heading, and wind speed in knots. Cloud cover data follows, as well as temperature and dewpoint. Following this are the remarks, noted by “RMK.” The remarks section has grown significantly over the years as more data becomes available, including here cloud types and sea-level pressure.
METAR CYUL 161900Z 23010KT 30SM SCT035 SCT240 24/16 A2985 RMK CU3CI0 SLP109
Figure 1 Example of a METAR (Pierre Elliott Trudeau Airport, 2008-08-16)
This data was separated into different files by station and year to improve organisation and vastly increase the speed at which the parser operates by only requiring it to search through files containing on the order of 10,000 lines, whereas the whole dataset contains on the order of 3 million lines. This process could be further accelerated if the files were to remove weather data from months where there is no football season, and even more aggressively to keeping only the data needed for the individual games, but this is a simple and standard way to keep the data for a relatively small cost, and it allows the data to be used for other purposes, including CFL data, where the season begins in June. To whittle down the data at this point would constitute a premature optimization (Knuth 1974).
For games played in domed stadia a dummy METAR was used, using room temperature, with no wind or precipitation and standard pressure. To some extent the pressure will not accurately reflect local pressure, but this variation is small compared to the effect of altitude on pressure. Some domes are held at a higher-than-ambient pressure, such as BC Place, in order to support the roof, but this again is considered a minor effect and is set aside at this time.

4. Incorporating into Parse

Game start times, now consistently available from the game statement (Clement 2019), were combined with the game stadium attribute, which holds the time zone information. Using Python’s datetime package (“Datetime — Basic Date and Time Types — Python 3.7.2 Documentation” n.d.) objects were created that can neatly contain the data for the date and time of the game. The pytz package (Bishop n.d.) allows proper timezones to be handled from the Olson database (“IANA — Time Zone Database” n.d.), including daylight savings time adjustments. This is important as the METAR data shows time in UTC, and because the U Sports football season crosses the daylight savings time divide.
Having the start time of the game, we look to assign an estimated real time to each play, held in the attribute realTime, a datetime object of the same type as game_date. This is done by estimating the length of a U Sports game as being 3 hours for regulation play, and 30 minutes for overtime. Halftime is given to take 15 minutes, and the remaining 2 hours and 45 minutes are splined linearly over the plays in regulation, while the 30 minutes of overtime are distributed evenly across those. This is hardly a precise means of measuring the time of any given play, but should prove accurate enough for this purpose, as shall be demonstrated shortly.
A new attribute for the game object, METARList, holds a list of METARs that are relevant to the game. These are held as Metar objects from Python’s metar package (Pollard 2019), which parses the text format and returns all of the individual elements as different attributes. In order to find the appropriate data for this list, the game loops though the appropriate raw data csv until it finds the first METAR that is within one hour of the game start time. From there it adds all METARs up to the first one at or after the end of the game, inclusively. Since METARs are issued at least hourly, if not more often (when they are properly known as SPECI reports, but have the same information and formatting), then a one hour buffer before the game ensure that the beginning of the game is covered. An example of a game’s METARList is given in Table 1.


CYWG 172200Z 12010KT 15SM FEW034 FEW090 FEW240 M02/M08 A3021 RMK SC1AC1CI1 SLP243
CYWG 172300Z 12008KT 15SM FEW090 FEW240 M05/M09 A3020 RMK AC1CI1 SLP242
CYWG 180000Z 10006KT 15SM SKC M07/M10 A3019 RMK SLP242
CYWG 180100Z 11006KT 15SM SKC M07/M10 A3018 RMK SLP237


Table 1 METARList from a game (WES vs. MAN 2007-11-17)
Finally, the game loops through the playlist and to each play finds the METAR from METARList whose time attribute is nearest to the realTime attribute of that play. Each play then has the weather data that issued nearest to it’s estimated real time. So while the method of estimating the real time of a play may not be precise, it is accurate in the sense that it should largely place plays in the appropriate area to be assigned the correct METAR. Absent the discovery of a trove of data allowing the real time to be assigned to plays from which it would be possible to spline between these known points in the same way as is done with the clock and time attributes, this is the best method currently available.
An unfortunate side effect of this addition to the data is that the parser is now slower to execute by a factor of between 30 and 50. To streamline this process Python’s pickle module is used to serialize the results of gamelist, preserving it as a separate file that can then be reimported as needed to develop models, rather than reparsing the entire dataset. Indeed, the file storing gamelist as a byte stream exceeds 175MB. Similarly other data structures that are slow to calculate can be pickled in the same way.

5. Conclusions

The inclusion of weather data into the dataset will allow for better contextualization of play-by-play data, and the development of more sophisticated models. Future data will be easier to include, as it has become more available through different sources and in more standard formats, simplifying integration with the existing data.

6. References



No comments:

Post a Comment

Three Downs Away: P(1D) In U Sports Football

1-Abstract A data set of U Sports football play-by-play data was analyzed to determine the First Down Probability (P(1D)) of down & d...