Chapter 2. Ingesting Data into the Cloud

In Chapter 1, we explored the idea of deciding whether to cancel a meeting in a data-driven way. We decided on a probabilistic decision criterion: to cancel the meeting with a client if the probability of the flight arriving within 15 minutes of the scheduled arrival time was less than 70%. To model the arrival delay given a variety of attributes about the flight, we need historical data that covers a large number of flights. Historical data that includes this information from 1987 onward is available from the US Bureau of Transportation Statistics (BTS). One of the reasons that the government captures this data is to monitor the fraction of flights by a carrier that are on-time (defined as flights that arrive less than 15 minutes late), so as to be able to hold airlines accountable.1 Because the key use case is to compute on-time performance, the dataset that captures flight delays is called Airline On-Time Performance Data. That’s the dataset we will use in this book.

All of the code snippets in this chapter are available in the folder 02_ingest of the book’s GitHub repository. See the last section of Chapter 1 for instructions on how to clone the repository, and see the README.md file in the 02_ingest directory for instructions on how to do the steps described in this chapter.

Airline On-Time Performance Data

For nearly 40 years, all major US air carriers have been required to file statistics about each of their domestic flights with ...

Get Data Science on the Google Cloud Platform, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.