In this Lab you will analyze data about flights that departed from New York City airports in 2013.
You will need to use the Databases chapter of the course textbook to complete this Lab.
We have used this dataset before in CDS 101, but this time the flight data is stored as a table in a database along with a table of weather data.
We will need to join these two tables together so that we can investigate the following question:
How is a flight’s departure delay related to bad weather at the departure airport?
About the database
Running the set-up chunk will create a SQLite database in the project
folder called nycflights13.sqlite
. This database contains
several tables, of which we will be using two: flights
and
weather
.
The flights
table contains the same dataset that we used
for Assignment 3: each of the 336,776 rows represents a flight that
departed a New York City (NYC) airport in 2013.
The weather
tables contains hourly weather data for the
three NYC airports (which are JFK, LaGuardia, and Newark). Each row
represents the weather at one of these three airports at a particular
hour during the year of 2013.
The columns in the weather
table are:
variable(s) | description |
---|---|
origin |
The airport weather station. Will be either: JFK ,
LGA , or EWA to match one of the 3 NYC airports
in the origin column of the flights
table. |
year , month , day ,
hour |
Columns representing the date and hour of the weather observation. |
temp , dewp |
Temperature and dewpoint in F. |
wind_dir , wind_speed ,
wind_gust |
Wind direction (in degrees), speed and gust speed (in mph). |
precip |
Precipitation (e.g. rain, snow, etc.), in inches. |
pressure |
Sea level pressure in millibars. |
visib |
Visibility in miles. |
time_hour |
Date and hour of the recording as a POSIXct date. Ignore this column for this assignment. |
Tips and Reminders
Big tables of data will overflow the pages of your PDF. You should not need to print out any tables of data for the exercises here, but if you decide to do so anyway then make sure you reduce the size of any tables you print in your PDF to fit within the width of a page, and not to be longer than ~1 page (but it is OK if a smaller table wraps over two pages).
You should use the
labs()
function to add appropriate titles and axis labels to all your plots in this (and future) assignments. The syntax for doing this is:If you don’t remember how to use a function, you can look up its help page in RStudio by typing
??function_name
in the Console (or by Googling the function!)Remember to commit your work after each question.
Exercises
You will need to use the Databases chapter of the course textbook to complete these exercises.
The database is stored in a file called
"nycflights13.sqlite"
. (If this file does not exist, then you need to create it by running the set-up code chunk in yourflights_database.Rmd
answer file.)This is an example of a SQLite database.
Add a code chunk to your
flights_database.Rmd
answer file create a database connection to the"nycflights13.sqlite"
database. This connection should be assigned to an R variable calledcon
.Hint: Refer to the Databases chapter of the course textbook for the code to do this.
When you run this code chunk you should see the
con
variable appear in the Environment tab of RStudio. It should have a data type that says Formal class SQLiteConnection.Commit your work when you have finished this exercise.
You can see a list of the tables in the database by running this command:
DBI::dbListTables(con)
(you do not need this code in your answer file, so run it in the Console instead of putting it in a code chunk).The
flights
dataset that we analyzed in Assignment 3 is stored in a table of the database calledflights
.In a new code chunk, create a database table (with the
tbl()
function) for theflights
table. Assign this table to an R variable calledflights_tbl
.When you run this code chunk you should see the
flights_tbl
variable appear in the Environment with a data type description that says List of 2.Commit your work when you have finished this exercise.
Write a query that selects the
year
,month
,day
,hour
,dep_delay
, andorigin
columns from theflights_tbl
database table.Even though
flights_tbl
is a database table instead of a dataframe, you can do this using theselect()
function that we have used all semester! Assign this query to a new R variable calledflights_query
.When you run this code chunk you should see the
flights_query
variable appear in the Environment with a data type description that says List of 2.Commit your work when you have finished this exercise.
Our database query has not yet been run. To do so, R will need to translate our R code into the language most commonly used to interact with databases: SQL.
We do not need to write any SQL ourselves (or even understand it!), but we can get R to show us the SQL statement that will be executed.
In a new code chunk, use an appropriate function to display the SQL query stored in the
flights_query
variable.Commit your work when you have finished this exercise.
Add a code chunk and run the
flights_query
query with thecollect()
function. Assign the resulting dataframe (which should have 336,776 rows and 6 columns) to a new variable calledflights_df
.Note that these are the same 336,776 flights to/from New York that we analyzed in Assignment 3!
Below your code, answer this question: what column tells us the airport that a flight left from?
Commit your work when you have finished this exercise.
Create a graph showing the distribution of the
dep_delay
column in theflights_df
variable. Make sure to add an appropriate title.Describe the shape of the distribution.
Calculate summary statistics (mean, median, standard deviation, minimum and maximum) for the
dep_delay
column using thesummarize()
function. Hint: you may have to add thena.rm = TRUE
parameter to each of the summary functions to tell them to ignore missing data in thedep_delay
column.Note any interesting patterns about the summary statistics (make sure that you explain why this is interesting, i.e. instead of merely stating (for example), “The mean is 10.”, a better answer would be something like “The mean is 10 which means that …”).
Commit your work when you have finished this exercise.
So far we have extracted information about flights from the flights table. However, we are interested in the relationship between the departure delay and the weather at the originating airport.
The database stores the weather information in a different table, which is called
weather
. Take a look back at the About the Database section, and answer the following questions:What does each row in the weather table represent?
What column contains information about the amount of rain (or other precipitation) that fell?
What column indicates the airport that each weather observation was made at?
Commit your work when you have finished this exercise.
Create an R variable for the weather table and then write a query that selects the following columns:
origin
,year
,month
,day
,hour
,temp
,wind_speed
,precip
.Make sure you store the query in a new variable called
weather_query
so that it can be reused. Then in a separate line of code, useshow_query()
to print out the SQL statement for this query.I.e. you are essentially repeating Exercises @flights_table - @show_query.
You do not need to
collect()
this dataset.Commit your work when you have finished this exercise.
We want to create a dataset that joins the flights data with the weather data. We will do this by joining the datasets selected by flights query and our weather query.
Join the
flights_query
(left) withweather_query
(right) using theleft_join()
function by matching theorigin
,year
,month
,day
, andhour
columns from both datasets.Store the new query in a new variable called
joined_query
.Write a sentence or two explaining what this code will do.
Then in a separate line of code, use
show_query()
to print out the SQL statement for this query.Commit your work when you have finished this exercise.
Run the
joined_query
query with thecollect()
function, and store the resulting dataframe to a new variable.What does each row in this new dataset represent?
Commit your work when you have finished this exercise.
Using your new joined dataframe, create 2 scatter plots of
dep_delay
(y-axis) vs (1)precip
and (2)wind_speed
.Describe any patterns you see in the scatter plots (i.e. does it look like flights are more delayed when there is more rain?).
Using your graphs from the previous exercise, write a paragraph or so answering our original question of interest (“Does bad weather affect the departure delay of flights?”).
Is this what you expected? Why or why not?
You may wish to consider that this dataset only contains flights that actually departed (and not flights that were cancelled). This is called survivorship bias (Wikipedia). How might survivorship bias be affecting the apparent relationship between bad weather and departure delay that we are seeing in our graphs? Consider this famous story about a case of survivorship bias in World War 2:
You don’t want your planes to get shot down by enemy fighters, so you armor them. But armor makes the plane heavier, and heavier planes are less maneuverable and use more fuel. Armoring the planes too much is a problem; armoring the planes too little is a problem. Somewhere in between there’s an optimum. The reason you have a team of mathematicians socked away in an apartment in New York City is to figure out where that optimum is.
The military came to the SRG with some data they thought might be useful. When American planes came back from engagements over Europe, they were covered in bullet holes. But the damage wasn’t uniformly distributed across the aircraft. There were more bullet holes in the fuselage, not so many in the engines…
But exactly how much more armor belonged on those parts of the plane? That was the answer they came to Wald for. It wasn’t the answer they got.
The armor, said Wald, doesn’t go where the bullet holes are. It goes where the bullet holes aren’t: on the engines.
Wald’s insight was simply to ask: where are the missing holes? The ones that would have been all over the engine casing, if the damage had been spread equally all over the plane? Wald was pretty sure he knew. The missing bullet holes were on the missing planes. The reason planes were coming back with fewer hits to the engine is that planes that got hit in the engine weren’t coming back.
Excepted from How not to be wrong by Jordan Ellenberg - full excerpt
Commit your work when you have finished this exercise.
How to Submit
To submit your lab assignment, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!
Save, commit, and push your completed RMarkdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Lab 11 posting on Blackboard.
Credits
This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exercises and instructions written by Dominic White.