This week’s lab will show you access data published online through APIs.

What are APIs?

An Application Programming Interface (API) is a broad term for anything in computer programming that has a standardized interface so that a computer can interact with it. In essence it is a set of rules that say, here is a list of operations that you can do to me, and here is what will happen if you undertake any of those operations.

In this lab, we will use R interact with a specific sort of API: web API’s for accessing data. Essentially these are URLs (web addresses) that return data instead of web pages. Each URL returns a specific type of data, and has options that we can insert into the URL to customize exactly what data we get back.

Because the data has to be access through these well-defined URLs, and because the data we get back will be in a certain format (a “schema”), it is very easy for a computer to download data through an API with minimal human intervention.

In this lab, we will look data about the inhabitants of the USA that is collected by the US Census Bureau. The Census Bureau conducts 2 main surveys of the US population:

The Census Bureau has a website where anyone can download anonymized statistics collected from these surveys. We will use a package called tidycensus to download data from this website automatically.

Setup

The Census Bureau API

The web address or API endpoint for the 2010 US census that we will be using is: https://api.census.gov/data/2010/dec/sf1

If you open this link in your web browser, you will get a webpage that looks like a jumble of text:

However, you may notice that there appears to be some structure to this mess! In fact, this data is in a format called JSON. JSON is hard for us humans to read, but because of its structure, it is easy for computers to parse it.

To get particular data from this API, we also need to specify some parameters about that data. We add parameters to an endpoint by putting a ? after the endpoint URL, and then parameterName=parameterValue. Multiple parameters are separated by &.

For example, here we have added two parameters to our endpoint:

https://api.census.gov/data/2010/dec/sf1?get=P001001&for=state

You can copy the URL above into your browser and get the data for yourself!

Exercises

  1. Copy and paste this code into a code chunk in your answer file:

    va_pop_2010 <- get_decennial(
      geography = "county",
      state = "VA",
      variables = c("P001001"),
      year = 2010,
      show_call = TRUE,
      geometry = TRUE
    )

    Run the code and answer the following questions:

    1. What is the full URL (i.e. web address) of the API endpoint that the get_decennial() function is accessing data from?

    2. What are the parameters and values that we are supplying to this endpoint? (Format your answer as a Markdown table, like we did in Lab 1, with one parameter and value per row.)

    3. Copy and paste the endpoint URL into a browser tab in your web browser. Compare the page that you see with the dataframe stored in the va_pop_2010 variable created by the code above, and complete the following two tasks in your answer file:

    * Describe how a row of data is recorded in the API response (i.e. does it look like a nicely formatted table, and if not, how are rows and columns denoted?) 
    * Take a screenshot of the webpage that you see, and add it as an image within your answer file (again, like we did in Lab 1). (Note that you will need to save or upload the screenshot image file to the folder of files for this Project in RStudio.)

    When you have finished this exercise, commit your work.

  2. The code in the previous exercise downloaded the value of a census variable called P001001 for every country in the state of Virginia.

    P001001 tells us the total population, i.e. we have a dataset of Virginia county populations.

    We have also downloaded the geographic outline of each of these counties by using the geometry = TRUE parameter. This means that we can plot a choropleth map, which is a type of map where each area is colored by some value.

    To do so, copy and paste the following code into your answer file:

    va_pop_2010 %>%
      ggplot(aes(fill = value)) + 
      geom_sf() + 
      scale_fill_viridis_c(option = "magma")

    The color of each county in this map tells us the value of the variable we plotted, which in this case was the population.

    Compare this choropleth map with a map of Virginia’s counties and answer the following questions: Which Virginia county has the highest population? How can you tell from the choropleth map.

    When you have finished this exercise, commit your work.

  3. Run the get_decennial() function again, but this time use it to also get data about (1) the number of housing units in each county, and (2) the number of white people. In addition, we will look at tracts instead of counties (tracts are smaller areas that counties are divided into).

    You can use the following code template (also see the hints below it):

    ... <- get_decennial(
      geography = "...",
      state = "VA",
      variables = c("P001001", "...", ...etc.),
      year = 2010
    )
    • Change the geographic area from "county” to “tract”.

    • List all the variables that we want:

      • P001001 (total population)
      • P003002 (white population)
      • H001001 (total number of housing units)
      • H004002 (housing units occupied by owner with mortgages)
      • H004003 (housing units occupied by owner and owned fully, i.e. no mortgage)
    • Note that we do not want to use the geometry or show_call parameters here.

    When you have finished this exercise, commit your work.

  4. If you look at the dataframe stored in the new R variable you created in the previous exercise, you will see that the value of all the census variables are in a single column called value and there is a column of names called variable.

    In effect, the dataframe looks like somebody has run pivot_longer() on all the columns!

    Unfortunately, we need to reverse this and put every census variable in it’s own column. To convert rows back into columns we want to use a function called pivot_wider(), which we can do with the code template below:

    ... <- 
      ... %>%
      pivot_wider(
        names_from = variable,
        values_from = ...
      )

    We will need to pipe in the dataframe from the previous exercise, and assign the final wider output dataframe to a new variable. You should be able to figure out what column to pass to the values_from parameter by looking at the columns that we have in the input dataframe (hint: think about what column contains the actual numerical values of all of our census variables).

    (Your final pivoted dataframe should contain 7 columns and 1907 rows - each row will represent a unique tract in Virginia.)

    When you have finished this exercise, commit your work.

  5. Using your wide dataframe from the previous exercise and the column descriptions in 3, create two new columns using the mutate() function:

    • One column should calculate the percentage of each tract that is non-white. Since we know the total population and the number of white inhabitants, you will need to calculate the number of non-white inhabitants and then figure out what percentage they represent.

      For example, if a tract had 1000 people, and 700 are white, then the percentage that is non-white is

      \[\begin{align*} \frac{1000-700}{1000} * 100 = \frac{300}{1000} * 100 = 30\% \end{align*}\]

    • The other column should calculate the percentage of homes that are owned by the person living there (either with or without a mortgage). To do this, you will need to add up the columns that represent mortgaged and non-mortgaged homes and divide by the total number of homes (and again multiply by 100 to convert this fraction to a percentage).

    Make sure to assign the dataframe with the new columns to a new R variable, and when you have finished this exercise, commit your work.

  6. Visualize the distributions of the data in each of the two new columns created in the previous exercise.

    Your graphs should be of the appropriate type for showing the distribution of a single column (i.e. histogram, box plot, violin plot), and should be adequately formatted and labelled for the type of graph you pick.

    Using your graphs, describe both the distributions.

    When you have finished this exercise, commit your work.

  7. Finally let’s look at the co-variation of the two new variables.

    To do this, create a scatter plot of the percentage of owned houses (on the y-axis) vs. the percentage of non-whites in each tract. Add a linear trend line to this same plot with the geom_smooth() function (i.e. this trend line should be straight and not wavy). Don’t forget to add appropriate labels to the graph as well.

    Describe any patterns you see in the graph, and interpret these given what you know about socioeconomic trends in the USA and/or Virginia.

    When you have finished this exercise, commit your work.

How to submit

To submit your lab assignment, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!

Credits

This lab is released under a Creative Commons Attribution-ShareAlike 4.0 International License. Lab instructions were written by Dominic White.