Lab 9: Accessing data from APIs

This week’s lab will show you access data published online through APIs.

What are APIs?

An Application Programming Interface (API) is a broad term for anything in computer programming that has a standardized interface so that a computer can interact with it. In essence it is a set of rules that say, here is a list of operations that you can do to me, and here is what will happen if you undertake any of those operations.

In this lab, we will use R interact with a spcific sort of API: web API’s for accessing data. Essentially these are URLs (web addresses) that return data instead of web pages. Each URL returns a specific type of data, and has options that we can insert into the URL to customize exactly what data we get back.

Because the data has to be access hrough these well-defined URLs, and because the data we get back will be in a certain format (a “schema”), it is very easy for a computer to download data through an API with minimal human intervention.

In this lab, we will look at air quality data published by an organization called OpenAQ, whose goal is to make air quality data from all over the world accessible to anyone who wishes to access it. You can read more about their misison on their website: https://openaq.org/

The OpenAQ API

The base URL for the OpenAQ web page is https://api.openaq.org/v2/

If you open this link in your web browser, you will get a webpage that looks like a jumble of text:

However, you may notice that there appears to be some structure to this mess! In fact, this data is in a format called JSON. You can install an extension for most browsers that converts JSON into a more readable format. Here are some suggestions:

Chrome: install the JSONView extension
Firefox: has a built-in JSON viewer or you can try the JSONView extension
Safari: you can try the JSON Peep extension

When converted into a more readable format, you will see that JSON is actually nested lists of information:

As you’ll see, when formatted properly, we can see the data in the JSON format. The base endpoint (the URLs of an API are called endpoints), contains a list of all the other OpenAQ endpoints.

For example, if we want to get a list of the cities in the OpenAQ API, we would go this endpoint: https://api.openaq.org/v2/cities (i.e. we put the word cities at the end of URL).

Anatomy of a API endpoint

The OpenAQ API is documented here: https://docs.openaq.org/

The locations endpoint, at this URL: https://api.openaq.org/v2/locations, returns a list of locations of every air quality sensor that provides data to OpenAQ.

However, if you visit the endpoint in your browser, you will see that there are a lot of sensor locations! How can we reduce this down to, for example, just sensors located in a small region.

Take a look at the locations endpoint on this documentation page (there are actually three variations of the locations URL, so make sure you pick the first one). In the documentation, you will see that there are a number of parameters that we can supply to restrict the list of sensor locations. For example, we can restrict our list to just sensor locations within 18,000 meters of Mason’s Fairfax campus using the coordinates and radius parameters.

We add parameters to an endpoint by putting a ? after the endpoint URL, and then parameterName=parameterValue. Multiple parameters are spearated by &.

For example, the documentation tells us that the coordinates parameter takes a latitude and longitude (separated by a comma). GMU is at a latitude of 38.83 and a longitude of -77.30. We would write this parameter as coordinates=38.83,-77.30.

We then add this onto our endpoint URL as follows:

https://api.openaq.org/v2/locations + ? + coordinates=38.83,-77.30

to give a combined URL of

https://api.openaq.org/v2/locations?coordinates=38.83,-77.30

Unfortunately, if you navigate to this in your browser, you will find that there are no sensors returned by this search. If you check the documentation, you will see that the default radius that OpenAQ looks for sensors in around a set of coordinates is just 2500 meters (about a mile and a half). We can increase the radius parameter to broaden our search.

Let’s add in a radius=18000 to broaden our search to 18,000 meters. This gives us this URL (remember that parameters are spearated by & symbols:

https://api.openaq.org/v2/locations?coordinates=38.83,-77.30&radius=18000

You should now get a single sensor returned from this search!

Play around with the locations endpoint and the different parameters. When you are comfortable with it, use the coordinates and radius parameters create your own URL that finds the closest sensor near a point of interest to you (i.e. your home, somewhere you went on vacation, or maybe somewhere that you would like to go). You can use a website such as https://www.latlong.net/ to find coordinates of a place that interests you.

Write this URL down in your answer file.
We can get the JSON data from an API in R using the fromJSON function from the jsonlite package.
```
locations_page <- fromJSON("https://api.openaq.org/v2/locations")
```
The locations_page variable contains a list, containing two items: meta (metadata about the API) and results (the list of sensors that we are interested in). To convert this results element to a dataframe (note the use of the $ symbol to access the results element from the list):
```
locations_df <- as_tibble(locations_page$results)
```
In your answer file, run these two lines of code on the URL that you constructed in the previous question. You should end up with a dataframe that contains information about the sensor closest to the geographic coordinates that you picked.
Now that you have identified a region with some sensors, let’s fetch the air quality measurement data from this region.

By default, the API will only return 100 measurements. You can (and should) increase this to a larger number using the limit parameter in the API url. The maximum is 10000, so try a few thousand, e.g. limit=2000.

The air quality measurements are available from this endpoint:

https://api.openaq.org/v2/measurements

To get a selection of recent air quality measurements for your region of interest, you will need to add the same coordinates and radius parameters as before, as well as the limit parameter.

You should be able to run the same two functions as before (fromJSON() and as_tibble()) to create a dataframe of all the measurements. Store this dataframe in a new variable called measurements_df.

Being a good API citizen

In general, it is a good idea to minimize the number of times we request data from an API. This is because requests for large amounts of data can put a lot of strain on the server.

It is usually a good idea to make the request once, and then save the resulting data into a file. Then in the future you just load the data from the file rather than having to request the same data all over again.

We will not worry about this during this lab, as we will not be making too many request to the API, but bear this in mind if you use APIs in the future.
Let’s take a look at the data you downloaded.

If you open the measurements_df dataframe, you should see several columns of interest:
- The parameter column records the type of air quality measurement recorded in that row. There are several different types including pm25 (PM2.5 - particulate matter smaller than 2.5 micrometers), so2 (Sulphur dioxide), and o3 (Ozone). Different sensors record different measurements, so your values in this column will vary depending on which sensor you pick
- The value column records the actual measurement.
We will start by calculating some summary statistics of the data using the group_by and summarize functions. You should group_by the “parameter” variable, and calculate the mean, median, min, max, standard deviation, and inter-quartile range of the the “value” column.

As a reminder, the standard deviation is calculated by the sd function, and the inter-quatile range is calculated by the IQR function.
Finally, let’s create a graph to visualize the air pollution data at the sensor you chose.

The air pollution data is an example of time series data: data measured at regular intervals of time. Time series data is often best plotted as a line graph. In ggplot, we can do that with the geom_line function.

Create a line graph, plotting each type of data on a separate facet (i.e. use the facet_wrap function to facet over the parameter variable, and supply the scales = "free_y" argument to allow the y-axis to vary between facets).

Note that you will need to plot time on the x axis. However, the date “column” is actually a dataframe itself with two columns, local and utc (i.e. a dataframe within a dataframe). To get one of the date sub-columns you should use the $ selector operator, e.g. x = date$local.

Describe the trends that you see in your graph.

How to submit

To submit your lab assignment, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!

Save, commit, and push your completed RMarkdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Lab 9 posting on Blackboard.

Credits

This lab is released under a Creative Commons Attribution-ShareAlike 4.0 International License. Lab instructions were written by Dominic White.