This week’s lab will show you access data published online through APIs.
What are APIs?
An Application Programming Interface (API) is a broad term for anything in computer programming that has a standardized interface so that a computer can interact with it. In essence it is a set of rules that say, here is a list of operations that you can do to me, and here is what will happen if you undertake any of those operations.
In this lab, we will use R interact with a spcific sort of API: web API’s for accessing data. Essentially these are URLs (web addresses) that return data instead of web pages. Each URL returns a specific type of data, and has options that we can insert into the URL to customize exactly what data we get back.
Because the data has to be access hrough these well-defined URLs, and because the data we get back will be in a certain format (a “schema”), it is very easy for a computer to download data through an API with minimal human intervention.
In this lab, we will look at air quality data published by an organization called OpenAQ, whose goal is to make air quality data from all over the world accessible to anyone who wishes to access it. You can read more about their misison on their website: https://openaq.org/
The OpenAQ API
The base URL for the OpenAQ web page is https://api.openaq.org/v2/
If you open this link in your web browser, you will get a webpage
that looks like a jumble of text:
However, you may notice that there appears to be some structure to this mess! In fact, this data is in a format called JSON. You can install an extension for most browsers that converts JSON into a more readable format. Here are some suggestions:
- Chrome: install the JSONView extension
- Firefox: has a built-in JSON viewer or you can try the JSONView extension
- Safari: you can try the JSON Peep extension
When converted into a more readable format, you will see that JSON is actually nested lists of information:
As you’ll see, when formatted properly, we can see the data in the JSON format. The base endpoint (the URLs of an API are called endpoints), contains a list of all the other OpenAQ endpoints.
For example, if we want to get a list of the cities in the OpenAQ
API, we would go this endpoint: https://api.openaq.org/v2/cities (i.e. we put the word
cities
at the end of URL).
Anatomy of a API endpoint
The OpenAQ API is documented here: https://docs.openaq.org/
The locations
endpoint, at this URL: https://api.openaq.org/v2/locations, returns a list of
locations of every air quality sensor that provides data to OpenAQ.
However, if you visit the endpoint in your browser, you will see that there are a lot of sensor locations! How can we reduce this down to, for example, just sensors located in a small region.
Take a look at the locations
endpoint on this
documentation page (there are actually three variations of the
locations
URL, so make sure you pick the first one). In the
documentation, you will see that there are a number of parameters that
we can supply to restrict the list of sensor locations. For example, we
can restrict our list to just sensor locations within 18,000 meters of
Mason’s Fairfax campus using the coordinates
and
radius
parameters.
We add parameters to an endpoint by putting a ?
after
the endpoint URL, and then parameterName=parameterValue
.
Multiple parameters are spearated by &
.
For example, the documentation tells us that the
coordinates
parameter takes a latitude and longitude
(separated by a comma). GMU is at a latitude of 38.83 and a longitude of
-77.30. We would write this parameter as
coordinates=38.83,-77.30
.
We then add this onto our endpoint URL as follows:
https://api.openaq.org/v2/locations
+?
+coordinates=38.83,-77.30
to give a combined URL of
https://api.openaq.org/v2/locations?coordinates=38.83,-77.30
Unfortunately, if you navigate to this in your browser, you will find
that there are no sensors returned by this search. If you check the
documentation, you will see that the default radius
that
OpenAQ looks for sensors in around a set of coordinates is just 2500
meters (about a mile and a half). We can increase the
radius
parameter to broaden our search.
Let’s add in a radius=18000
to broaden our search to
18,000 meters. This gives us this URL (remember that parameters are
spearated by &
symbols:
https://api.openaq.org/v2/locations?coordinates=38.83,-77.30&radius=18000
You should now get a single sensor returned from this search!
Play around with the
locations
endpoint and the different parameters. When you are comfortable with it, use thecoordinates
andradius
parameters create your own URL that finds the closest sensor near a point of interest to you (i.e. your home, somewhere you went on vacation, or maybe somewhere that you would like to go). You can use a website such as https://www.latlong.net/ to find coordinates of a place that interests you.Write this URL down in your answer file.
We can get the JSON data from an API in R using the
fromJSON
function from thejsonlite
package.<- fromJSON("https://api.openaq.org/v2/locations") locations_page
The
locations_page
variable contains a list, containing two items:meta
(metadata about the API) andresults
(the list of sensors that we are interested in). To convert thisresults
element to a dataframe (note the use of the$
symbol to access theresults
element from the list):<- as_tibble(locations_page$results) locations_df
In your answer file, run these two lines of code on the URL that you constructed in the previous question. You should end up with a dataframe that contains information about the sensor closest to the geographic coordinates that you picked.
Now that you have identified a region with some sensors, let’s fetch the air quality measurement data from this region.
By default, the API will only return 100 measurements. You can (and should) increase this to a larger number using the
limit
parameter in the API url. The maximum is 10000, so try a few thousand, e.g.limit=2000
.The air quality measurements are available from this endpoint:
https://api.openaq.org/v2/measurements
To get a selection of recent air quality measurements for your region of interest, you will need to add the same
coordinates
andradius
parameters as before, as well as thelimit
parameter.You should be able to run the same two functions as before (
fromJSON()
andas_tibble()
) to create a dataframe of all the measurements. Store this dataframe in a new variable calledmeasurements_df
.Being a good API citizen
In general, it is a good idea to minimize the number of times we request data from an API. This is because requests for large amounts of data can put a lot of strain on the server.
It is usually a good idea to make the request once, and then save the resulting data into a file. Then in the future you just load the data from the file rather than having to request the same data all over again.
We will not worry about this during this lab, as we will not be making too many request to the API, but bear this in mind if you use APIs in the future.
Let’s take a look at the data you downloaded.
If you open the measurements_df dataframe, you should see several columns of interest:
The parameter column records the type of air quality measurement recorded in that row. There are several different types including pm25 (PM2.5 - particulate matter smaller than 2.5 micrometers), so2 (Sulphur dioxide), and o3 (Ozone). Different sensors record different measurements, so your values in this column will vary depending on which sensor you pick
The value column records the actual measurement.
We will start by calculating some summary statistics of the data using the
group_by
andsummarize
functions. You shouldgroup_by
the “parameter” variable, and calculate the mean, median, min, max, standard deviation, and inter-quartile range of the the “value” column.As a reminder, the standard deviation is calculated by the
sd
function, and the inter-quatile range is calculated by theIQR
function.Finally, let’s create a graph to visualize the air pollution data at the sensor you chose.
The air pollution data is an example of time series data: data measured at regular intervals of time. Time series data is often best plotted as a line graph. In ggplot, we can do that with the
geom_line
function.Create a line graph, plotting each type of data on a separate facet (i.e. use the
facet_wrap
function to facet over theparameter
variable, and supply thescales = "free_y"
argument to allow the y-axis to vary between facets).Note that you will need to plot time on the x axis. However, the date “column” is actually a dataframe itself with two columns,
local
andutc
(i.e. a dataframe within a dataframe). To get one of the date sub-columns you should use the$
selector operator, e.g.x = date$local
.Describe the trends that you see in your graph.
How to submit
To submit your lab assignment, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!
Save, commit, and push your completed RMarkdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Lab 9 posting on Blackboard.
Credits
This lab is released under a Creative Commons Attribution-ShareAlike 4.0 International License. Lab instructions were written by Dominic White.