This week’s lab shows you how to use the ggplot2 package to visualize datasets and how visualization plays a crucial role in data exploration.
Why data visualization?
Why is data visualization an important topic? On the face of it, you might wonder why we need to dedicate any time to this topic. Aren’t plots really easy now that we all have computers? And isn’t making plots and figures one of the last things that we do for a project or lab report, after we’ve figured everything out? Why start with this? Since a picture (or visualization) is worth a thousand words, take a moment to explore the data visualizations linked below.
After a few minutes, be prepared to share with the class one thing you noticed about one of the visualizations that you think made it effective at conveying information.
Why do buses bunch? http://setosa.io/bus/
U.S. Age Pyramid Becomes a Rectangle: http://www.pewresearch.org/next-america/#Two-Dramas-in-Slow-Motion
Visualizations have an important role to play in nearly every stage of a data science project. High-quality visualizations help people to understand your results and can activate their curiosity about your work and ideas. Creating visualizations in R is also easy and fun, and learning how to make them will help you become more comfortable with using R and RStudio. You will quickly see how simple it is to make colorful and eye-catching plots!
You should refer back to the Data Visualization interactive tutorial from CDS 101 when working on this lab.
About this week’s dataset
You will be exploring a tidied dataset scraped from the website reelgood.com on movies streamed on
four big platforms (Netflix, Prime Video, Hulu, Disney+), which is
automatically loaded into the variable streaming
for you in
the RMarkdown file for your lab report.
To explore the dataset, type the following in the Console tab of RStudio (you will need to run the set-up chunk in the RMarkdown file first):
Variables
This is a tabular dataset with 1000 observations on the following variables:
Variable | Type | Description |
---|---|---|
title | chr | title of the movie |
year | int | the year in which the movies was created |
age | chr | appropriate target audience |
im_db | dbl | IMDB rating |
rotten_tomatoes | dbl | Rotten Tomatoes rating |
runtime | int | the length of the movie |
streaming | chr | the platform which the movie is streamed on |
main_language | chr | the main language which the movie uses |
Sources
- The dataset was scraped by Ruchi Bathia who has published it at https://www.kaggle.com/ruchi798/movies-on-netflix-prime-video-hulu-and-disney, and has been lightly cleaned for use in this lab.
Visualization by Example
In your lab report, create an R code block that contains the following code:
To run the code, either click the green “play” button in the upper right corner of the R code block or, while your cursor is inside the code block, press
<CTRL>-<SHIFT>-<ENTER>
. This should create a plot called a histogram.After creating the histogram, look at the
im_db
column in the data table you can view withView(streaming)
and compare it with the histogram. Then, describe what the histogram is doing with the data in this column.When you have finished this exercise, commit your work.
It is simple to add additional arguments to the aesthetic input
aes()
that change the way data are shown, which can reveal trends that were previously hidden from view.Let’s see what the
fill
argument does to our histogram. Write the following code in your lab report:Run the block and look at the output. Then answer the following questions:
- What did adding
fill = main_language
do? - What language are most of the movies in this database?
When you have finished this exercise, commit your work.
- What did adding
Describe the shape (skewness and modality) of the English and Foreign IMDB rating distributions, and where they seem to be centered around. Upon your visual inspection, does there appear to be a tangible difference in the average IMDB rating for these two distributions?
Based on your experience of watching movies in recent years, is this a result that you would have expected to see? (Explain why or why not.)
When you have finished this exercise, commit your work.
If we have a lot of categories, it can be confusing to try to break our graph down by colors. (If you want an example, try temporarily replacing the
main_language
variable with theage
variable, but make sure to change it back again afterwards!)An alternative to coloring is to create a separate sub-plot for each category. In the
ggplot2
library, sub-plots are called “facets”.Add this code in a new code block in your lab report:
Then answer the following questions:
- How many facets are there?
- What does each faceted sup-plot represent? (Hint: what is the variable we are faceting over?)
- Which facet’s distribution contains the most movies?
Scatterplots
Let’s further explore the data using another type of visualization, the scatter-plot.
Use the following code to create a scatter-plot of each movie’s IMDB rating versus its Rotten Tomatoes rating.
Here,
im_db
is the response (dependent) variable androtten_tomatoes
would be the explanatory (independent) variable. Describe any trends that you see using full sentences.When you have finished this exercise, commit your work.
Next, we should try and create a plot that is similar in spirit to what we did in Exercise 2, so that we can see how the
im_db
variable depends on therotten_tomatoes
variable when themain_language
variable is taken into account. One important difference to know is that we need to use the wordcolor
as a parameter instead offill
. Otherwise, the procedure for grouping over themain_language
variable is basically the same.To do this, figure out how to color the scatter plot by the
main_language
variable using thecolor
input and create a new plot. (Hint: start with your code from the last exercise and add the extra argument.)What does this plot tell you about the relationship between the ratings and the
main_language
variable? (I.e. does the relationship between ratings on IMDB and Rotten Tomatoes look different for English and foreign language movies?)When you have finished this exercise, commit your work.
Faceting
Instead of coloring our scatter plot, let’s facet it instead.
Create another scatter plot of the IMDB vs Rotten Tomatoes ratings, but this time add the
facets
parameter instead of thecolor
parameter. Facet over theage
variable, just like we did in Exercise 4.Is the information presented here any different from the information in Exercise 5? (I.e. do any age categories show a different relationship between their ratings on IMDB vs Rotten Tomatoes?)
When you have finished this exercise, commit your work.
Modeling in ggplot2
When we are looking for a relationship between two variables, it’s often useful to plot the linear regression line through the data (i.e. just like a “line of best fit” that you may have drawn by hand in a prior science or math class).
We can do this with
qplot()
by customizing the “geometry” of the graph with thegeom
parameter.By default,
qplot()
will use"point"
geometry to create a scatter plot for two variables. If we want to plot a line, we can instead add the argumentgeom = "smooth"
. We will also need to add an argumentmethod = "lm"
to make this a linear line.Copy the following code into a new code chunk:
Does it follow the trends (if any) you’ve previously described in the data?
Note: The semi-transparent gray region around the line represents the error or uncertainty in the best position for the line.
When you have finished this exercise, commit your work.
Unfortunately plotting a linear regression line by itself is not very useful. Although we know that these two variables happen to be linearly related from an earlier exercise, you can also draw straight lines through non-linear data. The line won’t be a good fit, but unless you show the underlying data in the graph as well, a viewer will never know that.
Therefore a linear regression line by itself can be misleading, as we always want to avoid misleading visualizations.
To plot multiple geometries on the same graph, we need to supply a vector of geometries to the
geom
parameter. Copy your code from the previous exercise, but now change thegeom
parameter togeom = c("point", "smooth")
.When you are finished, commit your work!
How to submit
To submit your lab assignment, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!
Save, commit, and push your completed RMarkdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Lab 2 posting on Blackboard. Make sure your proofread your PDF for spelling and formatting!
Credits
This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The streaming dataset was modified from Ruchi Bathia’s work with necessary data tidying and some of the lab exercises were adapted from problem sets found in Modern Data Science with R by Benjamin Baumer, Daniel Kaplan, and Nicholas Horton. All other exercises and instructions written by Felicia Natalie Wijaya, James Glasbrenner, and Dominic White for CDS 102.
Tip
As you can now see, changing one of the inputs in your ggplot2 code can have a substantial effect on the way your visualization looks. When a visualization reveals new information, we should describe and interpret it in our lab reports.