This week’s lab shows you how to use the ggplot2 package to visualize datasets and how visualization plays a crucial role in data exploration.

Why data visualization?

Why is data visualization an important topic? On the face of it, you might wonder why we need to dedicate any time to this topic. Aren’t plots really easy now that we all have computers? And isn’t making plots and figures one of the last things that we do for a project or lab report, after we’ve figured everything out? Why start with this? Since a picture (or visualization) is worth a thousand words, take a moment to explore the data visualizations linked below.

After a few minutes, be prepared to share with the class one thing you noticed about one of the visualizations that you think made it effective at conveying information.

Visualizations have an important role to play in nearly every stage of a data science project. High-quality visualizations help people to understand your results and can activate their curiosity about your work and ideas. Creating visualizations in R is also easy and fun, and learning how to make them will help you become more comfortable with using R and RStudio. You will quickly see how simple it is to make colorful and eye-catching plots!

You should refer back to the Data Visualization interactive tutorial from CDS 101 when working on this lab.

About this week’s dataset

You will be exploring a tidied dataset scraped from the website reelgood.com on movies streamed on four big platforms (Netflix, Prime Video, Hulu, Disney+), which is automatically loaded into the variable streaming for you in the RMarkdown file for your lab report.

To explore the dataset, type the following in the Console tab of RStudio (you will need to run the set-up chunk in the RMarkdown file first):

View(streaming)

Variables

This is a tabular dataset with 1000 observations on the following variables:

Variable Type Description
title chr title of the movie
year int the year in which the movies was created
age chr appropriate target audience
im_db dbl IMDB rating
rotten_tomatoes dbl Rotten Tomatoes rating
runtime int the length of the movie
streaming chr the platform which the movie is streamed on
main_language chr the main language which the movie uses

Sources

Visualization by Example

  1. In your lab report, create an R code block that contains the following code:

    qplot(x = im_db, binwidth = 0.5, data = streaming)

    To run the code, either click the green “play” button in the upper right corner of the R code block or, while your cursor is inside the code block, press <CTRL>-<SHIFT>-<ENTER>. This should create a plot called a histogram.

    After creating the histogram, look at the im_db column in the data table you can view with View(streaming) and compare it with the histogram. Then, describe what the histogram is doing with the data in this column.

    When you have finished this exercise, commit your work.

  2. It is simple to add additional arguments to the aesthetic input aes() that change the way data are shown, which can reveal trends that were previously hidden from view.

    Let’s see what the fill argument does to our histogram. Write the following code in your lab report:

    qplot(
      x = im_db, 
      binwidth = 0.5, 
      fill = main_language,
      data = streaming
      )

    Run the block and look at the output. Then answer the following questions:

    • What did adding fill = main_language do?
    • What language are most of the movies in this database?

    When you have finished this exercise, commit your work.

Tip

As you can now see, changing one of the inputs in your ggplot2 code can have a substantial effect on the way your visualization looks. When a visualization reveals new information, we should describe and interpret it in our lab reports.

  1. Describe the shape (skewness and modality) of the English and Foreign IMDB rating distributions, and where they seem to be centered around. Upon your visual inspection, does there appear to be a tangible difference in the average IMDB rating for these two distributions?

    Based on your experience of watching movies in recent years, is this a result that you would have expected to see? (Explain why or why not.)

    When you have finished this exercise, commit your work.

  2. If we have a lot of categories, it can be confusing to try to break our graph down by colors. (If you want an example, try temporarily replacing the main_language variable with the age variable, but make sure to change it back again afterwards!)

    An alternative to coloring is to create a separate sub-plot for each category. In the ggplot2 library, sub-plots are called “facets”.

    Add this code in a new code block in your lab report:

    qplot(
      x = im_db, 
      binwidth = 0.5, 
      facets = ~ age,
      data = streaming
      )

    Then answer the following questions:

    • How many facets are there?
    • What does each faceted sup-plot represent? (Hint: what is the variable we are faceting over?)
    • Which facet’s distribution contains the most movies?

    A note on faceting

    When faceting, we have to specify the faceting argument with a tilde symbol: ~, e.g.

    facets = ~ age

    We will learn more about the tilde symbol later on in the course.

Scatterplots

  1. Let’s further explore the data using another type of visualization, the scatter-plot.

    Use the following code to create a scatter-plot of each movie’s IMDB rating versus its Rotten Tomatoes rating.

    qplot(x = rotten_tomatoes, y = im_db, data = streaming)

    Here, im_db is the response (dependent) variable and rotten_tomatoes would be the explanatory (independent) variable. Describe any trends that you see using full sentences.

    When you have finished this exercise, commit your work.

  2. Next, we should try and create a plot that is similar in spirit to what we did in Exercise 2, so that we can see how the im_db variable depends on the rotten_tomatoes variable when the main_language variable is taken into account. One important difference to know is that we need to use the word color as a parameter instead of fill. Otherwise, the procedure for grouping over the main_language variable is basically the same.

    To do this, figure out how to color the scatter plot by the main_language variable using the color input and create a new plot. (Hint: start with your code from the last exercise and add the extra argument.)

    What does this plot tell you about the relationship between the ratings and the main_language variable? (I.e. does the relationship between ratings on IMDB and Rotten Tomatoes look different for English and foreign language movies?)

    When you have finished this exercise, commit your work.

Faceting

  1. Instead of coloring our scatter plot, let’s facet it instead.

    Create another scatter plot of the IMDB vs Rotten Tomatoes ratings, but this time add the facets parameter instead of the color parameter. Facet over the age variable, just like we did in Exercise 4.

    Is the information presented here any different from the information in Exercise 5? (I.e. do any age categories show a different relationship between their ratings on IMDB vs Rotten Tomatoes?)

    When you have finished this exercise, commit your work.

Modeling in ggplot2

  1. When we are looking for a relationship between two variables, it’s often useful to plot the linear regression line through the data (i.e. just like a “line of best fit” that you may have drawn by hand in a prior science or math class).

    We can do this with qplot() by customizing the “geometry” of the graph with the geom parameter.

    By default, qplot() will use "point" geometry to create a scatter plot for two variables. If we want to plot a line, we can instead add the argument geom = "smooth". We will also need to add an argument method = "lm" to make this a linear line.

    Copy the following code into a new code chunk:

    qplot(
      x = rotten_tomatoes, 
      y = im_db, 
      geom = "smooth", 
      method = "lm", 
      data = streaming
      )

    Does it follow the trends (if any) you’ve previously described in the data?

    Note: The semi-transparent gray region around the line represents the error or uncertainty in the best position for the line.

    When you have finished this exercise, commit your work.

  2. Unfortunately plotting a linear regression line by itself is not very useful. Although we know that these two variables happen to be linearly related from an earlier exercise, you can also draw straight lines through non-linear data. The line won’t be a good fit, but unless you show the underlying data in the graph as well, a viewer will never know that.

    Therefore a linear regression line by itself can be misleading, as we always want to avoid misleading visualizations.

    To plot multiple geometries on the same graph, we need to supply a vector of geometries to the geom parameter. Copy your code from the previous exercise, but now change the geom parameter to geom = c("point", "smooth").

    When you are finished, commit your work!

How to submit

To submit your lab assignment, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!

Credits

This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The streaming dataset was modified from Ruchi Bathia’s work with necessary data tidying and some of the lab exercises were adapted from problem sets found in Modern Data Science with R by Benjamin Baumer, Daniel Kaplan, and Nicholas Horton. All other exercises and instructions written by Felicia Natalie Wijaya, James Glasbrenner, and Dominic White for CDS 102.