This week’s lab will introduce you to data transformations using the dplyr library, which is loaded as part of tidyverse. To keep things from becoming too complicated, we practice these commands on a small, synthetic dataset that represents sales of caffeinated drinks.
What is “data wrangling”?
The previous lab focused on constructing visualizations using the
ggplot2
library. The dataset used for that lab was selected
because it was relatively small in size and could be visualized in
interesting ways without the need for additional processing. It would
certainly be nice if all datasets came in such a form! However, in
reality, many datasets need cleaning and to be transformed before you
can create a meaningful visualization or answer a question. The dataset
need not even be messy; datasets that have been preprocessed and are
clean and tidy still may require transformations. The full pipeline for
obtaining data, cleaning it up, and transforming it is informally
referred to as “data wrangling”, which is summarized in the figure from
R for Data
Science below:
We start with the transform part of the data wrangling pipeline, which relies on commands that are part of the dplyr package, itself part of tidyverse. Like many of the functions available in the tidyverse, the command names provide us with clues as to what they do. If you have any prior experience with spreadsheet software such as Microsoft Excel, you will probably find the commands in dplyr to be familiar.
About this week’s dataset
This is a synthetic dataset that contains sales data of office
supplies to different regions of the United States. The dataset is
stored in the variable coffeeshop
. The data was taken from
decisivedata.net.
Variables
This is a tabular dataset with 43 observations on the following variables:
Variable | Type | Description |
---|---|---|
cogs | dbl | cost of goods sold (the ingredient’s buying price; in US Dollars) |
date | date | the date when the order’s placed |
market | chr | the district in which the shop’s located |
product_line | chr | the ingredient’s form (beans or leaves) |
product | chr | product name |
sales | dbl | the product’s selling price (in US Dollars) |
state | chr | the state in which the shop’s located |
total_expenses | dbl | the total amount spent on creating the product (in US Dollars) |
type | chr | caffeinated (Regular) or decaffeinated (Decaf) |
Like in previous labs, it’s recommended that you take a first look at
the dataset by viewing it by running View(coffeeshop)
in
the Console pane of RStudio (you will need to run the set-up
chunk in the RMarkdown file first).
While working through these exercises, you may find it helpful to refer back to the following resources:
- Data Wrangling interactive tutorial from CDS 101 (see your CDS 101 Blackboard site for link)
- Data transformation with
dplyr
cheatsheet - Data tidying with
tidyr
cheatsheet - Supplemental reading: R for Data Science by Wickham and Grolemund (online textbook):
Exercises
Let’s start with the
select()
function. Theselect()
function selects columns from a dataset, which is useful when you’re working with a dataset that contains dozens of variables.Try running the following code in the Console (you do not need to add this example code to your answer file):
The output will be a new dataframe with a single column. We can select more columns by passing them as additional arguments to the select function, e.g.
select(coffeeshop, market, sales)
.Obviously writing out the name of every column you need can be a little cumbersome if you are working with a lot of columns. Fortunately there are shortcuts we can use to specify multiple columns according to a pattern.
For example, if we wish for a series of consecutive columns we can just write the first and last columns separated by a colon
:
. This will select all the columns in that range, e.g.select(coffeeshop, product_line:sales)
.Other operators and helper functions that we can use are listed on this web page.
Add the following four code examples to new code chunks in the Lab03.Rmd answer file, and write an explanation of what each does.
(Note that the
!
operator is interchangable with the-
operator in this context.)
The strange looking symbol
%>%
is called the pipe operator, and it is a handy way to pass a dataset through a chain of commands. It sends the output of the expression on its left to be the first argument of the function on its right.Create a new code chunk and rewrite this code so that we pipe the
coffeeshop
dataset to theselect()
function:Create a new code chunk and rewrite this code so that we pipe the
coffeeshop
dataset to theselect()
function, and then pipe the output ofselect()
tofilter()
:Note
This code example nests one function inside another. While this is valid R code, it is hard to read, and it is much easier to understand if we rewrite it as a series of operations connected via the pipe operator.
Let’s now try an example that uses the
mutate()
function, which is a little more complex.mutate()
lets us transform our dataset by applying the same operation to each row in the dataset and appending the results as a new column. This would allow you to create a new column in your dataset calledproduct_profit
that contains the revenue from each order.To calculate the revenue an order, we need to deduct the expenses (cost of goods sold (COGS) and operating expenses) from our sales. Do this by adding the following
mutate()
command to a new code chunk:After confirming that the above command works, copy the same code into a new code block and remove
product_profit =
from the function input.Does the code still run? If so, what (if anything) is different in the output? What if you ran
mutate(product_profit = sales - (cogs + total_expenses))
?Based on these outputs, describe what the
product_profit =
part of the command seems to be doing.We are not limited to only one input at a time in
mutate()
. As long as we separate each input by a comma, we can put as many inputs as we want in themutate()
function!Starting again with this example,
Modify it so that there’s a second input in
mutate()
,expiration_date = date + 2
.Explain how the dataframe output by this code is different to the dataframe output by the code in the previous exercise’s
mutate()
function. Also, what has happened by adding the number 2 to the values in thedate
column?It’s worth pausing for a moment and asking whether any of these commands are permanently changing the way the dataset looks in
coffeeshop
.Inspect your dataset using
View(coffeeshop)
(in the Console) to check. Based on what you observe, has the data frame incoffeeshop
been updating with each command that we have run in previous exercises, or is it unchanged?As a comparison, add the command
to a new code chunk, run the code chunk, and take a look at the contents of the new variable
coffeeshop_updated
. (You can find any variables that you have created in the Environment tab in the top right pane of RStudio.)Has the output of the
mutate()
function persisted in either thecoffeeshop
orcoffeeshop_updated
variables?Immutability
When input data is not changed by a function we say that it is “immutable”.
All the functions that we will be learning in R are immutable. Their transformed output is always a copy of the input, leaving the original input unchanged.
It is common to want to summarize the information contained within a dataset, such as computing sums and averages, or counting how many data points belong to different groups.
This is called data aggregation, as it aggregates many data points together and uses them to compute a cetain quantity. We perform data aggregation in R by using the commands
group_by()
andsummarize()
, which frequently show up as a pair.- The
group_by()
command is applied to one or more columns, and allows you create groups that share common values in a column of categorical data. - The
summarize()
command can be used when you want to do things like calculate the average number of units sold by each representative, or calculate the gross earnings of eachitem
.
group_by()
andsummarize()
are functions that are easier to understand using examples. First, add this example to a code chunk (we are usingmean()
to calculate the average of a column):Which product has the highest product profit on average?
Let’s run another example where we add together the numbers in a column using
sum()
:coffeeshop %>% group_by(market) %>% summarize( total_profit = sum(sales - (cogs + total_expenses)), total_sales = sum(sales), profit_margin = total_profit / total_sales )
Which market had the greatest profit margin?
- The
Small changes to how we group our data can have a big impact on what our summary tables look like. This gives us a lot of flexibility in aggregating our data for analysis.
Take the second code example from Exercise 6 and modify it so that it groups over two variables instead of one:
coffeeshop %>% group_by(product_line, type) %>% summarize( total_profit = sum(sales - (cogs + total_expenses)), total_sales = sum(sales), profit_margin = total_profit / total_sales )
How did this change the summary table from what you obtained in Exercise 6? Interpret the table you get as output (i.e. the profitability of different types of product).
How to submit
To submit your lab assignment, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!
Save, commit, and push your completed RMarkdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Lab 3 posting on Blackboard.
Credits
This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exercises and instructions written by Dominic White, Ajay Kulkarni, Felicia Natalie Wijaya, and James Glasbrenner for CDS-102.