Lab 3: Wrangling Sales Data

This week’s lab will introduce you to data transformations using the dplyr library, which is loaded as part of tidyverse. To keep things from becoming too complicated, we practice these commands on a small, synthetic dataset that represents sales of caffeinated drinks.

What is “data wrangling”?

The previous lab focused on constructing visualizations using the ggplot2 library. The dataset used for that lab was selected because it was relatively small in size and could be visualized in interesting ways without the need for additional processing. It would certainly be nice if all datasets came in such a form! However, in reality, many datasets need cleaning and to be transformed before you can create a meaningful visualization or answer a question. The dataset need not even be messy; datasets that have been preprocessed and are clean and tidy still may require transformations. The full pipeline for obtaining data, cleaning it up, and transforming it is informally referred to as “data wrangling”, which is summarized in the figure from R for Data Science below:

We start with the transform part of the data wrangling pipeline, which relies on commands that are part of the dplyr package, itself part of tidyverse. Like many of the functions available in the tidyverse, the command names provide us with clues as to what they do. If you have any prior experience with spreadsheet software such as Microsoft Excel, you will probably find the commands in dplyr to be familiar.

About this week’s dataset

This is a synthetic dataset that contains sales data of office supplies to different regions of the United States. The dataset is stored in the variable coffeeshop. The data was taken from decisivedata.net.

Variables

This is a tabular dataset with 43 observations on the following variables:

Variable	Type	Description
cogs	dbl	cost of goods sold (the ingredient’s buying price; in US Dollars)
date	date	the date when the order’s placed
market	chr	the district in which the shop’s located
product_line	chr	the ingredient’s form (beans or leaves)
product	chr	product name
sales	dbl	the product’s selling price (in US Dollars)
state	chr	the state in which the shop’s located
total_expenses	dbl	the total amount spent on creating the product (in US Dollars)
type	chr	caffeinated (Regular) or decaffeinated (Decaf)

Like in previous labs, it’s recommended that you take a first look at the dataset by viewing it by running View(coffeeshop) in the Console pane of RStudio (you will need to run the set-up chunk in the RMarkdown file first).

While working through these exercises, you may find it helpful to refer back to the following resources:

Data Wrangling interactive tutorial from CDS 101 (see your CDS 101 Blackboard site for link)
Data transformation with dplyr cheatsheet
Data tidying with tidyr cheatsheet
Supplemental reading: R for Data Science by Wickham and Grolemund (online textbook):
- Data transformation chapter
- Tidy data chapter

Exercises

Let’s start with the select() function. The select() function selects columns from a dataset, which is useful when you’re working with a dataset that contains dozens of variables.

Try running the following code in the Console (you do not need to add this example code to your answer file):
```
select(coffeeshop, market)
```
The output will be a new dataframe with a single column. We can select more columns by passing them as additional arguments to the select function, e.g. select(coffeeshop, market, sales).

Obviously writing out the name of every column you need can be a little cumbersome if you are working with a lot of columns. Fortunately there are shortcuts we can use to specify multiple columns according to a pattern.

For example, if we wish for a series of consecutive columns we can just write the first and last columns separated by a colon :. This will select all the columns in that range, e.g. select(coffeeshop, product_line:sales).

Other operators and helper functions that we can use are listed on this web page.

Add the following four code examples to new code chunks in the Lab03.Rmd answer file, and write an explanation of what each does.
1. (Note that the ! operator is interchangable with the - operator in this context.)
```
select(coffeeshop, !cogs)
```
2. ```
select(coffeeshop, starts_with('prod'))
```
3. ```
select(coffeeshop, contains("pe"))
```
4. ```
select(coffeeshop, caffeination = type)
```
The strange looking symbol %>% is called the pipe operator, and it is a handy way to pass a dataset through a chain of commands. It sends the output of the expression on its left to be the first argument of the function on its right.
1. Create a new code chunk and rewrite this code so that we pipe the coffeeshop dataset to the select() function:
```
select(coffeeshop, caffeination = type)
```
2. Create a new code chunk and rewrite this code so that we pipe the coffeeshop dataset to the select() function, and then pipe the output of select() to filter():
```
filter(select(coffeeshop, starts_with('prod')), product == "Darjeeling")
```
  Note
  
  This code example nests one function inside another. While this is valid R code, it is hard to read, and it is much easier to understand if we rewrite it as a series of operations connected via the pipe operator.
Let’s now try an example that uses the mutate() function, which is a little more complex. mutate() lets us transform our dataset by applying the same operation to each row in the dataset and appending the results as a new column. This would allow you to create a new column in your dataset called product_profit that contains the revenue from each order.

To calculate the revenue an order, we need to deduct the expenses (cost of goods sold (COGS) and operating expenses) from our sales. Do this by adding the following mutate() command to a new code chunk:
```
coffeeshop %>%
  mutate(product_profit = sales - (cogs + total_expenses))
```
After confirming that the above command works, copy the same code into a new code block and remove product_profit = from the function input.

Does the code still run? If so, what (if anything) is different in the output? What if you ran mutate(product_profit = sales - (cogs + total_expenses))?

Based on these outputs, describe what the product_profit = part of the command seems to be doing.
We are not limited to only one input at a time in mutate(). As long as we separate each input by a comma, we can put as many inputs as we want in the mutate() function!

Starting again with this example,
```
coffeeshop %>%
  mutate(product_profit = sales - (cogs + total_expenses))
```
Modify it so that there’s a second input in mutate(), expiration_date = date + 2.

Explain how the dataframe output by this code is different to the dataframe output by the code in the previous exercise’s mutate() function. Also, what has happened by adding the number 2 to the values in the date column?
It’s worth pausing for a moment and asking whether any of these commands are permanently changing the way the dataset looks in coffeeshop.

Inspect your dataset using View(coffeeshop) (in the Console) to check. Based on what you observe, has the data frame in coffeeshop been updating with each command that we have run in previous exercises, or is it unchanged?

As a comparison, add the command
```
coffeeshop_updated <- mutate(
  coffeeshop, 
  product_profit = sales - (cogs + total_expenses)
 )
```
to a new code chunk, run the code chunk, and take a look at the contents of the new variable coffeeshop_updated. (You can find any variables that you have created in the Environment tab in the top right pane of RStudio.)

Has the output of the mutate() function persisted in either the coffeeshop or coffeeshop_updated variables?

Immutability

When input data is not changed by a function we say that it is “immutable”.

All the functions that we will be learning in R are immutable. Their transformed output is always a copy of the input, leaving the original input unchanged.
It is common to want to summarize the information contained within a dataset, such as computing sums and averages, or counting how many data points belong to different groups.

This is called data aggregation, as it aggregates many data points together and uses them to compute a cetain quantity. We perform data aggregation in R by using the commands group_by() and summarize(), which frequently show up as a pair.
- The group_by() command is applied to one or more columns, and allows you create groups that share common values in a column of categorical data.
- The summarize() command can be used when you want to do things like calculate the average number of units sold by each representative, or calculate the gross earnings of each item.
group_by() and summarize() are functions that are easier to understand using examples. First, add this example to a code chunk (we are using mean() to calculate the average of a column):
```
coffeeshop_updated %>%
  group_by(product) %>%
  summarize(avg_profit = mean(product_profit))
```
Which product has the highest product profit on average?

Let’s run another example where we add together the numbers in a column using sum():
```
coffeeshop %>%
  group_by(market) %>%
  summarize(
    total_profit = sum(sales - (cogs + total_expenses)),
    total_sales = sum(sales),
    profit_margin = total_profit / total_sales
    )
```
Which market had the greatest profit margin?
Small changes to how we group our data can have a big impact on what our summary tables look like. This gives us a lot of flexibility in aggregating our data for analysis.

Take the second code example from Exercise 6 and modify it so that it groups over two variables instead of one:
```
coffeeshop %>%
  group_by(product_line, type) %>%
  summarize(
    total_profit = sum(sales - (cogs + total_expenses)),
    total_sales = sum(sales),
    profit_margin = total_profit / total_sales
    )
```
How did this change the summary table from what you obtained in Exercise 6? Interpret the table you get as output (i.e. the profitability of different types of product).

How to submit

To submit your lab assignment, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!

Save, commit, and push your completed RMarkdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Lab 3 posting on Blackboard.

Credits

This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exercises and instructions written by Dominic White, Ajay Kulkarni, Felicia Natalie Wijaya, and James Glasbrenner for CDS-102.