Lab 4: Tidying your dataset

This week’s lab will introduce you to the concepts of tidy data and how you can reshape your dataset to take advantage of the tidyverse tools. You will then be guided through the process of using these tools to reshape a real-world gene expression dataset that tested the effect of starvation and growth rate on baker’s yeast.¹

Tidy data

The principles of Tidy Data are visually represented in the figure below:

The three panels are an illustration of the following three rules,

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

It is worth emphasizing that there is a difference between a tidy dataset and a dirty dataset. “Tidying” a dataset means reshaping it by transposing the rows and columns until the format matches the criteria outlined in the above rules, which then allows us to more easily use the ggplot2 and dplyr functions to analyze and visualize a dataset. Cleaning a “dirty” dataset means that you are fixing misspellings, data entry errors, and dealing with other irregularities in the raw data.

About this week’s dataset

The following quote, taken from a discussion about this paper², describes the meaning of this dataset pretty well:

Through the process of gene regulation, a cell can control which genes are transcribed from DNA to RNA — what we call being “expressed”. (If a gene is never turned into RNA, it may as well not be there at all). This provides a sort of “cellular switchboard” that can activate some systems and deactivate others, which can speed up or slow down growth, switch what nutrients are transported into or out of the cell, and respond to other stimuli. A gene expression microarray lets us measure how much of each gene is expressed in a particular condition. We can use this to figure out the function of a specific gene (based on when it turns on and off), or to get an overall picture of the cell’s activity.

Brauer 2008 used microarrays to test the effect of starvation and growth rate on baker’s yeast S. cerevisiae, a popular model organism for studying molecular genomics because of its simplicity). Basically, if you give yeast plenty of nutrients (a rich media), except that you sharply restrict its supply of one nutrient, you can control the growth rate to whatever level you desire (we do this with a tool called a chemostat). For example, you could limit the yeast’s supply of glucose (sugar, which the cell metabolizes to get energy and carbon), of leucine (an essential amino acid), or of ammonium (a source of nitrogen).

“Starving” the yeast of these nutrients lets us find genes that:

Raise or lower their activity in response to growth rate. Growth-rate dependent expression patterns can tell us a lot about cell cycle control, and how the cell responds to stress.

Respond differently when different nutrients are being limited. These genes may be involved in the transport or metabolism of those nutrients.

Variables

This is a tabular dataset with 5,537 rows and 40 columns:

Variable	Description
GID	One kind of ID for each gene
YORF	Yeast Open Reading Frame (a kind of sub-unit of genetic information)
NAME	See below
GWEIGHT	The paper doesn’t make this clear, but all entries are 1
[GNP][0.05 \(\le\) x \(\le\) 0.30]	The letters G, N, and P represent the restricted nutrient. The decimal value is the yeast growth rate.

The NAME column contains the following information separated by the double bar || symbols:

Variable	Description
Gene name	for example, SFB2. Not all genes have a name here.
Biological process	for example, “proteolysis and peptidolysis”
Molecular function	for example, “metalloendopeptidase activity”
Systematic ID	for example, YNL049C. Every gene has one of these unique IDs.
Unknown ID Number	for example, 1082129. The paper doesn’t explain what these mean.

Like in previous labs, it’s recommended that you take a first look at the dataset by viewing it by running View(brauer) in your Console window.

The tidyr package

Reshaping the gene expression dataset will require us to use two functions found in the tidyr package, pivot_longer() and separate(). Let’s review how each of these functions works with the extended example from Chapter 12.6 in the R for Data Science textbook.

Running the library(tidyverse) command at the top of our RMarkdown file loads many packages and example datasets for us, which includes a dataset from the World Health Organization that is stored in the variable who. The first few lines of the who dataset are:

country	iso2	iso3	year	new_sp_m014	new_sp_m1524	new_sp_m2534	new_sp_m3544	new_sp_m4554	…
Afghanistan	AF	AFG	1980	NA	NA	NA	NA	NA	…
Afghanistan	AF	AFG	1981	NA	NA	NA	NA	NA	…
Afghanistan	AF	AFG	1982	NA	NA	NA	NA	NA	…
Afghanistan	AF	AFG	1983	NA	NA	NA	NA	NA	…
Afghanistan	AF	AFG	1984	NA	NA	NA	NA	NA	…
Afghanistan	AF	AFG	1985	NA	NA	NA	NA	NA	…
…	…	…	…	…	…	…	…	…	…

Using `pivot_longer()` to reshape columns into rows

In this data frame, the names of the 57 columns starting with new_sp_m014 and ending with newrel_f65 each refer to a set of three categories, violating the first rule for tidy data. This can easily be fixed by transposing these columns into rows using the pivot_longer() command:

who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
    )

After applying the pivot_longer() operation, the first few rows in the dataset now look as follows:

country	iso2	iso3	year	key	cases
Afghanistan	AF	AFG	1997	new_sp_m014	0
Afghanistan	AF	AFG	1997	new_sp_m1524	10
Afghanistan	AF	AFG	1997	new_sp_m2534	6
Afghanistan	AF	AFG	1997	new_sp_m3544	3
Afghanistan	AF	AFG	1997	new_sp_m4554	5
Afghanistan	AF	AFG	1997	new_sp_m5564	2
…	…	…	…	…	…

As you can see, we’ve taken the 57 category columnms and converted them into categories underneath a single column named key with their corresponding values placed underneath the column cases.

To summarize, the syntax for pivot_longer() is as follows:

dataset %>%
  pivot_longer(
    cols = ...,           #  Columns you want to pivot into rows
    names_to = "...",   #  Variable for storing names of pivoted columns
    values_to = "..."  #  Variable for values stored under pivoted columns
  )

Using `separate()` to split one column into many

The other function you will need to use is separate(). This function takes values in a single column and splits them out into multiple columns. This is used when you have a dataset that doesn’t follow the third rule of tidy data.

Going back to the who example, after you pivoted the 57 columns into a single column, we might wonder what the values under key mean. If you run ?who in the Console, you’ll see that the underscores _ separate different variable values. This means that the key column contains 3 values per cell, not one, so this is a case for using separate().

To separate the columns, we run the following:

who2 <- who1 %>%
  mutate(key = str_replace(key, "newrel", "new_rel")) %>%
  separate(
    col = key,
    into = combine("new", "type", "sexage"),
    sep = "_"
  )

You can ignore the second line with a mutate() command for now, as this just fixes the missing underscore that you need to do before running separate.

The first few lines of the separated dataset are:

country	iso2	iso3	year	new	type	sexage	cases
Afghanistan	AF	AFG	1997	new	sp	m014	0
Afghanistan	AF	AFG	1997	new	sp	m1524	10
Afghanistan	AF	AFG	1997	new	sp	m2534	6
Afghanistan	AF	AFG	1997	new	sp	m3544	3
Afghanistan	AF	AFG	1997	new	sp	m4554	5
Afghanistan	AF	AFG	1997	new	sp	m5564	2
…	…	…	…	…	…	…	…

As you can see, this has successfully split our one column into three.

To summarize, the syntax for separate() is as follows:

dataset %>%
  separate(
    col = ...,            #  Name of column to separate
    into = combine(...),  #  Names for new columns formed after separation
    sep = ...,            #  Specifies the separator symbol(s) or position
    convert = ...         #  If TRUE, tries to set data type for new columns
  )

Your should note that the into keyword needs to be specified as a vector of strings that you can create using combine().

Submit the answers to the following questions in your lab report.

Prune the dataset

When dealing with larger datasets that have many columns, it is helpful if you can identify any columns that are extraneous and will not be used for analysis. This shrinks the overall size of the dataset so that it takes up less space in the computer’s memory. For this dataset, the GID, YORF, and GWEIGHT columns aren’t particularly important for any kind of analysis, so they can be removed.

Remove the GID, YORF, and GWEIGHT variables from the dataset and assign the result to the variable brauer2.

Enforcing one observation per row

Now we will employ a systematic approach while tidying this dataset by going through the rules one by one and determining what (if anything) needs to be reshaped. We start with the first rule that “each variable must have its own column.” The columns G0.05 through U0.3 each contain the results of different experimental trials, telling us which nutrient was restricted and what growth rate was selected for the yeast. Knowing this, we conclude that this dataset violates the first rule, as each row contains multiple experimental trials.

Use pivot_longer() to transpose the columns G0.05 through U0.3 into a series of rows to satisfy the rule that “each variable must have its own column. Use the label sample for the new categorical variable created by pivot_longer() that contains the pivoted column names and use the label expression for the new numerical variable containing the values originally under the G0.05 through U0.3 columns. Assign your result to the variable brauer3.

After completing this step, the one observation per row rule will be satisfied.

Enforcing one column per variable and one value per cell

The second and third rules are intertwined, and so fixing one will help to satisfy the other. Looking at the version of the dataset in brauer3, we can clearly see that the one value per cell rule is violated by the NAME and sample columns. These are violations that can be fixed using the separate() function. Let’s separate the sample column we created with the pivot_longer() function first.

Hint: If you don’t want the separate() function to delete any individual letters or symbols when splitting a column, you can set the sep argument equal to an integer number that specifies a position to split the values at. For example sep = 1 tells separate() to split the column right after the first letter.

Separate the sample column that you created in Exercise 2 into two variables, nutrient and rate. After separation, the data type for the rate column should be a number (<dbl> or <num>), not character. Assign your result to the variable brauer4.

Now we turn to the NAME column where each value is separated by two bars || and there are five variables contained in each cell. Using the information from the about this week’s dataset section, we deduce that we should name the new columns gene_name, biological_process, molecular_function, systematic_id, and number.

Hint: If you try to use sep = "||" in separate(), it will not work as expected. By default, the bar symbol | is interpreted as the boolean operator OR, which should be familiar from creating rules for filter(). To get around this, we can preceed | with backslashes to signal to R that we want it to read | as a symbol and not as the boolean operator OR. Thus, we should use sep = "\\|\\|" to define our separator.

Use separate() to split the values in the NAME column into five columns named gene_name, biological_process, molecular_function, systematic_id, and number. Assign your result to the variable brauer5.

We’re nearly done! If you inspect the dataset with View(brauer5), you’ll notice that the text under some of the columns has quotation marks with unnecessary blank space. For example, " molecular function unknown ". Removing this blank space (typically called “white space”) is called trimming, and we can use the str_trim() function for this purpose.

Remove the whitespace from the five new columns you created using mutate_at() in combination with str_trim():
```
brauer5 %>%
  mutate_at(vars(gene_name:systematic_id), str_trim)
```
Assign your result to the variable brauer_tidy.

Now we’re done!

Visualizations using the tidy dataset

Having a tidy dataset makes it easy to use the rest of the tidyverse packages, such as ggplot2. To demonstrate this, we will now create a plot that a biologist would use to explore gene expression trends that emerge due to restricting different nutrients and enforcing different growth rates.

Filter brauer_tidy so that it only contains entries with gene_name equal to LEU1. Then, create a line plot where you place rate on the horizontal axis, expression on the vertical axis, and assign different colors to nutrient. You will also have to add the group = nutrient argument inside the aes() function to group the observations into the correct lines.

How to submit

To submit your lab assignment, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!

Save, commit, and push your completed RMarkdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Lab 4 posting on Blackboard.

Credits

This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exercises and instructions written by James Glasbrenner for CDS-102.

Brauer et. al., “Coordination of growth rate, cell cycle, stress response, and metabolic activity in yeast”, Mol. Biol. Cell 19, 352 (2008).↩︎
Robinson, David, “Cleaning and Visualizing Genomic Data: A Case Study in Tidy Analysis,” Variance Explained (2015).↩︎