This week’s lab will introduce you to the concepts of tidy data and how you can reshape your dataset to take advantage of the tidyverse tools. You will then be guided through the process of using these tools to reshape a real-world gene expression dataset that tested the effect of starvation and growth rate on baker’s yeast.1
Tidy data
The principles of Tidy Data are visually represented in the figure below:
The three panels are an illustration of the following three rules,
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
It is worth emphasizing that there is a difference between a tidy dataset and a dirty dataset. “Tidying” a dataset means reshaping it by transposing the rows and columns until the format matches the criteria outlined in the above rules, which then allows us to more easily use the ggplot2 and dplyr functions to analyze and visualize a dataset. Cleaning a “dirty” dataset means that you are fixing misspellings, data entry errors, and dealing with other irregularities in the raw data.
About this week’s dataset
The following quote, taken from a discussion about this paper2, describes the meaning of this dataset pretty well:
Through the process of gene regulation, a cell can control which genes are transcribed from DNA to RNA — what we call being “expressed”. (If a gene is never turned into RNA, it may as well not be there at all). This provides a sort of “cellular switchboard” that can activate some systems and deactivate others, which can speed up or slow down growth, switch what nutrients are transported into or out of the cell, and respond to other stimuli. A gene expression microarray lets us measure how much of each gene is expressed in a particular condition. We can use this to figure out the function of a specific gene (based on when it turns on and off), or to get an overall picture of the cell’s activity.
Brauer 2008 used microarrays to test the effect of starvation and growth rate on baker’s yeast S. cerevisiae, a popular model organism for studying molecular genomics because of its simplicity). Basically, if you give yeast plenty of nutrients (a rich media), except that you sharply restrict its supply of one nutrient, you can control the growth rate to whatever level you desire (we do this with a tool called a chemostat). For example, you could limit the yeast’s supply of glucose (sugar, which the cell metabolizes to get energy and carbon), of leucine (an essential amino acid), or of ammonium (a source of nitrogen).
“Starving” the yeast of these nutrients lets us find genes that:
- Raise or lower their activity in response to growth rate. Growth-rate dependent expression patterns can tell us a lot about cell cycle control, and how the cell responds to stress.
- Respond differently when different nutrients are being limited. These genes may be involved in the transport or metabolism of those nutrients.
Variables
This is a tabular dataset with 5,537 rows and 40 columns:
Variable | Description |
---|---|
GID | One kind of ID for each gene |
YORF | Yeast Open Reading Frame (a kind of sub-unit of genetic information) |
NAME | See below |
GWEIGHT | The paper doesn’t make this clear, but all entries are 1 |
[GNP][0.05 \(\le\) x \(\le\) 0.30] | The letters G, N, and P represent the restricted nutrient. The decimal value is the yeast growth rate. |
The NAME
column contains the following information
separated by the double bar ||
symbols:
Variable | Description |
---|---|
Gene name | for example, SFB2. Not all genes have a name here. |
Biological process | for example, “proteolysis and peptidolysis” |
Molecular function | for example, “metalloendopeptidase activity” |
Systematic ID | for example, YNL049C. Every gene has one of these unique IDs. |
Unknown ID Number | for example, 1082129. The paper doesn’t explain what these mean. |
Like in previous labs, it’s recommended that you take a first look at
the dataset by viewing it by running View(brauer)
in your
Console window.
The tidyr package
Reshaping the gene expression dataset will require us to use two
functions found in the tidyr package,
pivot_longer()
and separate()
. Let’s review
how each of these functions works with the extended example from Chapter
12.6 in the R for Data
Science textbook.
Running the library(tidyverse)
command at the top of our
RMarkdown file loads many packages and example datasets for us, which
includes a dataset from the World Health Organization that is stored in
the variable who
. The first few lines of the
who
dataset are:
country | iso2 | iso3 | year | new_sp_m014 | new_sp_m1524 | new_sp_m2534 | new_sp_m3544 | new_sp_m4554 | … |
---|---|---|---|---|---|---|---|---|---|
Afghanistan | AF | AFG | 1980 | NA | NA | NA | NA | NA | … |
Afghanistan | AF | AFG | 1981 | NA | NA | NA | NA | NA | … |
Afghanistan | AF | AFG | 1982 | NA | NA | NA | NA | NA | … |
Afghanistan | AF | AFG | 1983 | NA | NA | NA | NA | NA | … |
Afghanistan | AF | AFG | 1984 | NA | NA | NA | NA | NA | … |
Afghanistan | AF | AFG | 1985 | NA | NA | NA | NA | NA | … |
… | … | … | … | … | … | … | … | … | … |
Using pivot_longer()
to reshape columns into rows
In this data frame, the names of the 57 columns starting with
new_sp_m014
and ending with newrel_f65
each
refer to a set of three categories, violating the first rule for tidy
data. This can easily be fixed by transposing these columns into rows
using the pivot_longer()
command:
who1 <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
)
After applying the pivot_longer()
operation, the first
few rows in the dataset now look as follows:
country | iso2 | iso3 | year | key | cases |
---|---|---|---|---|---|
Afghanistan | AF | AFG | 1997 | new_sp_m014 | 0 |
Afghanistan | AF | AFG | 1997 | new_sp_m1524 | 10 |
Afghanistan | AF | AFG | 1997 | new_sp_m2534 | 6 |
Afghanistan | AF | AFG | 1997 | new_sp_m3544 | 3 |
Afghanistan | AF | AFG | 1997 | new_sp_m4554 | 5 |
Afghanistan | AF | AFG | 1997 | new_sp_m5564 | 2 |
… | … | … | … | … | … |
As you can see, we’ve taken the 57 category columnms and converted
them into categories underneath a single column named key
with their corresponding values placed underneath the column
cases
.
To summarize, the syntax for pivot_longer()
is as
follows:
Using separate()
to split one column into many
The other function you will need to use is separate()
.
This function takes values in a single column and splits them out into
multiple columns. This is used when you have a dataset that doesn’t
follow the third rule of tidy data.
Going back to the who
example, after you pivoted the 57
columns into a single column, we might wonder what the values under
key
mean. If you run ?who
in the
Console, you’ll see that the underscores _
separate different variable values. This means that the key
column contains 3 values per cell, not one, so this is a case for using
separate()
.
To separate the columns, we run the following:
who2 <- who1 %>%
mutate(key = str_replace(key, "newrel", "new_rel")) %>%
separate(
col = key,
into = combine("new", "type", "sexage"),
sep = "_"
)
You can ignore the second line with a mutate()
command
for now, as this just fixes the missing underscore that you need to do
before running separate.
The first few lines of the separated dataset are:
country | iso2 | iso3 | year | new | type | sexage | cases |
---|---|---|---|---|---|---|---|
Afghanistan | AF | AFG | 1997 | new | sp | m014 | 0 |
Afghanistan | AF | AFG | 1997 | new | sp | m1524 | 10 |
Afghanistan | AF | AFG | 1997 | new | sp | m2534 | 6 |
Afghanistan | AF | AFG | 1997 | new | sp | m3544 | 3 |
Afghanistan | AF | AFG | 1997 | new | sp | m4554 | 5 |
Afghanistan | AF | AFG | 1997 | new | sp | m5564 | 2 |
… | … | … | … | … | … | … | … |
As you can see, this has successfully split our one column into three.
To summarize, the syntax for separate()
is as
follows:
dataset %>%
separate(
col = ..., # Name of column to separate
into = combine(...), # Names for new columns formed after separation
sep = ..., # Specifies the separator symbol(s) or position
convert = ... # If TRUE, tries to set data type for new columns
)
Your should note that the into
keyword needs to be
specified as a vector of strings that you can create using
combine()
.
Submit the answers to the following questions in your lab report.
Prune the dataset
When dealing with larger datasets that have many columns, it is
helpful if you can identify any columns that are extraneous and will not
be used for analysis. This shrinks the overall size of the dataset so
that it takes up less space in the computer’s memory. For this dataset,
the GID
, YORF
, and GWEIGHT
columns aren’t particularly important for any kind of analysis, so they
can be removed.
- Remove the
GID
,YORF
, andGWEIGHT
variables from the dataset and assign the result to the variablebrauer2
.
Enforcing one observation per row
Now we will employ a systematic approach while tidying this dataset
by going through the rules one by one and determining what (if anything)
needs to be reshaped. We start with the first rule that “each variable
must have its own column.” The columns G0.05
through
U0.3
each contain the results of different experimental
trials, telling us which nutrient was restricted and what growth rate
was selected for the yeast. Knowing this, we conclude that this dataset
violates the first rule, as each row contains multiple experimental
trials.
- Use
pivot_longer()
to transpose the columnsG0.05
throughU0.3
into a series of rows to satisfy the rule that “each variable must have its own column. Use the labelsample
for the new categorical variable created bypivot_longer()
that contains the pivoted column names and use the labelexpression
for the new numerical variable containing the values originally under theG0.05
throughU0.3
columns. Assign your result to the variablebrauer3
.
After completing this step, the one observation per row rule will be satisfied.
Enforcing one column per variable and one value per cell
The second and third rules are intertwined, and so fixing one will
help to satisfy the other. Looking at the version of the dataset in
brauer3
, we can clearly see that the one value per
cell rule is violated by the NAME
and
sample
columns. These are violations that can be fixed
using the separate()
function. Let’s separate the
sample
column we created with the
pivot_longer()
function first.
Hint: If you don’t want the separate()
function to delete any individual letters or symbols when splitting a
column, you can set the sep
argument equal to an integer
number that specifies a position to split the values at. For example
sep = 1
tells separate()
to split the column
right after the first letter.
- Separate the
sample
column that you created in Exercise 2 into two variables,nutrient
andrate
. After separation, the data type for therate
column should be a number (<dbl>
or<num>
), not character. Assign your result to the variablebrauer4
.
Now we turn to the NAME
column where each value is
separated by two bars || and there are
five variables contained in each cell. Using the information from the about this week’s dataset section,
we deduce that we should name the new columns gene_name
,
biological_process
, molecular_function
,
systematic_id
, and number
.
Hint: If you try to use sep = "||"
in
separate()
, it will not work as expected. By default, the
bar symbol | is interpreted as the
boolean operator OR, which should be familiar from creating rules for
filter()
. To get around this, we can preceed | with backslashes to signal to R that we want
it to read | as a symbol and not as the
boolean operator OR. Thus, we should use sep = "\\|\\|"
to
define our separator.
- Use
separate()
to split the values in theNAME
column into five columns namedgene_name
,biological_process
,molecular_function
,systematic_id
, andnumber
. Assign your result to the variablebrauer5
.
We’re nearly done! If you inspect the dataset with
View(brauer5)
, you’ll notice that the text under some of
the columns has quotation marks with unnecessary blank space. For
example, " molecular function unknown "
. Removing this
blank space (typically called “white space”) is called
trimming, and we can use the str_trim()
function for this purpose.
Remove the whitespace from the five new columns you created using
mutate_at()
in combination withstr_trim()
:Assign your result to the variable
brauer_tidy
.
Now we’re done!
Visualizations using the tidy dataset
Having a tidy dataset makes it easy to use the rest of the tidyverse packages, such as ggplot2. To demonstrate this, we will now create a plot that a biologist would use to explore gene expression trends that emerge due to restricting different nutrients and enforcing different growth rates.
- Filter
brauer_tidy
so that it only contains entries withgene_name
equal toLEU1
. Then, create a line plot where you placerate
on the horizontal axis,expression
on the vertical axis, and assign different colors tonutrient
. You will also have to add thegroup = nutrient
argument inside theaes()
function to group the observations into the correct lines.
How to submit
To submit your lab assignment, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!
Save, commit, and push your completed RMarkdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Lab 4 posting on Blackboard.
Credits
This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exercises and instructions written by James Glasbrenner for CDS-102.