12 {DESeq2}: count normalization
This tutorial will use many tools found in the {tidyverse} package to reproduce how {DESeq2} calculates the size factors to normalize the count data.
12.2 Learning outcomes
By the end of this tutorial, you should be able to:
- Define the challenges of differential analysis.
- Apply different count normalization strategies.
- Reproduce count normalization of {DESeq2} in R using {tidyverse}.
You should have both R and RStudio installed. Please open RStudio and work with the examples by editing your deseq.R
script.
12.3 Setup
12.3.2 Data
The RNAseq data were collected from Drosophila melanogaster in which the splicing regulator Pasilla was depleted by RNAi (treated
) or not (untreated
) (1). Download the data sets pasilla_gene_counts.csv
and pasilla_metadata.csv
below and place them in your deseq
directory.
12.4 Working with data frames
12.4.1 Loading tabular data from a file
Data frames can be loaded into R using many different functions. In the {tidyverse}, they are called read_*()
. For example, let’s inspect Spasilla_gene_counts.cs
which contains a table using Notepad (Windows) or TextEdit (macOS).
gene_id,treated1,treated2,treated3,untreated1,untreated2,untreated3,untreated4
0000003,0,0,1,0,0,0,0
0000008,140,88,70,92,161,76,70
0000014,4,0,0,5,1,0,0
You will notice that the first row is different from all the others, and it actually contains column names for our data. Each row after the first one makes up a row in the table. You will also notice that commas separate values in each row. The first element in each row (i.e. before the first comma) makes up the value in the first column, each second element the value in the second column, and so on. Such a file is described as having comma-separated values. For that reason, files with that format often have the extension .csv
.
We can load pasilla_gene_counts.csv
into R with read_csv
for a comma-separated file and specify the arguments that describe our data as follows:
-
file
: the name of the file you want to load as a character data type, i.e. wrapped in quotes. -
col_names
: can take either the valueTRUE
orFALSE
and tells R if the first row contains column names.
read_csv(file = "pasilla_gene_counts.csv", col_names = TRUE)
The data are printed to the console in R.
12.4.2 Save data in the environment
Since we want to do more with our data after reading it in, we need to save it as a variable in R with the <-
operator. You can choose to name the object whatever you like, though this module assumes the names used below.
counts <- read_csv(file = "pasilla_gene_counts.csv", col_names = TRUE)
12.4.3 Data exploration
Let’s explore the data that we’ve imported into R by clicking on the triangle of the counts
in the Environment tab in the top-right pane. You can see each column and its data type in a long list.
## spec_tbl_df [14,599 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gene_id : chr [1:14599] "0000003" "0000008" "0000014" "0000015" ...
## $ treated1 : num [1:14599] 0 140 4 1 6205 ...
## $ treated2 : num [1:14599] 0 88 0 0 3072 ...
## $ treated3 : num [1:14599] 1 70 0 0 3334 ...
## $ untreated1: num [1:14599] 0 92 5 0 4664 ...
## $ untreated2: num [1:14599] 0 161 1 2 8714 ...
## $ untreated3: num [1:14599] 0 76 0 1 3564 ...
## $ untreated4: num [1:14599] 0 70 0 2 3150 310 0 3 0 672 ...
If we directly click on counts
in our Environment tab on the top-right pane, we can view the table. Row and column names are in grey boxes, and data is in white boxes.
Using different functions, we can look at the dimensions of our data, number of rows, and number of columns:
# Number of rows followed by number of columns
dim(counts)
## [1] 14599 8
# Number of rows
nrow(counts)
## [1] 14599
# Number of columns
ncol(counts)
## [1] 8
We can list the column names using colnames()
:
colnames(counts)
## [1] "gene_id" "treated1" "treated2" "treated3" "untreated1"
## [6] "untreated2" "untreated3" "untreated4"
We can call specific columns using the $
notation, which is useful for calling an entire column, regardless of its length. For example, calling the untreated1
column:
head(counts$untreated1)
## [1] 0 92 5 0 4664 583
12.5 Data wrangling with {dplyr}
A popular package for data wrangling is {dplyr} in the {tidyverse}. This package is so good at what it does and integrates so well with other popular tools like {ggplot2} that it has rapidly become the de-facto standard.
{dplyr} code is very readable because all operations are based on using {dplyr} functions or verbs (mutate()
, filter()
, …). Each verb works similarly:
- Input data frame in the first argument.
- Other arguments can refer to variables as if they were local objects (i.e. you do not need to use quotes around a
"<variable>"
name and can just directly refer to it with<variable>
). - Output is another data frame.
12.5.1 Creating of modifying variables with mutate()
Use mutate()
to apply a transformation to some variable and assign it to a new variable. As the first step for determining the size factor for each sample, {DESeq2} calculates the natural logarithm of each gene count.
If we want to determine the logarithm for each variable, then the code becomes quickly tedious and repetitive.
counts_two_log_column <- mutate(counts,
log_untreated1 = log(untreated1),
log_untreated2 = log(untreated2))
Instead, we can use across()
to apply the same function to each variable and overwrite it. We can select a range of variables with <first_variable>:<last_variable>
.
Alternatively, we can define what variables the function NOT to apply to with !
. In our example, we want to skip gene_id
.
12.5.2 Summarizing across variables with rowwise()
and c_across()
We can group our data with rowwise()
and then use c_across()
,to calculate the mean across multiple columns (step 2 of {DESeq2} normalization). The mean of the log values is not as sensitive to outliers compared to the means of the count values themselves.
12.5.3 Piping with %>%
Because of the basic {dplyr} verb syntax, these functions lend themselves to piping, i.e. taking the output of one function and passing it in as a (by default the first) argument of the following function.
12.5.4 Remove groupings with ungroup()
Because we are finished with the row-wise operation, we should remove the grouping with ungroup()
.
12.5.5 filter()
Conditional statements and logical operators are essential when working with data in R. You can use filter()
to select specific rows based on a logical condition of a variable. For quick reference, here are the most commonly used statements and operators.
R code | meaning |
---|---|
== |
equals |
< or > |
less/greater than |
<= or >= |
less/greater than or equal to |
%in% |
in |
is.na() |
is missing (NA) |
! |
not (as in not equal to != ) |
& |
and |
| |
or |
As step 3 of {DESeq2} count normalization, all rows that contain an infinite value for their geometric mean are removed. is.infinite()
would return rows where the variable is infinite. !is.infinite()
returns rows where it is not.
geometric_means <- log_counts %>%
rowwise() %>%
mutate(gm = mean(c_across(treated1:untreated4))) %>%
ungroup() %>%
filter(!is.infinite(gm))
In step 4 of {DESeq2} of count normalization, the geometric mean is subtracted of each \(log(count) - log(gm) = log\left(\frac{{counts}}{{gm}}\right)\), i.e. it looks at the ratio of the sample to the mean.
12.5.6 Column-wise operation with summarise()
Now that we know the ratio of sample to mean, we select the median()
across all remaining genes for each sample. To calculate a summary statistic for a column, we use summarise()
. Finally, we take the inverse logarithm to receive the size factor for each sample.
12.6 Additional resources
12.6.2 {dplyr} and {tidyr}
- R cheatsheets also available in RStudio under Help > Cheatsheets
- Introduction to {dplyr}
- {dplyr} tutorial
- {dplyr} video tutorial
- More functions in {dplyr} and {tidyr}