12 {DESeq2}: count normalization

This tutorial will use many tools found in the {tidyverse} package to reproduce how {DESeq2} calculates the size factors to normalize the count data.

12.1 Slides

You can download the slides for this tutorial below.

12.2 Learning outcomes

By the end of this tutorial, you should be able to:

Define the challenges of differential analysis.
Apply different count normalization strategies.
Reproduce count normalization of {DESeq2} in R using {tidyverse}.

You should have both R and RStudio installed. Please open RStudio and work with the examples by editing your deseq.R script.

12.3 Setup

12.3.1 R packages

We will work with the {tidyverse} package so make sure that it is loaded.

library(tidyverse)

12.3.2 Data

The RNAseq data were collected from Drosophila melanogaster in which the splicing regulator Pasilla was depleted by RNAi (treated) or not (untreated) (1). Download the data sets pasilla_gene_counts.csv and pasilla_metadata.csv below and place them in your deseq directory.

12.4 Working with data frames

12.4.1 Loading tabular data from a file

Data frames can be loaded into R using many different functions. In the {tidyverse}, they are called read_*(). For example, let’s inspect Spasilla_gene_counts.cs which contains a table using Notepad (Windows) or TextEdit (macOS).

gene_id,treated1,treated2,treated3,untreated1,untreated2,untreated3,untreated4
0000003,0,0,1,0,0,0,0
0000008,140,88,70,92,161,76,70
0000014,4,0,0,5,1,0,0

You will notice that the first row is different from all the others, and it actually contains column names for our data. Each row after the first one makes up a row in the table. You will also notice that commas separate values in each row. The first element in each row (i.e. before the first comma) makes up the value in the first column, each second element the value in the second column, and so on. Such a file is described as having comma-separated values. For that reason, files with that format often have the extension .csv.

We can load pasilla_gene_counts.csv into R with read_csv for a comma-separated file and specify the arguments that describe our data as follows:

file: the name of the file you want to load as a character data type, i.e. wrapped in quotes.
col_names: can take either the value TRUE or FALSE and tells R if the first row contains column names.

read_csv(file = "pasilla_gene_counts.csv", col_names = TRUE)

The data are printed to the console in R.

12.4.2 Save data in the environment

Since we want to do more with our data after reading it in, we need to save it as a variable in R with the <- operator. You can choose to name the object whatever you like, though this module assumes the names used below.

counts <- read_csv(file = "pasilla_gene_counts.csv", col_names = TRUE)

12.4.3 Data exploration

Let’s explore the data that we’ve imported into R by clicking on the triangle of the counts in the Environment tab in the top-right pane. You can see each column and its data type in a long list.

## spec_tbl_df [14,599 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gene_id   : chr [1:14599] "0000003" "0000008" "0000014" "0000015" ...
##  $ treated1  : num [1:14599] 0 140 4 1 6205 ...
##  $ treated2  : num [1:14599] 0 88 0 0 3072 ...
##  $ treated3  : num [1:14599] 1 70 0 0 3334 ...
##  $ untreated1: num [1:14599] 0 92 5 0 4664 ...
##  $ untreated2: num [1:14599] 0 161 1 2 8714 ...
##  $ untreated3: num [1:14599] 0 76 0 1 3564 ...
##  $ untreated4: num [1:14599] 0 70 0 2 3150 310 0 3 0 672 ...

If we directly click on counts in our Environment tab on the top-right pane, we can view the table. Row and column names are in grey boxes, and data is in white boxes.

Using different functions, we can look at the dimensions of our data, number of rows, and number of columns:

# Number of rows followed by number of columns
dim(counts)

## [1] 14599     8

# Number of rows
nrow(counts)

## [1] 14599

# Number of columns
ncol(counts)

## [1] 8

We can list the column names using colnames():

colnames(counts)

## [1] "gene_id"    "treated1"   "treated2"   "treated3"   "untreated1"
## [6] "untreated2" "untreated3" "untreated4"

We can call specific columns using the $ notation, which is useful for calling an entire column, regardless of its length. For example, calling the untreated1 column:

head(counts$untreated1)

## [1]    0   92    5    0 4664  583

12.5 Data wrangling with {dplyr}

A popular package for data wrangling is {dplyr} in the {tidyverse}. This package is so good at what it does and integrates so well with other popular tools like {ggplot2} that it has rapidly become the de-facto standard.

{dplyr} code is very readable because all operations are based on using {dplyr} functions or verbs (mutate(), filter(), …). Each verb works similarly:

Input data frame in the first argument.
Other arguments can refer to variables as if they were local objects (i.e. you do not need to use quotes around a "<variable>" name and can just directly refer to it with <variable>).
Output is another data frame.

12.5.1 Creating of modifying variables with `mutate()`

Use mutate() to apply a transformation to some variable and assign it to a new variable. As the first step for determining the size factor for each sample, {DESeq2} calculates the natural logarithm of each gene count.

counts_single_log_column <- mutate(counts, log_untreated1 = log(untreated1))

If we want to determine the logarithm for each variable, then the code becomes quickly tedious and repetitive.

counts_two_log_column <- mutate(counts,
                                log_untreated1 = log(untreated1),
                                log_untreated2 = log(untreated2))

Instead, we can use across() to apply the same function to each variable and overwrite it. We can select a range of variables with <first_variable>:<last_variable>.

log_counts <- mutate(counts, across(treated1:untreated4, log))

Alternatively, we can define what variables the function NOT to apply to with !. In our example, we want to skip gene_id.

log_counts <- mutate(counts, across(!gene_id, log))

12.5.2 Summarizing across variables with `rowwise()` and `c_across()`

We can group our data with rowwise() and then use c_across(),to calculate the mean across multiple columns (step 2 of {DESeq2} normalization). The mean of the log values is not as sensitive to outliers compared to the means of the count values themselves.

log_counts_grouped <- rowwise(log_counts)
geometric_means <- mutate(log_counts_grouped,
                          mean = mean(c_across(treated1:untreated4)))

12.5.3 Piping with `%>%`

Because of the basic {dplyr} verb syntax, these functions lend themselves to piping, i.e. taking the output of one function and passing it in as a (by default the first) argument of the following function.

geometric_means <- log_counts %>% 
  rowwise() %>% 
  mutate(gm = mean(c_across(treated1:untreated4)))

12.5.4 Remove groupings with `ungroup()`

Because we are finished with the row-wise operation, we should remove the grouping with ungroup().

geometric_means <- log_counts %>% 
  rowwise() %>% 
  mutate(gm = mean(c_across(treated1:untreated4))) %>% 
  ungroup()

12.5.5 `filter()`

Conditional statements and logical operators are essential when working with data in R. You can use filter() to select specific rows based on a logical condition of a variable. For quick reference, here are the most commonly used statements and operators.

R code	meaning
`==`	equals
`< or >`	less/greater than
`<= or >=`	less/greater than or equal to
`%in%`	in
`is.na()`	is missing (NA)
`!`	not (as in not equal to `!=`)
`&`	and
`\|`	or

As step 3 of {DESeq2} count normalization, all rows that contain an infinite value for their geometric mean are removed. is.infinite() would return rows where the variable is infinite. !is.infinite() returns rows where it is not.

geometric_means <- log_counts %>% 
  rowwise() %>% 
  mutate(gm = mean(c_across(treated1:untreated4))) %>% 
  ungroup() %>% 
  filter(!is.infinite(gm))

In step 4 of {DESeq2} of count normalization, the geometric mean is subtracted of each $log(count) - log(gm) = log\left(\frac{{counts}}{{gm}}\right)$, i.e. it looks at the ratio of the sample to the mean.

ratio <- geometric_means %>% 
  mutate(across(treated1:untreated4, ~ . - gm))

12.5.6 Column-wise operation with `summarise()`

Now that we know the ratio of sample to mean, we select the median() across all remaining genes for each sample. To calculate a summary statistic for a column, we use summarise(). Finally, we take the inverse logarithm to receive the size factor for each sample.

size_factors <- ratio %>% 
  summarise(across(treated1:untreated4, ~ exp(median(.))))

12.6 Additional resources

12.6.1 Count normalization

12.6.2 {dplyr} and {tidyr}

R cheatsheets also available in RStudio under Help > Cheatsheets
Introduction to {dplyr}
{dplyr} tutorial
{dplyr} video tutorial
More functions in {dplyr} and {tidyr}

11 Introduction to R and RStudio

13 {DESeq2}: independent filtering