7.2 Differential analysis with DESeq2: count normalization

class: center, middle, inverse, title-slide

# 7.2 Differential analysis with DESeq2: count normalization
## MICB 405 101 2021W1 Bioinformatics
### <a href="mailto:stephan.koenig@ubc.ca">Stephan Koenig</a>
### University of British Columbia
### February 17, 2022

---

## Learning outcomes

- Define the challenges of differential analysis.

- Apply different count normalization strategies.

- Reproduce count normalization of DESeq2 in R using tidyverse.

---

class: center middle

## Goal: Determine significant changes in gene expression between conditions

---

## Challenges to identify differentially expressed genes <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:FireBrick;overflow:visible;position:relative;"><path d="M497.9 142.1l-46.1 46.1c-4.7 4.7-12.3 4.7-17 0l-111-111c-4.7-4.7-4.7-12.3 0-17l46.1-46.1c18.7-18.7 49.1-18.7 67.9 0l60.1 60.1c18.8 18.7 18.8 49.1 0 67.9zM284.2 99.8L21.6 362.4.4 483.9c-2.9 16.4 11.4 30.6 27.8 27.8l121.5-21.3 262.6-262.6c4.7-4.7 4.7-12.3 0-17l-111-111c-4.8-4.7-12.4-4.7-17.1 0zM124.1 339.9c-5.5-5.5-5.5-14.3 0-19.8l154-154c5.5-5.5 14.3-5.5 19.8 0s5.5 14.3 0 19.8l-154 154c-5.5 5.5-14.3 5.5-19.8 0zM88 424h48v36.3l-64.5 11.3-31.1-31.1L51.7 376H88v48z"/></svg>

- Distinguish technical variation from variation due to treatment.

- Majority of genes do not change between treatments.

- Only a few replicates per treatment, difficult to estimate variance.

---

- **Distinguish technical variation from variation due to treatment.**

- Majority of genes do not change between treatments.

- Only a few replicates per treatment, difficult to estimate variance.

---

## Why count normalization? <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:FireBrick;overflow:visible;position:relative;"><path d="M497.9 142.1l-46.1 46.1c-4.7 4.7-12.3 4.7-17 0l-111-111c-4.7-4.7-4.7-12.3 0-17l46.1-46.1c18.7-18.7 49.1-18.7 67.9 0l60.1 60.1c18.8 18.7 18.8 49.1 0 67.9zM284.2 99.8L21.6 362.4.4 483.9c-2.9 16.4 11.4 30.6 27.8 27.8l121.5-21.3 262.6-262.6c4.7-4.7 4.7-12.3 0-17l-111-111c-4.8-4.7-12.4-4.7-17.1 0zM124.1 339.9c-5.5-5.5-5.5-14.3 0-19.8l154-154c5.5-5.5 14.3-5.5 19.8 0s5.5 14.3 0 19.8l-154 154c-5.5 5.5-14.3 5.5-19.8 0zM88 424h48v36.3l-64.5 11.3-31.1-31.1L51.7 376H88v48z"/></svg>

The numerical value of non-differentially expressed genes should not vary due to **sampling depth** or **RNA composition**. We need to determine a sample-specific **size factor** for each sample.

???

We are not normalizing for gene length since we do only within gene comparisons.

---

## Count normalization in DESeq2 <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:FireBrick;overflow:visible;position:relative;"><path d="M497.9 142.1l-46.1 46.1c-4.7 4.7-12.3 4.7-17 0l-111-111c-4.7-4.7-4.7-12.3 0-17l46.1-46.1c18.7-18.7 49.1-18.7 67.9 0l60.1 60.1c18.8 18.7 18.8 49.1 0 67.9zM284.2 99.8L21.6 362.4.4 483.9c-2.9 16.4 11.4 30.6 27.8 27.8l121.5-21.3 262.6-262.6c4.7-4.7 4.7-12.3 0-17l-111-111c-4.8-4.7-12.4-4.7-17.1 0zM124.1 339.9c-5.5-5.5-5.5-14.3 0-19.8l154-154c5.5-5.5 14.3-5.5 19.8 0s5.5 14.3 0 19.8l-154 154c-5.5 5.5-14.3 5.5-19.8 0zM88 424h48v36.3l-64.5 11.3-31.1-31.1L51.7 376H88v48z"/></svg>

1. Determine the natural logarithm of gene counts.

2. Calculate he geometric mean of each row to use as a pseudo-reference sample.

3. Remove infinite values.

4. Subtract the reference from the log of counts (equivalent to log of ratio of counts to reference).

`$$log(counts) - log(reference) = log\left(\frac{{counts}}{{reference}}\right)$$`

5. Calculate median for each sample.

6. Convert log of median to number.

???

The combination of these steps removes outliers from the data set and at the same time each step is less sensitive to outliers.