7.3 Differential analysis with DESeq2: independent filtering

class: center, middle, inverse, title-slide

# 7.3 Differential analysis with DESeq2: independent filtering
## MICB 405 101 2021W1 Bioinformatics
### <a href="mailto:stephan.koenig@ubc.ca">Stephan Koenig</a>
### University of British Columbia
### February 17, 2022

---

## Learning outcomes

- Explain how a large number of negatives and controlling FDR affects retention of positives.

- Produce a toy data set in R to explore the relationship among FDR-controlling procedure, negatives and positives.

- Visualize population and probability distributions in R.

- Refactor code into functions for repeated use.

---

class: center middle

## Goal: Determine significant changes in gene expression between conditions

---

## Challenges to identify differentially expressed genes <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:FireBrick;overflow:visible;position:relative;"><path d="M497.9 142.1l-46.1 46.1c-4.7 4.7-12.3 4.7-17 0l-111-111c-4.7-4.7-4.7-12.3 0-17l46.1-46.1c18.7-18.7 49.1-18.7 67.9 0l60.1 60.1c18.8 18.7 18.8 49.1 0 67.9zM284.2 99.8L21.6 362.4.4 483.9c-2.9 16.4 11.4 30.6 27.8 27.8l121.5-21.3 262.6-262.6c4.7-4.7 4.7-12.3 0-17l-111-111c-4.8-4.7-12.4-4.7-17.1 0zM124.1 339.9c-5.5-5.5-5.5-14.3 0-19.8l154-154c5.5-5.5 14.3-5.5 19.8 0s5.5 14.3 0 19.8l-154 154c-5.5 5.5-14.3 5.5-19.8 0zM88 424h48v36.3l-64.5 11.3-31.1-31.1L51.7 376H88v48z"/></svg>

- Distinguish technical variation from variation due to treatment.

- Majority of genes do not change between treatments.

- Only a few replicates per treatment, difficult to estimate variance.

---

- Distinguish technical variation from variation due to treatment.

- **Majority of genes do not change between treatments.**

- Only a few replicates per treatment, difficult to estimate variance.

---

## Limits to FDR-controlling procedures <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:FireBrick;overflow:visible;position:relative;"><path d="M497.9 142.1l-46.1 46.1c-4.7 4.7-12.3 4.7-17 0l-111-111c-4.7-4.7-4.7-12.3 0-17l46.1-46.1c18.7-18.7 49.1-18.7 67.9 0l60.1 60.1c18.8 18.7 18.8 49.1 0 67.9zM284.2 99.8L21.6 362.4.4 483.9c-2.9 16.4 11.4 30.6 27.8 27.8l121.5-21.3 262.6-262.6c4.7-4.7 4.7-12.3 0-17l-111-111c-4.8-4.7-12.4-4.7-17.1 0zM124.1 339.9c-5.5-5.5-5.5-14.3 0-19.8l154-154c5.5-5.5 14.3-5.5 19.8 0s5.5 14.3 0 19.8l-154 154c-5.5 5.5-14.3 5.5-19.8 0zM88 424h48v36.3l-64.5 11.3-31.1-31.1L51.7 376H88v48z"/></svg>

- Multiple testing (tens of thousands of genes) causes false positives.

- When FDR corrected, the more negatives, the more false negatives.

---

## Modeling gene expression <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:FireBrick;overflow:visible;position:relative;"><path d="M497.9 142.1l-46.1 46.1c-4.7 4.7-12.3 4.7-17 0l-111-111c-4.7-4.7-4.7-12.3 0-17l46.1-46.1c18.7-18.7 49.1-18.7 67.9 0l60.1 60.1c18.8 18.7 18.8 49.1 0 67.9zM284.2 99.8L21.6 362.4.4 483.9c-2.9 16.4 11.4 30.6 27.8 27.8l121.5-21.3 262.6-262.6c4.7-4.7 4.7-12.3 0-17l-111-111c-4.8-4.7-12.4-4.7-17.1 0zM124.1 339.9c-5.5-5.5-5.5-14.3 0-19.8l154-154c5.5-5.5 14.3-5.5 19.8 0s5.5 14.3 0 19.8l-154 154c-5.5 5.5-14.3 5.5-19.8 0zM88 424h48v36.3l-64.5 11.3-31.1-31.1L51.7 376H88v48z"/></svg>

- If a gene is not differentially expressed in two different conditions, the samples come from the same distribution.

- If a gene is differentially expressed in two different conditions, then samples come from two different distributions.

### Caveats

- We will use *t*-test although **NOT** used by DESeq2.

- We will use continuous normal distributions to generate our data, although gene counts are discrete.

---

## Same distributions (non-differentially expressed genes)

---

## Two distribution (differentially expressed genes)

---

- If a gene is not differentially expressed in two different conditions, the samples come from the same distribution, and the **probability distribution is uniform** from 0 to 1.

- If a gene is differentially expressed in two different conditions, samples come from two different distributions and the **probability distribution is skewed toward 0** with most samples below 0.05.

---

## Bejamini-Hochberg method <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:FireBrick;overflow:visible;position:relative;"><path d="M497.9 142.1l-46.1 46.1c-4.7 4.7-12.3 4.7-17 0l-111-111c-4.7-4.7-4.7-12.3 0-17l46.1-46.1c18.7-18.7 49.1-18.7 67.9 0l60.1 60.1c18.8 18.7 18.8 49.1 0 67.9zM284.2 99.8L21.6 362.4.4 483.9c-2.9 16.4 11.4 30.6 27.8 27.8l121.5-21.3 262.6-262.6c4.7-4.7 4.7-12.3 0-17l-111-111c-4.8-4.7-12.4-4.7-17.1 0zM124.1 339.9c-5.5-5.5-5.5-14.3 0-19.8l154-154c5.5-5.5 14.3-5.5 19.8 0s5.5 14.3 0 19.8l-154 154c-5.5 5.5-14.3 5.5-19.8 0zM88 424h48v36.3l-64.5 11.3-31.1-31.1L51.7 376H88v48z"/></svg>

Adjust *p*-values by making them larger:

1. Rank *p*-values (from smallest to largest) and start with largest.

1. Adjust *p*-value by taking the smaller of
  
    - The *p*-value of the next higher rank (not applicable for highest rank)
    
        `$\operatorname{p-value}_{rank+1}$`, or
    
    - `$\operatorname{p-value}_{rank} \cdot \frac{{\operatorname{total number of p-values}}}{{rank}}$`
    
---

- Multiple testing (tens of thousands of genes) causes false positives.

- When FDR corrected, the more negatives, the more false negatives.

### Solution

- Low-expressed genes variance cannot be estimated.

- Remove low-expressed genes.

---

## Independent Filtering <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:FireBrick;overflow:visible;position:relative;"><path d="M497.9 142.1l-46.1 46.1c-4.7 4.7-12.3 4.7-17 0l-111-111c-4.7-4.7-4.7-12.3 0-17l46.1-46.1c18.7-18.7 49.1-18.7 67.9 0l60.1 60.1c18.8 18.7 18.8 49.1 0 67.9zM284.2 99.8L21.6 362.4.4 483.9c-2.9 16.4 11.4 30.6 27.8 27.8l121.5-21.3 262.6-262.6c4.7-4.7 4.7-12.3 0-17l-111-111c-4.8-4.7-12.4-4.7-17.1 0zM124.1 339.9c-5.5-5.5-5.5-14.3 0-19.8l154-154c5.5-5.5 14.3-5.5 19.8 0s5.5 14.3 0 19.8l-154 154c-5.5 5.5-14.3 5.5-19.8 0zM88 424h48v36.3l-64.5 11.3-31.1-31.1L51.7 376H88v48z"/></svg>

- Remove genes with low counts because it is hard to get an accurate count.

`$$\operatorname{sample mean} > \operatorname{filter threshold}$$`

- Determine significant genes for different thresholds (expressed as quantiles) and lot significant genes vs quantiles.

- Fit curve.

- Determine filter threshold with SD of the fitted curve.

`$$\operatorname{filter threshold} = \operatorname{max of curve} - SD$$`