class: center, middle, inverse, title-slide # 7.3 Differential analysis with DESeq2: independent filtering ## MICB 405 101 2021W1 Bioinformatics ###
Stephan Koenig
### University of British Columbia ### February 17, 2022 --- ## Learning outcomes - Explain how a large number of negatives and controlling FDR affects retention of positives. - Produce a toy data set in R to explore the relationship among FDR-controlling procedure, negatives and positives. - Visualize population and probability distributions in R. - Refactor code into functions for repeated use. --- class: center middle ## Goal: Determine significant changes in gene expression between conditions --- ## Challenges to identify differentially expressed genes
- Distinguish technical variation from variation due to treatment. - Majority of genes do not change between treatments. - Only a few replicates per treatment, difficult to estimate variance. --- ## Challenges to identify differentially expressed genes
- Distinguish technical variation from variation due to treatment. - **Majority of genes do not change between treatments.** - Only a few replicates per treatment, difficult to estimate variance. --- ## Limits to FDR-controlling procedures
- Multiple testing (tens of thousands of genes) causes false positives. - When FDR corrected, the more negatives, the more false negatives. --- ## Modeling gene expression
- If a gene is not differentially expressed in two different conditions, the samples come from the same distribution. - If a gene is differentially expressed in two different conditions, then samples come from two different distributions. -- ### Caveats - We will use *t*-test although **NOT** used by DESeq2. - We will use continuous normal distributions to generate our data, although gene counts are discrete. --- ## Same distributions (non-differentially expressed genes) <img src="data:image/png;base64,#independent_filtering_files/figure-html/unnamed-chunk-2-1.png" width="1152" /> --- ## Two distribution (differentially expressed genes) <img src="data:image/png;base64,#independent_filtering_files/figure-html/unnamed-chunk-3-1.png" width="1152" /> --- ## Modeling gene expression
- If a gene is not differentially expressed in two different conditions, the samples come from the same distribution, and the **probability distribution is uniform** from 0 to 1. - If a gene is differentially expressed in two different conditions, samples come from two different distributions and the **probability distribution is skewed toward 0** with most samples below 0.05. --- ## Bejamini-Hochberg method
Adjust *p*-values by making them larger: 1. Rank *p*-values (from smallest to largest) and start with largest. 1. Adjust *p*-value by taking the smaller of - The *p*-value of the next higher rank (not applicable for highest rank) `\(\operatorname{p-value}_{rank+1}\)`, or - `\(\operatorname{p-value}_{rank} \cdot \frac{{\operatorname{total number of p-values}}}{{rank}}\)` --- ## Limits to FDR-controlling procedures
- Multiple testing (tens of thousands of genes) causes false positives. - When FDR corrected, the more negatives, the more false negatives. -- ### Solution - Low-expressed genes variance cannot be estimated. - Remove low-expressed genes. --- ## Independent Filtering
- Remove genes with low counts because it is hard to get an accurate count. `$$\operatorname{sample mean} > \operatorname{filter threshold}$$` - Determine significant genes for different thresholds (expressed as quantiles) and lot significant genes vs quantiles. - Fit curve. - Determine filter threshold with SD of the fitted curve. `$$\operatorname{filter threshold} = \operatorname{max of curve} - SD$$`