class: center, middle, inverse, title-slide # 7.4 Differential analysis with DESeq2: model fitting ## MICB 405 101 2021W1 Bioinformatics ###
Stephan Koenig
### University of British Columbia ### February 17, 2022 --- ## Learning outcomes - Describe properties of read count data and how DESeq models them. - Use R markdown to create reproducible documents. --- class: center middle ## Goal: Determine significant changes in gene expression between conditions --- ## Challenges to identify differentially expressed genes
- Distinguish technical variation from variation due to treatment. - Majority of genes do not change between treatments. - Only a few replicates per treatment, difficult to estimate variance. --- ## Challenges to identify differentially expressed genes
- Distinguish technical variation from variation due to treatment. - Majority of genes do not change between treatments. - **Only a few replicates per treatment, difficult to estimate variance.** --- ## Goal: Model count matrix Read count matrix `\(K_{ij}\)` for each gene (row) `\(i\)` and sample (column) `\(j\)` <table> <thead> <tr> <th style="text-align:right;"> treated1 </th> <th style="text-align:right;"> treated2 </th> <th style="text-align:right;"> treated3 </th> <th style="text-align:right;"> untreated1 </th> <th style="text-align:right;"> untreated2 </th> <th style="text-align:right;"> untreated3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 140 </td> <td style="text-align:right;"> 88 </td> <td style="text-align:right;"> 70 </td> <td style="text-align:right;"> 92 </td> <td style="text-align:right;"> 161 </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 6205 </td> <td style="text-align:right;"> 3072 </td> <td style="text-align:right;"> 3334 </td> <td style="text-align:right;"> 4664 </td> <td style="text-align:right;"> 8714 </td> <td style="text-align:right;"> 3564 </td> </tr> <tr> <td style="text-align:right;"> 722 </td> <td style="text-align:right;"> 299 </td> <td style="text-align:right;"> 308 </td> <td style="text-align:right;"> 583 </td> <td style="text-align:right;"> 761 </td> <td style="text-align:right;"> 245 </td> </tr> </tbody> </table> ??? - Data: Pasilla data set - We aim to model the raw counts `\(K_{ij}\)` to infer the actual quantity of RNA in the cells. --- ## Read count distribution
<img src="data:image/png;base64,#model_fitting_files/figure-html/unnamed-chunk-3-1.png" width="1152" /> ??? - Data: Pasilla data set - Read counts do not follow normal distribution. - One approach: log-transformation. (Not testable) --- ## Properties of read counts
- Sparse event, i.e. small likelihood `\(p\)` of a read mapping to a specific gene. - Discrete and high number of events `\(n\)`, i.e. sampling depth (number of reads). -- - One way of modelling raw counts is the Poisson distribution. -- An important property of this distribution is that mean = variance ??? - Only holds true if samples are technical replicates. - Formula for Poisson distribution (not testable): `$$P\left( x \right) = \frac{{e^{ - \lambda } \lambda ^x }}{{x!}}$$` Calculates `\(P\left( x \right)\)` the **probability** of observing `\(x\)` **number of events** with `\(\lambda\)` **average number of events per interval**. One property of this distribution is that mean = variance = lambda --- ## In reality, RNAseq data is over-dispersed: <br /> mean < variance <img src="data:image/png;base64,#model_fitting_files/figure-html/unnamed-chunk-4-5-1.png" width="1152" /> ??? - Higher variability in low-expression genes. - Heteroscedasticity: difference in variability across means --- ## Dealing with variance
- Only few replicates/conditions make it difficult to estimate variance. - At low expression very noisy. -- ### {DESeq2} solution - Model count matrix with negative binomial distribution. - Borrow information across genes to estimate dispersion and ultimately variance. - Determine the log2 fold change and its significance (Wald statistic). --- ## Goal: Model count matrix Read count matrix `\(K_{ij}\)` for each gene (row) `\(i\)` and sample (column) `\(j\)` <table> <thead> <tr> <th style="text-align:right;"> treated1 </th> <th style="text-align:right;"> treated2 </th> <th style="text-align:right;"> treated3 </th> <th style="text-align:right;"> untreated1 </th> <th style="text-align:right;"> untreated2 </th> <th style="text-align:right;"> untreated3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 140 </td> <td style="text-align:right;"> 88 </td> <td style="text-align:right;"> 70 </td> <td style="text-align:right;"> 92 </td> <td style="text-align:right;"> 161 </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 6205 </td> <td style="text-align:right;"> 3072 </td> <td style="text-align:right;"> 3334 </td> <td style="text-align:right;"> 4664 </td> <td style="text-align:right;"> 8714 </td> <td style="text-align:right;"> 3564 </td> </tr> <tr> <td style="text-align:right;"> 722 </td> <td style="text-align:right;"> 299 </td> <td style="text-align:right;"> 308 </td> <td style="text-align:right;"> 583 </td> <td style="text-align:right;"> 761 </td> <td style="text-align:right;"> 245 </td> </tr> </tbody> </table> --- ## Negative bionomial distribution Distribution with over-dispersion `$$K_{ij} \sim \operatorname{NB}\left(\mu_{ij}, \alpha_i \right)$$` with `\(K\)` raw count of gene `\(i\)` in sample `\(j\)` with fitted mean `\(\mu_{ij}\)` and gene-specific dispersion `\(\alpha_i\)`. -- with mean `$$\mu_{ij} = s_{j}q_{ij}$$` sample-specific size factor `\(s_j\)`, true expression level `\(q_{ij}\)` -- and variance `$$\operatorname{Var}\left(K_{ij} \right) = \mu_{ij} + \alpha_i\mu^2_{ij}$$` -- By estimating gene-wise dispersion `\(\alpha_i\)` we can determine the variance. ??? The tilde symbol `\(\sim\)` means "model." Size factor `\(s_j\)` was determined for count normalization. Form for negative binomial distribution (not testable) `$$k \mapsto \binom{k+ r - 1 }{k} \cdot (1-p)^rp^k$$` with `\(k\)` number of success(es) before `\(r\)` failure(s) with `\(p\)` probability of success with mean `\(\frac{{p r}}{{1 - p}}\)` and variance `\(\frac{{p r}}{{\left(1 - p \right)^2}}\)`. `\(\log_2\)` fold changes `$$\log_2\left(q_{ij}\right) = x_j\beta_i$$` with coefficients `\(\beta_i\)` giving the `\(\log_2\)` fold changes for gene `\(i\)` for each column of the model matrix `\(X\)`. --- ## Estimating gene-wise dispersion <img src="data:image/png;base64,#../../images/shrink_dipsersion_estimates.svg" width="60%" style="display: block; margin: auto;" /> Love et al., 2014, Genome Biol 15:550. ???