Data normalization is the process of transforming/standardizing data to a common scale for comparison [1]. This is especially useful for microbial ecology as data often come from diverse samples processed in different ways, both physically and computationally [2]. Thus, here we cover several common normalization methods that can be applied in our Data Manipulator app.
Percent Relative Abundance (PRA) is a technique that transforms the data into percentages within each sample. Also known as Relative Species Abundance in microbial ecology, it is a measure of how common a species is relative to other species in a defined sample [3].
Strengths:
Weaknesses:
Random Subsampling, or rarefaction, is technique that splits the data into subsets [4]. Also known as rarefaction, it is a technique used to determine species richness of samples that differ in area, volume, or sampling efforts [5].
Strengths:
Weaknesses:
Multiple Imputation is a statistical technique that is useful for analyzing incomplete or missing data via a 3 step process [10, 11]:
Strengths:
Weaknesses:
Variance Stabilizing Transformation (VST) uses a function f to apply values to x in a dataset to create y = f(x) such that the variability of values y is not related to their mean value (or has a constant variance) [18].
Strengths:
DESeq2
package) [19]Weaknesses:
[1] Borgatti S. http://www.analytictech.com/ba762/handouts/normalization.htm (2018).
[2] Daniel Aguirre de Cárcer, Denman SE, McSweeney C, Morrison M. Evaluation of Subsampling-Based Normalization Strategies for Tagged High-Throughput Sequencing Data Sets from Gut Microbiomes. Applied and Environmental Microbiology. 2011; 77: 8795-8798.
[3] Socratic. How do species richness and relative abundance of species affect species diversity? https://socratic.org/questions/how-do-species-richness-and-relative-abundance-of-species-affect-species-diversi (2018).
[4] Dieterle F. Random Subsampling. http://www.frank-dieterle.com/phd/2_4_3.html (2018).
[5] Chiarucci A, Bacaro G, Rocchini D, Ricotta C, Palmer MW, Scheiner SM. Spatially constrained rarefaction: incorporating the autocorrelated structure of biological communities into sample-based rarefaction. Community Ecology. 2009; 10: 209-214.
[6] Hughes JB, Hellmann JJ. The application of rarefaction techniques to molecular inventories of microbial diversity. In: Vol 397. United States: Elsevier Science & Technology; 2005: 292-308.
[7] McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Computational Biology. 2014; 2013; 10: e1003531.
[8] Gotelli, NJ, Colwell RK. Estimating species richness. Frontiers in Measuring Biodiversity. 2011; 12: 39-54.
[9] Tipper JC. Rarefaction and Rarefiction; The Use and Abuse of a Method in Paleoecology. Paleobiology. 1979; 5: 423-434.
[10] van Buuren S. Multiple Imputation. http://www.stefvanbuuren.nl/mi/mi.html (2018).
[11] Maldonado, A. D.; Aguilera, P. A.; and Salmeron, A. An Experimental Comparison of Methods to Handle Missing Values in Environmental Datasets. International Congress on Environmental Modelling and Software. 2016: 3.
[12] Anonymous. Statistics How To. http://www.statisticshowto.com/multiple-imputation/ (2018).
[13] Sterne JAC, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009; 338: 157-160.
[14] Kanwar N, Scott HM, Norby B, et al. Impact of treatment strategies on cephalosporin and tetracycline resistance gene quantities in the bovine fecal metagenome. Scientific Reports. 2014; 2015; 4: 5100.
[15] Xu L, Paterson AD, Turpin W, Xu W. Assessment and Selection of Competing Models for Zero-Inflated Microbiome Data. PLoS One. 2015; 10: e0129606.
[16] Kaul A, Mandal S, Davidov O, Peddada SD. Analysis of Microbiome Data in the Presence of Excess Zeros. Frontiers in Microbiology. 2017; 8: 2114.
[17] Quora. How do you handle missing data (statistics)? What imputation techniques do you recommend or follow? https://www.quora.com/How-do-you-handle-missing-data-statistics-What-imputation-techniques-do-you-recommend-or-follow (2018).
[18] NC State University. Nonlinear Statistical Models for Univariate and Multivariate Response. https://www.stat.ncsu.edu/people/bloomfield/courses/ST762/slides/MD-02-2.pdf (2018).
[19] Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology. 2014; 15: 550-550.
[20] Weiss S, Xu ZZ, Peddada S, et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017; 5: 27.