7 Analysing the TreeSAPP classifications in R

7.1 Introduction and goals

These instructions will help you to complete the script template treesapp_analysis.R to complete the initial analysis of your TreeSAPP classifications. The following code chunks will show you how to load, aggregate, subset and transform continuous and discrete variables in R using the tidyverse. You will also be provided with examples for plotting TreeSAPP classification data using the ggplot2 library.

7.2 Resources for analysis of TreeSAPP output in R

  • The TreeSAPP output file layered_classifications.tsv with the metagenome and metatranscriptome abundances and gene functions you generated with TreeSAPP. If you are unable to retrieve this file, it can be downloaded from here.
  • An R script template called treesapp_analysis.R on Canvas.
  • Saanich_TimeSeries_Chemical_DATA.csv file containing geochemical measurements on Canvas.

7.3 Checklist to write and run R script

  • Place treesapp_analysis.R, layered_classifications.tsv and Saanich_TimeSeries_Chemical_DATA.csv files into a single folder and create a new RStudio project in that folder on your local computer.
  • Edit the treesapp_analysis.R script following the instructions below.

7.4 Writing the R script

In the treesapp_analysis.R script you will load and subset your TreeSAPP data to the variables and marker genes of interest. You will then break the one taxonomic information column into multiple taxonomic ranks. You will also load the Saanich_TimeSeries_Chemical_DATA.csv file for geochemical measurements to learn more about the conditions at your assigned depth. Finally, you will visualize the taxonomic, functional and chemical data using the ggplot2 library.

To complete this, replace all instances of test between angle brackets, like <SOME TEXT>, with the code in each section.

7.5 Load any required packages

7.6 Load metagenomic and metatranscriptomic TreeSAPP data

ts_dat <- read.table(file="layered_classifications.tsv",
                     header=TRUE,
                     sep="\t")

7.6.1 Classification table

Let’s look at TreeSAPP’s classification table for these FunGene XmoA sequences. In RStudio, you are able to click on a variable name under the “Environment” tab in the top-right panel. Optionally, you can use the View function in the R-console to view an object. For example, you could run View(ts_dat).

A description of each of the fields can be found under the treesapp assign documentation page.

7.7 Subset your data to the variables and marker genes of interest

Determine which variables and marker genes you will need and subset your data to them. The only “Marker” in these data is XmoA, so we will filter to XmoA.

ts_df <- ts_dat %>%
  filter(Marker == "XmoA")

When working with real data you may be analyzing multiple reference package’s worth of data. If you want to subset those data to a couple of reference packages you will find the ‘%in%’ operator handy. In the following example we are keeping data where the Marker variable is equal to any of “XmoA”, “PmoA” or “AmoA”. In reality, they will only match “XmoA”.

ts_df <- ts_df %>%
  filter(Marker %in% c("XmoA", "PmoA", "AmoA"))

7.8 Reformat variables

7.8.1 Processing taxonomic information

Your taxonomic information contains information from the highest to lowest available rank. Look at a few rows containing taxonomic information and determine the separator between each rank. We can use a function in the tidyverse called separate that allows you split a character object at each of these separators. As a result, your initial single taxonomic variable will be split into multiple new variables, one for each rank. Use the R object taxa_ranks to assign the names for those new variables.

taxa_ranks <- c("Root", "Domain", "Phylum", "Class",
                "Order", "Family", "Genus", "Species")
ts_df <- ts_df %>%
  separate(col = Taxonomy,
           into = taxa_ranks,
           sep = "; ", fill = "right", remove = T)

7.8.2 Distinguishing metagenomes from metatranscriptomes

ts_df <- ts_df %>% 
  mutate(Sample = gsub("_pe", "_MetaG", Sample)) %>% 
  separate(col = Sample,
           into = c("Cruise", "Depth", "SeqType"),
           extra = "drop")

7.8.3 Processing depth data

Modify the data frame to split the sample name into “Cruise” and “Depth” variables. Then, remove the ‘m’ character from the “Depth” variable to create a numeric variable called “Depth.m”.

ts_df <- ts_df %>%
  mutate(Depth.m = as.numeric(gsub('m', '', Depth))) %>% 
  mutate(Cruise = as.numeric(gsub('SI0', '', Cruise)))

7.9 Load the geochemical data

Geochemical data for the Saanich Inlet time series is in another file called Saanich_TimeSeries_Chemical_DATA.csv. We will first load it into its own data frame called geochem_df, and subset it to just the samples we are interested in.

geochem_df <- read.table("Saanich_TimeSeries_Chemical_DATA.csv",
                          header=TRUE, sep=',') %>% 
  filter(Cruise == 72) %>% 
  select(Cruise, Depth,
         CTD_O2, NO3,
         Mean_NH4, Mean_NO2, Mean_H2S, Mean_CH4) %>% 
  mutate(Mean_CH4 = Mean_CH4*1E-3) %>% 
  mutate(across(.fns=as.numeric))

We are going to use the function left_join from the dplyr library to merge the geochemical data for cruise 72 with the TreeSAPP classifications for XmoA by Sample. The combined data will be saved to a new data frame called ts_geo_df.

ts_geo_df <- left_join(ts_df,
                       geochem_df,
                       by=c("Cruise",
                            "Depth.m" = "Depth"))

7.10 Visualize the taxonomic, functional and chemical data from Saanich Inlet

7.10.1 Analyze TreeSAPP’s taxonomic classifications

We’re going to begin with making a line plot to show the distribution of abundances of taxonomic orders at different depths. The only taxa that will be shown are those with an XmoA gene and these are not the only organisms in Saanich Inlet! We will create a new variable called Sum that sums all the Abundance (TPM values) for each taxonomic order at each depth. Without the group_by() function, it would return a single value but since we’re grouping by Depth and Order, sum yields a value for each taxon at each depth.

ts_geo_df %>%
  filter(SeqType == "MetaG") %>% 
  group_by(Depth.m, Order) %>%
  summarise(Sum = sum(Abundance)) %>%
  ungroup() %>%
  ggplot(aes(x=Depth.m, y=Sum, colour=Order)) +
  geom_line() +
  coord_flip() +
  scale_x_reverse() +
  labs(x="Depth (m)",
       y="Relative abundance (TPM)")

ggsave("taxonomic_order_lineplot.png")

The above plot gives us an idea of the absolute TPM values, but what is each taxon’s share of the total abundance at each depth? To answer this we will create a stacked barplot to show the relative change in proportions at different depths. Instead of creating a variable to hold each taxon’s total TPM we will create a new variable to store the proportion of TPM called Proportion. This could easily be changed to a percentage by multiplying it by 100.

In the ts_geo_df dataframe there are two variables storing the sample depth: one is a numeric type and the other is a character type. We will be plotting the character type depth (Depth) on the x-axis but by default R’s alphabetical ordering makes ‘100m’ precede ‘10m’. The reorder function is used here to ensure that the depth is sorted numerically by ordering Depth (character type) by Depth.m (numeric type).

Additionally, we are faceting this plot by the sequence type: ‘MetaG’ (metagenome) and ‘MetaT’ (metatranscriptome).

ts_geo_df %>% 
  group_by(Depth, Depth.m, SeqType) %>% 
  mutate(Proportion = Abundance/sum(Abundance)) %>% 
  group_by(Depth, Depth.m, Order, SeqType) %>% 
  summarise(sum = sum(Proportion)) %>% 
  ungroup() %>% 
  mutate(Depth = reorder(Depth, Depth.m)) %>% 
  ggplot(aes(x=Depth, y=sum, fill=Order)) +
  geom_bar(stat = "identity") +
  facet_wrap(~SeqType) +
  labs(x="Depth",
       y="Relative abundance") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave("taxonomic_order_barplot.png")

7.10.2 Analyze the functional data

In the following figure we are instead going to visualize the abundance of each classified XmoA’s function (indicated in the CuMMO field) across the water column. We are going to use a bubble-plot for this figure where the bubbles at each depth-order intersection are scaled by the TPM value (Abundance). We are colouring the bubbles according to their assigned CuMMO. We are again faceting this plot by the sequence type.

ts_geo_df %>% 
  group_by(Depth, Depth.m, SeqType, Order, CuMMO) %>% 
  summarise(Sum = sum(Abundance)) %>% 
  ungroup() %>%
  mutate(Depth = reorder(Depth, desc(Depth.m))) %>% 
  ggplot(aes(x=Depth, y=Order, colour=CuMMO)) +
  geom_point(aes(size=Sum)) +
  facet_wrap(~SeqType) +
  coord_flip() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave("functional_bubbleplot.png")

Can you see any discrepancies between the metagenome and metatranscriptome abundances?

7.10.3 Analyze the chemical measurements

The following plot is for visualizing the chemical profiles for each depth. This is not exhaustive but does include most of the molecules we’re interested in.

To generate this plot, we’ve had to perform a common data manipulation technique called pivoting. Essentially, instead of having a separate variable for each of the chemical measurements, we have combined them into two columns: one for the name of the molecule (which was previously the variable name) and another for their value.

ts_geo_df %>% 
  pivot_longer(cols=c(starts_with("Mean"), NO3, CTD_O2),
               values_to = "Value.uM",
               names_to = "Molecule") %>%
  ggplot(aes(x=Depth.m, y=Value.uM)) +
  geom_point() +
  geom_line() +
  facet_wrap(~Molecule, scales = "free_x") +
  coord_flip() +
  scale_x_reverse()

ggsave("geochemical_lineplot.png")

Notice that the x-axis scales all differ. This was enabled with scales = "free_x" in the function facet_wrap and was essential to observe the trends across the water column. Without it many of the values would be squished against the y-axis as their values are close to zero and the oxygen concentration gets up to ~200 micromolar.

7.10.4 Bringing them all together

In this final figure we are going to combine the methane concentration data with the TPM values for the two orders, Methylococcales and Nitrosomonadales.

ts_geo_df %>%
  filter(SeqType=="MetaT") %>%
  filter(!is.na(Order)) %>% 
  mutate(Depth = reorder(Depth, Depth.m)) %>% 
  ggplot(aes(x=Depth, y=Abundance, colour=Mean_CH4)) +
  geom_boxplot() +
  geom_point() +
  facet_wrap(~Order) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave("taxonomic_order_boxplot.png")

How does the abundance of these two groups change with methane concentration?

7.11 Analyzing classifications for your group’s reference package

  • How does abundance of your gene vary with depth? Are trends similar for both RNA and DNA?

  • How does microbial diversity vary with depth? Are trends similar for both RNA and DNA? You will need to make a decision at what taxonomic rank you will calculate diversity. Why might you not be able to use the lowest possible? Why should your taxonomic rank not be too high?

  • How does the abundance of your gene relate to water column geochemistry in Saanich Inlet (use the chemical measurement data in Saanich_TimeSeries_Chemical_DATA.csv)?

  • Determine the geochemical conditions at your given depths. Which is the best available terminal electron acceptor?