6 Calculating relative abundance of meta’omic data

6.1 Introduction and goals

Genome assemblies contain less information than the sequence reads and recovering it is not always possible. The shotgun DNA sequencing technique samples subsequences from the genomes that comprise a microbial community. Sequencing bias, while present, is subtle enough that we can assume that each subsequence’s frequency in the reads approximates its abundance in the underlying community. Whenever a (meta)genome is assembled from reads, however, sequences that are tricky to assemble (due to repetitive motifs or low coverage) are removed and the abundance information is effectively lost. One way to recover these read-level abundances is by aligning the reads back to the contigs and calculating the number of reads that mapped to each. One can go a step further still and use relative abundance normalization measures popular in transcriptomics, such as fragments per kilobase per million mappable reads (FPKM) or transcripts per million (TPM) (12).

TreeSAPP’s subcommand abundance is capable of calculating FPKM or TPM for all classified query sequences. These can be derived from the reads in the FASTQ files of multiple different samples. This is only possible when the input fasta are nucleotide sequences from which ORFs were predicted within the pipeline (i.e. this thisn’t possible with amino acid sequences).

The goal of this tutorial is to show you how to use treesapp abundance to calculate TPM for metagenome-derived ORF sequences.

6.2 Workflow description

The workflow employed by TreeSAPP is common to many bioinformatics pipelines:

Collect nucleotide sequences for the classified ORFs.
BWA (13) creates an index from these nucleotide sequences - this will be the reference database.
Align the short reads to the reference database with BWA MEM.
Calculate the relative abundance metrics. TreeSAPP uses samsum.

There are potentially two common use cases for treesapp abundance.

Through treesapp assign where a single sample’s read data are mapped to its corresponding assembly (i.e. contigs or scaffolds) and the relative abundances of these sequences are obtained.
A combined assembly was generated from multiple samples, classified with treesapp assign, then the reads of each sample were mapped individually to generate sample-specific relative abundance values with treesapp abundance.

6.3 Usage

There are two required arguments for treesapp abundance:

treesapp_output: Path to the output directory from treesapp assign.
reads: Path to a FASTQ file containing single-end reads, or foward mates or interleaved reads from a paired-end sequencing library.

Below are descriptions for the most important options:

reverse: Path to a FASTQ file containing reverse mates from a paired-end sequencing library.
report: This argument controls what treesapp abundance does with the resulting abundance values. The ‘nothing’ option is most applicable to the Python API, and is used by treesapp assign to simply return the abundance object instances. ‘update’ can be used when the abundance values in the classification table should be replaced with these newly calculated values. ‘append’ should be used when reads of new samples are being used to calculate abundance (e.g. reads from a time-course experiment).
metric: Either Fragments per Kilobase per Mllion mappable reads (FPKM) or Transcripts per Million (TPM).

6.4 Outputs

The sole output of treesapp abundance is an edited classification table, previously generated by treesapp assign. No new files are generated, beyond intermediate files (e.g. SAM files). The values are in the ‘Abundance’ field of final_outputs/classifications.tsv.

6.5 Tutorial steps

Connect to your group’s server, then move to the directory /data/<user>/tutorial.

cd /data/<user>/tutorial/

This tutorial requires many FASTQ and FASTA-formatted files that are subsetted versions of the originals. They were created by searching metagenome assemblies and reads for XmoA then pulling out all sequences that matched into new files. They are all available in the SI072_sequence_data directory downloaded during the last tutorial from .

We will begin by classifying XmoA sequences found in the metagenome contigs. Since these are nucleotide sequences representing random stretches of DNA from the microbial community, treesapp assign will begin by predicting open reading frames (ORFs) from these. TreeSAPP will then classify the amino acid sequences for each of the predicted ORFs. Once the XmoA sequences are classified from these ORFs we can begin to generate the relative abundance values for the XmoA sequences.

Assign taxonomic labels to Saanich Inlet metagenomic contigs using the updated XmoA reference package with treesapp assign.

treesapp assign \
  -n 4 \
  --trim_align \
  --refpkg_dir XmoA_MAG_update/final_outputs/ \
  --fastx_input SI072_sequence_data/SI072_MetaG_contigs.fasta \
  --output SI072_MetaG_contigs_XmoA_assign/

Now we’ll need to access the processed data generated by treesapp assign in SI072_MetaG_contigs_XmoA_assign/. treesapp abundance will access the ORFs in SI072_MetaG_contigs_XmoA_assign/intermediates/orf-call/SI072_MetaG_contigs_ORFs.fna and pass these to bwa mem for sequence alignment. Due to this, the flag ‘--delete’ should not be used with treesapp assign if you want to eventually run treesapp abundance as the intermediates directory would be removed.

We will begin by calculating TPM for the metagenome reads. The reads we are using to calculate these values are paired-end meaning that the there are two reads to represent each sequenced DNA fragment. These two reads, in the case of the metagenomic data, are split across two files: one for the reads sequenced in the forward direction and another for the reads sequenced in the reverse. Their extensions are ‘.1.fq.gz’ and ‘.2.fq.gz’, respectively. Therefore, we must use the parameters ‘--reads’ and ‘--reverse’ to point TreeSAPP towards these files. You’ll notice that we didn’t list out seven files for each, either; we used a wildcard character (the asterisk). For the forward reads, the shell will interpret this as all files within the SI072_sequence_data directory that begin with ‘SI072_’ and end with ‘m_pe.1.fq.gz’. You can test this expansion using the ls command.

ls SI072_sequence_data/SI072_*m_pe.1.fq.gz

We would like to replace the current abundance values from treesapp assign (set to 1) with the new TPM values, so we will set the ‘--report’ parameter to ‘update’. Finally, we can specify that we want TPM values with --metric tpm. Although, ‘tpm’ is the default so this isn’t necessary.

Run treesapp abundance to overwrite the previous abundance values with the TPM values calculated from each of the seven SI072 datasets.

treesapp abundance \
  -n 4 \
  --treesapp_output SI072_MetaG_contigs_XmoA_assign/ \
  --reads SI072_sequence_data/SI072_*m_pe.1.fq.gz \
  --reverse SI072_sequence_data/SI072_*m_pe.2.fq.gz \
  --report update \
  --metric tpm

Now we’re going to generate TPM values for the seven matching metatranscriptome datasets. You will notice the format of these FASTQ files is different from the metagenomes. All of the reads for each sample are contained in a single file, and this is referred to as interleaved FASTQ format. Since there is no file for the reverse reads we can just use the ‘--reads’ parameter. Another key difference is we’re setting the ‘--report’ parameter to ‘append’. This is because we don’t want to overwrite the TPM values we just generated for the metagenomes. Instead, we want the new metatranscriptome values to be added to SI072_MetaG_contigs_XmoA_assign/final_outputs/classifications.tsv, so in the end the classification table will hold TPM values for both the metagenomes and metatranscriptomes.

Use treesapp abundance to calculate TPM values from the seven metatranscriptomes.

treesapp abundance \
 -n 4 \
 --treesapp_output SI072_MetaG_contigs_XmoA_assign/ \
 --reads SI072_sequence_data/SI072_*m_MetaT_QC_Filtered.fq.gz \
 --pairing pe \
 --metric tpm \
 --report append

The next tutorial will cover visualizing the TreeSAPP classifications. For this we also want the functional annotations that we generated in the last tutorial.

Run treesapp layer to annotate the XmoA query sequences with their respective paralog.

treesapp layer \
 -o SI072_MetaG_contigs_XmoA_assign/ \
 --refpkg_dir XmoA_MAG_update/final_outputs/

That’s it! You’re now a seasoned veteran of TreeSAPP.

6.6 Use your group’s reference package to calculate abundances

Continue to work with your group’s reference package in the directory /data/<gene name>. Visualize the appropriate outputs with iTOL.

5 Classifying unknown sequences with TreeSAPP and updating a reference package

7 Analysing the TreeSAPP classifications in R