4 Creating a reference package for TreeSAPP

4.1 Introduction and goals

In this tutorial we will demonstrate how to build a TreeSAPP reference package and assess whether there are any mis-annotated sequences in the final product.

The goals of this tutorial are:

To learn how to use treesapp create to build a reference package for a simple protein family (XmoA) using the sequences acquired in the previous tutorial.
To learn how to assess whether the reference package contains only homologous sequences using treesapp purity.
To learn how to view a reference package’s phylogenetic tree using the interactive Tree of Life (iTOL).

4.2 Making the reference package

In the Acquiring reference sequence data tutorial, PmoA & AmoA sequences were downloaded from EggNOG and UniProt. Here, we will use the EggNOG sequences. Let’s call this our seed reference package. The steps involved are:

All steps are executed on the server unless stated otherwise. Connect to your group’s server container and continue to work in your /data/<user>/tutorial directory.

cd /data/<user>/tutorial

Optional: If this directory doesn’t exist, first create it using the command mkdir.

cd /data/
mkdir <user>
cd <user>
mkdir tutorial

Optional: If you could not download the reference sequences of XmoA yourself (see Acquiring reference sequence data), download xmoa_file_list.txt — a list of the necessary fasta files — with wget to the tutorial directory, then run wget again to retrieve each file within this list.

wget https://raw.githubusercontent.com/EDUCE-UBC/MICB425/main/data/xmoa_file_list.txt
wget -i xmoa_file_list.txt

There are 73 PmoA and AmoA fasta sequences in this file.

Let’s look briefly at the newly generated fasta file using cat and less. cat stands for concatenate, i.e. the files are pasted together. The output of cat is funneled, or piped, with the | symbol to a text viewer called less, i.e. cat file_1.ext file_2.ext | less. The -S argument for less command turns off line-wrapping to maintain a readable format.

cat ENOG5028JPK_EggNOGv5.faa arCOG08676_EggNOGv5.faa | less -S

You will notice that there are lines that start with a > character. This character indicates all subsequent characters on the line correspond to the name of a sequence (and potentially other information depending on the database), with the protein sequence itself on one or more lines after.

We can count and confirm the number of sequences in the fasta file by counting those > using grep. grep is an incredibly powerful tool to find matching text patterns (using a format called “regular expression”) and is beyond the scope of this tutorial. The code below searches for a > at the beginning of lines (defined with "^>") and counts each occurrence due to the option -c.

cat ENOG5028JPK_EggNOGv5.faa arCOG08676_EggNOGv5.faa | grep -c "^>"

Use submodule treesapp create to build the XmoA reference package with different arguments (see visualization of the create workflow). To speed things along, we will use FastTree to infer the phylogeny and will skip bootstrapping with --fast. To remove potential redundant sequences, we will cluster the candidate reference sequences at 97% similarity using the argument -p. When clustering the candidate sequences TreeSAPP would normally ask which sequence to use for the representative of the cluster. This can be handy in cases when some sequences are better annotated and/or are especially important. To speed things up even more, the flag --headless will prevent these requests. This command will take ~2 minutes to complete.

treesapp create \
  --fast \
  --headless \
  --cluster \
  -n 4 \
  -p 0.97 \
  --fastx_input ENOG5028JPK_EggNOGv5.faa arCOG08676_EggNOGv5.faa \
  -c XmoA \
  --output XmoA_seed

The final reference package file is located in XmoA_seed/final_outputs/XmoA_build.pkl. This file contains all the individual components of a reference package (multiple sequence alignment, profile HMM, phylogenetic tree, taxonomic lineages) as well as some other data. These files were bundled up using the joblib Python library. They can be accessed individually using the submodule treesapp package.

Replace the current refpkg_code Z1111 with unique identifier N0102.

treesapp package edit \
  refpkg_code N0102 \
  --overwrite \
  --refpkg_path XmoA_seed/final_outputs/XmoA_build.pkl

We can also modify the reference package’s description while we’re here.

treesapp package edit \
  description "Alpha subunits of copper membrane monooxygenase enzymes" \
  --overwrite \
  --refpkg_path XmoA_seed/final_outputs/XmoA_build.pkl

4.3 Testing the purity of the reference package

We will determine whether there were any mis-annotated sequences that were included in our reference package with treesapp purity. To do this, we take a well-curated database and attempt to classify sequences in it using treesapp assign. The results are then analysed and displayed for the user to evaluate. The TIGRFAM database is fairly comprehensive, representing 4488 different groups in version 15, and by using just the TIGRFAM seed sequences (TIGRFAM_seed_named.faa) we can be fairly sure we won’t be evaluating our classifications with mis-annotated sequences.

treesapp purity \
  -n 4 \
  -r XmoA_seed/final_outputs/XmoA_build.pkl \
  --extra_info TIGRFAM_info.tsv \
  -i TIGRFAM_seed_named.faa \
  --output XmoA_purity

The important bit of the output should look like this

Summarizing assignments for reference package XmoA
Ortholog        Hits    Leaves  Tree-coverage   Description
--------------------------------------------------------------------------------
TIGR03080       3       3       8.3    methane monooxygenase/ammonia monooxygenase, subunit A

From this summary it appears that the reference package classified three homologous sequences that were placed at leaf nodes in the tree (i.e. they’re closely related) and the sequences were all from the family “TIGR03080”, also known as “methane monooxygenase/ammonia monooxygenase, subunit A”. In all, we are probably good to proceed!

4.4 Viewing the reference package tree with iTOL

Among the many files inferred by treesapp create is a phylogenetic tree for XmoA. To ensure that the tree topology is sensible, without any anomalously long branches for example, we should visualize it in iTOL. Since reference packages are compressed files with many attributes, we will use the subcommand treesapp package to extract the tree in Newick format.

Use treesapp package to write the reference package tree to a file.

treesapp package \
  view tree \
  --refpkg_path XmoA_seed/final_outputs/XmoA_build.pkl > \
  XmoA_seed/XmoA_labelled_tree.txt

On your computer: Transfer XmoA_seed/XmoA_labelled_tree.txt from the server to your local directory ts_tutorial_local.

 scp root@<server address>:/data/<user>/tutorial/XmoA_seed/XmoA_labelled_tree.txt <path to ts_tutorial_local>

We can quickly and easily view the sequences placed on the phylogeny using iTOL. We will do this by uploading the text file produced by TreeSAPP.

Navigate to https://itol.embl.de/ using a web browser and click “Upload” at the top of the page.
Upload the file XmoA_labelled_tree.txt. You can do this using either your computer’s file browser then navigating to the parent directory and clicking-and-dragging the file, or iTOL’s file uploader by choosing a file.
The figure should look identical to this figure below Figure 4.1.

FIGURE 4.1: Phylogenetic tree of the seed XmoA reference package viewed in iTOL. Sequence names at the tree’s leaf tips show the organism name and unique EggNOG accession identifier.

4.5 Booster shot

The reference package for XmoA is relatively small, and ideal for an in-class tutorial. However, many protein families are much more complex with members spread broadly across the tree of life. In these cases, you may need to use other tactics to reduce the size of your reference package so treesapp create can finish in under a few hours. Using any or all of the following options are not required, but may be useful.

Remove more candidate reference sequences from the final reference package by decreasing the percent similarity used during sequence clustering. In the tutorial we set “-p”/“--similarity” to 0.97 but this is going to be too high in many cases. Try something between 0.80 and 0.95 to get between 100 and 2000 sequences.
A profile HMM can be used to accomplish two things: first it can be used to remove sequences with only remote homology to the protein family you’re building a reference package for; second, each HMM alignment can be used to trim the input sequences to just the conserved regions improving the quality of the multiple sequence alignment and phylogeny. You can download a profile HMM from EggNOG by going to your gene’s EggNOG page, clicking Download ᐁ and then “HMM model”. This will typically only be available for subgroups such as “Bacteria”, “Archaea”, “Proteobacteria”, etc. (i.e. not ‘root’). Provide the HMM file to treesapp create with the “--profile” argument.
Metagenomic data from Saanich Inlet were generated from biomass where larger Eukaryotic cells had been physically removed prior to sequencing. Therefore, these organisms do not need to be present in the reference packages. You can remove sequences that are Eukaryotic in origin by using either “--screen” or “--filter” arguments. For example, to only include Bacteria and Archaea use --screen Bacteria,Archaea or --filter Eukaryota.
The “--min_taxonomic_rank” argument specifies the minimum taxonomic rank a sequence’s taxonomic lineage must be resolved to. For example, using --min_taxonomic_rank g will ensure that all sequences in the reference package have been annotated to the genus, species or strain taxonomic rank. To specify others ranks , use the lowercase first letter of that taxonomic rank. If your reference package is still very large after using the above options, setting this argument to a highly resolved taxonomic rank can be a potent way to remove candidate reference sequences.

To bring all of these arguments together for you here is an example command:

treesapp create
  --fast \
  --headless \
  --overwrite \
  --cluster \
  --trim_align \
  -n 4 \
  --similarity 0.90 \
  --screen Bacteria,Archaea \
  --min_taxonomic_rank g \
  --profile arCOG08676_hmm.txt \
  --fastx_input XmoA_seed.faa \
  -c XmoA \
  --output XmoA_seed

3 Acquiring reference sequence data

5 Classifying unknown sequences with TreeSAPP and updating a reference package