4 Creating a reference package for TreeSAPP
4.1 Introduction and goals
In this tutorial we will demonstrate how to build a TreeSAPP reference package and assess whether there are any mis-annotated sequences in the final product.
The goals of this tutorial are:
- To learn how to use
treesapp create
to build a reference package for a simple protein family (XmoA) using the sequences acquired in the previous tutorial. - To learn how to assess whether the reference package contains only homologous sequences using
treesapp purity
. - To learn how to view a reference package’s phylogenetic tree using the interactive Tree of Life (iTOL).
4.2 Making the reference package
In the Acquiring reference sequence data tutorial, PmoA & AmoA sequences were downloaded from EggNOG and UniProt. Here, we will use the EggNOG sequences. Let’s call this our seed reference package. The steps involved are:
-
All steps are executed on the server unless stated otherwise. Connect to your group’s server container and continue to work in your
/data/<user>/tutorial
directory.
cd /data/<user>/tutorial
-
Optional: If this directory doesn’t exist, first create it using the command
mkdir
.
cd /data/
mkdir <user>
cd <user>
mkdir tutorial
-
Optional: If you could not download the reference sequences of XmoA yourself (see Acquiring reference sequence data), download
xmoa_file_list.txt
— a list of the necessary fasta files — withwget
to thetutorial
directory, then runwget
again to retrieve each file within this list.
wget https://raw.githubusercontent.com/EDUCE-UBC/MICB425/main/data/xmoa_file_list.txt
wget -i xmoa_file_list.txt
There are 73 PmoA and AmoA fasta sequences in this file.
- Let’s look briefly at the newly generated fasta file using
cat
andless
.cat
stands for concatenate, i.e. the files are pasted together. The output ofcat
is funneled, or piped, with the|
symbol to a text viewer calledless
, i.e.cat file_1.ext file_2.ext | less
. The-S
argument forless
command turns off line-wrapping to maintain a readable format.
cat ENOG5028JPK_EggNOGv5.faa arCOG08676_EggNOGv5.faa | less -S
You will notice that there are lines that start with a >
character. This character indicates all subsequent characters on the line correspond to the name of a sequence (and potentially other information depending on the database), with the protein sequence itself on one or more lines after.
- We can count and confirm the number of sequences in the fasta file by counting those
>
usinggrep
.grep
is an incredibly powerful tool to find matching text patterns (using a format called “regular expression”) and is beyond the scope of this tutorial. The code below searches for a>
at the beginning of lines (defined with"^>"
) and counts each occurrence due to the option-c
.
cat ENOG5028JPK_EggNOGv5.faa arCOG08676_EggNOGv5.faa | grep -c "^>"
- Use submodule
treesapp create
to build the XmoA reference package with different arguments (see visualization of thecreate
workflow). To speed things along, we will use FastTree to infer the phylogeny and will skip bootstrapping with--fast
. To remove potential redundant sequences, we will cluster the candidate reference sequences at 97% similarity using the argument-p
. When clustering the candidate sequences TreeSAPP would normally ask which sequence to use for the representative of the cluster. This can be handy in cases when some sequences are better annotated and/or are especially important. To speed things up even more, the flag--headless
will prevent these requests. This command will take ~2 minutes to complete.
treesapp create \
--fast \
--headless \
--cluster \
-n 4 \
-p 0.97 \
--fastx_input ENOG5028JPK_EggNOGv5.faa arCOG08676_EggNOGv5.faa \
-c XmoA \
--output XmoA_seed
The final reference package file is located in XmoA_seed/final_outputs/XmoA_build.pkl
. This file contains all the individual components of a reference package (multiple sequence alignment, profile HMM, phylogenetic tree, taxonomic lineages) as well as some other data. These files were bundled up using the joblib Python library. They can be accessed individually using the submodule treesapp package
.
- Replace the current refpkg_code
Z1111
with unique identifierN0102
.
treesapp package edit \
\
refpkg_code N0102 --overwrite \
--refpkg_path XmoA_seed/final_outputs/XmoA_build.pkl
- We can also modify the reference package’s description while we’re here.
treesapp package edit \
"Alpha subunits of copper membrane monooxygenase enzymes" \
description --overwrite \
--refpkg_path XmoA_seed/final_outputs/XmoA_build.pkl
4.3 Testing the purity of the reference package
We will determine whether there were any mis-annotated sequences that were included in our reference package with treesapp purity
. To do this, we take a well-curated database and attempt to classify sequences in it using treesapp assign
. The results are then analysed and displayed for the user to evaluate. The TIGRFAM database is fairly comprehensive, representing 4488 different groups in version 15, and by using just the TIGRFAM seed sequences (TIGRFAM_seed_named.faa
) we can be fairly sure we won’t be evaluating our classifications with mis-annotated sequences.
treesapp purity \
-n 4 \
-r XmoA_seed/final_outputs/XmoA_build.pkl \
--extra_info TIGRFAM_info.tsv \
-i TIGRFAM_seed_named.faa \
--output XmoA_purity
The important bit of the output should look like this
Summarizing assignments for reference package XmoA
Ortholog Hits Leaves Tree-coverage Description
--------------------------------------------------------------------------------
TIGR03080 3 3 8.3 methane monooxygenase/ammonia monooxygenase, subunit A
From this summary it appears that the reference package classified three homologous sequences that were placed at leaf nodes in the tree (i.e. they’re closely related) and the sequences were all from the family “TIGR03080”, also known as “methane monooxygenase/ammonia monooxygenase, subunit A”. In all, we are probably good to proceed!
4.4 Viewing the reference package tree with iTOL
Among the many files inferred by treesapp create
is a phylogenetic tree for XmoA. To ensure that the tree topology is sensible, without any anomalously long branches for example, we should visualize it in iTOL. Since reference packages are compressed files with many attributes, we will use the subcommand treesapp package
to extract the tree in Newick format.
- Use
treesapp package
to write the reference package tree to a file.
treesapp package \
\
view tree --refpkg_path XmoA_seed/final_outputs/XmoA_build.pkl > \
XmoA_seed/XmoA_labelled_tree.txt
-
On your computer: Transfer
XmoA_seed/XmoA_labelled_tree.txt
from the server to your local directoryts_tutorial_local
.
scp root@<server address>:/data/<user>/tutorial/XmoA_seed/XmoA_labelled_tree.txt <path to ts_tutorial_local>
We can quickly and easily view the sequences placed on the phylogeny using iTOL. We will do this by uploading the text file produced by TreeSAPP.
Navigate to https://itol.embl.de/ using a web browser and click “Upload” at the top of the page.
Upload the file
XmoA_labelled_tree.txt
. You can do this using either your computer’s file browser then navigating to the parent directory and clicking-and-dragging the file, or iTOL’s file uploader by choosing a file.-
The figure should look identical to this figure below Figure 4.1.
4.5 Booster shot
The reference package for XmoA is relatively small, and ideal for an in-class tutorial. However, many protein families are much more complex with members spread broadly across the tree of life. In these cases, you may need to use other tactics to reduce the size of your reference package so treesapp create
can finish in under a few hours. Using any or all of the following options are not required, but may be useful.
Remove more candidate reference sequences from the final reference package by decreasing the percent similarity used during sequence clustering. In the tutorial we set “-p”/“--similarity” to 0.97 but this is going to be too high in many cases. Try something between 0.80 and 0.95 to get between 100 and 2000 sequences.
A profile HMM can be used to accomplish two things: first it can be used to remove sequences with only remote homology to the protein family you’re building a reference package for; second, each HMM alignment can be used to trim the input sequences to just the conserved regions improving the quality of the multiple sequence alignment and phylogeny. You can download a profile HMM from EggNOG by going to your gene’s EggNOG page, clicking
Download ᐁ
and then “HMM model”. This will typically only be available for subgroups such as “Bacteria”, “Archaea”, “Proteobacteria”, etc. (i.e. not ‘root’). Provide the HMM file totreesapp create
with the “--profile” argument.Metagenomic data from Saanich Inlet were generated from biomass where larger Eukaryotic cells had been physically removed prior to sequencing. Therefore, these organisms do not need to be present in the reference packages. You can remove sequences that are Eukaryotic in origin by using either “--screen” or “--filter” arguments. For example, to only include Bacteria and Archaea use
--screen Bacteria,Archaea
or--filter Eukaryota
.The “--min_taxonomic_rank” argument specifies the minimum taxonomic rank a sequence’s taxonomic lineage must be resolved to. For example, using
--min_taxonomic_rank g
will ensure that all sequences in the reference package have been annotated to the genus, species or strain taxonomic rank. To specify others ranks , use the lowercase first letter of that taxonomic rank. If your reference package is still very large after using the above options, setting this argument to a highly resolved taxonomic rank can be a potent way to remove candidate reference sequences.
To bring all of these arguments together for you here is an example command:
treesapp create
--fast \
--headless \
--overwrite \
--cluster \
--trim_align \
-n 4 \
--similarity 0.90 \
--screen Bacteria,Archaea \
--min_taxonomic_rank g \
--profile arCOG08676_hmm.txt \
--fastx_input XmoA_seed.faa \
-c XmoA \
--output XmoA_seed