5 Classifying unknown sequences with TreeSAPP and updating a reference package
5.1 Introduction and goals
We will be updating the new particulate methane monooxygenase (pMMO) and ammonia monooxygenase (AMO) alpha subunit, or XmoA, reference package with XmoA amino acid sequences sourced from UniProt. You already created the reference package (Creating a Reference Package For TreeSAPP) and downloaded the sequences that will be used to update it (Acquiring reference sequence data).
There are four goals of this tutorial:
- Learn how to classify query sequences using
treesapp assign
. - Learn how to update a TreeSAPP reference package with
treesapp update
. - Learn how to use
treesapp colour
to colour a phylogeny in iTOL. - Learn how to annotate features of phylogenetic clades using
treesapp layer
.
The first step involves using treesapp assign
to classify the UniProt XmoA sequences. The second step is to use treesapp update
to add the new sequences to the original XmoA reference package.
5.2 Classify new sequences
Before we can begin using TreeSAPP, we must log onto the server and make sure we’re in the same directory as all our data.
cd /data/<user>/tutorial/
According to treesapp purity
in the last chapter, and all other sanity checks, the XmoA reference package (contained in your working directory, /data/<user>/tutorial
, at XmoA_seed/final_outputs/XmoA_build.pkl
) only contains XmoA sequences. At this point, you’re able to use this reference package to classify sequences in any dataset using treesapp assign
.
The command treesapp assign
is perhaps the most popular command and is described in detail on the TreeSAPP wiki.
For these commands we will use the Block Mapping and Gathering with Entropy (BMGE) software to filter out regions of questionable homology from the alignment (6). Refer to the BMGE publication for more information. To trim the multiple sequence alignments prior to phylogenetic placement use the --trim_align
flag.
To use the PmoA and AmoA reference package we’ve built instead of any other reference packages that come with TreeSAPP we can use the argument --refpkg_dir
with the path to a directory containing reference packages we want to use.
As usual, if you’re unclear as to what any arguments are, you can use treesapp assign -h
to get the complete usage and descriptions of each argument.
5.2.1 Classifying amino acid sequences
We’re first going to classify PmoA and AmoA sequences sourced from UniProt.
Before we classify them, the sequences all need to be combined in a single file using the cat
shell command.
- Use
cat
to combine PmoA and AmoA sequences into the fileUniProt_XmoA.fasta
. The output ofcat
is redirected from the console with the>
symbol to a<new file>
.
cat UniProt_AmoA.fasta UniProt_PmoA.fasta > UniProt_XmoA.fasta
This new file will be the input to treesapp assign
, provided with the --fastx_input
argument.
In past TreeSAPP commands we specified the path to a single reference package with the --refpkg_path
argument.
Since treesapp assign
is capable of using multiple reference packages, the analogous argument is --refpkg_dir
.
This should point to a directory containing one or more reference package .pkl files, all of which can be used for classifying the query sequences.
The output directory is UniProt_XmoA_assign/
.
- Classify the 69661 protein sequences in
UniProt_XmoA.fasta
usingtreesapp assign
.
treesapp assign \
-n 4 \
--trim_align \
--refpkg_dir XmoA_seed/final_outputs/ \
--fastx_input UniProt_XmoA.fasta \
--output UniProt_XmoA_assign/
The command should take approximately twenty seconds to complete and if you see the line TreeSAPP has finished successfully.
at the end then you’re good to move on.
5.2.2 Classifying ORFs predicted from genomes
The second batch of sequences that we will classify XmoA from are metagenome-assembled genomes (MAGs). MAGs are derived from assembled metagenome datasets (i.e. contigs or scaffolds) and represent the genome of a closely related microbial population. You can think of them as representing a single strain present in a microbial community.
We will need to download two files: the FASTA file containing the genomes and a table with the taxonomic label for each genome. These files, SI072_MAGs.fa
and SI072_MAGs_gtdbtk.bac120.summary.tsv
, are contained within a larger data package that we will be revisiting in the next tutorial.
- Download the data package
SI072_sequence_data.tar.gz
from Zenodo and decompress it withtar
.
wget https://zenodo.org/record/6323402/files/SI072_sequence_data.tar.gz && \
tar -xzvf SI072_sequence_data.tar.gz
The FASTA file was created by concatenating twelve different MAGs into a single file and prepending each sequence name (i.e. header) with the MAG name using the software seqkit.
Concatenating multiple genomes into a single file saves time since you only need to run treesapp assign
once instead of classifying each MAG individually.
Prepending the header with its respective MAG name is necessary to properly map the names of the query sequences back to their original genome names during the update process. Here is the shell command used (do not run):
for f in *fa
do
mag_name=$( basename $f | sed 's/.fa//g' )
cat $f | seqkit replace -p "^" -r "${mag_name}_" >>SI072_MAGs.fa
done
The treesapp assign
command we use is very similar from those we’ve used in the past.
We do not need to indicate which molecule type it is (i.e., nucleotides or amino acids) as TreeSAPP automatically determines this when reading the input.
treesapp assign \
-n 4 \
--trim_align \
--refpkg_dir XmoA_seed/final_outputs/ \
--fastx_input SI072_sequence_data/SI072_MAGs.fa \
--output SI072_MAGs_assign/
This command should take less than a minute to complete. An extra step was required here for open reading frames (ORFs) to be predicted using the software Prodigal. The ORFs were translated to create protein sequences required by TreeSAPP for homology search, phylogenetic placement, etc.
5.3 Update the reference package with new sequences
The initial XmoA_seed reference package was pretty small with just 36 sequences (information was retrieved using treesapp package view num_seqs -r XmoA_seed/final_outputs/XmoA_build.pkl
). Also, there were only bacterial sequences included in this version and we know there are archaeal ammonia oxidizers too.
Let’s see if we can expand on this seed reference package using the sequences from UniProt.
5.3.1 Updating with publicly available sequences
You have already seen many of these flags and arguments before, and they’re used again here just to remind TreeSAPP that we want results fast.
A couple of new ones are used, though. treesapp update
won’t take just any old FASTA file but relies on the outputs produced by treesapp assign
.
So we’ll guide it to the relevant output directory with --treesapp_output UniProt_XmoA_assign/
.
From there, it will figure out what sequences were classified and what their assigned lineages were.
By default TreeSAPP will use the lineage that was assigned to each classified query sequence from the treesapp assign
outputs.
Although these sequences are from UniProt (and originally GenBank) they do not have NCBI accessions or taxonomy IDs in the header, so we must provide taxonomic lineages for each sequence in a table.
Fortunately, we already have this information in the files UniProt_AmoA.tab
and UniProt_PmoA.tab
.
- Create a table mapping sequence names to taxonomic lineage information for
treesapp update
. The first command creates a new file (UniProt_XmoA.tsv
) with the table column names. The second command pipes all rows in the UniProt lineage tables except the header line by using thetail
command. The-q
parameter preventstail
from including the file name in the output. The UniProt table rows are appended to the new file by using>>
instead of one>
.
echo -e "SeqID\tOrganism\tDomain\tPhylum\tClass\tOrder\tFamily\tGenus\tSpecies" >UniProt_XmoA.tsv
tail -q -n +2 UniProt_*tab >>UniProt_XmoA.tsv
Alternatively, inverse pattern matching the word “Entry” — the first column name in the exported UniProt tables — with grep --no-filename -v Entry UniProt_*tab
could have also worked.
To make TreeSAPP use this new table instead of using the taxonomic classifications generated by treesapp assign
we use the --skip_assign
flag.
Finally, we are going to require that taxonomic lineages are resolved to at least the rank of Order with the argument --min_taxonomic_rank
.
Resolving lineages of sequences to at least this rank will help us link them to their respective CuMMO groups more precisely.
This level of resolution isn’t necessary for all reference packages but given the metabolic diversity within the CuMMOs of Proteobacteria alone, it is here.
- Update the reference package
XmoA_seed/final_outputs/XmoA_build.pkl
with classified UniProt XmoA sequences. Use the argument--seqs2lineage
to specify the path to the table containing taxonomic lineages for all UniProt XmoA sequences.
treesapp update \
--fast \
--headless \
--overwrite \
--delete \
--cluster \
--trim_align \
--min_taxonomic_rank o \
-n 4 \
--output XmoA_UniProt_update/ \
--skip_assign \
--treesapp_output UniProt_XmoA_assign/ \
--refpkg_path XmoA_seed/final_outputs/XmoA_build.pkl \
--seqs2lineage UniProt_XmoA.tsv
It looks like 39 new sequences were introduced into the reference package, bringing the total to 75 sequences. And judging from the taxonomic rank summary printed during runtime both Archaea and Bacteria are now included:
Number of unique lineages:
root 1
domain 2
phylum 6
class 7
order 10
family 14
genus 29
species 46
5.3.2 Updating with MAG sequences
We are now going to update the reference package that we just updated using the MAG sequences to demonstrate how to iteratively update reference packages. Like the previous update, we must provide the taxonomic labels for each MAG. The MAGs were assigned taxonomic labels using the Genome Taxonomy Database toolkit (GTDB-tk)(7).
We first need to reformat the table containing the taxonomic labels of the MAGs. We will employ the shell functions awk
and sed
to pull out the two columns we need and find-and-replace words and characters.
awk -F"\t" '{ OFS="\t"; print $1,$2 }' SI072_sequence_data/SI072_MAGs_gtdbtk.bac120.summary.tsv | \
sed 's/user_genome/Organism/g' | \
sed 's/classification/Lineage/g' | \
sed 's/;/; /g' >gtdb_classifications.tsv
Now that we have everything we need we can proceed to run treesapp update
to add the MAG sequences to the reference package.
treesapp update \
--fast \
--headless \
--overwrite \
--delete \
--cluster \
--trim_align \
-n 4 \
--output XmoA_MAG_update/ \
--skip_assign \
--seqs2lineage gtdb_classifications.tsv \
--treesapp_output SI072_MAGs_assign/ \
--refpkg_path XmoA_UniProt_update/final_outputs/XmoA_build.pkl
One of the XmoA sequences from the MAGs should have been added to the reference package bringing the total to 77 reference sequences.
A summary of the reference package will have been printed to the screen if treesapp update
completed properly, and it should look like this:
Summary of the updated reference package:
ReferencePackage instance of XmoA (N0102):
Molecule type: 'prot'
TreeSAPP version: '0.11.3'
Profile HMM length: '246'
Substitution model used for phylogenetic inference: 'LG+G4'
Number of reference sequences (leaf nodes): 77
Software used to infer phylogeny: 'FastTree'
Date of last update: '2022-03-01'
Description: 'Alpha subunits of copper membrane monooxygenase enzymes'
For supplementary reading on how to integrate sequences from SAGs and MAGs into reference packages please visit the TreeSAPP Wiki.
5.4 Annotate clades
Protein families vary in their evolutionary complexity; many, especially genes involved in scavenging and metabolizing nutrients, have often been involved in some gene duplication and/or lateral gene transfer event. Moreover, closely related enzymes can vary in their activities and it is often valuable, though not necessary, to annotate such features. As previously discussed, XmoA includes PmoA and AmoA, which are involved in oxidizing methane and ammonia, respectively. Still other activities are mediated by members of this metabolically diverse protein family, including ethylene, propane and butane oxidation (5). Here, we will annotate reference sequences in the XmoA reference package so TreeSAPP can classify sequences more precisely and create colour-annotation files for iTOL.
Creating your own annotations involves scouring the literature to identify the phenotypes associated with each organism or taxon. For example, you would need to determine what paralogs or substrates are used by each organism in the reference package. Often though, these phenotypes are conserved across larger taxonomic groups, such as whole genera, families, orders, etc.
If you are not familiar with the evolution of a protein family it may be helpful to find reference sequences for the activity, function, or other feature you’re interested in and place them on the tree with treesapp assign
. Then, guided by these sequences, it should be easier to annotate the clades.
Seeing as a literature review can’t be accomplished within the time span of this tutorial, you can use the annotations already created for the XmoA reference package with extensive support from the literature (5, 8–11).
wget https://raw.githubusercontent.com/hallamlab/RefPkgs/master/Nitrogen_metabolism/Nitrification/XmoA/XmoA_taxonomy-phenotype_map.tsv
You can view the contents of this file with the Unix command less
:
less XmoA_taxonomy-phenotype_map.tsv
d__Archaea AmoA_AOA
p__Verrucomicrobia PmoA
p__Actinobacteria BmoA
o__Chromatiales AmoA_AOB
f__Methylococcaceae PmoA
f__Nitrosomonadaceae AmoA_AOB
g__Haliea EmoA
g__Methylocystis PmoA
g__Methylosinus PmoA
g__Methylohalobius PmoA
g__Methylocapsa PmoA
g__Methylothermus PmoA
g__Methylomarinovum PmoA
g__Hydrogenophaga Unknown
s__Crenothrix polyspora PmoA
Candidatus Methylomirabilis oxyfera PmoA
1116472.MGMO_167c00070 PxmA
743836.AYNA01000128_gene1400 PxmA
1116472.MGMO_123c00040 PxmA
697282.Mettu_1055 PxmA
686340.Metal_1434 PxmA
ACE95886 PxmA
ACE95893 PxmA
ACE95890 PxmA
AEF15859 PxmA
tr|A0A1R4H2I8|A0A1R4H2I8_9GAMM PxmA
To create annotate the XmoA reference sequences as these more resolved orthologous groups we will use treesapp package
.
The specific reference package attribute that is modified is “feature_annotations”.
The treesapp package edit
subcommand only allows for a single attribute to be specified.
If the “feature_annotations” attribute is being edited the next command-line parameter indicates the name of the feature-annotation to edit.
- Add a new reference package feature annotation called CuMMO that will store the type of CuMMO for each reference XmoA sequence.
Include
--overwrite
to write the updated reference package to the same file.
treesapp package edit \
\
feature_annotations CuMMO -r XmoA_MAG_update/final_outputs/XmoA_build.pkl \
--taxa_map XmoA_taxonomy-phenotype_map.tsv \
--overwrite
5.4.1 Annotate features
With the CuMMO feature annotation included in the reference package, we can re-classify the PmoA and AmoA sequences from UniProt onto the MAG-updated reference package with treesapp assign
.
treesapp assign \
-n 4 \
--trim_align \
--refpkg_dir XmoA_MAG_update/final_outputs/ \
--fastx_input UniProt_XmoA.fasta \
--output UniProt_XmoA_update_assign/
With the clade colours file in the tutorial directory, it is now simple to assign each of the classified query sequences to its respective functional guild with treesapp layer
.
treesapp layer \
--refpkg_dir XmoA_MAG_update/final_outputs/ \
--treesapp_output UniProt_XmoA_update_assign/
The output of this command is UniProt_XmoA_update_assign/final_outputs/layered_classifications.tsv
.
It contains an extra column for the functional annotations but is otherwise identical to the original classifications.tsv
file.
You can view the first 10 lines by using the Unix command head
:
head UniProt_XmoA_update_assign/final_outputs/layered_classifications.tsv
5.4.2 Colouring phylogenies in iTOL
TreeSAPP can also create iTOL-compatible annotation files for colouring the phylogeny.
treesapp colour \
--output_dir XmoA_MAG_update/final_outputs/ \
--refpkg_path XmoA_MAG_update/final_outputs/XmoA_build.pkl \
--attribute CuMMO \
--unknown_colour grey
The two output files —XmoA_CuMMO_colours_style.txt
and XmoA_CuMMO_colour_strip.txt
— can be used to bring colour to the JPlace file in iTOL.
The default colour palette used is “BrBG” from colorbrewer.org but this can be changed by the --palette
argument to any other colorbrewer palette.
Let’s copy these two files into the UniProt_XmoA_update_assign/iTOL_output
directory and transfer them to <path to ts_tutorial_local>
on your (local) computer.
cp XmoA_MAG_update/final_outputs/XmoA_CuMMO_*.txt UniProt_XmoA_update_assign/iTOL_output/
-
On your computer: Transfer
UniProt_XmoA_update_assign/iTOL_output/
from the server to your local directoryts_tutorial_local
.
scp -r root@<server address>:/data/<user>/tutorial/UniProt_XmoA_update_assign/iTOL_output <path to ts_tutorial_local>
Navigate to https://itol.embl.de/ using a web browser and sign in using the group’s username and password.
Upload the file
XmoA_complete_profile.jplace
. You can do this using either your computer’s file browser then navigating to theiTOL_output
directory and clicking-and-dragging the file, or iTOL’s file uploader—the “Upload tree files” button.Next, navigate to the page displaying the phylogeny and click-and-drag the file
XmoA_labels.txt
from your file browser into the iTOL window. This should convert the TreeSAPP identifiers (e.g. 1_XmoA) to more useful descriptions with the organism name and accession for each leaf node.The two files
XmoA_CuMMO_colours_style.txt
andXmoA_CuMMO_colour_strip.txt
can be dragged into the iTOL browser window to bring colour to the XmoA tree.Finally, turn on the “Phylogenetic Placements” dataset (right of the screen). The figure should look very similar to Figure 5.1