3 Acquiring reference sequence data

3.1 Introduction and goals

Reference data in the context of biological sequence analysis are those that come from well-curated open-source databases. These data have been used to create all TreeSAPP reference packages to date, and they should be used to create all future reference packages. But what does “well-curated” mean, and how can you identify data that would meet this standard?

The goals of this tutorial are:

  1. Develop a sense of biological sequence database curation quality
  2. Learn how to subset and download sequences from the EggNOG and UniProt databases

3.2 The golden standard

The field of genomics is blessed with an abundance of reference data. In fact, it is not uncommon for biologists and bioinformaticians alike to gripe over the increased run-time of computational analyses caused by steps that include comparisons to such large reference databases. Not all of these data are, however, of the same quality. The vast majority have been automatically curated using algorithms with known shortcomings, and only a small portion have been manually curated with verified functional characterization. These latter data are the ones we consider well-curated or golden. More details on choosing databases for retrieving sequences can be found on the TreeSAPP wiki.

3.4 XmoA example

To show you how to access these data, we will download the alpha subunit sequences of the enzymes particulate methane monooxygenase (PmoA) and ammonia monooxygenase (AmoA) that are used in the treesapp create tutorial from the EggNOG and UniProt databases. Since XmoA is not a recognized identifier, but a convenient label that reflects the operon structure (xmoCAB) of homologous oxidoreductase enzymes, we will need to search in databases multiple times.

3.4.1 EggNOG

Go to the EggNOG website at http://eggnog5.embl.de/#/app/home and search for either “ENOG5028JPK” or “arCOG08676” in the search bar at the top of the page. These EggNOG ortholog identifiers correspond to PmoA/AmoA from Bacteria and AmoA from Archaea, respectively. In the image below, we’ve searched for “ENOG5028JPK”.

Click “Download” at the bottom-left of the panel and select “All XX sequences (FASTA)”, where XX is the number of sequences belonging to the orthologous group (OG). This will open a new window with all sequences from that OG in FASTA format. Download this file to a file on your computer following the format <OG name>_EggNOGv5.faa where <OG name> is the EggNOG ortholog identifier.

Repeat the above steps with the other EggNOG identifier.

3.4.2 UniProt

The UniProt database architecture is built upon standard biological attributes, in contrast to the de novo clustering and classification scheme of EggNOG. Protein sequences are therefore searchable using multiple keys covering a range of taxonomic and functional specificities. The most specific is the unique identifier of a single protein sequence (e.g. Q04507 is an AmoA sequence in Nitrosomonas europaea). More broad searchable attributes include protein names, gene names, organism, Enzyme Commission (EC) numbers, and identifiers of other databases such as PFam and TIGRFAM.

Here, we will use the gene names AmoA and PmoA to find reference sequences in UniProt. Begin by navigating to https://www.uniprot.org/ and typing “Particulate methane monooxygenase” (including the double quotation marks) into the search bar at the top of the page.

3.4.2.2 Download sequences

Downloading sequences in UniProt is quick and easy. Click the button (atop the table), ensure the format is FASTA, and click “Go” to specify where the file should be saved on your computer. Change the file name to UniProt_PmoA.fasta.

3.4.2.3 Download taxonomic lineages

UniProt sequence names are not searchable within the NCBI’s database, so software (including TreeSAPP) is unable to pair them with their corresponding taxonomic lineages like with EggNOG. Therefore, a separate table must be downloaded from UniProt and provided to TreeSAPP during reference package creation.

To create a properly formatted table, we must first change the columns in the UniProt table of search results. Begin by clicking the icon to the right of the “Download” button.

Once there, only check the boxes for organism, superkingdom, phylum, class, order, family, genus and species as shown above. Click “Save” near the top-right of the page to save your selection.

You can export this table similar to before. Click the button to select all sequences, change the format to “Tab-separated”, then click “Go” to save the table to your computer. Change the file name to UniProt_PmoA.tab.

To gather all sequences in UniProt for the XmoA reference package, repeat the previous three steps with the string “Ammonia monooxygenase alpha subunit”, Gene name [GN] “amoA”, and the same taxonomic filter. There should be >3,000 entries returned. Save the sequences and taxonomic lineages in UniProt_AmoA.fasta and UniProt_AmoA.tab, respectively.

3.4.2.4 Downloading reviewed sequences from SwissProt

As you may have noticed in the previous steps, there was a single manually-reviewed sequence for PmoA and AmoA each. In this case, there are too few sequences to build a seed reference package that could be scaffolded on to eventually construct phylogenetically comprehensive reference package. Still, once there are more than ten or so phylogenetically diverse manually-reviewed sequences, the UniProt/SwissProt database should be sufficient to construct a seed reference package. By filtering by reviewed sequenced (gold star) and following the previous two steps you can retrieve sequences that have been manually curated, indicating a high-degree of confidence in the functional annotation of those sequences. Keep this functionality in mind for future reference packages.

3.4.3 Copying data onto the server

  • On your computer: Create a new directory called ts_tutorial_local inside of your home directory by using a file browser, i.e. File Explorer (Windows 10) or Finder (macOS). Place any reference sequences in ts_tutorial_local that you identified and downloaded from the databases.

    Next, connect to your personal server using the terminal app, i.e. Windows Terminal (Windows 10) or Terminal (macOS).

    ssh root@<server address>
  • On the server: Move to your directory /data (if it is not your working directory already) and create a new directory called <user>/tutorial where <user>/ is unique within your project group.

    cd /data/
    mkdir <user>
    cd <user>
    mkdir tutorial
  • On your computer: Open a second terminal window that is connected to your local computer, then copy the reference sequences from your computer to your server.

    scp <path to ts_tutorial_local>/* root@<server address>:/data/<user>/tutorial/

3.5 Alternative sources

EggNOG and UniProt are often the most useful sources, though this leaves out the majority of biological sequence databases for no reason other than simplicity. IMG, KEGG, PFam, and other similar databases are all great but their utility is blunted because they either lack an API to retrieve lineage information, requiring a separate table to link this information, or the sequence accession (how database sequences are uniquely named) do not allow for easy referencing in the NCBI taxonomic hierarchy. Details on this table’s format and how to provide it can be found under the treesapp create wiki page.

3.6 Data access hazards

Accessing the fraction of well-curated data from multiple databases is not trivial. Perhaps the most obvious reason for this are unrelated genes with shared names. For example, the gene product of mcrB in E. coli widely refers to 5-methylcytosine-specific restriction enzyme B, but in the Archaea, it would refer to the beta subunit of methyl-coenzyme M reductase—a completely unrelated protein. This is an inevitability in a field of science as broad as biology, where scientists that are responsible for naming genes cannot possibly be aware of all gene names in circulation. Another issue can be genes or proteins with synonyms, making the search for these data more complex. In these cases, you may need to search multiple databases with several queries in order to access all the sequences belonging to the same protein family.

3.7 Validation

There are a number of quality control measures that can be taken to ensure the functional characteristics are as expected.

3.7.1 Basic

  • BLAST: The first method for validating a set of candidate reference sequences could be to use the web-based Basic Local Alignment Search Tool available through the NCBI. Submit a FASTA file containing your sequences to Protein BLAST (blastp) with an appropriate database. To further reduce the processing time, you can restrict the reference sequences to specific taxa or use different algorithms. One thing to note is the query size limit - blastp will accept only FASTA files containing fewer than 100,000 characters. So, it is a good idea to take a representative sample or cluster the sequences before submitting the job.

  • EggNOG-mapper: If you didn’t download your query sequences from the EggNOG database, a good resource to check your sequences is EggNOG mapper. It will annotate your query sequences against the EggNOG database and serve the result in several formats.

3.7.2 Advanced

  • Phylogenetic inference: Try building a phylogeny from the candidate reference sequences and see whether there are any anomolously long branches between clades.
  • hmmsearch across PFam database: Got some time to kill? Try aligning your candidate reference sequences to all profile HMMs in the PFam database! This will need to be done on the command-line and you will need to download Pfam-A.hmm.gz from the PFam ftp site.