11 Running TreeSAPP on a server

These instructions will help you to complete the script template treesapp_analysis.sh to run your TreeSAPP analysis of the Saanich Inlet data of your assigned depth on the server and copy the output files to your local machine. To adjust the code examples below and in the script template, replace any angle brackets, e.g. <SOME TEXT>, with the missing code.

11.1 Resources for TreeSAPP analysis

  • A shell script template called treesapp_analysis.sh on Canvas.
  • Access to a server and <server address>.
  • The Saanich Inlet data located in /mnt/datasets

11.2 Checklist to write and run shell script

  • Edit the treesapp_analysis.sh script following the instructions below.

  • Using the terminal on your computer, copy the script to your group’s server using:

    scp <path to file on your computer> <user>@<server address>:<destination>
  • Make the script executable on the server (otherwise you won’t be able to run it). In the CLI and while in the directory containing your script, run the following command:

    chmod u+x treesapp_analysis.sh
  • Still in the CLI, start a new session with screen -S <session name>.

  • In the new session, start the script with ./treesapp_analysis.sh.

  • After the script has run (will take at most 8 hours), locate the final_output/marker_contig_map.tsv files in the TreeSAPP output directories and copy each one to your local computer using scp.

After these steps, you will continue with the analysis of your results in R on your local computer using R and the provided treesapp_analysis.R script template.

11.3 Writing the shell script

In the treesapp_analysis.sh script, we will define some variables to make our code more readable when running commands, give some feedback to the user as the script runs, and finally run the TreeSAPP workflow. Edit the script either on your local computer or after you have copied it to the server using the text editor of your choice (e.g. RStudio on your local machine or nano on the server).

11.3.1 Defining the shell

The script template starts with:

#! bin/bash

and defines the shell used to interpret this script and where the shell is located (i.e. the bash shell in the /bin directory). You do not need to modify anything about that step.

11.3.2 Setting a variable for your depth

Let’s define our first variable with the name “depth”. Replace <your depth> with the depth your group has been assigned to (e.g. 100).

depth="<your  depth>"

11.3.3 Providing user feedback

It is good practice to provide feedback to the user while a script is running by printing messages to the terminal. Let’s print a welcome message to the terminal using the echo command to informing the user what the script is going to do. This is a great opportunity to call the depth variable that we have just defined.

Hint: Remember, to call a variable, you have to prepend its name with a $, e.g. if the variable’s name is some_variable then you call it with $some_variable.

echo "TreeSAPP analysis of Saanich Inlet Data at Depth <call depth variable> m"

11.3.4 Defining variable for input directory

Define a variable for path to the input data directory. Let’s call the variable input_dir.

<variable_name>="<path to data directory>"

11.3.5 Creating output directory

Define a variable for the TreeSAPP output directory and choose a name for it yourself. The directory should be located in your home directory (which is also shared between your host and container). To refer to your home directory, use the environmental variable HOME which is automatically defined by the shell (i.e. you do not need to first define it yourself). Finally, create the output directory.

<set output directory variable>
<create the ouput directory>

11.3.6 Reporting variables to user

Let’s provide the user some feedback where TreeSAPP is going to save its output and what bind paths have been set.

<print command> "<Some text explaining what variables have been set>"

11.3.7 Looping over multiple files

For-loops minimize the amount of redundant code and ensure all data is processed similarly. Below is an example for-loop that could be used to classify metagenome sequences with treesapp assign.

for f in <path/to/assembly_files>/SI072_*m_contig.fa
  do sample=$( basename $f | sed 's/_contig.fa//g')
  treesapp assign -i $f \
  --refpkg_dir <path_to_refpkgs/> \
  --output <path/to/outputs>/${sample}_assign \
  --trim_align -n 8
done

This loop iterates over the seven metagenome assembly files that match the pattern ’SI072_*m_contig.fa’, where the asterisk represents any characters one or more times. Effectively, it replaces seven nearly identical treesapp assign commands. A variable called ‘sample’ is created by calling basename on the file path to return the filename, and using the sed (stream editor) to remove the assembly-specific suffix ’_contig.fa’. The name of the output directory is based on this ‘sample’ variable to ensure the directories are unique across metagenomes.

This for-loop can be modified to loop over treesapp abundance commands as well (assuming you’ve followed the above naming convention for the treesapp assign output directory):

for f in <path/to/outputs>/SI072_*assign
  do sample=$( basename $f | sed 's/_assign//g')
  treesapp abundance \
  --treesapp_output $f \
  --reads <path/to>/MetaG_Trim_QC_Reads/${sample}_pe.1.fq.gz \
  --reverse <path/to>/MetaG_Trim_QC_Reads/${sample}_pe.2.fq.gz \
  -n 8 \
  --report update
done

11.3.8 Listing TreeSAPP version

Let’s print the information about the configuration of TreeSAPP.

 treesapp info

Hint: When we have very long commands in a script, we can introduce line breaks using \ which are best used right after a space in the command (see example above after the word “following”).

11.3.9 Executing TreeSAPP analysis

In the terminal (i.e. not as part of this script), check the documentation for treesapp assign, the command you will use for the automated reconstruction of a nutrient cycle (visit its wikipage to learn more).

treesapp assign -h

Looking at the TreeSAPP documentation, determine the flags you need in addition to the ones already included below.

Time to run our analysis! You will have to identify the input file(s) that you need to process. You will run the analysis once for metagenomic (two files, forward reads are in the file that contains .1 in its name, and reverse reads in file the with .2) and once for metatranscriptomic reads (a single file that contains both forward and reverse reads) and align them to the assembled contigs. All reads are pair-end.

You will need to provide the following settings using flags:

  • Which assembly file to use.
  • To use 8 CPUs for the analysis.
  • To do the analysis for all marker genes provided in TreeSAPP.
  • To output a verbose runtime log.
  • Where to save the results.
  • To calculate the rpkm values for detected sequences.
  • Which reads to use and what type they are (i.e. pair-end or single-end).
  • To delete any intermediate files generated by the analysis.
  • To use position masking of multiple sequence alignment.

Hints:

  • Will you ouput the data for metagenomic and metatranscriptomic analysis to the same folder? Does the folder already exist?
  • If you want the default value of a flag, then you do not need to provide it.
  • Where is the input data located in the singularity container?
  • Metatrascriptomic reads are NOT rRNA.
  • In scripts you often refer to variables while combined with additional text. Let’s say you have defined the variable file_name="some_file" to refer to the file some_file. Now you want to save a modified version of the file as some_file_modified. If you would write $file_name_modified, then the shell does not know that you just want to call a variable called $file_name and instead will assume you are calling a variable called $file_name_modified and will return an empty string. To indicate the beginning and end of a variable name, use curly braces. In the above example, you would use ${file_name}_modified.

Optional bonus challenge:
Earlier we defined the depth variable. Could you use it in some way when you define which assembly and reads you are using in the analysis?

# Metagenomic analysis
treesapp assign \
-i <input data dir><file path to your assembly> \
-m <choose the correct option> \
<number of CPU threads, set to 8>
<your output directory> \
<your remaining settings>

# Metatranscriptomic analysis
<you are on your own now :)>

11.3.10 Concatenating classification tables

Once the treesapp assign, treesapp abundance and treesapp layer commands have completed successfully it is convenient to store classification data from all samples in a single table. This one table can then be downloaded from the server and loaded into R.

You can combine all tables with two lines of code in bash, using cat, head, and tail. The name of the combined table for this example is ‘SI072_classifications.tsv’. This first line writes just the header from one of the classification tables to the combined table while the second writes the rows (i.e. not including the column names) to the combined table. This ensures the final table is properly formatted, i.e., one row containing the column names and the remaining rows containing data.

cat /data/SI072_*m_assign/final_outputs/classifications.tsv | head -n 1 >SI072_classifications.tsv
tail -q -n +2 /data/SI072_*m_assign/final_outputs/classifications.tsv >>SI072_classifications.tsv

The above command will work for both the classifications.tsv file created by treesapp assign or the layered_classifications.tsv file created by treesapp layer if the file name is changed from ‘classifications.tsv’ to ‘layered_classifications.tsv’ in the commands above.