7 RNA-seq alignment with STAR

Slides

You can download the slides for this tutorial below.

7.1 STAR workflow

Normally, you can download mouse genomic data (in this case, the mm10 FASTA and GTF files) from Ensembl, but we have prepared this for you in advance at /projects/micb405/analysis/STAR_tutorial. Simply create a new working directory and a subdirectory for your STAR index.
```
mkdir ~/star && cd ~/star
mkdir STARIndex
```

Before you begin your alignment, STAR must generate its own format of index based on the genomic information you provide it with. Generate a STAR index based on the mm10 FASTA and your GTF files, then proceed to Step 3 while it runs:

STAR \
  --runMode genomeGenerate \
  --genomeDir STARIndex \
  --genomeFastaFiles /projects/micb405/analysis/STAR_tutorial/Mus_musculus.GRCm38.dna.primary_assembly.fa \
  --sjdbGTFfile /projects/micb405/analysis/STAR_tutorial/Mus_musculus.GRCm38.84.gtf \
  --sjdbOverhang 49 \
  --runThreadN 16

Download healthy tissue RNAseq FASTQ files from the paper “An RNA-Seq atlas of gene expression in mouse and rat normal tissues”. These are located in the associated ArrayExpress that can be located in the data citations from the article’s NCBI page.

Discuss among your group an idea for what you could investigate given the data you have available. Download the FASTQ files that will allow you to accomplish this. Recall that RNA-seq should indicate relative expression at gene intervals you provide, which can later be quantified and normalized by HTseq & DESeq2.
```
# In my case, I'm downloading 4 tissue types from the first individual mouse in order to observe differential gene expression in these areas.
# These are just an example! 
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR213/000/ERR2130640/ERR2130640.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR213/009/ERR2130649/ERR2130649.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR213/003/ERR2130623/ERR2130623.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR213/004/ERR2130614/ERR2130614.fastq.gz
```
When the STAR index is ready, run STAR, outputting into a separate directory for each sample you wish to align. You can control the output directory with the string provided to --outFileNamePrefix. These directories will have to be created by you beforehand - STAR does not create directories for you. Since you won’t have time for the whole process to run today, create a script that runs this STAR command for each of your samples and run it in a way so that it won’t stop when you exit your shell (Group Assignment part #2 below).

Be sure that --readFilesIn, --outFileNamePrefix, and --outSAMattrRGline changes for each of your samples
```
STAR \
  --genomeDir STARIndex/ \
  --readFilesIn sample1.fastq.gz \
  --outFileNamePrefix sample1/sample1.fastq.gz \
  --runThreadN 8 \
  --limitBAMsortRAM 60000000000 \
  --outSAMattrRGline ID:sample1.fastq.gz SM:sample1.fastq.gz \
  --outBAMsortingThreadN 8 \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMunmapped Within \
  --outSAMstrandField intronMotif \
  --readFilesCommand zcat \
  --chimSegmentMin 20 \
  --genomeLoad NoSharedMemory
```

7.2 Group assignment

Which samples from the paper did you choose, and what did you want to test with them?
Create a script that performs the STAR alignment for your files, avoiding excessive repetition, and can be run so that it is not interrupted when you close your connection with the server.

7.3 Resources

STAR documentation

6 ChIP-seq analysis

8 HTSeq and DESeq2