6 ChIP-seq analysis
6.1 Setup
Make sure you have IGV installed on your laptop
Check the
macs2
is installed and in your$PATH
by runningwhich macs2
-
If you do not see the folder where
macs2
is installed, run: -
Set up a new folder to work with in the tutorial:
Check the files needed for the tutorial using
ls -lh /projects/micb405/data/mouse/chip_tutorial
. These are the ChIP-seq reads that constitute the main input of your analysis.
6.2 Suggested workflow
The (recommended) tools that you will working with to complete this project include:
- BWA
- SAMtools
- MACS2
- IGV (installed on your computer)
6.3 Aligning using BWA-MEM
We will now process and map the reads using bwa mem
. Note the run time command syntax for bwa mem
is different (easier and faster) from bwa aln
introduced in class. The alignment process will take ~20 min to complete depending on server load. It is highly recommended to construct your commands within a shell script and generated associated log files by redirecting STDOUT as described in class. To make your subsequent commands easier, remember that you can define paths to commonly used files/resources in your script or shell as variables. For example, you can define the path to where the indexed mm10 genome is and the directory to where the tutorial input files are stored (see below). Best practice is to use the absolute rather than relative path when defining these variables, as that will ensure that your script will behave similarly no matter where it is located on your filesystem.
GENOME=/projects/micb405/analysis/references/mm10/mm10.fa
DATA=/projects/micb405/data/mouse/chip_tutorial
You only need to export
a variable if you want to add it to your current shell and have it persist between sessions. Since a script runs a subshell in a continuous session with its own unique variables, it’s not necessary to export
variables within it.
As should now be familiar, once defined you can call these paths in your script/shell by entering the special character $
followed by the variable name as shown below. The following command will run bwa mem
and stream the sam
file into samtools view
, which will then automatically convert it into a bam
file. You can use this or design your own.
# Input a.k.a. control alignment
bwa mem \
-t 8 \
$GENOME \
$DATA/Naive_Input_1.fastq \
$DATA/Naive_Input_2.fastq \
| samtools view -h -b - -o Naive_Input.bam
# Treatment a.k.a. immunoprecipitated a.k.a. IP alignment
bwa mem \
-t 8 \
$GENOME \
$DATA/Naive_H3K27ac_1.fastq \
$DATA/Naive_H3K27ac_2.fastq \
| samtools view -h -b - -o Naive_H3K27ac.bam
Note that the -t 8
is to do multithreaded processing to improve speed. The $GENOME
specifies the location of reference genome to use. The $DATA/Naive_H3K27ac_1.fastq
$DATA/Naive_H3K27ac_2.fastq
specifies the location of the reads. This step will take some time. Expect the program to run for about 20 mins or longer depending on the server load. You can add nohup
in front of each command, or in in front of your command running the script with these commands, in order to assure that if you are disconnected from the server, these analysis steps do not stop.
6.5 Sort, mark duplicates, and index alignments using SAMtools
The output from bwa mem
is an unsorted SAM file. Downstream tools require a sorted BAM file as input, so an intermediate step is to sort and index the alignment to facilitate future steps, similar to what was covered in our samtools/bcftools tutorial.
# Tag mates
samtools fixmate -m Naive_Input.bam Naive_Input.fixmate.bam
# Position sort for markdup
samtools sort Naive_Input.fixmate.bam -o Naive_Input.fixmate.sort.bam
# Mark duplicates. Note that you don't have to remove them as this is done by MACS2 automatically
samtools markdup -S Naive_Input.fixmate.sort.bam Naive_Input.markdup.bam
# Sort again
samtools sort Naive_Input.markdup.bam -o Naive_Input.final.bam
# Index BAM
samtools index Naive_Input.final.bam
Repeat this for the treatment file as well.
6.6 Clean-up intermediate files
Now is the time to clean-up any intermediate files that are not needed downstream, keeping only the .final.bam
files needed by MACS2. When you are finished cleaning your files the working directory should be ~2.5G total.
6.7 Call peaks using MACS2
Before doing peak calling, it is necessary to have a sample data set, as well as a control data set. While MACS2 offers the option to perform both regular (narrow) and broad peak-calling, in this case, we will only be doing regular peak calling, however the instructions are similar for broad peak-calling just add the additional flag --broad
to the command.
We will call peaks on sample Naive_H3K27ac.final.bam
and we will be using the Naive_Input.final.bam
as a control (background). Once you have the sorted, duplicate-marked bam
files for all samples, you can perform the MACS2 peakcalling like this:
macs2 callpeak \
-t Naive_H3K27ac.final.bam \
-c Naive_Input.final.bam \
-f BAMPE \
-g mm \
-n Naive_H3K27ac \
-B \
-q 0.01
- The
-t
is the treatment or IP (ImmunoPrecipitated) aligned and dup-marked bam file - The
-c
is the control or INPUT aligned and dup-marked bam file - The
-f
indicates the input file type (BAM paired-end or BAMPE in this case) - The
-g
indicates the effective genome size (here precomputed for mm10 and provided asmm
) - The
-n
is the prefix you will give your output files - The
-B
indicates that the program should create a BedGraph (.bdg
) file with the results - The
-q
is the FDR cutoff for which to call peaks
6.8 Check files
At the end, you should have something similar to:
user01@orca01:~/ChIP_tutorial$ ls -lh
total 3.2G
-rw-r--r-- 1 mhirst orca_users 958M Oct 4 15:20 Naive_H3K27ac.final.bam
-rw-r--r-- 1 mhirst orca_users 5.6M Oct 4 15:20 Naive_H3K27ac.final.bam.bai
-rw-r--r-- 1 mhirst orca_users 815M Oct 4 15:57 Naive_H3K27ac_control_lambda.bdg
-rw-r--r-- 1 mhirst orca_users 1.1M Oct 4 15:57 Naive_H3K27ac_peaks.narrowPeak
-rw-r--r-- 1 mhirst orca_users 1.3M Oct 4 15:57 Naive_H3K27ac_peaks.xls
-rw-r--r-- 1 mhirst orca_users 754K Oct 4 15:57 Naive_H3K27ac_summits.bed
-rw-r--r-- 1 mhirst orca_users 310M Oct 4 15:57 Naive_H3K27ac_treat_pileup.bdg
-rw-r--r-- 1 mhirst orca_users 1.1G Oct 4 15:30 Naive_Input.final.bam
-rw-r--r-- 1 mhirst orca_users 5.8M Oct 4 15:30 Naive_Input.final.bam.bai
-rw------- 1 mhirst orca_users 58 Oct 4 05:36 nohup.out
6.9 Final Processing
Convert to bigWig and sort (on the Orca server).
wget http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/mm10.chrom.sizes
bedtools sort -i Naive_H3K27ac_treat_pileup.bdg > Naive_H3K27ac_treat_pileup.sort.bdg
bedtools sort -i Naive_H3K27ac_control_lambda.bdg > Naive_H3K27ac_control_lambda.sort.bdg
bedGraphToBigWig Naive_H3K27ac_treat_pileup.sort.bdg mm10.chrom.sizes Naive_H3K27ac_treat_pileup.sort.bw
bedGraphToBigWig Naive_H3K27ac_control_lambda.sort.bdg mm10.chrom.sizes Naive_H3K27ac_control_lambda.sort.bw
6.10 Transfer all relevant output files to your local computer
Using a different terminal window that is not connected to the server, retrieve the bigWig and narrowPeak files
Pre-run analysis can be found here:
You can replace this folder for your own ~/ChIP_tutorial
folder if your script doesn’t finish running in time.
Launch IGV and load in the bigWigs and bed (narrowPeak) files.
If you haven’t installed it yet, please get it here IGV download. Make sure you are loading the Mouse (mm10) reference genome by clicking on the drop-down menu on the top left hand corner (Genomes > Load Genome from Server > mm10).
CONGRATULATIONS! You have now completed your first ChIP-seq analysis.
Questions:
- What do each of the files outputted by MACS2 represent?
- Find one H3K27ac peak, either by displaying a text file in your terminal or looking at your files through IGV. You can also use the text file to file a location to jump to within IGV. What is its chromosomal position?
- How many total peaks were called in this analysis?
- With these peaks on hand, what is one downstream analysis step you can perform that might rely on a set of called peaks?