16.2.22 UTAP: ChIP-seq pipeline guidelines - (almost done: JL please add comments from NGS meeting)
Pipeline website: http://utap.wexac.weizmann.ac.il
The ChIP-seq (Chromatin Immuno-Precipitation followed by Sequencing) pipeline facilitates the analysis of ChIP-seq data in order to identify genome-wide DNA binding sites for transcription factors and other proteins. The pipeline receives single or paired-end reads as input (the type of input is automatically determined by the number of fastq files generated per sample), performs quality control and pre-processing steps, and maps the reads onto mouse or human genomes. Peak analysis is then performed (after some post-processing), on the identified DNA binding fragments, and significant peaks (as compared to a control background if present) are selected and analyzed.
Before you start:
This pipeline runs on the Wexac cluster.
Please prepare the following in advance:
- An account (userID) on Wexac, via your department administrator.
- A "Collaboration" folder within your lab folder on Wexac, with read and write permission for Bioinformatics Unit staff. This must be set up by the computing center (hpc@weizmann.ac.il).
- Sufficient free storage space on Wexac (> 400Gb), via your department administrator.
Setting up a new CHIP-seq analysis
In order to run a new transcriptome analysis, you must first transfer demultiplexed sequencing data (fastq files) to your Collaboration folder. Within the Collaboration folder, the files must conform to the directory structure described below.
Then, login to utap.wexac.weizmann.ac.il via Firefox or Chrome (the pipeline is NOT compatible with Internet Explorer) using your Weizmann userID and password and click on "Run pipeline":
- Click on CHIP-seq in the Choose pipeline box:
2. Provide the name of the input folder:
Browse within your Collaboration folder, and select the folder containing your sample (fastq) files. Fastq files must be organized, within the selected folder (depicted as root_folder in the example below), into subfolders as shown below.
Note that if you wish to go up one level (or more), click the desired folder level on the path at the top of the folder-browsing window.
Fastq file name conventions: Fastq file names must start with the same sample name as the subfolders, and end with "_R1.fastq" (or "_R1.fastq.gz") for single-read data. For paired-end data, corresponding files must exist that are IDENTICAL in their name, but contain the suffix "_R2.fastq" (or "_R2.fastq.gz") instead of "_R1.fastq" (or "_R1.fastq.gz").
For example:
- root_folder
- sample1_R1.fastq
- sample1_R2.fastq (must exist in paired-end)
- sample2_R1.fastq.gz
- sample2_R2.fastq.gz (must exist in paired-end)
- sample1
- sample2
The pipeline also supports the fastq file format conventions _S*_L00*_R1.fastq or _S*_L00*_R1_0*.fastq.
For example:
- root_folder
- sample1_S0_L001_R1_001.fastq
- sample1_S0_L001_R1_002.fastq
- sample1_S0_L002_R1_001.fastq
- sample1_S0_L002_R1_002.fastq
- sample1_S0_L001_R2_001.fastq (must exist in paired-end)
- sample1_S0_L001_R2_002.fastq (must exist in paired-end)
- sample1_S0_L002_R2_001.fastq (must exist in paired-end)
- sample1_S0_L002_R2_002.fastq (must exist in paired-end)
- sample2_S0_L001_R1_001.fastq
- sample2_S0_L001_R1_002.fastq
- sample2_S0_L002_R1_001.fastq
- sample2_S0_L002_R1_002.fastq
- sample2_S0_L001_R2_001.fastq (must exist in paired-end)
- sample2_S0_L001_R2_002.fastq (must exist in paired-end)
- sample2_S0_L002_R2_001.fastq (must exist in paired-end)
- sample2_S0_L002_R2_002.fastq (must exist in paired-end)
- sample1
- sample2
3. Optionally change the name of the output folder
If you want the output folder to be different from the one automatically filled in (based on the selected input folder), overwrite the output folder name in the text box associated with the screen’s Output folder: field with your name of choice.
Additional setups
Fill in a project name, and select the reference genome to which the reads will be aligned.
Default adapters are the P5 and P7 adapters of the Tru-seq protocol.
4. Run with control
Chose “run with control” (in the drop-down menu associated with the run with control line) in order to enable comparison of each treatment with its corresponding control. When selecting this option, a new group of control and treatment boxes will open. Organize the samples by selecting them and using the arrows to move items to the appropriate categories.
If you have more than one treatment against control, press on the "Add group" button as shown in the figure below.
Each group must contain at least one sample in each of the treatment and control boxes.
When moving a sample to the control box, a copy of the sample is retained, so that you can use it again in a new group.
If you move more than one sample to the treatment or control box, the pipeline will automatically combine the samples into one big treatment/control sample.
Important: All of the pipeline steps (mapping, counts etc.) will be run (only) on the samples in the treatment/control boxes.
5. Run the pipeline
Finally, click the “Run analysis” button to submit the analysis. You will be notified by email when the analysis is ready (usually after a few hours). All of its output files will be stored in the relevant subfolders within your wexac Collaboration folder.
Analysis workflow
Pipeline steps and associated tools:
- Quality control: Reads are trimmed using cutadapt (DOI: 14806/ej.17.1.200) (with the parameters --times 2 -q 20 -m 25). In this process, primers corresponding to the Tru-seq protocol are removed.
- Quality control: Reads quality control is evaluated using FastQC (with the parameter --cassava). A report file, containing quality results for all of the samples is generated using multiQC.
- Mapping to genomes: The quality trimmed reads are mapped to Mouse/Human genomes: /shareDB/iGenomes/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/genome /shareDB/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome, /shareDB/iGenomes/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/genome respectively, using Bowtie2 (doi: 10.1038/nmeth.1923.) (with the parameters --local for all analyses and, in addition, -X 2000 for paired end analyses). Refseq annotation is provided for the mapped genes.
- Alignment filtering: Following the alignment, reads are filtered using samtools view with the parameters -F 4 to remove unmapped reads that are output by bowtie2, -q 39 and with -f 0x2 for paired end reads. The remaining unique reads are then indexed and sorted, using samtools index and samtools sort.
- Generation of statistics on the alignment using flagstat.
- Visualization in graphs: The reads are graphically visualized using ngsplot (with the parameters -G -R genebody -C -O samples -D refseq -L 50000).
- Peak calling: Significant chip regions (peaks) are evaluated and compared to control samples if present using macs2 callpeak (with the parameters --bw 300 -B -f --SPMR -g -keep-dup auto -q 0.01 for all analyses, BAMPE --nomodel for paired end analyses, and BAM for single end analyses). The resulting peaks "*_peaks_filtered.broadPeak" are filtered to exclude peaks from the blacklist (https://github.com/Boyle-Lab/Blacklist/tree/master/lists).
- Conversion to BigWig format: Files containing the predicted peaks coordinates in BedGraph format are converted to BigWig format using bedtools slop (with the parameters -g -b 0), bedClip stdin and bedGraphToBigWig (with default parameters)
- Peak annotation: The predicted peaks are collected from all samples using multiIntersectBed and then annotated according to the corresponding genome using Homer (with default parameters). Analysis of peaks distribution in genomic regions, and around TSS is done using ChipSeeker, together with a Venn diagram of overlap of peaks sets.
Output folders:
0_concatenating_fastq
1_cutadapt
2_fastqc
3_multiQC
4_mapping
5_filtered_alignment
6_peaks_prediction
7_peaks_annotation
8_graphs
9_BigWig
10_reports
Log files (stored one directory above the output directory)
snakemake_stdout.txt (stored one directory above the output directory)