16.2.22 UTAP: ATAC-Seq pipeline guidelines

The ATAC-Seq (Assay for transposase-accessible chromatin using sequencing)  pipeline facilitates the analysis of ATAC-Seq data in order to capture open and accessible regions of chromatin across the genome. The pipeline receives paired-end reads as input, performs quality control and pre-processing steps, and maps the reads onto mouse or human genomes. Nucleosome-free fragments are selected, peaks are identified and analyzed.

This pipeline is available in https://utap.wexac.weizmann.ac.il/

Analysis setup

The ATAC-Seq pipeline run on the Wexac cluster.  In order to run a new analysis, your fastq files must be in your Collaboration folder on Wexac, in the correct structure (UTAP requirements & description).     

To run a new analysis from existing files in the Collaboration folder login to https://utap.wexac.weizmann.ac.il using your Weizmann userID and password and click on "Run pipeline":



Click "Run pipeline"

  1. Select the type of analysis you desire:
  2. Select the input folder:

Browse within your Collaboration folder and select the folder containing your sample files (fastq). Fastq files must be organized, within the selected folder (root folder), into subfolders as shown below. 

Note that if you wish to go up one level (or more), click the desired folder level on the path at the top of the window.


UTAP user interface input information required: 

  • Select the output folder

If you want the output folder to be different from the one automatically filled in (based on the selected input folder), overwrite the output folder name in the text box associated with the screen’s Output folder: field with your name of choice. 

  • Additional setups

Fill in a project name, and select the reference genome to which the reads will be aligned.

For mouse, you have the option to choose a TSS file, containing either a broad or narrow definition of the genes’ TSS (Transcription Start Site) regions (based on Nature. 2016 Jun 30;534(7609):652-7 - The landscape of accessible chromatin in mammalian preimplantation embryos).

Default adapters are P5 and P7 adapters of the TruSeq protocol.

By default, the ATAC-Seq pipeline runs without a “naked-DNA” control.

  • Run the pipeline

Finally, click the “Run analysis” button to submit the analysis. Once the analysis is completed, you will be notified by email (usually after a few hours).
All of the output files will be stored in your Wexac Collaboration folder.
At this point no report is being created.

Analysis workflow

Pipeline steps and associated tools:

  1. Reads trimming: Reads are quality trimmed using cutadapt. In this process primers corresponding to the TruSeq protocol are removed (output is in folder 1).
  2. Quality control: Reads quality control is evaluated using FastQC (in output folder 2), and a report file, containing quality reports for all of the samples, is generated using multiQC (in output folder 3).
  3. Mapping to genome: The quality trimmed paired-end reads are mapped to Mouse/Human genomes using Bowtie2 (output is in folder 4).
  4. Alignment filtering: Following the alignment, mitochondrial genes are removed from the analysis (using the grep command). Reads are filtered using samtools view with the parameters -F 4 to remove unmapped reads that are output by bowtie2, and with -f 0x2 -q 39 for paired end reads. Duplicated reads are removed using picard-tools. The remaining unique reads are indexed and sorted using samtools index and samtools sort. Statistics on the alignment is generated using flagstat (output is in folder 5).
  5. Reads shifting: To account for the 9-bp duplication created by DNA repair of the nick by Tn5 transposase (as described by Buenrostro et al. 2013 PMID: 24097267), aligned reads are shifted + 4 bp and − 5 bp for positive and negative strand respectively, using using bedtools bamtobed and awk commands.

  6. Select nucleosome-free fragments: fragments of length <120bp are selected using the awk command (alignments are in folder 6), and insert size distributions are plotted before and after size selection (output is in folder 8, plots after selection end with "_nucl_free").
  7. Visualization in graphs: reads coverage on gene body and around the TSS are graphically visualized using ngsplot (output is in folder 7).
  8. Read counts on TSS: for mm10 genome we count the number of reads on genes’ TSS (Transcription Start Site) regions based on, Nature. 2016 Jun 30;534(7609):652-7 ).
  9. Peak calling: Broad peaks are called using MACS2 (output is in folder 11). The resulting peaks "*_peaks_filtered.broadPeak" are filtered to exclude peaks from the blacklist (https://github.com/Boyle-Lab/Blacklist/tree/master/lists).
  10. Peak annotation: The predicted peaks are collected from all samples using multi_intersectBed and then annotated according to the corresponding genome using Homer (with default parameters). Analysis of peaks distribution in genomic regions, and around TSS is done using ChIPseeker, together with a venn diagram of overlap of peaks sets.


Output folders:

1_cutadapt

2_fastqc

3_multiqc

4_mapping

5_process_alignment

6_nucleosome_free

7_shift_bam

8_ngs_plot

9_picard_plot

10_tss_count

11_call_peak

12_peaks_annotation

13_reports


Log files (one directory above the output directory):

snakemake_stdout.txt