UTAP: ATAC-seq pipeline guidelines - (?done: JL please review)

The ATAC-seq (Assay for transposase-accessible chromatin using sequencing) pipeline facilitates the analysis of ATAC-seq data in order to capture open and accessible regions of chromatin across the genome. The pipeline receives paired-end reads as input, performs quality control and pre-processing steps, and maps the reads onto mouse or human genomes. Nucleosome-free fragments are selected (after some post-processing), and peaks identified and analyzed.

Pipeline website: http://utap.wexac.weizmann.ac.il

Before you start:

This pipeline runs on the Wexac cluster.
Please prepare the following in advance:

An account (userID) on Wexac, via your department administrator.
A "Collaboration" folder within your lab folder on Wexac, with read and write permission for Bioinformatics Unit staff. This must be set up by the computing center (hpc@weizmann.ac.il).
Sufficient free storage space on Wexac (> 400Gb), via your department administrator.

Setting up a new analysis

In order to run a new transcriptome analysis, you must first transfer demultiplexed sequencing data (fastq files) to your Collaboration folder. Within the Collaboration folder, the files must conform to the directory structure described below.

Then, login to utap.wexac.weizmann.ac.il via Firefox or Chrome (the pipeline is NOT compatible with Internet Explorer) using your Weizmann userID and password, and click on Run pipeline.

1. Click on ATAC-seq in the Choose pipeline box

2. Provide the name of the input folder:

Browse within your Collaboration folder and select the folder containing your sample (fastq) files. Fastq files must be organized, within the selected folder (root_folder), into subfolders as shown below.

Note that if you wish to go up one level (or more), click the desired folder level on the path at the top of the folder-browsing window.

Fastq file name conventions: Fastq file names must start with the same sample name as the subfolders, and end with "_R1.fastq" (or "_R1.fastq.gz") for single-read data . In the case of paired-end data, corresponding files must exist that are IDENTICAL in their name, but contain the suffix "_R2.fastq" (or "_R2.fastq.gz") instead of "_R1.fastq" (or "_R1.fastq.gz").

For example:

root_folder
- sample1_R1.fastq
- sample1_R2.fastq (must exist in paired-end)
- sample2_R1.fastq.gz
- sample2_R2.fastq.gz (must exist in paired-end)

sample1
sample2

The pipeline also support the fastq file format conventions _S*_L00*_R1.fastq or _S*_L00*_R1_0*.fastq.

For example:

root_folder
- sample1_S0_L001_R1_001.fastq
- sample1_S0_L001_R1_002.fastq
- sample1_S0_L002_R1_001.fastq
- sample1_S0_L002_R1_002.fastq
- sample1_S0_L001_R2_001.fastq (must exist in paired-end)
- sample1_S0_L001_R2_002.fastq (must exist in paired-end)
- sample1_S0_L002_R2_001.fastq (must exist in paired-end)
- sample1_S0_L002_R2_002.fastq (must exist in paired-end)
- sample2_S0_L001_R1_001.fastq
- sample2_S0_L001_R1_002.fastq
- sample2_S0_L002_R1_001.fastq
- sample2_S0_L002_R1_002.fastq
- sample2_S0_L001_R2_001.fastq (must exist in paired-end)
- sample2_S0_L001_R2_002.fastq (must exist in paired-end)
- sample2_S0_L002_R2_001.fastq (must exist in paired-end)
- sample2_S0_L002_R2_002.fastq (must exist in paired-end)

sample1
sample2

UTAP user interface input information required:

Optionally change the name of the output folder

If you want the output folder to be different from the one automatically filled in (based on the selected input folder), overwrite the output folder name in the text box associated with the screen’s Output folder: field with your name of choice.

Additional setups

Fill in a project name, and select the reference genome to which the reads will be aligned.

For mouse, you have the option to choose a TSS file, containing either a broad or narrow definition of the genes’ TSS (Transcription Start Site) regions (based on Wu et al, Nature. 2016 Jun 30;534(7609):652-7 - The landscape of accessible chromatin in mammalian preimplantation embryos).

Default adapters are P5 and P7 adapters of the Tru-seq protocol.

By default, the ATAC-seq pipeline runs without a “black-list” or “naked-DNA” control.

4. Run with control

Choose “run with control” (in the drop-down menu associated with the run with control line) in order to enable comparison of each treatment with its corresponding control. When selecting this option, a new group of control and treatment boxes will open. Organize the samples by selecting them and using the arrows to move items to the appropriate categories.

If you have more than one treatment against control, press on the "Add group" button as shown in the figure below.

Each group must contain at least one sample in each of the treatment and control boxes.

When moving a sample to the control box, a copy of the sample is retained, so that you can use it again in a new group.

If you move more than one sample to the treatment or control box, the pipeline will automatically combine the samples into one big treatment/control sample.

Important: All of the pipeline steps (mapping, counts etc.) will be run (only) on the samples in the treatment/control boxes.

Run the pipeline

Finally, click the “Run analysis” button to submit the analysis. Once the analysis is completed, you will be notified by email (usually after a few hours). All of the output files will be stored in your wexac Collaboration folder.

Analysis workflow

Pipeline steps and associated tools:

Quality control: Reads are quality trimmed using cutadapt. In this process, primers corresponding to the TruSeq protocol are removed.
Quality control: Reads quality control is evaluated using FastQC, and a report file, containing quality reports for all of the samples, is generated using multiQC.
Mapping to genome: The quality trimmed paired-end reads are mapped to Mouse/Human genomes using Bowtie2.
Following the alignment, mitochondrial genes are removed from the analysis. Duplicated reads are removed using picard-tools. The remaining unique reads are indexed and sorted using samtools index and samtools sort
Generate statistics on the alignment using flagstat.
Visualization in graphs: The analyzed reads are graphically visualized using ngsplot.
Select nucleosome-free fragments: fragments of length <120bp are selected.
Peak calling: Peaks are called using MACS2.

Output folders:

1_cutadapt

2_fastqc

3_multiqc

4_mapping

5_process_alignment

6_ngs_plot

7_nucleosome_free

8_tss_count

9_call_peak

10_reports

Log files (one directory above the output directory):

snakemake_stdout.txt

Examples of reports

RNA-Seq example: public data set from Klepikova AV et al. BMC Genomics. 2015 Jun 18;16:466

Mars-seq example: public data set from Feigelson SW et al. Cell Rep. 2018 Jan 23;22(4):849-859

Please regard this analysis as a good starting point and not an end result.

List of links

Transcriptome pipeline for Weizmann Institute users: http://utap.wexac.weizmann.ac.il

Demo of the UTAP interface (for internal and external users): http://utap-demo.weizmann.ac.il

Acknowledgments

Citation:

Kohen et al. BMC Bioinformatics (2019) 20:154 https://doi.org/10.1186/s12859-019-2728-2 (PMID: 30909881)

Bioinformatics support staff for UTAP:

UTAP development and maintenance team: utap@weizmann.ac.il
Dena Leshkowitz
Ester Feldmesser
Gil Stelzer
Bareket Dassa
Noa Wigoda

Bioinformatics public