UTAP: Transcriptome (RNA -seq and MARS-seq) pipelines guidelines

Analysis setup AFTER sequencing completion

Pipeline website: http://utap.wexac.weizmann.ac.il

The transcriptome pipelines run on the wexac cluster. In order to run a new transcriptome analysis, you must first transfer demultiplexed sequencing data (fastq files) to your Collaboration folder on wexac.

If you wish to run an analysis AFTER sequencing has completed using the link provided in the email you received (after demultiplexing and QC finish), paste the link in a browser. Then, simply set up the analysis (enter project name, create categories etc) and the files will be copied for you to your collaboration folder on Wexac as described above.

If you wish to run a new analysis from existing files in the Collaboration folder, or if you uploaded data to wexac from an external source (e.g. sequencing data not performed in the LSCF), login to http://utap.wexac.weizmann.ac.il using your Weizmann userID and password and click on "Run pipeline":

Select the pipeline you desire:

This page concentrates on the common aspects of the RNA-seq and Mars-seq pipelines.

If your protocol is Mars-seq, you will get this screen:

Browse within your Collaboration directory structure and Select the root folder of the sub-folders of the samples for analysis with the appropriate button. Note that if you wish to go up one level (or more), please click the desired folder level on the path at the top of the window.

Fastq files must be organized, directly under the selected folder (root folder), into subfolders as shown below. The subfolders names are derived from the sample names.

Fastq files must start with the same sample name as the subfolders and end with "_R1.fastq" (or "_R1.fastq.gz") for single-read data. In the case of paired-end data (required for Mars-Seq), corresponding files must exist that are IDENTICAL in their name except for the ending "_R2.fastq" (or "_R2.fastq.gz") instead of "_R1.fastq", where R prefixes the read number.

For example:

root_folder
- sample1
  - sample1_R1.fastq
  - sample1_R2.fastq (must exist in Mars-seq and in paired-end)
- sample2

The pipeline also supports the convention of the fastq file format _S*_L00*_R1.fastq or _S*_L00*_R1_0*.fastq.

For example:

root_folder
- sample1
  - sample1_S0_L001_R1_001.fastq
  - sample1_S0_L001_R1_002.fastq
  - sample1_S0_L002_R1_001.fastq
  - sample1_S0_L002_R1_002.fastq
  - sample1_S0_L001_R2_001.fastq (must exist in Mars-seq and in paired-end)
  - sample1_S0_L001_R2_002.fastq (must exist in Mars-seq and in paired-end)
  - sample1_S0_L002_R2_001.fastq (must exist in Mars-seq and in paired-end)
  - sample1_S0_L002_R2_002.fastq (must exist in Mars-seq and in paired-end)
- sample2
  - sample2_S0_L001_R1_001.fastq
  - sample2_S0_L001_R1_002.fastq
  - sample2_S0_L002_R1_001.fastq
  - sample2_S0_L002_R1_002.fastq
  - sample2_S0_L001_R2_001.fastq (must exist in Mars-seq and in paired-end)
  - sample2_S0_L001_R2_002.fastq (must exist in Mars-seq and in paired-end)
  - sample2_S0_L002_R2_001.fastq (must exist in Mars-seq and in paired-end)
  - sample2_S0_L002_R2_002.fastq (must exist in Mars-seq and in paired-end)

If there is an error with the folder you selected, you must fix it in order to run the pipeline.

The output folder should be different from the one automatically filled in (based on the selected input folder).

Select the desired output folder.

Continue by filling in all of the fields.

If your protocol is RNA-seq, you will get this screen:

Fill in the project name, select the genome and annotation.

Select whether or not your protocol is stranded (the sequenced reads saves the original strand of RNA fragments) or non-stranded. If you don't know, select the "find automatically" option.

Specify your adapters for each read (R1 and R2). These adapters will be removed from the reads by the pipeline. You can retain the default adapters if you used them with i5 and i7 adapters of the True-seq protocol.

Select whether or not you desire to identify differentially expressed genes using the DESeq2 package as described in the DESeq2 manual. If you select this option, by default, two categories must be created (by filling in the category names).

Sort the samples by selecting them and using the arrows to move to the appropriate categories

You may add additional categories by using the relevant buttons

If the samples were prepared in different pools, you can add this information: After moving the samples into categories boxes, click on the "Add Batch Effect" button, then select the samples that belong to one batch and click on the "Batch 1" button. Repeat the operation with the other groups of the samples.

All of the steps of the pipeline (mapping, counts etc.) will be run on all of the samples, but Deseq2 will be run only on the samples with categories.

Finally, submit the run for analysis.

Analysis pipeline steps and reports

The steps performed by the pipeline include:

Trim adapter sequences
Fastqc for quality control of the samples (run in parallel with the other steps)
Map reads to the selected reference genome
When the protocol is Mars-Seq, add UMI and gene information to the reads
Quantify gene expression by counting reads
When the protocol is Mars-Seq and Dseq2 is selected, count UMIs to avoid PCR duplications.
Note that counting of UMIs (critical for the MARS-seq library) takes into account that reads should be uniquely aligned to a genome; if a read aligns equally well to two different genomes, it is not counted on behalf of a gene. So for example, if you’ve aligned your fastq files to both human and mouse genomes, best to use UTAP with the merged human and mouse genome; this will provide the unique read or UMI counts to both human and mouse genes.
Detect Differentially Expressed (DE) genes for a model with a single factor

Steps 4 and 6 are performed only for Mars-Seq

Upon completion, you will get an email with links to the results reports

The report includes several sections:

Sequencing and Mapping QC
1. Figure 1 - Plots the average quality of each base across all reads. Qualities of 30 (predicted error rate 1:1000) and up are good
2. Figure 2 - Histogram showing the number of reads for each sample in the raw data
3. Figure 3 - Histogram showing the percent of reads discarded after trimming the adapters (after the removing of the adapters, some read and polyA/T or low quality reads may be too short; the pipeline discards them)
4. Figure 4 - Histogram with the number of reads for each sample in each step of the pipeline
5. Figure 5 - Plots showing sequence coverage on and near gene regions
6. Figure 6 -
  1. Histogram showing the percent of reads that mapped uniquely and not uniquely per sample
  2. Histogram showing the percent of the uniquely mapped reads that mapped to genes (genes included must have at least 5 reads)
Exploratory Analysis
1. Figure 7 - Heatmap plotting the highly-expressed genes (above 5% of total expression). For example the expression of gene RN45S in sample SRR3112243 amounts to 15% of the expression
2. Figure 8 - Heatmap of Pearson correlation between samples according to the gene expression values
3. Figure 9 - Clustering dendogram of the samples according to the gene expression
4. Figure 10 - PCA analysis
  1. Histogram of % explained variability for each PC component
  2. PCA plot of PC1 vs PC2 c. PCA plot of PC1 vs PC3
Differential Expression Analysis (this section exists only if you run the DESeq2 analysis) - a table with the number of differentially expressed (DE) genes in each category (up/down) for the different contrasts. In addition, links for p-value distribution, volcano plots and heatmaps, as well as a table of the DE genes with dot plots of their expression values
Bioinformatics Pipeline Methods - description of the utilized pipeline methods
Links to additional results - links for downloading tables with raw, normalized counts, log normalized values (rld) and statistical data of contrasts. In the case of model with batches, "combat" values are calculated (instead of rld) using the "sva" package, and are batch corrected normalized log2 count values.

Annotation file:

For the counts of the reads per gene we use with annotation files (gtf format) from RefSeq or GENCODE (more elaborate i.e. contains more genes and transcripts) . In MARS-seq analysis we use the 5' end of the transcripts by extending the 3' UTR exon by /** add how many **/ bases in both the 5' and 3' direction.

Examples of reports

RNA-Seq example: public data set from Klepikova AV et al. BMC Genomics. 2015 Jun 18;16:466

Mars-seq example: public data set from Feigelson SW et al. Cell Rep. 2018 Jan 23;22(4):849-859

Please regard this analysis as a good starting point and not an end result.

List of links

Transcriptome pipeline for Weizmann Institute users: http://utap.wexac.weizmann.ac.il

Demo of the UTAP interface (for internal and external users): http://utap-demo.weizmann.ac.il

Acknowledgments

Citation:

Kohen et al. BMC Bioinformatics (2019) 20:154 https://doi.org/10.1186/s12859-019-2728-2 (PMID: 30909881)

Bioinformatics support staff for UTAP:

UTAP development and maintenance team: utap@weizmann.ac.il
Dena Leshkowitz
Ester Feldmesser
Gil Stelzer
Bareket Dassa
Noa Wigoda

Bioinformatics public