Copy of UTAP - User guide

Copy of UTAP - User guide#Registration to system
Copy of UTAP - User guide#Input data format
Copy of UTAP - User guide#Import Input Data
Copy of UTAP - User guide#MARS-seq Analysis Setup
Copy of UTAP - User guide#RNA-seq Analysis Setup
Copy of UTAP - User guide#Analysis pipeline steps and reports

Registration to system

Import Input data

In order to run the transcriptome analysis pipeline fastq sequence files need to be located on the server.

You need use ssh client software installed on your computer for copying the input data to the server.

For example: you can use winScp software - https://winscp.net/eng/download.php:

Open the software connect to server with SFTP portocol, Host name of the server and Port number , and your User name and Password with which registered in the signup form in web browser.

Host name is ip number or DNS name of the server, and Por number is and $HOST_SSH_PORT variable with which the system was installed.

Click on Login.

Now drag and drop the folder with the input data from your computer (left side of the screen) into your directory on the server (right side of the screen):

Only you (and the administrator) have permission to access your input data and output results.

Input data format

Fastq files must be orginazed, within the selected folder (root folder), into subfolders as shown below. The subfolders names are derived from the sample names.

Fastq files must start with the same sample name as the subfolders and end with "_R1.fastq" (or "_R1.fastq.gz") for single-read data . In the case of paired-end data (required for Mars-Seq), corresponding files must exist that are IDENTICAL in their name except for the ending "_R2.fastq" (or "_R2.fastq.gz") instead of "_R1.fastq".

Where R is the read number.

For example:

root_folder
- sample1
  - sample1_R1.fastq
  - sample1_R2.fastq (must exists in Mars-seq and in paired-end)
- sample2

The pipeline also support the convention of the fastq file format _S*_L00*_R1.fastq or _S*_L00*_R1_0*.fastq.

For example:

root_folder
- sample1
  - sample1_S0_L001_R1_001.fastq
  - sample1_S0_L001_R1_002.fastq
  - sample1_S0_L002_R1_001.fastq
  - sample1_S0_L002_R1_002.fastq
  - sample1_S0_L001_R2_001.fastq (must exists in Mars-seq and in paired-end)
  - sample1_S0_L001_R2_002.fastq (must exists in Mars-seq and in paired-end)
  - sample1_S0_L002_R2_001.fastq (must exists in Mars-seq and in paired-end)
  - sample1_S0_L002_R2_002.fastq (must exists in Mars-seq and in paired-end)
- sample2
  - sample2_S0_L001_R1_001.fastq
  - sample2_S0_L001_R1_002.fastq
  - sample2_S0_L002_R1_001.fastq
  - sample2_S0_L002_R1_002.fastq
  - sample2_S0_L001_R2_001.fastq (must exists in Mars-seq and in paired-end)
  - sample2_S0_L001_R2_002.fastq (must exists in Mars-seq and in paired-end)
  - sample2_S0_L002_R2_001.fastq (must exists in Mars-seq and in paired-end)
  - sample2_S0_L002_R2_002.fastq (must exists in Mars-seq and in paired-end)

MARS-seq Analysis Setup

After importing you data (or if you have old data on the server that was imported in the past), you can run the pipeline by selecting the "Run pipeline" option

You can now select the pipeline from the options available.

Browse within your directory structure as and Select the root folder for analysis. Note that if you wish to go up one level (or more) please click the desired folder level using the path at the top of the window.

Input folder must be in a correct format as was explained above. If there is an error with the folder you selected, you will be able to select it only after the error has been resolved.

If the output folder should be different from the one automatically filled in (based on the selected input folder), select the desired output folder.

Fill in the project name, then select the genome and annotation.

Select if you desire to identify differentially expressed using the DESeq2 package. If you selected this option, by default, two categories must be created (fill in the category names)

Choose the samples by selecting them and use the arrows to move them to the appropriate categories

You may add additional categories.

The order of the comparisons will be determined by the order of the category boxes, for example: DESeq2 will output "Treatment" vs "Control" comparison in case that user inserts the "Treatment" as the first category and the "Control" as the second.

If the samples were prepared in different batches, you can add this information: After moving the samples into category boxes, click on "Add Batch Effect" button, then select the samples from the category boxes that belongs to one batch and click on "Batch 1" button. Repeat the operation with the other batches. Be sure that the batch effect is designed correctly: DESeq2 doucumentation

All steps of the pipeline (mapping, counts etc.) will be run on all samples, but Deseq will be run only on the samples with categories.

Finally, submit the run for analysis.

In the end of the run, an email will be sent to you informing of analysis completion.

RNA-seq Analysis Setup

If your portocol is RNA-seq, you will get this screen:

Fill in the project name, select the genome and annotation.

Select if your protocol is stranded (the sequenced reads saves the original strand of RNA fragments) or non-stranded.

Type your adapters on each read (R1 and R2). These adapters will be removed from the reads by the pipeline. You can leave the default adapters if you use with P5 and P7 adapters of True-seq protocol.

Analysis pipeline steps and reports

The steps performed by the pipeline -

Trim adapter sequences
Fastqc for quality control of the samples will be run in parallel to the steps described
Map reads to the selected reference genome
Add UMI and gene information to the reads
Quantify gene expression by counting reads
Count UMI's for cases of PCR bias
Detect Deferentially Expressed (DE) genes for a model with a single factor

Steps 3 and 5 are performed only for Mars-Seq

Steps 6 is performed only if DESeq2 is selected

Upon completion you will get an email with links to the results report

The report includes several sections -

Sequencing and Mapping QC
1. Figure 1 - Plots the average quality of each base across all reads. Quality of 30 (predicted error rate 1:1000) and up is good
2. Figure 2 - Histogram showing the number of reads for each sample in raw data
3. Figure 3 - Histogram showing the percent of reads discarded after trimming the adapters (after the removing of the adapters some read and polyA/T or low quality reads may be too short and the pipeline discards them)
4. Figure 4 - Histogram with the number of reads for each sample in each step of the pipeline
5. Figure 5 - Plots sequence coverage on and near gene regions
6. Figure 6 -
  1. Histogram showing the percent of reads that mapped uniquely and not uniquely per sample
  2. Histogram showing the percent of the uniquely mapped reads that mapped to genes (genes included must have at least 5 reads)
Exploratory Analysis
1. Figure 7 - Heatmap plotting the highly-expressed genes (above 5% of total expression). For example the expression of gene RN45S in sample SRR3112243 amounts to 15% of the expression
2. Figure 8 - Heatmap of Pearson correlation between samples according to the gene expression values
3. Figure 9 - Clustering dendogram of the samples according to the gene expression
4. Figure 10 - PCA analysis
  1. Histogram of % explained variability for each PC component
  2. PCA plot of PC1 vs PC2 c. PCA plot of PC1 vs PC3
Differential Expression Analysis (this section exists only if you run the DESeq2 analysis) - a table with the number of differential expressed genes (DE) in each category (up/down) for the different contrasts. In addition, links for p-value distribution, volcano plots and heatmaps as well as a table of the DE genes with dot plots of their expression values
Bioinformatics Pipeline Methods - description of pipeline methods utilized
Links to additional results - links for downloading tables with raw, normalized counts, log normalized values (rld) and statistical data of contrasts. In case of model with batches, "combat" values were calculated (instead of rld) using "sva" package and are batch corrected normalized log2 count values.

Annotation file:

For the counts of the reads per gene we use with annotation files (gtf format) from "ensemble" or "gencode". In MARS-seq analysis we extend the 3' UTR exon away from the transcript on the DNA and extend or cut the 3" UTR exon towards the 5' direction on the mRNA.

Examples of reports

RNA-Seq example:

https://bip.weizmann.ac.il/rna-seq

Mars-seq example:

https://bip.weizmann.ac.il/mars-seq

Please regard this analysis as a good starting point and not an end result.

Bioinformatics public