UTAP pipeline for MAR-Seq - run with command line

The pipeline file:

/home/labs/bioservices/services/miniconda2/envs/utap/lib/python2.7/site-packages/ngs-snakemake/snakefiles/snakefile-marseq.txt

Previous version:

From the next version

Graph of the pipeline:

marseq-graph.png

The pipeline requires for the following in the working directory:
a. The pipeline file
b. Config file (the default path of the config file is WORKING_DIRECTORY/config-marseq.yaml OR can be altered in the command line of snakemake).
c. factors file: (required only if you want to run differential expression analysis) Tab-separated value file with the description of each sample. See example here

Before running the pipeline, you would need to modify the config file, such that it will contain the list of samples.

The Config file: /home/labs/bioservices/services/miniconda2/envs/utap/lib/python2.7/site-packages/ngs-snakemake/config_example/config-marseq.yaml

run_id: type any name of the job. You can run the last step of the pipeline (deseq) again with other parameters after the first run is ended. In this case change the job name in order to the new output will be wrriten to a new folder and don't override the output of the old run.

job_name: the title of the run (this will be appear in the title of the report).

fastq_dir: full path of a folder, containing subfolders for all the samples. Each subfolder contains the fastq file of the sample (in fastq or fastq.gz format). The subfolder name need to be identical to sample name.
output_dir: full path of the output directory (will be created if not exists)
factors_file: full path to factors.tsv file (remove this line if you do not want run differential expression analysis).
gtf: full path of gtf file, for example: /shareDB/iGenomes/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf #Notice: the name of gtf must to start with the creature name: for example mm10.genes.gtf, else some functionalities will not work
my_star_index: full path to directory of star index, for example: /shareDB/iGenomes/Mus_musculus/UCSC/mm10/Sequence/STAR_index/
NOTICE:
You can omit the line factors_file in config.yaml file. In this case, snakemake will find the sample names automatically, and run the pipeline without Deseq.
Example to running the pipeline (not on the cluster):
snakemake --snakefile snakefile_marseq.txt #the the default path of the config file is WORKING_FOLDER/config-marseq.yaml
OR giving the config file name in the command line:
snakemake --snakefile marseq_pipeline.txt --configfile config.yaml
Steps for running the pipeline on the cluster (can run on access server):
a. Make a working directory R1 and R2 fastq files are expected to be uncompressed or gz commpresssed.
b. Copy the config file to your working directory - WORKING_FOLDER/config-marseq.yaml
c. Run the pipeline with the following command - replacing J with the number of parallel jobs that you want (can be a large number), and M with the RAM memory you need (in bytes):
Notice: the memory usage must be bigger of size of the STAR index file. For the above example, the size of mm10 index file is ~27G (27000 bytes), So we will use with 35G. The number of threads is hard coded to 20, so devide: 35000M/20 ~= 2000M. Therefore we will write: rusage[mem=2000].
module load python/3.5.2;module load python/2.7;module load fastqc;module load jre;module load star/2.5.2b;module load samtools/1.3.1;module load R/3.4.2;module load pandoc/1.18;module load ngsplot;
snakemake -p --jobs J --latency-wait 60 --snakefile snakefile-marseq.txt --cluster "bsub -q new-short -n {threads} -o cluster.output -e cluster.error -R "rusage[mem=M]" -R "span[hosts=1]" " &> stdout.txt
The default of the threads number per job is hard-coded to 20, and the user cannot change it.
i.e. interior users can use with "-q bio"

You can also run in local machine with the command - replacing T with the maxinum total number of threads you would like to provide:
module load python/3.5.2;module load python/2.7;module load fastqc;module load jre;module load star/2.5.2b;module load samtools/1.3.1;module load R/3.4.2;module load pandoc/1.18;module load ngsplot;
snakemake -p --cores T --snakefile snakefile-marseq.txt
The default of the threads number per job is hard-coded to 20. If the value of cores is 40, two jobs can run in parallel, if the value is 20 or less, only one job can run in parallel.
Running only part of the rules:
you can run only part of the rules using with the flags: --forcerun and --until.
For example:
To run only "rule_7_dedup_counts" rule, add to command:
--forcerun rule_7_dedup_counts --until rule_7_dedup_counts
or to run only "rule_6_mark_dup" and "rule_7_dedup_counts" rules, add to commnad:
--forcerun rule_6_mark_dup --until rule_7_dedup_counts
You can see the rules list in this diagram: marseq-graph.png
Report:
The report output is 4_reports/report.html