Epi-Seq - A Bioinformatics pipeline for predicting
cancer cancer specific epitopes from RNA-Seq data
Version 1.0.0 (03/05/2013)
=========================================================


INTRODUCTION

The Epi-Seq pipeline is a multi-step analysis process that starts from the raw RNA-Seq tumor sample reads, 
and produces a set of predicted tumor specific expressed epitopes. It incorporates a number of in house developed
tools, including SNVQ [2] for calling mutations and RefHap [1] for phasing them, as well as the NetMHC [3,4] epitope 
prediction algorithm developed by the Center for Biological Sequence Analysis in Technical University of Denmark. 


REQUIREMENTS

To run the epitope prediction step (see Running Epi-Seq section), NetMHC 3.0 must be installed. To run the full  Epi-Seq pipeline, bowtie must be installed and genome and transcriptome bowtie mapping indices must be built.


INSTALLATION


1. Create an Epi-Seq directory and download the compressed Epi-Seq
   from http://dna.engr.uconn.edu/software/Epi-Seq/Epi-Seq-1.0.0.tar.gz 

2. Uncompress Epi-Seq-1.0.0.zip into the directory.

3. If you want to rebuild Epi-Seq, run the unix script "build"
   provided in the compressed file.

4. Run the script "setup" to setup the paths of Epi-Seq executables

5. Edit the file SNV_jars/epitopeFinder.properties and set netmhc.program to the path for the local installation of
   NetMHC 3.0


RUNNING Epi-Seq


The two main executable are in the bin directory

									  
A. Epi-Seq: Predicting SNVs and epitopes starting from RNA-Seq data

Usage:
		Epi-Seq genome_index transcriptome_index genome GTF CCDS_Map CCDS_cDNA MHC_Allele phred_convert clipping_mode
						clip5 clip3 read_length output_prefix fastq_file


		genome_index 				path to bowtie genome index

		transcriptome_index 			path to bowtie transcriptome index

		genome					genome fasta file

		GTF 					transcriptome annotation file

		CCDS_Map 				CCDS annotation file

		CCDS_cDNA 				cDNA sequences fasta file

		MHC_Allele 				MHC I allele for which epitopes will be predicted

		phred_convert		 		yes/no (option to convert base quality scores to Phred+33 scale; as expected by SNVQ) 

		clipping_mode				read clipping mode
							-user_defined: clip reads according to the values of clip5 and clip3
							-none : do not clip reads
							-auto: automatically determine number of bases to clip based on mismatch statistics
							
		clip5 					number of bases to clip at the 5'end of the reads (set to 0 for clipping_mode auto or none)

		clip3 					number of bases to clip at the 3'end of the reads (set to 0 for clipping_mode auto or none) 	

		read_length 				length of RNA-Seq reads

		output_prefix 				prefix for output files

		fastq_file				RNA-Seq reads file

Note: the compressed file has a sample script to call the pipeline (run-pipeline), where arguments can be easily edited.


Output:

1. <output_prefix>.snv
The list of called SNVs called by SNVQ. The output is a tab delimited text file with the following fields:

- Sequence name
- Position
- Reference Allele
- Alternative Allele
- Genotype as a two characters String
- Coverage of the reference allele
- Coverage of the alternative allele
- Phred quality score of the genotype
- Phred quality score of the probability of having a variant in this locus
- Number of A calls
- Number of C calls
- Number of G calls
- Number of T calls 
- Coverage of the reference allele on the positive strand
- Coverage of the alternative allele on the positive strand
- Number of A calls on the positive strand
- Number of C calls on the positive strand
- Number of G calls on the positive strand
- Number of T calls  on the positive strand
- Coverage of the reference allele on the negative strand
- Coverage of the alternative allele on the negative strand
- Number of A calls on the negative strand
- Number of C calls on the negative strand
- Number of G calls on the negative strand
- Number of T calls  on the negative strand


2. <output_prefix>.phase
Phasing of a subset of the called SNVs (where there is read evidence to connect SNVs); RefHap output. The output is a tab delimited text file with the following fields:

- Sequence name
- Position
- Haploid of the first base in the genotype (0/1)
- Haploid of the second base in the genotype (0/1)
- Block number : SNVs in the same block number are phased together


3. <output_prefix>.out
The list of predicted epitopes. The output is a tab delimited text file with the following fields:
- Mutation (SequenceName_position)
- Gene name
- CCDS transcript ID
- URL to mutation position on UCSC genome browser
- Long WT peptide
- Long mutated peptide
- Reference allele
- Alternative allele
- Coverage of the reference allele
- Coverage of the alternative allele
- Phred quality score of the genotype
- Genotype
- Nearby mutations
- MHC I allele
- Short WT peptide
- WT peptide score
- Short mutated peptide
- Mutant peptide score
- Score difference
- Number of predicted epitopes for this mutation
- Minimum peptide score difference for this mutation
- Maximum peptide score difference for this mutation
- Maximum peptide score for this mutation


4. <output_prefix>.out.snvStat
List of SNVs within CCDS coding regions. The output is a tab delimited text file with the following fields:
- Mutation (SequenceName_position)
- Reference allele
- Alternative allele
- Genotype
- Heterozygous/homozygous
- Synonymous/non-synonymous
- Type of synonymous mutation


B. epitopeFinder.sh: Predicting Epitopes given a list of called SNVs and phased SNVs.

Usage:
		epitopeFinder.sh <ARGUMENTS> 

Arguments (all required):

		-outFilename <output_file>		 file where the predictions report will be stored

		-ccdsFilename				 file with the CCDS transcripts map

		-cdnaFilename				 file with CCDS transcripts cdna sequences in fasta format

		-algorithm 				 NethMHC prediction algorithm [PWM - ANN]

		-motifs 				 MHC I allele for which epitopes will be predicted

		-phaseInFilename 			 file of phased SNVs (output 2 in A. above)

		called_snps 				 file of called SNVs (output 1 in A. above)


Output:
1. <output_file> 
The list of predicted epitopes. Same as output 3 for epitopePredictionPipeline.sh 

2. <output_file>.snvStat
List of SNVs within CCDS coding regions. Same as output 4 for epitopePredictionPipeline.sh 


CONTACTS


For questions or suggestions regarding Epi-Seq you can contact:

     Sahar Al Seesi (sahar@engr.uconn.edu)
     Ion Mandoiu (ion@engr.uconn.edu)


REFERENCES


[1]	J. Duitama, et al., ReFHap: A Reliable and fast algorithm for Single Individual Haplotyping, Proc. ACM-BCB, pp. 160-169, 2010
[2]	J. Duitama and P.K. Srivastava and I.I. Mandoiu, Towards accurate detection and genotyping of expressed variants from Whole Transcriptome Sequencing data, BMC Genomics 13(Suppl 2):S6, 2012
[3]	M. Nielsen, et al., Reliable prediction of T-cell epitopes using neural networks with novel sequence representations. Protein Sci., pp. 12:1007-17, 2003
[4]	M. Nielsen, et al., Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach, Bioinformatics, 20(9):1388-97, 2004.