Epi-Seq - A Bioinformatics pipeline for predicting cancer cancer specific epitopes from RNA-Seq data Version 1.0.0 (03/05/2013) ========================================================= INTRODUCTION The Epi-Seq pipeline is a multi-step analysis process that starts from the raw RNA-Seq tumor sample reads, and produces a set of predicted tumor specific expressed epitopes. It incorporates a number of in house developed tools, including SNVQ [2] for calling mutations and RefHap [1] for phasing them, as well as the NetMHC [3,4] epitope prediction algorithm developed by the Center for Biological Sequence Analysis in Technical University of Denmark. REQUIREMENTS To run the epitope prediction step (see Running Epi-Seq section), NetMHC 3.0 must be installed. To run the full Epi-Seq pipeline, bowtie must be installed and genome and transcriptome bowtie mapping indices must be built. INSTALLATION 1. Create an Epi-Seq directory and download the compressed Epi-Seq from http://dna.engr.uconn.edu/software/Epi-Seq/Epi-Seq-1.0.0.tar.gz 2. Uncompress Epi-Seq-1.0.0.zip into the directory. 3. If you want to rebuild Epi-Seq, run the unix script "build" provided in the compressed file. 4. Run the script "setup" to setup the paths of Epi-Seq executables 5. Edit the file SNV_jars/epitopeFinder.properties and set netmhc.program to the path for the local installation of NetMHC 3.0 RUNNING Epi-Seq The two main executable are in the bin directory A. Epi-Seq: Predicting SNVs and epitopes starting from RNA-Seq data Usage: Epi-Seq genome_index transcriptome_index genome GTF CCDS_Map CCDS_cDNA MHC_Allele phred_convert clipping_mode clip5 clip3 read_length output_prefix fastq_file genome_index path to bowtie genome index transcriptome_index path to bowtie transcriptome index genome genome fasta file GTF transcriptome annotation file CCDS_Map CCDS annotation file CCDS_cDNA cDNA sequences fasta file MHC_Allele MHC I allele for which epitopes will be predicted phred_convert yes/no (option to convert base quality scores to Phred+33 scale; as expected by SNVQ) clipping_mode read clipping mode -user_defined: clip reads according to the values of clip5 and clip3 -none : do not clip reads -auto: automatically determine number of bases to clip based on mismatch statistics clip5 number of bases to clip at the 5'end of the reads (set to 0 for clipping_mode auto or none) clip3 number of bases to clip at the 3'end of the reads (set to 0 for clipping_mode auto or none) read_length length of RNA-Seq reads output_prefix prefix for output files fastq_file RNA-Seq reads file Note: the compressed file has a sample script to call the pipeline (run-pipeline), where arguments can be easily edited. Output: 1. .snv The list of called SNVs called by SNVQ. The output is a tab delimited text file with the following fields: - Sequence name - Position - Reference Allele - Alternative Allele - Genotype as a two characters String - Coverage of the reference allele - Coverage of the alternative allele - Phred quality score of the genotype - Phred quality score of the probability of having a variant in this locus - Number of A calls - Number of C calls - Number of G calls - Number of T calls - Coverage of the reference allele on the positive strand - Coverage of the alternative allele on the positive strand - Number of A calls on the positive strand - Number of C calls on the positive strand - Number of G calls on the positive strand - Number of T calls on the positive strand - Coverage of the reference allele on the negative strand - Coverage of the alternative allele on the negative strand - Number of A calls on the negative strand - Number of C calls on the negative strand - Number of G calls on the negative strand - Number of T calls on the negative strand 2. .phase Phasing of a subset of the called SNVs (where there is read evidence to connect SNVs); RefHap output. The output is a tab delimited text file with the following fields: - Sequence name - Position - Haploid of the first base in the genotype (0/1) - Haploid of the second base in the genotype (0/1) - Block number : SNVs in the same block number are phased together 3. .out The list of predicted epitopes. The output is a tab delimited text file with the following fields: - Mutation (SequenceName_position) - Gene name - CCDS transcript ID - URL to mutation position on UCSC genome browser - Long WT peptide - Long mutated peptide - Reference allele - Alternative allele - Coverage of the reference allele - Coverage of the alternative allele - Phred quality score of the genotype - Genotype - Nearby mutations - MHC I allele - Short WT peptide - WT peptide score - Short mutated peptide - Mutant peptide score - Score difference - Number of predicted epitopes for this mutation - Minimum peptide score difference for this mutation - Maximum peptide score difference for this mutation - Maximum peptide score for this mutation 4. .out.snvStat List of SNVs within CCDS coding regions. The output is a tab delimited text file with the following fields: - Mutation (SequenceName_position) - Reference allele - Alternative allele - Genotype - Heterozygous/homozygous - Synonymous/non-synonymous - Type of synonymous mutation B. epitopeFinder.sh: Predicting Epitopes given a list of called SNVs and phased SNVs. Usage: epitopeFinder.sh Arguments (all required): -outFilename file where the predictions report will be stored -ccdsFilename file with the CCDS transcripts map -cdnaFilename file with CCDS transcripts cdna sequences in fasta format -algorithm NethMHC prediction algorithm [PWM - ANN] -motifs MHC I allele for which epitopes will be predicted -phaseInFilename file of phased SNVs (output 2 in A. above) called_snps file of called SNVs (output 1 in A. above) Output: 1. The list of predicted epitopes. Same as output 3 for epitopePredictionPipeline.sh 2. .snvStat List of SNVs within CCDS coding regions. Same as output 4 for epitopePredictionPipeline.sh CONTACTS For questions or suggestions regarding Epi-Seq you can contact: Sahar Al Seesi (sahar@engr.uconn.edu) Ion Mandoiu (ion@engr.uconn.edu) REFERENCES [1] J. Duitama, et al., ReFHap: A Reliable and fast algorithm for Single Individual Haplotyping, Proc. ACM-BCB, pp. 160-169, 2010 [2] J. Duitama and P.K. Srivastava and I.I. Mandoiu, Towards accurate detection and genotyping of expressed variants from Whole Transcriptome Sequencing data, BMC Genomics 13(Suppl 2):S6, 2012 [3] M. Nielsen, et al., Reliable prediction of T-cell epitopes using neural networks with novel sequence representations. Protein Sci., pp. 12:1007-17, 2003 [4] M. Nielsen, et al., Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach, Bioinformatics, 20(9):1388-97, 2004.