GeneSeq - Calling genotypes from next generation reads Version 1.0.0 (12/05/08) ============================================== The GeneSeq package provides methods to call the genotype and the haplotypes of a single individual given reads produced by second generation sequencing tecnologies like 454, Illumina, and ABI SOLiD. GeneSeq improves the calling accuracy by taking into account haplotypes from the population to which the individual to genotype belongs. GeneSeq implements a Hidden Markov Model (HMM) of evolution through recombination and mutation from a small set of founder haplotypes to mix information provided by the population haplotypes with the information provided by mapped reads. ---------------- Building GeneSeq ---------------- The code has been compiled and tested successfully on Gentoo Linux with the GNU gcc compiler version 3.4.5. To build the GEDI executables run the following commands in the directory where GeneSeq_1.0.0.tar.gz is located: tar -xzvf GeneSeq-1.0.0.tar.gz cd GeneSeq make all -------------- Runing GeneSeq -------------- GeneSeq is the command-line tool. Basic usage requires a text file with mapped reads in a format similar to nucmer that will be explained below, a legend text file with the snps to be called, a text file with the population haplotypes in 0-1 format, and a file name to store the output. Examples of input these files are provided. The basic command line example looks like this: GENE-SEQ -ref_snp_info chr22.hapmap_legend -ref_chr_hap chr22.population_01phasing -reads_file chr22.unsorted -output_file_name chr22.output The full list of options is explained below: Usage: GENE-SEQ REQUIRED_ARGUMENTS: -ref_snp_info : Info file for SNPs to genotype. It must have a header line and four columns separated by tabs. Columns are snp id, snp position in the chromosome, base encoded as 0, and base encoded as 1 -ref_chr_hap : Reference haplotypes. It has a 0-1 format with a line per haplotype and alleles separated by spaces. The number of alleles per haplotype must match the number of snps in snp_info_file -reads_file : File with mapped reads (format explained below) -output_file_name : File where output will be stored. For each snp in snp_info_file a line will be produced with the snp id and the two called bases (equal if the snp is called as homozygous). For posterior methods the fourth column is the posterior genotype probability. For other methods it just will be 0. For viterbi methods the second column across snps corresponds with the first haplotype and the second column corresponds to the second haplotype. For other methods the two alleles of each snp will be sorted alphabetically. OPTIONAL_ARGUMENTS: -version (posterior1|viterbi|posterior-viterbi) : Algorithm to run. Algorithms explained below. Default: posterior-viterbi -find_snps_out_file : Output file with information provided by the reads. For each snp in snp_info_file a line will be produced with the snp id, the two base alleles ordered alphabetically, the number of reads covering each allele, the probability of the reads covering the snp given each of the three possible genotypes and the posterior probability of homozygous0 and heterozygous, assuming as prior that the population information reflects the true allele frequencies and that genotypes are in HWE. -read_qualities_status (N|I|P) : N if the reads file does not include read qualities, I for ignoring them, P to consider them. Default: I -nr_founders : Number of haplotype founders. Default: 7 -blocksize : Number of snps in a single HMM window. Default: 50 -half_overlap : Half of the number of snps overlaping two HMM windows. Default: 10 -seed : Seed for random number generation. Default 1 -founder_bias : Initial probability of staying in the same founder from one snp to the next. Default 0.9 -noise : Desired percentage of noise on initial transition and emission probabilities. Default 0.4 -max_steps : Maximum of iterations for the HMM training. Default 1000 ---------------------------- Sample data and reads format ---------------------------- The distribution includes one set of sample input files. The files chr22.hapmap_legend and chr22.population_01phasing correspond to snps and population information that has been taken from http://ftp.hapmap.org/phasing/2007-08_rel22/phased/ for chromosome 22 and Yoruban population. The file chr22.unsorted correspond to Illumina reads available in ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA000272 mapped to the reference human genome version 36.3 using maq software. Just reads mapped to chromosome 22 are included. The reads file includes one line per read with the following information separated by spaces: - Read id - Start position in the reference genome - End position in the reference genome - Start position in the read - End position in the read - Delta vector with insertions and deletions. String of signed digits finishing with 0 (zero). Each digit represents the distance to the next insertion in the reference (positive int) or deletion in the reference (negative int). For example, the insertions and deletions of the alignment: A = ABC.DACBDCAC B = .BCCDAC.DCAC are encoded as 1 -3 4 0. Further information about this encoding can be found in http://mummer.sourceforge.net/manual/ - Read quality. Phred scale score for the entire read. This field is optional, if present, read_qualities_status parameter must be set to I or P. Otherwise, it must be set to N - Read sequence - Phred quality score for each base of the read. There must be as many scores as the length of the read. Score must be separated by spaces. To convert reads mapped by maq to the format explained above, a separated module, included in this distribution, has been developed. The module called maqGeneSeqTranslator is a simple command line tool that receives the text file generated by the command maq mapview (See http://maq.sourceforge.net/maq-man.shtml for details) and a contigs text file with the desired distribution in chromosomes and generates (or add reads to) an output file per chromosome with the reads mapped to sequences related with the chromosome. The format of the contigs file has five columns: chromosome number, sequence id, start position in the chromosome, end position in the chromosome and strand (zero if the strand is not known). A contig file for the 22 autosomal chromosomes of the reference human genome version 36.3 and a small reads file in maq format are included as samples. No extra steps are needed to build this tool. The command line to run it looks like: maqGeneSeqTranslator -contigsFile ref36-3.contig_map -readsFile reads.txt --------------------------- Genotype Calling Algorithms --------------------------- Different algorithms can be applied in the genotype calling step depending on the goal to be achieved. The implemented algorithms are the following: posterior1 : Calculates for each SNP the genotype with maximum posterior probability given the reads and the population information. Implements a posterior decoding procedure on the HMM. It produces high accuracy genotypes but it does not produce haplotypes viterbi : Calculates the path over the founders with maximum likelihood given the reads by implementing a Viterbi algorithm. Genotyping accuracy of this method is less than posterior1 but it produces more accurate haplotypes than phasing the genotype generated by posterior1. posterior-viterbi : Posterior1 + phasing. Produces as good genotypes as posterior1 but haplotypes less accurate than viterbi. ---------------- Revision history ---------------- Version 1.0.0 (12/05/08) - first public release ------------------- Contact Information ------------------- For questions and bug reports please send e-mail to jlk02019@engr.uconn.edu, jduitama@engr.uconn.edu, or ion@engr.uconn.edu.