GeneSeq - Calling genotypes from next generation reads
Version 1.0.0 (12/05/08)
==============================================

The GeneSeq package provides methods to call the genotype and the haplotypes 
of a single individual given reads produced by second generation sequencing 
tecnologies like 454, Illumina, and ABI SOLiD. GeneSeq improves the calling 
accuracy by taking into account haplotypes from the population to which the
individual to genotype belongs. GeneSeq implements a Hidden Markov Model (HMM)
of evolution through recombination and mutation from a small set of founder
haplotypes to mix information provided by the population haplotypes with the
information provided by mapped reads.

----------------
Building GeneSeq 
----------------

The code has been compiled and tested successfully on Gentoo Linux with 
the GNU gcc compiler version 3.4.5.  To build the GEDI executables run 
the following commands in the directory where GeneSeq_1.0.0.tar.gz is 
located:

tar -xzvf GeneSeq-1.0.0.tar.gz
cd GeneSeq
make all

--------------
Runing GeneSeq
--------------

GeneSeq is the command-line tool. Basic usage requires
a text file with mapped reads in a format similar to nucmer that will be explained 
below, a legend text file with the snps to be called, a text file with the
population haplotypes in 0-1 format, and a file name to store the output. Examples 
of input these files are provided. The basic command line example looks like this:

GENE-SEQ -ref_snp_info chr22.hapmap_legend -ref_chr_hap chr22.population_01phasing -reads_file chr22.unsorted -output_file_name chr22.output 

The full list of options is explained below:

Usage:
	GENE-SEQ <REQUIRED_ARGUMENTS> <OPTIONAL_ARGUMENTS>

REQUIRED_ARGUMENTS:
        -ref_snp_info <snp_info_file>   : Info file for SNPs to genotype. It must 
					 have a header line and four columns separated 
					 by tabs. Columns are snp id, snp position in 
					 the chromosome, base encoded as 0, and base encoded as 1
        -ref_chr_hap <hap_file>         : Reference haplotypes. It has a 0-1 format with a line per
					 haplotype and alleles separated by spaces. The number of alleles per
					 haplotype must match the number of snps in snp_info_file
        -reads_file <reads_file>        : File with mapped reads (format explained below)
        -output_file_name <output_file> : File where output will be stored. For each snp in snp_info_file a
					  line will be produced with the snp id and the two called bases (equal if
					  the snp is called as homozygous). For posterior methods the fourth column is
					  the posterior genotype probability. For other methods it just will be 0.
					  For viterbi methods the second column across snps corresponds with the first
					  haplotype and the second column corresponds to the second haplotype. For other methods
					  the two alleles of each snp will be sorted alphabetically.

OPTIONAL_ARGUMENTS:
        -version (posterior1|viterbi|posterior-viterbi) : Algorithm to run. Algorithms explained below. Default: posterior-viterbi
        -find_snps_out_file <snps_output_file> : Output file with information provided by the reads. For each snp in 
						 snp_info_file a line will be produced with the snp id, the two base alleles
						 ordered alphabetically, the number of reads covering each allele, the probability
						 of the reads covering the snp given each of the three possible genotypes and the posterior
						 probability of homozygous0 and heterozygous, assuming as prior that the population information 
						 reflects the true allele frequencies and that genotypes are in HWE.
        -read_qualities_status (N|I|P)         : N if the reads file does not include read qualities, I for ignoring them, 
						 P to consider them. Default: I
        -nr_founders <int>                     : Number of haplotype founders. Default: 7
        -blocksize <int>                       : Number of snps in a single HMM window. Default: 50
        -half_overlap <int>                    : Half of the number of snps overlaping two HMM windows. Default: 10
        -seed <long>                           : Seed for random number generation. Default 1
        -founder_bias <double>                 : Initial probability of staying in the same founder from one snp to the next. Default 0.9
        -noise <double>                        : Desired percentage of noise on initial transition and emission probabilities. Default 0.4
        -max_steps <int>                       : Maximum of iterations for the HMM training. Default 1000

----------------------------
Sample data and reads format
----------------------------

The distribution includes one set of sample input files. The files
chr22.hapmap_legend and chr22.population_01phasing correspond to snps and
population information that has been taken from 
http://ftp.hapmap.org/phasing/2007-08_rel22/phased/ for chromosome 22 and 
Yoruban population. The file chr22.unsorted correspond to Illumina reads available in
ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA000272 mapped to the
reference human genome version 36.3 using maq software. Just reads mapped to
chromosome 22 are included. 

The reads file includes one line per read with the following information separated by spaces:
- Read id
- Start position in the reference genome
- End position in the reference genome
- Start position in the read
- End position in the read
- Delta vector with insertions and deletions. String of signed digits 
  finishing with 0 (zero). Each digit represents the distance to the next insertion in 
  the reference (positive int) or deletion in the reference (negative int). For example, the 
  insertions and deletions of the alignment:

    A = ABC.DACBDCAC
    B = .BCCDAC.DCAC

  are encoded as 1 -3 4 0. Further information about this encoding can be found in
  http://mummer.sourceforge.net/manual/
- Read quality. Phred scale score for the entire read. This field is optional, if present, 
  read_qualities_status parameter must be set to I or P. Otherwise, it must be set to N
- Read sequence
- Phred quality score for each base of the read. There must be as many scores as the length of
  the read. Score must be separated by spaces.

To convert reads mapped by maq to the format explained above, a separated module, included in this distribution,
has been developed. The module called maqGeneSeqTranslator is a simple command line tool that
receives the text file generated by the command maq mapview 
(See http://maq.sourceforge.net/maq-man.shtml for details) and a contigs text file with the desired
distribution in chromosomes and generates (or add reads to) an output file per chromosome with the reads
mapped to sequences related with the chromosome. The format of the contigs file has five columns: chromosome number,
sequence id, start position in the chromosome, end position in the chromosome and strand (zero if the strand is not known).
A contig file for the 22 autosomal chromosomes of the reference human genome version 36.3 and a small reads file in maq format
are included as samples. No extra steps are needed to build this tool. The command line to run it looks like:

maqGeneSeqTranslator -contigsFile ref36-3.contig_map -readsFile reads.txt

---------------------------
Genotype Calling Algorithms
---------------------------
Different algorithms can be applied in the genotype calling step depending on the goal to be achieved. The implemented algorithms
are the following:

posterior1        : Calculates for each SNP the genotype with maximum posterior probability given the reads and the population information.
	            Implements a posterior decoding procedure on the HMM. It produces high accuracy genotypes but it does not produce haplotypes 
viterbi           : Calculates the path over the founders with maximum likelihood given the reads by implementing a Viterbi algorithm.
	            Genotyping accuracy of this method is less than posterior1 but it produces more accurate haplotypes than phasing the genotype
                    generated by posterior1.
posterior-viterbi : Posterior1 + phasing. Produces as good genotypes as posterior1 but haplotypes less accurate than viterbi.  

----------------
Revision history
----------------

Version 1.0.0 (12/05/08)  - first public release

-------------------
Contact Information
-------------------

For questions and bug reports please send e-mail to jlk02019@engr.uconn.edu,
jduitama@engr.uconn.edu, or ion@engr.uconn.edu.