ENT - Genotype Phasing by Entropy Minimization Version 1.0.2, Oct. 5, 2008 ============================================== A Single Nucleotide Polymorphism (SNP) is a position in the genome at which two or more of the possible four nucleotides occur in a large percentage of the population. SNPs account for most of the genetic variability between individuals, and mapping SNPs in the human population has become the next high-priority in genomics after the completion of the Human Genome project. In diploid organisms such as humans, there are two non-identical copies of each autosomal chromosome. A description of the SNPs in a chromosome is called a haplotype. At present, it is prohibitively expensive to directly determine the haplotypes of an individual, but it is possible to obtain rather easily the conflated SNP information in the so called genotype. Computational methods for genotype phasing, i.e., inferring haplotypes from genotype data, have received much attention in recent years as haplotype information leads to increased statistical power of disease association tests. ENT is a highly scalable genotype phasing algorithm based on entropy minimization. ENT is capable of phasing both unrelated and related genotypes coming from complex pedigrees. The open source code implementation of ENT and a web interface are publicly available at http://dna.engr.uconn.edu/~software/ent/. Building ENT ------------ To build the ent executable run the following commands in the directory where ent_1.0.2.tar.gz is located: tar -xzvf ent_1.0.2.tar.gz cd ent_1.0.2 make all After compiling the executable, make runs a regression test (assuming that awk is installed on your system) to verify that they work as expected. Upon completion of this test you should get the message "REGRESSION TEST SUCCESSFUL!!!" The code has been compiled and tested successfully on Gentoo Linux with gcc compiler versions 3.3.5 and 3.4.5. Using ENT --------- ENT is a command-line tool which by default reads from a user specified input file and prints the results to the standard output. Usage: ent [options] -input OPTIONS: -free N : free window size (default: automatically selected) -locked N : locked window size (default: automatically selected) -seed N : random generator seed (default: 1) -no_batching : turn off batching (default: use batching) -count_all : compute entropy over all haplotypes (default: count founder haplotypes only) A sample input file (Hapmap release 23a genotypes for chromosome 22 of CEU samples) is included with the code distribution. Sample command: ent -input chr22_CEU_r23a.ent_gen > chr22_CEU_r23a.ent_gen.ent_hap File Formats ------------ ENT accepts sequences of the form 0/1/2/? where 0/1 denote the genotypes that are homozygous for the major/minor allele, 2 denotes a heterozygous genotype, and ? denotes an unknown genotype. The ENT input file format is as follows: * First line: * Additional lines: All individual id's must be non-zero, a parent id of 0 represents no known parent. The output file format is as follows: * First line: * Additional lines: ---------------- Revision history ---------------- Version 1.0.2 (10/5/08) - corrected a bug in the handling of large pedigrees Version 1.0.1 (8/23/08) - added upperbounds of 5 for free and locked in automatic window size selection to ensure more predictable runtime on large datasets Contact Information ------------------- For questions and bug reports please send e-mail to gusev@cs.columbia.edu or bogdan@icsi.berkeley.edu or ion@engr.uconn.edu.