GEDI - Genotype Error Detection and Imputation Version 1.0.3 (8/30/09) ============================================== The GEDI package provides methods for * error detection in whole-genome SNP genotype data * recovery of missing SNP genotypes * imputation of genotypes at untyped SNPs based on reference haplotypes such as those provided by the Hapmap project * genotype phasing through a copy of our highly scalable ENT algorithm. GEDI handles genotype data from unrelated individuals as well as individuals related by simple pedigrees such as trios. GEDI computations rely on efficient likelihood computations based on a Hidden Markov Model of haplotype diversity in the population under study. For further details on the statistical model and algorithms implemented by GEDI see http://arxiv.org/abs/0911.1765 ------------- Building GEDI ------------- The code has been compiled and tested successfully on Gentoo Linux with the GNU gcc compiler version 3.4.5. To build the GEDI executables run the following commands in the directory where gedi_1.0.3.tar.gz is located: tar -xzvf gedi-1.0.3.tar.gz cd GEDI-1.0.3 make all This will create three executables: GEDI, the executable normally called by the user (see below), which provides additional functionality for advanced users, and ent, a fast phasing program based on entropy minimization (see http://dna.engr.uconn.edu/~software/ent/ for details). ----------- Runing GEDI ----------- GEDI is the command-line tool normally called by the user. ent can be used directly, but are normally called by GEDI after appropriate file format conversions (they must be present in the directory under which GEDI is run) Usage: GEDI COMMANDS: -ED [-LLRT ] Perform error detection on input genotype data, correcting SNP genotypes with log-likelihood ratio above threshold (default LLRT is 3) -MDR Perform imputation of missing genotypes at typed SNP loci -IMP [-flanking ] Perform imputation of SNPs in -imp_snp file using posterior probability based on typed SNPs before and after imputed locus -PHASE Perform genotype phasing of genotype data, using ENT software REQUIRED_ARGUMENTS: -chr : chromosome number -pop_snp_info : info file for typed SNPs -pop_ped : population pedigree file -pop_chr_gen : typed SNP genotypes in row format OPTIONAL_ARGUMENTS: -excl_sample : individuals to exclude -excl_snp : SNPs to exclude -imp_snp : SNPs to be imputed (required by -IMP) -ref_snp_info : info file for reference SNPs (required for -IMP and when using -useRefHapMDR or -useRefHapED) -ref_chr_hap : reference haplotypes (required for -IMP and when using -useRefHapMDR or -useRefHapED) -ref_chr_hap2 : reference haplotypes for admixed populations -hap_training_ED : haplotypes used for -ED (default is to phase genotypes internally using ENT) -useRefHapED : use reference haplotypes for -ED -hap_training_MDR : haplotypes used for -MDR (default is to phase genotypes internally using ENT) -useRefHapMDR : use reference haplotypes for -MDR GEDI creates several output files depending on the specified commands: * Summary.txt -- summary statistics for each of performed step (ED, MDR, IMP) * .res, where is the prefix of the population genotype file -- file including, in row format, all SNP genotypes at the end of GEDI's analysis (corrected, imputed, or simply unmodified original genotypes) * .res.mi_err -- file containing Mendelian inconsistencies (if any) discovered and corrected by GEDI * .res.mc_err -- file containing Mendelian consistent errors (if any) that pass the detection threshold in -ED * .res.mdr -- file containing missing genotypes (if any) imputed by -MDR * .res.imp -- file containing genotypes imputed by -IMP (including posterior probabilities) * .res.phase -- file containing genotypes phased by -PHASE For full details on GEDI file formats and command line parameters see http://dna.engr.uconn.edu/~software/GEDI/GEDI_1.0_ReleaseDoc.pdf For ent file formats and command line parameters see http://dna.engr.uconn.edu/~software/ent/README.TXT ----------------------- File Formats ----------------------- pop_snp_info: one snp per line containing 'rsID position MinorAllele MajorAllele strand' with header line e.g.: #rsID pos A0 A1 strand rs4040617 819185 G A + rs2980300 825852 T C + rs4075116 1043552 C T + rs9442385 1137258 T G + rs10907175 1170650 C A + pop_ped: one sample per line containing 'SampleID FatherId MotherId' with header line e.g.: #Geno_Sample_ID Fa_id Ma_id ADMX1 0 0 ADMX2 0 0 ADMX3 0 0 ADMX4 0 0 ADMX5 0 0 pop_chr_gen: one genotype per line containing 'rsID Sample Id Allele0 Allele1 QC_score' with header line e.g.: #snpID Ind A0 A1 qc rs4040617 ADMX1 G A 0 rs2980300 ADMX1 T C 0 rs4075116 ADMX1 C T 0 rs9442385 ADMX1 T T 0 rs10907175 ADMX1 C C 0 ref_chr_hap: one snp per line containing 'rsID position MinorAllele MajorAllele' with header line snp position a0 a1 rs4040617 819185 G A rs2980300 825852 T C rs4075116 1043552 C T rs9442385 1137258 T G rs10907175 1170650 C A ref_snp_info: one haplotype per line separated by space or tabs (no header line) e.g.: 0 0 1 1 ... 0 0 1 0 ... 0 0 1 0 ... 0 0 1 1 ... 1 1 1 1 ... imp_snp: list of SNPs to be imputed if -IMP flag is set (no header line) rs6687835 rs7517989 rs13375764 rs6666453 rs7418357 rs12121044 ----------------------- Sample data and scripts ----------------------- The distribution includes sample input files and two sample scripts: * run_GEDI_sample_unrelated.sh, which performs error detection, missing data recovery, and imputation) on a set of unrelated samples (using ENT to phase genotypes in the ED and MDR steps), and * run_GEDI_sample_trios.sh, which performs imputation for a trio datasets The two datasets consist of chromosome 22 genotypes for the parents, respectively all individuals, in the 30 CEU trios of Hapmap, re-genotyped at the Broad Institute Center for Genotyping and Analysis with the Affymetrix 500K Array 2.0 as part of the GAIN studies. The list of imputed SNPs was obtained from the SNPs present on the Affymetrix 6.0 chip by excluding SNPs on the 500K. Genotypes typed with both platforms for all 270 Hapmap samples are available from dbGaP at ftp://ftp.ncbi.nih.gov/dbgap/GAIN/genotypeQC/ ---------------- Revision history ---------------- Version 1.0.3 (8/30/09) - Added phasing, combined core Version 1.0.2 (7/04/08) - small changes in output file formats Version 1.0.1 (6/11/08) - fixed memory leak, included sample files Version 1.0.0 (5/22/08) - first public release ------------------- Contact Information ------------------- For questions and bug reports please send e-mail to jlk02019@engr.uconn.edu, bogdan@engr.uconn.edu, or ion@engr.uconn.edu.