GEDI - Genotype Error Detection and Imputation
Version 1.0.3 (8/30/09)
==============================================

The GEDI package provides methods for

* error detection in whole-genome SNP genotype data

* recovery of missing SNP genotypes

* imputation of genotypes at untyped SNPs based on reference haplotypes such 
  as those provided by the Hapmap project

* genotype phasing through  a copy of our highly scalable ENT algorithm.

GEDI handles genotype data from unrelated individuals as well as 
individuals related by simple pedigrees such as trios. GEDI computations 
rely on efficient likelihood computations based on a Hidden Markov Model 
of haplotype diversity in the population under study.  For further details 
on the statistical model and algorithms implemented by GEDI see 
http://arxiv.org/abs/0911.1765

-------------
Building GEDI 
-------------

The code has been compiled and tested successfully on Gentoo Linux with 
the GNU gcc compiler version 3.4.5.  To build the GEDI executables run 
the following commands in the directory where gedi_1.0.3.tar.gz is 
located:

tar -xzvf gedi-1.0.3.tar.gz
cd GEDI-1.0.3
make all

This will create three executables: GEDI, the executable normally called 
by the user (see below), which provides additional functionality 
for advanced users, and ent, a fast phasing program based on entropy 
minimization (see http://dna.engr.uconn.edu/~software/ent/ for details). 

-----------
Runing GEDI
-----------

GEDI is the command-line tool normally called by the user.
ent can be used directly, but are normally called by GEDI after 
appropriate file format conversions (they must be present in the 
directory under which GEDI is run) 

Usage:
        GEDI <COMMANDS> <REQUIRED_ARGUMENTS> <OPTIONAL_ARGUMENTS>

COMMANDS:
        -ED [-LLRT <N>]
                Perform error detection on input genotype data, correcting
                SNP genotypes with log-likelihood ratio above threshold <N>
                (default LLRT is 3)

        -MDR
                Perform imputation of missing genotypes at typed SNP loci

        -IMP [-flanking <K>]
                Perform imputation of SNPs in -imp_snp file using posterior
                probability based on <K> typed SNPs before and after imputed locus

	-PHASE
		Perform genotype phasing of genotype data, using ENT software

REQUIRED_ARGUMENTS:
        -chr <chr_number>               : chromosome number
        -pop_snp_info <snp_info_file>   : info file for typed SNPs
        -pop_ped <ped_file>             : population pedigree file
        -pop_chr_gen <row_genotype_file>: typed SNP genotypes in row format

OPTIONAL_ARGUMENTS:
        -excl_sample <sample_list>      : individuals to exclude
        -excl_snp <snp_list>            : SNPs to exclude
        -imp_snp <snp_list>             : SNPs to be imputed (required by -IMP)
        -ref_snp_info <snp_info_file>   : info file for reference SNPs
                (required for -IMP and when using -useRefHapMDR or -useRefHapED)
        -ref_chr_hap <hap_file>         : reference haplotypes
                (required for -IMP and when using -useRefHapMDR or -useRefHapED)
        -ref_chr_hap2 <hap_file>        : reference haplotypes for admixed populations
        -hap_training_ED <hap_file>     : haplotypes used for -ED
                (default is to phase genotypes internally using ENT)
        -useRefHapED                    : use reference haplotypes for -ED
        -hap_training_MDR <hap_file>    : haplotypes used for -MDR
                (default is to phase genotypes internally using ENT)
        -useRefHapMDR                   : use reference haplotypes for -MDR


GEDI creates several output files depending on the specified commands:

* Summary.txt -- summary statistics for each of performed step (ED, MDR, IMP)

* <filename>.res, where <filename> is the prefix of the population 
  genotype file -- file including, in row format, all SNP genotypes 
  at the end of GEDI's analysis (corrected, imputed, or simply 
  unmodified original genotypes)  

* <filename>.res.mi_err -- file containing Mendelian inconsistencies 
  (if any) discovered and corrected by GEDI

* <filename>.res.mc_err -- file containing Mendelian consistent errors 
  (if any) that pass the detection threshold in -ED 

* <filename>.res.mdr -- file containing missing genotypes (if any) 
  imputed by -MDR 

* <filename>.res.imp -- file containing genotypes imputed by -IMP 
  (including posterior probabilities)

* <filename>.res.phase -- file containing genotypes phased by -PHASE

For full details on GEDI file formats and command line parameters 
see http://dna.engr.uconn.edu/~software/GEDI/GEDI_1.0_ReleaseDoc.pdf
For ent file formats and command line parameters see 
http://dna.engr.uconn.edu/~software/ent/README.TXT


-----------------------
File Formats
-----------------------

pop_snp_info: one snp per line containing 'rsID position MinorAllele MajorAllele strand' with header line
e.g.:
#rsID pos A0 A1 strand
rs4040617 819185 G A +
rs2980300 825852 T C +
rs4075116 1043552 C T +
rs9442385 1137258 T G +
rs10907175 1170650 C A +

pop_ped: one sample per line containing 'SampleID FatherId MotherId' with header line
e.g.:
#Geno_Sample_ID Fa_id Ma_id
ADMX1 0 0
ADMX2 0 0
ADMX3 0 0
ADMX4 0 0
ADMX5 0 0

pop_chr_gen: one genotype per line containing 'rsID Sample Id Allele0 Allele1 QC_score' with header line
e.g.:
#snpID Ind A0 A1 qc
rs4040617 ADMX1 G A 0
rs2980300 ADMX1 T C 0
rs4075116 ADMX1 C T 0
rs9442385 ADMX1 T T 0
rs10907175 ADMX1 C C 0

ref_chr_hap: one snp per line containing 'rsID position MinorAllele MajorAllele' with header line
snp     position        a0      a1
rs4040617       819185  G       A
rs2980300       825852  T       C
rs4075116       1043552 C       T
rs9442385       1137258 T       G
rs10907175      1170650 C       A

ref_snp_info: one haplotype per line separated by space or tabs (no header line)
e.g.:
0       0       1       1     ...
0       0       1       0     ...
0       0       1       0     ...
0       0       1       1     ...
1       1       1       1     ...

imp_snp: list of SNPs to be imputed if -IMP flag is set (no header line)
rs6687835
rs7517989
rs13375764
rs6666453
rs7418357
rs12121044


-----------------------
Sample data and scripts
-----------------------

The distribution includes sample input files and two sample scripts: 

* run_GEDI_sample_unrelated.sh, which performs error detection, missing data 
  recovery, and imputation) on a set of unrelated samples (using ENT to phase 
  genotypes in the ED and MDR steps), and 

* run_GEDI_sample_trios.sh, which performs imputation for a trio datasets

The two datasets consist of chromosome 22 genotypes for the parents, 
respectively all individuals, in the 30 CEU trios of Hapmap, re-genotyped at 
the Broad Institute Center for Genotyping and Analysis with the Affymetrix 
500K Array 2.0 as part of the GAIN studies.  The list of imputed SNPs was 
obtained from the SNPs present on the Affymetrix 6.0 chip by excluding SNPs 
on the 500K.  Genotypes typed with both platforms for all 270 Hapmap samples 
are available from dbGaP at ftp://ftp.ncbi.nih.gov/dbgap/GAIN/genotypeQC/


----------------
Revision history
----------------

Version 1.0.3 (8/30/09) - Added phasing, combined core
Version 1.0.2 (7/04/08)  - small changes in output file formats
Version 1.0.1 (6/11/08)  - fixed memory leak, included sample files
Version 1.0.0 (5/22/08)  - first public release

-------------------
Contact Information
-------------------

For questions and bug reports please send e-mail to jlk02019@engr.uconn.edu,
bogdan@engr.uconn.edu, or ion@engr.uconn.edu.