CAREER: Combinatorial Algorithms for High-Throughput Collection and Analysis of Genomic Diversity Data
Funding agency: National Science Foundation, Division of Information & Intelligent Systems
Award #: IIS-0546457
Amount: $550,000
PI: Ion I. Mandoiu
Period: 01/2006–12/2011
Abstract:
Genomic diversity analyses of large-scale case/control and population studies promise to provide answers to fundamental problems ranging from determining the genetic basis of disease susceptibility to uncovering the pattern of historical population migrations. However, the feasibility of such studies critically depends on addressing a number of technological and computational challenges. On the technological front, despite the huge advances made in recent years, there is still a need for a flexible high-throughput platform capable of typing hundreds of thousands of SNPs at a very low-cost per experiment.Computationally, there is a need for integrating recently developed statistical models of the structure of genomic variability in human populations with efficient combinatorial methods delivering predictable solution quality.
The proposed research and education activities will address the above challenges at several levels, including modeling and formalizing the underlying biological and technological problems, finding efficient algorithms for the identified problems, engineering these algorithms into high-quality open-source bioinformatics tools, and collaborating closely with industry researchers and molecular geneticists in validating the proposed methods and applying them to population-scale genomic data. Major project outcomes will include (1) development of an innovative high-throughput SNP genotyping assay realizing a yet unrealized potential of k-mer arrays by combination with solution-phase single-base extension, (2) optimization of two proven technologies that are in common use in SNP genotyping – DNA tag arrays and multiplex-PCR, (3) novel likelihood maximization algorithms with predictable solution quality for challenging computational problems arising in two-stage sampling design association studies, including haplotype tagging SNP selection and haplotype reconstruction from genotype data, (4) robust open-source software implementations and principled methodologies for the empirical evaluation of proposed algorithms, and (5) innovative curriculum and educational materials, including the creation of a new textbook on computational genomics. The successful completion of the project will lead to decreased data collection costs in large-scale association studies, thus enabling more studies to be completed within the same budget. The proposed assay architecture based on k-mer arrays is expected to enable additional applications of genomic technologies, such as genomics-based point-of-care medical diagnosis and large-scale species identification. Broader impacts of proposed educational and outreach activities include increasing participation of under-represented groups in research and training of future researchers with unique interdisciplinary skills.
Software Packages
- ViSpA – Viral Spectrum Assembler
- NGSTools – Java tools for analysis of Next Generation Sequencing data
- IsoEM: Inferring Alternative Splicing Isoform Frequencies from High-Throughput RNA-Seq Data
- DGE-EM: Accurate Estimation of Gene Expression Levels from DGE Sequencing Data
- GEDI-ADMX: Genotype Error Detection and Imputation for Admixed Populations
- GeneSeq: LD-based SNP genotype calling from shotgun sequencing reads
- GEDI: Genotype Error Detection and Imputation
- PrimerHunter: Primer Design for PCR-Based Virus Subtype Identification
- ENT: Genotype Phasing by Entropy Minimization
- DNA-BAR: Distinguisher Selection for DNA Barcoding
- G-POT: Multiplex-PCR Primer Set Selection