NGSTools - Java tools for analysis of Next Generation Sequencing (NGS) data Version 2.0.0 (03/10/2013) =========================================================================== The NGSTools package provides an object model to enable different kinds of analysis of Next Generation Sequencing (NGS) data, and some utility programs to process reads aligned to different reference genomes. The most important tools in this package are SNVQ and HardMerge. SNVQ is an accurate Single Nucleotide Variants (SNV) detection and genotyping algorithm from base calls and quality scores. HardMerge merges alignments of a set of reads to two references (genome and transcriptome) given a set of rules that ensures confindently calling SNVs from the resulting set of alignments. The format of choice to process alignments in every tool in this package is SAM, which allows to integrate NGSTools with commonly used mapping programs as Bowtie (http://bowtie-bio.sourceforge.net/index.shtml) and other analysis packages like SAMTools (http://samtools.sourceforge.net/). ---------------------- Installing NGSTools ---------------------- NGSTools has been compiled and run successfully on the standard jdk version 1.7.0. To install NGSTools: 1. Download the compressed NGSTools-2.0.0 from http://dna.engr.uconn.edu/software/NGSTools/NGSTools-2.0.0.zip 2. Uncompress NGSTools-2.0.0.zip 3. If you want to rebuild NGSTools, run the command: make all 4. Run the script "setup" to setup the paths of NGSTools executables -------------------------- Calling variants with SNVQ -------------------------- A Single Nucleotide Variants (SNV) detection and genotyping algorithm (SNVQ) from reference genome and an alignments file in SAM format and writes the SNVs list in standard output. Usage: snvq OPTIONS: -h FLOAT Heterozygosity rate. Default: 0.001 -querySeq STRING Call variants just for this sequence name -start INT Call variants just from this locus in the given query sequence -end INT Call variants just until this locus in the given query sequence -minAltCoverage INT Minimum coverage of the alternative allele to call a SNV. Default: 0 -maxAltCoverage INT Maximum coverage of the alternative allele to call a SNV. Default: Integer.MAX_VALUE -minProbability FLOAT Minimum genotype posterior probability to call a SNV. Default: 0 -reference FILE Reference assembly file in fasta format. An alignmets file can be used as input if this file is provided. It is assumed that the sequence names in the alignments file correspond with the sequence names in this reference assembly -keepLowerCaseRef Keep variant calls in loci where the reference allele is lower case. It is assumed by default that loci with lower case are in repetitive regions and then false positives may be produced. -strand Output strand specific base call statistics INPUT_FILE: read alignments sam file The output is a tab delimited text file with the following fields: - Sequence name - Position - Reference Allele - Alternative Allele - Genotype as a two characters String - Coverage of the reference allele - Coverage of the alternative allele - Phred quality score of the genotype - Phred quality score of the probability of having a variant in this locus - Number of A calls - Number of C calls - Number of G calls - Number of T calls if -strand option is selected, the following fields are also included - Coverage of the reference allele on the positive strand - Coverage of the alternative allele on the positive strand - Number of A calls on the positive strand - Number of C calls on the positive strand - Number of G calls on the positive strand - Number of T calls on the positive strand - Coverage of the reference allele on the negative strand - Coverage of the alternative allele on the negative strand - Number of A calls on the negative strand - Number of C calls on the negative strand - Number of G calls on the negative strand - Number of T calls on the negative strand ---------------------------------------------------------- Merging Genome and Transcriptome Alignments with HardMerge ---------------------------------------------------------- NGSTools also includes a tool to merge read alignments against a reference genome with alignments against the a transcripts' library. Both alignment files must be in SAM V-0.1.2 format. The alignments against the transcript library must be in genome coordinates. Conversion into genome coordinates can be done using the tool convert-iso-to-genome-coords, which is part of the IsoEM software (http://dna.engr.uconn.edu/?page_id=105). The output is another sam file with alignments merged using the rules summarized in the table below. HardMerge also outputs the set of reads that were filtered out by the mergin rules in a fastq file. Genome Transcripts Agree? Keep? Mapping Mapping Unique Unique Yes Yes Unique Unique No No Unique Multiple No No Unique Not Mapped No Yes Multiple Unique No No Multiple Multiple No No Multiple Not Mapped No No Not Mapped Unique No Yes Not Mapped Multiple No No Not Mapped Not Mapped No No The usage is as follows: Usage: hardmerge [] TRANSCRIPTOME_ALIGNMENTS: Transcriptome alignments SAM file in genome coordinates GENOME_ALIGNMENTS: Genome alignments SAM file OUTPUT_SAM_FILE: Output SAM files FILTERED_READS_FASTQ: Fatsq file with reads filtered out by HardMerge merging rules LOCAL_ALIGNMENT_LENGTH_THRESHOLD [optional]: Minimun number of consecutive uniquely aligned bases; for local alignments Default: 15 ------------------------ Clipping read alignments ------------------------ NGSTools includes a utility to clip a given number of bases from the 5' end and from the 3' end of each alignment in a SAM file. The input for this tool is a file of read alignments in SAM format, the number of bases to clip from the 5' end and the number of bases to clip from the 3' end. Clipped alignments are reported in the standard output. This tool assumes that alignments are grouped by sequence name and that alignments to the same sequence are sorted by start position. The output SAM file will also be sorted in this way. The usage is the following: Usage: ClipReadAlignments ALIGNMENTS_FILE: Input alignments SAM file CLIPPING_5PRIME_END: Number of bases to be clipped at the 5' end CLIPPING_3PRIME_END: Number of bases to be clipped at the 3' end OUTPUT_FILE: Output clipped alignments SAM file --------------------------- Modifying Reference Genomes --------------------------- The objective of this tool is to modify a public reference genome according with a given set of variants specific to the type of organism being studied. The tool receives a reference assembly and a set of SNVs and outputs a new version of the reference having the alternative allele as reference in the loci included in the SNVs file. The usage is the following: Usage: ModifyReference [-l ] LENGTH [optiona]: Line length in the output fasta file. Default: 100 REFERENCE_FILE: Genome fasta file SNVS_FILE: List of SNVs; SNVQ output OUT_FILE: Modified reference output file ----------------- Merging SAM files ----------------- This tool has the same basic functionality as the merge command of SAMTools but it is able to process SAM files. It receives a list of SAM files assumed to be grouped by sequence name and with alignments to the same sequence sorted by start position and merges them according with this sorting mechanism into a single file. The first file must have a SQ header for each sequence name present in any file. These headers will appear in the output file which will be written in standard output. The usage for this tool is the following: merge-sam + --------------------------------- Calculating mismatches statistics --------------------------------- This is a small tool that takes a set of alignments and a reference genome and counts the number of mismatches with the reference for each read position from 5' to 3' end. This report is useful to detect sequencing error biases. The usage for this tool is the following mismatch-stats REFERENCE_FILE Genome fasta file ALIGNMENTS_FILE: Input alignments SAM file MAX_READ_LENGTH maximum read length in SAM file OUTPUT Path to mismatch statistics output file ------------ Source Code ------------ The source code can be found in the src directory under the installation path. ---------------- Revision history ---------------- Version 2.0.0 (3/10/3012) - HardMerge: Supports paird end reads - HardMerge: Supports local alignemnts - HardMerge: Memory efficiency; sorting of alignements is done externally to support large SAM files - HardMerge: Independence of CCDS; transcriptome alignments are expected in genome coordinate. convert-iso-to-genome-coords, which is part of the IsoEM software (http://dna.engr.uconn.edu/?page_id=105) can be used to convert transcriptome coordinates externally. - SNVQ: Memory and time efficiency (for cases with large exons) - SNVQ: Strand specific coverage in SNVQ output (optional) - ReadPositionStatistics: masked bases (different case) is not considered a mismatch. - User friendly command line Version 1.0.0 (10/01/10) - First public release ------- Contact ------- For questions or suggestions regarding NGSTools you can contact: Jorge Duitama (j.duitama@cgiar.org) Sahar Al Seesi (sahar@engr.uconn.edu) Ion Mandoiu (ion@engr.uconn.edu)