NGSTools - Java tools for analysis of Next Generation Sequencing (NGS) data
Version 2.0.0 (03/10/2013)
===========================================================================

The NGSTools package provides an object model to enable different kinds of analysis
of Next Generation Sequencing (NGS) data, and some utility programs to process reads 
aligned to different reference genomes. The most important tools in this package are 
SNVQ and HardMerge. SNVQ is an accurate Single Nucleotide Variants (SNV) detection 
and genotyping algorithm from base calls and quality scores.  HardMerge merges alignments
of a set of reads to two references (genome and transcriptome) given a set of rules that 
ensures confindently calling SNVs from the resulting set of alignments. The format of 
choice to process alignments in every tool in this package is SAM, which allows to integrate 
NGSTools with commonly used mapping programs as Bowtie (http://bowtie-bio.sourceforge.net/index.shtml) 
and other analysis packages like SAMTools (http://samtools.sourceforge.net/).


----------------------
Installing NGSTools 
----------------------

NGSTools has been compiled and run successfully
on the standard jdk version 1.7.0. To install NGSTools:

1. Download the compressed NGSTools-2.0.0
   from http://dna.engr.uconn.edu/software/NGSTools/NGSTools-2.0.0.zip

2. Uncompress NGSTools-2.0.0.zip 

3. If you want to rebuild NGSTools, run the command:
	make all 

4. Run the script "setup" to setup the paths of NGSTools executables


--------------------------
Calling variants with SNVQ
--------------------------

A Single Nucleotide Variants (SNV) detection and genotyping
algorithm (SNVQ) from reference genome and an alignments file 
in SAM format and writes the SNVs list in standard output. 

Usage: 

snvq <OPTIONS> <INPUT_FILE>

OPTIONS:
		-h FLOAT				Heterozygosity rate. Default: 0.001
		-querySeq STRING			Call variants just for this sequence name 
		-start INT				Call variants just from this locus in the given query sequence
		-end INT				Call variants just until this locus in the given query sequence
		-minAltCoverage	INT			Minimum coverage of the alternative allele to call a SNV. 
							  Default: 0
		-maxAltCoverage	INT			Maximum coverage of the alternative allele to call a SNV. 
							  Default: Integer.MAX_VALUE
		
		-minProbability	FLOAT			Minimum genotype posterior probability to call a SNV. Default: 0
		-reference FILE				Reference assembly file in fasta format. An alignmets file can be used
							  as input if this file is provided. It is assumed that the sequence names
							  in the alignments file correspond with the sequence names in this
							  reference assembly
		-keepLowerCaseRef			Keep variant calls in loci where the reference allele is lower case.
							  It is assumed by default that loci with lower case are in repetitive regions
							  and then false positives may be produced. 
		-strand					Output strand specific base call statistics


INPUT_FILE: 	read alignments sam file


The output is a tab delimited text file with the following fields:
- Sequence name
- Position
- Reference Allele
- Alternative Allele
- Genotype as a two characters String
- Coverage of the reference allele
- Coverage of the alternative allele
- Phred quality score of the genotype
- Phred quality score of the probability of having a variant in this locus
- Number of A calls
- Number of C calls
- Number of G calls
- Number of T calls 

if -strand option is selected, the following fields are also included
- Coverage of the reference allele on the positive strand
- Coverage of the alternative allele on the positive strand
- Number of A calls on the positive strand
- Number of C calls on the positive strand
- Number of G calls on the positive strand
- Number of T calls  on the positive strand
- Coverage of the reference allele on the negative strand
- Coverage of the alternative allele on the negative strand
- Number of A calls on the negative strand
- Number of C calls on the negative strand
- Number of G calls on the negative strand
- Number of T calls  on the negative strand


----------------------------------------------------------
Merging Genome and Transcriptome Alignments with HardMerge
----------------------------------------------------------

NGSTools also includes a tool to merge read alignments against a reference genome with alignments 
against the a transcripts' library. Both alignment files must be in SAM V-0.1.2 format. The 
alignments against the transcript library must be in genome coordinates. Conversion into genome 
coordinates can be done using the tool convert-iso-to-genome-coords, which is part of the IsoEM 
software (http://dna.engr.uconn.edu/?page_id=105). The output is another sam file with alignments 
merged using the rules summarized in the table below. HardMerge also outputs the set of reads that 
were filtered out by the mergin rules in a fastq file.

Genome		Transcripts		Agree?	Keep?
Mapping		Mapping
Unique		Unique			Yes		Yes
Unique		Unique			No		No
Unique		Multiple		No		No
Unique		Not Mapped		No		Yes
Multiple	Unique			No		No
Multiple	Multiple		No		No
Multiple	Not Mapped		No		No
Not Mapped	Unique			No		Yes
Not Mapped	Multiple		No		No
Not Mapped	Not Mapped		No		No

The usage is as follows:

Usage: 

hardmerge <TRANSCRIPTOME_ALIGNMENTS> <GENOME_ALIGNMENTS> <OUTPUT_SAM_FILE> <FILTERED_READS_FASTQ> [<LOCAL_ALIGNMENT_LENGTH_THRESHOLD>]

TRANSCRIPTOME_ALIGNMENTS:	Transcriptome alignments SAM file in genome coordinates
GENOME_ALIGNMENTS:		Genome alignments SAM file
OUTPUT_SAM_FILE:		Output SAM files
FILTERED_READS_FASTQ:		Fatsq file with reads filtered out by HardMerge merging rules
LOCAL_ALIGNMENT_LENGTH_THRESHOLD [optional]:
				Minimun number of consecutive uniquely aligned bases; for local alignments
				Default: 15


------------------------
Clipping read alignments
------------------------

NGSTools includes a utility to clip a given number of bases from the 5' end and from the 3' end of 
each alignment in a SAM file. The input for this tool is a file of read alignments in SAM format, the
number of bases to clip from the 5' end and the number of bases to clip from the 3' end. Clipped alignments
are reported in the standard output. This tool assumes that alignments are grouped by sequence name and 
that alignments to the same sequence are sorted by start position. The output SAM file will also be
sorted in this way. The usage is the following:

Usage: 

ClipReadAlignments <ALIGNMENTS_FILE> <CLIPPING_5PRIME_END> <CLIPPING_3PRIME_END> <OUTPUT_FILE>

ALIGNMENTS_FILE:	Input alignments SAM file
CLIPPING_5PRIME_END:	Number of bases to be clipped at the 5' end
CLIPPING_3PRIME_END:	Number of bases to be clipped at the 3' end
OUTPUT_FILE:		Output clipped alignments SAM file


---------------------------
Modifying Reference Genomes
---------------------------

The objective of this tool is to modify a public reference genome according with a given set of variants
specific to the type of organism being studied. The tool receives a reference assembly and a set of SNVs
and outputs a new version of the reference having the alternative allele as reference in the loci
included in the SNVs file. The usage is the following:

Usage:

ModifyReference [-l <LENGTH>] <REFERENCE_FILE> <SNVS_FILE> <OUT_FILE>

LENGTH [optiona]:		Line length in the output fasta file. Default: 100
REFERENCE_FILE:			Genome fasta file
SNVS_FILE:			List of SNVs; SNVQ output
OUT_FILE:			Modified reference output file


-----------------
Merging SAM files
-----------------

This tool has the same basic functionality as the merge command of SAMTools but it is able to process
SAM files. It receives a list of SAM files assumed to be grouped by sequence name and with alignments 
to the same sequence sorted by start position and merges them according with this sorting mechanism
into a single file. The first file must have a SQ header for each sequence name present in any file.
These headers will appear in the output file which will be written in standard output. The usage for
this tool is the following:

merge-sam <SAM_FILE>+


---------------------------------
Calculating mismatches statistics
---------------------------------

This is a small tool that takes a set of alignments and a reference genome and counts the number of 
mismatches with the reference for each read position from 5' to 3' end. This report is useful to detect
sequencing error biases. The usage for this tool is the following

mismatch-stats <REFERENCE_FILE> <ALIGNMENTS_FILE> <MAX_READ_LENGTH> <OUTPUT>

REFERENCE_FILE		Genome fasta file
ALIGNMENTS_FILE:	Input alignments SAM file
MAX_READ_LENGTH	maximum read length in SAM file
OUTPUT			Path to mismatch statistics output file


------------
Source Code
------------

The source code can be found in the src directory under the 
installation path.


----------------
Revision history
----------------
Version 2.0.0 (3/10/3012)	- HardMerge: Supports paird end reads
				- HardMerge: Supports local alignemnts
				- HardMerge: Memory efficiency; sorting of alignements is done externally to support large SAM files
				- HardMerge: Independence of CCDS; transcriptome alignments are expected in genome coordinate. 
						convert-iso-to-genome-coords, which is part of the IsoEM software (http://dna.engr.uconn.edu/?page_id=105)
						can be used to convert transcriptome coordinates externally.
				- SNVQ: Memory and time efficiency (for cases with large exons)
				- SNVQ: Strand specific coverage in SNVQ output (optional)

				- ReadPositionStatistics: masked bases (different case) is not considered a mismatch.
				- User friendly command line

Version 1.0.0 (10/01/10)	- First public release


-------
Contact
-------
For questions or suggestions regarding NGSTools you can contact:

     Jorge Duitama (j.duitama@cgiar.org)
     Sahar Al Seesi (sahar@engr.uconn.edu)
     Ion Mandoiu (ion@engr.uconn.edu)