Guide to running IsoEM by aligning reads on a library of known isoforms ======================================================================= IsoEM takes as input alignments given in genome coordinates in SAM format. However, best results are obtained if the reads are aligned directly on a library of known isoforms rather than directly on the genome. IsoEM comes with a series of tools designed to assist in the process. In this guide we show all the steps necessary to obtain isoform frequency estimates from a set of reads. A. Download and install IsoEM. You can find it at: http://dna.engr.uconn.edu/software/IsoEM B. Download the sample.zip archive from http://dna.engr.uconn.edu/software/IsoEM/sample.zip NOTE: the file is large, because it contains the full human genome reference sequence. The archive contains the following files and directories: ===================== README-SAMPLE.TXT - This file sample.sh - Script to run all the steps presented below sample/ hg18_ref_genome.fa - Human genome in fasta format reads.fastq - 1 million 25bp single reads in fastq format knownGene.gtf - UCSC known isoforms in GTF format knownToEnsembl.txt \ knownToGnfAtlas2.txt Three different mappings of isoforms to genes knownToRefSeq.txt / ===================== C. Unzip the archive in the same directory where IsoEM is installed! D. Download and install bowtie from: http://bowtie-bio.sourceforge.net/index.shtml E. Take a look at the sample.sh script. The script is intended to run everything needed. Here's a summary of all the steps executed by this script: 1. Extract the isoform nucleotide sequences for all the known isoforms from the genome sequence based on the coordinates given in the .gtf file. 2. Create a bowtie index for the isoform sequences 3. Align the reads on the library of isoforms 4. Convert the alignments from isoform coordinates to genome coordinates 5. Run IsoEM 6. Run Isoviz If the script runs correctly, you'll end up with multiple new files in the sample/ directory. The interesting files are: =================== sample/genome_aligned_reads.iso_estimates - estimated FPKMs (Fragments Per Kilobase per Million reads) for all the isoforms in the GTF file sample/genome_aligned_reads.gene_estimates - estimated FPKMs for genes sample/genome_aligned_reads_iso_read_coverage.bed - isoforms coverage by reads sample/genome_aligned_reads_isoforms_w_fpkm.gtf - isoforms with their fpkm values =================== If you have any questions or suggestions please contact one of: Marius Nicolae (man09004@engr.uconn.edu) Serghei Mangul (serghei.mangul@gmail.com) Sahar Al Seesi (sahar@engr.uconn.edu) Ion Mandoiu (ion@engr.uconn.edu) Alex Zelikovsky (alexz@cs.gsu.edu)