Applications & Databases

A full list of the bioinformatics tools and databases is currently available in the HPC cluster as shown below.

If the bioinformatics tools you are looking for is not in the HPC cluster, please request to the Bioimaging Group .

Applications
Databases

Apps	Use	Version	Description
ABYSS	De novo Assembly	1.3.5	ABySS is a de novo, parallel, paired-end sequence assembler that is designed for short reads. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes
Allpathslg	De novo Assembly	r46360	ALLPATHS-LG is a whole-genome shotgun assembler that can generate high-quality genome assemblies using short reads (~100bp) such as those produced by the new generation of sequencers.
AMOS	De novo Assembly	3.1.0	AMOS is a collection of tools and class interfaces for the assembly of DNA reads. The package includes a robust infrastructure, modular assembly pipelines, and tools for overlapping, consensus generation, contigging, and assembly manipulation.
augustus	Gene prediction	2.2.5	AUGUSTUS is a program that predicts genes in eukaryotic genomic sequences.
bcl2fastq	Scaffolding	1.13	Bambus is a general purpose scaffolder. Bambus accepts the output from most current assemblers and provides the user with great flexibility in choosing the scaffolding parameters. In particular, Bambus is able to accept contig linking data other than specified by mate-pairs. Such sources of information include alignment to a reference genome (Bambus can directly use the output of MUMmer), physical mapping data, or information about gene synteny
bedtools	Conversion Software	v2.26	bcl2fastq Conversion Software both demultiplexes data and converts BCL files generated by Illumina sequencing systems to standard FASTQ file formats for downstream analysis. For Illumina sequencing systems running RTA version 1.18.54 and later, use bcl2fastq2 Conversion Software v2.17 or later. For Illumina sequencing systems runnings RTA versions earlier than 1.18.54, use bcl2fastq Conversion Software v1.8.4.
Bioperl	Short Read Analysis	v2.24.0	The BEDTools utilities allow one to address common genomics tasks such as finding feature overlaps and computing coverage. The utilities are largely based on four widely-used file formats: BED, GFF/GTF, VCF, and SAM/BAM. Using BEDTools, one can develop sophisticated pipelines that answer complicated research questions by "streaming" several BEDTools together.
Biopython	Bioinformatics Programming Packages	1.6.1	BioPerl is a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications.
Blast2go_basic	Bioinformatics Programming Packages	1.6.1	Biopython is a set of freely available tools for biological computation written in Python.
blat	Functional Annotation	3.3	Blast2GO is an all in one tool for functional annotation of (novel) sequences and the analysis of annotation data.
bowtie	Sequence alignment	v. 35	BLAT (BLAST-like alignment tool) is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC) in the early 2000s to assist in the assembly and annotation of the human genome.[1] It was designed primarily to decrease the time needed to align millions of mouse genomic reads and expressed sequence tags against the human genome sequence.
bowtie2	Short Read Reference Mapping	1.1.2	Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).
bwa	Short Read Reference Mapping	2.2.5	Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.
cap3	Short Read Reference Mapping	0.7.4-r385	BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to 1Mbp.
cufflinks	De novo Assembly	VersionDate: 10/15/07	CAP3 is a DNA sequence assembly program for small-scale assembly with or without quality values.
Discovar	Short Read Analysis	2.2.1	Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols.
DiscovarDeNovo	Variant Calling	r52488	DISCOVAR can call variants on a region by region basis, potentially tiling an entire large genome. DISCOVAR variant calling is under active development and transitioning to VCF.
exonerate	De novo Assembly	r52488	DISCOVAR de novo can generate de novo assemblies for both large and small genomes. It currently does not call variants.
FastQC	Sequence alignment	2.2.0	Exonerate is a generic tool for pairwise sequence comparison. It allows you to align sequences using a many alignment models, either exhaustive dynamic programming or a variety of heuristics.
FASTX Toolkit	Sequence Quality Control	v0.11.5	FastQC is a quality control tool for high throughput sequence data. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.
GapCloser	Sequence Quality Control	0.0.14	The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing. It is sometimes more productive to preprocess the FASTA/FASTQ files before mapping the sequences to the genome - manipulating the sequences to produce better mapping results.
GATK	Genome annotation	v3.8-0	The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic variant calling tools, and to tackle copy number (CNV) and structural variation (SV). In addition to the variant callers themselves, the GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data.
Gossamer	De novo Assembly	1.2	The GapCloser is designed to close the gaps emerging during the scaffolding process by SOAPdenovo, using the abundant pair relationships of short reads.
HMMER	De novo Assembly	1.2.2	Gossamer is an application for doing de novo assembly of high throughput sequencing. Large data sets can be assembled on computers with small amounts of memory, making for an extremely space-efficient genome assembler.
htseq	Sequence alignment	3.1	HMMER is used for searching sequence databases for sequence homologs, and for making sequence alignments. It implements methods using probabilistic models called profile hidden Markov models (profile HMMs).
IGV	Python Package	0.6.0	HTSeq is a Python package that provides infrastructure to process data from high-throughput sequencing assays.
MACS	Genome browser	2.3.8	The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.
MAKER	ChIP-Seq Analysis	1.4, 2.0	Model-based Analysis of ChIP-Seq data, MACS, which analyzes data generated by short read sequencers such as Solexa's Genome Analyzer. MACS empirically models the shift size of ChIP-Seq tags, and uses it to improve the spatial resolution of predicted binding sites. MACS also uses a dynamic Poisson distribution to effectively capture local biases in the genome, allowing for more robust predictions.
MEGA	Genome Annotation	6.5.4-7	MAKER is a portable and easily configurable genome annotation pipeline. Its purpose is to allow smaller eukaryotic and prokaryotic genome projects to independently annotate their genomes and to create genome databases. MAKER identifies repeats, aligns ESTs and proteins to a genome, produces ab-initio gene predictions and automatically synthesizes these data into gene annotations having evidence-based quality values.
meme	Statistical Analysis	7	Molecular Evolutionary Genetics Analysis (MEGA) is computer software for conducting statistical analysis of molecular evolution and for constructing phylogenetic trees. It includes many sophisticated methods and tools for phylogenomics and phylomedicine.
MIRA	Motif-based Analysis	4.11	The MEME Suite allows the biologist to discover novel motifs in collections of unaligned nucleotide or protein sequences, and to perform a wide variety of other motif-based analyses.
mirDeep	miRNA identification tool		The miRDeep package was developed to discover active known or novel miRNAs from deep sequencing data (Solexa/Illumina, 454, …). The package consists of everything you need to analyze your own deep sequencing data after removal of ligation adapters: a number of scripts to preprocess the mapped data, and the core miRDeep algorithm that will analyze and score these data.
MUMmer	De novo Assembly	V3.4.1.1	The mira genome fragment assembler is a specialised assembler for sequencing projects classified as 'hard' due to high number of similar repeats. For EST transcripts, miraEST is specialised on reconstructing pristine mRNA transcripts while detecting and classifying single nucleotide polymorphisms (SNP) occuring in different variations thereof.
NCBI-blast	Sequence alignment	3.23	MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form.
NGSQCToolkit	Sequence alignment	2.2.28+	A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold.
nseg	Sequence Quality Control	v2.3	NGS QC Toolkit: A toolkit for the quality control (QC) of next generation sequencing (NGS) data. The toolkit comprises of user-friendly stand alone tools for quality control of the sequence data generated using Illumina and Roche 454 platforms with detailed results in the form of tables and graphs, and filtering of high-quality sequence data. It also includes few other tools, which are helpful in NGS data quality control and analysis.
Oases	Functional Annotation	0	NSEG is used to mask nucleic acid sequences, needed by RepeatScout
	De novo Assembly	0.2.8	Oases is a de novo transcriptome assembler designed to produce transcripts from short read sequencing technologies, such as Illumina, SOLiD, or 454 in the absence of any genomic assembly.
perl	Programming Language	system 5.10; module avialable 5.15
python	Programming Language	system 2.6, module 2.7.3, 2.7.12
R	Programming Language	3.25	R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible.
RepeatMasker	Functional Annotation	open-4.0.2	RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns).
RepeatScout	Functional Annotation	1.0.5	RepeatScout is a tool to discover repetitive substrings in DNA.
rmBlast	Functional Annotation	BLAST with RepeatMasker Extensions 2.2.27+	RMBlast is a RepeatMasker compatible version of the standard NCBI BLAST suite. The primary difference between this distribution and the NCBI distribution is the addition of a new program "rmblastn" for use with RepeatMasker and RepeatModeler.
rpy2	Programming language	v2.8.x	The high-level interface in rpy2 is designed to facilitate the use of R by Python programmers.
samtools	Short Read Analysis	1.2	SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
SHRiMP	Short Read Reference Mapping	2.2.3	SHRiMP is a software package for aligning genomic reads against a target genome.
SNAP	Sequence alignment	28/7/2006	SNAP is a new sequence aligner that is 3-20x faster and just as accurate as existing tools like BWA-mem, Bowtie2 and Novoalign.
SOAP2	Short Read Reference Mapping	2.21	SOAP2 is a program for faster and efficient alignment for short oligonucleotide onto reference sequences. SOAP2 is compatible with numerous applications, including single-read or pair-end resequencing.
SOAPdenovo	De novo Assembly	2.04-r240	SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads.
SOPRA	De novo Assembly	1.4.6	SOPRA is an assembler for mate pair/paired-end reads from high throughput sequencing platforms, e.g. Illumina and SOLiD.
SRAtoolkit	Sequence Analysis	2.8.1	the SRA Toolkit and SDK from NCBI is a collection of tools and libraries for using data in the INSDC Sequence Read Archives.
stacks1.44	Short Read Reference Mapping	v.1.44	Stacks is a software pipeline for building loci from short-read sequences, such as those generated on the Illumina platform. Stacks was developed to work with restriction enzyme-based data, such as RAD-seq, for the purpose of building genetic maps and conducting population genomics and phylogeography.
staden	Sequence Analysis	1.4	A fully developed set of DNA sequence assembly (Gap4 and Gap5), editing and analysis tools (Spin) for Unix, Linux, MacOSX and MS Windows.
StringTie	Sequence alignment	v1.3.3	StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus.
tophat	Short Read Reference Mapping	2.1.0	TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
TRF	Tamdem Repeat Finder	4.04	Tandem repeat finder
Trinity	De novo Assembly	trinityrnaseq_r20140717	Trinity is a transcriptome de novo assembler exclusively for Illumina solexa data.
velvet	De novo Assembly	1.2.09	Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454. Velvet currently takes in short read sequences, removes errors then produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.
Vienna RNA Package	RNA Secondary Structure	1.6	The Vienna RNA Package consists of a C code library and several stand-alone programs for the prediction and comparison of RNA secondary structures.

Database	Description	Version
The Reference Sequence (RefSeq) database	The Reference Sequence (RefSeq) database is build by National Center for National Center for Biotechnology Information (NCBI) Which is open access, annotated and curated collection of publicly available nucleotide sequence and their protein product and it provides only a single record for each natural biological molecule (i.e. DNA, RNA or protein) for major organisms ranging from viruses to bacteria to eukaryotes. This directory contains NCBI transcript reference sequences.
NT	Nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ excluding bulk divisions (gss, sts, pat, est, and htg divisions. wgs entries are also excluded. Not non-redundant.
NR	Non-redundant protein squence database with entries from GenPept, Swissprot, PIR, PDF, PDB and NCBI RefSeq
Drosophila genome (release 5)	The Release 5 D. melanogaster assembly combines euchromatic and heterochromatic sequence.	Mar-06
Rfam	Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database hosted by the Wellcome Trust Sanger Institute in collaboration with Janelia Farm.
Drosophila transposons	This file contains 'canonical' sequences of the transposable elements from Drosophila.	Apr-05 (v9.41)
D. melanogaster Apr. 2006 (BDGP R5/dm3)	This directory contains the Apr. 2006 assembly of the D. melanogaster genome (dm3, BDGP Release 5), as well as repeat annotations and GenBank sequences.	Apr-06
D. mojavensis Aug. 2005 (Agencount prelim/droMoj2)	The August 2005 Drosophila mojavensis genome assembly was produced by Agencourt Bioscience Corporation. Agencourt Bioscience Corporation produced the 1 August 2005 assembly using the Arachne assembler. The assembly contains 6,843 scaffolds ranging in size from 101 bases to 34,172,700 bases, with a mean size of 28389.6 and median of 1671.	Aug-05
Mouse Dec. 2011 (GRCm38/mm10)	Genome Reference Consortium GRCm38, which includes approximately 2.6 Gb of sequence, is considered to be "essentially complete". The assembly includes chromosomes 1-19, X, Y, M (mitochondrial DNA) and chr_random (unlocalized) and chrUn_ (unplaced clone contigs). For information about the process used to assemble this version, see the GRC website.	Dec-11
Human Feb. 2009 (GRCh37/hg19)	This directory contains the Feb. 2009 assembly of the human genome (hg19, GRCh37 Genome Reference Consortium Human Reference 37) in one gzip-compressed FASTA file per chromosome. The GRCh37 build reference sequence is considered to be "finished", a technical term indicating that the sequence is highly accurate (with fewer than one error per 10,000 bases) and highly contiguous (with the only remaining gaps corresponding to regions whose sequence cannot be reliably resolved with current technology).	Feb-09
Zebrafish Jul. 2010 (Zv9/danRer7)	The Zv9 assembly comprises a sequence length of 1.4 Gb in 26 chromosomes and 1,107 scaffolds. This assembly is based on a clone path sorted with the high-density meiotic map SATMAP (Clark et al., in preparation). The data freeze was taken on 1 April 2010. The remaining gaps were filled with sequence from WGS31, a combined Illumina and capillary assembly. The assembly integration process involved sequence alignemnts as well as cDNA, marker and BAC/Fosmid end sequence placements. For more details about the Zv9 assembly, see the Sanger Institute page for the Danio rerio Sequencing Project	Jul-10
FlyBase Drosophilia melangaster 5.52	FlyBase 5.52 contains a complete annotation of the Drosophila melanogaster genome	5.52
mirBase version 20	The miRBase database is a searchable database of published miRNA sequences and annotation. Each entry in the miRBase Sequence database represents a predicted hairpin portion of a miRNA transcript (termed mir in the database), with information on the location and sequence of the mature miRNA sequence (termed miR). Both hairpin and mature sequences are available.	20
DDBJ (DNA Data Bank of Japan) Sequence Read Archive selected datasets	DDBJ Sequence Read Archive (DRA) is an archive database for output data generated by next-generation sequencing machines including Roche 454 GS System®, Illumina Genome Analyzer®, Applied Biosystems SOLiD® System, and others. DRA is a member of the International Nucleotide Sequence Database Collaboration (INSDC) and archiving the data in a close collaboration with NCBI Sequence Read Archive (SRA) and EBI Sequence Read Archive (ERA).
NCBI (National Center for Biotechnology information) Gene Expression Omnibus selected datasets	The Gene Expression Omnibus (GEO) is a public repository that archives and freely distributes high throughput gene expression data submitted by the scientific community. GEO currently stores approximately a billion individual gene expression measurements, derived from over 100 organisms, addressing a wide range of biological issues. These huge volumes of data may be effectively explored, queried, and visualized using user-friendly Web-based tools.