Applications for the Educational Clusters

The applications offered on the HPC educational clusters, Mamba and DSBA Hadoop, are a subset of what is available on the research clusters. This is mainly due to limitations related to licensing. Faculty who will be teaching a course may request software or codes to be installed, but the request must be done prior to the start of the semester that the class will be taught so that we have time to configure, build, install, and test the application in our environment. We typically do not install new applications or update existing applications during the semester, so that the environment is consistent throughout the whole semester.



ABySS, Assembly by Short Sequences, is a de novo, parallel, paired-end sequence assembler that is designed for short reads. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes.
versions available: 1.9.0, 2.1.1

AGWG-merge is a version of the 3D-DNA pipeline (Dudchenko et al., Science, 2017) that was used to help generate AaegL5 genome assembly for the mosquito Aedes aegypti.
versions available: 180114

AllenNLP makes it easy to design and evaluate new deep learning models for nearly any NLP problem, along with the infrastructure to easily run them. AllenNLP includes reference implementations of high quality models for both core NLP problems (e.g. semantic role labeling) and NLP applications (e.g. textual entailment).
versions available: 0.8.3

ALLPATHS‐LG is a whole‐genome shotgun assembler that can generate high‐quality genome assemblies using short reads (~100bp) such as those produced by the new generation of sequencers.
versions available: 52488

AmberTools18 consists of several independently developed packages that work well by themselves, and with Amber itself. The suite can also be used to carry out complete molecular dynamics simulations, with either explicit water or generalized Born solvent models. AmberTools18 consists of the following major codes: NAB/sff, antechamber and MCPB, tleap and parmed, sqm, pbsa, 3D-RISM, sander, mdgx, cpptraj and pytraj, and amberlite
versions available: 18, 18-mpi

ANGSD is a software for analyzing next generation sequencing data, and can handle a number of different input types from mapped reads to imputed genotype probabilities. Most methods take genotype uncertainty into account instead of basing the analysis on called genotypes, which is especially useful for low and medium depth data.
versions available: 0.918

Anvi'o is an open-source, community-driven analysis and visualization platform for 'omics data. It brings together many aspects of today's cutting-edge genomic, metagenomic, metatranscriptomic, pangenomic, and phylogenomic analysis practices to address a wide array of needs.
versions available: 5.3

ART (art_Illumina Q version) is a simulation program to generate sequence read data of Illumina sequencers. ART generates reads according to the empirical read quality profile summarized from large real read data. ART has been using for testing or benchmarking a variety of methods or tools for next-generation sequencing data analysis, including read alignment, de novo assembly, detection of SNP, CNV, or other structure variation.
versions available: 03.19.15

AsciiDoc is a presentable text document format for writing articles, UNIX man pages and other small to medium sized documents. The asciidoc(1) command translates AsciiDoc files to HTML, DocBook and LinuxDoc formats.
versions available: 8.6.9

AUGUSTUS is a gene prediction program for eukaryotes. It can be used as an ab initio program, which means it bases its prediction purely on the sequence.
versions available: 3.2.3, 3.3, 3.3.2

BamTools is a project that provides both a C++ API and a command-line toolkit for reading, writing, and manipulating BAM (genome alignment) files.
versions available: 2.4.1, 2.5.1

BCFtools is a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary counterpart BCF. All commands work transparently with both VCFs and BCFs, both uncompressed and BGZF-compressed.
versions available: 1.3.1, 1.6, 1.9

Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome.
versions available: 2.26.0

Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development.
versions available: 3.2

Bismark is a set of tools for the time-efficient analysis of Bisulfite-Seq (BS-Seq) data. Bismark performs alignments of bisulfite-treated reads to a reference genome and cytosine methylation calls at the same time. (Requires Bowtie or Bowtie2)
versions available: 0.18.0

The PacBio long read aligner
versions available: 5.3

NCBI BLAST (Basic Local Alignment Search Tool) is a suite of programs for aligning query sequences against those present in a selected target database.
versions available: 2.2.29+, 2.3.0+, 2.5.0+

Blat produces two major classes of alignments: 1) at the DNA level between two sequences that are of 95% or greater identity, but which may include large inserts, 2) at the protein or translated DNA level between sequences that are of 80% or greater identity and may also include large inserts. (v36 / 64-bit)
versions available: 36

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters to relatively long (e.g. mammalian) genomes.
versions available: 2.1.0, 2.2.9, 2.2.9-sse2

BRAKER2 is an unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS
versions available: 2.0.6, 2.1.2

BUSCO provides quantitative measures for the assessment of genome assembly, gene set, and transcriptome completeness, based on evolutionarily-informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB v9.
versions available: 3.0.2

BWA (Burrows-Wheeler Aligner) is a software package for mapping DNA sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM.
versions available: 0.7.12, 0.7.17

c2x reads and writes a selection of file formats which relate to DFT electronic structure codes. It is not simply a conversion utility, but it can also perform manipulations and some analysis. It can form supercells and find primitive cells.
versions available: 2.26b

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors.
versions available: 1.0, 1.0.0rc3, 1.0.0rc3-cuda8, 1.0-cuda8

Canu is a fork of the Celera Assembler, designed for high-noise single-molecule sequencing, such as the PacBio RS II/Sequel or Oxford Nanopore MinION.
versions available: 1.8

CG2AT-Traj is a program to convert a coarse-grained molecular dynamics simulation trajectory to an all-atom trajectory. This project started on Sat Oct 17 2015 as part of a 'hack day' at the CECAM workshop for 'setting up simulations'.
versions available: traj

ClustalW2 is a general purpose DNA or protein multiple sequence alignment program for three or more sequences.
versions available: 2.1

ClustalW2 is a general purpose DNA or protein multiple sequence alignment program for three or more sequences.
versions available: 0.13

CMake is a cross-platform, open-source build system. CMake is a family of tools designed to build, test and package software.
versions available: 3.10.2, 3.12.3

Constraint Network Analysis (CNAnalysis) is a graph theory-based rigidity analysis approach that analyzes global and local flexibility and rigidity characteristics of proteins by carrying out thermal unfolding simulations.
versions available: 2.0

CNVnator - a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads
versions available: 0.3.3

CrossMap (for Python 2.7.x) is a program for convenient conversion of genome coordinates (or annotation files) between different assemblies (such as Human hg18 (NCBI36) <> hg19 (GRCh37), Mouse mm9 (MGSCv37) <> mm10 (GRCm38)). It supports most commonly used file formats including SAM/BAM, Wiggle/BigWig, BED, GFF/GTF, VCF.
versions available: 0.2.9, 0.3.3

The NVIDIA CUDA Toolkit provides a development environment for creating high performance GPU-accelerated applications. With the CUDA Toolkit, you can develop, optimize and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms and HPC supercomputers.
versions available: 10.0, 8.0, 9.0, 9.2

Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples.
versions available: 2.2.1

Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
versions available: 1.14

Deepgaze is a library for human-computer interaction, people detection and tracking which uses Convolutional Neural Networks (CNNs) for face detection, head pose estimation and classification. The focus of attention of a person can be approximately estimated finding the head orientation.
versions available: 0.1-cuda8

Eclipse provides IDEs and platforms nearly every language and architecture, including Java, C/C++, JavaScript and PHP.
versions available: 4.3.2

EMBOSS (European Molecular Biology Open Software Suite) is a software analysis package specially developed for the needs of the molecular biology (e.g. EMBnet) user community. The software automatically copes with data in a variety of formats and even allows transparent retrieval of sequence data from the web.
versions available: 6.6.0

Quantum Espresso is an integrated suite of Open-Source computer codes for electronic-structure calculations and materials modeling at the nanoscale.
versions available: 5.3-intel-mpi, 6.3-intel-mpi

This code implements the popular RAxML search algorithm for maximum likelihood based inference of phylogenetic trees. It uses a radically new MPI parallelization approach that yields improved parallel efficiency, in particular on partitioned multi-gene or whole-genome datasets.
versions available: 3.0.16, 3.0.17

Exonerate is a generic tool for sequence alignment
versions available: 2.4.0

Face detection using the Faster R-CNN. It is developed based on the awesome py-faster-rcnn repository.
versions available: 1.0-cuda8

FastQC is an application which takes a FastQ file and runs a series of tests on it to generate a comprehensive QC report. This will tell you if there is anything unusual about your sequence. Each test is flagged as a pass, warning or fail depending on how far it departs from what you'd expect from a normal large dataset with no significant biases.
versions available: 0.11.5

The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.
versions available: 0.0.14

Fire Dynamics Simulator (FDS) is a large-eddy simulation (LES) code for low-speed flows, with an emphasis on smoke and heat transport from fires.
versions available: 6.5.3, 6.7.0

FFmpeg is the leading multimedia framework, able to decode, encode, transcode, mux, demux, stream, filter and play pretty much anything that humans and machines have created. It supports the most obscure ancient formats up to the cutting edge. It contains libavcodec, libavutil, libavformat, libavfilter, libavdevice, libswscale and libswresample which can be used by applications. As well as ffmpeg, ffserver, ffplay and ffprobe which can be used by end users for transcoding, streaming and playing.
versions available: 2.8.13, 3.2.14, 4.1.3

Mozilla Firefox (or simply Firefox) is a free and open-source web browser developed by Mozilla Foundation and its subsidiary, Mozilla Corporation.
versions available: 58.0.2

FMLRC, or FM-index Long Read Corrector, is a tool for performing hybrid correction of long read sequencing using the BWT and FM-index of short-read sequencing data. Included in this module: msbwt and ropebwt2
versions available: 0.1.2

GARLI, Genetic Algorithm for Rapid Likelihood Inference, is a program for inferring phylogenetic trees. Using an approach similar to a classical genetic algorithm, it rapidly searches the space of evolutionary trees and model parameters to find the solution maximizing the likelihood score.
versions available: 0.942, 2.01

GATK (Genome Analysis Toolkit) offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size.
versions available: 3.8

GeneMark developed in 1993 was the first gene finding method recognized as an efficient and accurate tool for genome projects. GeneMark was used for annotation of the first completely sequenced bacteria, Haemophilus influenzae, and the first completely sequenced archaea, Methanococcus jannaschii.
versions available: 4.32, 4.38

Git is a version-control system for tracking changes in computer files and coordinating work on those files among multiple people. It is primarily used for source-code management in software development, but it can be used to keep track of changes in any set of files. As a distributed revision-control system, it is aimed at speed, data integrity, and support for distributed, non-linear workflows.
versions available: 2.19.2

GRASS (Geographic Resources Analysis Support System), is a free and open source GIS software suite used for geospatial data management and analysis, image processing, graphics and maps production, spatial modeling, and visualization.
versions available: 7.0.3

GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions, many groups are also using it for research on non-biological systems, e.g. polymers.
versions available: 2016.3, 2016.3-cuda, 2016.3-mpi, 2016.3-mpi-cuda, 2018, 2018-cuda, 2018-mpi, 2018-mpi-cuda, 5.1.2, 5.1.2-cuda, 5.1.2-mpi, 5.1.2-mpi-cuda

The Gurobi Optimizer is the state-of-the-art mathematical programming solver for prescriptive analytics, for all major model types including LP, MILP, MIQP, and others.
versions available: 7.5.1, 8.0.1, 8.1.1

HiC-Pro is an optimized and flexible pipeline for Hi-C data processing. HiC-Pro was designed to process Hi-C data, from raw fastq files (paired-end Illumina data) to the normalized contact maps.
versions available: 2.11.1

HMMER is used for searching sequence databases for sequence homologs, and for making sequence alignments. It implements methods using probabilistic models called profile hidden Markov models (profile HMMs).
versions available: 3.2.1

HUMAnN is a pipeline for efficiently and accurately profiling the presence/absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data (typically millions of short DNA/RNA reads).
versions available: 0.11.2

InterPro is a database which integrates together predictive information about proteins' function from a number of partner resources, giving an overview of the families that a protein belongs to and the domains and sites it contains.
versions available: 5.31-70.0, 5.34-73.0

I-TASSER is an integrated package for protein structure and function predictions. For a given sequence, I-TASSER first identifies template proteins from the Protein Data Bank (PDB) by multiple threading techniques (LOMETS).
versions available: 5.1

ITK (Insight Segmentation and Registration Toolkit) is an open-source software toolkit for performing registration and segmentation. Segmentation is the process of identifying and classifying data found in a digitally sampled representation.
versions available: 4.9.0

JELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA. JELLYFISH can count k-mers using an order of magnitude less memory and an order of magnitude faster than other k-mer counting packages by using an efficient encoding of a hash table and by exploiting the 'compare-and-swap' CPU instruction to increase parallelism.
versions available: 2.2.6

KneadData is a tool designed to perform quality control on metagenomic and metatranscriptomic sequencing data, especially data from microbiome experiments.
versions available: 0.7.2

Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies.
versions available: 1.1

kWIP: A method for calculating genetic similarity between samples. Unlike similar alternatives, e.g. SNP-based distance calculation, kWIP operates directly upon next-gen sequencing reads.
versions available: 0.2.0

LACHESIS: is a software tool to measure the thread of life.
versions available: 201701

versions available: 12Dec18-intel-gpu, 12Dec18-intel-mpi, 18Jun19-intel-gpu, 18Jun19-intel-mpi, 31Mar17

LASTZ: A tool for (1) aligning two DNA sequences, and (2) inferring appropriate scoring parameters automatically.
versions available: 1.04.00

LIGGGHTS(R)-PUBLIC is an Open Source Discrete Element Method Particle Simulation Software based on LAMMPS. LIGGGHTS (R) stands for LAMMPS improved for general granular and granular heat transfer simulations. LIGGGHTS (R) aims to improve the capabilities of LAMMPS with the goal to apply it to industrial applications.
versions available: 3.7.0

LoRDEC is a program to correct sequencing errors in long reads from 3rd generation sequencing with high error rate, and is especially intended for PacBio reads. It uses a hybrid strategy, meaning that it uses two sets of reads: the reference read set, whose error rate is assumed to be small, and the PacBio read set, which is then corrected using the reference set. Typically, the reference set contains Illumina reads.
versions available: 0.8

MAFFT is a Multiple alignment program for amino acid or nucleotide sequences. It offers a range of multiple alignment methods, L-INS-i (accurate; for alignment of <∼200 sequences), FFT-NS-2 (fast; for alignment of <∼30,000 sequences), etc. (7.273woe)
versions available: 7.055woe, 7.273woe

The MaSuRCA (Maryland Super Read Cabog Assembler) assembler combines the benefits of deBruijn graph and Overlap-Layout-Consensus assembly approaches. MaSuRCA supports hybrid assembly with short Illumina reads and long high error PacBio/MinION data.
versions available: 3.2.4, 3.2.7

Mauve is a system for constructing multiple genome alignments in the presence of large-scale evolutionary events such as rearrangement and inversion. Multiple genome alignments provide a basis for research into comparative genomics and the study of genome-wide evolutionary dynamics.
versions available: 2015.02

Multiple Em for Motif Elicitation. The MEME Suite allows the biologist to discover novel motifs in collections of unaligned nucleotide or protein sequences, and to perform a wide variety of other motif-based analyses.
versions available: 4.12.0, 4.9.1, 5.0.1

Meraculous is a whole genome assembler for Next Generation Sequencing data geared for large genomes. It is a hybrid k-mer/read-based assembler that capitalizes on the high accuracy of Illumina sequence by eschewing an explicit error correction step which we argue to be redundant with the assembly process.
versions available: 2.2.5

MetaBAT: A robust statistical framework for reconstructing genomes from metagenomic data
versions available: 2.13

MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With the newly added StrainPhlAn module, it is now possible to perform accurate strain-level microbial profiling.
versions available: 2.7.2

MethylDackel is a (mostly) universal methylation extractor for BS-seq experiments. It will process a coordinate-sorted and indexed BAM or CRAM file containing some form of BS-seq alignments and extract per-base methylation metrics from them.
versions available: 0.2.1

Minialign is a little bit fast and moderately accurate nucleotide sequence alignment tool designed for PacBio and Nanopore long reads. It is built on three key algorithms, minimizer-based index of the minimap overlapper, array-based seed chaining, and SIMD-parallel Smith-Waterman-Gotoh extension.
versions available: 0.4.4, 0.6.0

Miniasm is a very fast OLC-based *de novo* assembler for noisy long reads. It takes all-vs-all read self-mappings, typically by [minimap][minimap] as input and outputs an assembly graph in the [GFA][gfa] format.
versions available: 0.2

Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database.
versions available: 2.10

miRDeep2 is a software package for identification of novel and known miRNAs in deep sequencing data. Furthermore, it can be used for miRNA expression profiling across samples. Last, a new module for preprocessing of raw Illumina sequencing data produces files for downstream analysis with the miRDeep2 or quantifier module. Colorspace sequencing data is currently not supported by the preprocessing module but it is planed to be implemented.
versions available: 0.1.2

microRNA PREdiction From small RNAseq data (miR-PREFeR) uses expression patterns of miRNA and follows the criteria for plant microRNA annotation to accurately predict plant miRNAs from one or more small RNA-Seq data samples of the same species. We tested miR-PREFeR on several plant species. The results show that miR-PREFeR is sensitive, accurate, fast, and has low memory footprint.
versions available: 0.24

MODELLER is used for homology or comparative modeling of protein three-dimensional structures (1,2). The user provides an alignment of a sequence to be modeled with known related structures and MODELLER automatically calculates a model containing all non-hydrogen atoms.
versions available: 9.16

The MOSEK optimization software is designed to solve large-scale mathematical optimization problems. The strongest point of MOSEK is its state-of-the-art interior-point optimizer for continous linear, quadratic and conic problems.
versions available: 8.1

Open-Source Parallel BLAST (Basic Local Alignment Search Tool). BLAST is a suite of programs provided by NCBI for aligning query sequences against those present in a selected target database.
versions available: 1.6.0, 1.6.0-ib

MrBayes is a program for Bayesian inference and model choice across a wide range of phylogenetic and evolutionary models. MrBayes uses Markov chain Monte Carlo (MCMC) methods to estimate the posterior distribution of model parameters.
versions available: 3.2.2, 3.2.2-ib

NAMD (2.11 x86_64 mpi) is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations.
versions available: 2.11-mcore, 2.11-mcore-cuda, 2.11-mpi

The Netwide Assembler, NASM, is an 80x86 and x86-64 assembler designed for portability and modularity. It supports a range of object file formats, including Linux and `*BSD' `a.out', `ELF', `COFF', `Mach-O', 16-bit and 32-bit `OBJ' (OMF) format, `Win32' and `Win64'.
versions available: 2.14

The NCO (netCDF Operators) toolkit manipulates and analyzes data stored in netCDF-accessible formats, including DAP, HDF4, and HDF5. It exploits the geophysical expressivity of many CF (Climate & Forecast) metadata conventions, the flexible description of physical dimensions translated by UDUnits, the network transparency of OPeNDAP, the storage features of HDF, and many powerful mathematical and statistical algorithms of GSL.
versions available: 4.4.8

NetBeans is a free, open source IDE that allows you to quickly and easily develop desktop, mobile and web applications with Java, HTML5, PHP, C/C++ and more.
versions available: 8.0.2, 8.1

NetLogo is a programmable modeling environment for simulating natural and social phenomena. NetLogo is particularly well suited for modeling complex systems developing over time.
versions available: 5.0.4, 5.1.0

Node.js is a JavaScript runtime built on Chrome's V8 JavaScript engine. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient.
versions available: 4.4.0

Open Babel is a chemical toolbox designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.
versions available: 2.3.2

OpenFOAM is the free, open source CFD software released and developed primarily by OpenCFD Ltd since 2004. OpenFOAM has an extensive range of features to solve anything from complex fluid flows involving chemical reactions, turbulence and heat transfer, to acoustics, solid mechanics and electromagnetics.
versions available: 1606+, 1706, 1806

OpenSfM is an Open source Structure from Motion pipeline. The library serves as a processing pipeline for reconstructing camera poses and 3D scenes from multiple images.
versions available: 0.2.0

OrthoFinder is a fast, accurate and comprehensive platform for comparative genomics. It finds orthogroups and orthologs, infers rooted gene trees for all orthogroups and identifies all of the gene duplcation events in those gene trees.
versions available: 2.2.7

PacBio develops comprehensive solutions for scientists that propel the field of genomics, improve science and research, and create positive impact globally. This module includes several of the PacBio open source tools, including blasr, ConsensusCore, GenomicConsensus, pbalign, pbcommand, pbcore, and pbcoretools, pbbam, bam2fastx, pb-dazzler, PB Assembly, and FALCON.
versions available: 2018.8, 2019.8

GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input.
versions available: 20190322

ParaView is an open-source, multi-platform data analysis and visualization application. ParaView users can quickly build visualizations to analyze their data using qualitative and quantitative techniques.
versions available: 5.0.0-mpi

Software for Long-Read Sequencing Data from PacBio. PBSuite is made up of 2 tools: PBJelly and PBHoney. PBJelly is a highly automated pipeline that aligns long sequencing reads (such as PacBio RS reads or long 454 reads in fasta format) to high-confidence draft assembles. PBHoney is an implementation of two variant-identification approaches designed to exploit the high mappability of long reads (i.e., greater than 10,000 bp).
versions available: 15.8.24

PBZIP2 is a parallel implementation of the bzip2 block-sorting file compressor that uses pthreads and achieves near-linear speedup on SMP machines. The output of this version is fully compatible with bzip2 v1.0.2 or newer
versions available: 1.1.13

PDL (Perl Data Language) gives standard Perl the ability to compactly store and speedily manipulate the large N-dimensional data arrays which are the bread and butter of scientific computing.
versions available: 2.015

Peridigm is an open-source computational peridynamics code developed at Sandia National Laboratories for massively-parallel multi-physics simulations. It has been applied primarily to problems in solid mechanics involving pervasive material failure.
versions available: 1.4.1-mpi

Picard is a set of Java command line tools for manipulating high-throughput sequencing (HTS) data and formats. Picard is implemented using the HTSJDK Java library HTSJDK to support accessing file formats that are commonly used for high-throughput sequencing data such as SAM and VCF.
versions available: 2.18.29, 2.9.2

PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.
versions available: 1.90b3.32

POY is a phylogenetic analysis program that supports multiple kinds of data (e.g. morphology, nucleotides, genes and gene regions, chromosomes, whole genomes, etc). POY is particular in that it can perform true alignment and phylogeny inference (i.e. input sequences need not to be prealigned).
versions available: 5.0.1, 5.0.1-ib, 5.1.2, 5.1.2-ib

Implementation of the Pairwise Sequentially Markovian Coalescent (PSMC) model
versions available: 0.6.5

PyTorch is a python package that provides two high-level features: Tensor computation (like numpy) with strong GPU acceleration, and Deep Neural Networks built on a tape-based autodiff system. Built with CUDA Toolkit 10.0, for GPUs with Compute Capabilities of 3.7, 6.1, and 7.0
versions available: 0.4.0-anaconda3-cuda9.2-sm3.7, 0.4.0-anaconda3-cuda9.2-sm6.1, 1.0.1-anaconda3-cuda10.0, 1.2.0-anaconda3-cuda10.0

QIIME (canonically pronounced chime) stands for Quantitative Insights Into Microbial Ecology. QIIME is an open-source bioinformatics pipeline for performing microbiome analysis from raw DNA sequencing data.
versions available: 1.9.1

QIIME 2 is a powerful, extensible, and decentralized microbiome analysis package with a focus on data and analysis transparency. QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results.
versions available: 2018.6, 2018.8, 2019.1

QuickFlash provides high-performance data access and processing routines for large, multi-dimensional datasets in environments where memory and access to other software may be limited, such as on desktop computers or parallel supercomputer nodes.
versions available: 1.0.0, 1.0.0-ib

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Labs, by John Chambers and colleagues. R can be considered as a different implementation of S.
versions available: 3.4.3, 3.5.0, 3.6.0

Racon is intended as a standalone consensus module to correct raw contigs generated by rapid assembly methods which do not include a consensus step. The goal of Racon is to generate genomic consensus which is of similar or better quality compared to the output generated by assembly methods which employ both error correction and consensus steps, while providing a speedup of several times compared to those methods.
versions available: 1.3.1

RAxML (Randomized Axelerated Maximum Likelihood) is a program for sequential and parallel Maximum Likelihood based inference of large phylogenetic trees.
versions available: 7.4.2, 7.4.2-mpi, 8.2.12, 8.2.12-mpi, 8.2.4, 8.2.4-mpi

REPdenovo is designed for constructing repeats directly from sequence reads. It based on the idea of frequent k-mer assembly.
versions available: 0.0

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns).
versions available: 4.0.8

RepeatModeler is a de-novo repeat family identification and modeling package. At the heart of RepeatModeler are two de-novo repeat finding programs ( RECON and RepeatScout ) which employ complementary computational methods for identifying repeat element boundaries and family relationships from sequence data. RepeatModeler assists in automating the runs of RECON and RepeatScout given a genomic database and uses the output to build, refine and classify consensus models of putative interspersed repeats.
versions available: 1.0.11

The purpose of the RepeatScout software is to identify repeat family sequences from genomes where hand-curated repeat databases (a la RepBase update) are not available.
versions available: 1.0.5

RMBlast is a RepeatMasker compatible version of the standard NCBI blastn program. The primary difference between this distribution and the NCBI distribution is the addition of a new program 'rmblastn' for use with RepeatMasker and RepeatModeler.
versions available: 2.6.0

Rosetta is a comprehensive software suite for modeling macromolecular structures. As a flexible, multi-purpose application, it includes tools for structure prediction, design, and remodeling of proteins and nucleic acids.
versions available: 2015.02, 2016.10, 2019.07

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.
versions available: 0.99, 1.1.442

Samtools is a suite of programs for interacting with high-throughput sequencing data, allowing you to read/write/edit/index/view SAM/BAM/CRAM format.
versions available: 0.1.18, 1.3.1, 1.6, 1.9

SAS (Statistical Analysis System) is a software suite developed by SAS Institute for advanced analytics, multivariate analyses, business intelligence, data management, and predictive analytics.
versions available: 9.4

SEER, Sequence Element (kmer) Enrichment Analysis, finds sequence elements (words/kmers) that are enriched in samples with a given phenotype, which may be binary or continuous.
versions available: 1.1.1-intel

SegNet is a Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling based on Caffe. Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors.
versions available: 1.0.0, 1.0.0-cuda8, 1.0.0-cuda8-nodnn

SeqGL is a new group lasso-based algorithm to extract multiple transcription factor (TF) binding signals from ChIP- and DNase-seq profiles. Benchmarked on over 100 ChIP-seq experiments, SeqGL outperformed traditional motif discovery tools in discriminative accuracy and cofactor detection.
versions available: 1.1.4

Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.
versions available: 1.2

ShortBRED (Short, Better Representative Extract Dataset) is a pipeline to take a set of protein sequences, reduce them to a set of unique identifying strings ('markers'), and then search for these markers in metagenomic data and determine the presence and abundance of the protein families of interest.
versions available: 0.9.5

SIESTA, a first-principles materials simulation code using DFT, is both a method and its computer program implementation, to perform efficient electronic structure calculations and ab initio molecular dynamics simulations of molecules and solids.
versions available: 3.2, 4.0.2

SLiM is an evolutionary simulation framework that combines a powerful engine for population genetic simulations with the capability of modeling arbitrarily complex evolutionary scenarios.
versions available: 2.5

SMOG v2 is a software package designed to allow the user to start with a structure of a biomolecule (i.e. a PDB file) and construct a structure-based model, which is then simulated using Gromacs, or NAMD.
versions available: 2.0.2

sowhat automates the SOWH phylogenetic topology test. It works on amino acid, nucleotide, and binary character state datasets. Partitions (including codon position partitioning) can be specified.
versions available: 0.36

SPAdes (St. Petersburg genome assembler) is intended for both standard isolates and single-cell MDA bacteria assemblies.
versions available: 2.5.1, 3.11.1, 3.7.1

The Sequence Read Archive (SRA) stores raw sequence data from 'next-generation' sequencing technologies including Illumina, 454, IonTorrent, Complete Genomics, PacBio and OxfordNanopores. In addition to raw sequence data, SRA now stores alignment information in the form of read placements on a reference sequence. Includes NCBI VDB and NGS SDK.
versions available: 2.8.2-1, 2.9.4

Stacks is a software pipeline for building loci from short-read sequences, such as those generated on the Illumina platform. Stacks was developed to work with restriction enzyme-based data, such as RAD-seq, for the purpose of building genetic maps and conducting population genomics and phylogeography.
versions available: 1.4.7

Spliced Transcripts Alignment to a Reference (STAR) is an ultrafast universal RNA-seq aligner, which was developed to align a large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset.
versions available: 2.7.0c

SUBREAD is a tool kit for processing next-gen sequencing data. It includes Subread aligner, Subjunc exon-exon junction detector and featureCounts read summarization program. Subread aligner can be used to align both gDNA-seq and RNA-seq reads.
versions available: 1.5.2

TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them.
versions available: 1.10-anaconda2-cuda9.2, 1.10-anaconda3-cuda9.2, 1.13-anaconda2-cuda10.0, 1.13-anaconda3-cuda10.0, 1.14-anaconda2-cuda10.0, 1.14-anaconda3-cuda10.0

TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
versions available: 2.1.1

Trim Galore A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data.
versions available: 0.4.4

Trimmomatic is a fast, multithreaded command line tool that can be used to trim and crop Illumina (FASTQ) data as well as to remove adapters. These adapters can pose a real problem depending on the library preparation and downstream application.
versions available: 0.38

Trinity assembles transcript sequences from Illumina RNA-Seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads.
versions available: 2.4.0, 2.8.5

The UDUNITS package supports units of physical quantities. Its C library provides for arithmetic manipulation of units and for conversion of numeric values between compatible units.
versions available: 2.2.19

Valgrind is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. You can also use Valgrind to build new tools.
versions available: 3.13.0

VCFtools is a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project. The aim of VCFtools is to provide easily accessible methods for working with complex genetic variation data in the form of VCF files.
versions available: 0.1.10, 0.1.15

Velvet is a sequence assembler for very short reads
versions available: 1.2.10

VEP (Variant Effect Predictor) predicts the functional effects of genomic variants.
versions available: 95.1

The ViennaRNA Package consists of a C code library and several stand-alone programs for the prediction and comparison of RNA secondary structures.
versions available: 2.4.13

VisIt is an Open Source, interactive, scalable, visualization, animation and analysis tool.
versions available: 2.10.0, 2.10.0-ib, 2.10.3, 2.10.3-ib, 2.13.1, 2.13.1-ib

VMD is a molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting. (1.9.3 x86_64 64-bit, CUDA 8.0, SSE and AVX2, OpenGL)
versions available: 1.9.3-cuda75-text-egl, 1.9.3-cuda8-opengl, 1.9.3-text

Visual Studio Code is a lightweight but powerful source code editor which runs on your desktop and is available for Windows, macOS and Linux. It comes with built-in support for JavaScript, TypeScript and Node.js and has a rich ecosystem of extensions for other languages (such as C++, C#, Java, Python, PHP, Go) and runtimes (such as .NET and Unity).
versions available: 1.33

The Visualization Toolkit (VTK) is an open-source, freely available software system for 3D computer graphics, modeling, image processing, volume rendering, scientific visualization, and information visualization.
versions available: 6.2.0, 7.0.0, 7.0.0-mpi

A fast, memory efficient implementation of the Weighted Histogram Analysis Method (WHAM).
versions available: 2.0.9

Wise2 is a package focused on comparisons of bio polymers, commonly DNA sequence and protein sequence.
versions available: 2.2.0, 2.4.1, 2.4.1-pthreads

The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both atmospheric research and operational forecasting needs.
versions available: 3.9.1-intel-mpi, 3.9.1-intel-serial, 4.0.1-intel-mpi, 4.0.1-intel-serial

wtdbg is a fuzzy Bruijn graph (FBG) approach to long noisy reads assembly. wtdbg is desiged to assemble huge genomes in very limited time, it requires a PowerPC with multiple-cores and very big RAM (1Tb+). wtdbg can assemble a 100 X human pacbio dataset within one day.
versions available: 1.1, 2.3

XCrySDen is a crystalline and molecular structure visualisation program aiming at display of isosurfaces and contours, which can be superimposed on crystalline structures and interactively rotated and manipulated.
versions available: 1.5.60

Xerces-C++ is a validating XML parser written in a portable subset of C++. Xerces-C++ makes it easy to give your application the ability to read and write XML data. A shared library is provided for parsing, generating, manipulating, and validating XML documents using the DOM, SAX, and SAX2 APIs.
versions available: 3.2.1

Yade is an extensible open-source framework for discrete numerical models, focused on Discrete Element Method. The computation parts are written in c++ using flexible object model, allowing independent implementation of new alogrithms and interfaces. Python is used for rapid and concise scene construction, simulation control, postprocessing and debugging.
versions available: 2017.01a

Yices 2 is an SMT solver that decides the satisfiability of formulas containing uninterpreted function symbols with equality, linear real and integer arithmetic, bitvectors, scalar types, and tuples.
versions available: 2.4.2

Z3 Theorem Prover is an efficient, high-performance SMT Solver being developed at Microsoft Research.
versions available: 4.4.1

Compilers / Interpreters

Anaconda (python 2.7.14-based) is the world’s most popular Python data science platform. Anaconda, Inc. continues to lead open source projects like Anaconda, NumPy and SciPy that form the foundation of modern data science. Load this module for CPU ONLY (NON-GPU) compute jobs.
versions available: 5.0.1, 5.0.1-cuda8

Anaconda (python 3.6.3-based) is the world’s most popular Python data science platform. Anaconda, Inc. continues to lead open source projects like Anaconda, NumPy and SciPy that form the foundation of modern data science. Load this module for CPU ONLY (NON-GPU) compute jobs.
versions available: 5.0.1, 5.0.1-cuda8

Bazel is Google's own build tool. Bazel has built-in support for building both client and server software, and also provides an extensible framework that you can use to develop your own build rules.
versions available: 0.18.1, 0.19.2, 0.21.0, 0.22.0

The GNU Compiler Collection includes front ends for C, C++, Objective-C, and Fortran, as well as libraries for these languages (libstdc++, libgcj,...).
versions available: 6.4.0, 7.3.0, 8.2.0

The Glasgow Haskell Compiler (GHC) is a state-of-the-art, open source compiler and interactive environment for the functional language Haskell. Haskell is a polymorphically statically typed, lazy, purely functional computer programming language.
versions available: 7.10.3

Oracle's Java Platform, Standard Edition (Java SE JDK) lets you develop and deploy Java applications on desktops and servers. Java offers the rich user interface, performance, versatility, portability, and security that today's applications require.
versions available: 9.0.4

Julia is a high-level, high-performance dynamic programming language for numerical computing. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library.
versions available: 1.0.1

PyPy is a replacement for CPython. PyPy implements Python 2.7.10. It supports all of the core language, passing the Python test suite (with minor modifications that were already accepted in the main python in newer versions). It is built using the RPython language that was co-developed with it. The main reason to use it instead of CPython is speed: it runs generally faster.
versions available: 5.3.1

Scala is an acronym for 'Scalable Language'. Scala is a pure-bred object-oriented language. Conceptually, every value is an object and every operation is a method-call. The language supports advanced component architectures through classes and traits.
versions available: 2.10.4, 2.11.7

YASM, an assembler and disassembler for the Intel x86 architecture, is a complete rewrite of the NASM assembler. YASM currently supports the x86 and AMD64 instruction sets, accepts NASM and GAS assembler syntaxes, outputs binary, ELF32, ELF64, 32 and 64-bit Mach-O, RDOFF2, COFF, Win32, and Win64 object formats, and generates source debugging information in STABS, DWARF 2, and CodeView 8 formats.
versions available: 1.3.0


Armadillo is a high quality linear algebra library (matrix maths) for the C++ language, aiming towards a good balance between speed and ease of use. It provides high-level syntax (API) deliberately similar to Matlab.
versions available: 8.400.0

ARPACK is a collection of Fortran77 subroutines designed to solve large scale eigenvalue problems. The package is designed to compute a few eigenvalues and corresponding eigenvectors of a general n by n matrix A.
versions available: 2.1

ARPACK-NG is a collection of Fortran77 subroutines designed to solve large scale eigenvalue problems.
versions available: 3.5.0

Boost is a set of libraries for the C++ programming language that provide support for tasks and structures such as linear algebra, pseudorandom number generation, multithreading, image processing, regular expressions, and unit testing.
versions available: 1.65.1

Ceres Solver is an open source C++ library for modeling and solving large, complicated optimization problems. It can be used to solve Non-linear Least Squares problems with bounds constraints and general unconstrained optimization problems.
versions available: 1.14.0, 1.14.0-cuda

clBLAS is a software library containing BLAS functions written in OpenCL.
versions available: 1.10

The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.
versions available: 6.0-cuda8, 7.0-cuda8, 7.0-cuda9, 7.2.1-cuda9, 7.2.1-cuda9.2, 7.4.2-cuda10, 7.4.2-cuda9.2

Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real world problems. It is used in both industry and academia in a wide range of domains including robotics, embedded devices, mobile phones, and large high performance computing environments.
versions available: 19.10, 19.10-cuda8

Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.
versions available: 3.3.6, 3.3.7

The GATB-CORE project provides a set of highly efficient algorithms to analyse NGS data sets. These methods enable the analysis of data sets of any size on multi-core desktop computers, including very huge amount of reads data coming from any kind of organisms such as bacteria, plants, animals and even complex samples
versions available: 1.4.1

The OpenGL Extension Wrangler Library (GLEW) is a cross-platform open-source C/C++ extension loading library. GLEW provides efficient run-time mechanisms for determining which OpenGL extensions are supported on the target platform. OpenGL core and extension functionality is exposed in a single header file.
versions available: 1.13.0

Various Google codes, including Gflags v2.2.2 (Google's commandline flags library); Glog v0.4.0 (C++ implementation of the Google logging module); LevelDB v1.21 (A fast key-value storage library); Protocol Buffers v3.7.1 ( Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data)
versions available: 2015, 2019

HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data.
versions available: 1.10.2, 1.10.2-mpi, 1.8.16, 1.8.16-mpi

HTS Lib is a C library for high-throughput sequencing data formats.
versions available: 1.4.1, 1.6, 1.9

Intel runtime libraries
versions available: 16.0.0, 19.0.0

A common GPU ndarray(n dimensions array) that can be reused by all projects.
versions available: 0.7.5

NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
versions available:,, 4.4.0, 4.4.0-mpi, 4.6.3, 4.6.3-mpi

Hierarchical Data Format (OPENBLAS4; also known as OPENBLAS) is a library and multi-object file format for storing and managing data between machines.
versions available: 0.2.18, 0.2.20

OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library. OpenCV was built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in the commercial products.
versions available: 2.4.12, 2.4.12-cuda8, 3.1.0, 3.1.0-cuda, 4.1.0, 4.1.0-cuda

OpenGV is a library for solving calibrated central and non-central geometric vision problems. OpenGV stands for Open Geometric Vision. It contains classical central and more recent non-central absolute and relative camera pose computation algorithms, as well as triangulation and point-cloud alignment functionalities, all extended by non-linear optimization and RANSAC contexts.
versions available: 1.0

A modular scientific software framework. It provides all the functionalities needed to deal with big data processing, statistical analysis, visualisation and storage.
versions available: 6.12.04

SuiteSparse is a suite of sparse matrix algorithms, including GraphBLAS, Mongoose, ssget, UMFPACK, CHOLMOD, SPQR, KLU and BTF, CSparse and CXSparse, spqr_rank, Factorize, SSMULT, SFMULT, and ordering methods (AMD, CAMD, COLAMD, and CCOLAMD); AMD and COLAMD appear in MATLAB.
versions available: 5.4.0, 5.4.0-cuda

SuperLU (Supernodal LU) is a general purpose library for the direct solution of large, sparse, nonsymmetric systems of linear equations. It supports both real and complex datatypes, both single and double precision, and 64-bit integer indexing.
versions available: 5.2.1

SuperLU_DIST (Supernodal LU for distributed memory) is a general purpose library for the direct solution of large, sparse, nonsymmetric systems of linear equations. It supports both real and complex datatypes, both single and double precision, and 64-bit integer indexing.
versions available: 5.3.0

NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.
versions available:,,

Trilinos is an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific problems.
versions available: 11.14.1-mpi, 12.6.4-mpi

The ZeroMQ (0MQ) lightweight messaging kernel is a library which extends the standard socket interfaces with features traditionally provided by specialised messaging middleware products. 0MQ sockets provide an abstraction of asynchronous message queues, multiple messaging patterns, message filtering (subscriptions), seamless access to multiple transport protocols and more. This module includes czmq, which is a High-level C binding for 0MQ.
versions available: 4.1.4

MPI (Message Passing Interface)

MPICH is a high-performance and widely portable implementation of the Message Passing Interface (MPI) standard MPI-1, MPI-2 and MPI-3.
versions available: 3.2.1, 3.2.1-ib

The Open MPI Project is an open source MPI-2 implementation that is developed and maintained by a consortium of academic, research, and industry partners.
versions available: 1.10.0, 1.10.0-ib, 1.10.0-intel, 1.10.0-intel-ib, 1.10.0-pgi, 1.10.0-pgi-ib, 2.1.5, 2.1.5-intel, 2.1.5-pgi, 3.1.2, 3.1.2-intel, 3.1.2-pgi