« Return to HPC Overview

SmrtLink and UVA HPC

Description

PacBio’s open-source SMRT Analysis software suite is designed for use with Single Molecule,
Real-Time (SMRT) Sequencing data. You can analyze, visualize, and manage your data through an intuitive GUI
or command-line interface. You can also integrate SMRT Analysis in your existing data workflow through
the extensive set of APIs provided

Software Category: bio

For detailed information, visit the SmrtLink
website.

Available Versions

To find the available versions and learn how to load them, run:

module spider smrtlink

The output of the command shows the available SmrtLink
module versions.

For detailed information about a particular SmrtLink
module, including how to load the module, run the module spider command with the module’s full version label. For example:

module spider smrtlink/13.1.0.221970

Module	Version	Module Load Command
smrtlink	13.1.0.221970	module load smrtlink/13.1.0.221970
smrtlink	25.2.0	module load smrtlink/25.2.0

Using SmrtLink

On the HPC system, Smrtlink tools can be executed only in non-interactive mode without any graphical user interface. Several Smrtlink tools support code execution on multiple cpu cores. Some of the common tools include blasr, ngmlr, pbalign, and the pbsv (pbsv align and pbsv call) commands.

In the Slurm job scripts the number of requested cpu cores (per task) is stored in the environment variable SLURM_CPUS_PER_TASK.

PacBio BAM files

The BAM format is a binary compressed format for raw or aligned sequence reads. The associated SAM format is a text representation of the same data (specifications for BAM/SAM).

PacBio-produced BAM files are a fully compatible extension of the BAM specification. In addition to the typical BAM headers, the PacBio BAM header includes @RG (read group) entries with the following tags: ID, PL, PM, PU, DS. This means that any of the Smrtlink tools that require a PacBio BAM file will not accept any generic BAM files (i.e. non-PacBio BAM files).

A more detailed description of the PacBio BAM format can be found here.

Running blasr

blasr is a read mapping program that maps reads to positions in a genome by clustering short exact matches between the read and the genome, and scoring clusters using alignment. The matches are generated by searching all suffixes of a read against the genome using a suffix array. When suffix array index of a genome is not specified, the suffix array is built before producing alignment. This may be prohibitively slow when the genome is large (e.g. Human). It is best to precompute the suffix array of a genome using the program sawriter, and then specify the suffix array on the command line using -sa genome.fa.sa. Global chaining methods are used to score clusters of matches.

Command line arguments

The only required inputs to blasr are a file of reads (Fasta, PacBio BAM, bax.h5) and a reference genome (Fasta format). Although reads may be input in FASTA format, the recommended input is PacBio BAM files because these contain quality value information that is used in the alignment and produce higher quality variant detection.

--out : specifies the output file. Although alignments can be output in various formats, the recommended output format is PacBio BAM using the –bam option.
--nproc : sets the number of threads used and is matched to the number of cpu cores requested for the Slurm job.

To get a more complete description of all available command line options run this command:

blasr --help

Slurm script example

#SBATCH -A mygroup
#SBATCH -p standard
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=9000
#SBATCH --time=06:00:00

module purge
module load smrtlink/5.1.0.26412
blasr reads.bam genome.fasta --bam --out alignment.bam --nproc $SLURM_CPUS_PER_TASK

Running pbalign

pbalign maps PacBio sequences to references using predefined and selectable alignment algorithms (options are blasr, bowtie, or gmap). The input can be a fasta, pls.h5, bas.h5 or ccs.h5 file or a fofn (file of file names). The output can be in SAM or BAM format. If the output is in BAM format, the aligner has to be blasr and QVs will be loaded automatically.

Command line arguments

pbalign expects a minimum of three command line arguments: the name of the sequence input file (set of subreads or unaligned sequences in PacBio BAM format), the path to the reference files (Fasta format), and a filename for the computed alignment results. The blasr algorithm is used by default.

--nproc : Is used to specify how many cpu cores are available for the alignment computation. This value is matched to the number of requested cpu core via the SLURM_CPUS_PER_TASK environment variable.

To get a more complete description of all available command line options run this command:

pbalign --help

Slurm script example

#SBATCH -A mygroup
#SBATCH -p standard
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=6000
#SBATCH --time=06:00:00

module purge
module load smrtlink/5.1.0.26412
pbalign --nproc $SLURM_CPUS_PER_TASK <inputFilename> <referencePath> <outputFilename>

Running ngmlr

ngmlr is a long-read mapper designed to align PacBio or Oxford Nanopore reads to a reference genome. It is optimized for structural variation detection.

Command line arguments

ngmlr requires these command line arguments:

-r : Specifies the path to the reference genome (FASTA/Q, can be gzipped)
-q : Specifies the path to the read file (FASTA/Q) [/dev/stdin]
-o : Specifies the path to output file [stdout]
-t : Is used to specify how many cpu cores are available for the alignment computation. This value is matched to the number of requested cpu core via the SLURM_CPUS_PER_TASK environment variable.

To get a more complete description of all available command line options run this command:

ngmlr --help

Slurm script example

#SBATCH -A mygroup
#SBATCH -p standard
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=9000
#SBATCH --time=06:00:00

module purge
module load smrtlink/5.1.0.26412
ngmlr -t $SLURM_CPUS_PER_TASK -r <reference> -q <reads> [-o <output>]

Running sawriter

sawriter creates a suffix array from a single or list Fasta input files for a reference genome. The suffix array provides an additional index that increases the performance during the mapping step (e.g. via blasr). This is particularly useful when handling large reference files (i.e. larger than bacterial genomes).

Command line arguments

sawriter expects at least two command line arguments. The first one specifies the output file, e.g. sa_outputfile, the remaining ones specify the input files in Fasta format. At least one Fasta file has to be provided. Multiple input files can be specified by providing additional Fasta files separated by a whitespace at the end of the command.

To get a more complete description of all available command line options run this command:

sawriter --help

Slurm script example

#SBATCH -A mygroup
#SBATCH -p standard
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1 # multi-threading not supported
#SBATCH --mem-per-cpu=9000
#SBATCH --time=06:00:00

module purge
module load smrtlink/5.1.0.26412
sawriter sa_outputfile input1.fasta # or multiple input files: input1.fasta input2.fasta input3.fasta ...]

Updated June 22, 2019 | HPC, software, bio multi-core