This is the command pbbarcode that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator
PROGRAM:
NAME
pbbarcode - annotate PacBio sequencing reads with barcode information
DESCRIPTION
The pbbarcode package provides utilities for annotating individual ZMWs directly from a
bas.h5 file, emitting fast[a|q] files for each barcode, labeling alignments stored in a
cmp.h5 file, and calling consensus on small amplicons (requires pbdagcon(1))
At the moment, Barcodes can be scored in two different ways: symmetric and paired.
Symmetric mode supports barcode designs with two identical barcodes on both sides of a
SMRTbell, e.g., for barcodes (A, B), molecules are labeled as A--A or B--B. The paired
mode supports designs with two distinct barcodes on each side of the molecule, but neither
barcode appears without its mate. The minimum example is given with the following
barcodes: (ALeft, ARight, BLeft, BRight), where the following barcode sets are checked:
ALeft--ARight, BLeft--BRight.
It is important to highlight that a barcode FASTA file specifies a list of available
barcodes to evaluate. Depending on the scoring mode, the barcodes are grouped together in
different ways. For instance, in the symmetric case, the number of possible barcode
outcomes are simply the number of barcodes that are supplied to the routine in the FASTA
file (see below for usage) plus an additional NULL barcode indicating that no barcode
could be evaluated (denoted by: '--'). Labels like this (A--A) are used in the final
outputs. In the paired mode, the number of possible barcode outcomes are half the number
of the sequences in the FASTA file plus the NULL barcode. The NULL barcode indicates that
no attempt was made to score the molecule or it was filtered out by the user's criteria.
The majority of cases when a molecule is not scored are related to not observing any
adapters. If a user has executed a "hot-start" run, the user can try the '--scoreFirst'
parameter to attempt to label the first adapter's barcode. This increases the yield of the
labeleing procedure at the expense of some probably false positives.
The software is implemented as a standard python package. Barcodes are labeled according
to the following high-level logic. For each molecule, all adapters are found. For each
adapter, we align (using standard Smith-Watterman alignment) each barcode and its reverse
complement to flanking sequence of the adapter. If two complete flanking sequences are
available, we divide by 2, else 1 if only one flanking sequence was available (average
score at adapter). This allows the scores across adapters to be on the same scale (chimera
detection). Depending on the mode, we then determine which barcode(s) are maximally
scoring. We store the two maximally scoring barcodes, the sum of their alignment scores
across the adapters. The average barcode score then can be given approximately by:
total-score/number-of-adapters. At the moment, the alignment parameters are fixed at:
┌──────────┬───────┐
│type │ score │
├──────────┼───────┤
│insertion │ -1 │
├──────────┼───────┤
│deletion │ -1 │
├──────────┼───────┤
│missmatch │ -2 │
├──────────┼───────┤
│match │ 2 │
└──────────┴───────┘
Input and output
labelZmws
usage: pbbarcode labelZmws [-h] [--outDir OUTDIR] [--outFofn OUTFOFN]
[--adapterSidePad ADAPTERSIDEPAD] [--insertSidePad INSERTSIDEPAD] [--scoreMode
{symmetric,paired}] [--maxAdapters MAXADAPTERS] [--scoreFirst]
[--startTimeCutoff STARTTIMECUTOFF] [--nZmws NZMWS] [--nProcs NPROCS]
[--saveExtendedInfo] barcode.fasta input.fofn
Creates a barcode.h5 file from base h5 files.
positional arguments:
barcode.fasta Input barcode fasta file input.fofn Input base
fofn
optional arguments:
-h, --help
show this help message and exit
--outDir OUTDIR
Where to write the newly created barcode.h5 files. (default:
/home/UNIXHOME/jbullard/projects/software/bioinformatics/tools/pbbarcode/doc)
--outFofn OUTFOFN
Write to outFofn (default: barcode.fofn)
--adapterSidePad ADAPTERSIDEPAD
Pad with adapterSidePad bases (default: 4)
--insertSidePad INSERTSIDEPAD
Pad with insertSidePad bases (default: 4)
--scoreMode {symmetric,paired}
The mode in which the barcodes should be scored. (default: symmetric)
--maxAdapters MAXADAPTERS
Only score the first maxAdapters (default: 20)
--scoreFirst
Whether to try to score the leftmost barcode in a trace. (default: False)
--startTimeCutoff STARTTIMECUTOFF
Reads must start before this value in order to be included when
scoreFirst is set. (default: 10.0)
--nZmws NZMWS
Use the first n ZMWs for testing (default: -1)
--nProcs NPROCS
How many processes to use (default: 8)
--saveExtendedInfo
Whether to save extended information tothe barcode.h5 files; this
information is useful for debugging and chimera detection (default:
False)
The labelZmws command takes an input.fofn representing a set of bas.h5 files to operate
on. Additionally, it takes a barcode.fasta file. Depending on scoreMode, the FASTA file
will be processed in different ways. Specifically, in paired mode, each two consecutive
barcodes in the file are considered a set.
The parameters, adapterSidePad and insertSidePad represents how many bases should be
considered on each side of the putative barcode. These parameters are constrained such
that: |adapterSidePad| + |insertSidePad| + |barcode| < 65.
Users have the option to specify a different output location for the various outputs.
Specifically, for each bas.h5 file in input.fofn, a bc.h5 (barcode hdf5) file is
generated. These files are listed in the file outFofn which is typically just called
barcode.fofn. See below for a description of the barcode hdf5 file.
labelAlignments
usage: pbbarcode labelAlignments [-h]
[--minAvgBarcodeScore MINAVGBARCODESCORE] [--minNumBarcodes MINNUMBARCODES]
[--minScoreRatio MINSCORERATIO] barcode.fofn aligned_reads.cmp.h5
Adds information about barcode alignments to a cmp.h5 file from a previous call to
"labelZmws".
positional arguments:
barcode.fofn input barcode fofn file aligned_reads.cmp.h5 cmp.h5 file
to add barcode labels
optional arguments:
-h, --help
show this help message and exit
--minAvgBarcodeScore MINAVGBARCODESCORE
ZMW Filter: exclude ZMW if average barcode score is less than this value
(default: 0.0)
--minNumBarcodes MINNUMBARCODES
ZMW Filter: exclude ZMW if number of barcodes observed is less than this
value (default: 1)
--minScoreRatio MINSCORERATIO
ZMW Filter: exclude ZMWs whose best score divided by the 2nd best score
is less than this ratio (default: 1.0)
The labelAlignments command takes as input a barcode.fofn computed from a call to
labelZMWs and a cmp.h5 file where the barcode information is written to. See below for a
description of the cmp.h5 file additions.
emitFastqs
usage: pbbarcode emitFastqs [-h] [--outDir output.dir] [--subreads]
[--unlabeledZmws] [--trim TRIM] [--fasta] [--minMaxInsertLength
MINMAXINSERTLENGTH] [--hqStartTime HQSTARTTIME] [--minReadScore MINREADSCORE]
[--minAvgBarcodeScore MINAVGBARCODESCORE] [--minNumBarcodes MINNUMBARCODES]
[--minScoreRatio MINSCORERATIO] input.fofn barcode.fofn
Takes a bas.h5 fofn and a barcode.h5 fofn and produces a fast[a|q] file for each
barcode.
positional arguments:
input.fofn input base or CCS fofn file barcode.fofn input
barcode.h5 fofn file
optional arguments:
-h, --help
show this help message and exit
--outDir output.dir output directory to write fastq files (default: /home/
UNIXHOME/jbullard/projects/software/bioinformatics/too ls/pbbarcode/doc)
--subreads
whether to produce fastq files for the subreads;the default is to use the
CCS reads. This option onlyapplies when input.fofn has both consensus and
raw reads,otherwise the read type from input.fofn will be returned.
(default: False)
--unlabeledZmws
whether to emit a fastq file for the unlabeled ZMWs. These are the ZMWs
where no adapters are found typically (default: False)
--trim TRIM
trim off barcodes and any excess constant sequence (default: 20)
--fasta
whether the files produced should be FASTA files asopposed to FASTQ
(default: False)
--minMaxInsertLength MINMAXINSERTLENGTH
ZMW Filter: exclude ZMW if the longest subreadis less than this amount
(default: 0)
--hqStartTime HQSTARTTIME
ZMW Filter: exclude ZMW if start time of HQ regiongreater than this value
(seconds) (default: inf)
--minReadScore MINREADSCORE
ZMW Filter: exclude ZMW if readScore is less thanthis value (default: 0)
--minAvgBarcodeScore MINAVGBARCODESCORE
ZMW Filter: exclude ZMW if average barcode score is less than this value
(default: 0.0)
--minNumBarcodes MINNUMBARCODES
ZMW Filter: exclude ZMW if number of barcodes observed is less than this
value (default: 1)
--minScoreRatio MINSCORERATIO
ZMW Filter: exclude ZMWs whose best score divided by the 2nd best score
is less than this ratio (default: 1.0)
The emitFastqs command takes as input both an input.fofn for the bas.h5 files as well as a
barcode.fofn from a call to labelZmws. The optional parameter outDir dictates where the
files will be written. For each detected barcode, a fast[a|q] file will be emitted with
all of the reads for that barcode. The trim parameter dictates how much of the read should
be trimmed off. The default parameter for trim is the length of the barcode (which is
stored in the barcode hdf5 files). At the moment, all barcodes in the barcode FASTA file
must be the same length, therefore only a constant trim value is supported. In practice,
one can aggressively trim in order to ensure that extra bases aren't left on the ends of
reads. Finally, the subreads parameter dictates whether subreads or CCS reads should be
returned with the default being the appropriate reads according to the input file type,
either CCS or subreads. This parameter is only inspected if the input.fofn contains both
CCS and subread data, if the input.fofn contains only subread or CCS data then that is
returned irrespective of the state of the the subreads parameter and a warning is issued.
consensus
usage: pbbarcode consensus [-h] [--subsample SUBSAMPLE] [--nZmws NZMWS]
[--outDir OUTDIR] [--keepTmpDir] [--ccsFofn CCSFOFN] [--nProcs NPROCS]
[--noQuiver] [--minMaxInsertLength MINMAXINSERTLENGTH] [--hqStartTime
HQSTARTTIME] [--minReadScore MINREADSCORE] [--minAvgBarcodeScore
MINAVGBARCODESCORE] [--minNumBarcodes MINNUMBARCODES] [--minScoreRatio
MINSCORERATIO] [--barcode BARCODE [BARCODE ...]] input.fofn barcode.fofn
Compute consensus sequences for each barcode.
positional arguments:
input.fofn input bas.h5 fofn file barcode.fofn input bc.h5
fofn file
optional arguments:
-h, --help
show this help message and exit
--subsample SUBSAMPLE
Subsample ZMWs (default: 1)
--nZmws NZMWS
Take n ZMWs (default: -1)
--outDir OUTDIR
Use this directory to output results (default: .)
--keepTmpDir --ccsFofn CCSFOFN Obtain CCS data from ccsFofn instead of
input.fofn
(default: )
--nProcs NPROCS
Use nProcs to execute. (default: 16)
--noQuiver --minMaxInsertLength MINMAXINSERTLENGTH
ZMW Filter: exclude ZMW if the longest subreadis less than this amount
(default: 0)
--hqStartTime HQSTARTTIME
ZMW Filter: exclude ZMW if start time of HQ regiongreater than this value
(seconds) (default: inf)
--minReadScore MINREADSCORE
ZMW Filter: exclude ZMW if readScore is less thanthis value (default: 0)
--minAvgBarcodeScore MINAVGBARCODESCORE
ZMW Filter: exclude ZMW if average barcode score is less than this value
(default: 0.0)
--minNumBarcodes MINNUMBARCODES
ZMW Filter: exclude ZMW if number of barcodes observed is less than this
value (default: 1)
--minScoreRatio MINSCORERATIO
ZMW Filter: exclude ZMWs whose best score divided by the 2nd best score
is less than this ratio (default: 1.0)
--barcode BARCODE [BARCODE ...]
Use this to extract consensus for just one barcode. (default: None)
The emitFastqs command takes as input both an input.fofn for the bas.h5 files as well as a
barcode.fofn from a call to labelZmws. The results are a FASTA file with an entry for each
barcode containing the consensus amplicon sequence. This mode utilizes Quiver and pbdagcon
to compute consensus.
In cases where the amplicon is fewer than 2.5k bases, using CCS data is quite helpful. The
--ccsFofn allows one to pass directly the ccs files. In many cases, both the CCS and raw
basecalls are in the same file so you can check by passing the same parameter to
input.fofn as to ccsFofn.
Dependencies
The pbbarcode package depends on a standard pbcore installation (‐
https://github.com/PacificBiosciences/pbcore). If one wishes to use the consensus tool,
pbdagcon needs to be installed (https://github.com/PacificBiosciences/pbdagcon).
Barcode HDF5 File
The barcode hdf5 file, bc.h5, represents a simple data store for barcode calls and their
scores for each ZMW. Generally, a user need not interact with barcode hdf5 files, but can
use the results stored in either the resulting cmp.h5 file or fast[a|q] files. The barcode
hdf5 file contains the following structure:
/BarcodeCalls/best - (nZMWs, 6)[32-bit integer] dataset with the following columns:
holeNumber,nAdapters,barcodeIdx1,barcodeScore1,barcodeIdx2,barcodeScore2
Additionally, the best dataset has the following attributes:
┌────────────┬─────────────────────────────────────────────────────────────────┐
│movieName │ m120408_042614_richard_c100309392550000001523011508061222_s1_p0 │
├────────────┼─────────────────────────────────────────────────────────────────┤
│columnNames │ holeNumber,nAdapters,barcodeIdx1,barcodeScore1,barcodeIdx2, │
│ │ barcodeScore2 │
└────────────┴─────────────────────────────────────────────────────────────────┘
│scoreMode │ [symmetric|paired] │
├────────────┼─────────────────────────────────────────────────────────────────┤
│barcodes │ 'bc_1', 'bc_2', ...., 'bc_N' │
└────────────┴─────────────────────────────────────────────────────────────────┘
The two barcodeIdx1 and barcodeIdx2 columns are indices into barcodes attribute. The
scoreMode is scoring mode used to align the barcodes. The barcodes attribute correspond to
the barcode.fasta sequence names.
Additionally, in some circumstances, it is useful to retain the entire history of the
scoring, i.e., each barcode scored to each adapter across all ZMWs. In oder to retain this
information, one must call:
pbbarcode labelZmws --saveExtendedInfo ...
In this mode, the resultant HDF5 file will have an additional dataset under the
BarcodeCalls group, named: all. This dataset has the following format:
/BarcodeCalls/all - (nbarcodes * nadapters[zmw_i], 4) forall i in 1 ... nZMWs
`holeNumber, adapterIdx, barcodeIdx, score`
The adapterIdx is the index of the adapter along the molecule, i.e., adapterIdx 1 is the
first adapter scored.
Additions to the compare HDF5 (cmp.h5) File
In addition to the barcode hdf5 file, a call to labelAlignments will annotate a cmp.h5
file. This annotation is stored in ways consistent with the cmp.h5 file format.
Specifically, a new group:
/BarcodeInfo/
ID (nBarcodeLabels + 1, 1)[32-bit integer]
Name (nBarcodeLabels + 1, 1)[variable length string]
In addition to the /BarcodeInfo/ group, the key dataset which assigns alignments to
barcodes is located at:
/AlnInfo/Barcode (nAlignments, 3)[32-bit integer] with the following colums:
index,count,bestIndex,bestScore,secondBestIndex,secondBestScore
Here index refers to the index into the Name vector, score corresponds to the sum of the
scores for the barcodes, and finally, count refers to the number of adapters found in the
molecule.
December 2015 PBBARCODE(1)
Use pbbarcode online using onworks.net services