This is the command swarm that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator
PROGRAM:
NAME
swarm — find clusters of nearly-identical nucleotide amplicons
SYNOPSIS
swarm [ options ] filename
DESCRIPTION
Environmental or clinical molecular studies generate large volumes of amplicons (e.g., 16S
or 18S SSU-rRNA sequences) that need to be clustered into molecular operational taxonomic
units (OTUs). Common clustering methods are based on greedy, input-order dependent
algorithms, with arbitrary selection of global cluster size and cluster centroids. To
address that problem, we developed swarm, a fast and robust method that recursively groups
amplicons with d or less differences. swarm produces natural and stable clusters centered
on local peaks of abundance, free from centroid selection induced input-order dependency.
Exact clustering is impractical on large data sets when using a naïve all-vs-all approach
(more precisely a 2-combination without repetitions), as it implies unrealistic numbers of
pairwise comparisons. swarm is based on a maximum number of differences d between two
amplicons, and focuses only on very close local relationships. For d = 1 (default value),
swarm uses an algorithm of linear complexity that performs exact-string matching by
comparing hash-values. For d = 2 or greater, swarm uses an algorithm of quadratic
complexity that performs pairwise string comparisons. An efficient k-mer-based filtering
and an astute use of comparisons results obtained during the clustering process allows to
avoid most of the amplicon comparisons needed in a naïve approach. To speed up the
remaining amplicon comparisons, swarm implements an extremely fast Needleman-Wunsch
algorithm making use of the Streaming SIMD Extensions (SSE2) of modern x86-64 CPUs. If
SSE2 instructions are not available, swarm exits with an error message.
swarm reads the named input filename, a fasta file of nucleotide amplicons. The amplicon
identifier is defined as the string comprised between the ">" symbol and the first space
or the end of the line, whichever comes first. As swarm outputs lists of amplicon
identifiers, amplicon identifiers must be unique to avoid ambiguity; swarm exits with an
error message if identifiers are not unique. Amplicon identifiers must end with a "_"
followed by a positive integer representing the amplicon copy number (or abundance
annotation; usearch/vsearch users can use the option -z to change that behavior).
Abundance annotations play a crucial role in the clustering process, and swarm exits with
an error message if that information is not available. The amplicon sequence is defined as
a string of [acgt] or [acgu] symbols (case insensitive), starting after the end of the
identifier line and ending before the next identifier line or the file end; swarm exits
with an error message if any other symbol is present.
General options
-b, --boundary positive integer
when using the option --fastidious (-f), define the minimum mass of a large OTU
as the number given with this option. The default value is 3, indicating that any
OTU with mass 3 or more is considered "large". By default, an OTU is "small" if
it has a mass of 2 or less, meaning that it is composed of either one amplicon of
abundance 2, or two amplicons of abundance 1. Any positive value greater than 1
can be specified. Using higher boundary values will speed up the second pass, but
also reduce the taxonomical resolution of swarm results.
-c, --ceiling positive integer
when using the option --fastidious (-f), define swarm's maximum memory footprint
(in megabytes). swarm will adjust the --bloom-bits (-y) value of the Bloom filter
to fit within the specified amount of memory. That option is not active by
default.
-d, --differences zero or positive integer
maximum number of differences allowed between two amplicons, meaning that two
amplicons will be grouped if they have integer (or less) differences. This is
swarm's most important parameter. The number of differences is calculated as the
number of mismatches (substitutions, insertions or deletions) between the two
amplicons once the optimal pairwise global alignment has been found (see
"pairwise alignment advanced options" to influencing that step). Any integer
between 0 and 256 can be used, but high d values will decrease the taxonomical
resolution of swarm results. Commonly used d values are 1, 2 or 3, rarely higher.
When using d = 0, swarm will output results corresponding to a strict
dereplication of the dataset, i.e. merging identical amplicons. Warning, swarm
still requires fasta entries to present abundance values. Default number of
differences is 1.
-f, --fastidious
when working with d = 1, perform a second clustering pass to reduce the number of
small OTUs (recommended option). During the clustering process with d = 1, an
intermediate amplicon can be missing for purely stochastic reasons, interrupting
the aggregation process. That option will create virtual amplicons, allowing to
graft small OTUs upon bigger ones. By default, an OTU is "small" if it has a mass
of 2 or less (see the --boundary option to increase that value). To speed things
up, swarm uses a Bloom filter to store intermediate results. Warning, that second
pass can be 2 to 3 times slower than the first pass and requires much more
memory. See the options --bloom-bits (-y) or --ceiling (-c) to control the memory
footprint of the Bloom filter. Warning, the fastidious option modifies clustering
results. The output files produced by the options --log (-l), --output-file (-o),
--mothur (-r), --uclust-file, and --seeds (-w) are updated to reflect these
modifications; the file --statistics-file (-s) is partially updated (columns 6
and 7 are not updated); the output file --internal-structure (-i) is not updated.
-h, --help
display this help and exit.
-n, --no-otu-breaking
deactivate the built-in OTU refinement (not recommended). Amplicon abundance
values are used to identify transitions among in-contact OTUs and to separate
them, yielding higher-resolution clustering results. That option prevents that
separation, and in practice, allows the creation of a link between amplicons A
and B, even if the abundance of B is higher than the abundance of A.
-t, --threads positive integer
number of computation threads to use. The number of threads should be lesser or
equal to the number of available CPU cores. Default number of threads is 1.
-v, --version
output version information and exit.
-y, --bloom-bits positive integer
when using the option --fastidious (-f), define the size (in bits) of each entry
in the Bloom filter. That option allows to balance the efficiency (i.e. speed)
and the memory footprint of the Bloom filter. Large values will make the Bloom
filter more efficient but will require more memory. Any value between 4 and 20
can be used. Default value is 16. See the --ceiling (-c) option for an
alternative way to control the memory footprint.
Input/output options
-a, --append-abundance positive integer
set abundance value to use when some or all amplicons in the input file lack
abundance values. Warning, it is not recommended to use swarm on datasets where
abundance values are all identical. We provide that option as a courtesy to
advanced users, please use it carefully. swarm exits with an error message if
abundance values are missing and if this option is not used.
-i, --internal-structure filename
output all pairs of nearly-identical amplicons to filename using a five-columns
tab-delimited format:
1. amplicon A label.
2. amplicon B label.
3. number of differences between amplicons A and B (positive integer).
4. OTU number (positive integer). OTUs are numbered in their order of
delineation, starting from 1. All pairs of amplicons belonging to the
same OTU will receive the same number.
5. number of steps from the OTU seed to amplicon B (positive integer).
-l, --log filename
output all messages to filename instead of standard error, with the exception of
error messages of course. That option is useful in situations where writing to
standard error is problematic (for example, with certain job schedulers).
-o, --output-file filename
output clustering results to filename. Results consist of a list of OTUs, one OTU
per line. An OTU is a list of amplicon identifiers separated by spaces. Default
is to write to standard output.
-r, --mothur
output clustering results in a format compatible with Mothur. That option
modifies swarm's default output format.
-s, --statistics-file filename
output statistics to filename. The file is a tab-separated table with one OTU per
row and seven columns of information:
1. number of unique amplicons in the OTU,
2. total copy number of amplicons in the OTU,
3. identifier of the initial seed,
4. initial seed copy number,
5. number of amplicons with a copy number of 1 in the OTU,
6. maximum number of iterations before the OTU reached its natural
limits),
7. theoretical maximum radius of the OTU (i.e., number of cummulated
differences between the seed and the furthermost amplicon in the OTU).
The actual maximum radius of the OTU is often much smaller.
-u, --uclust-file filename
output clustering results in uclust-like file format to the specified file. That
option does not modify swarm's default output format.
-w, --seeds filename
output OTU representatives to filename in fasta format. The abundance value of
each representative is the sum of the abundances of all the amplicons in the OTU.
-z, --usearch-abundance
accept amplicon abundance values in usearch/vsearch's style
(>label;size=integer[;]). That option influences the abundance annotation style
used in output files.
Pairwise alignment advanced options
when using d > 1, swarm recognizes advanced command-line options modifying the pairwise
global alignment scoring parameters:
-m, --match-reward positive integer
set the reward for a nucleotide match. Default is 5.
-p, --mismatch-penalty positive integer
set the penalty for a nucleotide mismatch. Default is 4.
-g, --gap-opening-penalty positive integer
set the gap open penalty. Default is 12.
-e, --gap-extension-penalty positive integer
set the gap extension penalty. Default is 4.
As swarm focuses on close relationships (i.e. d = 2 or 3), clustering results are
resilient to pairwise alignment model parameters modifications. Modifying model parameters
has a stronger impact when clustering using a higher d value.
EXAMPLES
Clusterize the data set myfile.fasta into OTUs with the finest resolution possible (1
difference, built-in breaking, fastidious option) using 4 computation threads. OTUs are
written to the file myfile.swarms, and OTU representatives are written to
myfile.representatives.fasta.
swarm -t 4 -f -w myfile.representatives.fasta < myfile.fasta > myfile.swarms
AUTHORS
Concept by Frédéric Mahé, implementation by Torbjørn Rognes.
CITATION
Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2014) Swarm: robust and fast
clustering method for amplicon-based studies. PeerJ 2:e593
<http://dx.doi.org/10.7717/peerj.593>
Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2015) Swarm v2: highly-scalable and
high-resolution amplicon clustering. PeerJ 3:e1420 <http://dx.doi.org/10.7717/peerj.1420>
REPORTING BUGS
Submit suggestions and bug-reports at <https://github.com/torognes/swarm/issues>, send a
pull request on <https://github.com/torognes/swarm>, or compose a friendly or curmudgeonly
e-mail to Frédéric Mahé <[email protected]> and Torbjørn Rognes <[email protected]>.
AVAILABILITY
The software is available from <https://github.com/torognes/swarm>
COPYRIGHT
Copyright (C) 2012, 2013, 2014, 2015 Frédéric Mahé & Torbjørn Rognes
This program is free software: you can redistribute it and/or modify it under the terms of
the GNU Affero General Public License as published by the Free Software Foundation, either
version 3 of the License, or any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this
program. If not, see <http://www.gnu.org/licenses/>.
Use swarm online using onworks.net services