This is the command cmbuild that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator
PROGRAM:
NAME
cmbuild - construct covariance model(s) from structurally annotated RNA multiple sequence
alignment(s)
SYNOPSIS
cmbuild [options] <cmfile_out> <msafile>
DESCRIPTION
For each multiple sequence alignment in <msafile> build a covariance model and save it to
a new file <cmfile_out>.
The alignment file must be in Stockholm or SELEX format, and must contain consensus
secondary structure annotation. cmbuild uses the consensus structure to determine the
architecture of the CM.
<msafile> may be '-' (dash), which means reading this input from stdin rather than a file.
To use '-', you must also specify the alignment file format with --informat <s>, as in
--informat stockholm (because of a current limitation in our implementation, MSA file
formats cannot be autodetected in a nonrewindable input stream.)
<cmfile_out> may not be '-' (stdout), because sending the CM file to stdout would conflict
with the other text output of the program.
In addition to writing CM(s) to <cmfile_out>, cmbuild also outputs a single line for each
model created to stdout. Each line has the following fields: "aln": the index of the
alignment used to build the CM; "idx": the index of the CM in the <cmfile_out>; "name":
the name of the CM; "nseq": the number of sequences in the alignment used to build the CM;
"eff_nseq": the effective number of sequences used to build the model; "alen": the length
of the alignment used to build the CM; "clen": the number of columns from the alignment
defined as consensus (match) columns; "bps": the number of basepairs in the CM; "bifs":
the number of bifurcations in the CM; "rel entropy: CM": the total relative entropy of the
model divided by the number of consensus columns; "rel entropy: HMM": the total relative
entropy of the model ignoring secondary structure divided by the number of consensus
columns. "description": description of the model/alignment.
OPTIONS
-h Help; print a brief reminder of command line usage and available options.
-n <s> Name the new CM <s>. The default is to use the name of the alignment (if one is
present in the <msafile>), or, failing that, the name of the <msafile>. If
<msafile> contains more than one alignment, -n doesn't work, and every alignment
must have a name annotated in the <msafile> (as in Stockholm #=GF ID annotation).
-F Allow <cmfile_out> to be overwritten. Without this option, if <cmfile_out> already
exists, cmbuild exits with an error.
-o <f> Direct the summary output to file <f>, rather than to stdout.
-O <f> After each model is constructed, resave annotated source alignments to a file <f>
in Stockholm format. Sequences are annoted with what relative sequence weights
were assigned. The alignments are also annotated with a reference annotation line
indicating which columns were assigned as consensus. If the source alignment had
reference annotation ("#=GC RF") it will be replaced with the consensus residue of
the model for consensus columns and '.' for insert columns, unless the --hand
option was used for specifying consensus positions, in which case it will be
unchanged.
--devhelp Print help, as with -h , but also include expert options that are not
displayed with -h . These expert options are not expected to be relevant for the
vast majority of users and so are not described in the manual page. The only
resources for understanding what they actually do are the brief one-line
descriptions output when --devhelp is enabled, and the source code.
OPTIONS CONTROLLING MODEL CONSTRUCTION
These options control how consensus columns are defined in an alignment.
--fast Define consensus columns automatically as those that have a fraction >= symfrac of
residues as opposed to gaps. (See below for the --symfrac option.) This is the
default.
--hand Use reference coordinate annotation (#=GC RF line, in Stockholm) to determine which
columns are consensus, and which are inserts. Any non-gap character indicates a
consensus column. (For example, mark consensus columns with "x", and insert columns
with ".".) This option was called --rf in previous versions of Infernal (0.1
through 1.0.2).
--symfrac <x>
Define the residue fraction threshold necessary to define a consensus column when
not using --hand. The default is 0.5. The symbol fraction in each column is
calculated after taking relative sequence weighting into account. Setting this to
0.0 means that every alignment column will be assigned as consensus, which may be
useful in some cases. Setting it to 1.0 means that only columns that include 0 gaps
will be assigned as consensus. This option replaces the --gapthresh <y> option
from previous versions of Infernal (0.1 through 1.0.2), with <x> equal to (1.0 -
<y>). For example to reproduce behavior for a command of cmbuild --gapthresh 0.8
in a previous version, use cmbuild --symfrac 0.2 with this version.
--noss Ignore the secondary structure annotation, if any, in <msafile> and build a CM with
zero basepairs. This model will be similar to a profile HMM and the cmsearch and
cmscan programs will use HMM algorithms which are faster than CM ones for this
model. Additionally, a zero basepair model need not be calibrated with cmcalibrate
prior to running cmsearch with it. The --noss option must be used if there is no
secondary structure annotation in <msafile>.
--rsearch <f>
Parameterize emission scores a la RSEARCH, using the RIBOSUM matrix in file <f>.
With --rsearch enabled, all alignments in <msafile> must contain exactly one
sequence or the --call option must also be enabled. All positions in each sequence
will be considered consensus "columns". Actually, the emission scores for these
models will not be identical to RIBOSUM scores due of differences in the modelling
strategy between Infernal and RSEARCH, but they will be as similar as possible.
RIBOSUM matrix files are included with Infernal in the "matrices/" subdirectory of
the top-level "infernal-xxx" directory. RIBOSUM matrices are substitution score
matrices trained specifically for structural RNAs with separate single stranded
residue and base pair substitution scores. For more information see the RSEARCH
publication (Klein and Eddy, BMC Bioinformatics 4:44, 2003).
OTHER MODEL CONSTRUCTION OPTIONS
--null <f>
Read a null model from <f>. The null model defines the probability of each RNA
nucleotide in background sequence, the default is to use 0.25 for each nucleotide.
The format of null files is specified in the user guide.
--prior <f>
Read a Dirichlet prior from <f>, replacing the default mixture Dirichlet. The
format of prior files is specified in the user guide.
Use --devhelp to see additional, otherwise undocumented, model construction options.
OPTIONS CONTROLLING RELATIVE WEIGHTS
cmbuild uses an ad hoc sequence weighting algorithm to downweight closely related
sequences and upweight distantly related ones. This has the effect of making models less
biased by uneven phylogenetic representation. For example, two identical sequences would
typically each receive half the weight that one sequence would. These options control
which algorithm gets used.
--wpb Use the Henikoff position-based sequence weighting scheme [Henikoff and Henikoff,
J. Mol. Biol. 243:574, 1994]. This is the default.
--wgsc Use the Gerstein/Sonnhammer/Chothia weighting algorithm [Gerstein et al, J. Mol.
Biol. 235:1067, 1994].
--wnone
Turn sequence weighting off; e.g. explicitly set all sequence weights to 1.0.
--wgiven
Use sequence weights as given in annotation in the input alignment file. If no
weights were given, assume they are all 1.0. The default is to determine new
sequence weights by the Gerstein/Sonnhammer/Chothia algorithm, ignoring any
annotated weights.
--wblosum
Use the BLOSUM filtering algorithm to weight the sequences, instead of the default
GSC weighting. Cluster the sequences at a given percentage identity (see --wid);
assign each cluster a total weight of 1.0, distributed equally amongst the members
of that cluster.
--wid <x>
Controls the behavior of the --wblosum weighting option by setting the percent
identity for clustering the alignment to <x>.
OPTIONS CONTROLLING EFFECTIVE SEQUENCE NUMBER
After relative weights are determined, they are normalized to sum to a total effective
sequence number, eff_nseq. This number may be the actual number of sequences in the
alignment, but it is almost always smaller than that. The default entropy weighting
method (--eent) reduces the effective sequence number to reduce the information content
(relative entropy, or average expected score on true homologs) per consensus position. The
target relative entropy is controlled by a two-parameter function, where the two
parameters are settable with --ere and --esigma.
--eent Use the entropy weighting strategy to determine the effective sequence number that
gives a target mean match state relative entropy. This option is the default, and
can be turned off with --enone. The default target mean match state relative
entropy is 0.59 bits for models with at least 1 basepair and 0.38 bits for models
with zero basepairs, but changed with --ere. The default of 0.59 or 0.38 bits is
automatically changed if the total relative entropy of the model (summed match
state relative entropy) is less than a cutoff, which is is 6.0 bits by default, but
can be changed with the expert, undocumented --eX option. If you really want to
play with that option, consult the source code.
--enone
Turn off the entropy weighting strategy. The effective sequence number is just the
number of sequences in the alignment.
--ere <x>
Set the target mean match state relative entropy as <x>. By default the target
relative entropy per match position is 0.59 bits for models with at least 1
basepair and 0.38 for models with zero basepairs.
--eminseq <x>
Define the minimum allowed effective sequence number as <x>.
--ehmmre <x>
Set the target HMM mean match state relative entropy as <x>. Entropy for
basepairing match states is calculated using marginalized basepair emission
probabilities.
--eset <x>
Set the effective sequence number for entropy weighting as <x>.
OPTIONS CONTROLLING FILTER P7 HMM CONSTRUCTION
For each CM that cmbuild constructs, an accompanying filter p7 HMM is built from the input
alignment as well. These options control filter HMM construction:
--p7ere <x>
Set the target mean match state relative entropy for the filter p7 HMM as <x>. By
default the target relative entropy per match position is 0.38 bits.
--p7ml Use a maximum likelihood p7 HMM built from the CM as the filter HMM. This HMM will
be as similar as possible to the CM (while necessarily ignorant of secondary
structure).
Use --devhelp to see additional, otherwise undocumented, filter HMM construction options.
OPTIONS CONTROLLING FILTER P7 HMM CALIBRATION
After building each filter HMM, cmbuild determines appropriate E-value parameters to use
during filtering in cmsearch and cmscan by sampling a set of sequences and searching them
with each HMM filter configuration and algorithm.
--EmN <n> Set the number of sampled sequences for local MSV filter HMM calibration to <n>.
200 by default.
--EvN <n> Set the number of sampled sequences for local Viterbi filter HMM calibration to
<n>. 200 by default.
--ElfN <n> Set the number of sampled sequences for local Forward filter HMM calibration to
<n>. 200 by default.
--EgfN <n> Set the number of sampled sequences for glocal Forward filter HMM calibration
to <n>. 200 by default.
Use --devhelp to see additional, otherwise undocumented, filter HMM calibration options.
OPTIONS FOR REFINING THE INPUT ALIGNMENT
--refine <f>
Attempt to refine the alignment before building the CM using expectation-
maximization (EM). A CM is first built from the initial alignment as usual. Then,
the sequences in the alignment are realigned optimally (with the HMM banded CYK
algorithm, optimal means optimal given the bands) to the CM, and a new CM is built
from the resulting alignment. The sequences are then realigned to the new CM, and a
new CM is built from that alignment. This is continued until convergence,
specifically when the alignments for two successive iterations are not
significantly different (the summed bit scores of all the sequences in the
alignment changes less than 1% between two successive iterations). The final
alignment (the alignment used to build the CM that gets written to <cmfile_out>) is
written to <f>.
-l With --refine, turn on the local alignment algorithm, which allows the alignment to
span two or more subsequences if necessary (e.g. if the structures of the query
model and target sequence are only partially shared), allowing certain large
insertions and deletions in the structure to be penalized differently than normal
indels. The default is to globally align the query model to the target sequences.
--gibbs
Modifies the behavior of --refine so Gibbs sampling is used instead of EM. The
difference is that during the alignment stage the alignment is not necessarily
optimal, instead an alignment (parsetree) for each sequences is sampled from the
posterior distribution of alignments as determined by the Inside algorithm. Due to
this sampling step --gibbs is non-deterministic, so different runs with the same
alignment may yield different results. This is not true when --refine is used
without the --gibbs option, in which case the final alignment and CM will always be
the same. When --gibbs is enabled, the --seed <n> option can be used to seed the
random number generator predictably, making the results reproducible. The goal of
the --gibbs option is to help expert RNA alignment curators refine structural
alignments by allowing them to observe alternative high scoring alignments.
--seed <n>
Seed the random number generator with <n>, an integer >= 0. This option can only
be used in combination with --gibbs. If <n> is nonzero, stochastic sampling of
alignments will be reproducible; the same command will give the same results. If
<n> is 0, the random number generator is seeded arbitrarily, and stochastic
samplings may vary from run to run of the same command. The default seed is 0.
--cyk With --refine, align with the CYK algorithm. By default the optimal accuracy
algorithm is used. There is more information on this in the cmalign manual page.
--notrunc
With --refine, turn off the the truncated alignment algorithm. There is more
information on this in the cmalign manual page.
Use --devhelp to see additional, otherwise undocumented, alignment refinement options as
well as other output file options and options for building multiple models for a single
alignment.
Use cmbuild online using onworks.net services