This is the command alimask that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator
PROGRAM:
NAME
alimask - Add mask line to a multiple sequence alignment
SYNOPSIS
alimask [options] <msafile> <postmsafile>
DESCRIPTION
alimask is used to apply a mask line to a multiple sequence alignment, based on provided
alignment or model coordinates. When hmmbuild receives a masked alignment as input, it
produces a profile model in which the emission probabilities at masked positions are set
to match the background frequency, rather than being set based on observed frequencies in
the alignment. Position-specific insertion and deletion rates are not altered, even in
masked regions. alimask autodetects input format, and produces masked alignments in
Stockholm format. <msafile> may contain only one sequence alignment.
A common motivation for masking a region in an alignment is that the region contains a
simple tandem repeat that is observed to cause an unacceptably high rate of false positive
hits.
In the simplest case, a mask range is given in coordinates relative to the input
alignment, using --alirange <s>. However it is more often the case that the region to be
masked has been identified in coordinates relative to the profile model (e.g. based on
recognizing a simple repeat pattern in false hit alignments or in the HMM logo). Not all
alignment columns are converted to match state positions in the profile (see the --symfrac
flag for hmmbuild for discussion), so model positions do not necessarily match up to
alignment column positions. To remove the burden of converting model positions to
alignment positions, alimask accepts the mask range input in model coordinates as well,
using --modelrange <s>. When using this flag, alimask determines which alignment
positions would be identified by hmmbuild as match states, a process that requires that
all hmmbuild flags impacting that decision be supplied to alimask. It is for this reason
that many of the hmmbuild flags are also used by alimask.
OPTIONS
-h Help; print a brief reminder of command line usage and all available options.
-o <f> Direct the summary output to file <f>, rather than to stdout.
OPTIONS FOR SPECIFYING MASK RANGE
A single mask range is given as a dash-separated pair, like --modelrange 10-20 and
multiple ranges may be submitted as a comma-separated list, --modelrange 10-20,30-42.
--modelrange <s>
Supply the given range(s) in model coordinates.
--alirange <s>
Supply the given range(s) in alignment coordinates.
--apendmask
Add to the existing mask found with the alignment. The default is to overwrite any
existing mask.
--model2ali <s>
Rather than actually produce the masked alignment, simply print model range(s)
corresponding to input alignment range(s).
--ali2model <s>
Rather than actually produce the masked alignment, simply print alignment range(s)
corresponding to input model range(s).
OPTIONS FOR SPECIFYING THE ALPHABET
The alphabet type (amino, DNA, or RNA) is autodetected by default, by looking at the
composition of the msafile. Autodetection is normally quite reliable, but occasionally
alphabet type may be ambiguous and autodetection can fail (for instance, on tiny toy
alignments of just a few residues). To avoid this, or to increase robustness in automated
analysis pipelines, you may specify the alphabet type of msafile with these options.
--amino
Specify that all sequences in msafile are proteins.
--dna Specify that all sequences in msafile are DNAs.
--rna Specify that all sequences in msafile are RNAs.
OPTIONS CONTROLLING PROFILE CONSTRUCTION
These options control how consensus columns are defined in an alignment.
--fast Define consensus columns as those that have a fraction >= symfrac of residues as
opposed to gaps. (See below for the --symfrac option.) This is the default.
--hand Define consensus columns in next profile using reference annotation to the multiple
alignment. This allows you to define any consensus columns you like.
--symfrac <x>
Define the residue fraction threshold necessary to define a consensus column when
using the --fast option. The default is 0.5. The symbol fraction in each column is
calculated after taking relative sequence weighting into account, and ignoring gap
characters corresponding to ends of sequence fragments (as opposed to internal
insertions/deletions). Setting this to 0.0 means that every alignment column will
be assigned as consensus, which may be useful in some cases. Setting it to 1.0
means that only columns that include 0 gaps (internal insertions/deletions) will be
assigned as consensus.
--fragthresh <x>
We only want to count terminal gaps as deletions if the aligned sequence is known
to be full-length, not if it is a fragment (for instance, because only part of it
was sequenced). HMMER uses a simple rule to infer fragments: if the sequence length
L is less than or equal to a fraction <x> times the alignment length in columns,
then the sequence is handled as a fragment. The default is 0.5. Setting
--fragthresh0 will define no (nonempty) sequence as a fragment; you might want to
do this if you know you've got a carefully curated alignment of full-length
sequences. Setting --fragthresh1 will define all sequences as fragments; you might
want to do this if you know your alignment is entirely composed of fragments, such
as translated short reads in metagenomic shotgun data.
OPTIONS CONTROLLING RELATIVE WEIGHTS
HMMER uses an ad hoc sequence weighting algorithm to downweight closely related sequences
and upweight distantly related ones. This has the effect of making models less biased by
uneven phylogenetic representation. For example, two identical sequences would typically
each receive half the weight that one sequence would. These options control which
algorithm gets used.
--wpb Use the Henikoff position-based sequence weighting scheme [Henikoff and Henikoff,
J. Mol. Biol. 243:574, 1994]. This is the default.
--wgsc Use the Gerstein/Sonnhammer/Chothia weighting algorithm [Gerstein et al, J. Mol.
Biol. 235:1067, 1994].
--wblosum
Use the same clustering scheme that was used to weight data in calculating BLOSUM
subsitution matrices [Henikoff and Henikoff, Proc. Natl. Acad. Sci 89:10915, 1992].
Sequences are single-linkage clustered at an identity threshold (default 0.62; see
--wid) and within each cluster of c sequences, each sequence gets relative weight
1/c.
--wnone
No relative weights. All sequences are assigned uniform weight.
--wid <x>
Sets the identity threshold used by single-linkage clustering when using --wblosum.
Invalid with any other weighting scheme. Default is 0.62.
OTHER OPTIONS
--informat <s>
Declare that the input msafile is in format <s>. Currently the accepted multiple
alignment sequence file formats include Stockholm, Aligned FASTA, Clustal, NCBI
PSI-BLAST, PHYLIP, Selex, and UCSC SAM A2M. Default is to autodetect the format of
the file.
--seed <n>
Seed the random number generator with <n>, an integer >= 0. If <n> is nonzero, any
stochastic simulations will be reproducible; the same command will give the same
results. If <n> is 0, the random number generator is seeded arbitrarily, and
stochastic simulations will vary from run to run of the same command. The default
seed is 42.
Use alimask online using onworks.net services