cmemit - Online in the Cloud

This is the command cmemit that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

PROGRAM:

NAME


cmemit - sample sequences from a covariance model

SYNOPSIS


cmemit [options] <cmfile>

DESCRIPTION


The cmemit program samples (emits) sequences from the covariance model(s) in <cmfile>, and
writes them to output. Sampling sequences may be useful for a variety of purposes,
including creating synthetic true positives for benchmarks or tests.

The default is to sample ten unaligned sequence from each CM. Alternatively, with the -c
option, you can emit a single majority-rule consensus sequence; or with the -a option, you
can emit an alignment.

The <cmfile> may contain a library of CMs, in which case each CM will be used in turn.

<cmfile> may be '-' (dash), which means reading this input from stdin rather than a file.

For models with zero basepairs, sequences are sampled from the profile HMM filter instead
of the CM. However, since these models will be nearly identical (unless special options
were used in cmbuild to prevent this), using the HMM instead of the CM will not change the
output in a significant way, unless the -l option is used. With -l, the HMM will be
configured for equiprobable model begin and end positions, while the CM will not. You can
force cmemit to always sample from the CM with the --nohmmonly option.

OPTIONS


-h Help; print a brief reminder of command line usage and available options.

-o <f> Save the synthetic sequences to file <f> rather than writing them to stdout.

-N <n> Generate <n> sequences. The default value for <n> is 10.

-u Write the generated sequences in unaligned format (FASTA). This is the default
behavior.

-a Write the generated sequences in an aligned format (STOCKHOLM) with consensus
structure annotation rather than FASTA. Other output formats are possible with the
--outformat option.

-c Predict a single majority-rule consensus sequence instead of sampling sequences
from the CM´s probability distribution. Highly conserved residues (base paired
residues that score higher than 3.0 bits, or single stranded residues that score
higher than 1.0 bits) are shown in upper case; others are shown in lower case.

-e <n> Embed the CM emitted sequences in a larger randomly generated sequence of length
<n> generated from an HMM that was trained on real genomic sequences with various
GC contents (the same HMM used by cmcalibrate). You can use the --iid option to
generate 25% A, C, G, and U sequence instead. The CM emitted sequence will begin
at a random position within the larger sequence and will be included in its
entirety unless the --u5p or --u3p options are used. When -e is used in
combination with --u5p, the CM emitted sequence will always begin at position 1 of
the larger sequence and will be truncated 5'. When used in combination --u3p the CM
emitted sequence will always end at position <n> of the larger sequence and will be
truncated 3'.

-l Configure the CMs into local mode before emitting sequences. By default the model
will be in global mode. In local mode, large insertions and deletions are more
common than in global mode.

OPTIONS FOR TRUNCATING EMITTED SEQUENCES


--u5p Truncate all emitted sequences at a randomly chosen start position <n>, by only
outputting residues beginning at <n>. A different start point is randomly chosen
for each sequence.

--u3p Truncate all emitted sequences at a randomly chosen end position <n>, by only
outputting residues up to position <n>. A different end point is randomly chosen
for each sequence.

--a5p <n>
In combination with the -a option, truncate the emitted alignment at a randomly
chosen start match position <n>, by only outputting alignment columns for positions
after match state <n> - 1. <n> must be an integer between 0 and the consensus
length of the model (which can be determined using the cmstat program. As a special
case, using 0 as <n> will result in a randomly chosen start position.

--a3p <n>
In combination with the -a option, truncate the emitted alignment at a randomly
chosen end match position <n>, by only outputting alignment columns for positions
before match state <n> + 1. <n> must be an integer between 1 and the consensus
length of the model (which can be determined using the cmstat program). As a
special case, using 0 as <n> will result in a randomly chosen end position.

OTHER OPTIONS


--seed <n>
Seed the random number generator with <n>, an integer >= 0. If <n> is nonzero,
stochastic sampling of sequences will be reproducible; the same command will give
the same results. If <n> is 0, the random number generator is seeded arbitrarily,
and stochastic samplings will vary from run to run of the same command. The
default seed is 0.

--iid With -e, generate the larger sequences as 25% each A, C, G and U.

--rna Specify that the emitted sequences be output as RNA sequences. This is true by
default.

--dna Specify that the emitted sequences be output as DNA sequences. By default, the
output alphabet is RNA.

--idx <n>
Specify that the emitted sequences be named starting with <modelname>.<n>. By
default <n> is 1.

--outformat <s>
With -a, specify the output alignment format as <s>. Acceptable formats are: Pfam,
AFA, A2M, Clustal, and Phylip. AFA is aligned fasta. Only Pfam and Stockholm
alignment formats will include consensus structure annotation.

--tfile <f>
Dump tabular sequence parsetrees (tracebacks) for each emitted sequence to file
<f>. Primarily useful for debugging.

--exp <x>
Exponentiate the emission and transition probabilities of the CM by <x> and then
renormalize those distributions before emitting sequences. This option changes the
CM probability distribution of parsetrees relative to default. With <x> less than
1.0 the emitted sequences will tend to have lower bit scores upon alignment to the
CM. With <x> greater than 1.0, the emitted sequences will tend to have higher bit
scores upon alignment to the CM. This bit score difference will increase as <x>
moves further away from 1.0 in either direction. If <x> equals 1.0, this option
has no effect relative to default. This option is useful for generating sequences
that are either more difficult ( <x> < 1.0) or easier ( <x> > 1.0) for the CM to
distinguish as homologous from background, random sequence.

--hmmonly
Emit from the filter profile HMM instead of the CM.

--nohmmonly
Never emit from the filter profile HMM, always use the CM, even for models with
zero basepairs.

Use cmemit online using onworks.net services



Latest Linux & Windows online programs