mlv-smile - Online in the Cloud

This is the command mlv-smile that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

PROGRAM:

NAME


mlv-smile - inference of structured signals in multiple sequences

SYNOPSIS


mlv-smile <parameter_file>
mlv-smile [-g number]

DESCRIPTION


This manual page documents briefly the mlv-smile command. For more details and example,
you should have a look to the documentation files installed with it.

mlv-smile is a program that was primarily made to extract promoter sequences from DNA
sequences. The interest of this program is to infer simultaneously several motifs (called
boxes) that respects distance constraints. The user has to write in a parameter_file the
list of criteria that he wants the signal to respect. In a first step of extraction, all
signals respecting these criteria are found. In a second step, they are all statistically
evaluated, aiming to detect the ones that are exceptionally represented in the original
sequences. Since the 1.4 version mlv-smile allows one to extract such signals on any
alphabet in any kind of sequences.

OPTIONS


The program usually waits for a parameter file that contains all the criteria needed. The
only option is:

-g number
produces on the standard output a generic parameter file to extract number boxes
signals.

HOW TO


How to use mlv-smile?
The only command you'll use is 'mlv-smile'. You have to give it just one parameter, which
is the name of a parameters file which should contain the characteristics of the motifs
you want to extract.

How to start?
You first have to write an alphabet file, which contains the alphabet used to describe the
motifs. Then you have to write a parameter file, and you're ready to use mlv-smile.

What should I write in the alphabet file?
The first line should contain the type of the alphabet's elements, to choose between
"Nucleotides", "Proteins", or "Others". This is to allow mlv-smile to change, for
instance, the "A or G" symbol into an R in DNA sequences. Then, on each line, you have to
write the elements of the motifs's alphabet.

Example: if you want to extract simple motifs (A,C,G,T) from clean DNA sequences written
with a four letters alphabet (A,C,G,T), then you may write an alphabet file containing:
Type:Nucleotides
A
C
G
T
Let's call this file 'alpha'.

How to write a simple parameter file?
You have to first write an alphabet file. You also need a sequence file, at the FASTA
format. Then, you can create a parameter file, using the "mlv-smile -g number_of_boxes"
command to help you.

Example: Let's write a parameter file to extract simple motifs. If you don't already have
one, let's first create a small DNA file in FASTA format, containing several sequences:

> Seq A
AGGCTAGTCAGGGCATGCGATCAGCAGGCATCAGGCGAGCATCGACAGCA
> Seq B
GGAGAGCGCAGAGCGAGCATCATCATGCAGCATCAGAGATCTTTCT
Let's call this file 'seq'.

Our purpose is now to extract from these sequences all motifs of length 13 that appears at
least one time in 100% of the sequences, allowing one substitution. We may write the
following parameter file (helped with the 'mlv-smile -g 1' command):
FASTA file seq // previously created
Output file results

Alphabet file alpha //previously created
Quorum 100
Total min length 13
Total max length 13
Total substitutions 1
Boxes 1
Let's call this file 'param'.

How to extract a simple motif?
You can launch "mlv-smile" after having created the alphabet and parameter files.

Example: With the previous alphabet, sequences and parameter files, you can now launch
mlv-smile: "mlv-smile param". You will obtain the following motifs in the "results" file:
GCGAGCATCAACA 2120210310010 2
Seq 1 Pos 12
Seq 0 Pos 34
2
GCGAGCATCGTCA 2120210312310 2
Seq 1 Pos 12
Seq 0 Pos 34
2
The first motif found, GCGAGCATCAACA, appears at position 12 in the second sequence and
position 34 in the first one (all positions or sequences counts starts at zero).

How to evaluate the significance of the motifs found?
You have to add some evaluation lines at the end of the parameter file.

Example: At the bottom of the previous "param" parameter file, you can add:
Shufflings 100
Size k-mer 2
which means that the original sequences will be shuffled 100 times, conserving
dinucleotides. The significance of the motifs found previously will be computed from their
frequency of apparition in the shuffled sequences. The more number of shuffling you do,
the more stable are the results, but it's longer to compute.

For this example, you may find such results (in the "results.shuffle"):
STATISTICS ON THE NUMBER OF SEQUENCES HAVING AT LEAST ONE OCCURRENCE
Model %right #right %shfl. #shfl. Sigma Chi2 Z-score
==============================================================
GCGAGCATCGTCA 100.00% 2 0.50% 0.01 0.10 3.96 19.90
GCGAGCATCAACA 100.00% 2 1.00% 0.02 0.14 3.92 14.07

STATISTICS ON THE TOTAL NUMBER OF OCCURRENCES
Model #right #shfl. Sigma Chi2 Z-score
=======================================================
GCGAGCATCGTCA 2 0.01 0.10 1.99 19.90
GCGAGCATCAACA 2 0.02 0.14 1.96 14.07

The first block of results shows the statistics on the number of sequences having at least
one occurrence. You can read, for each motif found, the frequency of apparition in the
original and shuffled sequences, and two statistical scores (Chi2 and Z-score) deduced.
Motifs are sorted according to the highest Z-scores. A high Z-score means that the motif
appears in a surprising way in the original sequences.

How to extract structured motifs?
The parameter file should be modified to indicate the characteristics of the structured
motifs to infer. You have to write global parameters for the whole motif, and local
parameters for each box of it.

Example: Let's extract from the previous "seq" sequences structured motifs composed of 2
boxes of length 5 to 6, but the whole motif must have a length 11. The two boxes may be
separated by 10 to 15 nucleotides. You allow at most one substitution in each box, and at
least one occurrence of a motif must appear in 100% of the sequences, you may write the
following parameter file:
FASTA file seq
Output file results

Alphabet file alpha
Quorum 100
Total min length 11
Total max length 11
Total substitutions 2
Boxes 2

BOX 1 ================
Min length 5
Max length 6
Substitutions 1
Min spacer length 10
Max spacer length 15

BOX 2 ================
Min length 5
Max length 6
Substitutions 1

PARAMETER FILE CRITERIA


FASTA File <filename>
The name of the file which contains the sequences to use for inference. These
sequences must be at the FASTA format. This file must contain at least two
sequences, as you cannot detect motifs which are common to several sequences in one
sequence!

Output file <filename>
The name of the file where results of extraction will be written.

Alphabet file <filemane>
The name of the file where you have to tell mlv-smile on which alphabet it will
infer motifs. The first line of this file contains "Type:" followed by the type of
symbols you use, to choose between "Nucleotides", "Proteins" or "Others". Then, on
each line of the file, must be written the symbols of the sequence that may be
matched by a symbol of a motif. A line containing "ANR" means that there is a
symbol in the motif's alphabet which matches A, N or R in the sequences. If Type is
defined with Nucleotides, mlv-smile will change this ANR symbol into an A to make
it more readable. These associations will be printed at the beginning of the
execution.

Quorum <number>
The percentage of sequences where at least one occurrence of a motif must appear to
make it valid. 100 means that a motif must have occurrences in every sequences.

Total min length <number>
The minimal length of the whole motif, i.e. the sum of minimal lengths of each box.
Warning: the length of the gaps between boxes mustn't me taken into account. The
total minimal length may differ of the sum of boxs's minimal length: you can, for
instance, infer motifs made of two boxes, with min length of boxes equals to 4 and
a total min length equals to 10.

Total max length <number>
Same explanation as "Total min length", excepted that a 0 length means "infinity".

Total substitutions <number>
Total maximum number of substitutions for the whole motif. As for the total length,
this is not necessarily the sum of each box's substitution number.

Boxes <number>
The number of boxes that compose the motifs to infer. When inferring simple one
box motifs, it's not necessary to use local criteria as global and local criteria
will be the same.

Composition in <symbol> <number> [OPTIONAL]
The number of a given symbol of the motif's alphabet may be restrained to a maximum
by this criteria.

BOX <number>
Begin the description of the criteria of a given box of the motif.

Min length <number>
Minimum length for the current box.

Max length <number>
Same explanation as "Min length", excepted that a 0 length means "infinity".

Substitution <number>
Maximum number of substitutions allowed for the current box.

Composition in <symbol> <number> [OPTIONAL]
Same as the global composition, but for the current box.

Min spacer length <number>
Minimum number of symbols between the end of the current box and the beginning of
the next one. This parameter mustn't appear in the last box's criteria, which has
no next box!

Max spacer length <number>
Same explanation as "Max spacer length".

Delta <number> [OPTIONAL]
This criteria allows one to infer motifs composed of several boxes without really
knowing the distance between these boxes. The min and max spacer length will be
used as a "large" interval, and the delta's value will define the size of small
intervals into this large one. An inference of two boxes motifs with a [10-20]
range of distance between the boxes will produce motifs whose occurrences respect
this range. A "Delta" criteria fixed to 2, for instance, will realize the same
inference in all the possible ranges [i-delta, i+delta] (here: [10-14], [11-15],
...). As many output files as different ranges will be produced.

Palindrome of box <number> [OPTIONAL]
Indicate that the concerned box must be the biological palindrome of one of the
previous boxes.

Shufflings <number> [OPTIONAL]
The number of shufflings of the original sequences to realize for the evaluation of
the statistical significance of the motifs found.

Size k-mer <number> [OPTIONAL, always with shuffling]
Length of the words to conserve during shufflings (usually 2).

Against wrong sequences <filename> [OPTIONAL]
Another method to evaluate the significance of the motifs (not compatible with the
shuffling method). In the case where you have a sequence file where you believe
that the motifs you look for in the first sequences set won't appear, you can give
to mlv-smile such a sequence file. The statistical evaluation of motifs found will
be made by computing theit frequency in the "wrong sequences".

WARNING


mlv-smile is an exact combinatorial algorithm. It is not made to infer any kind of motifs.
The amount of data where the extraction is made can be very large, but some criteria (in
particular the number of substitutions) must be restrained to reasonable values: one or
two substitutions allowed in a 10 length motif is ok, but not 6 or 8 substitutions. The
notion of spacers is made to avoid the use of to much substitutions.

Use mlv-smile online using onworks.net services



Latest Linux & Windows online programs