Thresholds for power and lengths of motifs are set by user. Sets of motifs can be constructed employing different matrices.
When the standard constant M=M(W) is chosen, a
large number of the obtained motifs will be "noise", despite their formal
statistical significance. This fact, obstacling further alignment construction by increase
of the search, is caused by the null hypothesis of independence of the sequence being
aligned (indeed, the very desire to align sequences is an evidence of their dependence!).
Clearly, for dependent (similar) sequences the mean mismatch weight increases as compared
to formula (2). The program allows a user to account for this fact by increasing M (by
setting parameter "Minimum homology ratio" in the
interval from 0 to 1: 0.01 is recommended). Experiments demonstrate that this procedure
allows to filter out a majority of noise motifs.
Example in FASTA format: DNA vs. PROTEIN: The program will count the number of A,C,G,T,U and
N characters. If 80% or more of the characters in a sequence are as above, then
DNA / RNA is assumed, protein otherwise.
To exclude noise motifs from the graphic, the noise threshold must be equal the upper bound. The text output contains only non-noise motifs. Anyway, if the noise upper bound is 0, all motifs consider to be non-noise.
The value of 3 - 5 is recommended.
Your Sequences
The sequences (cut & paste) must all be in FASTA format.
>FOSB_HUMAN P53539 homo sapiens (human). fosb protein
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGS
GGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSAPAKEDGFSWLLPPPPPPPLPFQTSQDAPPNLTASLFTHSEVQVLGDPFPVVNPSY
TSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL
>FOSB_MOUSE P13346 mus musculus (mouse). fosb protein.
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS
GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY
TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
General options
Map title
Type in a title for this
session for you to remember.
Number of motifs to plot
You may chose a maximum number of motifs to be plotted in the graphical map.
All or part of them have references to generated text output. See next option.
"Noise" motifs, if any, are plotted in light gray; significant motifs are plotted with
thicker lines and dark colors. See the power options.
Number of motifs in the text output
You may chose a maximum number of motifs to output in the text form.
Any "noise" motifs with power less than the upper noise bound are not included.
DotHelix options
Length threshold
Enter a minimum length of motifs (7 is recommended).
Power noise threshold (Dothelix threshold)
Actually the DotHelix procedure operates only with a "noise" threshold that defines the noise lower bound.
If the noise threshold set to 0, a common graphic is of a set of the non-noise
motifs over the gray background of the noise motifs. The color of the non-noise motifs is changed from yellow to dark brown (significant) depending on the computed power value.
Accurate DotHelix
Dothelix procedure is not much less effective in comparison with "window method": as a rule it demands N * ln(N) operation istead of N for window method, but in sophisticated cases the number could be equal N * N. To eliminate such possibilities there is procedure parameter "Accurate DotHelix" that decreases the number of operations (case "Off").
Other options
Power basic threshold - noise upper bound
First, the basic threshold defines the text output. Only motifs with
power greater than the basic threshold are included in the output.
The basic threshold also defines the noise range: noise threshold value
as a minimum to the basic one at most.
Minimum homology ratio
Minimum homology ratio may be set in the range of 0 to 1, and the value 0.01 is
recommended.
Motif frequences recalc
Power of motif highly depends on frequences of letters in comparing sequences. If frequences of letters in selected stretches of matching are significantly deviate from values in begining of culculations (for example the stretch is polyA), then it's necessary to recalculate the power with new frequences and this will decrease power of such unsignificant motif as, for example, the match of polyA stretch in query sequence with polyA stretch in databank sequence.
A C D E F G H I K L M N P Q R S T V W Y
A 8
C 4 13
D 2 1 10
E 3 0 6 9
F 2 2 1 1 10
G 4 1 3 2 1 10
H 2 1 3 4 3 2 12
I 3 3 1 1 4 0 1 8
K 3 1 3 5 1 2 3 1 9
L 3 3 0 1 4 0 1 6 2 8
M 3 3 1 2 4 1 2 5 3 6 9
N 2 1 5 4 1 4 5 1 4 1 2 10
P 3 1 3 3 0 2 2 1 3 1 2 2 11
Q 3 1 4 6 1 2 4 1 5 2 4 4 3 9
R 3 1 2 4 1 2 4 1 6 2 3 4 2 5 9
S 5 3 4 4 2 4 3 2 4 2 3 5 3 4 3 8
T 4 3 3 3 2 2 2 3 3 3 3 4 3 3 3 5 9
V 4 3 1 2 3 1 1 7 2 5 5 1 2 2 1 2 4 8
W 1 2 0 1 5 2 2 1 1 2 3 0 0 2 1 1 2 1 15
Y 2 2 1 2 7 1 6 3 2 3 3 2 1 3 2 2 2 3 6 11
Johnson Matrix
Unique
Identifier: 94016587 (MEDLINE)
Authors: Johnson M. S.
Overington J. P.
Institution: Department of Crystallography,
Birkbeck College, University of London, U.K.
Title: A
structural basis for sequence comparisons. An evaluation of scoring
methodologies. Source: Journal of Molecular Biology. 233(4):716-38,
1993 Oct 20.
Abstract:
A residue-exchange matrix has been
derived that is suitable for comparison of amino acid sequences. This
matrix is based on the tabulation of 207,795 amino acid replacements
observed in 65 homologous sets of structurally aligned three-dimensional
structures (235 proteins). The majority of the data is from structural
comparisons where there is between 15 and 40% sequence identity. As a
result, a scoring matrix such as the one devised here should provide a
sensitive basis for the comparison of amino acid sequences and the search
for homologous sequences in amino acid databases. In order to assess the
value of this matrix we have made a comparative analysis with 12 other
published scoring matrices that have been used for the alignment of
protein amino acid sequences. We find that the matrix derived here is
among the better performers in terms of alignment significance, detection
of homologous sequences and the accuracy of alignments.
A C D E F G H I K L M N P Q R S T V W Y
A 16
C 6 26
D 8 0 18
E 9 3 12 18
F 7 5 3 3 20
G 9 2 8 7 1 18
H 7 2 9 7 8 7 22
I 8 2 5 5 10 4 5 18
K 9 1 8 11 4 6 10 5 17
L 6 1 2 4 12 3 5 12 6 17
M 8 5 4 7 9 5 7 12 8 14 21
N 8 2 12 9 6 8 11 5 10 5 6 18
P 9 1 9 8 5 7 5 4 9 7 0 7 20
Q 9 3 9 12 3 7 11 3 11 5 9 9 6 19
R 8 4 6 10 4 7 10 4 13 6 6 8 6 12 20
S 10 2 10 8 5 8 7 5 8 5 5 11 9 9 9 16
T 9 4 8 9 5 6 7 7 10 5 7 10 8 9 8 12 17
V 9 5 5 6 8 4 6 14 6 12 10 4 5 6 5 5 8 17
W 4 1 4 2 13 3 6 6 4 9 9 4 2 2 6 4 0 5 25
Y 6 2 6 6 13 4 9 7 6 7 8 7 3 5 8 6 7 8 12 20
Gonnet Matrix
Unique Identifier: 1604319 (MEDLINE)
Authors: Gonnet G. H., Cohen M. A., Benner S. A.
Institution: Institute for Scientific Computation, Swiss Federal Institute of Technology, Zurich, Switzerland.
Title: Exhaustive matching of the entire protein sequence database.
Source: Science. 1992 Sep 18;257(5077):1609-10.
Abstract:
The entire protein sequence database has been exhaustively matched. Definitive mutation
matrices and models for scoring gaps were obtained from the matching and used to organize the
sequence database as sets of evolutionarily connected components. The methods developed are
general and can be used to manage sequence data generated by major genome sequencing
projects. The alignments made possible by the exhaustive matching are the starting point for
successful de novo prediction of the folded structures of proteins, for reconstructing sequences
of ancient proteins and metabolisms in ancient organisms, and for obtaining new perspectives in
structural biochemistry.
C S T P A G N D E Q H R K M I L V F Y W X *
C 12
S 0 2
T 0 2 2
P -3 0 0 8
A 0 1 1 0 2
G -2 0 -1 -2 0 7
N -2 1 0 -1 0 0 4
D -3 0 0 -1 0 0 2 5
E -3 0 0 0 0 -1 1 3 4
Q -2 0 0 0 0 -1 1 1 2 3
H -1 0 0 -1 -1 -1 1 0 0 1 6
R -2 0 0 -1 -1 -1 0 0 0 2 1 5
K -3 0 0 -1 0 -1 1 0 1 2 1 3 3
M -1 -1 -1 -2 -1 -4 -2 -3 -2 -1 -1 -2 -1 4
I -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -2 -2 -2 2 4
L -2 -2 -1 -2 -1 -4 -3 -4 -3 -2 -2 -2 -2 3 3 4
V 0 -1 0 -2 0 -3 -2 -3 -2 -2 -2 -2 -2 2 3 2 3
F -1 -3 -2 -4 -2 -5 -3 -4 -4 -3 0 -3 -3 2 1 2 0 7
Y 0 -2 -2 -3 -2 -4 -1 -3 -3 -2 2 -2 -2 0 -1 0 -1 5 8
W -1 -3 -4 -5 -4 -4 -4 -5 -4 -3 -1 -2 -4 -1 -2 -1 -3 4 4 14
X -3 0 0 -1 0 -1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 -2 -4 -1
* -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 1
Unitary Matrix for DNA/RNA
A C D E F G H I K L M N P Q R S T V W Y
A 10
C 0 10
D 0 0 10
E 0 0 0 10
F 0 0 0 0 10
G 0 0 0 0 0 10
H 0 0 0 0 0 0 10
I 0 0 0 0 0 0 0 10
K 0 0 0 0 0 0 0 0 10
L 0 0 0 0 0 0 0 0 0 10
M 0 0 0 0 0 0 0 0 0 0 10
N 0 0 0 0 0 0 0 0 0 0 0 10
P 0 0 0 0 0 0 0 0 0 0 0 0 10
Q 0 0 0 0 0 0 0 0 0 0 0 0 0 10
R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10