GeneBee DotHelix Motifs' Map Help

References

Leontovich A.M.,Brodsky L.I., Gorbalenya A.E. Construction of the full local similarity map for two biopolymers, 1993, Biosystems, 30,57-63.[Word97 doc]
Brodsky L.I., Ivanov V.V., Kalai dzidis Ya.L., Leontovich A.M., Nikolaev V.K., Feranchuk S.I., Drachev V.A. GeneBee-NET:Internet-based server for analyzing biopolymers structure, 1995, Biochemistry, 60, 8, 923-928. [Word97 doc ]

Construction of Pairwise Motifs

The construction of pairwise motifs is performed by DOTHELIX procedure (Leontovich et al., 1993).

Thresholds for power and lengths of motifs are set by user. Sets of motifs can be constructed employing different matrices.

When the standard constant M=M(W) is chosen, a large number of the obtained motifs will be "noise", despite their formal statistical significance. This fact, obstacling further alignment construction by increase of the search, is caused by the null hypothesis of independence of the sequence being aligned (indeed, the very desire to align sequences is an evidence of their dependence!). Clearly, for dependent (similar) sequences the mean mismatch weight increases as compared to formula (2). The program allows a user to account for this fact by increasing M (by setting parameter "Minimum homology ratio" in the interval from 0 to 1: 0.01 is recommended). Experiments demonstrate that this procedure allows to filter out a majority of noise motifs.

Your Sequences

The sequences (cut & paste) must all be in FASTA format.

Example in FASTA format:

>FOSB_HUMAN P53539 homo sapiens (human). fosb protein
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGS
GGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSAPAKEDGFSWLLPPPPPPPLPFQTSQDAPPNLTASLFTHSEVQVLGDPFPVVNPSY
TSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL
>FOSB_MOUSE P13346 mus musculus (mouse). fosb protein.
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS
GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY
TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL

DNA vs. PROTEIN: The program will count the number of A,C,G,T,U and N characters. If 80% or more of the characters in a sequence are as above, then DNA / RNA is assumed, protein otherwise.

General options

DotHelix options

Length threshold

Enter a minimum length of motifs (7 is recommended).

Power noise threshold (Dothelix threshold)

Actually the DotHelix procedure operates only with a "noise" threshold that defines the noise lower bound. If the noise threshold set to 0, a common graphic is of a set of the non-noise motifs over the gray background of the noise motifs. The color of the non-noise motifs is changed from yellow to dark brown (significant) depending on the computed power value.

To exclude noise motifs from the graphic, the noise threshold must be equal the upper bound. The text output contains only non-noise motifs. Anyway, if the noise upper bound is 0, all motifs consider to be non-noise.

Accurate DotHelix

Dothelix procedure is not much less effective in comparison with "window method": as a rule it demands N * ln(N) operation istead of N for window method, but in sophisticated cases the number could be equal N * N. To eliminate such possibilities there is procedure parameter "Accurate DotHelix" that decreases the number of operations (case "Off").

Other options

Power basic threshold - noise upper bound

First, the basic threshold defines the text output. Only motifs with power greater than the basic threshold are included in the output. The basic threshold also defines the noise range: noise threshold value as a minimum to the basic one at most.

The value of 3 - 5 is recommended.

Minimum homology ratio

Minimum homology ratio may be set in the range of 0 to 1, and the value 0.01 is recommended.

Motif frequences recalc

Power of motif highly depends on frequences of letters in comparing sequences. If frequences of letters in selected stretches of matching are significantly deviate from values in begining of culculations (for example the stretch is polyA), then it's necessary to recalculate the power with new frequences and this will decrease power of such unsignificant motif as, for example, the match of polyA stretch in query sequence with polyA stretch in databank sequence.

So, "On" is recommended.

Weight Matrices

There are 4 group of matrices implemented in GeneBee. You may choose any of them - Dayhoff, Blosum, Connet and Johnson - at the prompt in the full query page. The default matrix is Blosum62.

Dayhoff Matrix

(modified 250 PAM matrix from Atlas of Protein sequence and structure, (1978), v.5, suppl. 3, pp.345-358)

     A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A   12
C    8 22
D   10  5 14
E   10  5 13 14
F    6  6  4  5 19
G   11  7 11 10  5 15
H    9  7 11 11  8  8 16
I    9  8  8  8 11  7  8 15
K    9  5 10 10  5  8 10  8 15
L    8  4  6  7 12  6  8 12  7 16
M    9  5  7  8 10  7  8 12 10 14 16
N   10  6 12 11  6 10 12  8 11  7  8 12
P   11  7  9  9  5  9 10  8  9  7  8  9 16
Q   10  5 12 12  5  9 13  8 11  8  9 11 10 14
R    8  6  9  9  6  7 12  8 13  7 10 10 10 11 16
S   11 10 10 10  7 11  9  9 10  7  8 11 11  9 10 12
T   11  8 10 10  7 10  9 10 10  8  9 10 10  9  9 11 13
V   10  8  8  8  9  9  8 14  8 12 12  8  9  8  8  9 10 14
W    4  2  3  3 10  3  7  5  7  8  6  6  4  5 12  8  5  4 27
Y    7 10  6  6 17  5 10  9  6  9  8  8  5  6  6  7  7  8 10 20

Blosum62 Matrix

Unique Identifier: 93066354 (MEDLINE)
Authors: Henikoff S. Henikoff J. G.
Institution: Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, Seattle, WA 98104.
Title: Amino acid substitution matrices from protein blocks.
Source: Proceedings of the National Academy of Sciences of the United States of America., (1992), v.89, pp.10915 -10919).
Abstract:
Methods for alignment of protein sequences typically measure similarity by using a substitution matrix with scores for all possible exchanges of one amino acid with another. The most widely used matrices are based on the Dayhoff model of evolutionary rates. Using a different approach, we have derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins. This led to marked improvements in alignments and in searches using queries from each of the groups.

     A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A    8
C    4 13
D    2  1 10
E    3  0  6  9
F    2  2  1  1 10
G    4  1  3  2  1 10
H    2  1  3  4  3  2 12
I    3  3  1  1  4  0  1  8
K    3  1  3  5  1  2  3  1  9
L    3  3  0  1  4  0  1  6  2  8
M    3  3  1  2  4  1  2  5  3  6  9
N    2  1  5  4  1  4  5  1  4  1  2 10
P    3  1  3  3  0  2  2  1  3  1  2  2 11
Q    3  1  4  6  1  2  4  1  5  2  4  4  3  9
R    3  1  2  4  1  2  4  1  6  2  3  4  2  5  9
S    5  3  4  4  2  4  3  2  4  2  3  5  3  4  3  8
T    4  3  3  3  2  2  2  3  3  3  3  4  3  3  3  5  9
V    4  3  1  2  3  1  1  7  2  5  5  1  2  2  1  2  4  8
W    1  2  0  1  5  2  2  1  1  2  3  0  0  2  1  1  2  1 15
Y    2  2  1  2  7  1  6  3  2  3  3  2  1  3  2  2  2  3  6 11

Johnson Matrix

Unique Identifier: 94016587 (MEDLINE)
Authors: Johnson M. S. Overington J. P.
Institution: Department of Crystallography, Birkbeck College, University of London, U.K.
Title: A structural basis for sequence comparisons. An evaluation of scoring methodologies. Source: Journal of Molecular Biology. 233(4):716-38, 1993 Oct 20.
Abstract:
A residue-exchange matrix has been derived that is suitable for comparison of amino acid sequences. This matrix is based on the tabulation of 207,795 amino acid replacements observed in 65 homologous sets of structurally aligned three-dimensional structures (235 proteins). The majority of the data is from structural comparisons where there is between 15 and 40% sequence identity. As a result, a scoring matrix such as the one devised here should provide a sensitive basis for the comparison of amino acid sequences and the search for homologous sequences in amino acid databases. In order to assess the value of this matrix we have made a comparative analysis with 12 other published scoring matrices that have been used for the alignment of protein amino acid sequences. We find that the matrix derived here is among the better performers in terms of alignment significance, detection of homologous sequences and the accuracy of alignments.

    A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A  16
C   6 26
D   8  0 18
E   9  3 12 18
F   7  5  3  3 20
G   9  2  8  7  1 18
H   7  2  9  7  8  7 22
I   8  2  5  5 10  4  5 18
K   9  1  8 11  4  6 10  5 17
L   6  1  2  4 12  3  5 12  6 17
M   8  5  4  7  9  5  7 12  8 14 21
N   8  2 12  9  6  8 11  5 10  5  6 18
P   9  1  9  8  5  7  5  4  9  7  0  7 20
Q   9  3  9 12  3  7 11  3 11  5  9  9  6 19
R   8  4  6 10  4  7 10  4 13  6  6  8  6 12 20
S  10  2 10  8  5  8  7  5  8  5  5 11  9  9  9 16
T   9  4  8  9  5  6  7  7 10  5  7 10  8  9  8 12 17
V   9  5  5  6  8  4  6 14  6 12 10  4  5  6  5  5  8 17
W   4  1  4  2 13  3  6  6  4  9  9  4  2  2  6  4  0  5 25
Y   6  2  6  6 13  4  9  7  6  7  8  7  3  5  8  6  7  8 12 20

Gonnet Matrix

Unique Identifier: 1604319 (MEDLINE)
Authors: Gonnet G. H., Cohen M. A., Benner S. A.
Institution: Institute for Scientific Computation, Swiss Federal Institute of Technology, Zurich, Switzerland.
Title: Exhaustive matching of the entire protein sequence database.
Source: Science. 1992 Sep 18;257(5077):1609-10.
Abstract:
The entire protein sequence database has been exhaustively matched. Definitive mutation matrices and models for scoring gaps were obtained from the matching and used to organize the sequence database as sets of evolutionarily connected components. The methods developed are general and can be used to manage sequence data generated by major genome sequencing projects. The alignments made possible by the exhaustive matching are the starting point for successful de novo prediction of the folded structures of proteins, for reconstructing sequences of ancient proteins and metabolisms in ancient organisms, and for obtaining new perspectives in structural biochemistry.

   C  S  T  P  A  G  N  D  E  Q  H  R  K  M  I  L  V  F  Y  W  X  *
C 12
S  0  2
T  0  2  2
P -3  0  0  8
A  0  1  1  0  2
G -2  0 -1 -2  0  7
N -2  1  0 -1  0  0  4
D -3  0  0 -1  0  0  2  5
E -3  0  0  0  0 -1  1  3  4
Q -2  0  0  0  0 -1  1  1  2  3
H -1  0  0 -1 -1 -1  1  0  0  1  6
R -2  0  0 -1 -1 -1  0  0  0  2  1  5
K -3  0  0 -1  0 -1  1  0  1  2  1  3  3
M -1 -1 -1 -2 -1 -4 -2 -3 -2 -1 -1 -2 -1  4
I -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -2 -2 -2  2  4
L -2 -2 -1 -2 -1 -4 -3 -4 -3 -2 -2 -2 -2  3  3  4
V  0 -1  0 -2  0 -3 -2 -3 -2 -2 -2 -2 -2  2  3  2  3
F -1 -3 -2 -4 -2 -5 -3 -4 -4 -3  0 -3 -3  2  1  2  0  7
Y  0 -2 -2 -3 -2 -4 -1 -3 -3 -2  2 -2 -2  0 -1  0 -1  5  8
W -1 -3 -4 -5 -4 -4 -4 -5 -4 -3 -1 -2 -4 -1 -2 -1 -3  4  4 14
X -3  0  0 -1  0 -1  0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 -2 -4 -1
* -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8  1

Unitary Matrix for DNA/RNA

     A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A   10
C    0 10
D    0  0 10
E    0  0  0 10
F    0  0  0  0 10
G    0  0  0  0  0 10
H    0  0  0  0  0  0 10
I    0  0  0  0  0  0  0 10
K    0  0  0  0  0  0  0  0 10
L    0  0  0  0  0  0  0  0  0 10
M    0  0  0  0  0  0  0  0  0  0 10
N    0  0  0  0  0  0  0  0  0  0  0 10
P    0  0  0  0  0  0  0  0  0  0  0  0 10
Q    0  0  0  0  0  0  0  0  0  0  0  0  0 10
R    0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
S    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
T    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
V    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
W    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
Y    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10

Last updated: January 14, 2001.