2. Leontovich A.M.,Brodsky L.I., Gorbalenya A.E., Construction of the full local similarity map for two biopolymers, 1993, Biosystems, 30,57-63.[Word97 doc]
3. Brodsky L.I., Vasiliev A.V., Kalaidzidis Ya.L., Osipov Yu.S., Tatuzov R.L., Feranchuk S.I. GeneBee: the program package for biopolymer structure analysis, 1992, Dimacs, 8, 127-139. [Word97 doc]
4. Andrey M Leontovich, Konstantin Y Tokmachev, and Hans C van Houwelingen, The comparative analysis of statistics, based on the likelihood ratio criterion, in the automated annotation problem, BMC Bioinformatics, 2008 Jan 22;9:31. [PDF doc]
Example in FASTA format: DNA vs. PROTEIN: The program will count the number of A,C,G,T,U and
N characters. If 80% or more of the characters in a sequence are as above, then
DNA / RNA is assumed, protein otherwise.
So, "On" is recommended.
Predicting Annotation:
An automatic annotation of a sequence is based on statistics assembled from the result of the homology search, that is for a prediction description elements (DEL) of the given sequence. The theoretical approach and the algorithms fixed in a basis of an automatic sequence annotation, are detailed stated in the report of A.M.Leontovich [?].
Your Sequence
The sequence (cut & paste) must be in FASTA format.
>FOSB_HUMAN P53539 homo sapiens (human). fosb protein
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGS
GGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSAPAKEDGFSWLLPPPPPPPLPFQTSQDAPPNLTASLFTHSEVQVLGDPFPVVNPSY
TSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL
General options:
Annotation title
Type in a title for this
session for you to remember.
Start/End position of the sequence
Set the borders of the query sequence zone (Start & End), for which the searching for homologies search will be done.
Additional output
In addition to the text form of prediction there are the following output:
Graphical alignment - graphical inmage of the found sequences with the most
frequent KW, DE, FT keywords;
FT fragments & DR keywords - graphical image of the found sequences with FT
fragments and DR keywords;
Cross-reference map - cross-reference map of the found sequences;
Alignments description - brief description of the found sequences and
the first 10 supermotifs;
Statistics - statictics for the most frequent keywords.
DotHelix options:
Motif's length threshold
Enter a minimum length of motifs (7 is recommended).
Motif's power threshold
Enter a minimum power of motifs (3 - 5 is recommended).
Accurate DotHelix
Dothelix procedure is not much less effective in comparison with "window method": as a rule it demands N * ln(N) operation istead of N for window method, but in sophisticated cases the number could be equal N * N. To eliminate such possibilities there is procedure parameter "Accurate DotHelix" that decreases the number of operations (case "Off").
Supermotif's options:
Alignment's power threshold
Enter a minimum power of selected local alignemnts (6 - 9 is recommended).
Gap penalty
Enter a gap penalty in the units of standard deviation (1 is recommended).
Best shiefts
Enter a number of best shifts (5-10 is recommended).
Reported alignments
Restricts the number of matching sequences reported to the number specified. Default limit is 100 sequences.
FT fragment picture options:
Max number of FT keywords
Restricts the number of FT keys to be displayed. First the most frequent keys are displayed. The less this value the better color distiction. The maximum value is 20.
Excluded fragments
There is a possibility to exclude some fragments to clarify the picture of other ones:
- Secondary structure: HELIX, TURN, STRAND;
- CHAIN;
- DOMAIN.
DOMAIN extension to qualifier
The qualifier is added to the DOMAIN key to specify the key in more details.
Other options:
Coincidance ratio
The desired percent of coinciding (from the maximum in the case of complete similarity) "one-color" pair-patterns on the selected shift (in 1/100 of 1) (0.02 for protein and 0.08 for nucleotide query sequence).
Min. homology ratio
Min. homology ratio may be set in the range of 0 to 1, and the value 0.01 is
recommended.
Motif frequences recalc
Power of motif highly depends on frequences of letters in comparing sequences. If frequences of letters in selected stretches of matching are significantly deviate from values in begining of culculations (for example the stretch is polyA), then it's necessary to recalculate the power with new frequences and this will decrease power of such unsignificant motif as, for example, the match of polyA stretch in query sequence with polyA stretch in databank sequence.
Strand
This option sets which frames will be processed. If 'Only forward' is choosen then three forward frames will be processed in the case protein against nucleotide databanks and single forward strand will be processed in the case nucleotide against nucleotide databanks. If 'Both' is choosen all six frames will be processed in the case protein against nucleotide databanks and both forward and backward strands will be processed in the case nucleotide against nucleotide databanks.
Clusterization type
This option has sense only in the case nucleotide against protein databanks OR protein against nucleotide databanks. If 'Each frame separately' is chosen then found motifs will be clustered (into supermotifs) separatly for each frame. If 'Codirectional joinly' is choosen then motifs found on codirectional frames (forward and backward) will be clustered joinly. So it is possible obtain supermotif containing motifs from 2 or 3 forward (or backward) frames. The reason of this option - propable errors in query or databanks seuences.
Weight Matrices:
There are 3 matrices inplemented in GeneBee. You may choose any of them - Dayhoff, Blosum62,
or Johnson - at the prompt in the full query page. The default matrix is Dayhoff.
Dayhoff Matrix
(modified 250 PAM matrix from Atlas of Protein sequence and structure,v.5, suppl. 3, pp.345-358):
A C D E F G H I K L M N P Q R S T V W Y
A 12
C 8 22
D 10 5 14
E 10 5 13 14
F 6 6 4 5 19
G 11 7 11 10 5 15
H 9 7 11 11 8 8 16
I 9 8 8 8 11 7 8 15
K 9 5 10 10 5 8 10 8 15
L 8 4 6 7 12 6 8 12 7 16
M 9 5 7 8 10 7 8 12 10 14 16
N 10 6 12 11 6 10 12 8 11 7 8 12
P 11 7 9 9 5 9 10 8 9 7 8 9 16
Q 10 5 12 12 5 9 13 8 11 8 9 11 10 14
R 8 6 9 9 6 7 12 8 13 7 10 10 10 11 16
S 11 10 10 10 7 11 9 9 10 7 8 11 11 9 10 12
T 11 8 10 10 7 10 9 10 10 8 9 10 10 9 9 11 13
V 10 8 8 8 9 9 8 14 8 12 12 8 9 8 8 9 10 14
W 4 2 3 3 10 3 7 5 7 8 6 6 4 5 12 8 5 4 27
Y 7 10 6 6 17 5 10 9 6 9 8 8 5 6 6 7 7 8 10 20
Blosum62 Matrix
Unique
Identifier: 93066354 (MEDLINE)
Authors: Henikoff S. Henikoff
J. G.
Institution: Howard Hughes Medical Institute, Fred
Hutchinson Cancer Research Center, Seattle, WA 98104.
Title:
Amino acid substitution matrices from protein blocks.
Source:
Proceedings of the National Academy of Sciences of the United States of
America. 89(22):10915-9, 1992 Nov 15.
Abstract:
Methods for
alignment of protein sequences typically measure similarity by using a
substitution matrix with scores for all possible exchanges of one amino
acid with another. The most widely used matrices are based on the Dayhoff
model of evolutionary rates. Using a different approach, we have derived
substitution matrices from about 2000 blocks of aligned sequence segments
characterizing more than 500 groups of related proteins. This led to
marked improvements in alignments and in searches using queries from each
of the groups.
A C D E F G H I K L M N P Q R S T V W Y
A 8
C 4 13
D 2 1 10
E 3 0 6 9
F 2 2 1 1 10
G 4 1 3 2 1 10
H 2 1 3 4 3 2 12
I 3 3 1 1 4 0 1 8
K 3 1 3 5 1 2 3 1 9
L 3 3 0 1 4 0 1 6 2 8
M 3 3 1 2 4 1 2 5 3 6 9
N 2 1 5 4 1 4 5 1 4 1 2 10
P 3 1 3 3 0 2 2 1 3 1 2 2 11
Q 3 1 4 6 1 2 4 1 5 2 4 4 3 9
R 3 1 2 4 1 2 4 1 6 2 3 4 2 5 9
S 5 3 4 4 2 4 3 2 4 2 3 5 3 4 3 8
T 4 3 3 3 2 2 2 3 3 3 3 4 3 3 3 5 9
V 4 3 1 2 3 1 1 7 2 5 5 1 2 2 1 2 4 8
W 1 2 0 1 5 2 2 1 1 2 3 0 0 2 1 1 2 1 15
Y 2 2 1 2 7 1 6 3 2 3 3 2 1 3 2 2 2 3 6 11
Johnson Matrix
Unique
Identifier: 94016587 (MEDLINE)
Authors: Johnson M. S.
Overington J. P.
Institution: Department of Crystallography,
Birkbeck College, University of London, U.K.
Title: A
structural basis for sequence comparisons. An evaluation of scoring
methodologies. Source: Journal of Molecular Biology. 233(4):716-38,
1993 Oct 20.
Abstract:
A residue-exchange matrix has been
derived that is suitable for comparison of amino acid sequences. This
matrix is based on the tabulation of 207,795 amino acid replacements
observed in 65 homologous sets of structurally aligned three-dimensional
structures (235 proteins). The majority of the data is from structural
comparisons where there is between 15 and 40% sequence identity. As a
result, a scoring matrix such as the one devised here should provide a
sensitive basis for the comparison of amino acid sequences and the search
for homologous sequences in amino acid databases. In order to assess the
value of this matrix we have made a comparative analysis with 12 other
published scoring matrices that have been used for the alignment of
protein amino acid sequences. We find that the matrix derived here is
among the better performers in terms of alignment significance, detection
of homologous sequences and the accuracy of alignments.
A C D E F G H I K L M N P Q R S T V W Y
A 16
C 6 26
D 8 0 18
E 9 3 12 18
F 7 5 3 3 20
G 9 2 8 7 1 18
H 7 2 9 7 8 7 22
I 8 2 5 5 10 4 5 18
K 9 1 8 11 4 6 10 5 17
L 6 1 2 4 12 3 5 12 6 17
M 8 5 4 7 9 5 7 12 8 14 21
N 8 2 12 9 6 8 11 5 10 5 6 18
P 9 1 9 8 5 7 5 4 9 7 0 7 20
Q 9 3 9 12 3 7 11 3 11 5 9 9 6 19
R 8 4 6 10 4 7 10 4 13 6 6 8 6 12 20
S 10 2 10 8 5 8 7 5 8 5 5 11 9 9 9 16
T 9 4 8 9 5 6 7 7 10 5 7 10 8 9 8 12 17
V 9 5 5 6 8 4 6 14 6 12 10 4 5 6 5 5 8 17
W 4 1 4 2 13 3 6 6 4 9 9 4 2 2 6 4 0 5 25
Y 6 2 6 6 13 4 9 7 6 7 8 7 3 5 8 6 7 8 12 20
Unitary Matrix for DNA/RNA:
A C D E F G H I K L M N P Q R S T V W Y
A 10
C 0 10
D 0 0 10
E 0 0 0 10
F 0 0 0 0 10
G 0 0 0 0 0 10
H 0 0 0 0 0 0 10
I 0 0 0 0 0 0 0 10
K 0 0 0 0 0 0 0 0 10
L 0 0 0 0 0 0 0 0 0 10
M 0 0 0 0 0 0 0 0 0 0 10
N 0 0 0 0 0 0 0 0 0 0 0 10
P 0 0 0 0 0 0 0 0 0 0 0 0 10
Q 0 0 0 0 0 0 0 0 0 0 0 0 0 10
R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10