SEARCHING FOR HOMOLOGIES ALONG A SEQUENCE DATABANKS
COMPARISION SENSITIVITY of the GENEBEE SEARCH, BLAST and FASTA
REFERENCES
ALGORITHM
PARAMETERS
REFERENCES:
- Brodsky L.I., Vasiliev A.V., Kalaidzidis Ya.L., Osipov Yu.S., Tatuzov R.L., Feranchuk S.I.
GeneBee: the program package for biopolymer structure analysis, 1992, Dimacs, 8, 127-139. [Word97 doc]
-
Leontovich A.M. On the optimality of the Dayhoff matrix for computing the similarity score
between fragments of biological sequences, 1992, Dimacs, 8, 1-9. [ Word97 doc ]
-
Leontovich A.M.,Brodsky L.I., Gorbalenya A.E. Construction of the full local similarity map for
two biopolymers, 1993, Biosystems, 30,57-63.[Word97 doc]
-
Brodsky L.I., Ivanov V.V., Kalaidzidis Ya.L., Leontovich A.M., Nikolaev V.K., Feranchuk
S.I., Drachev V.A. GeneBee-NET:Internet-based server for analyzing biopolymers structure, 1995,
Biochemistry, 60, 8, 923-928. [Word97 doc]
ALGORITHM
Normally protein sequence (alignment) goes against protein databanks (SWISSPROT and
PDB), but since the comletion of the protein databanks falls far behind the adds to the
databank of nucleotide sequences (EMBL), there is the possibility of screening the protein
sequence along desirable parts of EMBL (with translation of the direct and complementary
chains in six reading frames).
Because of possible long duration of the screening (from 2 minutes
to 24 hours) depending on the query sequence/alignment length and the volume of treated
databank the result will be sent to you by E-mail along with it's WWW demonstration.
By default we suppose that the protein sequence/alignment should be
analized against protein databank (SWISSPROT) and the nucleotide sequence/alignment -
against complete nucleotide databank (EMBL). In protein case it will be taken Dayhoff
matrix by default for similarity calculations and aminoacids will be divided to several
groups ("colors") according to their physico-chemical properties.
The query sequence/alignment should be written in one-letter code
(low or upper case) and can be divided to several strings. In case of alignment it have to
be blank string between sequental batches and at the begining of every batch all sequence
names should be repeated:
KAG2_CAVPO -----------------------------CGGVLVDPQWVLTAAHCIND--S---N
KAG_PIG -----------------------------CGGVLVNPKWVLTAAHCKND--N---Y
PLMN_PIG parvvggcvsiphswpwqislryryrgHFCGGTLISPEWVLTAKHC----------
TRYP_PIG -----------------------nsgsHFCGGSLINSQWVVSAAHCYKS--R---I
UROK_PIG -----------------------------CGGSLISPCWVVSATHCFINYQQKEDY
KAG2_CAVPO QVKLGRHNLFEDEDTA-----QHFLVSQSVPHPDFN
KAG_PIG EVGWLRHNLFENENTA-----QFFGVTADFPHPGFN
PLMN_PIG ------------lekssspssykvilgaheeyhlge
TRYP_PIG QVRLGEHNIDVLEGNE-----QFINAAKIITHPNFN
UROK_PIG IVYLGRQTLHSSTHGEMKFEVEKLILHEDYSADSLA
There are two variants of searching for homologies:
- between a query sequence and the databank sequences or
- between an alignment and the databank ones.
As a rule you will want to perform screening the protein
sequence/alignment along protein databank (SWISSPROT) and screening of nucleotide one
along single (or several) of EMBL parts, but since the completion of the protein
databank falls far behind the adds to the databank of nucleotide sequences, the program
sports the possibility of screening the protein sequence along the nucleotide bank (with
translation of the direct and complementary chains in 6 reading frames).
The algoritm of searching for homologies of a
sequence (or alignment) along a databank consists of a sequential tests of all sequences
in a databank, selecting those, which either form a good alignment with the search one, or
have a significant fragment, homologous to it.
The procedure consists of three steps. The first one
finds those SHIFTS of the sequences, when homologous fragments are to coincide. The second
one selects best local similarities without gaps (the "MOTIFS") for each shift,
and the third step constructs the best total and local ALIGNMENTS of two sequences (or the
query alignment and a databank sequence).
SHIFTS
The determination of sequence
shifts is performed by the modified Lipman - Pearson procedure. It marks the shifts for
which sufficiently large number of matching "equivalent" patterns can be
observed. In most cases this method indeed discovers the shifts that lead to matching of
similar fragments. The main advantage of this approach is that the search is rather fast
(method of Looking table).
Let's consider it in details. On one hand, the letters (residues)
are divided into several classes ("COLORS"). This coloring is consistent
with the weight matrix: substitution weights within a class should be sufficiently large.
For each shift of sequences the following procedure is performed: for each position of a
window (of length, say, 7) the value T(T-1)/2 is computed, where T is the number of color
matches within this window. Then the sum of these values for all window positions for a
given shif t is computed. This characteristic is preferable as compared to the more widely
used number of perfect matches or number of color matches (that is, the sum of T), since
it is sensitive to determination of shifts setting in correspondence locally similar
regions. On the other hand, our procedure is sufficiently fast, since the above
characteristic can be computed using preformed looking table of all color pairs (the
distance between which does not exceed the window length) for query sequence/ alignment. A
user-defined number of best shifts is subject for further processing.
COLORS
These are following the standard groupes of
aminoacids ("PROTEIN" colors):
A and C |
are usually grouped together for having a small side chain, |
C |
is the only to build S-S bridges, |
D,E,N,Q |
a carboxil or an amid qroup in the side chain, |
I,L,M,V |
hydrophobic Interaction of the I kind, |
F,Y,W |
hydrophobic Interaction of the II kind, |
H |
positively charged and has a reactional ability, |
K,R |
positively carged side chain, |
P |
can't make hydrogen bonds, |
S,T |
make non-peptide hydrogen bonds in the side chain. |
For DNA/RNA sequences only T and U are united in one
group - all others are in separate groups ("DNA" colors).
MOTIFS
Motifs are sets of similar fragments of different
sequences, having the same length. Those can be paired or multiple: the similarities
between fragments of two sequences in the first case and of several sequences in the
other.
Non-contradictional sets of motifs for the given
pair of sequences can be united at local alignment (with gaps).
Mostly important motif/alignment characteristic is
its statistical value in some definite statistical system. When supposing independent
letter appearance, we can get the probability distribution of the given statistic, and,
taking it into consideration, see the probability of fixed value of the choosen statistic,
calculated for the given motif. This procedure selects less probable motifs and alignments
(from the hypothesis of random letter appearance).
Traditionally, the mainly used statistic is the so
called "sum of weights for changes" according to the MATRIX of residue's
similarities. Different variants of this statistic can be formed, using different matrices
(Dayhoff matrix, matrices of chemical and physical properties: hydrophobic, polarity,
participation in secondary and 3D structure elements etc.). It is common to use the
Standard Square Deviation (SD) units for measuring probability. We shall call the
probability, measured in a such way, the POWER of the motif/alignment.
Power of motif highly depends on frequences of
letters in comparing sequences. If frequences of letters in selected stretches of matching
are significantly deviate from values in begining of calculations (for example the stretch
is polyA), then it's necessary to recalculate the power with new frequences and this will
decrease power of such unsignificant motif as, for example, the match of polyA stretch in
query sequence with polyA stretch in databank sequence
( "Motif frequency recalc" parameter, "YES" is recommended ).
DOTHELIX PROCEDURE
We use DOTHELIX procedure to separate
statistically significant motifs. Giving a fixed shift of two sequences (or alignments),
it marks a set of non-overlapping motifs with the following property: the first motif has
the maximum of the power (least possibility of appearing for random sequences of the
statistic "sum of weights for letter changes along given matrix"), the second
has the best power for the set of motifs, non-overlapping with the first and etc.
Boundaries of the corresponding fragments can be found exactly since extending or
shortening of fragments by one letter necessarily decreases their similarity (power).
The most important advantage of the suggested
approach as compared with the traditional one (looking at the given sequence shift by
window) is the following. First, the method is more sensitive (all motifs will be
discovered independently on the difference at their lengths). Second, the method is more
precise: exact boundaries of similar fragments are found and the power characterizing the
similarity of the fragments is computed.
Dothelix procedure is not much less effective in
comparison with "window method": as a rule it demands N * ln(N) operation istead
of N for window method, but in sophisticated cases the number could be equal N * N. To
eliminate such possibilities there is procedure parameter ("Accurate DotHelix")
that bound the number of operations (case "NO").
If user marks several matrices, then Dothelix
procedure will be performed for all of them separately. All the motifs with power more
than threshold will be used for the best ALIGNMENTS construction.
ALIGNMENTS
On the next step of the examination of the current
pair of sequences (or query alignment and databank sequence) two alignments are found: the
best integral alignment (according to the sum of the residual exchange weights
along total length) and the best local alignment of them (in the power terms). The
both are built from the motifs previously found (for several matrices). If one of the
following happens: the power of the local alignment or the alignment is higher than the
threshold, then the sequence from the databank and the local alignment itself are saved.
The results of the program are presented in two lists: the separate list of the best local
alignments and a list of the best integral alignments. The inclusion of the two lists is
brought up by the fact, that they show two biologically different senses: the presence of
a best local alignment shows the existence of a good and, maybe, functionally significant
pattern, while the integral alignment shows, there could be an evolutional relation of the
two sequences.
Each aligned sequense consists of motif's fragments
(fragments of query sequence/alignment similar to fragments of databank sequences),
fragments not belonging to any motif, and gaps. At the first step supermotifs are formed.
Supermotif is defined as a partial alignment whose power exceeds powers of all its local
alignments. In particular, its power exceeds powers of all motifs belonging to it. In the
program two methods for linking motifs into supermotifs are implemented: cluster method
for local alignment and dinamic (for total alignment).
In the cluster method the following scheme is
employed. Initially each motif is considered as a supermotif. At each step two such
supermotifs are linked, that the power of the obtained supermotif exceeds powers of its
constituents, and it is the maximum one among all possible linkages.
It have been implemented a dynamic programming
procedure for construction of a total alignment of query and databank sequences. We
represent an alignment as a chain of motifs. These motifs can belong to an alignment not
only completely, but partially as well. It is taken into account in our program, but below
we for simplicity assume that motifs enter an alignment entirely. Now total alignment
quality is measured by its "price" that is defined as a sum of centered weights
for letter exchange along the alignment path minus gap penalties. For each motif it is
possible to point out motifs that can precede or follow given in an alignment. Thus the
set of motifs is a partially ordered set. If a partial alignment has been constructed,
further aligning depends only on the most recently aligned fragment. Thus the dynamic
programming principles, and in particular, a Needleman-Wunsch type procedure can be
employed. For each motif we find the maximum price of alignments starting at the beginning
of the sequences and ending at this motif. As result the best total alignment is
constructed. If it's PERCENT OF HOMOLOGY (see PARAMETERS) and its power are more than
thresholds then the alignment will be included in result list.
RESULT
The result holds the selected sequences and the
alignments formed. Each motif of such alignment is built from a fragment of the search
sequence (or alignment) and a fragment of the databank one.
If the screening goes with query alignment, then in
the results will be not pair motifs: from one side it will be there all the variant of
"thickening" of the search alignment because of all the newly found homologous
fragments, and all the sequences, taking part in the new motifs, from the other side.
Example:
SUPERMOTIF 9 Power 12.901
psuste (11) WINWFLGIKGNEFLCHVPTNYFQDTFNQLGLEYFSQ---------pldvilk
**+*******++****+*++**+++*+++**+**++
DMSTEGHAb (391) WINWFLGIKGNEFLCHVPTNYFQDTFNQMGLEYFSQhwt*s*srrstvprac
psuste pafdsssglfyddekkwygmiharyirsergvndihrkymrgdfescpniscnrkntl
DMSTEGHAb sttmrksgta*ftpdtsgrsva*micteti*eetlnrvrispvigrtpcqwasamcga
psuste pvglRSTAHAVKTTSILKLIHSHVRAQLSGHLLNAAAELETAPGQPT-----------
*+++*++*+++******+*+*++*.****+++++*+++**.*+
DMSTEGHAb sqr*RSTAHAVKTTSILKLIHSHVRAQLPGHLLNAAAELETAPGRPTvsnspnivlvv
psuste ---------SLCFXLHQKALMPLKSPKSSPKNIKYSSS
+*** **+*+*****+**++**.* +.+
DMSTEGHAb f*tkrlnlqSLCFLLHQKALMPLKSPKSSPKKIGSSAS
SUPERMOTIF 10 Power 12.619
psuste (11) WINWFLGIKGNEFLCHVPTNYFQDTFNQLGLEYFSQ
**+*******++****+*++**+++*+++**+**++
DMSTEGHAb (391) WINWFLGIKGNEFLCHVPTNYFQDTFNQMGLEYFSQ
The signs between aligned sequences shows the
quality of the letter exchage according to matrix selected the first.
In the dialog window you should set the following
parameters of the program:
query sequence (should be in one-letter code format):
[sequence in one-letter code]
for example:
MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAA
KSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRN AKLKPVYDSL DAVRRAAL
INMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDA
YK
or query alignment which should be represented as following:
ACRO_PIG --------------------------------------------------------
EL1_PIG ggteaqrnswpsqislqyrsgsswahtCGGTLIRQNWVMTAAHCVDRELTFRVVVG
EL2_PIG ggedarpnswpwqvslqydssgqwrhtCGGTLVDQSWVLTAAHCISSSRTYRVVLG
FA9_CAVPO FRVVGGEDAKPGQFPWQVLLNGETEAFCGGSIVNEKWIVTAAHCILPGIKIEVVAG
FA9_PIG IRIVGGENAKPGQFPWQVLLNGKIDAFCGGSIINEKWVVTAAHCIEPGVKITVVAG
KAG2_CAVPO ---------------------------CGGVLVDPQWVLTAAHCINDS--NQVKLG
KAG_PIG ---------------------------CGGVLVNPKWVLTAAHCKNDN--YEVGWL
PLMN_PIG maensktspiarmrdvvlfekriylsecktgngknyrgttsktksgvicqkwsvss
TRYP_PIG ddkivggytcaansipyqvslnsgshfCGGSLINSQWVVSAAHCYKSR--IQVRLG
UROK_PIG kkfqgehceidtsqtcfegnghsyrgkantntggrpclpwnsatvllntyhahrpd
ACRO_PIG ----------------------------
EL1_PIG EHNLNQNDGTEQYVGVQKIVVHPYWNTD
EL2_PIG RHSLSTNEPGSLAVKVSKLVVHQDWNSN
FA9_CAVPO KHNIEKKEDTEQRRNVTQIILHHSYNAS
FA9_PIG EYNTEETEPTEQRRNVIRAIPHHSYNAT
KAG2_CAVPO RHNLFEDEDTAQHFLVSqdfta------
KAG_PIG RHNLFENENTAQFFGVTADFPHPGFnls
PLMN_PIG phipkyspekfplagleenycrnpdnde
TRYP_PIG EHNIDVLEGNEQFINAAKIITHPNF---
UROK_PIG alqlglgkhnycrnpdnqrrpwcyvqvg
the method of similarity selection: Best local similarity (local
alignments) and/or Integral similarity (total alignments) - both methods are recommended;
minimum length of motifs selected (7 is recommended);
the motif's minimum power (3 - 5 is recommended);
minimum power of selected local alignemnts (6 - 9 is
recommended);
minimum percent of homology (in part of complete similarity)
for integral alignments found (0.15 - 0.25 is recommended);
Looking table technique uses its own set of the
parameters:
The additional search parameters are:
The same dialog window sets the DotHelix mode
(accurate and simplified) and the list of amino-acid residue similarity MATRICES, which
will be used to search for motifs. If more than one matrix is selected, then the motifs
are searched for all of the defined types of similarity. Currently it's possible to use
the following three matrices:
- Dayhoff matrix (modified 250 PAM matrix from Atlas of Protein
sequence and structure, (1978), v.5, suppl. 3, pp.345-358):
A C D E F G H I K L M N P Q R S T V W Y
A 12
C 8 22
D 10 5 14
E 10 5 13 14
F 6 6 4 5 19
G 11 7 11 10 5 15
H 9 7 11 11 8 8 16
I 9 8 8 8 11 7 8 15
K 9 5 10 10 5 8 10 8 15
L 8 4 6 7 12 6 8 12 7 16
M 9 5 7 8 10 7 8 12 10 14 16
N 10 6 12 11 6 10 12 8 11 7 8 12
P 11 7 9 9 5 9 10 8 9 7 8 9 16
Q 10 5 12 12 5 9 13 8 11 8 9 11 10 14
R 8 6 9 9 6 7 12 8 13 7 10 10 10 11 16
S 11 10 10 10 7 11 9 9 10 7 8 11 11 9 10 12
T 11 8 10 10 7 10 9 10 10 8 9 10 10 9 9 11 13
V 10 8 8 8 9 9 8 14 8 12 12 8 9 8 8 9 10 14
W 4 2 3 3 10 3 7 5 7 8 6 6 4 5 12 8 5 4 27
Y 7 10 6 6 17 5 10 9 6 9 8 8 5 6 6 7 7 8 10 20
- BLOSUM62 matrix constructed from 2000 multiple aligned blocks
(equivalent to 160 PAM, Henikoff S. et al, Proc. Nat. Acad. Sci. USA, (1992), v.89, pp.10915
-10919):
A C D E F G H I K L M N P Q R S T V W Y
A 8
C 4 13
D 2 1 10
E 3 0 6 9
F 2 2 1 1 10
G 4 1 3 2 1 10
H 2 1 3 4 3 2 12
I 3 3 1 1 4 0 1 8
K 3 1 3 5 1 2 3 1 9
L 3 3 0 1 4 0 1 6 2 8
M 3 3 1 2 4 1 2 5 3 6 9
N 2 1 5 4 1 4 5 1 4 1 2 10
P 3 1 3 3 0 2 2 1 3 1 2 2 11
Q 3 1 4 6 1 2 4 1 5 2 4 4 3 9
R 3 1 2 4 1 2 4 1 6 2 3 4 2 5 9
S 5 3 4 4 2 4 3 2 4 2 3 5 3 4 3 8
T 4 3 3 3 2 2 2 3 3 3 3 4 3 3 3 5 9
V 4 3 1 2 3 1 1 7 2 5 5 1 2 2 1 2 4 8
W 1 2 0 1 5 2 2 1 1 2 3 0 0 2 1 1 2 1 15
Y 2 2 1 2 7 1 6 3 2 3 3 2 1 3 2 2 2 3 6 11
- Johnson matrix based on AA replacements in 65 sets structure aligned
3D structures (M.S.Johnson et al., J.Mol.Biol. (1993), v.233, pp. 716 - 738) :
A C D E F G H I K L M N P Q R S T V W Y
A 16
C 6 26
D 8 0 18
E 9 3 12 18
F 7 5 3 3 20
G 9 2 8 7 1 18
H 7 2 9 7 8 7 22
I 8 2 5 5 10 4 5 18
K 9 1 8 11 4 6 10 5 17
L 6 1 2 4 12 3 5 12 6 17
M 8 5 4 7 9 5 7 12 8 14 21
N 8 2 12 9 6 8 11 5 10 5 6 18
P 9 1 9 8 5 7 5 4 9 7 0 7 20
Q 9 3 9 12 3 7 11 3 11 5 9 9 6 19
R 8 4 6 10 4 7 10 4 13 6 6 8 6 12 20
S 10 2 10 8 5 8 7 5 8 5 5 11 9 9 9 16
T 9 4 8 9 5 6 7 7 10 5 7 10 8 9 8 12 17
V 9 5 5 6 8 4 6 14 6 12 10 4 5 6 5 5 8 17
W 4 1 4 2 13 3 6 6 4 9 9 4 2 2 6 4 0 5 25
Y 6 2 6 6 13 4 9 7 6 7 8 7 3 5 8 6 7 8 12 20
We should remind, that the strategy of parameter
selection is very individual and depends highly on the problem solved.
The results, got from the main computer, will
contain selected sequences and homologies found, the query sequence/alignment and the
parameters of search.
|