|
|
MULTIPLE ALIGNMENT (algorithm)REFERENCES:
Multiple Sequence Alignment is the arrangement of several protein or nucleic acid sequences with postulated gaps so that similar residues (in one-letter code) are juxtaposed. GeneBee algorithm of MULTIPLE ALIGNMENT proceeds in four steps:
CONTENTS:GENERAL DEFINITIONSAPPENDIX1. POWER GENERAL DEFINITIONS1.1. MULTIPLE ALIGNMENT. 1.2. MOTIFS. 1.3. POWER. 1.4. TOTAL and LOCAL
alignments. 1.5. SUPERMOTIFS. 1.6. ALGORITHM.
The 4th step can be performed in 2 ways.
APPENDIX1. POWER L S S ( W{b(r1,k), b(r2,k)} - M ) k 1<r1<r2<K A = ------------------------------------------- (1) sqrt (DL) where Increase of power corresponds to increase of similarity between
motif layers. The notion of POWER as defined by formula (1) is a base for other definitions: CORRECTED MOTIF POWER, CONDITIONAL POWER, ALIGNMENT POWER. The latter values are the ones directly used in this program. 1.1. Corrected power The formula for A_cor is based on the following informal reasoning. Let's assume for simplicity that all sequences have similar length n. The number of possible relative shifts of sequences, when motifs of thickness k are considered has the order n**(k-1), and it is approximately n**(k-1) times greater than that for motifs of thickness 2, for which no correction is performed. Thus the correction should decrease the corresponding probability P(A) at n**(k-2) times. In order to do that it is necessary to decrease half of the squared power by (k-2)*ln(n). Thus we get the following formula for corrected power: A_cor = sqrt{A**2 - 2*(k-2)*ln(n)} (2). 1.2. Alignment power
One of the reasons for introducing the gap penalty is the correction similar to one in the corrected motif power A_cor. The problem is that the number of possible alignments sharply increases as the allowed number of gaps increases. Thus for random independent Bernoulli sequences the maximum of the sum of weights along the alignment path increases, and so the power of alignments with many gaps is too large. In order to compensate for this effect, the gap penalty is introduced. Unfortunately, unlike the situation with the corrected motif power, it is difficult to estimate theoretically the gap penalty, since sums of weights along different path are strongly dependent. 2. CONSTRUCTION OF PAIRWISE
MOTIFS Thresholds for power
and lengths of motifs are set by user. Sets of motifs can be
constructed employing different matrices. When the standard constant M=M(W) is chosen, a large number of the obtained motifs will be "noise", despite their formal statistical significance. This fact, obstacling further alignment construction by increase of the search, is caused by the null hypothesis of independence of the sequence being aligned (indeed, the very desire to align sequences is an evidence of their dependence!). Clearly, for dependent (similar) sequences the mean mismatch weight increases as compared to formula (2). The program allows a user to account for this fact by increasing M (by setting parameter "Minimum homology ratio" in the interval from 0 to 1: 0.01 is recommended). Experiments demonstrate that this procedure allows to filter out a majority of noise motifs. Thus it is possible to find motifs for various M and different weight matrices. A user can lump together all obtained motifs, and it increases possibilities of alignment. The described procedure is fairly slow, since it has to process a very large number of relative sequence shifts. To accelerate this process, it's possible to consider only the most "promising" shifts, which are processed further. Of course, this pre-processing can lead to the loss of some motifs. Such sorting of registers is performed by a Lipman-Pearson type procedure (Lipman and Pearson, 1985). Consider it in more detail. First, the letters (residues) are divided into several classes ("colors"). This coloring is consistent with the weight matrix: substitution weights within a class should be sufficiently large. For each shift of sequences the following procedure is performed: for each position of a window (of length, say, 7) the value t(t-1)/2 is calculated, where 't' is the number of color matches within this window. Second, the sum of these values for all window positions for a given shift is calculated. To select a given shift the ratio of this sum to maxsum should be more than user-defined value (parameter "Coincidence ratio"). The described characteristic is more preferable as opposed to the more widely used number of perfect matches or number of color matches (that is, the sum of t), since it is sensitive to determination of shifts setting in correspondence locally similar regions. On the other hand, our procedure is sufficiently fast, since the above characteristic can be calculated using pre-formed looking tables of all color pairs (the distance between which does not exceed the window length) for all sequences. A user-defined number of shifts is a subject for further processing. 3. CONSTRUCTION OF MULTIPLE
MOTIFS. Consider this algorithm in more details. It is sufficient to describe the procedure of linking of two motifs (or a motif and a sequence fragment). If the motifs have no common layers, the DotHelix procedure is applied (each motif is considered as a unit). Linking of a fragment to a motif is performed in the same way. If motifs have common layers, then (if it's possible) the natural correspondence between layers is established, and then DotHelix is applied to thus defined comparison register. Three options of the parameter "How to process motifs" relate to three nested possible ways of
motif construction: The result of this procedure is a set of motifs of various lengths. Each motif is characterized by the corrected power (A_cor), calculated when the motif was constructed. Motifs with the power exceeding certain thresholds are retained. Incomplete thickness of a motif means that only some sequences possess a good similarity in this region. Intermediate storage of motifs is based on the stack principle, so that only some best (with regards to the corrected power) motifs are retained. When a new motif is constructed, the stack is reordered and the worst of the stored motifs are deleted. It allows to save memory and calculation time. Speaking metaphorically, the suggested method first selects areas of potential similarity, based on pairwise comparisons. Then some of these areas become more "visible" due to their similarity with fragments of other sequences, while the others are not supported by additional similarities and disappear from the stack. The accelerating procedure described in Section 2 can be applied to the construction of multiple motifs as well. It can be further accelerated by considering color matches only in conserved positions of motif (i.e. positions colored almost entirely in one color). The percentage of one-colored letters in a conservative position is a user-defined parameter "Minimum letter frequence". 4. CONSTRUCTION OF
ALIGNMENTS (supermotifs and total alignments). 4.1. Construction of
supermotifs In the greedy method first the most powerful motif is considered, then it is linked with such motif, that the power of the constructed supermotif exceeds powers of both motifs belonging to it. Then the obtained supermotif is linked to the third motif (if it is possible) etc. This procedure is performed while the supermotif power increases. In the cluster method the following scheme is employed. Initially each motif is considered as a supermotif. At each step two such supermotifs are linked, that the power of the obtained supermotif exceeds powers of its constituents, and it is the maximum one among all possible linkages. It should be noted, that not all supermotifs can be constructed by these procedures. Construction of algorithms that would allow us to construct all existing supermotifs is a subject of further research. 4.2. Construction of total
DRAFT alignments. 4.2.1. Greedy procedure This algorithm can be generalized in the following way (the derived procedure would not be a greedy one). First, one of t best (super)motifs is taken, where t is defined by a user. Then each obtained partial alignment (currently consisting of a single supermotif) is linked by a new supermotif. This is done by one of t ways so that the power of the alignment, if possible, increases. The procedure continues until there are supermotifs that can be still linked to an alignment. If t=1, this is the above described greedy procedure, If t= number of motifs in the stack, this procedure reduces to the full search. As t increases, the calculation time also increases, but the quality of alignment improves. 4.2.2. Dynamic programming
(Needleman - Wunsch). We represent an alignment as a chain of motifs. These motifs can belong to an alignment not only completely, but partially as well. It is taken into account in our program, but below we for simplicity assume that motifs enter an alignment entirely. Alignment quality is measured by its "price", that is defined as a sum of centered weights W(al,bl) of mismatches along the alignment path minus gap penalties. For each motif it is possible to point out motifs that can precede or follow it in an alignment. Thus the set of motifs is a partially ordered set. If a partial alignment has been constructed, further aligning depends only on the most recently aligned fragment. Thus the dynamic programming principles, and in particular, a Needleman-Wunsch type procedure can be employed. For each motif we find the maximum value of alignments starting at the beginning of the sequences and ending at this motif. We do not consider all transitions from a current motif to subsequent ones: among motifs having close registers only transition to the nearest one is allowed. If many sequences are aligned, a complete alignment can also be represented as a chain of pairwise motifs (since a multiple motif can be decomposed into pairwise ones). However, unlike the previous case, alignment depends not only on the last aligned motif, but on some previous motifs. Thus in this case the dynamic programming principle cannot be applied. In this connection we introduce the following notion. Each partial alignment is set in correspondence with its FRONT, defined as the set of the rightmost in each sequence pairwise motifs belonging to the alignment, and their relative positions in the alignment. Thus in order to define a front, it is necessary, first, to set for each sequence such motif, that one of its layers belongs to this sequence, and, second, to define the position of this motif in the alignment relative to the initial position of the leftmost motif in this front. The initial position of any frontal motif in a sequence, to which it is ascribed, should be situated rightwards relative to beginnings in this sequence of all other frontal motifs. One motif can be ascribed to two sequences (recall that all motifs are pairwise ones). It is simple to demonstrate that further construction depends only on the front of the already aligned part. Thus a dynamic programming procedure is applicable. It is clear, that as the number of motifs increases, the number of fronts also increases - at a very fast rate (roughly as [N/t]**t) , where 'N' - number of motifs, 't' - - number of sequences). Thus it seems to be necessary to bound the number of fronts by a reasonable stack size, so that alignment of natural sequences would be possible. 4.3. Construction of the REFINED alignment. The REFINED alignment is constructed by the iterative improvement of the DRAFT alignment. Each layer (sequence) of the DRAFT alignment is aligned against all other alignment's layers taken as a single batch by the dynamic programming procedure. The process of iteration is completed when the quality of alignment doesn't become any better. The basic multiple motifs were inserted while constructing the DRAFT alignment. These basic motifs contain consensus (conservative) sub-columns. This is exactly why the DRAFT alignment is a good starting point for the iterative process of the improvement of alignments by the consequtive procedures of dynamic programming. The euristic reasons are following. Let's call the consensus sub-columns "embryos". As a rule, the largest faction of "embryos" does correspond to the correct alignment. However, some of the "embryos" can be incorrect. The task of the iterative improvement is to expand these correct "embryos"/sub-columns up to the thickness of a complete alignment. During this process incorrect "embryos" would disappear, because they won't find good correspondence in the other lines, trying to expand. The residues, participating in these incorrect "embryos", would be forced to move to the other columns, so that they could provide the strengthening of the correct "embryos". There are several matrices inplemented in GeneBee. You may
choose any of them - Dayhoff, Blosum62, DAYHOFF MATRIX A C D E F G H I K L M N P Q R S T V W Y A 12 C 8 22 D 10 5 14 E 10 5 13 14 F 6 6 4 5 19 G 11 7 11 10 5 15 H 9 7 11 11 8 8 16 I 9 8 8 8 11 7 8 15 K 9 5 10 10 5 8 10 8 15 L 8 4 6 7 12 6 8 12 7 16 M 9 5 7 8 10 7 8 12 10 14 16 N 10 6 12 11 6 10 12 8 11 7 8 12 P 11 7 9 9 5 9 10 8 9 7 8 9 16 Q 10 5 12 12 5 9 13 8 11 8 9 11 10 14 R 8 6 9 9 6 7 12 8 13 7 10 10 10 11 16 S 11 10 10 10 7 11 9 9 10 7 8 11 11 9 10 12 T 11 8 10 10 7 10 9 10 10 8 9 10 10 9 9 11 13 V 10 8 8 8 9 9 8 14 8 12 12 8 9 8 8 9 10 14 W 4 2 3 3 10 3 7 5 7 8 6 6 4 5 12 8 5 4 27 Y 7 10 6 6 17 5 10 9 6 9 8 8 5 6 6 7 7 8 10 20 BLOSUM62 MATRIX A C D E F G H I K L M N P Q R S T V W Y A 8 C 4 13 D 2 1 10 E 3 0 6 9 F 2 2 1 1 10 G 4 1 3 2 1 10 H 2 1 3 4 3 2 12 I 3 3 1 1 4 0 1 8 K 3 1 3 5 1 2 3 1 9 L 3 3 0 1 4 0 1 6 2 8 M 3 3 1 2 4 1 2 5 3 6 9 N 2 1 5 4 1 4 5 1 4 1 2 10 P 3 1 3 3 0 2 2 1 3 1 2 2 11 Q 3 1 4 6 1 2 4 1 5 2 4 4 3 9 R 3 1 2 4 1 2 4 1 6 2 3 4 2 5 9 S 5 3 4 4 2 4 3 2 4 2 3 5 3 4 3 8 T 4 3 3 3 2 2 2 3 3 3 3 4 3 3 3 5 9 V 4 3 1 2 3 1 1 7 2 5 5 1 2 2 1 2 4 8 W 1 2 0 1 5 2 2 1 1 2 3 0 0 2 1 1 2 1 15 Y 2 2 1 2 7 1 6 3 2 3 3 2 1 3 2 2 2 3 6 11JOHNSON MATRIX Unique Identifier: 94016587 (MEDLINE) Authors: Johnson M. S. Overington J. P. Institution: Department of Crystallography, Birkbeck College, University of London, U.K. Title: A structural basis for sequence comparisons. An evaluation of scoring methodologies. Source: Journal of Molecular Biology. 233(4):716-38, 1993 Oct 20. Abstract: A residue-exchange matrix has been derived that is suitable for comparison of amino acid sequences. This matrix is based on the tabulation of 207,795 amino acid replacements observed in 65 homologous sets of structurally aligned three-dimensional structures (235 proteins). The majority of the data is from structural comparisons where there is between 15 and 40% sequence identity. As a result, a scoring matrix such as the one devised here should provide a sensitive basis for the comparison of amino acid sequences and the search for homologous sequences in amino acid databases. In order to assess the value of this matrix we have made a comparative analysis with 12 other published scoring matrices that have been used for the alignment of protein amino acid sequences. We find that the matrix derived here is among the better performers in terms of alignment significance, detection of homologous sequences and the accuracy of alignments. A C D E F G H I K L M N P Q R S T V W Y A 16 C 6 26 D 8 0 18 E 9 3 12 18 F 7 5 3 3 20 G 9 2 8 7 1 18 H 7 2 9 7 8 7 22 I 8 2 5 5 10 4 5 18 K 9 1 8 11 4 6 10 5 17 L 6 1 2 4 12 3 5 12 6 17 M 8 5 4 7 9 5 7 12 8 14 21 N 8 2 12 9 6 8 11 5 10 5 6 18 P 9 1 9 8 5 7 5 4 9 7 0 7 20 Q 9 3 9 12 3 7 11 3 11 5 9 9 6 19 R 8 4 6 10 4 7 10 4 13 6 6 8 6 12 20 S 10 2 10 8 5 8 7 5 8 5 5 11 9 9 9 16 T 9 4 8 9 5 6 7 7 10 5 7 10 8 9 8 12 17 V 9 5 5 6 8 4 6 14 6 12 10 4 5 6 5 5 8 17 W 4 1 4 2 13 3 6 6 4 9 9 4 2 2 6 4 0 5 25 Y 6 2 6 6 13 4 9 7 6 7 8 7 3 5 8 6 7 8 12 20 Unitary matrix for DNA/RNA A C G T U A 10 C 0 10 G 0 0 10 T 0 0 0 10 U 0 0 0 10 10 |