Thompson J.D., Higgins D.G., Gibson T.J.; "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice."; Nucleic Acids Res. 22:4673-4680(1994).
FASTA (Pearson), NBRF/PIR, EMBL/Swiss Prot, GDE, CLUSTAL, GCG/MSF, GCG9/RSF.
The program tries to "guess" which format is being used and whether the sequences are nucleic acid (DNA/RNA) or amino acid (proteins). The format is recognised by the first characters in the file. This is kind of stupid/crude but works most of the time and it is difficult to do reliably, any other way.
Format First non blank word or character in the file. ............................................................... FASTA > NBRF >P1; or >D1; EMBL/SWISS ID GDE protein % GDE nucleotide # CLUSTAL CLUSTAL (blocked multiple alignments) GCG/MSF PILEUP or !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT or MSF on the first line, and '..' at the end of line GCG9/RSF !!RICH_SEQUENCE
Example in FASTA format:
>FOSB_HUMAN P53539 homo sapiens (human). fosb protein MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGS GGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSAPAKEDGFSWLLPPPPPPPLPFQTSQDAPPNLTASLFTHSEVQVLGDPFPVVNPSY TSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL >FOSB_MOUSE P13346 mus musculus (mouse). fosb protein. MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
Note, that the only way of spotting that a file is MSF format is if the word PILEUP appears at the very beginning of the file. If you produce this format from software other than the GCG pileup program, then you will have to insert the word PILEUP at the start of the file. Similarly, if you use clustal format, the word CLUSTAL must appear first.
All of these formats can be used to read in AN EXISTING FULL ALIGNMENT. With CLUSTAL format, this is just the same as the output format of this program and Clustal V. If you use PILEUP or CLUSTAL format, all sequences must be the same length, INCLUDING GAPS ("-" in clustal format; "." in MSF). With the other formats, sequences can be gapped with "-" characters. If you read in any gaps these are kept during any later alignments. You can use this facility to read in an alignment in order to calculate a phylogenetic tree OR to output the same alignment in a different format (from the output format options menu of the multiple alignment menu) e.g. read in a GCG/MSF format alignment and output a PHYLIP format alignment. This is also useful to read in one reference alignment and to add one or more new sequences to it using the "profile alignment" facilities.
DNA vs. PROTEIN: The program will count the
number of A,C,G,T,U and N characters. If 85% or more of the characters
in a sequence are as above, then DNA / RNA is assumed, protein otherwise.
By default, the initial pairwise alignments are now carried out using a full
dynamic programming algorithm. This is more accurate than the older hash/
k-tuple based alignments (Wilbur and Lipman) but is MUCH slower. On a fast
workstation you may not notice but on a slow box, the difference is extreme.
The construction of the dendrogram can be very time consuming if you wish to
align many sequences (e.g. for 100 sequences you need to carry out 100x99/2
sequence comparisons = 4950). During every multiple alignment, a dendrogram is
constructed and saved to a file (something.dnd).
This option let you choose to show denrogam in results or not.
CLUSTAL FORMAT: This is a self explanatory alignment. The alignment is
written out in blocks. Identities are highlighted and (if you use a PAM
250 matrix) positions in the alignment where all of the residues are
"similar" to each other (PAM 250 score of 8 or more) are indicated.
GCG FORMAT: In version 7 of the Wisconsin GCG package, a new
multiple sequence format was introduced. This is the MSF (Multiple Sequence
Format) format. It can be used as input to the GCG sequence editor or any of
the GCG programs that make use of multiple alignments.
PHYLIP FORMAT: This format can be used by the Phylip package of
Joe Felsenstein. Phylip allows you to do a huge range of phylogenetic
analyses (we just offer one method in this program) and is probably the most
widely used set of programs for drawing trees. It also works on just about
every computer you can think of, providing you have a decent Pascal compiler.
NBRF/PIR FORMAT: This is the usual NBRF/PIR format with gaps
indicated by hyphens ("-"). This format is exactly compatible with
the sequence input format. Therefore you can read in these alignments again
for profile alignments or for calculating phylogenetic trees.
GDE FORMAT: This format is used by Steven Smith's GDE package.
The consensus line:
"*" = identical or conserved residues in all sequences in the alignment
"." = indicates semi-conserved substitutions.
BLOSUM (Henikoff): These matrices appear to be the best available
for carrying out data base similarity (homology searches). The matrices
used are: Blosum80, 62, 40 and 30.
PAM (Dayhoff): These have been extremely widely used since
the late '70s.We use the PAM 120, 160, 250 and 350 matrices.
GONNET: These matrices were derived using almost the same procedure
as the Dayhoff one (above) but are much more up to date and are based on a far
larger data set. They appear to be more sensitive than the Dayhoff series.
We use the GONNET 40, 80, 120, 160, 250 and 350 matrices.
ID:We also supply an identity matrix which gives a score of 10 to two
identical amino acids and a score of zero otherwise.
IUB: This is the default scoring matrix used by BESTFIT for the comparison
of nucleic acid sequences. X's and N's are treated as matches to any IUB
ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.
CLUSTALW (1.6): The previous system used by ClustalW, in which matches
score 1.0 and mismatches score 0. All matches for IUB symbols also score 0.
Each of these formats can be presented graphically.
Clustal format:
This format is verbose and lists all of the distances between the sequences
and the number of alignment positions used for each. The tree is described
at the end of the file. It lists the sequences that are joined at each
alignment step and the branch lengths. After two sequences are joined, it is
referred to later as a NODE. The number of a NODE is the number of the
lowest sequence in that NODE.
Phylip format:
This format is the New Hampshire format, used by many phylogenetic analysis
packages. It consists of a series of nested parentheses, describing the
branching order, with the sequence names and branch lengths. It can
be used by the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP
package to see the trees graphically. This is the same format used during
multiple alignment for the guide trees. Some other packages that can read and
display New Hampshire format are TreeTool, TreeView, Phylowin and NJPlot.
The distances only:
This format just outputs a matrix of all the pairwise distances in a format
that can be used by the Phylip package. It used to be useful when one
could not produce distances from protein sequences in the Phylip package but
is now redundant (Protdist of Phylip 3.5 now does this).
Alignment title
Type in a title for this alignment session for you to remember.
Your E-Mail
A valid internet email address in the form somebody@somewhere.domain.country.
You must type your email address in this text box. You don't have to fill in
this box if you want to run your search interactively.
Alignment
You may choose to run a full alignment or using a stringent algorithm for
generating the tree guide or a fast algorithm.
Show dendrogram
To do a complete multiple alignment, we need to know the approximate
relationships of the sequences to each other (which ones are most similar to
each other). We do this by calculating a crude phylogenetic tree which we
call a dendrogram (to distinguish it from the more sensitive trees available
under the phylogenetic tree option). This dendrogram is used as a guide to
align bigger and bigger groups of sequences during the multiple alignment.
The dendrogram is calculated in 2 stages: 1) all pairs of sequence are compared
using the fast/approximate method of Wilbur and Lipman (1983); the result of
each comparison is a similarity score. 2) the similarity scores are used to
construct the dendrogram using the UPGMA cluster analysis method of Sneath
and Sokal (1973).
Output Format
Here you decide which output format you want your multiple sequence
alignment in. The options are CLUSTAL, GCG, PHYLIP, PIR and GDE.
This format is only supported in version 7 of the GCG package or later.
Output Order
This option is used to control the order of the sequences in the output
alignments. By default, the order corresponds to the order in which the
sequences were aligned (from the guide tree/dendrogram), thus automatically
grouping closely related sequences. This switch can be used to set the order
to the same as the input file.
":" = indicates conserved subsitutions Pairwise alignment options:
A distance is calculated between every pair of sequences and these are
used to construct the dendrogram which guides the final multiple alignment.
The scores are calculated from separate pairwise alignments. These can be
calculated using 2 methods: dynamic programming (slow but accurate) or by the
method of Wilbur and Lipman (extremely fast but approximate).
Fast alignment options:
These similarity scores are calculated from fast, approximate, global align-
ments, which are controlled by 4 parameters. 2 techniques are used to make
these alignments very fast: 1) only exactly matching fragments (k-tuples) are
considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)
are used.
K-Tuple (Word size)
This option allows you to choose which 'word-length' to use when
calulating fast pairwise alignments. Can be 1 or 2 for proteins and 1 to 4
for DNA. Increase this to increase speed; decrease to improve sensitivity.
Window length
This is a number of diagonals around each "top" diagonal that are considered.
Decrease for speed; increase for greater sensitivity.
The allowed range is 1 to 50.
TOPDIAG
This is the number of best diagonals in the imaginary dot-matrix plot that
are considered. Decrease (must be greater than zero) to increase speed;
increase to improve sensitivity. The allowed range is 1 to 50.
PAIRGAP
Here you can set the gap penalty. This is the number of matching residues
that must be found in order to introduce a gap. This should be larger than
K-Tuple size. This has little effect on speed or sensitivity.
The allowed range is 1 to 500.
Score type
The similarity scores may be expressed as raw scores (number of identical
residues minus a "gap penalty" for each gap) or as percentage scores.
If the sequences are of very different lengths, percentage scores make more sense.
Slow alignment options:
These parameters do not have any affect on the speed of the alignments. They
are used to give initial alignments which are then rescored to give percent
identity scores. These % scores are the ones which are displayed on the
screen. The scores are converted to distances for the trees.
Protein weight matrix
Here you can select the scoring table which describes the similarity
of each amino acid to each other.
DNA weight matrix
Here you can select the matrix with the scores assigned to matches and
mismatches (including IUB ambiguity codes).
Gap open
Here you can set the penalty for opening a gap in the alignment.
The allowed range is 0.0 to 100.0
Gap extension
Here you can set the penalty for extending a gap by 1 residue.
The allowed range is 0.0 to 10.0
Multiple sequence alignment options:
These parameters control the final multiple alignment. This is the core of
the program and the details are complicated. To fully understand the use
of the parameters and the scoring system, you will have to refer to the
documentation.
Type
It is critically important for the program to know whether or not it is
aligning DNA or protein sequences. The input routines attempt to guess which
type of sequence is being used by counting the number of A,C,G,T or U's in the
sequences. If the total is more than 85% of the sequence length then DNA is
assumed. If you use very bizarre sequences (proteins with really strange a
compositions or DNA sequences with loads of strange ambiguity codes) you might
confuse the program. Here you can define the sequence type.
Protein weight matrix
This option allows you to choose which matrix series to use when generating
the mulitple sequence alignment. The program goes through the choosen matrix
series, spanning the full range of amino acid distances.
DNA weight matrix
For DNA, a single matrix (not a series) is used. Two hard-coded matrices are
available:
Gap open
This option control the cost of opening up every new gap.
Increasing the gap opening penalty will make gaps less frequent.
The allowed range is 0.0 to 100.0
Gap extension
This option control the cost of every item in a gap. Increasing the gap
extension penalty will make gaps shorter.
The allowed range is 0.0 to 10.0
Phylogenetic tree
The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First
you calculate distances (percent divergence) between all pairs of sequence from
a multiple alignment; second you apply the NJ method to the distance matrix.
Tree type
Three output formats are offered: Clustal, Phylip and Just the distances.
Kimura's correction of distances
For small divergence (less than 10%) this option makes no difference.
For greater divergence, this option corrects for the fact that observed
distances underestimate actual evolutionary distances. This is because, as
sequences diverge, more than one substitution will happen at many sites. However,
you only see one difference when you look at the present day sequences.
Therefore, this option has the effect of stretching branch lengths in trees
(especially long branches). The corrections used here (for DNA or proteins)
are both due to Motoo Kimura. See the documentation for details.
Ignore gaps in alignment
With this option, any alignment positions where ANY of the sequences have a
gap will be ignored. This means that 'like' will be compared to 'like' in
all distances. It also, automatically throws away the most ambiguous parts
of the alignment, which are concentrated around gaps (usually). The
disadvantage is that you may throw away much of the data if there are many gaps.
Picture formats
Phylogenetic trees can be presented in one or several graphical forms
(picture types):
- slanted cladogram, two versions;
- rectangular cladogram, two versions;
- phylogram, that is a rectangular cladogram with branches scaled by their length (weight);
- unrooted, two versions (unscaled and scaled braches).
All images have same size. Width and height of the images may be set
from 320 to 2000 and 240 to 1500 pixels, respectively. Defaults are 640 and 480.
Unrooted tree with scaled branches (Unrooted 2) has special option:
max/min factor. Scaled unrooted tree looks not so good when branches (edges)
have very different lengths. This option restrains the difference so that very
short branches are plotted with length only at factor times less than maximum
plotted. Such branches are dispayed in orange color.
Also very long branches (three at most) are plotted with shorter, partly
dashed, green lines.
Other advanced options
-case= LOWER or UPPER (for GDE output only)
-seqnos= OFF or ON (for Clustal output only)
-negative protein alignment with negative values in matrix
-endgaps no end gap separation pen.
-gapdist=n gap separation pen. range
-nopgap residue-specific gaps off
-nohgap hydrophilic gaps off
-hgapresidues= list hydrophilic res.
-maxdiv=n % ident. for delay
-transweight=f transitions weighting
-secstrout= STRUCTURE or MASK or BOTH or NONE output in alignment file
-helixcap=n gap penalty for helix core residues
-strandgap=n gap penalty for strand core residues
-loopgap=n gap penalty for loop regions
-terminalgap=n gap penalty for structure termini
-helixendin=n number of residues inside helix to be treated as terminal
-helixendout=n number of residues outside helix to be treated as terminal
-strandendin=n number of residues inside strand to be treated as terminal
-strandendout=n number of residues outside strand to be treated as terminal