-
What is BLAST?
-
What are Gapped and PSI-BLAST?
-
Which BLAST program should I use?
-
Why do I see a string of "X"s (or "N"s) in my query sequence
that I did not put there?
-
What is low-complexity sequence?
-
What is the Expect (E) value?
-
How do I perform a similarity search with a short peptide/nucleotide
sequence?
-
How can I see low-similarity matches when there are many
strong hits to my query sequence?
Q: What is BLAST?
BLAST (Basic Local Alignment Search Tool) is a set of similarity search
programs designed to explore all of the available sequence databases regardless
of whether the query is protein or DNA. The BLAST programs have been designed
for speed, with a minimal sacrifice of sensitivity to distant sequence
relationships. The scores assigned in a BLAST search have a well-defined
statistical interpretation, making real matches easier to distinguish from
random background hits. BLAST uses a heuristic algorithm which seeks local
as opposed to global alignments and is therefore able to detect relationships
among sequences which share only isolated regions of similarity (Altschul
et al., 1990).
Q: What are Gapped BLAST and PSI-BLAST?
Gapped BLAST and PSI-BLAST are useful search tools provided by the BLAST
server (version 2.0) (Altschul
et al., 1997).
The Gapped BLAST algorithm allows gaps (deletions and insertions) to
be introduced into the alignments that are returned. Allowing gaps means
that similar regions are not broken into several segments. The scoring
of these gapped alignments tends to reflect biological relationships more
closely.
Position-Specific Iterated BLAST (PSI-BLAST) provides an automated,
easy-to-use version of a "profile" search, which is a sensitive way to
look for sequence homologues. The program first performs a gapped BLAST
database search. The PSI-BLAST program uses the information from any significant
alignments returned to construct a position-specific score matrix, which
replaces the query sequence for the next round of database searching. PSI-BLAST
may be iterated until no new significant alignments are found. At this
time PSI-BLAST may be used only for comparing protein queries with protein
databases.
Q: Which BLAST program should I use?
You have many choices to make between different BLAST programs and how
to access them. Please see the Overview
for more information on this topic. The easiest way to search is to use
the BLAST Web pages. Simply paste your sequence into the box, choose a
database, choose a program that matches your sequence to the database,
and press Search. There are many additional parameters that can be controlled,
but for a basic search, the default options work well.
Q: After running a search why do I see a string of "X"s (or "N"s) in my
query sequence that I did not put there?
You are seeing the result of automatic filtering of your query for low-complexity
sequence that is performed to prevent artifactual hits. The filter substitutes
any low-complexity sequence that it finds with the letter "N" in nucleotide
sequence (e.g., "NNNNNNNNNNNNN") or the letter "X" in protein sequences
(e.g., "XXXXXXXXX"). Low-complexity regions can result in high scores that
reflect compositional bias rather than significant position-by-position
alignment (Wootton
& Federhen, 1996). Filter programs can eliminate these potentially
confounding matches from the blast reports, leaving regions whose BLAST
statistics reflect the specificity of their pairwise alignment. Queries
searched with the blastn program are filtered with DUST. The other BLAST
programs use SEG.
You can change the default and remove these filters if you like. On
the Basic BLAST Web interface you will see a button to click that will
remove the filter. On the Advanced Page you can set the filter to "none"
in the menu. For email BLAST you can use the following command (filter
NONE).
Q: What is low-complexity sequence?
Regions with low-complexity sequence have an unusual composition and this
can create problems in sequence similarity searching (Wootton
& Federhen, 1996). Low-complexity sequence can often be recognized
by visual inspection. For example, the protein sequence PPCDPPPPPKDKKKKDDGPP
has low complexity and so does the nucleotide sequence AAATAAAAAAAATAAAAAAT.
Filters are used to remove low-complexity sequence because it can cause
artifactual hits (please see Q: After running a search why
do I see a string of "X"s (or "N"s) in my query sequence that I did
not put there?
In BLAST searches performed without a filter, often certain hits will
be reported with high scores only because of the presence of a low-complexity
region. Most often, this type of match cannot be thought of as the result
of homology shared by the sequences. Rather, it is as if the low-complexity
region is "sticky" and is pulling out many sequences that are not truly
related.
Q: What is the Expect (E) value?
The Expect value (E) is a parameter that describes the number of hits one
can "expect" to see just by chance when searching a database of a particular
size. It decreases exponentially with the Score (S) that is assigned to
a match between two sequences. Essentially, the E value describes the random
background noise that exists for matches between sequences.
The Expect value is used as a convenient way to create a significance
threshold for reporting results. When the Expect value is increased from
the default value of 10, a larger list with more low-scoring hits can be
reported.
In BLAST 2.0, the Expect value is also used instead of the P value (probability)
to report the significance of matches. For example, an E value of 1 assigned
to a hit can be interpreted as meaning that in a database of the current
size one might expect to see 1 match with a similar score simply by chance.
Q: How do I perform a similarity search with a short peptide/nucleotide
sequence?
First, you will probably need to increase the Expect (E) value in your
search. A short query is more likely to occur by chance in the database.
Therefore, even a perfect match can have low statistical significance and
may not be reported. Increasing the E value allows you to look farther
down in the hit list and see matches that would normally be discarded because
of low statistical significance.
For most searches, an Expect value up to 1000 is enough to see results.
However, you can raise the E value farther on the Advanced BLAST Web page
by typing -e 10000, for example, in the Other
Advanced Options Box.
If you still do not get results after increasing the E value, you may
want to try decreasing the Word size (W), another parameter that becomes
important with a short query. The BLAST algorithm uses "words" to nucleate
regions of similarity. The default Word size for a protein sequence is
3 residues and for nucleotide sequences it is 11 bp. A blastn search will
not work with a Word size of less than 7. A good rule of thumb is that
the query length must be at least twice the Word size. For example, if
your query is a protein sequence of 4 residues, then the Word size should
be reduced to 2. Please note that the smaller the Word size, the slower
your search will be.
You can lower the default word size on the Advanced BLAST Web page.
In the Other
Advanced Options, type -W some_number (for example, -W 9).
Sometimes a short query sometimes does not produce results because it
contains low-complexity sequence. Often this type of sequence can be recognized
by the human eye because it looks very redundant, for example the protein
sequence PADPPPDPPPP or the nucleotide sequence AAATTTAAAAAT. A filter
for low complexity sequence is applied by default to BLAST nucleotide and
protein searches. If your query has regions of low-complexity sequence,
then large portions of your query may be filtered out, essentially making
your query shorter than you might have expected. Removing the filter will
help in these cases.
Finally, you can change the matrix to optimize for searching with short
protein sequences. For information on query length and the matrix see the
document
Q: How can I see low-similarity matches when there are many strong hits
to my query sequence?
Often, when the query is a member of a large sequence family, the summary
hit list and the alignments returned only contain very high scoring hits.
To look at low-similarity matches, you must increase the maximum number
of results returned.
On the BLAST Web pages, often it is sufficient to increase the size
of the summary hit list and the number of alignments shown using the menus
on the Advanced pages. However, it is possible to increase the lists even
further using the Other
Advanced Options box on the Advanced BLAST pages.For BLAST 2.0, "-v
2000", for example, will increase the number of descriptions returned in
the summary hit list to 2000. The option "-b 2000" will similarly increase
the number of alignments returned.