BLAST Frequently Asked Questions (FAQ)

What is BLAST?
What are Gapped and PSI-BLAST?
Which BLAST program should I use?
Why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?
What is low-complexity sequence?
What is the Expect (E) value?
How do I perform a similarity search with a short peptide/nucleotide sequence?
How can I see low-similarity matches when there are many strong hits to my query sequence?

Q: What is BLAST?

(Altschul et al., 1990)

Q: What are Gapped BLAST and PSI-BLAST?

(Altschul et al., 1997)

The Gapped BLAST algorithm allows gaps (deletions and insertions) to be introduced into the alignments that are returned. Allowing gaps means that similar regions are not broken into several segments. The scoring of these gapped alignments tends to reflect biological relationships more closely.

Position-Specific Iterated BLAST (PSI-BLAST) provides an automated, easy-to-use version of a "profile" search, which is a sensitive way to look for sequence homologues. The program first performs a gapped BLAST database search. The PSI-BLAST program uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-BLAST may be iterated until no new significant alignments are found. At this time PSI-BLAST may be used only for comparing protein queries with protein databases.

Q: Which BLAST program should I use?

Overview

Q: After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?

(Wootton & Federhen, 1996)

You can change the default and remove these filters if you like. On the Basic BLAST Web interface you will see a button to click that will remove the filter. On the Advanced Page you can set the filter to "none" in the menu. For email BLAST you can use the following command (filter NONE).

Q: What is low-complexity sequence?

(Wootton & Federhen, 1996)

why do I see a string of "X"s (or "N"s) in my query sequence

In BLAST searches performed without a filter, often certain hits will be reported with high scores only because of the presence of a low-complexity region. Most often, this type of match cannot be thought of as the result of homology shared by the sequences. Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences that are not truly related.

Q: What is the Expect (E) value?

The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported.

In BLAST 2.0, the Expect value is also used instead of the P value (probability) to report the significance of matches. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance.

Q: How do I perform a similarity search with a short peptide/nucleotide sequence?

For most searches, an Expect value up to 1000 is enough to see results. However, you can raise the E value farther on the Advanced BLAST Web page by typing -e 10000, for example, in the Other Advanced Options Box.

If you still do not get results after increasing the E value, you may want to try decreasing the Word size (W), another parameter that becomes important with a short query. The BLAST algorithm uses "words" to nucleate regions of similarity. The default Word size for a protein sequence is 3 residues and for nucleotide sequences it is 11 bp. A blastn search will not work with a Word size of less than 7. A good rule of thumb is that the query length must be at least twice the Word size. For example, if your query is a protein sequence of 4 residues, then the Word size should be reduced to 2. Please note that the smaller the Word size, the slower your search will be.

You can lower the default word size on the Advanced BLAST Web page. In the Other Advanced Options, type -W some_number (for example, -W 9).

Sometimes a short query sometimes does not produce results because it contains low-complexity sequence. Often this type of sequence can be recognized by the human eye because it looks very redundant, for example the protein sequence PADPPPDPPPP or the nucleotide sequence AAATTTAAAAAT. A filter for low complexity sequence is applied by default to BLAST nucleotide and protein searches. If your query has regions of low-complexity sequence, then large portions of your query may be filtered out, essentially making your query shorter than you might have expected. Removing the filter will help in these cases.

Finally, you can change the matrix to optimize for searching with short protein sequences. For information on query length and the matrix see the document

Q: How can I see low-similarity matches when there are many strong hits to my query sequence?

Other Advanced Options