03-511/711, 15-495/856 Course Notes - November 14, 2006


Database Searching and the BLAST Algorithm

Data base searching overview

Database searching is essentially a local alignment problem. In theory, a data base could be searched for sequences similar to a query sequence using dynamic programming. However, the running time for dynamic programing is O(mn), where m is the length of the query sequence and n is the length of the data base. For large data bases, the complexity of this approach is prohibitive.

The "typical" amino acid query sequence is 250 - 300 residues long, but query sequences could be much longer:

BRCA2
   Genomic sequence:    86,101 bp
   mRNA    10,257 bp
   Protein   3,418 amino acids

Data base
11/08/2006
      Nucleic Acid Sequences    Amino Acid Sequences
   Letters:    18,386,426,667    1,417,848,065
   Sequences:    4,549,725    4,111,659



BLAST

Note that one could also define L to be the high scoring words in the data base and compare each w-mer in the query sequence with L. However, this would result in a much bigger hash table. This approach also incurs a performance penalty because the database is accessed randomly rather than scanned sequentially.

How to select S, w and T?

How are S and E related?

Selecting w and T

Steps 1 and 2 are relatively fast. Step 3 is slow. Therefore, we want to select parameter values, w and T, to minimize the number of hits that must be extended in Step 3. We select S based on the number false positives that we can tolerate. Then, given S, select w and T to optimize speed and sensitivity (number of false negatives). In particular, we need to consider the following tradeoff:

In this paper, Altschul and his colleagues used simulation studies to estimate the probability that hits found with a given set of parameter values in the data base would in fact be contained in local ungapped alignments with score ≥ S. In other words, they used a statistical approach to minimize the probability of unnecessarily attempting to extend a hit. They determined in empirically that a choice of w=4 and T=17 offered a good compromise between maximizing this probability and an excessive running time. This is discussed in detail in Altschule et al., 1990, on electronic reserve.




Last modified: November 14, 2006.
Maintained by Dannie Durand (durand@cs.cmu.edu) and Annette McLeod.