03-511/711, 15-495/856 Course Notes - November 19, 2009


Database Searching and the BLAST Algorithm

Data base searching overview

Database searching is essentially a local alignment problem. In theory, a data base could be searched for sequences similar to a query sequence using dynamic programming. However, the running time for dynamic programing is O(mn), where m is the length of the query sequence and n is the length of the data base. For large data bases, the complexity of this approach is prohibitive.

The "typical" amino acid query sequence is 250 - 300 residues long, but query sequences could be much longer:

BRCA2
   Genomic sequence:    86,101 bp
   mRNA    10,257 bp
   Protein   3,418 amino acids

Data base
Nov 20, 2009 5:42 PM
      Nucleic Acid Sequences    Amino Acid Sequences
   Letters:    29,494,641,191    3,433,246,473
   Sequences:    10,254,931    10,067,804



BLAST

Note that one could also define L to be the high scoring words in the data base and compare each w-mer in the query sequence with L. However, this would result in a much bigger hash table. This approach also incurs a performance penalty because the database is accessed randomly rather than scanned sequentially.

How to select S, w and T?

How are S and E related?

Selecting w and T

Steps 1 and 2 are relatively fast. Step 3 is slow. Therefore, we want to select parameter values, w and T, to minimize the number of hits that must be extended in Step 3. We select S based on the number false positives that we can tolerate. Then, given S, select w and T to optimize speed and sensitivity (number of false negatives). In particular, we need to consider the following tradeoff:

In this paper, Altschul and his colleagues used simulation studies to estimate the probability that hits found with a given set of parameter values in the data base would in fact be contained in local ungapped alignments with score ≥ S. In other words, they used a statistical approach to minimize the probability of unnecessarily attempting to extend a hit. They determined in empirically that a choice of w=4 and T=17 offered a good compromise between maximizing this probability and an excessive running time. This is discussed in detail in Altschul et al., 1990, on electronic reserve.




Last modified: November 19, 2009.
Maintained by Dannie Durand (durand@cs.cmu.edu) and Annette McLeod.