03-511/711, 15-495/856 Course Notes - November 8, 2011

Database Searching and the BLAST Algorithm

Data base searching overview

Some applications:
- Compare cDNA with genomic DNA
  - To find gene location
  - To identify intron/exon boundaries
- Protein homology
  - Compare protein sequence with protein database
  - Structural and functional predictions
  - Evolutionary analyses
- Verification of gene prediction
Goals
- Find all high scoring local alignments (above some threshold)
- Determine significance of alignments found
  - Null hypothesis - sequences are unrelated
    p_j, p_k background frequencies
  - Alternate hypothesis - sequences are related
    q_jk target frequencies

Database searching is essentially a local alignment problem. In theory, a data base could be searched for sequences similar to a query sequence using dynamic programming. However, the running time for dynamic programing is O(mn), where m is the length of the query sequence and n is the length of the data base. For large data bases, the complexity of this approach is prohibitive.

The "typical" amino acid query sequence is 250 - 300 residues long, but query sequences could be much longer:

BRCA2
Genomic sequence:	86,101 bp
mRNA	10,257 bp
Protein	3,418 amino acids

Data base
Nov 7, 2009 2:15 PM
	Nucleic Acid Sequences	Amino Acid Sequences
Letters:	38,689,746,090	5,510,094,084
Sequences:	14,905,504	16,036,619

Database searching heuristics
- FASTA - Pearson and Lipman, 1988 not covered in this course.
- Original, ungapped BLAST: Altschul et al., '90
- Gapped BLAST: Altschul et al. '97

BLAST

Terminology
- Query sequence, Q, of length m.
- Data base, D, of length n.
- Scoring matrix S[ ].
- Score threshold: S
- Segment pair
  - Ungapped local alignment
- Maximal segment pair (MSP)
  - A segment pair whose score cannot be improved by extending or shortening
- High scoring segment pair (HSP)
  - Segment pair with score ≥ S
- Word or w-mers
  - w consecutive letters
Goals
- Given Q and D, find all HSP's with statistical scores
BLAST statistics (Karlin and Altschul, 1990)
- Given S, estimate expected number of HSP's under a random sequence model.
- Used to
  - optimize the performance of the heuristic.
  - evaluate search results independent of scoring matrix.
Approach
1. Construct L=list of high scoring words derived from query sequence. For proteins, high scoring words are w-mers with a similarity score ≥ T when aligned with a subsequence of length w in the query sequence using the PAM120 substitution matrices. (I used BLOSUM62 in class.) For DNA, high scoring words are the m - w + 1 subsequences of length w.
2. Compare each w-mer in the database sequence with the words in L to find "hits". This can be done by hashing L or by using a Mealy machine.
3. Extend hits to obtain HSP's with scores ≥ S. Limit time spent on this step by using a score cutoff. If score of extended alignment is lower than the best score seen so far by the amount of the cutoff, give up aligning in that direction.

Note that one could also define L to be the high scoring words in the data base and compare each w-mer in the query sequence with L. However, this would result in a much bigger hash table. This approach also incurs a performance penalty because the database is accessed randomly rather than scanned sequentially.

How to select S, w and T?

Choose desired E (the number false positives tolerated). This gives a value for S.
Given S, select w and t to optimize the number of false negatives and speed of search

How are S and E related?

Let E be the expected number of false positives; i.e., E is the expected number of HSPs with score >= S given a "random" query sequence of length m and a "random" data base sequence of length n. Here, "random" means sampled with background frequencies.
- H0 - sequences are unrelated (background frequencies). The probability of seeing i aligned with j is p_i p_j.
- Ha - sequences are related (target frequencies). The probability of seeing i aligned with j is q_ij.
Approach to estimating E
- Model alignment as a random walk
  
  G G A G ...
  G A A C ...
  - We can treat an alignment as random walk, where a match is a step in the positive direction and a mismatch is a step in the negative direction.
  - Each site corresponds to one step in the walk.
  - Step size - S[i,j]
  - Step probability - p_i• p_j
- HSP's in the alignment are "excursions" in the random walk.
  - We define a "ladder point" L to be a new low in this random walk.
  - Let the random variable Y_i describe the maximum point between the ladder points L_i and L_i+1
  - An excursion is the walk from the L_i to Y_i.
  - Excursions correspond to segment pairs (ungapped local arrangements).
- Apply known statistics about excursions of length at least S to get statistics about HSPs of score at least S:
  - In order to apply random walk theory, we require:
    1. at least one positive step (S[i,j] ≥ 0) and one negative step (S[k,l] < 0)
    2. step size has negative mean Σ_i,j S[i,j] p_i p_j < 0
  - Then the expected number of HSP's under H0 is
```
   E = Kmne^-λS     (1)
```
    where λ is specified by the equation
```
   1 = Σ_i,j pi•pj e^λS[i,j]
```
    and K can be computed analytically for various substitution matrices from the theory. (Simulations are not necessary.)
- Note that
  - The PAM and BLOSUM matrices meet conditions (i) and (ii). They contain positive and negative entries and Σ_i,j S[i,j] p_ip_j < 0,
  - E increases with sequence lengths m and n and decreases exponentially with S.

Selecting w and T

Steps 1 and 2 are relatively fast. Step 3 is slow. Therefore, we want to select parameter values, w and T, to minimize the number of hits that must be extended in Step 3. We select S based on the number false positives that we can tolerate. Then, given S, select w and T to optimize speed and sensitivity (number of false negatives). In particular, we need to consider the following tradeoff:

T Low
- Fewer false positives
- More hits to extend
T High
- More false positives
- Fewer hits to extend

In this paper, Altschul and his colleagues used simulation studies to estimate the probability that hits found with a given set of parameter values in the data base would in fact be contained in local ungapped alignments with score ≥ S. In other words, they used a statistical approach to minimize the probability of unnecessarily attempting to extend a hit. They determined in empirically that a choice of w=4 and T=17 offered a good compromise between maximizing this probability and an excessive running time. This is discussed in detail in Altschul et al., 1990, on electronic reserve.

Last modified: November 8, 2011.
Maintained by Dannie Durand (durand@cs.cmu.edu) and Annette McLeod.