03-511/711, 15-495/856 Course Notes - November 8, 2011
Database Searching and the BLAST Algorithm
Data base searching overview
- Some applications:
- Compare cDNA with genomic DNA
- To find gene location
- To identify intron/exon boundaries
- Protein homology
- Compare protein sequence with protein database
- Structural and functional predictions
- Evolutionary analyses
- Verification of gene prediction
- Goals
- Find all high scoring local alignments (above some threshold)
- Determine significance of alignments found
- Null hypothesis - sequences are
unrelated
pj, pk
background frequencies
- Alternate hypothesis - sequences are related
qjk target frequencies
Database searching is essentially a local alignment problem. In theory, a
data base could be searched for sequences similar to a query sequence using
dynamic programming. However, the running time for dynamic programing is
O(mn), where m is the length of the query sequence and n is
the length of the data base. For large data bases, the complexity of this
approach is prohibitive.
The "typical" amino acid query sequence is 250 - 300 residues long, but query
sequences could be much longer:
BRCA2 |
Genomic sequence: |
86,101 bp |
mRNA |
10,257 bp |
Protein |
3,418 amino
acids |
Data base |
Nov 7, 2009 2:15 PM |
|
Nucleic Acid Sequences |
Amino Acid Sequences |
Letters: |
38,689,746,090 |
5,510,094,084 |
Sequences: |
14,905,504 |
16,036,619 |
- Database searching heuristics
- FASTA - Pearson and Lipman, 1988 not covered in this course.
- Original, ungapped BLAST: Altschul et al., '90
- Gapped BLAST: Altschul et al. '97
BLAST
- Terminology
- Query sequence, Q, of length m.
- Data base, D, of length n.
- Scoring matrix S[ ].
- Score threshold: S
- Segment pair
- Maximal segment pair (MSP)
- A segment pair whose score cannot be improved by extending or
shortening
- High scoring segment pair (HSP)
- Segment pair with score ≥ S
- Word or w-mers
- Goals
- Given Q and D, find all HSP's with statistical scores
- BLAST statistics (Karlin and Altschul, 1990)
- Given S, estimate expected number of HSP's under a random
sequence model.
- Used to
- optimize the performance of the heuristic.
- evaluate search results independent of scoring matrix.
- Approach
- Construct L=list of high scoring words derived from query
sequence. For proteins, high scoring words are w-mers with a
similarity score ≥ T when aligned with a subsequence of length
w in the query sequence using the PAM120 substitution matrices. (I
used BLOSUM62 in class.) For DNA, high scoring words are the m - w + 1
subsequences of length w.
- Compare each w-mer in the database sequence with the words in
L to find "hits". This can be done by hashing L or by using a
Mealy machine.
- Extend hits to obtain HSP's with scores ≥ S. Limit time spent on
this step by using a score cutoff. If score of extended alignment is lower
than the best score seen so far by the amount of the cutoff, give up
aligning in that direction.
Note that one could also define L to be the high scoring words in the
data base and compare each w-mer in the query sequence with
L. However, this would result in a much bigger hash table. This approach
also incurs a performance penalty because the database is accessed randomly
rather than scanned sequentially.
How to select S, w and T?
- Choose desired E (the number false positives tolerated). This gives a
value for S.
- Given S, select w and t to optimize the number of false negatives
and speed of search
How are S and E related?
- Let E be the expected number of false positives; i.e., E
is the expected number of HSPs with score >= S
given a "random" query sequence of length m and a
"random" data base sequence of length n. Here,
"random" means sampled with background frequencies.
- H0 - sequences are unrelated (background
frequencies). The probability of seeing i aligned with
j is pi pj.
- Ha - sequences are related (target frequencies). The
probability of seeing i aligned with j is qij.
- Approach to estimating E
- Model alignment as a random walk
G G A G ...
G A A C ...
- We can treat an alignment as random walk, where a match is a step in the
positive direction and a mismatch is a step in the negative direction.
- Each site corresponds to one step in the walk.
- Step size - S[i,j]
- Step probability - pi• pj
- HSP's in the alignment are "excursions" in the random walk.
- We define a "ladder point" L to be a new low in
this random walk.
- Let the random variable Yi describe the maximum point between the ladder
points Li and Li+1
- An excursion is the walk from the
Li to Yi.
- Excursions correspond to segment pairs (ungapped local arrangements).
- Apply known statistics about excursions of length at least
S to get statistics about HSPs of score at least
S:
- In order to apply random walk theory, we require:
-
at least one positive step (S[i,j] ≥ 0) and one negative
step (S[k,l] < 0)
- step size has negative mean
Σi,j S[i,j] pi pj < 0
- Then the expected number of HSP's under H0 is
E = Kmne-λS (1)
where λ is specified by the equation
1 = Σi,j pi•pj eλS[i,j]
and
K can be computed analytically for various substitution matrices
from the theory. (Simulations are not necessary.)
- Note that
- The PAM and BLOSUM matrices meet conditions (i) and (ii). They contain
positive and negative entries and Σi,j S[i,j]
pipj < 0,
- E increases with sequence lengths m and n and
decreases exponentially with S.
Selecting w and T
Steps 1 and 2 are relatively fast. Step 3 is slow. Therefore, we want to
select parameter values, w and T, to minimize the number of hits
that must be extended in Step 3. We select S based on the number false
positives that we can tolerate. Then, given S, select w and
T to optimize speed and sensitivity (number of false negatives). In
particular, we need to consider the following tradeoff:
- T Low
- Fewer false positives
- More hits to extend
- T High
- More false positives
- Fewer hits to extend
In this paper, Altschul and his colleagues used simulation studies to
estimate the probability that hits found with a given set of parameter values in
the data base would in fact be contained in local ungapped alignments with score
≥ S. In other words, they used a statistical approach to minimize the
probability of unnecessarily attempting to extend a hit. They determined in
empirically that a choice of w=4 and T=17 offered a good
compromise between maximizing this probability and an excessive running time. This is discussed
in detail in Altschul et al., 1990, on electronic reserve.
Last modified: November 8, 2011.
Maintained by Dannie Durand (durand@cs.cmu.edu) and Annette McLeod.