03-511/711, 15-495/856 Course Notes - November 9, 2006


Introduction to Database Searching

BLOSUM Matrices


See Ewens and Grant, 6.5.2. for a detailed discussion of how the BLOSUM matrices are computed.


Comparing PAM and BLOSUM Matrices

 
PAM
BLOSUM
Evolutionary model Explicit evolutionary model None
Data Full length MSAs of closely related sequences. Assume no multiple substitutions. Conserved blocks in protein families. Residues are under selective pressure
Bias correction Trees Clustering
Evolutionary distance Obtained from Markov model of sequence evolution. Obtained from clustering of sequences.
Matrices Transition and log odds scoring matrices Log odds scoring matrix only.
Parameter n Distance increases with n Distance decreases with n
Biophysical properties Derived indirectly from data Derived indirectly from data


The PAM and BLOSUM matrices were constructed from an evolutionary model and conserved blocks where amino acids are under selective constraints, respectively. Nevertheless, the matrices favor replacement of amino acids which share biochemical properties. Inspection of the BLOSUM 62 matrix shows that alignments of residues in the same biochemical group tend to have positive log odds scores. These residues are more likely to be observed together in related sequences than by chance. Residues from different groups tend to have negative scores. These residues are less likely to be observed together in related sequences than in chance alignments. A score of zero means that this pair of residues is equally likely in related and chance alignments.


Seq Identity
PAM
BLOSUM
20
250
45
30
160
62
40
120
80
50
80
-
60
60
-

Outline


Data base searching overview

Database searching is essentially a local alignment problem. In theory, a data base could be searched for sequences similar to a query sequence using dynamic programming. However, the running time for dynamic programing is O(mn), where m is the length of the query sequence and n is the length of the data base. For large data bases, the complexity of this approach is prohibitive.

The "typical" amino acid query sequence is 250 - 300 residues long, but query sequences could be much longer:

BRCA2
   Genomic sequence:    86,101 bp
   mRNA    10,257 bp
   Protein   3,418 amino acids

Data base
11/8/2006
      Nucleic Acid Sequences    Amino Acid Sequences
   Letters:    18,386,426,667    1,417,848,065
   Sequences:    4,549,725    4,111,659



Last modified: November 9, 2006.
Maintained by Dannie Durand (durand@cs.cmu.edu) and Annette McLeod.