See Ewens and Grant, 6.5.2. for a detailed discussion of how the BLOSUM matrices are computed.
|
|
|
|
| Evolutionary model | Explicit evolutionary model | None |
| Data | Full length MSAs of closely related sequences. Assume no multiple substitutions. | Conserved blocks in protein families. Residues are under selective pressure |
| Bias correction | Trees | Clustering |
| Evolutionary distance | Obtained from Markov model of sequence evolution. | Obtained from clustering of sequences. |
| Matrices | Transition and log odds scoring matrices | Log odds scoring matrix only. |
| Parameter n | Distance increases with n | Distance decreases with n |
| Biophysical properties | Derived indirectly from data | Derived indirectly from data |
The PAM and BLOSUM matrices were constructed from an evolutionary model and conserved blocks where amino acids are under selective constraints, respectively. Nevertheless, the matrices favor replacement of amino acids which share biochemical properties. Inspection of the BLOSUM 62 matrix shows that alignments of residues in the same biochemical group tend to have positive log odds scores. These residues are more likely to be observed together in related sequences than by chance. Residues from different groups tend to have negative scores. These residues are less likely to be observed together in related sequences than in chance alignments. A score of zero means that this pair of residues is equally likely in related and chance alignments.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Database searching is essentially a local alignment problem. In theory, a data base could be searched for sequences similar to a query sequence using dynamic programming. However, the running time for dynamic programing is O(mn), where m is the length of the query sequence and n is the length of the data base. For large data bases, the complexity of this approach is prohibitive.
The "typical" amino acid query sequence is 250 - 300 residues long, but query
sequences could be much longer:
| BRCA2 | |
|---|---|
| Genomic sequence: | 86,101 bp |
| mRNA | 10,257 bp |
| Protein | 3,418 amino acids |
| Data base | ||
|---|---|---|
| 11/8/2006 | ||
| Nucleic Acid Sequences | Amino Acid Sequences | |
| Letters: | 18,386,426,667 | 1,417,848,065 |
| Sequences: | 4,549,725 | 4,111,659 |