03-511/711, 15-495/856 Course Notes

03-511/711, 15-495/856 Course Notes - Nov. 7th, 2006

Amino Acid Substitution Matrices

Scoring overview

Scoring in pairwise alignment

simplest: M, m, g
more general: p(x,y) - similarity of x and y
select p( ) to take into account
- biophysical properties of residues.
- evolutionary divergence. Substitutions that are surprising in closely related sequences might not be in distant ones.
- (possibly) multiple substitutions.

Jukes-Cantor

Markov model of point mutations in nucleic acid sequence

Instantaneous rate matrix
- P[i,j]=α i≠j
- P[i,i]= 1-3 α

Probability of change over time, t.
```
                                        - 4 α t
               P(i,i)   =     1/4 + 3/4 e     
```
```
                                        - 4 α t
               P(i,j)   =     1/4 (1 - e)     
```
This is a scoring matrix parameterized by the evolutionary distance, α t. Note that when t=0, p(i,i)=1 and p(i,j)=0. When t=α, p(i,i)=P(i,j)=1/4.
However, this scoring matrix, isn't very useful because we don't know α t. Instead, we count the number of mismatches and then correct the distance by observing that the expected
number of positions where a substitution occured is
```
               6 α t   =     3/4 log (1 - 4/3 P(i≠j)),
```
where P(i≠j) is the expected number of observed mismatches and can be estimated by m/n, the number of observed mismatches divided by the alignment length .

Jukes-Cantor model
- has an explicit evolutionary model
- takes evolutionary divergence into account
- does not take the biophysical differences between residues into account
- is determined from first principles

Kimura 2-parameter model

Takes some biophysical properties into account (models both transitions and transversions), but does not capture all aspects.

Amino Acid Substitution Matrices

Overview

Goal: Amino acid similarity matrices that take into account

biophysical properties of residues,
evolutionary divergence and
multiple substitutions.

Markov models of sequence evolution require

Nucleic acids 4 states, 16 transitions
Amino acids 20 states, 400 transitions
Use data to infer transition probabilities for amino acids.

Two commonly used families of amino acid substitution matrices

PAM - Dayhoff et al., 1978
BLOSUM - Henikoff S., Henikoff JG., 1992.

Each family is parameterized by evolutionary distance. Both use the following approach

"Trusted" MSA's (ungapped)
Count substitutions, correcting for sample bias in choice of sequences
Estimate substitution frequencies
PAM - evolutionary model
BLOSUM - directly from data
Construct Log odds scoring matrix

PAM matrices

Definition:
- PAM = "accepted point mutation" or "percent accepted mutation".
- PAM is a unit of evolutionary distance.
- We say two sequence are n PAMs apart if there are, on average, n actual changes (including multiple substitutions) beteen them per 100 residues.
- Our goal is to construct a family of matrices parameterized by PAM distance.
Approach:
- Construct a family of Markov chains with twenty states. If the chain is in state j at time t,we say that we see residue j at site i at time t. Note that this model assumes site independence.
- Derive the PAM 1 transition probability, P¹[j,k] from closely related alignments (no multiple substitutions.)
- Extrapolate to obtain the PAM n transition probability, Pⁿ[j,k]. Pⁿ[j,k] is the probability that j will be replaced with k in sequences that are n PAM units apart.
Strategy:
1. "Trusted" ungapped MSAs of 71 groups of closely related sequences. Within each group, the sequence similarity is >= 85%.
2. Count replacements, correcting for sample bias in choice of sequences by averaging over all most parsimonious trees. For each tree, T, we calculate A^T_jk by counting the number of edges connecting j and k. A^T_jj equals twice the number of edges connecting j and j. We obtain overall counts by summing over all trees:
```
A_jk  = (1/|T|)   Σ_T A^T_jk
```
3. Using A_jk, we obtain the transition matrix P¹[j,k], the probability that amino acid j will be replaced by amino acid k sequences separated by one PAM of evolutionary distance, as follows:
```
P¹[j,k] = m_j  A_j,k
             ----------
             Σ_{i ≠ j} A_j,i


P¹[j,j] = 1 - m_j
```
```
    mj =  1       Σ_{i ≠ j} A_j,i
       -------  -------------
       n p_j z    Σ_h Σ_{i ≠ h} A_h,i
```
  where p_j is the frequency of j in the MSA and n is the length of the MSA.
  Select the nomalization factor, z, so that
```
Σ_{j = 1 to 20} (p_j  m_j) =  0.01

yielding

m_j  = 0.01  1  Σ_{i ≠ j} A_j,i
          ---  ------------
           p_j  Σ_h Σ_{i ≠ h} A_h,i
 
```
Note - P[j,k] is a Markov chain
- rows sum to 1
- history independent
- finite, aperiodic, irreducible => stationary distribution

PAM2 matrix:

P²[j,k] = ΣP[j,l] P[l,k] = (P¹[j,k])²

PAMn matrix:

Pⁿ[j,k] = (P¹[j,k])ⁿ

Obtain log odds scoring matrix

Let q(j,k) be the probability that, at a given position, we see amino acid j aligned with amino acid k;
i.e., that amino acid j is replaced by amino acid k after n PAMs of mutational change. Then the PAM n scoring matrix is

S[j,k] = λ log q[jk]
                    p_j p_k

          = λ log P[j,k]
                     p_k
where λ is a constant. Typically λ = 10 and the entries of S[] are rounded to the nearest integer.

Are PAM matrices symmetric?

Transition matrix - no
Replacing amino acid i with amino acid j is not the same as replacing j with i.
Scoring matrix - yes
In an alignment, we cannot determine direction of evolution.

Last modified: November 7, 2006.
Maintained by Dannie Durand (durand@cs.cmu.edu) and Annette McLeod.