03-511/711, 15-495/856 Course Notes - Oct. 28, 2010
Gibbs Sampler
A method for discovering ungapped local alignments (Lawrence, et al. '93). See Ewens &
Grant, section 6.6 for a straightforward introduction to their algorithm.
- Input: k sequences, t1...tk, of length
n1, n2...nk
- Output: k subsequences of length w starting at positions
i1...ik that are "most similar" to each other
- Search space: All possible alignments or, equivalently, all
sets of starting positions.
Intuition:
- Input: Grandma's illegible pie recipe
- Initialization:
- Guess quantities of butter, sugar...
Repeat {
- select ingredient i at random.
- Vary quantity of i. Bake. Taste. Change quanity
of i in recipe
} until (cake doesn't change or time for party).
Mathematical Framework:
We seek a property of the joint probability distribution of a set
of random variables: f(i1, i2, ... ik). For example, these
random variables might be the quantities of the ingredients in
the cake recipe, or the starting points of a conserved pattern in
a set of k sequences. In many cases, calculating the joint
probability is hard, but calculating the conditional probability,
f(ij | i2, ij-1, ij+1,.. ik) is relatively easy. The Gibbs
method simulates the joint distribution by generating an instance
of each random variable in turn, conditional on the current values of
the remaining k-1 random variables.
Algorithm:
Let t* = t1
For j = 2 to
k
let ij be a random number
between 1 and nj − w + 1
Let A be a (k−1) x w dimensional matrix containing the local MSA of the k-1
subsequences of length w starting at t2[i2],
t3[i3]...tk[ik]
Repeat
{
- Let S[|Σ|, w] be the PSSM of the local alignment
A[ ]
- For each position i in t*, let
P[i]= |
score (t*, i,
w) |
(1) |
Σj score (t*, j, w) |
|
where the sum in the denominator is over all
positions j=1 to (nj−w+1) and
score(t*, j, w) = ΣL S[t*[i+L],L] where 1 ≤ L < w.
- Select a starting position i* in t* with probability
P(i*)
- Select a row, r, in A[ ] at random and replace it
with the subsequence of length w starting at t*[i*]
- Let t* = r.
} until (S stops changing)
- Comments
- We assume that each sequence has exactly one copy of
the pattern. Problems could occur if there is more than one
copy, no copies, or we did not guess the length of the pattern
correctly. Also, we might find a biologically meaningful
pattern, but not the one we hoped to find. For example, we might
find a direct repeats from a transposable element insertion, when
we were looking for a transcription factor binding site.
- Pseudocounts guarantee that every character is represented at every
position.
- Convergence:
- The Gibbs method samples a Markov chain with N states, where
N= πi (ni−w−1).
The states correspond to all possible sets of starting positions.
- The transition probability is determined by equation (1).
- We require randomness to avoid being trapped in local optima.
- This Markov chain is finite, irreducible (the state space is
connected), and aperiodic. Therefore, it has a stationary distribution.
- Given enough time, the algorithm will converge to the
alignment that maximizes this distribution. (For details, see optional readings.)
- In practice, the algorithm can get stuck in local optima.
This problem can be ameliorated with by executing several runs
with different random starts.
- Black magic
- How to choose pseudo counts?
- How to choose the window length, w?
- When to quit?
- How to choose the initial alignment?
- More about theory, convergence proofs, black magic
- Ewens & Grant 10.5.2, Lawrence, et al. 1993, Casella & George, 1992.
Last modified: October 30, 2010.
Maintained by Dannie Durand (durand@cs.cmu.edu) and Annette McLeod.