Given s[1..m] and t[1..n], α(s',t') is an alignment
if
s', t' in (∑')*
|s'| = |t'| = l ≥ max{m,n}
s is the subsequence obtained by removing "_" from s'
(ditto for t and t')
There is no value of i for which s'[i] = t'[i] = "_".
Goal: Find the optimal alignment w.r.t. a given scoring scheme
Distance Based Scoring
D[s,t] = ∑(d[s'[i],t'[i]),
i = 1..l
d(x,x) = 0
d(x,y) > 0
d("_","_") = 0 or undefined
d(x,z) < d(x,y) + d(y,z)
NOTE:
If d(x,y) = 1 and d(x,"_") = 1, then
D(s,t) is the minimum number of operations required to transform
s into t, where the operations are substitution, insertion and
deletion. This is called the "edit distance".
If d(x,y) > 1 and d(x,"_") >
1, then it is called the "weighted edit distance".
D[s,t] is a metric
D[s,t] is an additive scoring function. We assume positional
independence.
Dynamic Programming Algorithm for Global Alignment
Initialization
D[0,s[i]] = D[0,s[i-1]] + d(s[i],"_")
D[t[j],0] = D[t[j-1],0] + d("_",t[j])
Recurrence
D[i,j] = min {
D[i-1,j] + d(s[i], "_")
D[i-1,j-1] + d(s[i], t[j])
D[i,j-1] + d("_", t[j])
Termination condidtion
Compute score of all pairs of prefixes in O(m • n) time.
D[m,n] gives the score of the optimal alignment.
Trace back through the alignment matrix in O(m+n) time to obtain
the optimal alignment.
There may be more than one optimal alignment
Similarity Scoring
p(x,y): similarity of x and y
p(x,"_"): gap cost
Score of alignment = ∑(p(s'[i], t'[i])), i =
1..l
A simple similarity score that treats all characters equally:
p(i, i) = M
p(i, j) = m
p(i, "_") = g
In this case, the
recurrence relation is:
a[i, j] =
max {
a[i,j-1] + g
a[i-1, j-1] + p(i,j)
a[i-1,j] +
g
Require 2g < m < M. If we allow 2g ≥ m then there will
be no substitutions.
Last modified: September 5th, 2006.
Maintained by Dannie Durand (durand@cs.cmu.edu) and Annette McLeod.