03-511/711, 15-495/856 Course Notes - Sept 12, 2006
Pairwise alignment continued
Alignment algorithms
The dynamic programs for sequence alignment compute a matrix a[i,j],
which gives the scores of the optimal alignments of all prefixes. These algorithms have four components:
- Initialization of the first row and column of a[i,j].
- A recurrence relation for a[i,j], i,j > 1.
- Determination of the score of the optimal alignment
from the matrix a[i,j] in o(m-n) time.
- Trace back through the alignment matrix to obtain the
optimal alignment in o(m+n) time.
The details of each of these steps are what differentiate global,
semi-global and local alignment.
Global alignment with similarity scoring
- p(x,y): similarity of x and y
- p(x,"_"): gap cost
- Score of alignment = ∑(p(s'[i], t'[i])), i =
1..l
- A simple similarity scoring function that treats all characters equally:
- p(i, i) = M
- p(i, j) = m
- p(i, "_") = g
- We require that 2g < m < M. If we allow 2g ≥ m
then there will be no substitutions.
In this case, all matches are
accorded the same weight, as are all mismatches. Later in the semester we will
consider substitution matrices where the scores for matches and
mismatches vary for different characters i and j.
Under this simple scoring function, the dynamic programming algorithm for
global alignment has the following initialization and recursion steps:
- Initialization
- a[0,s[i]] = a[i-1,0] + g
- a[t[j],0] = a[0,j-1] + g
- Recurrence relation:
| a[i, j] = max { |
a[i,j-1] + g |
| a[i-1, j-1] + p(i,j) |
| a[i-1,j] + g |
Semiglobal Alignment
Semiglobal alignment is global alignment with no end gap penalties. Some
applications include:
- Finding overlaps between fragments for sequence assembly.
- Aligning cDNA's or EST's with genomic DNA to identify gene structure.
The global dynamic programming algorithm can be modified for semi-global
alignment as follows:
- Initialization
- initialize the first row or the first column
of a[i,j] to zero, to avoid leading gap penalties.
- Recurrence relation
- To avoid trailing gap penalties, the score of the optimal semiglobal alignment is MAXi
a[i,n] or MAXj a[m,j]
- To avoid trailing gap penalties, start the trace back at the
cell in the last row (or column) that with maximum score.
To avoid leading gap penalties, end the trace back any where in
the first row (or column) to optimize the score.
Local Alignment
- Initialize the first row and column to zero: s[i,0] = t[0,j] = 0 for all i and j
- Recurrence
| a[i,j]= max { |
a[i-1,j] + g |
| a[i-1,j-1] + p(s[i], t[j]) |
| a[i,j-1] + g |
| 0 |
- The score of the optimal alignment is max{ a[i,j]}, where the
maximum is taken over all i and all j.
- Trace back starting at a*[i,j], the cell corresponding
to the maximum score. End the trace back when the score
reaches zero
Note that :
- There can be more than one optimal alignment
- Suboptimal alignments may be of interest
- M > m > 2g
- Global and semi-global alignments can use distance or
similarity functions.
- Local pairwise alignment requires that
- The scoring function be a similarity function.
- The similarity matrix, p[i,j], must contain at least one positive
value.
- The expected random alignment score must be
negative.
Last modified: September 12, 2006.
Maintained by Dannie Durand (durand@cs.cmu.edu) and Annette McLeod.