03-511/711, 15-495/856 Course Notes - Sept 1, 2009
Pairwise alignment continued
Alignment algorithms
The dynamic programs for sequence alignment compute a matrix a[i,j],
which gives the scores of the optimal alignments of all prefixes. These algorithms have four components:
- Initialization of the first row and column of a[i,j].
- A recurrence relation for a[i,j], i,j > 1.
- Determination of the score of the optimal alignment
from the matrix a[i,j] in O(mn) time.
- Trace back through the alignment matrix to obtain the
optimal alignment in O(m+n) time.
The details of each of these steps are what differentiate global,
semi-global and local alignment.
Global alignment with similarity scoring
- p(x,y): similarity of x and y
- p(x,"_"): gap cost
- Score of alignment = ∑(p(s'[i], t'[i])), i =
1..l
- A simple similarity scoring function that treats all characters equally:
- p(i, i) = M
- p(i, j) = m
- p(i, "_") = g
- We require that 2g ≤ m < M. If we allow 2g ≥ m
then there will be no substitutions.
In this case, all matches are
accorded the same weight, as are all mismatches. Later in the semester we will
consider substitution matrices where the scores for matches and
mismatches vary for different characters i and j.
Under this simple scoring function, the dynamic programming algorithm for
global alignment has the following initialization and recursion steps:
- Initialization
- a[0,s[i]] = a[i-1,0] + g
- a[t[j],0] = a[0,j-1] + g
- Recurrence relation:
a[i, j] = max { |
a[i,j-1] + g |
a[i-1, j-1] + p(i,j) |
a[i-1,j] + g |
Semiglobal Alignment
Semiglobal alignment is global alignment with no end gap penalties. Some
applications include:
- Finding overlaps between fragments for sequence assembly.
- Aligning cDNA's or EST's with genomic DNA to identify gene structure.
The global dynamic programming algorithm can be modified for semi-global
alignment as follows:
- Initialization
- initialize the first row or the first column
of a[i,j] to zero, to avoid leading gap penalties.
- Recurrence relation
- To avoid trailing gap penalties, the score of the optimal semiglobal alignment is MAXi
a[i,n] or MAXj a[m,j]
- To avoid trailing gap penalties, start the trace back at the
cell(s) in the last row (or column) that with maximum score.
Note that because the first row (or column) of the matrix is
initialized to zero, the traceback will end in the first row (or
column) but not necessarily in the cell a[1,1].
Last modified: September 2, 2009.
Maintained by Dannie Durand (durand@cs.cmu.edu) and Annette McLeod.