03-511/711, 15-495/856 Course Notes - Oct 1, 09


Review of distance-based methods

Greedy methods for distance-based phylogeny reconstruction

Taxa are points in a metric space with pairwise distances, D[i,j]. Tree building is equivalent to hierachical clustering of these points.

Both of these greedy algorithms maintain a forest of subtrees, beginning with the set of singleton trees (i.e., trees with one leaf and no edges). At each iteration, the algorithm merges two neighboring subtrees in the forest. The lenghth(s) of edge(s) connecting the subtrees are calculated and the distance matrix is updated. This step is repeated until only one tree remains - the final result.

The algorithms differ in

Unweighted Paired Group Method with Average Means (UPGMA)

The UPGMA algorithm is a variant of average linkage. UPGMA is based on the molecular clock assumption. The consequences of this assumption are that

However, if the assumption is violated (i.e., if D is a not ultrametric), then

Neighbor Joining (NJ)

The NJ algorithm deals with this problem by correcting for variations in the rate of change. The "corrected" distance between a pair of nodes is calculated by subtracting the average of the distances to all other leaves.

Thm:
If D is additive, the pair of taxa that minimimize this "corrected" distance matrix are neighbors in the true tree.
Proof:
Durbin et al., 7.8

If D is additive, then NJ will reconstruct the correct unrooted tree in quadratic time.

Determining the root of the tree

If D is a ultrametric, then the root can be determined directly from the data as a consequence of the molecular clock hypothesis. The root is located at the midpoint of the longest pathway between two taxa. UPGMA does this automatically. For trees obtained using other algorithms, the root can be estimated using midpoint rooting after the tree is constructed, as long as the tree has branch lengths.

If D is not ultrametric, then additional information is needed. To root a tree one should add an outgroup to the data set. An outgroup is an taxon for which external information (eg. paleontological information, morphology, ...) is available that indicates that the outgroup branched off before all other taxa. For example, bear could be used as an outgroup in a canine phylogeny.

The outgroup should not be too closely related to the taxa in question. Nor should the outgroup be very distantly related to the taxa. If the rates of change do not differ greatly from one lineage to the next, then midpoint rooting may give a reasonable approximation.




Last modified: October 1st, 2009.
Maintained by Dannie Durand (durand@cs.cmu.edu) and Annette McLeod.