Phylogenetic Tree Simulator: UPGMA Tree Construction

simulator intermediate ~12 min
Loading simulation...
6-taxon UPGMA tree — height = 50 Myr

A UPGMA tree built from 6 species with mutation rate 2.2×10⁻⁹/site/year over a maximum divergence of 100 Myr produces an ultrametric tree with height 50 Myr and 5 internal nodes.

Formula

d(A,B) = Σ differences / alignment_length (p-distance)
d_JC = -3/4 × ln(1 - 4p/3) (Jukes-Cantor correction)
h(node) = d(Ci,Cj) / 2 (UPGMA node height)

From Sequences to Trees

Phylogenetic trees represent the evolutionary relationships among species or genes. Building a tree from molecular data requires three steps: multiple sequence alignment, distance estimation, and tree construction. UPGMA — one of the earliest and simplest tree-building algorithms — takes a matrix of pairwise distances and iteratively clusters the closest pairs until a single rooted tree emerges.

The UPGMA Algorithm

UPGMA begins by treating each species as a single-member cluster. At each iteration, it identifies the two clusters with the smallest average inter-cluster distance, merges them into a new cluster, and places the joining node at half their distance. The distance matrix is then updated using the arithmetic mean of distances from the new cluster to all remaining clusters. After n-1 iterations, a fully resolved rooted tree is obtained.

The Molecular Clock Assumption

UPGMA produces ultrametric trees — trees where all leaf-to-root distances are equal. This implies a molecular clock: all lineages accumulate mutations at the same rate. When this assumption holds (as approximately true for closely related species), UPGMA gives accurate topologies and meaningful divergence time estimates. When rates vary between lineages, UPGMA can place fast-evolving species at incorrect positions in the tree.

Beyond UPGMA

Modern phylogenetics has largely moved beyond UPGMA to methods that relax the clock assumption. Neighbor-joining builds unrooted trees without assuming equal rates. Maximum likelihood and Bayesian inference evaluate explicit models of sequence evolution, incorporating rate heterogeneity across sites and lineages. Bootstrap resampling and posterior probabilities provide statistical support for each branch, essential for drawing reliable biological conclusions.

FAQ

What is UPGMA?

UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is a hierarchical clustering algorithm that builds phylogenetic trees from a distance matrix. At each step, it joins the two closest clusters, placing the node at half their distance. UPGMA assumes a molecular clock — equal evolutionary rates across all lineages — producing ultrametric trees where all tips are equidistant from the root.

What is a molecular clock?

The molecular clock hypothesis proposes that DNA and protein sequences accumulate mutations at a roughly constant rate over time. If true, the number of differences between two sequences is proportional to their divergence time. UPGMA relies on this assumption, while methods like neighbor-joining do not.

When does UPGMA fail?

UPGMA produces incorrect tree topologies when the molecular clock assumption is violated — that is, when different lineages evolve at different rates. In such cases, fast-evolving lineages are incorrectly placed deeper in the tree. Neighbor-joining, maximum likelihood, and Bayesian methods handle rate variation more robustly.

What is long-branch attraction?

Long-branch attraction is a systematic error where rapidly evolving lineages are incorrectly grouped together because their sequences have converged on similar compositions by chance. It affects parsimony methods most severely but can also bias distance-based methods when distance corrections are inadequate.

Sources

Embed

<iframe src="https://homo-deus.com/lab/bioinformatics/phylogenetic-tree/embed" width="100%" height="400" frameborder="0"></iframe>
View source on GitHub