The Assembly Challenge
Modern DNA sequencers produce billions of short reads (100-300 bp) from a genome that may be millions or billions of base pairs long. Genome assembly is the computational puzzle of reconstructing the complete sequence from these overlapping fragments — like assembling a jigsaw puzzle where every piece is nearly identical and you have 50 copies of each. The de Bruijn graph approach, introduced to genomics by Pevzner and colleagues, transformed this challenge into a tractable graph theory problem.
De Bruijn Graph Construction
The assembly pipeline begins by decomposing every read into overlapping k-mers (subsequences of length k). Each k-mer becomes an edge in the de Bruijn graph, connecting its (k-1)-mer prefix to its (k-1)-mer suffix. The genome corresponds to an Eulerian path — a path that traverses every edge exactly once. This formulation is elegant because duplicate k-mers from overlapping reads collapse into single edges, naturally handling the massive redundancy of deep sequencing.
The Role of K-mer Size
Choosing the right k-mer size involves a fundamental tradeoff. Small k values produce well-connected graphs (good for low-coverage regions) but introduce ambiguity where different genomic locations share short sequences by chance. Large k values resolve more repeats and reduce false connections but require higher coverage and amplify the impact of sequencing errors. Modern assemblers like SPAdes use iterative multi-k strategies, starting with small k to establish connectivity and progressively increasing k to resolve ambiguities.
From Graph to Contigs
The raw de Bruijn graph contains artifacts from sequencing errors (tips and bubbles) and genomic repeats (tangles). Graph simplification algorithms remove erroneous tips, merge bubbles from heterozygous variants or errors, and carefully resolve simple repeat structures. The remaining unambiguous paths become contigs — the contiguous assembled sequences. Scaffolding with paired-end or long-read information then orders and orients contigs, bridging gaps to produce chromosome-scale assemblies.