Genome Assembly Simulator: De Bruijn Graph Construction

simulator advanced ~13 min
Loading simulation...
N50 ≈ 45 kb — estimated assembly contiguity

With k=31, 50× coverage, 150 bp reads, and 0.5% error rate, the de Bruijn graph assembly produces contigs with an estimated N50 of 45 kb, limited primarily by repeat structures longer than the read length.

Formula

k-mers per read = L - k + 1
k-mer coverage = C × (L - k + 1) / L
N50: shortest contig s.t. Σ(contigs ≥ N50) ≥ total/2

The Assembly Challenge

Modern DNA sequencers produce billions of short reads (100-300 bp) from a genome that may be millions or billions of base pairs long. Genome assembly is the computational puzzle of reconstructing the complete sequence from these overlapping fragments — like assembling a jigsaw puzzle where every piece is nearly identical and you have 50 copies of each. The de Bruijn graph approach, introduced to genomics by Pevzner and colleagues, transformed this challenge into a tractable graph theory problem.

De Bruijn Graph Construction

The assembly pipeline begins by decomposing every read into overlapping k-mers (subsequences of length k). Each k-mer becomes an edge in the de Bruijn graph, connecting its (k-1)-mer prefix to its (k-1)-mer suffix. The genome corresponds to an Eulerian path — a path that traverses every edge exactly once. This formulation is elegant because duplicate k-mers from overlapping reads collapse into single edges, naturally handling the massive redundancy of deep sequencing.

The Role of K-mer Size

Choosing the right k-mer size involves a fundamental tradeoff. Small k values produce well-connected graphs (good for low-coverage regions) but introduce ambiguity where different genomic locations share short sequences by chance. Large k values resolve more repeats and reduce false connections but require higher coverage and amplify the impact of sequencing errors. Modern assemblers like SPAdes use iterative multi-k strategies, starting with small k to establish connectivity and progressively increasing k to resolve ambiguities.

From Graph to Contigs

The raw de Bruijn graph contains artifacts from sequencing errors (tips and bubbles) and genomic repeats (tangles). Graph simplification algorithms remove erroneous tips, merge bubbles from heterozygous variants or errors, and carefully resolve simple repeat structures. The remaining unambiguous paths become contigs — the contiguous assembled sequences. Scaffolding with paired-end or long-read information then orders and orients contigs, bridging gaps to produce chromosome-scale assemblies.

FAQ

What is a de Bruijn graph in genome assembly?

A de Bruijn graph represents all k-mers from sequencing reads as edges, with (k-1)-mer prefixes and suffixes as nodes. The genome sequence corresponds to an Eulerian path through the graph. This formulation elegantly handles the massive redundancy of high-coverage sequencing data, since each k-mer appears as a single edge regardless of how many reads contain it.

How does k-mer size affect assembly?

Small k values produce simpler graphs with better connectivity but more ambiguity from coincidental k-mer matches. Large k values resolve more repeats but require higher coverage (each read produces fewer k-mers) and are more sensitive to sequencing errors. Many assemblers try multiple k values and merge results.

What is N50 and why does it matter?

N50 is the length of the shortest contig such that contigs of this length or longer cover at least 50% of the assembly. Higher N50 indicates a more contiguous assembly. A bacterial genome (5 Mb) assembled into 200 contigs with N50 = 50 kb means most of the genome is in contigs ≥50 kb long.

Why are repeats problematic for assembly?

Genomic repeats longer than the read length (or k-mer size) create ambiguous paths in the assembly graph — the assembler cannot determine which flanking unique sequences belong together. This fragments the assembly at repeat boundaries. Long-read sequencing (PacBio, Oxford Nanopore) resolves more repeats by spanning them entirely.

Sources

Embed

<iframe src="https://homo-deus.com/lab/bioinformatics/genome-assembly/embed" width="100%" height="400" frameborder="0"></iframe>
View source on GitHub