The Alignment Problem
Comparing biological sequences — DNA, RNA, or protein — is the most fundamental operation in bioinformatics. Every genome project, every evolutionary study, and every protein function prediction begins with alignment. The challenge is to find the optimal correspondence between two sequences, accounting for mutations (substitutions), insertions, and deletions that have accumulated since their divergence from a common ancestor.
Dynamic Programming Solution
Needleman and Wunsch's 1970 algorithm solves global alignment optimally using dynamic programming. It constructs a matrix where entry F(i,j) stores the best alignment score for the first i characters of sequence X and first j characters of sequence Y. Each cell is computed from its three neighbors: diagonal (match/mismatch), left (gap in Y), and above (gap in X). The traceback from the bottom-right corner reveals the optimal alignment.
Scoring Schemes
The choice of scoring parameters profoundly affects the alignment. Simple match/mismatch scores work for DNA, but protein alignment requires substitution matrices like BLOSUM62 or PAM250 that reflect the biochemical similarity between amino acids. Affine gap penalties (open + extend) better model the biology of insertions and deletions, since a single mutational event often creates a multi-residue gap.
From Pairwise to Database Search
While Needleman-Wunsch guarantees the optimal alignment, its O(mn) complexity makes it impractical for searching databases of billions of residues. BLAST (Basic Local Alignment Search Tool) uses word-seeding heuristics to rapidly identify candidate regions, then extends alignments locally. Modern tools like DIAMOND and MMseqs2 push throughput further using precomputed index structures, enabling metagenomic analyses of millions of sequences.