From Photos to 3D
Structure from Motion is the algorithmic backbone of modern photogrammetry. Given a collection of overlapping photographs — potentially unordered and taken with unknown cameras — SfM automatically detects visual features, matches them across images, estimates camera positions, and triangulates a sparse 3D point cloud. The entire process mirrors how our visual cortex infers 3D structure from the changing patterns we see as we move through the world.
Feature Detection and Matching
SfM begins by detecting distinctive visual features (corners, blobs) in each image using algorithms like SIFT or SuperPoint. These features are encoded as high-dimensional descriptors and matched across image pairs. RANSAC-based geometric verification eliminates false matches by fitting the fundamental matrix — only matches consistent with a valid geometric relationship survive. The quality of these matches directly determines reconstruction accuracy.
Incremental Reconstruction
Most SfM pipelines build the reconstruction incrementally: starting from a well-matched initial pair, they triangulate an initial point cloud, then register additional cameras one by one. Each new camera is localized against existing 3D points (PnP), new points are triangulated from the new viewpoint, and bundle adjustment periodically refines everything. This incremental approach handles thousands of images but can accumulate drift over long sequences.
Bundle Adjustment
Bundle adjustment is the mathematical heart of SfM — a massive nonlinear optimization that simultaneously refines all camera parameters and 3D point positions to minimize total reprojection error. The Jacobian of this system is extremely sparse (each observation involves only one camera and one point), enabling efficient solution via the Schur complement. Modern solvers like Ceres handle millions of observations in seconds, making high-quality reconstruction practical even on consumer hardware.