Structure from Motion Simulator: SfM 3D Reconstruction Pipeline

simulator advanced ~15 min
Loading simulation...
ε = 0.42 px — mean reprojection error

With 20 images at 60% match ratio and 10 BA iterations, the SfM pipeline reconstructs a sparse point cloud with sub-pixel reprojection accuracy, ready for dense multi-view stereo.

Formula

x = PX = K[R|t]X (projection equation)
min Σᵢⱼ ||xᵢⱼ - π(Cⱼ, Xᵢ)||² (bundle adjustment)
F × x' = 0 (fundamental matrix constraint)

From Photos to 3D

Structure from Motion is the algorithmic backbone of modern photogrammetry. Given a collection of overlapping photographs — potentially unordered and taken with unknown cameras — SfM automatically detects visual features, matches them across images, estimates camera positions, and triangulates a sparse 3D point cloud. The entire process mirrors how our visual cortex infers 3D structure from the changing patterns we see as we move through the world.

Feature Detection and Matching

SfM begins by detecting distinctive visual features (corners, blobs) in each image using algorithms like SIFT or SuperPoint. These features are encoded as high-dimensional descriptors and matched across image pairs. RANSAC-based geometric verification eliminates false matches by fitting the fundamental matrix — only matches consistent with a valid geometric relationship survive. The quality of these matches directly determines reconstruction accuracy.

Incremental Reconstruction

Most SfM pipelines build the reconstruction incrementally: starting from a well-matched initial pair, they triangulate an initial point cloud, then register additional cameras one by one. Each new camera is localized against existing 3D points (PnP), new points are triangulated from the new viewpoint, and bundle adjustment periodically refines everything. This incremental approach handles thousands of images but can accumulate drift over long sequences.

Bundle Adjustment

Bundle adjustment is the mathematical heart of SfM — a massive nonlinear optimization that simultaneously refines all camera parameters and 3D point positions to minimize total reprojection error. The Jacobian of this system is extremely sparse (each observation involves only one camera and one point), enabling efficient solution via the Schur complement. Modern solvers like Ceres handle millions of observations in seconds, making high-quality reconstruction practical even on consumer hardware.

FAQ

What is structure from motion (SfM)?

Structure from motion simultaneously estimates 3D scene structure and camera positions from a collection of unordered 2D images. By detecting and matching features across images, triangulating 3D points, and refining everything through bundle adjustment, SfM produces a sparse 3D point cloud and calibrated camera poses.

How does bundle adjustment work?

Bundle adjustment is a nonlinear least squares optimization that minimizes the total reprojection error — the sum of squared distances between observed feature points and their predicted positions through the estimated camera models and 3D points. It jointly refines all camera parameters and point coordinates.

How many images do you need for SfM?

A minimum of 3 images is needed for 3D reconstruction, but practical SfM requires 70-80% overlap between adjacent images. For a complete object, 20-60 images from diverse viewpoints typically suffice. Larger image sets improve robustness and fill gaps.

What is the difference between SfM and MVS?

SfM produces sparse 3D points and camera poses from feature matches. Multi-view stereo (MVS) then uses these calibrated cameras to compute dense depth maps and generate detailed 3D surface models. SfM is the geometric backbone; MVS adds surface detail.

Sources

Embed

<iframe src="https://homo-deus.com/lab/photogrammetry/structure-from-motion/embed" width="100%" height="400" frameborder="0"></iframe>
View source on GitHub