Seeing Depth from Two Eyes
Stereo vision mimics human binocular perception: two cameras separated by a known baseline capture the same scene from slightly different angles. The horizontal offset (disparity) between corresponding points in the left and right images encodes depth — closer objects exhibit larger disparity shifts, while distant features show nearly zero displacement. This geometric principle underpins autonomous vehicle perception, robotic navigation, and aerial 3D mapping.
Block Matching Algorithms
The simplest dense stereo method slides a small window across each epipolar line, computing a matching cost (typically Sum of Absolute Differences) at every candidate disparity. The disparity minimizing cost is selected per pixel, producing a dense disparity map. Block size controls the smoothness-detail tradeoff: small blocks preserve edges but are noise-sensitive, while large blocks smooth results at the cost of blurred boundaries.
Depth Precision and Range
The fundamental depth equation Z = fB/d reveals that depth precision degrades quadratically with distance — doubling range quadruples depth uncertainty. This means stereo systems excel at close-range reconstruction but struggle at long distances. Increasing baseline or focal length extends useful range, though wider baselines introduce more occlusion artifacts at object boundaries.
Modern Advances
Semi-Global Matching (SGM) dramatically improved stereo quality by enforcing piecewise smoothness across multiple directions, achieving near-real-time dense reconstruction. Recent deep learning approaches like RAFT-Stereo and LEAStereo learn to match features directly, handling textureless regions and reflections that defeat traditional methods. These advances power LiDAR-free depth sensing in smartphones and self-driving cars.