Hash Tables: O(1) Lookup Through Hashing

simulator intermediate ~8 min
Loading simulation...
Load factor: 1.25 — collisions inevitable

With 20 keys in 16 buckets (load factor 1.25), collisions are mathematically unavoidable. A good hash function distributes them evenly; a poor one creates pathological clustering.

Formula

Expected lookup (chaining): O(1 + α) where α = n/m is the load factor
Birthday paradox: P(collision) ≈ 1 - e^(-n²/(2m)) for n keys in m buckets

The Most Important Data Structure

Hash tables are arguably the single most important data structure in computer science. They power dictionaries in Python, objects in JavaScript, HashMaps in Java, and the key-value stores that underpin modern databases. The magic of hash tables is constant-time lookup: given a key, you can find its value in O(1) time on average, regardless of how many items are stored. This seemingly impossible feat is achieved through hash functions.

Hash Functions and Collisions

A hash function maps a key (like a string or number) to an integer index in the table's bucket array. A good hash function distributes keys uniformly across all buckets, minimizing collisions. But collisions are unavoidable — by the pigeonhole principle, if you have more possible keys than buckets, some keys must share a bucket. The art of hash table design lies in handling collisions efficiently.

Collision Resolution Strategies

The two main strategies are chaining and open addressing. With chaining, each bucket holds a linked list of all keys that hash to it. Lookups traverse the list. With open addressing, when a collision occurs, the algorithm probes subsequent buckets until an empty one is found. The visualization shows both strategies: chaining as linked lists growing from buckets, and open addressing as arrows showing the probe sequence.

Performance and Resizing

Hash table performance depends critically on the load factor α = n/m (keys divided by buckets). At low load factors, most lookups find the target immediately. As α approaches 1, collisions multiply and performance degrades toward O(n). Real implementations track the load factor and resize (typically double) the table when it exceeds a threshold — usually 0.7 to 0.75. Resizing requires rehashing every key, but this amortized cost keeps average operations at O(1).

FAQ

How does a hash table work?

A hash table stores key-value pairs by computing a hash function on the key to determine which bucket (array slot) to store it in. This gives O(1) average-case lookup, insertion, and deletion. When two keys hash to the same bucket (a collision), the table uses either chaining (linked lists) or open addressing (probing) to resolve it.

What is a hash collision?

A collision occurs when two different keys produce the same hash value, mapping to the same bucket. By the pigeonhole principle, collisions are unavoidable when the number of possible keys exceeds the table size. Good hash functions minimize collisions by distributing keys uniformly across all buckets.

What is the load factor?

The load factor α = n/m is the ratio of stored keys (n) to table size (m). As α increases, collisions become more frequent and performance degrades. Most implementations resize (typically doubling) when α exceeds 0.7–0.75. After resizing, all keys must be rehashed to their new positions.

Chaining vs open addressing — which is better?

Chaining (linked lists at each bucket) is simpler and degrades gracefully with high load factors. Open addressing (linear/quadratic probing or double hashing) has better cache performance and uses less memory. Java's HashMap uses chaining; Python's dict uses open addressing. The choice depends on expected load factor and access patterns.

Sources

Embed

<iframe src="https://homo-deus.com/lab/computer-science/hash-table/embed" width="100%" height="400" frameborder="0"></iframe>
View source on GitHub