The Most Common Word Dominates
In English, the word "the" accounts for about 7% of all words in a typical text. "Of" appears about 3.5% of the time, "and" about 2.8%. This perfectly regular decay — where frequency is inversely proportional to rank — was first documented by George Kingsley Zipf in 1935, though the pattern had been noticed by stenographers decades earlier.
A Universal Linguistic Law
What makes Zipf's Law remarkable is its universality. It holds not just for English, but for every natural language ever studied — Chinese, Arabic, Finnish, Swahili, and even extinct languages like Sumerian. The exponent is always close to 1.0, suggesting a deep structural property of human communication rather than a quirk of any particular grammar.
The Long Tail Problem
Zipf's Law creates a computational challenge: most words in a vocabulary are extremely rare. In a million-word corpus, roughly half of all unique words appear only once (hapax legomena). This long tail means that no matter how large your training data, there will always be words your model has never seen — a fundamental problem in natural language processing.
Information-Theoretic Optimality
Recent research suggests Zipf's Law may be the optimal distribution for communication. If words were uniformly distributed, messages would be longer than necessary. If one word dominated completely, communication would be impossible. The Zipfian distribution sits at the sweet spot — maximizing information transfer while minimizing the cognitive cost of maintaining a large vocabulary.