Benford's Law Is A Mathematical Mystery, But We Use It All the Time

We want you to come up with a set of data. For preference, a huge one. It need not be random; for instance, it may be “every social security number” or “the populations of all US cities.” However, it must cover a wide range of orders of magnitude; for example, “human height” or “birthday month” won’t suffice because all viable replies will be quite near to one another. Have one? Great. Now, which leading digit do you believe appears the most frequently in that set?

The question doesn’t appear to make much sense at first glance, does it? It makes reasonable that the leading digits, or the first digit of each entry, would be distributed uniformly because the set of numbers is enormous and often unexpected. For example, the starting digit of 633 is 6. Ninety percent of the data would begin with the number one, ninety percent with the number two, ninety percent with the number three, and so on. What if, however, we informed you otherwise? In actuality, one is almost probably the most common leading digit—and by a significant margin, too. In reality, you’ll typically discover that 30% or so of your data points begin with one. What is happening?

Benford’s law is what? The mathematical phenomena known as Benford’s Law is this disproportionate frequency. Contrary to popular belief, it was actually found by astronomer Simon Newcomb entirely by accident. In 1881, while researching logarithmic tables, he noted that the pages that started with the number one were noticeably more worn than the other pages. A phenomenon was quietly born after he quickly sent a message to the American Journal of Mathematics.

Nobody took the discovery seriously until a physicist by the name of Frank Benford decided to put it to the test in 1937. There’s a good reason it’s called Benford’s Law rather than Newcomb’s Law; after all, Benford put in the effort. The first-digit law remained true across all of the more than 20,000 data sets he tested it on, including mortality rates, molecular weights, population estimates, addresses, river flows, Reader’s Digest numbers, and more.

Sounds absurd, doesn’t it? Let’s put it into practice now; all we need is a sizable dataset of naturally occuring data. What about the total square kilometers of all the nations in the world? By adding the frequencies of each of the leading numbers and excluding Vatican City since it is too tiny for our needs, we arrive at the following. The numbers in the bars are, er, actual figures. The line conforms to what Benford’s Law would predict. Spooky!

Why does Benford’s law exist? Looking at that illustration, you would speculate that perhaps this is a human trait; perhaps we simply like smaller numbers, so we stop enlarging our kingdoms or whatever when we reach one million square kilometers. Watch that? Yes, it follows the same pattern. However, this one is scarcely physically constructed by human hands as it measures the leading digits of 2n.

Now, I’m sure that some of the mathematicians among you are now making your way to the comments area to point out that this impact most likely depends on the base you select. Because we operate in base ten, when we state that a data set’s starting digits are often ones, we actually mean that most of the entries are ones, or something-teens, or one hundred and somethings, and so on.

The identical values will have a different representation if we go to, example, base five or hexadecimal; these values won’t necessarily start with a one, therefore it stands to reason that the frequency of leading digits will also change. The wonderful part is that it is independent of base. Let’s translate the data from our nation sizes dataset into base… Oh, let’s go with base 8.