Understanding Benford's Law Proof: Scaling and Invariance

  • Context: High School 
  • Thread starter Thread starter etotheipi
  • Start date Start date
  • Tags Tags
    Law Proof
Click For Summary
SUMMARY

Benford's Law states that in many naturally occurring datasets, the leading digit is more likely to be small, with approximately 30% of numbers starting with the digit 1. The law is applicable to data that are not dimensionless and must exhibit a universal probability distribution that is invariant under scaling. The discussion highlights the mathematical proof of Benford's Law, emphasizing that the probability distribution must remain consistent regardless of the units of measurement. Participants also explore practical applications of Benford's Law in random number generation and data simulation.

PREREQUISITES
  • Understanding of Benford's Law and its mathematical formulation.
  • Familiarity with probability distributions and their properties.
  • Knowledge of scaling transformations in statistical data.
  • Basic concepts of random number generation and simulation techniques.
NEXT STEPS
  • Research the mathematical proof of Benford's Law, particularly the work of Theodore Hill (1998).
  • Explore applications of Benford's Law in forensic accounting and fraud detection.
  • Learn about the implications of scaling transformations in statistical analysis.
  • Investigate methods for generating random numbers that conform to Benford's Law.
USEFUL FOR

Data scientists, statisticians, mathematicians, and anyone involved in data analysis or simulation who seeks to understand the implications of Benford's Law in real-world datasets.

etotheipi
The significant digits of numbers in sets of numerical data supposedly follows "Benford's Law", which asserts that the probability that the first digit in a given data point is ##D## is about ##\log_{10}(1+ \frac{1}{D})##. An upshot is that we expect ~30% of significant digits to be ##1##.

The proof is outlined here and I can follow their reasoning but can't understand the very first step. They say
Benford's law applies to data that are not dimensionless, so the numerical values of the data depend on the units. If there exists a universal probability distribution ##P(x)## over such numbers, then it must be invariant under a change of scale, so ##P(kx) = f(k)P(x)##

If you take that to be true you can show ##f(k) = \frac{1}{k}##, though I wondered how you come up with the above assertion in the first place? What do we mean by scaling - I thought ##P(x)## was just supposed to model a PDF over the digits from 1 to 9?
 
Physics news on Phys.org
He is saying that Benford's law is true when the numbers being considered have a "universal probability distribution" which he then defines as being invariant to the units of measure.

So Benford's Law will work when you have the same distribution of numbers regardless of the units of measure - for example: feet, miles, cm, meters, etc.
 
  • Informative
Likes   Reactions: etotheipi
.Scott said:
He is saying that Benford's law is true when the numbers being considered have a "universal probability distribution" which he then defines as being invariant to the units of measure.

Though if the domain of ##P(x)## is digits from 1-9, then what does it mean to consider ##P(kx)##?

I can imagine that if you had a pdf ##f_X(x)## whose domain was ##x \in [0,5]##, if you then considered something like ##Y = 5X## (i.e. converting into a weird new unit), then the domain of this new ##f_Y(y)## is ##y \in [0,25]##, and to normalise it we would squish the curve down by a factor of 5.

I guess I wonder what do they mean by 'such numbers'; I had assumed this was only over the significant digits.
 
data that are not dimensionless come about by dividing some obervation by a unit: what you report when you write e.g. length = 3.4 m is actually length / 1 m = 3.4 and there is your 1/k !
 
  • Like
Likes   Reactions: etotheipi
Another approach: from 1 to the next digit up is 50% {oops... :biggrin:) 100 % but to the next digit down is only 10 %
 
  • Informative
Likes   Reactions: etotheipi
etotheipi said:
Though if the domain of ##P(x)## is digits from 1-9, then what does it mean to consider ##P(kx)##?

I can imagine that if you had a pdf ##f_X(x)## whose domain was ##x \in [0,5]##, if you then considered something like ##Y = 5X## (i.e. converting into a weird new unit), then the domain of this new ##f_Y(y)## is ##y \in [0,25]##, and to normalise it we would squish the curve down by a factor of 5.

I guess I wonder what do they mean by 'such numbers'; I had assumed this was only over the significant digits.

The ##x## in ##P(x)## are all possible measurements. That ##x## is not 1,2,3,4,5,6,7,8,9.
Later in the proof, he uses ##P(D)## - where ##D## is one of the nine decimal digits.
 
  • Like
Likes   Reactions: etotheipi
BvU said:
data that are not dimensionless come about by dividing some obervation by a unit: what you report when you write e.g. length = 3.4 m is actually length / 1 m = 3.4 and there is your 1/k !

I see; the dimensions are the key part. Thank you!
 
.Scott said:
The ##x## in ##P(x)## are all possible measurements. That ##x## is not 1,2,3,4,5,6,7,8,9.
Later in the proof, he uses ##P(D)## - where ##D## is one of the nine decimal digits.

That makes more sense. Thanks for clarifying!

My "homework" (well, I guess isn't all work "homework" now...?) is supposedly to trawl through a bunch of newspapers to see if the distribution fits... but the maths behind it is much more interesting!
 
Last edited by a moderator:
  • Like
Likes   Reactions: Klystron
etotheipi said:
I see; the dimensions are the key part. Thank you!

I must admit I didn't know there was such a law. I thought it was just fairly obvious that numbers tend to start with ##1##. For example, if you take the price of everything in the supermarket. Then you apply inflation - it doesn't matter what percentage you use.

If something starts at £1, it takes a long time to get to £2, less time to get to £3 and so on. With inflation at 3%, say, the price spends 24 years at £1. something; and only 4 years at £9. something. And then the same cycle for £10. something etc. By a rough calculation, therefore, the price of 27% of all items should start with ##1##.

It doesn't apply to phone numbers, for example, as these are codes and not actually numbers.
 
  • Like
Likes   Reactions: WWGD and etotheipi
  • #10
PeroK said:
I must admit I didn't know there was such a law. I thought it was just fairly obvious that numbers tend to start with ##1##. For example, if you take the price of everything in the supermarket. Then you apply inflation - it doesn't matter what percentage you use.

At first I thought it was slightly outlandish, but yes when I thought about it for a little longer it doesn't seem so far fetched. I quite like your inflation example since it's easy to see how in a subsequent fixed period of time you're going to get greater increases which will push you through the £70's, £80's, etc. into the £100's, and then you spend a bit longer there whilst the rate of raw increase keeps increasing throughout the other hundreds etc.

Some of the other things on that list are a bit harder to visualise (e.g. area of rivers, X-ray volts?) but I suppose the same principle applies.

Cool stuff!
 
  • Like
Likes   Reactions: Klystron
  • #11
PS On a topical, if grim, note: in 37% of countries the number of Coronavirus cases begins with 1. And another 18% begin with 2. You can see the same "inflationary" pattern in the numbers.
 
  • Like
Likes   Reactions: etotheipi
  • #12
I encountered Benford's law, though probably not a proof, in several contexts including designing pseudo-random number generators to model actual data sets. This quote contains a useful distinction [added Latex formatting]:
The quantity ##P(d)## is proportional to the space between ##d## and ##d+1## on a logarithmic scale. Therefore, this is the distribution expected if the logarithms of the numbers (but not the numbers themselves) are uniformly and randomly distributed.

When slide rules were in general use and, therefor, logarithmic numeric forms more pervasive; Benford's distribution in otherwise random data sets would have been more apparent.
 
  • Informative
  • Like
Likes   Reactions: WWGD and etotheipi
  • #13
Klystron said:
When slide rules were in general use and, therefor, logarithmic numeric forms more pervasive; Benford's distribution in otherwise random data sets would have been more apparent.

That's a bit like the part in the Wolfram article where it mentions the first few pages in tables of logarithms were observed to be more worn than those later on :smile:

Klystron said:
I encountered Benford's law, though probably not a proof, in several contexts including designing pseudo-random number generators to model actual data sets.

I know very little about how random number generators work (I only really know of the iterative approach with the seed and the remainders... not sure if it has a name!); if you were designing a pseudo-random number generator whose aim was to generate numbers with a uniform distribution between 0 and 100,000, I wouldn't suspect that Benford's law be obeyed since you're imposing an arbitrary cut-off in possible values. Is that right?

I wonder, how do you get the algorithm to spit out numbers that conform to the law? Was such an alteration necessary for what you were doing?
 
  • #14
etotheipi said:
That's a bit like the part in the Wolfram article where it mentions the first few pages in tables of logarithms were observed to be more worn than those later on :smile:
I know very little about how random number generators work (I only really know of the iterative approach with the seed and the remainders... not sure if it has a name!); if you were designing a pseudo-random number generator whose aim was to generate numbers with a uniform distribution between 0 and 100,000, I wouldn't suspect that Benford's law be obeyed since you're imposing an arbitrary cut-off in possible values. Is that right?

I wonder, how do you get the algorithm to spit out numbers that conform to the law? Was such an alteration necessary for what you were doing?
My memory seems cloudy of late but one application involved a seeded RNG that produced numbers between zero and one that met the requirements for randomness. The application programmers used simple arithmetic to produce digits from 1...10 to produce CFD test data but the requestor was not satisfied with the output.

I applied Benford's distribution to weight the generated data (digits) to more accurately mimic actual fluids, similar to this table:

1588623549433.png

A colleague really improved what she called my "brutal force" methods by subtle reiteration of the weighting algorithm to simulate more natural data streams.

I later applied similar algorithms to a group artificial intelligence project at university that required simulated shoppers entering and leaving queues to initiate the simulation. Previous data sets produced unnatural predictable bunches of shoppers. Applying Benford's Law lent verisimilitude to the sim that led to a successful project.
 
  • Like
  • Love
Likes   Reactions: WWGD and etotheipi
  • #15
Klystron said:
I later applied similar algorithms to a group artificial intelligence project at university that required simulated shoppers entering and leaving queues to initiate the simulation. Previous data sets produced unnatural predictable bunches of shoppers. Applying Benford's Law lent verisimilitude to the sim that led to a successful project.

That's very cool, how such a peculiar mathematical quirk can change the results so drastically. Thanks for sharing!
 
  • Like
Likes   Reactions: Klystron
  • #16
On a separate note, wonder if Benford's law relates to Zipf's law, on distribution of proportion of traits in a population where e.g., the 2nd, 3rd, etc largest cities in a country will have proportions of the total population that remain constant across different populations. At any rate, maybe s good point to make is that Benford's is a Practical Statistical but not Mathematical law.
 
  • Informative
  • Like
Likes   Reactions: Klystron and etotheipi
  • #17
WWGD said:
On a separate note, wonder if Benford's law relates to Zipf's law, on distribution of proportion of traits in a population where e.g., the 2nd, 3rd, etc largest cities in a country will have proportions of the total population that remain constant across different populations.
No, Zipf's law is an empirical law - it is derived from the observation of different data sets, some of which obey it and some of which don't. Benford's law on the other hand is deterministic - oversimplifying a little, if you have data set with a distribution that is scale invariant then it can be shown (Hill 1998) that for example the first digits of the data set will follow Benford's law.

WWGD said:
At any rate, maybe s good point to make is that Benford's is a Practical Statistical but not Mathematical law.
Not sure what your distinction is there, but I would consider it a 'Mathematical Law' in a way that Zipf's law, the Pareto principle etc. are not, although even Theodore Hill in his original paper and in subsequent publications does not use the word 'proof', and nor would I. The distinction that makes it mathematical is that the explanation for Zipf and Pareto lies in socio-economic or other factors intrinsic to the data that is being studied whereas the explanation for Benford's law lies in the mathematical properties of the numbers used to measure the data.
 
  • Informative
Likes   Reactions: Klystron
  • #18
pbuk said:
No, Zipf's law is an empirical law - it is derived from the observation of different data sets, some of which obey it and some of which don't. Benford's law on the other hand is deterministic - oversimplifying a little, if you have data set with a distribution that is scale invariant then it can be shown (Hill 1998) that for example the first digits of the data set will follow Benford's law.Not sure what your distinction is there, but I would consider it a 'Mathematical Law' in a way that Zipf's law, the Pareto principle etc. are not, although even Theodore Hill in his original paper and in subsequent publications does not use the word 'proof', and nor would I. The distinction that makes it mathematical is that the explanation for Zipf and Pareto lies in socio-economic or other factors intrinsic to the data that is being studied whereas the explanation for Benford's law lies in the mathematical properties of the numbers used to measure the data.
I am not an expert on it but by checking basic sources like Wiki and Wolfram, it is described as something that tends to occur in some datasets. So I would not call it Mathematical unless that statement was unsubstantiated.
 
  • #19
WWGD said:
I am not an expert on it but by checking basic sources like Wiki and Wolfram, it is described as something that tends to occur in some datasets. So I would not call it Mathematical unless that statement was unsubstantiated.
I don't consider Wikipedia authoritative. Be careful with the word 'tends', you seem to be using it to mean 'has a tendency to' whereas when Wolfram say 'Benford's law states that in listings, tables of statistics, etc., the digit 1 tends to occur with probability ~30% much greater than the expected 11.1% (i.e., one digit out of 9)' they are using it to mean 'in the limit approaches'.

But in any case I am not really concerned with linguistic definitions :wink: if you don't want to use the term mathematical to describe this phenomenon then don't.
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 13 ·
Replies
13
Views
2K
  • · Replies 55 ·
2
Replies
55
Views
8K
  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 17 ·
Replies
17
Views
3K
  • · Replies 20 ·
Replies
20
Views
6K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K