Optimal Discrete Sampling of Histogram

Jarvis323 · Mar 25, 2021

I am wondering if this problem has a name, and what is the most efficient way to solve it. Say you have a normalized histogram ##h(P)## (representing a pdf estimated from a large population), with ##n## bins, you want to generate a sample of points ##S## from ##h(P)## of size ##k##, such that $$\sum_{i=0}^{n-1}|h(S)_i - h(P)_i|$$ is minimized. Note that since the histogram is discrete, it is enough to select a ##c(i)## for each bin, where $$\sum_{i=0}^{n-1}c(i) = k$$ and the solution is minimal.

Stephen Tashi · Mar 26, 2021

How is ##c(i)## defined? Is it the number of elements selected that have a value equal to the value represented by the ##i##th bin? Is ##c(i) = k\ h(P)_i## ?

Jarvis323 · Mar 26, 2021

##c(i)## is just the number of elements you choose to generate in bin ##i##.

Another way of thinking about it is that you have ##k## apples to give out and ##n## people. Each person has a score ##p \in \mathbb{R}##, which says how deserving they are of apples. You want to give out the apples as fairly as possible according to ##p##. Since ##p## is continuous and you have discrete apples, some people may deserve slightly more than they get, or get more than they deserve.

I guess it seems somewhat simple, and I have an idea to do it efficiently, which is to just give everyone the number of apples they wholy deserve, and then sort the people according to how much apple they still deserve, and then just give the ##m## apples left over to the first ##m## people.

But it would be nice if you could do it without having to do the sorting.

Stephen Tashi · Mar 27, 2021

Jarvis323 said:

##c(i)## is just the number of elements you choose to generate in bin ##i##.

I think what your are asking is illustrated by the following problem: Suppose there are 3 bins with ##h(P)_0 = 0.4, h(P)_1 = 0.4, h(P)_2 = 0.2##. and we want to generate a sample of size 11. We can't pick values of the ##c(i)## to make ##\sum_{i=0}^2 |h(S_)i - h(P)_i|## exactly zero. So what are the ways of assigning values to the ##c(i)## that minimize that sum?

I am wondering if this problem has a name,

Problems of this general type are called "integer programming" problems. The function ##C(c_0, c_2,...c_m) = \sum_{i=0}^{m} |h(S_)i - h(P)_i|## is the "objective function". The constraints are ## 0 \le c_i \le n ##, ##\sum_{i=0}^m c_i = n##, where ##n## represents the size of the sample to be generated.

I don't know what, if any, name is given to a problem with such an objective function. Your objective function is unusual because it does not depend on what the bins represent, but only on what fraction of the population they contain. A particular example of this problem may have more than one solution because two different bins might each contain the same fraction of the population and the distinction in what the two bins represent doesn't affect the objective function.

It's worth noting that if you generate a sample of size ##n## by minimizing the objective function, you are not generating ##n## independent random samples from the population, so theorems in statistics that apply to random sampling do not apply to generating samples in this manner.

Optimal Discrete Sampling of Histogram

What is "Optimal Discrete Sampling of Histogram"?

Why is "Optimal Discrete Sampling of Histogram" important in data analysis?

How is the optimal number of bins determined in "Optimal Discrete Sampling of Histogram"?

What is the purpose of finding the optimal bin width in "Optimal Discrete Sampling of Histogram"?

How can "Optimal Discrete Sampling of Histogram" be applied in real-world situations?

Similar threads

Hot Threads

Recent Insights