Optimal Discrete Sampling of Histogram

Click For Summary
SUMMARY

The discussion centers on the optimal discrete sampling of a normalized histogram, denoted as ##h(P)##, with ##n## bins to generate a sample of size ##k## that minimizes the objective function $$\sum_{i=0}^{n-1}|h(S)_i - h(P)_i|$$. The variables ##c(i)## represent the number of elements selected from each bin, constrained by $$\sum_{i=0}^{n-1}c(i) = k$$. This problem falls under the category of "integer programming," where the objective function is defined as $$C(c_0, c_1,...c_m) = \sum_{i=0}^{m} |h(S)_i - h(P)_i|$$. The discussion highlights the challenge of distributing discrete samples fairly according to continuous scores without sorting.

PREREQUISITES
  • Understanding of normalized histograms and probability density functions (PDFs).
  • Familiarity with integer programming concepts and objective functions.
  • Knowledge of constraints in optimization problems.
  • Basic principles of sampling theory and its limitations.
NEXT STEPS
  • Research integer programming techniques for optimization problems.
  • Explore advanced sampling methods in statistics, particularly in non-independent sampling scenarios.
  • Study the implications of objective functions in optimization and their applications.
  • Investigate algorithms for efficient distribution of discrete resources based on continuous metrics.
USEFUL FOR

Data scientists, statisticians, and operations researchers interested in optimization techniques for sampling distributions and resource allocation problems.

Jarvis323
Messages
1,247
Reaction score
988
I am wondering if this problem has a name, and what is the most efficient way to solve it. Say you have a normalized histogram ##h(P)## (representing a pdf estimated from a large population), with ##n## bins, you want to generate a sample of points ##S## from ##h(P)## of size ##k##, such that $$\sum_{i=0}^{n-1}|h(S)_i - h(P)_i|$$ is minimized. Note that since the histogram is discrete, it is enough to select a ##c(i)## for each bin, where $$\sum_{i=0}^{n-1}c(i) = k$$ and the solution is minimal.
 
Technology news on Phys.org
How is ##c(i)## defined? Is it the number of elements selected that have a value equal to the value represented by the ##i##th bin? Is ##c(i) = k\ h(P)_i## ?
 
##c(i)## is just the number of elements you choose to generate in bin ##i##.

Another way of thinking about it is that you have ##k## apples to give out and ##n## people. Each person has a score ##p \in \mathbb{R}##, which says how deserving they are of apples. You want to give out the apples as fairly as possible according to ##p##. Since ##p## is continuous and you have discrete apples, some people may deserve slightly more than they get, or get more than they deserve.

I guess it seems somewhat simple, and I have an idea to do it efficiently, which is to just give everyone the number of apples they wholy deserve, and then sort the people according to how much apple they still deserve, and then just give the ##m## apples left over to the first ##m## people.

But it would be nice if you could do it without having to do the sorting.
 
Jarvis323 said:
##c(i)## is just the number of elements you choose to generate in bin ##i##.

I think what your are asking is illustrated by the following problem: Suppose there are 3 bins with ##h(P)_0 = 0.4, h(P)_1 = 0.4, h(P)_2 = 0.2##. and we want to generate a sample of size 11. We can't pick values of the ##c(i)## to make ##\sum_{i=0}^2 |h(S_)i - h(P)_i|## exactly zero. So what are the ways of assigning values to the ##c(i)## that minimize that sum?

I am wondering if this problem has a name,
Problems of this general type are called "integer programming" problems. The function ##C(c_0, c_2,...c_m) = \sum_{i=0}^{m} |h(S_)i - h(P)_i|## is the "objective function". The constraints are ## 0 \le c_i \le n ##, ##\sum_{i=0}^m c_i = n##, where ##n## represents the size of the sample to be generated.

I don't know what, if any, name is given to a problem with such an objective function. Your objective function is unusual because it does not depend on what the bins represent, but only on what fraction of the population they contain. A particular example of this problem may have more than one solution because two different bins might each contain the same fraction of the population and the distinction in what the two bins represent doesn't affect the objective function.

It's worth noting that if you generate a sample of size ##n## by minimizing the objective function, you are not generating ##n## independent random samples from the population, so theorems in statistics that apply to random sampling do not apply to generating samples in this manner.
 
  • Like
Likes   Reactions: Jarvis323

Similar threads

  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
Replies
26
Views
5K
  • · Replies 0 ·
Replies
0
Views
2K
  • · Replies 12 ·
Replies
12
Views
2K
  • · Replies 15 ·
Replies
15
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 17 ·
Replies
17
Views
3K
Replies
31
Views
3K