Optimal Discrete Sampling of Histogram

Click For Summary

Discussion Overview

The discussion revolves around the problem of optimal discrete sampling from a normalized histogram, specifically how to generate a sample of points that minimizes the difference between the histogram of the sample and the original histogram. The scope includes theoretical considerations and potential methods for solving this optimization problem.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant inquires about the name of the problem and the most efficient solution for generating a sample from a normalized histogram.
  • Another participant seeks clarification on the definition of ##c(i)##, questioning whether it represents the number of elements selected for each bin and suggesting a potential formula.
  • A participant clarifies that ##c(i)## is the number of elements chosen for bin ##i## and provides an analogy involving distributing apples based on scores to illustrate the fairness in allocation.
  • Further, the analogy is expanded upon, suggesting a method of initially distributing apples based on whole numbers and then addressing any remaining apples without sorting.
  • A different participant reiterates the definition of ##c(i)## and presents a specific example with three bins, posing a question about minimizing the objective function given certain constraints.
  • Another participant identifies the problem as an "integer programming" problem, describing the objective function and constraints involved, while noting the uniqueness of the objective function's dependence on population fractions rather than bin representations.
  • It is mentioned that generating a sample by minimizing the objective function does not yield independent random samples, which may affect the applicability of statistical theorems related to random sampling.

Areas of Agreement / Disagreement

Participants express differing views on the definition and implications of the problem, with some agreeing on the nature of ##c(i)## while others explore various methods of solving the optimization challenge. The discussion remains unresolved regarding the most efficient approach and the specific naming of the problem.

Contextual Notes

Limitations include the need for further clarification on the assumptions underlying the problem, the dependence on definitions of terms like ##c(i)##, and the unresolved nature of the mathematical steps involved in minimizing the objective function.

Jarvis323
Messages
1,247
Reaction score
988
I am wondering if this problem has a name, and what is the most efficient way to solve it. Say you have a normalized histogram ##h(P)## (representing a pdf estimated from a large population), with ##n## bins, you want to generate a sample of points ##S## from ##h(P)## of size ##k##, such that $$\sum_{i=0}^{n-1}|h(S)_i - h(P)_i|$$ is minimized. Note that since the histogram is discrete, it is enough to select a ##c(i)## for each bin, where $$\sum_{i=0}^{n-1}c(i) = k$$ and the solution is minimal.
 
Technology news on Phys.org
How is ##c(i)## defined? Is it the number of elements selected that have a value equal to the value represented by the ##i##th bin? Is ##c(i) = k\ h(P)_i## ?
 
##c(i)## is just the number of elements you choose to generate in bin ##i##.

Another way of thinking about it is that you have ##k## apples to give out and ##n## people. Each person has a score ##p \in \mathbb{R}##, which says how deserving they are of apples. You want to give out the apples as fairly as possible according to ##p##. Since ##p## is continuous and you have discrete apples, some people may deserve slightly more than they get, or get more than they deserve.

I guess it seems somewhat simple, and I have an idea to do it efficiently, which is to just give everyone the number of apples they wholy deserve, and then sort the people according to how much apple they still deserve, and then just give the ##m## apples left over to the first ##m## people.

But it would be nice if you could do it without having to do the sorting.
 
Jarvis323 said:
##c(i)## is just the number of elements you choose to generate in bin ##i##.

I think what your are asking is illustrated by the following problem: Suppose there are 3 bins with ##h(P)_0 = 0.4, h(P)_1 = 0.4, h(P)_2 = 0.2##. and we want to generate a sample of size 11. We can't pick values of the ##c(i)## to make ##\sum_{i=0}^2 |h(S_)i - h(P)_i|## exactly zero. So what are the ways of assigning values to the ##c(i)## that minimize that sum?

I am wondering if this problem has a name,
Problems of this general type are called "integer programming" problems. The function ##C(c_0, c_2,...c_m) = \sum_{i=0}^{m} |h(S_)i - h(P)_i|## is the "objective function". The constraints are ## 0 \le c_i \le n ##, ##\sum_{i=0}^m c_i = n##, where ##n## represents the size of the sample to be generated.

I don't know what, if any, name is given to a problem with such an objective function. Your objective function is unusual because it does not depend on what the bins represent, but only on what fraction of the population they contain. A particular example of this problem may have more than one solution because two different bins might each contain the same fraction of the population and the distinction in what the two bins represent doesn't affect the objective function.

It's worth noting that if you generate a sample of size ##n## by minimizing the objective function, you are not generating ##n## independent random samples from the population, so theorems in statistics that apply to random sampling do not apply to generating samples in this manner.
 
  • Like
Likes   Reactions: Jarvis323

Similar threads

  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
Replies
26
Views
5K
  • · Replies 0 ·
Replies
0
Views
2K
  • · Replies 12 ·
Replies
12
Views
2K
  • · Replies 15 ·
Replies
15
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 17 ·
Replies
17
Views
3K
Replies
31
Views
4K