Optimal Discrete Sampling of Histogram

AI Thread Summary
The discussion centers on the challenge of optimal discrete sampling from a normalized histogram, aiming to minimize the difference between a sample histogram and the original population histogram. The problem can be classified as an "integer programming" issue, where the objective function measures the discrepancy between the two histograms. Participants explore how to define the number of samples per bin, denoted as c(i), while adhering to constraints on total sample size. A proposed method involves distributing samples based on the proportionate "deserving" scores of bins, though concerns about the need for sorting arise. Overall, the conversation highlights the complexity of achieving an optimal sampling strategy while acknowledging limitations in statistical independence.
Jarvis323
Messages
1,247
Reaction score
988
I am wondering if this problem has a name, and what is the most efficient way to solve it. Say you have a normalized histogram ##h(P)## (representing a pdf estimated from a large population), with ##n## bins, you want to generate a sample of points ##S## from ##h(P)## of size ##k##, such that $$\sum_{i=0}^{n-1}|h(S)_i - h(P)_i|$$ is minimized. Note that since the histogram is discrete, it is enough to select a ##c(i)## for each bin, where $$\sum_{i=0}^{n-1}c(i) = k$$ and the solution is minimal.
 
Technology news on Phys.org
How is ##c(i)## defined? Is it the number of elements selected that have a value equal to the value represented by the ##i##th bin? Is ##c(i) = k\ h(P)_i## ?
 
##c(i)## is just the number of elements you choose to generate in bin ##i##.

Another way of thinking about it is that you have ##k## apples to give out and ##n## people. Each person has a score ##p \in \mathbb{R}##, which says how deserving they are of apples. You want to give out the apples as fairly as possible according to ##p##. Since ##p## is continuous and you have discrete apples, some people may deserve slightly more than they get, or get more than they deserve.

I guess it seems somewhat simple, and I have an idea to do it efficiently, which is to just give everyone the number of apples they wholy deserve, and then sort the people according to how much apple they still deserve, and then just give the ##m## apples left over to the first ##m## people.

But it would be nice if you could do it without having to do the sorting.
 
Jarvis323 said:
##c(i)## is just the number of elements you choose to generate in bin ##i##.

I think what your are asking is illustrated by the following problem: Suppose there are 3 bins with ##h(P)_0 = 0.4, h(P)_1 = 0.4, h(P)_2 = 0.2##. and we want to generate a sample of size 11. We can't pick values of the ##c(i)## to make ##\sum_{i=0}^2 |h(S_)i - h(P)_i|## exactly zero. So what are the ways of assigning values to the ##c(i)## that minimize that sum?

I am wondering if this problem has a name,
Problems of this general type are called "integer programming" problems. The function ##C(c_0, c_2,...c_m) = \sum_{i=0}^{m} |h(S_)i - h(P)_i|## is the "objective function". The constraints are ## 0 \le c_i \le n ##, ##\sum_{i=0}^m c_i = n##, where ##n## represents the size of the sample to be generated.

I don't know what, if any, name is given to a problem with such an objective function. Your objective function is unusual because it does not depend on what the bins represent, but only on what fraction of the population they contain. A particular example of this problem may have more than one solution because two different bins might each contain the same fraction of the population and the distinction in what the two bins represent doesn't affect the objective function.

It's worth noting that if you generate a sample of size ##n## by minimizing the objective function, you are not generating ##n## independent random samples from the population, so theorems in statistics that apply to random sampling do not apply to generating samples in this manner.
 
Dear Peeps I have posted a few questions about programing on this sectio of the PF forum. I want to ask you veterans how you folks learn program in assembly and about computer architecture for the x86 family. In addition to finish learning C, I am also reading the book From bits to Gates to C and Beyond. In the book, it uses the mini LC3 assembly language. I also have books on assembly programming and computer architecture. The few famous ones i have are Computer Organization and...
I have a quick questions. I am going through a book on C programming on my own. Afterwards, I plan to go through something call data structures and algorithms on my own also in C. I also need to learn C++, Matlab and for personal interest Haskell. For the two topic of data structures and algorithms, I understand there are standard ones across all programming languages. After learning it through C, what would be the biggest issue when trying to implement the same data...
Back
Top