# Entropy of a histogram with different bin sizes

1. Nov 2, 2012

### obesogen

I'm wondering if there's an expression/correction for finding the entropy of a density using histograms of different bin sizes. I know that as the number of bins increases, entropy will also increase (given a sufficient number of data points), but is there an expression relating the two? All I can find is work on finding the optimal bin size given some data.

2. Nov 4, 2012

### obesogen

Any ideas at all?

3. Nov 4, 2012

### Stephen Tashi

You're question is slightly ambiguous. On the one hand, you might have a formula for a probability density $f(x)$ and I might be trying to approximate $\int f(x) ln{f(x)}dx$ by using the area of a finite number of rectangles. Or you might have a set of data with an unknown probability distribution and you might be trying to estimate the entropy of the unknown distribution. Or you might declare that the data defines the an empirical distribution (i.e. a random sample from the distribution will be one of the data points selected at random) and you want to know the entropy of the empirical distribution.

4. Nov 4, 2012

### obesogen

Thanks for the reply. The case is approximating an unknown probability distribution with a histogram. I'm actually trying to figure out how many 'bins' i need to discretize the state space in a dynamical system, so I am looking for the number of bins that sufficiently decreases the entropy (i.e., where the 'elbow' in a # of states vs. entropy plot would be). The problem is that adding more bins always increases the entropy, so I'm trying to find an expression to correct for this.

5. Nov 4, 2012

### Stephen Tashi

I don't know much about dynamical systems, so I don't know what the criteria is for "sufficiently" decreasing the entropy. If we are talking about a continuous distribution, doesn't it have a definite entropy? - not one that can be adjusted by a number of bins?

Are you trying to find the (in some sense) optimum estimate for the entropy of a continuous distribution from a given set of data?

6. Nov 4, 2012

### obesogen

Sorry for being vague. I'm looking at the entropy of P(state j | state i,action a), so for a system obeying the Markov property (which this one does), the probability distribution should become sparser and sparser as the state space is better represented, ideally being 1 for one state and 0 for the rest. The way I thought to represent this sparseness is entropy, although I am definitely open to other suggestions. So in this way, I'm really not looking at a fixed density, but I thought I could apply some ideas on histograms to this problem. As for the "sufficient" part, that was probably an abuse of terminology; I would be satisfied with finding an 'elbow' of an entropy vs. number of bins graph.

7. Nov 5, 2012

### Stephen Tashi

Let me see if I can understand that. I'll speculate.

We have a physical process that (in reality) has states defined by continuous variables. For example, it might be defined by (temperature, concentration of substance A, concentration of substanceB). I don't know how the "actions" are parameterized. Are they also continuous? - like add heat at the rate of 0.342 BTUs per second? The physical process is a continuous time Markov process.

I don't know how such a process is described using continuous variables. Perhaps there is a function that gives the "instantaneous" probebility of transitions to state (t1,A1,B1) with action given you are in state (t0,A0,B0) and take action F. I can see that for a fixed (t0,A0,B0) and F there is a probability distribution over all possible states (t1,A1,B1). That distribution would have an entropy. Do you want entropy as a function of (t0,A0,B0,F)?

I don't know what is meant by the "elbow".

8. Nov 6, 2012

### obesogen

Yes, you're right--my states are actually continuous. I am trying to to discretize them so I can use learning algorithms that operate on tables. That being the case, the actions I choose to be available for the agent is also a discrete set. My system is in discrete time, also.

My idea behind calculating entropy of P(state j | state i,action a) is that, since the system is a Markov process, if I have represented the state space well enough (i.e., have enough vectors in the discretization), the transition probabilities (which are in reality deterministic for the 'real' states) will become very sparse. If I underrepresent the states, I will classify two different states as the same, and consequently, my transition probabilities will not be as sparse. i.e., transition probabilities will be more 'uncertain'--have more entropy.

Ideally, the entropy would then be monotonically decreasing for more vectors added to the discretization. By 'elbow', I mean the point in a monotonically decreasing graph right before the curve begins to flatten out, i.e., point of highest curvature. Analogous to identifying the number of labels in a clustering problem by looking at # of labels vs. inter-cluster error.

The problem, though, is that the graph looks like an upside down elbow, due to entropy calculated on more bins increasing. I don't think all is lost, though, since the distribution itself determines how much entropy increases by adding more bins. For example, a uniform distribution will have its entropy increase the most when adding more bins--the relationship is linear. The more concentrated the density is, the less adding bins will affect the entropy.

9. Nov 6, 2012

### Stephen Tashi

Whoa! We can't talk about concepts invovling probability unless we keep a clear picture.

In the first place, let's clarify what probability distribution (or distributions) we are talking about. P( state j | state i, action a) is a single number for a given j,i,a. So which of these are we varying in order to create a distribution? It would seem most natural to keep i and 'a' constant while varying j. But this means you get a possibly different distribution for each i and 'a'. In that case what entropy are we concerned with? The max or min entropy taken over the whole collection of distributions defined by various i and 'a' ?

If the physical process is really deterministic then how are you introducing probability in the model? If a bin i is defined as all physical states within some volume of the (actual) state space (for example, all (t,A,B) such that t0 - dt < t < t + dt; A0 - dA < A < A0 + dA, B0 - dB < B < B0 + dB) ) then are you introducing probability by saying that given the system is in bin i, the probability of the actual state is uniformly spread over the volumne defined by the bin?

I don't understand your assertion that the entropy calculated with more bins must be increasing - partly because I don't understand what distribution(s) are involved and I don't understand how probability enters into the model. Can you give a simple example of a scenario (not necessarily from your model) where finer binning increases the entropy of a distribution?