The Fisher information is also an approximation to the Kullback-Leibler divergence. http://uni-leipzig.de/~strimmer/lab/statisticalthinking/pdf/c4.pdf [Broken]

The Kullback-Leibler divergence is also called the relative entropy. An example involving coin tossing to show its intuitive meaning is given in http://arxiv.org/abs/quant-ph/0102094

The easiest way to think of this is to understand that the variance of a parameter estimate is inversely related to the information. The curve described is a likelihood function which is maximal at the best estimate of the parameter in terms of information. This estimate is best defined (high information) when the variance is minimal and less well defined when the variance is large. The variance is described in terms of the partial derivative of the density function log-f and is conditional on a given value of the parameter theta.

Thanks for your replies, and thanks for correcting my notation. I guess one can understand this, by looking at the inequality and the fact that it is the inverse of the minimum varince of an unbiased estimator. But I'd like to understand it directly from the equation.

Let me reprhase the question, and maybe you can understand better what I am asking. When asking this question, I am assuming, that if the fisher information is high, the information in a single sample x, gives us a good idea of what theta is. question:

Why is it that if :[tex] I(\theta) = \frac {\partial}{\partial \theta} log f (X;\theta) [/tex]
tends to have many possible likely outcomes(high variance), a sample value x will tell us a lot about theta. But if [tex]I(\theta) = \frac {\partial}{\partial \theta} log f (X;\theta) [/tex] do not have many outcomes, a sample x will not give us much information about theta?

This is solved iteratively to find the MLE and requires a computer (unless you have a lot of time and a very good hand calculator)

I'm not sure what you mean by many possible outcomes. The MLE program iterates over many values of [itex] \theta [/itex] to find the MLE. It is the variance of this value which is inversely related to the information. Think of a flat line for the likelihood function. This corresponds to "infinite" variance and 0 information. With a low variance in the data, likelihood function is well defined around the estimate, corresponding to the higher information represented by the estimate.

You are right, I shouldnt have used I(theta)=... But besides that I stand by the question. What I ment by many outcomes, Is that the fisher informtion, is the variance of what I wrote. So if the varince of what I wrote is high, then another way to say it, is that the expression I wrote have many different possible outcomes. And also, since the varince of what I wrote, is the fisher information, this means that the information is high, if the expression has high variance.

What I dont understand about this is that you are explainging it in terms of variance when theta is the free variable. That is you explain it how it varies over different values of theta? But in the expression we hold theta fixed, and calculate the variance with x as variating. My problem in understanding this is then that when we calculate MLE, we let theta variate, but here we have theta fixed and let x variate.

Theta is not really a free variable. MLE selects the distribution that best fits the data. The MLE estimate of theta is a single value. You may be able calculate theta by the usual way (sum observations and divide by n) which is the MLE for some common distributions, but for some purposes, the shape of the likelihood function is of interest. It is especially useful in curve fitting to multiple data points.

If your thinking in terms of entropy, yes. The more states a system can exist in, the greater its entropy. That means observing a particular state has high information because there are many other possibilities. Don't confuse that with the variance of the estimate. I've been saying all along that the variance of the estimate is inversely related to the information of the estimate. The MLE of theta is the value which has the least variance and therefore the most information.

You have better knowledge of theta if the error of its estimate is less. If you flip a coin ten times, there are 1024 possible sequences. Therefore there is "value" (information) in one sequence if it's the "winner" of a bet. If you now introduce an "error" around the outcome, to include say three sequences, you've increased the probability of success (and therefore reduced its "value") to 3/1024. I hope this clears things up a bit. It is a concept that a lot of people have found difficult, including me.