# How to understand fisher information ?

1. Jan 10, 2012

### skwey

How to understand "fisher information"?

Hello, I am trying to understand what "fisher information is."

It is defined as V [∂/∂∅(lnf(X,∅)) ]=E[ (∂/∂∅[lnf(X,∅)])^2 ].

From Wikipedia:
Can you please help me understand why this is the case? How can this be explained by looking at the equation?

2. Jan 10, 2012

### atyy

Re: How to understand "fisher information"?

The Fisher information occurs in the Cramer-Rao bound, which is about the variance of an estimator. http://en.wikipedia.org/wiki/Cramér–Rao_bound

The Fisher information is also an approximation to the Kullback-Leibler divergence. http://uni-leipzig.de/~strimmer/lab/statisticalthinking/pdf/c4.pdf [Broken]

The Kullback-Leibler divergence is also called the relative entropy. An example involving coin tossing to show its intuitive meaning is given in http://arxiv.org/abs/quant-ph/0102094

Last edited by a moderator: May 5, 2017
3. Jan 10, 2012

### SW VandeCarr

Re: How to understand "fisher information"?

I think the equation you're looking for is:

$$I(\theta) = E[(\frac {\partial}{\partial \theta} log f (X;\theta))^2 | \theta]$$

You're using the empty set symbol for theta.

The easiest way to think of this is to understand that the variance of a parameter estimate is inversely related to the information. The curve described is a likelihood function which is maximal at the best estimate of the parameter in terms of information. This estimate is best defined (high information) when the variance is minimal and less well defined when the variance is large. The variance is described in terms of the partial derivative of the density function log-f and is conditional on a given value of the parameter theta.

Last edited: Jan 10, 2012
4. Jan 13, 2012

### skwey

Re: How to understand "fisher information"?

Thanks for your replies, and thanks for correcting my notation. I guess one can understand this, by looking at the inequality and the fact that it is the inverse of the minimum varince of an unbiased estimator. But I'd like to understand it directly from the equation.

Let me reprhase the question, and maybe you can understand better what I am asking. When asking this question, I am assuming, that if the fisher information is high, the information in a single sample x, gives us a good idea of what theta is. question:

Why is it that if :$$I(\theta) = \frac {\partial}{\partial \theta} log f (X;\theta)$$
tends to have many possible likely outcomes(high variance), a sample value x will tell us a lot about theta. But if $$I(\theta) = \frac {\partial}{\partial \theta} log f (X;\theta)$$ do not have many outcomes, a sample x will not give us much information about theta?

5. Jan 14, 2012

### SW VandeCarr

Re: How to understand "fisher information"?

You're not using the correct formula: $I(\theta) = E[(\frac {\partial}{\partial \theta} log f (X;\theta))^2 | \theta]$

This is a conditional probability and the equation is can be solved by maximum likelihood estimation (MLE). The general form is:

$$L(\theta|x_1, x_2,.. x_n)=f(x_1, x_2, ., x_n|\theta)=\prod_{n=1}^{n} f(x_i|\theta)$$

This is solved iteratively to find the MLE and requires a computer (unless you have a lot of time and a very good hand calculator)

I'm not sure what you mean by many possible outcomes. The MLE program iterates over many values of $\theta$ to find the MLE. It is the variance of this value which is inversely related to the information. Think of a flat line for the likelihood function. This corresponds to "infinite" variance and 0 information. With a low variance in the data, likelihood function is well defined around the estimate, corresponding to the higher information represented by the estimate.

Last edited: Jan 14, 2012
6. Jan 14, 2012

### skwey

Re: How to understand "fisher information"?

You are right, I shouldnt have used I(theta)=... But besides that I stand by the question. What I ment by many outcomes, Is that the fisher informtion, is the variance of what I wrote. So if the varince of what I wrote is high, then another way to say it, is that the expression I wrote have many different possible outcomes. And also, since the varince of what I wrote, is the fisher information, this means that the information is high, if the expression has high variance.

What I dont understand about this is that you are explainging it in terms of variance when theta is the free variable. That is you explain it how it varies over different values of theta? But in the expression we hold theta fixed, and calculate the variance with x as variating. My problem in understanding this is then that when we calculate MLE, we let theta variate, but here we have theta fixed and let x variate.

7. Jan 14, 2012

### SW VandeCarr

Re: How to understand "fisher information"?

Theta is not really a free variable. MLE selects the distribution that best fits the data. The MLE estimate of theta is a single value. You may be able calculate theta by the usual way (sum observations and divide by n) which is the MLE for some common distributions, but for some purposes, the shape of the likelihood function is of interest. It is especially useful in curve fitting to multiple data points.

Last edited: Jan 14, 2012
8. Jan 14, 2012

### SW VandeCarr

Re: How to understand "fisher information"?

If your thinking in terms of entropy, yes. The more states a system can exist in, the greater its entropy. That means observing a particular state has high information because there are many other possibilities. Don't confuse that with the variance of the estimate. I've been saying all along that the variance of the estimate is inversely related to the information of the estimate. The MLE of theta is the value which has the least variance and therefore the most information.

You have better knowledge of theta if the error of its estimate is less. If you flip a coin ten times, there are 1024 possible sequences. Therefore there is "value" (information) in one sequence if it's the "winner" of a bet. If you now introduce an "error" around the outcome, to include say three sequences, you've increased the probability of success (and therefore reduced its "value") to 3/1024. I hope this clears things up a bit. It is a concept that a lot of people have found difficult, including me.

Last edited: Jan 14, 2012