How to understand fisher information ?

In summary: Thanks for your replies, and thanks for correcting my notation. I guess one can understand this, by looking at the inequality and the fact that it is the inverse of the minimum varince of an unbiased estimator. But I'd like to understand it directly from the equation.
  • #1
skwey
17
0
How to understand "fisher information"?

Hello, I am trying to understand what "fisher information is."

It is defined as V [∂/∂∅(lnf(X,∅)) ]=E[ (∂/∂∅[lnf(X,∅)])^2 ].

From Wikipedia:
The Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ upon which the probability of X depends.

Can you please help me understand why this is the case? How can this be explained by looking at the equation?
 
Physics news on Phys.org
  • #2
The Fisher information occurs in the Cramer-Rao bound, which is about the variance of an estimator. http://en.wikipedia.org/wiki/Cramér–Rao_bound
http://www.colorado.edu/isl/papers/info/node2.html

The Fisher information is also an approximation to the Kullback-Leibler divergence. http://uni-leipzig.de/~strimmer/lab/statisticalthinking/pdf/c4.pdf

The Kullback-Leibler divergence is also called the relative entropy. An example involving coin tossing to show its intuitive meaning is given in http://arxiv.org/abs/quant-ph/0102094
 
Last edited by a moderator:
  • #3


skwey said:
Hello, I am trying to understand what "fisher information is."

It is defined as V [∂/∂∅(lnf(X,∅)) ]=E[ (∂/∂∅[lnf(X,∅)])^2 ].

From Wikipedia:Can you please help me understand why this is the case? How can this be explained by looking at the equation?

I think the equation you're looking for is:

[tex] I(\theta) = E[(\frac {\partial}{\partial \theta} log f (X;\theta))^2 | \theta][/tex]

You're using the empty set symbol for theta.

The easiest way to think of this is to understand that the variance of a parameter estimate is inversely related to the information. The curve described is a likelihood function which is maximal at the best estimate of the parameter in terms of information. This estimate is best defined (high information) when the variance is minimal and less well defined when the variance is large. The variance is described in terms of the partial derivative of the density function log-f and is conditional on a given value of the parameter theta.
 
Last edited:
  • #4


Thanks for your replies, and thanks for correcting my notation. I guess one can understand this, by looking at the inequality and the fact that it is the inverse of the minimum varince of an unbiased estimator. But I'd like to understand it directly from the equation.

Let me reprhase the question, and maybe you can understand better what I am asking. When asking this question, I am assuming, that if the fisher information is high, the information in a single sample x, gives us a good idea of what theta is. question:

Why is it that if :[tex] I(\theta) = \frac {\partial}{\partial \theta} log f (X;\theta) [/tex]
tends to have many possible likely outcomes(high variance), a sample value x will tell us a lot about theta. But if [tex]I(\theta) = \frac {\partial}{\partial \theta} log f (X;\theta) [/tex] do not have many outcomes, a sample x will not give us much information about theta?
 
  • #5


skwey said:
Thanks for your replies, and thanks for correcting my notation. I guess one can understand this, by looking at the inequality and the fact that it is the inverse of the minimum varince of an unbiased estimator. But I'd like to understand it directly from the equation.

Let me reprhase the question, and maybe you can understand better what I am asking. When asking this question, I am assuming, that if the fisher information is high, the information in a single sample x, gives us a good idea of what theta is. question:

Why is it that if :[tex] I(\theta) = \frac {\partial}{\partial \theta} log f (X;\theta) [/tex]
tends to have many possible likely outcomes(high variance), a sample value x will tell us a lot about theta. But if [tex]I(\theta) = \frac {\partial}{\partial \theta} log f (X;\theta) [/tex] do not have many outcomes, a sample x will not give us much information about theta?

You're not using the correct formula: [itex] I(\theta) = E[(\frac {\partial}{\partial \theta} log f (X;\theta))^2 | \theta][/itex]

This is a conditional probability and the equation is can be solved by maximum likelihood estimation (MLE). The general form is:

[tex]L(\theta|x_1, x_2,.. x_n)=f(x_1, x_2, ., x_n|\theta)=\prod_{n=1}^{n} f(x_i|\theta)[/tex]

This is solved iteratively to find the MLE and requires a computer (unless you have a lot of time and a very good hand calculator)

I'm not sure what you mean by many possible outcomes. The MLE program iterates over many values of [itex] \theta [/itex] to find the MLE. It is the variance of this value which is inversely related to the information. Think of a flat line for the likelihood function. This corresponds to "infinite" variance and 0 information. With a low variance in the data, likelihood function is well defined around the estimate, corresponding to the higher information represented by the estimate.
 
Last edited:
  • #6


You are right, I shouldn't have used I(theta)=... But besides that I stand by the question. What I ment by many outcomes, Is that the fisher informtion, is the variance of what I wrote. So if the varince of what I wrote is high, then another way to say it, is that the expression I wrote have many different possible outcomes. And also, since the varince of what I wrote, is the fisher information, this means that the information is high, if the expression has high variance.

I'm not sure what you mean by many possible outcomes. The MLE program iterates over many values of θ to find the MLE. It is the variance of this value which is inversely related to the information. Think of a flat line for the likelihood function. This corresponds to "infinite" variance and 0 information. With a low variance in the data, likelihood function is well defined around the estimate, corresponding to the higher information represented by the estimate.

What I don't understand about this is that you are explainging it in terms of variance when theta is the free variable. That is you explain it how it varies over different values of theta? But in the expression we hold theta fixed, and calculate the variance with x as variating. My problem in understanding this is then that when we calculate MLE, we let theta variate, but here we have theta fixed and let x variate.
 
  • #7


skwey said:
I don't understand about this is that you are explainging it in terms of variance when theta is the free variable. That is you explain it how it varies over different values of theta? But in the expression we hold theta fixed, and calculate the variance with x as variating. My problem in understanding this is then that when we calculate MLE, we let theta variate, but here we have theta fixed and let x variate.

Theta is not really a free variable. MLE selects the distribution that best fits the data. The MLE estimate of theta is a single value. You may be able calculate theta by the usual way (sum observations and divide by n) which is the MLE for some common distributions, but for some purposes, the shape of the likelihood function is of interest. It is especially useful in curve fitting to multiple data points.
 
Last edited:
  • #8


Swey:

I wrote have many different possible outcomes. And also, since the varince of what I wrote, is the fisher information, this means that the information is high, if the expression has high variance.

If your thinking in terms of entropy, yes. The more states a system can exist in, the greater its entropy. That means observing a particular state has high information because there are many other possibilities. Don't confuse that with the variance of the estimate. I've been saying all along that the variance of the estimate is inversely related to the information of the estimate. The MLE of theta is the value which has the least variance and therefore the most information.

You have better knowledge of theta if the error of its estimate is less. If you flip a coin ten times, there are 1024 possible sequences. Therefore there is "value" (information) in one sequence if it's the "winner" of a bet. If you now introduce an "error" around the outcome, to include say three sequences, you've increased the probability of success (and therefore reduced its "value") to 3/1024. I hope this clears things up a bit. It is a concept that a lot of people have found difficult, including me.
 
Last edited:

1. What is Fisher information?

Fisher information is a statistical measure that quantifies the amount of information that a random variable contains about an unknown parameter in a statistical model. In other words, it measures how much a set of data can inform us about the parameters of a probability distribution.

2. Why is Fisher information important?

Fisher information is important because it helps us understand the precision and accuracy of statistical estimates. It also plays a crucial role in various statistical methods, such as maximum likelihood estimation and hypothesis testing.

3. How is Fisher information calculated?

Fisher information is calculated by taking the second derivative of the log-likelihood function with respect to the parameter of interest. It can also be calculated using the variance of the score function, which is the first derivative of the log-likelihood function.

4. What is the relationship between Fisher information and the Cramer-Rao bound?

The Cramer-Rao bound is a lower bound on the variance of any unbiased estimator of a parameter. It is directly related to Fisher information, as the Cramer-Rao bound is equal to the inverse of Fisher information. Therefore, a higher Fisher information corresponds to a lower Cramer-Rao bound, indicating a more efficient estimator.

5. How does Fisher information relate to information theory?

Fisher information is a fundamental concept in information theory, which deals with the quantification of information. It is used to measure the amount of information that a random variable contains about a parameter, and it is closely related to other information-theoretic measures such as entropy and mutual information.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
11
Views
1K
Replies
0
Views
283
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
Replies
8
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
2K
Replies
6
Views
2K
Replies
1
Views
596
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
3K
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
1K
Back
Top