How to understand fisher information ?

skwey · Jan 10, 2012

How to understand "fisher information"?

Hello, I am trying to understand what "fisher information is."

It is defined as V [∂/∂∅(lnf(X,∅)) ]=E[ (∂/∂∅[lnf(X,∅)])^2 ].

From Wikipedia:

The Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ upon which the probability of X depends.

Can you please help me understand why this is the case? How can this be explained by looking at the equation?

atyy · Jan 10, 2012

The Fisher information occurs in the Cramer-Rao bound, which is about the variance of an estimator. http://en.wikipedia.org/wiki/Cramér–Rao_bound
http://www.colorado.edu/isl/papers/info/node2.html

The Fisher information is also an approximation to the Kullback-Leibler divergence. http://uni-leipzig.de/~strimmer/lab/statisticalthinking/pdf/c4.pdf

The Kullback-Leibler divergence is also called the relative entropy. An example involving coin tossing to show its intuitive meaning is given in http://arxiv.org/abs/quant-ph/0102094

SW VandeCarr · Jan 10, 2012

skwey said:

Hello, I am trying to understand what "fisher information is."

It is defined as V [∂/∂∅(lnf(X,∅)) ]=E[ (∂/∂∅[lnf(X,∅)])^2 ].

From Wikipedia:Can you please help me understand why this is the case? How can this be explained by looking at the equation?

I think the equation you're looking for is:

[tex]I(\theta) = E[(\frac {\partial}{\partial \theta} log f (X;\theta))^2 | \theta][/tex]

You're using the empty set symbol for theta.

The easiest way to think of this is to understand that the variance of a parameter estimate is inversely related to the information. The curve described is a likelihood function which is maximal at the best estimate of the parameter in terms of information. This estimate is best defined (high information) when the variance is minimal and less well defined when the variance is large. The variance is described in terms of the partial derivative of the density function log-f and is conditional on a given value of the parameter theta.

skwey · Jan 13, 2012

Thanks for your replies, and thanks for correcting my notation. I guess one can understand this, by looking at the inequality and the fact that it is the inverse of the minimum varince of an unbiased estimator. But I'd like to understand it directly from the equation.

Let me reprhase the question, and maybe you can understand better what I am asking. When asking this question, I am assuming, that if the fisher information is high, the information in a single sample x, gives us a good idea of what theta is. question:

Why is it that if :[tex]I(\theta) = \frac {\partial}{\partial \theta} log f (X;\theta)[/tex]
tends to have many possible likely outcomes(high variance), a sample value x will tell us a lot about theta. But if [tex]I(\theta) = \frac {\partial}{\partial \theta} log f (X;\theta)[/tex] do not have many outcomes, a sample x will not give us much information about theta?

SW VandeCarr · Jan 14, 2012

skwey said:

Thanks for your replies, and thanks for correcting my notation. I guess one can understand this, by looking at the inequality and the fact that it is the inverse of the minimum varince of an unbiased estimator. But I'd like to understand it directly from the equation.

Let me reprhase the question, and maybe you can understand better what I am asking. When asking this question, I am assuming, that if the fisher information is high, the information in a single sample x, gives us a good idea of what theta is. question:

Why is it that if :[tex]I(\theta) = \frac {\partial}{\partial \theta} log f (X;\theta)[/tex]
tends to have many possible likely outcomes(high variance), a sample value x will tell us a lot about theta. But if [tex]I(\theta) = \frac {\partial}{\partial \theta} log f (X;\theta)[/tex] do not have many outcomes, a sample x will not give us much information about theta?

You're not using the correct formula: [itex]I(\theta) = E[(\frac {\partial}{\partial \theta} log f (X;\theta))^2 | \theta][/itex]

This is a conditional probability and the equation is can be solved by maximum likelihood estimation (MLE). The general form is:

[tex]L(\theta|x_1, x_2,.. x_n)=f(x_1, x_2, ., x_n|\theta)=\prod_{n=1}^{n} f(x_i|\theta)[/tex]

This is solved iteratively to find the MLE and requires a computer (unless you have a lot of time and a very good hand calculator)

I'm not sure what you mean by many possible outcomes. The MLE program iterates over many values of [itex]\theta[/itex] to find the MLE. It is the variance of this value which is inversely related to the information. Think of a flat line for the likelihood function. This corresponds to "infinite" variance and 0 information. With a low variance in the data, likelihood function is well defined around the estimate, corresponding to the higher information represented by the estimate.

skwey · Jan 14, 2012

You are right, I shouldn't have used I(theta)=... But besides that I stand by the question. What I ment by many outcomes, Is that the fisher informtion, is the variance of what I wrote. So if the varince of what I wrote is high, then another way to say it, is that the expression I wrote have many different possible outcomes. And also, since the varince of what I wrote, is the fisher information, this means that the information is high, if the expression has high variance.

I'm not sure what you mean by many possible outcomes. The MLE program iterates over many values of θ to find the MLE. It is the variance of this value which is inversely related to the information. Think of a flat line for the likelihood function. This corresponds to "infinite" variance and 0 information. With a low variance in the data, likelihood function is well defined around the estimate, corresponding to the higher information represented by the estimate.

What I don't understand about this is that you are explainging it in terms of variance when theta is the free variable. That is you explain it how it varies over different values of theta? But in the expression we hold theta fixed, and calculate the variance with x as variating. My problem in understanding this is then that when we calculate MLE, we let theta variate, but here we have theta fixed and let x variate.

SW VandeCarr · Jan 14, 2012

skwey said:

I don't understand about this is that you are explainging it in terms of variance when theta is the free variable. That is you explain it how it varies over different values of theta? But in the expression we hold theta fixed, and calculate the variance with x as variating. My problem in understanding this is then that when we calculate MLE, we let theta variate, but here we have theta fixed and let x variate.

Theta is not really a free variable. MLE selects the distribution that best fits the data. The MLE estimate of theta is a single value. You may be able calculate theta by the usual way (sum observations and divide by n) which is the MLE for some common distributions, but for some purposes, the shape of the likelihood function is of interest. It is especially useful in curve fitting to multiple data points.

SW VandeCarr · Jan 14, 2012

Swey:

I wrote have many different possible outcomes. And also, since the varince of what I wrote, is the fisher information, this means that the information is high, if the expression has high variance.

If your thinking in terms of entropy, yes. The more states a system can exist in, the greater its entropy. That means observing a particular state has high information because there are many other possibilities. Don't confuse that with the variance of the estimate. I've been saying all along that the variance of the estimate is inversely related to the information of the estimate. The MLE of theta is the value which has the least variance and therefore the most information.

You have better knowledge of theta if the error of its estimate is less. If you flip a coin ten times, there are 1024 possible sequences. Therefore there is "value" (information) in one sequence if it's the "winner" of a bet. If you now introduce an "error" around the outcome, to include say three sequences, you've increased the probability of success (and therefore reduced its "value") to 3/1024. I hope this clears things up a bit. It is a concept that a lot of people have found difficult, including me.

How to understand fisher information ?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad The problem of points

Graduate Expected numbers of cards of a last color remaining

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect