Forecast with Fisher formalism : loss or gain of information

fab13 · Jun 9, 2019

In the context of a forecast, I am currently working on Fisher's formalism which is part of a more general theory, that of information. My problem applies to estimating cosmological parameters from input data with the Fisher formalism and recipes to build a Fisher matrix. The context is Astrophysics but it can apply to many subjects.

Here is a summary : the input data is 4 columns of data, the first representing the redshift of the galaxies (i.e their distance) and then the other 3 each corresponding to the bias (ie roughly to the uncertainty about their position) of a given galaxy type: there are therefore 3 types of given galaxies and one value per redshift (8 redshifts in all). So I have a table of 8x3 values of bias.

1) First part : Now, I'm trying to cross data to try to extract additional information because for example, for the first type of galaxy, I have only the first 2 biases that are non-zero (I mean for First 2 redshifts), and for the third type, I have 6 different values of 0 for the 6 redshifts above the 2 previous ones.

Here the file of biases for the 3 populations (b1,b2,b3) as a function of redshift (first column) :

# z b1 b2 b3
1.7500000000e-01 1.1133849956e+00 0.0000000000e+00 0.0000000000e+00
4.2500000000e-01 1.7983127401e+00 0.0000000000e+00 0.0000000000e+00
6.5000000000e-01 0.0000000000e+00 1.4469899900e+00 7.1498329000e-01
8.5000000000e-01 0.0000000000e+00 1.4194157200e+00 7.0135835000e-01
1.0500000000e+00 0.0000000000e+00 1.4006739400e+00 6.9209771000e-01
1.2500000000e+00 0.0000000000e+00 0.0000000000e+00 6.8562140000e-01
1.4500000000e+00 0.0000000000e+00 0.0000000000e+00 6.8097541000e-01
1.6500000000e+00 0.0000000000e+00 0.0000000000e+00 6.7756594000e-01

My teacher suggested to me to merge the first column (corresponding to the first type of galaxy) with the third one (corresponding to the third type of population of galaxies), so as to obtain a single vector with only values for the non-zero bias). This way, I simulate a "single population" processing.

From a statistical point of view, will there be a loss or a gain of information if I do this fusion of the 2 columns ? The problem seems rather complex because everything depends on the value of the data.

2) Second part : Another point of view suggested by my teacher: if I take a sample and I cut it in 2 parts, and if I cross-correlate data between the 2 subsets obtained, will I win or lose information from a statisticl point of view ? , i.e at the level of the accuracy of parameters that I will extract from the cross-correlation between the 2 subsets. (in my case, i.e galaxies biaises and Fisher formalism, I mean the constraints that I get after having built my Fisher matrix and invert it) ?

He thinks that at first sight, I can not lose information (which seems intuitive because cutting a sample in 2 is not a loss of info per se) but he says that everything depends on whether I know or not precisely the ratio of the biases between the 2 subsamples : I did not quite understand this notion of ratio between theses 2 biases of subsets.

I am therefore looking for information on this problem, maybe on this forum, statisticians will be able to help me in this technique of cross-correlations and the fact of knowing or not if one gains or one loses some information by bringing together several sources of information.

I think that the gain or the loss of info will be a function of the redundancy of the data (we speak of entropy of Shannon I think, don't we ?).

3) Third part : I could also cross data between overlapped data for 2 columns of data (2 values for each redshift) but here I think it's another problem from a statistical point of view: by the way, I'm talking about at the beginning of data crossing with the merging of 2 vectors but the "cross-correlation" is rather defined in the case of overlapped values, right?

However, in both cases, we cross data, in a certain way.

For the moment, in my algorithm, I treat the first 2 values of the 1st type of population, the 3 others overlapped between the second and the 3rd type, and the last 3 of the 3rd type of population, which is 8 values at total (I mean 8 redshifts): so there are 2 "auto-spectrum" and 1 overlapped spectrum.

My measure of the information gain that I have been talking about since the beginning of this post is done with the computation of constraints by inverting the Fisher matrix, which gives me the covariance matrix and therefore the variance and the correlation of the parameters that I want to estimate: the smaller the standard deviations, the higher the information gain.

Your advices or suggestions on the issue are precious and will help me better understand the logic of this "data cross-correlation" story.

Any help is welcome.

PS: If the topic seems to be placed on the wrong forum, feel free to move it in the forum Mathematics/Statistics but I think this is actually an Astrophysics/Cosmology problem.

Regards

kimbyd · Jun 10, 2019

1) If you combine dimensions, you generally lose information, except in the case that the two combined dimensions are perfectly degenerate. Basically, if you diagonalize the Fisher matrix and find that any of its eigenvalues are zero, then that means you can improve the calculation by getting rid of the degenerate data.

If the eigenvalues are small but nonzero, then there may still be some gain in collapsing the dimensions: computers have a hard time making calculations for Fisher matrices that have very uneven eigenvalues. The small eigenvalue represents a dimension which the data constrains very poorly, and thus doesn't add much to the final result. So if you get rid of it you're losing a little bit of information, but the numerical computation errors you're removing can be much more important.

2) You definitely don't lose information by splitting up your sample in two. If the split is meaningful, you actually gain information because now the data in sample subset A is treated differently from sample subset B. If the two samples are split in a random or not-very-meaningful way, then this additional information is not useful (well, technically you're not gaining information so much as removing less information when doing the statistical aggregation).

Any statistic you compute over the entire sample set should be identical whether the sample is split or not.

fab13 · Jun 15, 2019

@kimbyd thanks for your answer.

I would like to know : what do you mean by "If the split is meaningful" ? which are the criteria to have a meaningful split, I don't understand very well.

In fact, I was wondering if merging the bias of the second column (bias of the first population of galaxies) with the fourth column (third type of galaxy population) was "physically correct" or "statistically correct" since this not the same type of galaxy (so it may be another more bias).

## redshift \quad population1 \quad population2 \quad population3\\
0.1750\quad 1.1134\quad 0\quad 0\\
0.4250 \quad 1.7983 \quad 0 \quad 0\\
0.6500 \quad 0 \quad 1.4470 \quad 0.7150\\
0.8500 \quad 0 \quad 1.4194 \quad 0.7014\\
1.0500 \quad 0 \quad 1.4007 \quad 0.6921\\
1.2500 \quad 0 \quad 0 \quad 0.6856\\
1.4500 \quad 0 \quad 0 \quad 0.6810\\
1.6500 \quad 0 \quad 0 \quad 0.6776
##

Indeed, my teacher tells me that we must avoid as far as possible the null values for bias: a value of zero is a value (almost) like any other in the statistical sense, it could alone brought information in Fisher's sense: that's what I do not understand.

Regarding the matter of splitting a sample into 2 sub-samples, and making cross-correlation between the 2 sub-samples, statistical information can not be lost compared to a simple statistical study on starting sample: do we agree?

If you had a link or doc to prove this statement, it would be really nice to send them to me.

In this regard,my teacher told me that under certain conditions (I think he speaks of the values themselves, the 2nd column (which has only 2 values) can cause loss of information in the Fisher sense but as we bring values in more, we gain more that we lose, so in the end, we still have a gain.

The situtation could very well be symmetrical: it could be the values of the 3rd type of population (4th column above) which could cause the loss of information (then make it win, which makes a final gain as already mentioned in above).

I have some notions about entropy, especially Shannon with which I draw a parallel here: if the information is redundant, that is to say always more or less the same, then the information that we can extract from it is weak: it's the same here for the fusion of columns?

In conclusion, that would mean that if the difference between the values of the 2 columns is big or not, there will or will not be a significant gain but never a loss.

These topics are very interesting but a little technical, even if Fisher's formalism is a little like the theory on entropy or information.

kimbyd · Jun 17, 2019

Meaningful would be a statement that population 1 is measuring something fundamentally different from population 2. If galaxies are separated into population 1 and 2 randomly, or by some arbitrary criteria such as, "The north half of the sample goes in population 1, south half in population 2," then the split is not meaningful.

kimbyd · Jun 17, 2019

To answer a bit more:
1) Splitting the sample doesn't lose any fundamental information, but can reduce numerical accuracy. Smaller sample sets tend to be more prone to numerical inaccuracies in this kind of analysis.
2) The main problem with Fisher analysis is large differences in eigenvalues. If some eigenvalues are very small compared to others, then that represents a partial degeneracy (or a complete degeneracy if the eigenvalue is zero). Degeneracies cause numerical errors. Adding more data, especially if that data has different statistical properties (e.g. a population of galaxies at a different set of redshifts), can resolve these degeneracies.
3) To see how much information is lost when fusing columns, you need only check the eigenvalues between the two columns. This can be done by reducing the matrix to a 2x2 including only the columns you wish to fuse. If one eigenvalue is much much larger than the other, then fusing the two columns loses little but increases your numerical accuracy a lot. If the two eigenvalues are similar in magnitude, then fusing them loses substantial information and doesn't benefit you much numerically.

To get a handle on how much larger one eigenvalue should be than another, computers typically handle eigenvalue ratios of approximately ##10^{10}## without much inaccuracy creeping in (assuming double precision floating point). From the other end, as the eigenvalues are inverse variances, the square root of the ratio represents the relative constraining power between the two. For example, if one eigenvalue is ##10,000## and the other eigenvalue is ##1##, then the larger eigenvalue contributes a hundred times as much to the final result as the other.

fab13 · Jun 30, 2019

@kimbyd, thanks for your answer

1) Splitting the sample doesn't lose any fundamental information, but can reduce numerical accuracy. Smaller sample sets tend to be more prone to numerical inaccuracies in this kind of analysis.

Whe you talk about "doesn't lose any fundamental information, do you talk about :

1) Shannon's entropy (i.e with discrete values since here in my case, I manipulate discrete values like cosmological bias or density of galaxies : I mean, do you consider in this phrase the definition of Shannon's entropy :

##H_{b}(X)=-\mathbb{E}[\log _{b}{P(X)}]=\sum _{i=1}^{n}P_{i}\log _{b}\left({\dfrac{1}{P_{i}}}\right)##)
with the logarithm expressed in base ##b##.

OR

2) do you talk about the definition of Information (general theory), i.e : ##\mathcal{I}(\theta)=\mathrm{E}\left\{\left.\left[{\dfrac{\partial}{\partial\theta}}\ln L(X;\theta)\right]^{2}\right|\theta\right\}##
? With this formula, I can also work with discrete data values

3) Maybe both are equivalent if I take a base 10 for the Shannon's entropy : if this is the case, it would be interesting to get the demonstration (I don't know if personnaly, I will manage to performe this demonstration, so any link or remark is welcome).

4) By the way, in the definition of Information (question 2), does ##\ln L(X;\theta)## the likelihood function which represents the data ##X## knowing a parameter ##\theta## model ?

Indeed, it is noticed on Wikipedia that ##L(X;\theta )## is the law of probability of ##X## : isn't it rather the logarithm of the product of PDF (Probability Density Function) of parameter ##\theta## with values of data ##X## inserted in the computation of each factor ##f_X(\theta)## : SO, that would look like or equal to the likelihood function, wouldn't it ?

5)

3) To see how much information is lost when fusing columns, you need only check the eigenvalues between the two columns. This can be done by reducing the matrix to a 2x2 including only the columns you wish to fuse. If one eigenvalue is much much larger than the other, then fusing the two columns loses little but increases your numerical accuracy a lot. If the two eigenvalues are similar in magnitude, then fusing them loses substantial information and doesn't benefit you much numerically.

Which "matrix reduced" are you considering : the 2x2 block in Total Fisher Matrix (with all parameters) ? I don't understand how to build a square matrix with 2 vectors (i.e the 2 columns that I want to fuse) , if you could explain more precisely...

Regards and sorry if my english is not perfect, I hope you will understand correctly my 5 questions.

kimbyd · Jul 1, 2019

I'm not attempting to argue information in any formal sense, though I'm sure that would also be possible. What I'm pointing out is that if you combine two populations, you lose information about which observable is in which population. If the division between the populations carries no meaning (e.g. random assignment into the two populations), then nothing of value is lost.

As to your last question, I mean that you can evaluate the result of combining two columns of an existing Fisher matrix by taking only the four elements that are shared between the two columns (xx, xy, yx, yy). Take those four as a 2x2 matrix, diagonalize it, and see what the eigenvalues are. Similar eigenvalues mean that the two populations are providing different information. Very different eigenvalues mean that they both contain mostly the same information.

fab13 · Jul 1, 2019

@kimbyd

Sorry, I didn't understand the sense of your phrase (in bold) :

you lose information about which observable is in which population. If the division between the populations carries no meaning (e.g. random assignment into the two populations), then nothing of value is lost.

So finally, could we say that Shannon entropy, starting from a single sample, can't never decrease if we add another sample since even if combination is radomdy performed or carries no meaning, we keep at least the information contained in the single sample, but we can't never lose information, is it right ?

If this is the case, this makes me think to the irreversibility of phenomena in thermodynamics and the fact that into Univere, entropy is always increasing because, by getting more and more homogeneous, the number of configuations in the distribution of galaxies is always increasing. But maybe this comparison is not correct ...

kind regards for your help

Ibix · Jul 1, 2019

fab13 said:

Sorry, I didn't understand the sense of your phrase (in bold) :

you lose information about which observable is in which population. If the division between the populations carries no meaning (e.g. random assignment into the two populations), then nothing of value is lost.

An example of what I think @kimbyd means is this. Measure the average height of a group of adults. Separate them randomly in two groups and you'd expect to get that same average height for both groups, plus or minus sampling error. Separate the people into men and women, however, and you'd expect to get different heights. Furthermore, the variance on each mean is likely to be tighter than for the random groups. This is because you've made a split which had meaning for the measure you were making.

Forecast with Fisher formalism : loss or gain of information

1. What is the Fisher formalism used for in forecasting?

2. How does the Fisher formalism measure information gain or loss?

3. Can the Fisher formalism be applied to any type of forecasting?

4. How can the Fisher formalism help improve forecasting accuracy?

5. Are there any limitations to using the Fisher formalism in forecasting?

Similar threads

Hot Threads

Recent Insights