Statistics with Fisher formalism : loss or gain of information with data cross-correlations

98 Views Asked by At

I am currently working on Fisher's formalism which is part of a more general theory, that of information. My problem applies to estimating cosmological parameters from input data with the Fisher formalism and recipes to build a Fisher matrix. The context is Astrophysics but it can apply to many subjects.

As it is more statistics than Astrophysics, I posted here on the stack mathematics forum.

Here is a summary: the input data is 4 columns of data, the first representing the redshift of the galaxies (i.e their distance) and then the other 3 each corresponding to the bias (by definition, this is the linear factor between local densiy of matter and global density matter into Universe) to the uncertainty about their position) of a given galaxy type: there are therefore 3 types of given galaxies and one value per redshift (8 redshifts in all). So I have a table of 8x3 values of bias.

1) First part : Now, I'm trying to cross data to try to extract additional information because for example, for the first type of galaxy, I have only the first 2 biases that are non-zero (I mean for First 2 redshifts), and for the third type, I have 6 different values ​​of 0 for the 6 redshifts above the 2 previous ones.

Here the file of biases for the 3 populations (b1,b2,b3) as a function of redshift (first column) :

# z                    b1                b2              b3  
1.7500000000e-01 1.1133849956e+00 0.0000000000e+00 0.0000000000e+00
4.2500000000e-01 1.7983127401e+00 0.0000000000e+00 0.0000000000e+00
6.5000000000e-01 0.0000000000e+00 1.4469899900e+00 7.1498329000e-01
8.5000000000e-01 0.0000000000e+00 1.4194157200e+00 7.0135835000e-01
1.0500000000e+00 0.0000000000e+00 1.4006739400e+00 6.9209771000e-01
1.2500000000e+00 0.0000000000e+00 0.0000000000e+00 6.8562140000e-01
1.4500000000e+00 0.0000000000e+00 0.0000000000e+00 6.8097541000e-01
1.6500000000e+00 0.0000000000e+00 0.0000000000e+00 6.7756594000e-01

My teacher suggested to me to merge the first column (corresponding to the first type of galaxy) with the third one (corresponding to the third type of population of galaxies), so as to obtain a single vector with only values ​​for the non-zero bias). This way, I simulate a "single population" processing.

From a statistical point of view, will there be a loss or a gain of information if I do this fusion of the 2 columns ? The problem seems rather complex because everything depends on the value of the data.

2) Second part : Another point of view suggested by my teacher: if I take a sample and I cut it in 2 parts, if I cross-correlate data between the 2 subsets obtained, will I win or lose information from a statisticl point of view, i.e at the level of the accuracy of parameters that I will extract from the cross-correlation between the 2 subsets. (in my case, i.e galaxies biaises and Fisher formalism, I mean the constraints that I get after having built my Fisher matrix and invert it) ?

He thinks that at first sight, I can not lose information (which seems intuitive because cutting a sample in 2 is not a loss of info per se) but he says that everything depends on whether I know or not precisely the ratio of the biases between the 2 subsamples : I did not quite understand this notion of ratio between theses 2 biases of subsets.

I am therefore looking for information on this problem, maybe on this forum, statisticians will be able to help me in this technique of cross-correlations and the fact of knowing or not if one gains or one loses some information by bringing together several sources of information.

I think that the gain or the loss of info will be a function of the redundancy of the data (we speak of entropy of Shannon I think, don't we ?).

3) Third part : I could also cross data between overlapped data for 2 columns of data (2 values ​​for each redshift) but here I think it's another problem from a statistical point of view: by the way, I'm talking about at the beginning of data crossing with the merging of 2 vectors but the "cross-correlation" is rather defined in the case of overlapped values, right?

However, in both cases, we cross data, in a certain way.

For the moment, in my algorithm, I treat the first 2 values ​​of the 1st type of population, the 3 others overlapped between the second and the 3rd type, and the last 3 of the 3rd type of population, which is 8 values at total (I mean 8 redshifts): so there are 2 "auto-spectrum" and 1 overlapped spectrum.

My measure of the information gain that I have been talking about since the beginning of this post is done with the computation of constraints by inverting the Fisher matrix, which gives me the covariance matrix and therefore the variance and the correlation of the parameters that I want to estimate: the smaller the standard deviations, the higher the information gain.

Your advices or suggestions on the issue are precious and will help me better understand the logic of this "data cross-correlation" matter.

Any help is welcome.

PS: If the topic seems to be placed on the wrong forum, feel free to move it in the forum Astrophysics/Cosmology but I think this is actually a mathematical issue since this is about statistics.

Regards