Weird covariance matrix for linear fit

In summary: This fits with the result you get when you plot the errors: all points are consistent within their errors, but the blue line goes through the unweighted average point quite a bit further away than the other lines.I have no idea what you're asking.
  • #1
Malamala
299
27
Hello! I really don't know much about statistics, so I am sorry if this questions is stupid or obvious. I have this data: ##x = [0,1,2,3]##, y = ##[25.885,26.139,27.404,30.230]##, ##y_{err}=[1.851,0.979,2.049,6.729]##. I need to fit to this data the following function: $$y = a (x+0.5)/4.186 + b$$ So basically a straight line (the other constants can be easily absorbed in the obtained values for a and b). I did this fit using Python and I get this values: ##a = 3.78 \pm 1.70## and ##b = 24.99 \pm 0.66## and the off diagonal value of the covariance matrix is ##-1.03##. First of all, the y values are all equal to each other, within the given errors, so I expected the value for a to be also consistent with 0 i.e. something like ##3 \pm 4##, but my value seems to be 2 sigma away from zero. Does that make sense (again please forgive my lack of knowledge in statistics)? Also, why is the covariance -1? I remember that if the variables are so well correlated is not a good sign for a fit. Does my fit makes sense? Is there anything I can do to improve it? Thank you!
 
Physics news on Phys.org
  • #2
Hello,

FIrst thing to do in such a case (and many others!) is make a plot of the data:

1588847327827.png

The blue points are your data and the red lines indicate the given errors. They came with the data points and can be classified as internal errors.
(As opposed to external errors that follow from the scatter of the data. -- I hope I have this right, and perhaps someone has a good reference ?)​

I added some ornaments:

The blue line is a linear fit without weights. It yields ##y = (1.43\pm0.41) x + (25.27\pm 0.76)## (The errors are found using Excel | Data | Data Analysis | Regression ). With only four data points the error on the error estimate is some 50%, so we really should report only ##y = (1.4\pm0.4) x + (25.3\pm 0.8)##. The blue line goes through the unweighted average point ##\ (x_{\sf avg},y_{\sf avg})=(1.5,27.4)\ ## in the middle.

The purple line represents the weighted average ##y## with the dashed lines at ##\pm \sigma##. As you say, all data are consistent with ##\ y = 26.3\pm 0.9##
[Edit] made an error in the error --- I should have said the weighted average is ##\ y = 26.3\pm 0.4## :nb) which means two points are outside ##\pm\sigma##.​

Now that we have something to look at, we can start asking questions. That is to say: you can start asking yourself questions, because we don't have any idea at all how the data and the error estimates came about.

Statistics really hardly apply with so few data points. If you repeated the measurements (with e.g. ##x = -0.5, 0.5, 1.5, 2.5, 3.5## it can happen that the new points fit exactly on a parabola through the old points. Or the new points scatter like crazy, but stilll consistent with the errors you estimate. No way to tell from outside.

In the first case the internal errors are an over-estimate and you can consider the possibility that they have something systematic in common.
In the second case the error estimates from your fit will be considerable.

Further comment: Looking at the data alone: a second order fit yields an almost perfect result. Reason the more to ask what this data represents and how the results and the error estimates came about !

Malamala said:
I did this fit using Python
That is about as useful as stating that you used a keyboard :biggrin: !

I'm not good at weighted stuff, but I made an attempt to at least visually represent your results:

Same figure with your fit result, now with a weighted average ##\ (x_{\sf avg},y_{\sf avg})=(0.99,26.34)\ ## in the middle.

1588850928092.png


Malamala said:
again please forgive my lack of knowledge in statistics?
Stop apologizing. You want to learn something and you make a very good effort !

Malamala said:
my value seems to be 2 sigma away from zero. Does that make sense? Also, why is the covariance -1? I remember that if the variables are so well correlated is not a good sign for a fit. Does my fit makes sense? Is there anything I can do to improve it? Thank you!
As you can see, the fit result is quite consistent with the data. Note that for
weighing, only the relative errors count (i.e. the errors relative to each other): a single common factor for ALL errors does not change the result. Due to the relatively large errror in the 4th point, it hardly contributes (weights are 0.29, 1.04, 0.24, 0.02). So it's hardly surprising the first three points determine the result and the weighted average is very close to the second point.

If error estimates are completely right, about 32% of the data are more than one sigma away from the fit. For your data that would start to happen if you divide all the error estimates by 3 (i.e. increase all weights by a factor of 9 !). The fourth point would be at 1 ##\sigma##. More importantly, a slope ##a=0## would look very unlikely (still, with real statistics, 5% of data are outside ##\pm 2\sigma##).

Now about this correlation in the estimated errrors for ##a## and ##b##. I have shown these average points in the plots for a reason: linear fit lines always go through those points and the lines can be moved in two independent ways: wiggle (affecting ##a##) and move parallel up and down (affecting ##b##). Independent, so at that point the errors are uncorelated. repeating the fit after subtracting the 0.994 from all ##x## should yield zero correlation. Something worth checking and easily done.

With your original x-scale the wiggling influences the y-axis intercept much more than the shifting up and down, and that's where this -1.03 correlation comes from.

(Actually I am more used to correlation coefficients, but I think your -1.03 is ##\sigma_{ab}## so to get the correlation coefeficient you 'd do ##\displaystyle{\sigma_{ab}\over \sigma_a\sigma_b}##. That way the |coeffcient| ##\le## 1 as desirable.
Does my fit makes sense? Is there anything I can do to improve it? Thank you!
Yes it does.
Best thing to do is add more points. Statistiscs with just a few points is very risky.
You're welcome !
 
Last edited:
  • Informative
Likes Klystron

1. What is a covariance matrix?

A covariance matrix is a square matrix that contains the variances and covariances of a set of variables. It is commonly used in statistics to measure the linear relationship between two or more variables.

2. Why is a covariance matrix important for linear fit?

A covariance matrix is important for linear fit because it provides information about the relationship between the independent and dependent variables. It helps to determine the strength and direction of the linear relationship, as well as the variability of the data points around the best-fit line.

3. What does it mean if the covariance matrix is weird for a linear fit?

If the covariance matrix is weird for a linear fit, it means that there may be issues with the data or the model used for the fit. This could indicate that the data is not normally distributed, the variables are not linearly related, or there are outliers present.

4. How is a weird covariance matrix identified for a linear fit?

A weird covariance matrix for a linear fit can be identified by looking at the values in the matrix. If there are large differences in the variances or covariances, or if the values are not consistent with the expected relationships based on the data, then the matrix may be considered weird.

5. Can a weird covariance matrix affect the accuracy of a linear fit?

Yes, a weird covariance matrix can affect the accuracy of a linear fit. If the matrix is not representative of the data, then the resulting linear fit may not accurately reflect the relationship between the variables. It is important to investigate and address any issues with the covariance matrix to ensure the accuracy of the linear fit.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
14
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
28
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
495
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
0
Views
463
Back
Top