Amount of data vs. degrees of freedom in fit

  • Thread starter Niles
  • Start date
  • #1
1,868
0

Main Question or Discussion Point

Hi

I am struggling with a problem here. I have 6 data points, and I have found the solution of a model which I believe should describe the behavior of the data. Now I am trying to fit the parameters of the solution to the 6 data points.

The model contains 5 degrees of freedom, all of which are known - but not very precise. When I fit the expression to the data, I get very large standard deviations on my parameters - in addition they are not close at all to their expected values.

Naturally it is very well a possibility that my model is simply wrong. However I am also doubting how much value I can assign to the fit. Visually it looks good, but the reduced χ2 >>1.

My question is, is it possible to be in a situation where the number of data points is so few that the statistics is simply too bad in order to determine so many degrees of freedom?


Thanks for any feedback in advance.

Best,
Niles.
 

Answers and Replies

  • #2
Hurkyl
Staff Emeritus
Science Advisor
Gold Member
14,916
19
My question is, is it possible to be in a situation where the number of data points is so few that the statistics is simply too bad in order to determine so many degrees of freedom?
Of course. More data means more precision, so less data means less precision. Honestly, I don't think you have enough data to confidently fit one degree of freedom, let alone 5. (but then, I don't know the particulars of your situation)
 
  • #3
Stephen Tashi
Science Advisor
7,110
1,296
The model contains 5 degrees of freedom, all of which are known - but not very precise.
Do you mean that the model contains 5 unknown and independent parameters?

When I fit the expression to the data, I get very large standard deviations on my parameters - in addition they are not close at all to their expected values.
I'm not sure whether "expected values" is meant in the sense of a values that you'd expect to get based on expert scientific knowledge (like the known mass of a molecule) , or whether you mean this in a statistical sense as an average of a set of numbers.

How are you computing "standard deviations on my parameters"? After all, to do this in a straightforward way, you have to have various samples of your parameters and you only have samples of data, not samples of parameters.

There are curve-fitting software packages that purport to both find the parametes and assign a standard deviation to them, but as I understand what that software is doing, it must make many assumptions in order to do this computation. It would be best to understand if these assumptions aply to your problem.
 
  • #4
1,868
0
Hi

Thanks for the replies.

Do you mean that the model contains 5 unknown and independent parameters?
Yes, these I use as degrees of freedom when doing a least-squares fit.


I'm not sure whether "expected values" is meant in the sense of a values that you'd expect to get based on expert scientific knowledge (like the known mass of a molecule) , or whether you mean this in a statistical sense as an average of a set of numbers.
The first case.


How are you computing "standard deviations on my parameters"? After all, to do this in a straightforward way, you have to have various samples of your parameters and you only have samples of data, not samples of parameters.

There are curve-fitting software packages that purport to both find the parametes and assign a standard deviation to them, but as I understand what that software is doing, it must make many assumptions in order to do this computation. It would be best to understand if these assumptions aply to your problem.
The standard deviation I referred to was the results given by a least-squares fit. But I think you are both right, my problem is simply I have way too little data.
 
  • #5
chiro
Science Advisor
4,790
131
One way people deal with having very little amount of data is to use priors and Bayesian analysis to do statistical inference (the medical area has this problem frequently).

In saying the above though, small data sets require you to have a good deal of expert knowledge about the data to make sure that if you have a small amount of data, that the data is as useful as it can be.

But you have a ridiculously low amount of data (even when you consider some situations that have low data like for example a data-set of surgical data for some exotic niche specialty), so the question I have for you is two-fold:

1) Why do you have this amount of data? and
2) Can you collect more data if at all possible?
 
  • #6
34,256
10,300
Does your fit program give you a p-value for the fit quality? If that is too small even with just 6 data points, forget the fit.
If not, you can use a table to check your (χ2, ndf) value.

In general, with less data it is easier to get high p-values, as the fit is not so sensitive to details of the real distribution.

I get very large standard deviations on my parameters - in addition they are not close at all to their expected values.

Naturally it is very well a possibility that my model is simply wrong. However I am also doubting how much value I can assign to the fit. Visually it looks good, but the reduced χ2 >>1.
3 signs of a bad model or some other error.
 
  • #7
1,868
0
Does your fit program give you a p-value for the fit quality? If that is too small even with just 6 data points, forget the fit.
If not, you can use a table to check your (χ2, ndf) value.
Thanks. However I don't understand the quoted part. My fit routine does give me the p-value. In my case I am interested in a small p-value, such that I know the model is correct - that must be true regardless of the number of data points I would say. Maybe I misunderstood something?

Best,
Niles.
 
  • #8
34,256
10,300
?
The p-value is the probability that random data, based on your fitted model, looks as good as the actual data or worse.
If you have a small p-value (for example 0.0001), it means that a) your model is wrong or b) you were extremely unlucky, and got random deviations which just occur once in 10000 repetitions of the experiment. While (b) is possible, (a) looks more likely.

If your model is correct, you would expect a p-value somewhere around 0.5. Might be 0.05, might be 0.95, but everything below 0.01 is very suspicious.
 
  • #9
1,868
0
Then I may have misunderstood the concept of p-value. I thought small p-values (p < α) are evidence against the null hypothesis so the data is statistically significant.
 
  • #10
315
1
Then I may have misunderstood the concept of p-value. I thought small p-values (p < α) are evidence against the null hypothesis so the data is statistically significant.
I'd say you're right. I don't know what mfb is talking about.
 
  • #11
34,256
10,300
Small p-values of a fit are evidence that your fit function ("null hypothesis") is wrong, and the difference between data and fit function is significant.
That is what I said.
 
  • #12
315
1
Small p-values of a fit are evidence that your fit function ("null hypothesis") is wrong, and the difference between data and fit function is significant.
That is what I said.
Alright, I see what you mean, but it does depend on what exactly you are testing for. There are other tests where a low p-value indicates you can reject the null hypothesis in favor of your model.
 
  • #13
D H
Staff Emeritus
Science Advisor
Insights Author
15,393
683
Small p-values of a fit are evidence that your fit function ("null hypothesis") is wrong, and the difference between data and fit function is significant.
That is what I said.
That is exactly backwards. The null hypothesis is that the data are just random numbers drawn from a hat. A small p-value is evidence that this null hypothesis should be rejected.


In my case I am interested in a small p-value, such that I know the model is correct - that must be true regardless of the number of data points I would say. Maybe I misunderstood something?
There's a big problem with using the p-value for your analysis. Suppose you dropped one of those data points, leaving five data points for five unknowns. You will get a perfect fit (zero residual) if the matrix isn't singular. Zero residual means a p-value of zero. Does this mean your model is good? Absolutely not. It might be a perfect fit, but it is perfectly useless. You are going to get similar problems when you only have a few more data points than unknowns. The p-value statistic is meaningless in these cases.

There are a number of other tricks beyond the p-value if you know the errors/uncertainties in the individual data points. Note well: Any regression will be of dubious value if you don't know those uncertainties. One thing you can do is perform a principal components analysis. This will give you a fairly good idea of how many statistically meaningful degrees of freedom you have.

Another thing you can do is to build the model up from scratch. Build a set of five one variable models and pick the one that best explains the data. You don't have a model if none of the variables does a good job (here the p-value might be of help). Let's assume you do have something meaningful. Regress the remaining four variables against that one that gives the best fit to yield four new variables, and regress the residual from the first regression against each of these four new variables. Pick the new variable that does the best job of explaining the residual from the first regression. Repeat until either you have run out of variables or until the decrease in the residual is statistically insignificant. You can also use the opposite approach, start with the kitchen sink model (all variables tossed into the mix) and repeatedly delete variables from the mix until you finally get to the point where a deletion would be statistically significant.
 
  • #14
34,256
10,300
That is exactly backwards. The null hypothesis is that the data are just random numbers drawn from a hat. A small p-value is evidence that this null hypothesis should be rejected.
You cannot reject the hypothesis "data is drawn from some unknown distribution". It fits with ALL datasets.
You can reject specific distributions - if the p-value (with this distribution as model) is small.

Suppose you dropped one of those data points, leaving five data points for five unknowns.
Don't do that.

You are going to get similar problems when you only have a few more data points than unknowns. The p-value statistic is meaningless in these cases.
It not meaningless - while a high p-value does not mean that your model is right, a small p-value means that your model is probably wrong.
 
  • #15
D H
Staff Emeritus
Science Advisor
Insights Author
15,393
683
a small p-value means that your model is probably wrong.
This is wrong, very wrong. You are either using a very non-standard definition of the p-value or you completely misunderstand what it means. A small p-value, typically < 0.05, is a prerequisite for statistical significance. You are doing something very wrong if you are using the standard definition of p-value and are rejecting models with a small p-value / accepting models with a high p-value.
 
  • #16
34,256
10,300
A small p-value, typically < 0.05, is a prerequisite for statistical significance.
Statistical significance of a deviation from the model you used to calculate the p-value.
If you have that significant deviation, the model you used to calculate the p-value might be wrong (or you had bad luck).

0.05... :D. I'm particle physicist, everything beyond 3 standard deviations (or something like p<=0.003) is just considered as a fluctuation (or error in the analysis) there.
 
  • #17
315
1
I believe mfb is referring to the F test for lack of fit, where the null hypothesis actually is that your model is correct, and a low p value means you reject the model.

http://en.wikipedia.org/wiki/Lack-of-fit_sum_of_squares

This is a valid use of p-values, despite being somewhat reversed from how they are usually used.
 
  • #18
34,256
10,300
Maybe it is clearer with an example. Consider the Higgs boson search (ATLAS experiment as example, CMS has similar plots), and especially the decay channel Higgs to 2 photons (figure 4 in the pdf). The y-axis is the number of events (a) / number of weighted events (b), the x-axis is a measured parameter (mass).

We have two hypotheses: "No Higgs, background only" (dashed line) and "Higgs + background" (solid line).
How probable is the data with the background-only hypothesis? Well, we have large deviations close to 126 GeV, the p-value is very small (especially if you care about the region around 126 GeV only).
How probable is the data with the Higgs+background hypothesis? It fits well, the p-value is something reasonable.

Figure 7 and 8 give the local p-values for the background-only hypothesis, the dips in the logarithmic plots correspond to a very small p-value.
The background-only hypothesis is rejected, as its p-value is too small. As result, the discovery of a new boson was announced.

despite being somewhat reversed from how they are usually used.
Your turn.
 

Related Threads on Amount of data vs. degrees of freedom in fit

  • Last Post
2
Replies
25
Views
2K
  • Last Post
Replies
5
Views
637
  • Last Post
Replies
1
Views
2K
  • Last Post
Replies
0
Views
2K
  • Last Post
Replies
2
Views
3K
  • Last Post
Replies
4
Views
1K
  • Last Post
Replies
1
Views
1K
Replies
4
Views
3K
Replies
6
Views
5K
  • Last Post
Replies
0
Views
2K
Top