Best goodness of fit test for low expected values

MathewsMD · Jun 16, 2015

Hi,

I am currently working with a histogram (with 25 bins) that looks Gaussian and am trying to fit a function to it and compute its goodness of fit. The function I am using to fit to the histogram is a Gaussian (it looks like a good fit from visual inspection) and I am treating this as my expected value function. When finding the expected value for each bin, the values go very low (e.g. ##10^{-126}##) and even some of the observed values (the frequency for certain bins) are 0. I have tried using less bins and this brings the expected values to ~##10^{-30}## (which is still quite low) but I'd rather figure out a better fitting function or GoF test than modify my histogram. My observed values range from ##0 - 50## on average, while my expected values range from ##10^{-126} - 50##. I also have 3 parameters in the Gaussian function I am using to fit (i.e. the amplitude, mean, sigma).

I have looked into the standard chi-square and G-tests for GoF but these are not applicable for such low expected values. I also don't seem able to find tests that can be used for data that has more than 1 degree of freedom. If you could refer me to any methods that would seem applicable to my current situation in finding a good measure of GoF, that would be greatly appreciated!

This is more out of curiosity due to my lack of knowledge in statistics, but with the chi-square test, if I do:
## \tilde{\chi}^2=\sum_{k=1}^{n}\frac{(O_k - E_k)^2}{O_k}\ ## where I've used the observed value instead of the expected value in the denominator, what is the major difference in the meaning of this new value? Since if E = 0, then ##\chi = 1## which is obviously, but is there any merit to this method in any regard?

DEvens · Jun 16, 2015

There are problems here that you simply cannot "solve" in any real sense. Your expected frequency out on the "wings" of a Gaussian are going to be extremely low. And you are going to see variation in real samples.

A "left field" suggestion for you: try some Monte Carlo. Take your estimate of the data fitting curve. Use it to randomly generate many sets of artificial data with the same number of entries as your actual data. Then do the statistics on these sets. If you generate 1000 sets of data and bin them, for example, you can do stats on each bin. Is your real data within 1 sigma of the expected value in each bin?

MathewsMD · Jun 16, 2015

DEvens said:

There are problems here that you simply cannot "solve" in any real sense. Your expected frequency out on the "wings" of a Gaussian are going to be extremely low. And you are going to see variation in real samples.

A "left field" suggestion for you: try some Monte Carlo. Take your estimate of the data fitting curve. Use it to randomly generate many sets of artificial data with the same number of entries as your actual data. Then do the statistics on these sets. If you generate 1000 sets of data and bin them, for example, you can do stats on each bin. Is your real data within 1 sigma of the expected value in each bin?

Thank you for the response!

I've attached a figure that shows the histogram (it has 100 data points), and the function, f, that I am using to model the histogram's data. The 3 errors are associated to the amplitude, mean, and sigma, respectively, for f.

Yes, that seems to be the case. That's an interesting suggestion since it would reduce the difference between the observed and expected values, thus making my numerator and denominator (i.e. ## [E-O]^2##and ##E##) closer in magnitude. But for my purposes, I want to ensure I am fitting my model to my data very accurately. Generating random points based on the model doesn't seem quite too helpful until I actual have a good model.

Once again thank you for the advice.

DEvens · Jun 16, 2015

The idea of the Monte Carlo is to see how good the fit is. By generating a large number of artificial data sets based on the model, and then doing statistics on those data sets, you can see what the expected variation in your bins would be. In other words, you are testing your model to see if it is really consistent with your data.

For example, your data appears to be distinctly non-symmetric. You seem to have a tail on the right side. The question is, is this really inconsistent with a Gaussian? If so, how inconsistent is it? What is the probability that a Gaussian would produce this degree of asymmetry?

There are probably statistical formulas that will give you an estimate of the probability your data is Gaussian. I never studied much stats, and what I did study I did poorly in. But you can "hum a few bars" and get by without it if you do the Monte Carlo stuff.

If you generate 1000 data sets and the average value in those right-hand bins is much smaller (relative to the 1-sigma value in each bin) than the value you have in your actual data, it tells you that you might need a different shape from simple Gaussian.

pbuk · Jun 16, 2015

You don't need a numerical goodness of fit test to tell you that those data do not fit a Gaussian (normal) distribution - the tail frequencies, particularly to the right, are much too high. Yes the central peak looks sort of normal (are you sure that the normal curve you have drawn has the right parameters - it looks a bit narrow to me as though you have taken the square root of the sample variance twice?), but if the outliers are part of the same data set they are never going to fit with any meaningful confidence.

MathewsMD · Jun 17, 2015

MrAnchovy said:

You don't need a numerical goodness of fit test to tell you that those data do not fit a Gaussian (normal) distribution - the tail frequencies, particularly to the right, are much too high. Yes the central peak looks sort of normal (are you sure that the normal curve you have drawn has the right parameters - it looks a bit narrow to me as though you have taken the square root of the sample variance twice?), but if the outliers are part of the same data set they are never going to fit with any meaningful confidence.

The normal curve was generated using curve_fit on Python to find the optimal parameters a, b, and c on a function: ## f = ae^{- \frac{x-b}{2c^2}} ## so it's interesting you bring that up because I do agree that it seems a bit narrow, but I have not interfered with the computation of its standard deviation (i.e. c in the equation) or any of the parameters for that matter.

Gonna look through the code to see if I did input something incorrectly.

pbuk · Jun 17, 2015

That's not a normal curve, no wonder it looks strange - a normal curve has only 2 parameters so a is dependent on b and c. And you don't estimate a normal distribution by curve fitting.

You should look at a high school primer on statistics before you start to play with computational tools.

MathewsMD · Jun 18, 2015

MrAnchovy said:

That's not a normal curve, no wonder it looks strange - a normal curve has only 2 parameters so a is dependent on b and c. And you don't estimate a normal distribution by curve fitting.

You should look at a high school primer on statistics before you start to play with computational tools.

Yes, you're right, that's not a normal curve. Poor choice of words from me.

I did get the mean and standard deviation of the data set, and I could plot a Gaussian using these parameters, but this function appears to have a higher chi-square value than the one produced from the curve fitting which is why I wanted to look at that instead.

pbuk · Jun 18, 2015

MathewsMD said:

... but this function appears to have a higher chi-square value than the one produced from the curve fitting which is why I wanted to look at that instead.

I thought you said the chi-squared was coming out enormous because of the near-zero expected values? There's a reason for that - it is not a good fit!

MathewsMD · Jun 18, 2015

MrAnchovy said:

I thought you said the chi-squared was coming out enormous because of the near-zero expected values? There's a reason for that - it is not a good fit!

Yes, I just added a constant to use as a baseline after making this post, and this was once again found through curve fitting (therefore it does not approach 0 far from the mean, although this isn't strictly a Gaussian). When I did this, the chi-square came back down to relatively reasonable numbers, but still too high to be considered a good fit.

My main goal right now isn't necessarily to find the best fit for my data (maybe a little later), but to find how good of a fit a Gaussian is to this data. I haven't quite found the test to do this yet, especially since such low expected values causing the chi-sqaure to go so high do obviously indicate it's not a good fit, but the value doesn't quite tell me how bad of a fit it truly is.

Stephen Tashi · Jun 23, 2015

MathewsMD said:

but the value doesn't quite tell me how bad of a fit it truly is.

The phrase "goodness of fit" is not a mathematically precise term. (Likewise for "badness of it".) There are different measures of "goodness of it". What do you have in mind when you talk about the "true" measure of fit?

MathewsMD · Jun 24, 2015

Stephen Tashi said:

The phrase "goodness of fit" is not a mathematically precise term. (Likewise for "badness of it".) There are different measures of "goodness of it". What do you have in mind when you talk about the "true" measure of fit?

In this case I was referencing Pearson's chi-square test, but I agree it is not a great method for my data. I was really just looking for a well-established formula that takes into consideration very low expected values (i.e. ~0) and also the number of parameters used in the fitting function.

Best goodness of fit test for low expected values

Attachments

1. What is a goodness of fit test?

2. Why is a goodness of fit test important for low expected values?

3. What is the best goodness of fit test for low expected values?

4. How does a goodness of fit test work?

5. What are the limitations of using a goodness of fit test for low expected values?

Similar threads

Hot Threads

Recent Insights