Best goodness of fit test for low expected values

  • Thread starter MathewsMD
  • Start date
  • Tags
    Fit Test
In summary, the conversation discusses the difficulty in finding a good measure of goodness of fit for a histogram with a Gaussian shape. The expected values for each bin are extremely low, causing issues with traditional tests such as the chi-square and G-tests. A suggestion is made to try using Monte Carlo to generate artificial data sets and test the fit of the model to the data. The conversation also touches on the non-symmetry of the data and the possibility of needing a different function for a better fit.
  • #1
MathewsMD
433
7
Hi,

I am currently working with a histogram (with 25 bins) that looks Gaussian and am trying to fit a function to it and compute its goodness of fit. The function I am using to fit to the histogram is a Gaussian (it looks like a good fit from visual inspection) and I am treating this as my expected value function. When finding the expected value for each bin, the values go very low (e.g. ##10^{-126}##) and even some of the observed values (the frequency for certain bins) are 0. I have tried using less bins and this brings the expected values to ~##10^{-30}## (which is still quite low) but I'd rather figure out a better fitting function or GoF test than modify my histogram. My observed values range from ##0 - 50## on average, while my expected values range from ##10^{-126} - 50##. I also have 3 parameters in the Gaussian function I am using to fit (i.e. the amplitude, mean, sigma).

I have looked into the standard chi-square and G-tests for GoF but these are not applicable for such low expected values. I also don't seem able to find tests that can be used for data that has more than 1 degree of freedom. If you could refer me to any methods that would seem applicable to my current situation in finding a good measure of GoF, that would be greatly appreciated!

This is more out of curiosity due to my lack of knowledge in statistics, but with the chi-square test, if I do:
## \tilde{\chi}^2=\sum_{k=1}^{n}\frac{(O_k - E_k)^2}{O_k}\ ## where I've used the observed value instead of the expected value in the denominator, what is the major difference in the meaning of this new value? Since if E = 0, then ##\chi = 1## which is obviously, but is there any merit to this method in any regard?
 
Last edited:
Mathematics news on Phys.org
  • #2
There are problems here that you simply cannot "solve" in any real sense. Your expected frequency out on the "wings" of a Gaussian are going to be extremely low. And you are going to see variation in real samples.

A "left field" suggestion for you: try some Monte Carlo. Take your estimate of the data fitting curve. Use it to randomly generate many sets of artificial data with the same number of entries as your actual data. Then do the statistics on these sets. If you generate 1000 sets of data and bin them, for example, you can do stats on each bin. Is your real data within 1 sigma of the expected value in each bin?
 
  • #3
DEvens said:
There are problems here that you simply cannot "solve" in any real sense. Your expected frequency out on the "wings" of a Gaussian are going to be extremely low. And you are going to see variation in real samples.

A "left field" suggestion for you: try some Monte Carlo. Take your estimate of the data fitting curve. Use it to randomly generate many sets of artificial data with the same number of entries as your actual data. Then do the statistics on these sets. If you generate 1000 sets of data and bin them, for example, you can do stats on each bin. Is your real data within 1 sigma of the expected value in each bin?

Thank you for the response!

I've attached a figure that shows the histogram (it has 100 data points), and the function, f, that I am using to model the histogram's data. The 3 errors are associated to the amplitude, mean, and sigma, respectively, for f.

Yes, that seems to be the case. That's an interesting suggestion since it would reduce the difference between the observed and expected values, thus making my numerator and denominator (i.e. ## [E-O]^2##and ##E##) closer in magnitude. But for my purposes, I want to ensure I am fitting my model to my data very accurately. Generating random points based on the model doesn't seem quite too helpful until I actual have a good model.

Once again thank you for the advice.
 

Attachments

  • Screen Shot 2015-06-16 at 1.32.24 PM.png
    Screen Shot 2015-06-16 at 1.32.24 PM.png
    11 KB · Views: 545
  • #4
The idea of the Monte Carlo is to see how good the fit is. By generating a large number of artificial data sets based on the model, and then doing statistics on those data sets, you can see what the expected variation in your bins would be. In other words, you are testing your model to see if it is really consistent with your data.

For example, your data appears to be distinctly non-symmetric. You seem to have a tail on the right side. The question is, is this really inconsistent with a Gaussian? If so, how inconsistent is it? What is the probability that a Gaussian would produce this degree of asymmetry?

There are probably statistical formulas that will give you an estimate of the probability your data is Gaussian. I never studied much stats, and what I did study I did poorly in. But you can "hum a few bars" and get by without it if you do the Monte Carlo stuff.

If you generate 1000 data sets and the average value in those right-hand bins is much smaller (relative to the 1-sigma value in each bin) than the value you have in your actual data, it tells you that you might need a different shape from simple Gaussian.
 
  • #5
You don't need a numerical goodness of fit test to tell you that those data do not fit a Gaussian (normal) distribution - the tail frequencies, particularly to the right, are much too high. Yes the central peak looks sort of normal (are you sure that the normal curve you have drawn has the right parameters - it looks a bit narrow to me as though you have taken the square root of the sample variance twice?), but if the outliers are part of the same data set they are never going to fit with any meaningful confidence.
 
  • Like
Likes MathewsMD
  • #6
MrAnchovy said:
You don't need a numerical goodness of fit test to tell you that those data do not fit a Gaussian (normal) distribution - the tail frequencies, particularly to the right, are much too high. Yes the central peak looks sort of normal (are you sure that the normal curve you have drawn has the right parameters - it looks a bit narrow to me as though you have taken the square root of the sample variance twice?), but if the outliers are part of the same data set they are never going to fit with any meaningful confidence.

The normal curve was generated using curve_fit on Python to find the optimal parameters a, b, and c on a function: ## f = ae^{- \frac{x-b}{2c^2}} ## so it's interesting you bring that up because I do agree that it seems a bit narrow, but I have not interfered with the computation of its standard deviation (i.e. c in the equation) or any of the parameters for that matter.

Gonna look through the code to see if I did input something incorrectly.
 
Last edited:
  • #7
That's not a normal curve, no wonder it looks strange - a normal curve has only 2 parameters so a is dependent on b and c. And you don't estimate a normal distribution by curve fitting.

You should look at a high school primer on statistics before you start to play with computational tools.
 
  • #8
MrAnchovy said:
That's not a normal curve, no wonder it looks strange - a normal curve has only 2 parameters so a is dependent on b and c. And you don't estimate a normal distribution by curve fitting.

You should look at a high school primer on statistics before you start to play with computational tools.

Yes, you're right, that's not a normal curve. Poor choice of words from me.

I did get the mean and standard deviation of the data set, and I could plot a Gaussian using these parameters, but this function appears to have a higher chi-square value than the one produced from the curve fitting which is why I wanted to look at that instead.
 
  • #9
MathewsMD said:
... but this function appears to have a higher chi-square value than the one produced from the curve fitting which is why I wanted to look at that instead.
I thought you said the chi-squared was coming out enormous because of the near-zero expected values? There's a reason for that - it is not a good fit!
 
  • #10
MrAnchovy said:
I thought you said the chi-squared was coming out enormous because of the near-zero expected values? There's a reason for that - it is not a good fit!

Yes, I just added a constant to use as a baseline after making this post, and this was once again found through curve fitting (therefore it does not approach 0 far from the mean, although this isn't strictly a Gaussian). When I did this, the chi-square came back down to relatively reasonable numbers, but still too high to be considered a good fit.

My main goal right now isn't necessarily to find the best fit for my data (maybe a little later), but to find how good of a fit a Gaussian is to this data. I haven't quite found the test to do this yet, especially since such low expected values causing the chi-sqaure to go so high do obviously indicate it's not a good fit, but the value doesn't quite tell me how bad of a fit it truly is.
 
  • #11
MathewsMD said:
but the value doesn't quite tell me how bad of a fit it truly is.

The phrase "goodness of fit" is not a mathematically precise term. (Likewise for "badness of it".) There are different measures of "goodness of it". What do you have in mind when you talk about the "true" measure of fit?
 
  • Like
Likes MathewsMD
  • #12
Stephen Tashi said:
The phrase "goodness of fit" is not a mathematically precise term. (Likewise for "badness of it".) There are different measures of "goodness of it". What do you have in mind when you talk about the "true" measure of fit?

In this case I was referencing Pearson's chi-square test, but I agree it is not a great method for my data. I was really just looking for a well-established formula that takes into consideration very low expected values (i.e. ~0) and also the number of parameters used in the fitting function.
 

FAQ: Best goodness of fit test for low expected values

1. What is a goodness of fit test?

A goodness of fit test is a statistical test used to determine whether a set of observed data follows a specific theoretical distribution. It is used to assess how well a set of data fits a particular distribution or model.

2. Why is a goodness of fit test important for low expected values?

In cases of low expected values, it is important to use a goodness of fit test to determine if the observed data significantly deviates from the expected values. This can indicate whether the observed data is random or if there is a significant underlying pattern or trend.

3. What is the best goodness of fit test for low expected values?

The best goodness of fit test for low expected values depends on the specific situation and the type of data being analyzed. Some commonly used tests include the chi-square test, Kolmogorov-Smirnov test, and Anderson-Darling test.

4. How does a goodness of fit test work?

A goodness of fit test works by comparing the observed data to the expected values from a particular distribution or model. It calculates a test statistic, which is then compared to a critical value from a known distribution. If the test statistic is greater than the critical value, the observed data is considered to significantly deviate from the expected values.

5. What are the limitations of using a goodness of fit test for low expected values?

One limitation of using a goodness of fit test for low expected values is that it may not be sensitive enough to detect small deviations from the expected values. Additionally, the results of a goodness of fit test may be affected by the sample size and the choice of distribution or model being tested.

Similar threads

Back
Top