Data Analaysis -- How to choose the best statistical model to use?

Click For Summary

Discussion Overview

The discussion revolves around selecting the most appropriate statistical model for analyzing a set of data points with corresponding measurements and errors. Participants explore various statistical methods for model comparison, particularly in situations where the underlying distribution of the data is uncertain or varies.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant inquires about the best statistical model or method to determine consistency with measurements, suggesting options like chi-square, mean squared error, and Kolmogorov-Smirnov.
  • Another participant recommends using Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC) for model comparison, cautioning against using the same data for both model selection and testing due to the small sample size.
  • A different viewpoint suggests that in the absence of a theoretical model or known distributions, a spline curve could be a valid approach, emphasizing that statistical results may not be necessary.
  • One participant expresses confusion regarding the implementation of BIC and AIC, seeking guidance on how to obtain these metrics, while another participant suggests using R software for statistical analysis, noting its capabilities for calculating AIC and BIC.

Areas of Agreement / Disagreement

Participants express differing opinions on the necessity and appropriateness of statistical results versus visual methods like spline curves. There is no consensus on the best approach to model selection, and the discussion remains unresolved regarding the most suitable statistical method.

Contextual Notes

Participants highlight limitations related to the small sample size of 10 data points, which may affect the reliability of model selection and testing. The discussion also reflects varying levels of familiarity with statistical methods and software.

parazit
Messages
75
Reaction score
3
TL;DR
What is the most conviniant statistical model or method that I should use to determine the most consistent model with the measurements?
Hi all.

Let's assume I have a situation as following. I have a set of x values containing 10 data points. I also got the corresponding measurement values for that each x data points, as y values, and the error on them. Then, I perform calculations, with let's say 5 different models, in where I use the x values to obtain the y values.

In the end, I have x values, measured y values, and their errors, and five different sets of y values. You may see the attached file as an example.

My question is this: What is the most conviniant statistical model or method that I should use to determine the most consistent model with the measurements? Should I use chi-square, reduced chi-square, mean squared error, root mean square error, mean weighted deviation, the relative variance, Kolmogorov-Smirnov or something else?

You may wonder the distribution of the y values like are they linear, polynomial or etc. Let's assume they do not have a certain distribution or their distribution varies for different situation. My main interest in here is to point a statistical method for such cases.

Thank you so much for your time in advance.
 

Attachments

Last edited by a moderator:
Physics news on Phys.org
This is called "model comparison". I generally use the BIC or the closely related AIC for model comparison:

https://en.wikipedia.org/wiki/Bayesian_information_criterionhttps://en.wikipedia.org/wiki/Akaike_information_criterion
Note, you should not use the same data for model selection as for model testing. With only 10 data points you have far too few to do either job reliably. Your result is likely to be highly dependent on the sample noise and not very robust at all. I would recommend acquiring much more data, 40 at a bare minimum so that you can have 20 points for model selection and 20 points for model testing, but ideally substantially more than that.
 
  • Informative
Likes   Reactions: Klystron
It sounds like you want a general method to use when you have no idea about the theoretical model or statistical distributions that apply. In that situation, I do not think that you should concern yourself with a statistical result. A spline curve through the data points would be as valid as anything.
 
  • Like
Likes   Reactions: Klystron and Dale
Dale said:
This is called "model comparison". I generally use the BIC or the closely related AIC for model comparison:

https://en.wikipedia.org/wiki/Bayesian_information_criterionhttps://en.wikipedia.org/wiki/Akaike_information_criterion
Note, you should not use the same data for model selection as for model testing. With only 10 data points you have far too few to do either job reliably. Your result is likely to be highly dependent on the sample noise and not very robust at all. I would recommend acquiring much more data, 40 at a bare minimum so that you can have 20 points for model selection and 20 points for model testing, but ideally substantially more than that.

Thank you so much for your reply Dale. The file was just a sample to show the data and their distrubution. I have looked into BIC and AIC yet I am more confused now since I have no experience about obtaining them. It will be a blessing for me if you could show me a way out or an example. Thank you so much for your guidance in advance.
Best regards.
 
FactChecker said:
It sounds like you want a general method to use when you have no idea about the theoretical model or statistical distributions that apply. In that situation, I do not think that you should concern yourself with a statistical result. A spline curve through the data points would be as valid as anything.

Dear FactChecker,

Thanks for your reply. You're right actually. I normally plot a spline line with the obtained calculation results to compare them visually with the experimental data. However, I also would like to have a numerical comparison for a better understanding. This is why I asked about them. Thank you for your contribution, reply and time.
Best regards.
 
parazit said:
I have looked into BIC and AIC yet I am more confused now since I have no experience about obtaining them. It will be a blessing for me if you could show me a way out or an example.
Most good software packages will have them already implemented and you just have to call them. I would not recommend implementing them by hand!

I use R for my statistics, it is free and is very powerful and RStudio ( https://www.rstudio.com/ ) is a very nice distribution of R. Installing RStudio will probably take less time than programming your own AIC or BIC. Here is the page for AIC and BIC using R including an example at the bottom:

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/AIC.html
 
  • Like
Likes   Reactions: FactChecker

Similar threads

  • · Replies 5 ·
Replies
5
Views
6K
  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 16 ·
Replies
16
Views
2K
  • · Replies 11 ·
Replies
11
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 20 ·
Replies
20
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K