Data Analaysis -- How to choose the best statistical model to use?

parazit · Aug 1, 2019

Hi all.

Let's assume I have a situation as following. I have a set of x values containing 10 data points. I also got the corresponding measurement values for that each x data points, as y values, and the error on them. Then, I perform calculations, with let's say 5 different models, in where I use the x values to obtain the y values.

In the end, I have x values, measured y values, and their errors, and five different sets of y values. You may see the attached file as an example.

My question is this: What is the most conviniant statistical model or method that I should use to determine the most consistent model with the measurements? Should I use chi-square, reduced chi-square, mean squared error, root mean square error, mean weighted deviation, the relative variance, Kolmogorov-Smirnov or something else?

You may wonder the distribution of the y values like are they linear, polynomial or etc. Let's assume they do not have a certain distribution or their distribution varies for different situation. My main interest in here is to point a statistical method for such cases.

Thank you so much for your time in advance.

Dale · Aug 1, 2019

This is called "model comparison". I generally use the BIC or the closely related AIC for model comparison:

https://en.wikipedia.org/wiki/Bayesian_information_criterion https://en.wikipedia.org/wiki/Akaike_information_criterion
Note, you should not use the same data for model selection as for model testing. With only 10 data points you have far too few to do either job reliably. Your result is likely to be highly dependent on the sample noise and not very robust at all. I would recommend acquiring much more data, 40 at a bare minimum so that you can have 20 points for model selection and 20 points for model testing, but ideally substantially more than that.

FactChecker · Aug 1, 2019

It sounds like you want a general method to use when you have no idea about the theoretical model or statistical distributions that apply. In that situation, I do not think that you should concern yourself with a statistical result. A spline curve through the data points would be as valid as anything.

parazit · Aug 1, 2019

Dale said:

This is called "model comparison". I generally use the BIC or the closely related AIC for model comparison:

https://en.wikipedia.org/wiki/Bayesian_information_criterion https://en.wikipedia.org/wiki/Akaike_information_criterion
Note, you should not use the same data for model selection as for model testing. With only 10 data points you have far too few to do either job reliably. Your result is likely to be highly dependent on the sample noise and not very robust at all. I would recommend acquiring much more data, 40 at a bare minimum so that you can have 20 points for model selection and 20 points for model testing, but ideally substantially more than that.

Thank you so much for your reply Dale. The file was just a sample to show the data and their distrubution. I have looked into BIC and AIC yet I am more confused now since I have no experience about obtaining them. It will be a blessing for me if you could show me a way out or an example. Thank you so much for your guidance in advance.
Best regards.

parazit · Aug 1, 2019

FactChecker said:

It sounds like you want a general method to use when you have no idea about the theoretical model or statistical distributions that apply. In that situation, I do not think that you should concern yourself with a statistical result. A spline curve through the data points would be as valid as anything.

Dear FactChecker,

Thanks for your reply. You're right actually. I normally plot a spline line with the obtained calculation results to compare them visually with the experimental data. However, I also would like to have a numerical comparison for a better understanding. This is why I asked about them. Thank you for your contribution, reply and time.
Best regards.

Dale · Aug 1, 2019

parazit said:

I have looked into BIC and AIC yet I am more confused now since I have no experience about obtaining them. It will be a blessing for me if you could show me a way out or an example.

Most good software packages will have them already implemented and you just have to call them. I would not recommend implementing them by hand!

I use R for my statistics, it is free and is very powerful and RStudio ( https://www.rstudio.com/ ) is a very nice distribution of R. Installing RStudio will probably take less time than programming your own AIC or BIC. Here is the page for AIC and BIC using R including an example at the bottom:

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/AIC.html

Data Analaysis -- How to choose the best statistical model to use?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Attachments

Similar threads

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad The problem of points

Graduate Expected numbers of cards of a last color remaining

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect