Data Analaysis -- How to choose the best statistical model to use?

In summary, model comparison can be done using BIC or AIC, but with only 10 data points it may not be reliable. It is recommended to acquire more data and use a spline curve as a general method for comparison. Alternatively, BIC and AIC can be used with a software package like R or RStudio.
  • #1
parazit
75
3
TL;DR Summary
What is the most conviniant statistical model or method that I should use to determine the most consistent model with the measurements?
Hi all.

Let's assume I have a situation as following. I have a set of x values containing 10 data points. I also got the corresponding measurement values for that each x data points, as y values, and the error on them. Then, I perform calculations, with let's say 5 different models, in where I use the x values to obtain the y values.

In the end, I have x values, measured y values, and their errors, and five different sets of y values. You may see the attached file as an example.

My question is this: What is the most conviniant statistical model or method that I should use to determine the most consistent model with the measurements? Should I use chi-square, reduced chi-square, mean squared error, root mean square error, mean weighted deviation, the relative variance, Kolmogorov-Smirnov or something else?

You may wonder the distribution of the y values like are they linear, polynomial or etc. Let's assume they do not have a certain distribution or their distribution varies for different situation. My main interest in here is to point a statistical method for such cases.

Thank you so much for your time in advance.
 

Attachments

  • sample.xlsx
    14.8 KB · Views: 258
Last edited by a moderator:
Physics news on Phys.org
  • #2
This is called "model comparison". I generally use the BIC or the closely related AIC for model comparison:

https://en.wikipedia.org/wiki/Bayesian_information_criterionhttps://en.wikipedia.org/wiki/Akaike_information_criterion
Note, you should not use the same data for model selection as for model testing. With only 10 data points you have far too few to do either job reliably. Your result is likely to be highly dependent on the sample noise and not very robust at all. I would recommend acquiring much more data, 40 at a bare minimum so that you can have 20 points for model selection and 20 points for model testing, but ideally substantially more than that.
 
  • Informative
Likes Klystron
  • #3
It sounds like you want a general method to use when you have no idea about the theoretical model or statistical distributions that apply. In that situation, I do not think that you should concern yourself with a statistical result. A spline curve through the data points would be as valid as anything.
 
  • Like
Likes Klystron and Dale
  • #4
Dale said:
This is called "model comparison". I generally use the BIC or the closely related AIC for model comparison:

https://en.wikipedia.org/wiki/Bayesian_information_criterionhttps://en.wikipedia.org/wiki/Akaike_information_criterion
Note, you should not use the same data for model selection as for model testing. With only 10 data points you have far too few to do either job reliably. Your result is likely to be highly dependent on the sample noise and not very robust at all. I would recommend acquiring much more data, 40 at a bare minimum so that you can have 20 points for model selection and 20 points for model testing, but ideally substantially more than that.

Thank you so much for your reply Dale. The file was just a sample to show the data and their distrubution. I have looked into BIC and AIC yet I am more confused now since I have no experience about obtaining them. It will be a blessing for me if you could show me a way out or an example. Thank you so much for your guidance in advance.
Best regards.
 
  • #5
FactChecker said:
It sounds like you want a general method to use when you have no idea about the theoretical model or statistical distributions that apply. In that situation, I do not think that you should concern yourself with a statistical result. A spline curve through the data points would be as valid as anything.

Dear FactChecker,

Thanks for your reply. You're right actually. I normally plot a spline line with the obtained calculation results to compare them visually with the experimental data. However, I also would like to have a numerical comparison for a better understanding. This is why I asked about them. Thank you for your contribution, reply and time.
Best regards.
 
  • #6
parazit said:
I have looked into BIC and AIC yet I am more confused now since I have no experience about obtaining them. It will be a blessing for me if you could show me a way out or an example.
Most good software packages will have them already implemented and you just have to call them. I would not recommend implementing them by hand!

I use R for my statistics, it is free and is very powerful and RStudio ( https://www.rstudio.com/ ) is a very nice distribution of R. Installing RStudio will probably take less time than programming your own AIC or BIC. Here is the page for AIC and BIC using R including an example at the bottom:

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/AIC.html
 
  • Like
Likes FactChecker

FAQ: Data Analaysis -- How to choose the best statistical model to use?

1. What is data analysis and why is it important?

Data analysis is the process of collecting, organizing, and interpreting data to uncover patterns, trends, and insights. It is important because it allows us to make informed decisions, identify areas for improvement, and understand the impact of our actions.

2. How do I choose the best statistical model for my data?

Choosing the best statistical model depends on several factors, including the type of data you have, your research question, and the assumptions of the different models. It is important to carefully consider these factors and consult with a statistician or data analyst for guidance.

3. What are some common statistical models used in data analysis?

Some common statistical models used in data analysis include linear regression, logistic regression, ANOVA, and t-tests. Other models such as decision trees, random forests, and neural networks are also becoming increasingly popular.

4. How do I know if my data is suitable for a particular statistical model?

Before choosing a statistical model, it is important to assess the assumptions of the model and determine if your data meets those assumptions. For example, linear regression assumes a linear relationship between the variables, so if your data does not show this pattern, another model may be more appropriate.

5. Can I use more than one statistical model for my data?

Yes, it is common to use multiple statistical models to analyze data. This can help to validate the results and provide a more comprehensive understanding of the data. However, it is important to avoid overfitting the data by using too many models or selecting models based on the desired outcome rather than the data itself.

Similar threads

Back
Top