# Steps to follow to determine the right model for data?

## Main Question or Discussion Point

Hi,

I am trying to understand the basic steps one has to follow determine an appropriate model for a set of sample data. I couldn’t find any webpage describing the process.

Suppose you are given 50 values of a variable as a sample. How would you go about deciding the appropriate model that will represent the variable?

Thanks,

MG.

Related Set Theory, Logic, Probability, Statistics News on Phys.org
This is called the "density estimation" problem. There are many sort of non-parametric techniques that may be brought to bear on this question where the assumptions about the distribution are very limited: kernel density estimation and histograms being the main sort. 50 data points is really very little data, though, and most of these non-parametric density estimation techniques are usually not relied on in such cases. Do you have any more information on the type of distribution from which you are getting the data: are the data point independent and identically distributed for instance (can you make that assumption)?

http://en.wikipedia.org/wiki/Density_estimation

markst,

Thanks.

Generally, I see that, for any sample, people straight away assign the normal model and extrapolate values that were not observed in the sample.

Is it because of the Central Limit Theorem, which specifies that sample means have a normal distribution irrespective of the distribution of the population?

In probability theory there are actually a multitude of "central limit" theorems. The one that you generally come across at a more basic level applies to the case when there is a large sample of random variables with finite mean and finite variance. In that circumstance you say that that the sample average behaves like a normal random variable. Historically a sufficiently large sample was usually considered to be thirty or greater, but it really depends on the type of data that you have, and whether you have any information about the measurements being "noisy", etc. It also depends on how to interpret the numbers, say in the case where you have data about rank ordering it might not be meaningful to talk about sample averages. So, in order to give any sort of concrete advice about what to do with the dataset I'd really need to know what kind of data it is.

But if the data is real numbers (and those are a meaningful description of the data) and the data can be assumed to be generated from a process with finite mean and finite variance then you're probably safe to go with the CLT.

markst,

My question was not pertaining to any specific data. I have been reading many Statistics books and these questions keep popping in my head.

Thanks again,

MG.

markst,

Please see the attached file. Note that I have plotted the histogram of mileages of different cars. The histogram looks like a triangle, not sure.

How can we go about figuring the correct model for this variable?

Thanks,

MG.

#### Attachments

• 64.5 KB Views: 83