Steps to follow to determine the right model for data?

musicgold · May 3, 2009

Hi,

I am trying to understand the basic steps one has to follow determine an appropriate model for a set of sample data. I couldn’t find any webpage describing the process.

Suppose you are given 50 values of a variable as a sample. How would you go about deciding the appropriate model that will represent the variable?

Thanks,

MG.

markst · May 4, 2009

This is called the "density estimation" problem. There are many sort of non-parametric techniques that may be brought to bear on this question where the assumptions about the distribution are very limited: kernel density estimation and histograms being the main sort. 50 data points is really very little data, though, and most of these non-parametric density estimation techniques are usually not relied on in such cases. Do you have any more information on the type of distribution from which you are getting the data: are the data point independent and identically distributed for instance (can you make that assumption)?

Here's the Wikipedia link:
http://en.wikipedia.org/wiki/Density_estimation

musicgold · May 4, 2009

markst,

Thanks.

Generally, I see that, for any sample, people straight away assign the normal model and extrapolate values that were not observed in the sample.

Is it because of the Central Limit Theorem, which specifies that sample means have a normal distribution irrespective of the distribution of the population?

markst · May 4, 2009

In probability theory there are actually a multitude of "central limit" theorems. The one that you generally come across at a more basic level applies to the case when there is a large sample of random variables with finite mean and finite variance. In that circumstance you say that that the sample average behaves like a normal random variable. Historically a sufficiently large sample was usually considered to be thirty or greater, but it really depends on the type of data that you have, and whether you have any information about the measurements being "noisy", etc. It also depends on how to interpret the numbers, say in the case where you have data about rank ordering it might not be meaningful to talk about sample averages. So, in order to give any sort of concrete advice about what to do with the dataset I'd really need to know what kind of data it is.

But if the data is real numbers (and those are a meaningful description of the data) and the data can be assumed to be generated from a process with finite mean and finite variance then you're probably safe to go with the CLT.

musicgold · May 6, 2009

markst,

Thanks. Appreciate your help.

My question was not pertaining to any specific data. I have been reading many Statistics books and these questions keep popping in my head.

Thanks again,

MG.

musicgold · May 6, 2009

markst,

Please see the attached file. Note that I have plotted the histogram of mileages of different cars. The histogram looks like a triangle, not sure.

How can we go about figuring the correct model for this variable?

Thanks,

MG.

Steps to follow to determine the right model for data?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Attachments

Similar threads

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad How do E[X] and E[|X|] relate?

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight