Steps to follow to determine the right model for data?

  • Context: Undergrad 
  • Thread starter Thread starter musicgold
  • Start date Start date
  • Tags Tags
    Data Model
Click For Summary

Discussion Overview

The discussion revolves around the steps to determine an appropriate statistical model for a given set of sample data, specifically focusing on density estimation and model selection. The scope includes theoretical considerations, practical applications, and exploratory reasoning regarding statistical methods.

Discussion Character

  • Exploratory
  • Technical explanation
  • Conceptual clarification
  • Debate/contested

Main Points Raised

  • One participant introduces the concept of density estimation and mentions non-parametric techniques like kernel density estimation and histograms, noting the limitations of using these methods with small sample sizes.
  • Another participant questions the common practice of assuming a normal distribution for sample data, linking it to the Central Limit Theorem (CLT) and its various forms.
  • There is a discussion about the conditions under which the CLT applies, including the size of the sample and the nature of the data, with emphasis on the need for finite mean and variance.
  • A participant expresses uncertainty about the appropriateness of the normal model for their data, particularly when the histogram of the data appears triangular.

Areas of Agreement / Disagreement

Participants express differing views on the appropriateness of assuming a normal distribution for sample data, with some advocating for caution and others referencing the Central Limit Theorem. The discussion remains unresolved regarding the best approach to model selection for the specific data presented.

Contextual Notes

Limitations include the small sample size of 50 data points, the lack of information about the underlying distribution, and the potential variability in data characteristics that could affect model selection.

musicgold
Messages
303
Reaction score
19
Hi,

I am trying to understand the basic steps one has to follow determine an appropriate model for a set of sample data. I couldn’t find any webpage describing the process.

Suppose you are given 50 values of a variable as a sample. How would you go about deciding the appropriate model that will represent the variable?

Thanks,

MG.
 
Physics news on Phys.org
This is called the "density estimation" problem. There are many sort of non-parametric techniques that may be brought to bear on this question where the assumptions about the distribution are very limited: kernel density estimation and histograms being the main sort. 50 data points is really very little data, though, and most of these non-parametric density estimation techniques are usually not relied on in such cases. Do you have any more information on the type of distribution from which you are getting the data: are the data point independent and identically distributed for instance (can you make that assumption)?

Here's the Wikipedia link:
http://en.wikipedia.org/wiki/Density_estimation
 
markst,

Thanks.

Generally, I see that, for any sample, people straight away assign the normal model and extrapolate values that were not observed in the sample.

Is it because of the Central Limit Theorem, which specifies that sample means have a normal distribution irrespective of the distribution of the population?
 
In probability theory there are actually a multitude of "central limit" theorems. The one that you generally come across at a more basic level applies to the case when there is a large sample of random variables with finite mean and finite variance. In that circumstance you say that that the sample average behaves like a normal random variable. Historically a sufficiently large sample was usually considered to be thirty or greater, but it really depends on the type of data that you have, and whether you have any information about the measurements being "noisy", etc. It also depends on how to interpret the numbers, say in the case where you have data about rank ordering it might not be meaningful to talk about sample averages. So, in order to give any sort of concrete advice about what to do with the dataset I'd really need to know what kind of data it is.

But if the data is real numbers (and those are a meaningful description of the data) and the data can be assumed to be generated from a process with finite mean and finite variance then you're probably safe to go with the CLT.
 
markst,

Thanks. Appreciate your help.

My question was not pertaining to any specific data. I have been reading many Statistics books and these questions keep popping in my head.

Thanks again,

MG.
 
markst,

Please see the attached file. Note that I have plotted the histogram of mileages of different cars. The histogram looks like a triangle, not sure.

How can we go about figuring the correct model for this variable?

Thanks,

MG.
 

Attachments

Similar threads

  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 14 ·
Replies
14
Views
3K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
5K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K