Steps to follow to determine the right model for data?

  • Thread starter musicgold
  • Start date
  • Tags
    Data Model
In summary: The attached file includes a plot of the histogram of mileage data. The histogram looks like a triangle, not sure if this is what you are looking for.
  • #1
musicgold
304
19
Hi,

I am trying to understand the basic steps one has to follow determine an appropriate model for a set of sample data. I couldn’t find any webpage describing the process.

Suppose you are given 50 values of a variable as a sample. How would you go about deciding the appropriate model that will represent the variable?

Thanks,

MG.
 
Physics news on Phys.org
  • #2
This is called the "density estimation" problem. There are many sort of non-parametric techniques that may be brought to bear on this question where the assumptions about the distribution are very limited: kernel density estimation and histograms being the main sort. 50 data points is really very little data, though, and most of these non-parametric density estimation techniques are usually not relied on in such cases. Do you have any more information on the type of distribution from which you are getting the data: are the data point independent and identically distributed for instance (can you make that assumption)?

Here's the Wikipedia link:
http://en.wikipedia.org/wiki/Density_estimation
 
  • #3
markst,

Thanks.

Generally, I see that, for any sample, people straight away assign the normal model and extrapolate values that were not observed in the sample.

Is it because of the Central Limit Theorem, which specifies that sample means have a normal distribution irrespective of the distribution of the population?
 
  • #4
In probability theory there are actually a multitude of "central limit" theorems. The one that you generally come across at a more basic level applies to the case when there is a large sample of random variables with finite mean and finite variance. In that circumstance you say that that the sample average behaves like a normal random variable. Historically a sufficiently large sample was usually considered to be thirty or greater, but it really depends on the type of data that you have, and whether you have any information about the measurements being "noisy", etc. It also depends on how to interpret the numbers, say in the case where you have data about rank ordering it might not be meaningful to talk about sample averages. So, in order to give any sort of concrete advice about what to do with the dataset I'd really need to know what kind of data it is.

But if the data is real numbers (and those are a meaningful description of the data) and the data can be assumed to be generated from a process with finite mean and finite variance then you're probably safe to go with the CLT.
 
  • #5
markst,

Thanks. Appreciate your help.

My question was not pertaining to any specific data. I have been reading many Statistics books and these questions keep popping in my head.

Thanks again,

MG.
 
  • #6
markst,

Please see the attached file. Note that I have plotted the histogram of mileages of different cars. The histogram looks like a triangle, not sure.

How can we go about figuring the correct model for this variable?

Thanks,

MG.
 

Attachments

  • car data.xls
    64.5 KB · Views: 205

1. What is the first step in determining the right model for data?

The first step is to clearly define the problem or question that the data will be used to address. This will help guide the selection of appropriate variables and models.

2. How do I choose the right variables to include in my model?

Variables should be chosen based on their relevance to the problem and their ability to contribute to the predictive power of the model. This can be determined through exploratory data analysis and statistical techniques such as regression analysis.

3. Should I use a simple or complex model?

This depends on the complexity of the problem and the amount of data available. In general, it is recommended to start with a simpler model and gradually increase complexity if necessary. A balance between simplicity and accuracy should be considered.

4. How do I evaluate the performance of different models?

There are various metrics that can be used to evaluate the performance of a model, such as accuracy, precision, recall, and F1 score. It is important to choose the appropriate metric based on the specific problem and to compare the results of different models using the same metric.

5. Can I use the same model for all types of data?

No, different types of data require different models. For example, time series data may require a different model than categorical data. It is important to understand the characteristics of the data and choose a model that is suitable for it.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
446
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
913
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
Back
Top