Creating a Predictive Model for Time-Independent Data Analysis

In summary, the conversation discusses the need to create a model to predict future values of a given set of data. The data consists of values that are measured at a specific point in time, and do not necessarily depend on time. There is a question about whether fitting the data to a power law or using moving averages would be a better approach for making accurate predictions. There is also a clarification about the purpose of the model, which is to forecast values of x in the future.
  • #1
kdbnlin78
34
0
Hey all.

I have some data, approximately 6 months worth. It is values that do not depend on time but are represented by a pair (x, y) such that the values x is measured at a point in time y. Therefore the data is equivalent to measuring how much stock I have in a given time period, say. The measurements occur once a day for 6 months.

I need to take this data and create a model that would allow me to predict values of x in (say) one month, three months and six months time.

My question is what is the best method to use? Should I suupose that a power law exists and us a regression model, or would a simple plotting of moving averages allow me to make an accurate prediction?

I guess I don't necessarily need to end with a function model representing x ans a function of y (x = f(y)) but more that I can use some statistical inference to accurately forecast the values of x_{i} in the future.

Any help on this matter would be very appreciated.

Regards,
kdbnlin
 
Physics news on Phys.org
  • #2
There is no good general answer to this question. There are lots of techniques, but forecasting is an art. (I had an entire one-semester course on it in biz school.)

Fitting to a power law is a good strategy if there is indeed reason to think a power law describes the data. That might be because you have a theory that says it should be a power law, or because a power law fits past performance well. Absent such reason, there are lots of other possible models.

Finally, I don't understand this:

It is values that do not depend on time but are represented by a pair (x, y) such that the values x is measured at a point in time y.
As well as I can make this out, you are saying that you measure x as a function of time. Then, for some mysterious reason, you decide to call time y, and, most mysterious of all, you claim the "values that do not depend on time". Could you make this a little clearer? If the values don't depend on time, how can you possibly think you could build a model that will make predictions for future times? In fact, if the values don't depend on time, doesn't that mean that the value is always the same, so no prediction is necessary?
 
  • #3
pmsrw3 said:
Finally, I don't understand this:


As well as I can make this out, you are saying that you measure x as a function of time. Then, for some mysterious reason, you decide to call time y, and, most mysterious of all, you claim the "values that do not depend on time". Could you make this a little clearer? If the values don't depend on time, how can you possibly think you could build a model that will make predictions for future times? In fact, if the values don't depend on time, doesn't that mean that the value is always the same, so no prediction is necessary?

Hi pmsrw3 thanks for the reply.

Apologies for the (non) use of English!

What I mean is this: A measurement of (say) x does not depend on time. We capture the value of x at some point in time. So for example, a shop sells cans of soup. The shop owner measures how many cans of soup he sells each week. He measures this on Friday evening every week. He can assume that the sales of the soup do not (necessarily) depend on time.

So I have a pair of values (x, t) such that x represents the number of cans sold and t is the time (of the day) at which the shop owner counted them.

I mean then that x = x(t) is not really what I am looking for but I would like to know (using the analogy) how many cans of soup I will sell in the next 26 fridays.

I hope that makes (some kind of) sense.

Regards,
kdbnlin.
 
  • #4
So what you have is a series of sales values, each with a date. There is in addition a completely uninteresting and uninformative datum that gives the time of day, y, at which each report was made. Is that right?

Assuming the answer is yes, you're obviously just going to throw out y. Then you have a series of (t, x) values. Here I'm using t for the date, since a date is also a time measurement. (Sorry if that's strange to you, but it's the way such data are usually handled.)

I can't think of any particular reason to expect this to follow a power law. I would probably just fit it to a constant, a line, a quadratic, etc, stopping when the fit stopped improving (as determined by an F-test, for instance). I'd be very surprised if the data justify going beyond a quadratic. If you had at least a year's worth of data it might also make sense to include a seasonal correction, but you don't have a long enough time series for that. (You might, however, be able to find something useful in BLS sales stats.)
 
  • #5
kdbnlin78 said:
Hey all.

I have some data, approximately 6 months worth. It is values that do not depend on time but are represented by a pair (x, y) such that the values x is measured at a point in time y. Therefore the data is equivalent to measuring how much stock I have in a given time period, say. The measurements occur once a day for 6 months.

I need to take this data and create a model that would allow me to predict values of x in (say) one month, three months and six months time.

My question is what is the best method to use? Should I suupose that a power law exists and us a regression model, or would a simple plotting of moving averages allow me to make an accurate prediction?

I guess I don't necessarily need to end with a function model representing x ans a function of y (x = f(y)) but more that I can use some statistical inference to accurately forecast the values of x_{i} in the future.

Any help on this matter would be very appreciated.

Regards,
kdbnlin

Your question is nearly the same as given a function, finding a representation of the function that describes the data, and then using that to transform the data back to something linear that can be operated on and analyzed through standard techniques.

As pmsrw3 stated above, there is not a single or easy way to do this.

By your post it seems like you are looking for parametrization of your data.

One suggestion you could try is to break up your data into smaller chunks and try a parametrization of the subsections of your data, and then chunk of chunk, try and uniform the different parametrization sections together.

If the data follows some simple power law, you may not have to do this, but if its complicated, this may help (and I emphasize "may").

One other possible way is to transform your data by time series analysis to make it "smooth" and then use a variety of integral transform techniques to get required information about its functional structure. Depending on the transform used, you will get specific information about the function in some form or another. This kind of method is more systematic than what I said above, but its more complex.
 
  • #6
chiro said:
One other possible way is to transform your data by time series analysis to make it "smooth" and then use a variety of integral transform techniques to get required information about its functional structure. Depending on the transform used, you will get specific information about the function in some form or another. This kind of method is more systematic than what I said above, but its more complex.

This is the route I think I will take. It seems very straightfoprward to smooth out the data by a simple moving averages technique. One question comes to mind; How do I know what "step distance" to take in calculating the moving averages?

I suppose I mean what value of $n$ should I take in /frac{1}{n}/sum_{i-1}^{i+n-1}a_{i} where the a_{i} is my n-sequence of data?

Further to this, I should imagine a discrete Fourier transform may ork in this case, given that my measurements are specifically taken in regular intervals over time.
 
  • #7
My personal philosophy on the generalities of fitting models to data is that you should actually have a model. If you hypothesize a probability model (one that is detailed enough to let you write a computer simulation of how the data is generated) then even if this model has unknown parameters, you can answer many questions about the "best" way to fit equations to the data. If you hypothesize such a model you are using subjective judgement, but it you can explicitly know and state what have assumed. If you don't hypothesize a model and make a a whole collection of subjective judements based on "I like this transform" or "I'll group these things together" etc. then you still have made subjective judgements, but they have no coherent structure and they don't imply that any particular mathematical methods are optimal.
 
  • #8
Stephen Tashi said:
My personal philosophy on the generalities of fitting models to data is that you should actually have a model.
I like that attitude, but I have some sympathy with kdbnlin78. Sometimes (and this happens a lot in biz forecasting), you may have absolutely no good explanation for why a thing changes in the way it does. And yet you still want to make forecasts. In that case, making up a model is an exercise in subjectivity no better than choosing a Fourier series because you just love Fourier series. I think in that case one is justified in trying a bunch of random stuff and seeing what works. In fact, that may actually lead to a model.
 
  • #9
Stephen Tashi said:
My personal philosophy on the generalities of fitting models to data is that you should actually have a model.
Not necessarily. A model is a nice thing (very nice thing!) to have, but lack of a model does not mean you can't go forward. This is similar in a sense to the distinction between supervised and unsupervised machine learning.

One thing to beware of in model-free fit (or unsupervised learning) is the danger of overfitting. Throw 2001 data points at a fitting algorithm and it will gladly up with a 2000th order polynomial that matches every single one of those data points to a T. That 2000th order polynomial almost certainly has zero predictive capability. Shoot, it almost certainly doesn't even have good interpolative capabilities.
 
  • #10
Stephen Tashi said:
My personal philosophy on the generalities of fitting models to data is that you should actually have a model.


I agree with your sentiment and accept that my philiosophy here is somewhat flawed and is certainly not perfect. My motivation for using the approach in my previous post is based on a "best case scenario" given some time series data that isn't really appropriate to answer the questions I am being asked.

However, in this case, like all good Scientists and Mathematicians, I'll proceed and exhaust all possible avenues until one gives me a "best case answer".

As a Mathematician this approach pains me somewhat but needs must.
 
  • #11
D H said:
One thing to beware of in model-free fit (or unsupervised learning) is the danger of overfitting. Throw 2001 data points at a fitting algorithm and it will gladly up with a 2000th order polynomial that matches every single one of those data points to a T. That 2000th order polynomial almost certainly has zero predictive capability. Shoot, it almost certainly doesn't even have good interpolative capabilities.
Exactly. And the same danger exists in a more opaque form when you try lots and lots of different models until you find the one that seems to fit best. You may find one that fits the points you have, but that's not a good reason to think it'll fit the next point you get. Nate Silver at FiveThirtyEight.com is good on this.

The guy's got six months of data sampled weekly -- call it 26 data points. I really think smoothing + integral transforms is overkill (and probably overfitting!). Such a short series probably only tells you where you are and how fast things have been changing lately.

EDIT: Sorry, I went back and read again, and I may have gotten this wrong. In the OP, kdbnlin78 says the data are sampled daily. But then later he says they're sampled on Fridays. So I'm not sure if he has 26 or 180 data points. But the point stands.
 
  • #12
pmsrw3 said:
The guy's got six months of data sampled weekly -- call it 26 data points...etc
/
I have now been able to go back and retrieve three years worth of data measured on a daily basis. So I guess that's c1100 data points. I appreciate both your comment and the quoted comment too.

This isn't my finest moment as a Mathematician! ;-)
 
  • #13
kdbnlin78 said:
I have now been able to go back and retrieve three years worth of data measured on a daily basis. So I guess that's c1100 data points. I appreciate both your comment and the quoted comment too.
Oh, 1100... Well, that puts a bit of a different complexion on things.

You know, I think you should try http://en.wikipedia.org/wiki/LOESS" [Broken].
 
Last edited by a moderator:
  • #14
pmsrw3 said:
Oh, 1100... Well, that puts a bit of a different complexion on things.

You know, I think you should try http://en.wikipedia.org/wiki/LOESS" [Broken].

Ok - Local regression. I awsn't aware of it as a technique. I think you may be right though, looks like a good technique for my problem. An unknown function representing the data and my needs for a forecast.

(Edit: I see that the technique is local since all we are doing are Taylor expansions - makes sense now)

Thank you very much for that - I'll search the term and do some reading!

Regards,
kdbnlin
 
Last edited by a moderator:

What is the best model to fit data?

The best model to fit data depends on the type of data and the research question. It is important to consider the assumptions and limitations of different models and choose the one that best represents the data and allows for accurate predictions.

How do I know if a model is a good fit for my data?

There are a few ways to assess if a model is a good fit for your data. One way is to look at the R-squared value, which measures how well the model explains the variation in the data. Additionally, you can plot the model's residuals and check for patterns or outliers, which can indicate a poor fit.

What is overfitting and how does it affect model selection?

Overfitting occurs when a model is too complex and fits the training data too closely, resulting in poor performance on new data. This can be problematic when selecting a model because the one that fits the training data best may not generalize well to new data. It is important to balance model complexity and performance to avoid overfitting.

Can I use the same model for all types of data?

No, different types of data require different models. For example, linear regression is suitable for continuous data, while logistic regression is better for categorical data. It is important to understand the type of data and choose an appropriate model for best results.

What are some common techniques for selecting the best model?

Some common techniques for selecting the best model include cross-validation, which tests the model on different subsets of the data, and stepwise regression, which compares performance of different models. It is also important to consider the interpretability and simplicity of the model when making a selection.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
390
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
953
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
333
  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
778
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
917
  • Set Theory, Logic, Probability, Statistics
Replies
10
Views
782
  • Set Theory, Logic, Probability, Statistics
2
Replies
37
Views
3K
Back
Top