Is this a good forecasting model?

  • Thread starter Thread starter musicgold
  • Start date Start date
  • Tags Tags
    Model
musicgold
Messages
303
Reaction score
19
Hi,

Please see the attached Excel file.

I have a sample of 70 data pairs. The correlation between X and Y is -0.68. The OLS regression coefficient is statistically significant as shown in the file. However, with a R^2 of 0.40, I am not sure if my model would be good enough to forecast Y.
Can you please take a look?

Thanks.
 

Attachments

Physics news on Phys.org
Thanks.

If you simply want to know whether this is sufficient to imply that X and Y are not uncorrelated then you should do a t-test:

As shown in the Excel file, the regression analysis gets a very low p-value for the coefficient, so I know they are related (or not independent).

Separately, I also calculated the t-statistic of the correlation coefficient, which was 6.7, i.e. there is a very low chance that the sample correlation value occurred randomly. So I am quite confident that there is a statistically significant relationship.

What is making me nervous is the relatively low value of R^2 of the regression line. I am not sure how confident I should be about the predictions by this model.
 
musicgold said:
What is making me nervous is the relatively low value of R^2 of the regression line. I am not sure how confident I should be about the predictions by this model.
As you can see in the graph, knowing the x-value won't give a reliable prediction for y. It is better than not knowing the x-value (that's what the non-zero correlation tells you), but the spread of the datapoints is quite large.
 
If you want to determine the predictive value of your model, set aside a portion of your data to use for validation (or collect new measurements and use them for validation). I agree with mfb that the model probably won't have a great deal of predictive value.

the regression analysis gets a very low p-value for the coefficient, so I know they are related (or not independent).

Separately, I also calculated the t-statistic of the correlation coefficient, which was 6.7, i.e. there is a very low chance that the sample correlation value occurred randomly.

You should be aware that the p- and t- values don't really allow you to say any of those things. Null hypothesis testing (especially the p-value) is very commonly misinterpreted; the wikipedia article contains a list of common misconceptions that you may want to read.
 
mfb said:
the spread of the datapoints is quite large.
What do you mean by this?
 
musicgold said:
What do you mean by this?
See the highlighted areas in the attached image - very similar x-values (within each of them), but a large variation in y. Your prediction can be something like "it is probably within that y-range", but not better than that.

attachment.php?attachmentid=65221&d=1388424451.png
 

Attachments

  • reg.png
    reg.png
    4.4 KB · Views: 539
  • Like
Likes 1 person
Got it. Thanks.
 
Number Nine said:
If you want to determine the predictive value of your model, set aside a portion of your data to use for validation (or collect new measurements and use them for validation).
So should I go back, take only 60 of the 74 points and run the regression analysis again and see how the new model predicts the Y values for the remaining 14 X-values?

If yes, how should I go about selecting the 60 points, randomly?

Thanks.
 
Back
Top