Is this a good forecasting model?

  • Context: Undergrad 
  • Thread starter Thread starter musicgold
  • Start date Start date
  • Tags Tags
    Model
Click For Summary

Discussion Overview

The discussion revolves around the evaluation of a forecasting model based on a dataset of 70 data pairs, focusing on the correlation and regression analysis between variables X and Y. Participants explore the implications of the statistical results, particularly the significance of the regression coefficient and the low R² value, in the context of predictive modeling.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant expresses concern about the model's ability to forecast Y due to a low R² value of 0.40, despite a statistically significant regression coefficient.
  • Another participant questions the meaning of "good enough to forecast Y," suggesting that the adequacy of the model depends on the specific forecasting goals.
  • Some participants highlight the importance of conducting a t-test to assess the relationship between X and Y, noting a low p-value indicating statistical significance.
  • Concerns are raised about the large spread of data points in the graph, suggesting that while there is a correlation, predictions may not be reliable.
  • One participant recommends validating the model by setting aside a portion of the data for testing, indicating that the current model may not have strong predictive value.
  • A participant seeks clarification on the implications of the spread of data points, leading to a discussion about the variability in Y for similar X-values.
  • Another participant inquires about the methodology for selecting data points for validation, considering whether to randomly choose a subset of the data.

Areas of Agreement / Disagreement

Participants generally agree on the statistical significance of the relationship between X and Y, but there is no consensus on the model's predictive value or the best approach for validation. Multiple competing views on the adequacy of the model remain unresolved.

Contextual Notes

Limitations include the reliance on a single dataset for analysis, the potential misinterpretation of p-values and t-statistics, and the unresolved nature of how to effectively validate the model.

musicgold
Messages
303
Reaction score
19
Hi,

Please see the attached Excel file.

I have a sample of 70 data pairs. The correlation between X and Y is -0.68. The OLS regression coefficient is statistically significant as shown in the file. However, with a R^2 of 0.40, I am not sure if my model would be good enough to forecast Y.
Can you please take a look?

Thanks.
 

Attachments

Physics news on Phys.org
Thanks.

If you simply want to know whether this is sufficient to imply that X and Y are not uncorrelated then you should do a t-test:

As shown in the Excel file, the regression analysis gets a very low p-value for the coefficient, so I know they are related (or not independent).

Separately, I also calculated the t-statistic of the correlation coefficient, which was 6.7, i.e. there is a very low chance that the sample correlation value occurred randomly. So I am quite confident that there is a statistically significant relationship.

What is making me nervous is the relatively low value of R^2 of the regression line. I am not sure how confident I should be about the predictions by this model.
 
musicgold said:
What is making me nervous is the relatively low value of R^2 of the regression line. I am not sure how confident I should be about the predictions by this model.
As you can see in the graph, knowing the x-value won't give a reliable prediction for y. It is better than not knowing the x-value (that's what the non-zero correlation tells you), but the spread of the datapoints is quite large.
 
If you want to determine the predictive value of your model, set aside a portion of your data to use for validation (or collect new measurements and use them for validation). I agree with mfb that the model probably won't have a great deal of predictive value.

the regression analysis gets a very low p-value for the coefficient, so I know they are related (or not independent).

Separately, I also calculated the t-statistic of the correlation coefficient, which was 6.7, i.e. there is a very low chance that the sample correlation value occurred randomly.

You should be aware that the p- and t- values don't really allow you to say any of those things. Null hypothesis testing (especially the p-value) is very commonly misinterpreted; the wikipedia article contains a list of common misconceptions that you may want to read.
 
mfb said:
the spread of the datapoints is quite large.
What do you mean by this?
 
musicgold said:
What do you mean by this?
See the highlighted areas in the attached image - very similar x-values (within each of them), but a large variation in y. Your prediction can be something like "it is probably within that y-range", but not better than that.

attachment.php?attachmentid=65221&d=1388424451.png
 

Attachments

  • reg.png
    reg.png
    4.4 KB · Views: 622
  • Like
Likes   Reactions: 1 person
Got it. Thanks.
 
Number Nine said:
If you want to determine the predictive value of your model, set aside a portion of your data to use for validation (or collect new measurements and use them for validation).
So should I go back, take only 60 of the 74 points and run the regression analysis again and see how the new model predicts the Y values for the remaining 14 X-values?

If yes, how should I go about selecting the 60 points, randomly?

Thanks.
 

Similar threads

  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 30 ·
2
Replies
30
Views
5K
  • · Replies 7 ·
Replies
7
Views
3K
Replies
3
Views
3K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K