I How to propagate errors through a regression & non-linear model?

AI Thread Summary
To propagate errors through a regression and non-linear model, one must first estimate the uncertainty of the output variable, y, typically represented as the standard deviation, σ_y. If the underlying distributions of the input variables are not fully known, it is more accurate to estimate σ_y rather than calculate it directly. The correlation between the errors of the predicted variables a, b, and c, which are derived from the same input variables x_1, x_2, and x_3, should be considered when estimating the uncertainty. A common approach involves using a Taylor series expansion of the non-linear function to approximate the standard deviation of y, integrating the resulting multinomial approximation. This method relies on the assumption that sample values can represent true parameter values, though caution is advised as including higher moments can complicate the reliability of the estimates.
Master1022
Messages
590
Reaction score
116
TL;DR Summary
How to calculate uncertainty bounds for the output of a linear regression model
Hi,

I was working on a predictive linear regression model and was hoping to obtain some bounds to represent the uncertainty present in the model.

Question:
I suppose this boils down into two separate components:
1. What is a good measure of uncertainty from a linear regression model? MSE, or perhaps another metric?
2. How can I propagate that metric through a non-linear function?

Context I have used a certain dataset in 3 different linear regression models to predict variables ## x_1 ##, ## x_2 ##, and ## x_3 ##. I know the mean squared errors for those predictions - each of the predictions uses the same input variables, but the regression weights are different. Then, I am calculating ## y = f(x_1, x_2, x_3) ## where ## f ## is a non-linear function (not complicated, but there are products of ## x_1 \cdot x_2 ## and ## x_1 \cdot x_3 ##). How can I calculate a metric that is measures 'uncertainty' such that I can give the output of the model as ## y \pm \Delta y ##?

- ## y ## is being forecast into the future and so I do not have access to data to compare it against and calculate a MSE/metric

Thanks in advance for any help.
 
Mathematics news on Phys.org
Master1022 said:
How can I calculate a metric that is measures 'uncertainty' such that I can give the output of the model as ## y \pm \Delta y ##?

Suppose we take the simple interpretation that the "uncertainty" of ##y## will be ##\sigma_y## , the standard deviation of ##y## when ##y## is considered as a random variable.

If you have a specific probability model, where each random variable involved has (or is assumed to have) a distribution with known parameters then we can discuss calculating the standard deviation of ##y##.

However, if we only assume the random variables involved are from a general family of distributions ( for example, if some have unknown means and variances) then saying we will calculate the standard deviation of ##y## is misleading. A better choice of words is to say that we will estimate the standard deviation of ##y##. The thing we can calculate is an estimator of the standard deviation of ##y##.

What "uncertainty" means in the latter situation is somewhat compliated because we can make an estimate of ##\sigma_y## as a function ##\hat{\sigma}_y## of the data, but in common language terms, there is some uncertainty in our estimate.

Which of the two situations applies to your problem?
 
  • Like
Likes Master1022
Master1022 said:
I have used a certain dataset in 3 different linear regression models to predict variables x1, x2, and x3. I know the mean squared errors for those predictions - each of the predictions uses the same input variables, but the regression weights are different.
It would seem unlikely to me that these errors are uncorrelated. Can you explain what you did ?
 
Stephen Tashi said:
Suppose we take the simple interpretation that the "uncertainty" of ##y## will be ##\sigma_y## , the standard deviation of ##y## when ##y## is considered as a random variable.

If you have a specific probability model, where each random variable involved has (or is assumed to have) a distribution with known parameters then we can discuss calculating the standard deviation of ##y##.

However, if we only assume the random variables involved are from a general family of distributions ( for example, if some have unknown means and variances) then saying we will calculate the standard deviation of ##y## is misleading. A better choice of words is to say that we will estimate the standard deviation of ##y##. The thing we can calculate is an estimator of the standard deviation of ##y##.

What "uncertainty" means in the latter situation is somewhat compliated because we can make an estimate of ##\sigma_y## as a function ##\hat{\sigma}_y## of the data, but in common language terms, there is some uncertainty in our estimate.

Which of the two situations applies to your problem?

Thanks for your response. So my situation is the latter, so it looks like an estimate is what we are aiming for to measure the uncertainty. How would I go about propagating that through a non-linear function?
 
BvU said:
It would seem unlikely to me that these errors are uncorrelated. Can you explain what you did ?
Thanks for your response @BvU ! So I basically used those all three variables ## x_1 ##, ## x_2 ##, ## x_3 ## to calculate three predicted variables ## a ##, ## b ##, ## c ##. Then the output ## y ## had a form that can be condensed to:
y = c \cdot (b - a)
You are right that the errors for ## a ##, ##b##, and ##c## are likely not independent as they all were predictions using the same three variables (## x_1 ##, ## x_2 ##, ## x_3 ##). What is the best way to deal with such a situation in order to get an error estimate for ##y##?
 
Master1022 said:
I basically used those all three variables ## x_1 ##, ## x_2 ##, ## x_3 ## to calculate three predicted variables ## a ##, ## b ##, ## c ##. Then the output ## y ## had a form that can be condensed to:
y = c \cdot (b - a)
You are right that the errors for ## a ##, ##b##, and ##c## are likely not independent as they all were predictions using the same three variables (## x_1 ##, ## x_2 ##, ## x_3 ##). What is the best way to deal with such a situation in order to get an error estimate for ##y##?
From what you 'did basically' I can't follow what you did, so all I can give is general advice: find out the correlation matrix and use it to propagate the errors.
 
BvU said:
From what you 'did basically' I can't follow what you did, so all I can give is general advice: find out the correlation matrix and use it to propagate the errors.
Apologies, which part wasn't clear? I can try to explain further.

The same three time series ## x_1 ##, ## x_2 ##, and ## x_3 ## were used as inputs to three different linear regression models. The outputs of these models were ## a ##, ## b ##, and ## c ##. Then these constants were combined in a formula which was of the form ## c \cdot (b - a) ##. Which part was unclear?
 
Master1022 said:
How would I go about propagating that through a non-linear function?

"Propagating error through a function ##f(x_1,x_2,x_3)##" is usually defined to mean estimating the standard deviation of the random variable ##y = f(x_1,x_2,x_3)##. Although estimating a standard deviation is a different concept that computing the standard deviation from known parameters of distributions, what is usually done is to make a lot of assumptions that justify using the sample values of parameters (such as mean, variance, covariance, etc.) as if they were they were the true values of the parameters. So, we end up in case 1 of post #2, even if are really in case 2 !

Proceeding as if we know the true values of all parameters involved, write a Taylor series (multinomial) approximation of ##f()## expanded about the point ##(x_1 - \overline{x_1}, x_2 - \overline{x_2}, x_3 - \overline{x_3})## where ##\overline{ x_k} ## is the mean of ##x_k##. (Use the values of the sample means as the values of the actual means.)

Truncate the expansion. Then compute the standard deviation of ##y## by doing the appropriate integration of the multinomial approximation.

Of course, doing the integration can be complicated, but it's possible in principle since the calculations only involve computing "moments" of multinomial functions. For example, to compute ##\overline{y}## we might have to find the expected value of a term like ##\frac{\partial^2 f}{ \partial {x_1}^2} \frac{\partial f}{\partial x_2} ((x_1 - \overline{x_1})^2( x_2 - \overline{x_2}) ##. The values of the partial derivatives are known because we are evaluating them at ##(\overline{x_1}, \overline{x_2},\overline{x_3})##. For the expected value of ##( x_1- \overline{x_1})^2 (x_2 - \overline{x_2} )##, we use the sample mean of the quantity ##(x_1- \overline{x_1})^2 (x_2 - \overline{x_2})##.

That's an outline of common practice. The mathematics of how well this way of estimating ##\sigma_y## works is a different matter.

For typical distributions, using sample values to estimate "higher moments" like ##\overline{ {x_1}^2 x_2 {x_3}^2} ## performs worse (in an appropriate technical sense) than using sample values to estimate lower moments like ##\overline{x_1}## or ## \overline{x_1 x_2} ## So including a lot of terms in the Taylor series does not necessarily make the estimate of ##\sigma_y## more reliable. The more terms you include, the more higher moments are involved, so the assumption that we can use sample moments as the actual higher moments becomes questionable.
 
The "goodness of fit" of a regression is usually measured by the "coefficient of correlation". This coefficient is defined as r^{2}=\frac{\sum (Y_{est}-\bar{Y})}{\sum(Y-\bar{Y})} where Y denotes the observed valued, Yest are the values you get from your regression model and \bar{Y} is the mean value of the Ys. r2 varies between 0 and 1, where 1 denotes a perfect correlation and 0 denotes no correlation.
 
Back
Top