Multivariate Regressions?

  • Thread starter mathias1979
  • Start date
  • Tags
    Multivariate
In summary: In my document, I get a bit more into the interpretation of the covariates and how they effect the regression coefficients. When I get a chance, I will post the link.In summary, a multivariate regression can be done with only two independent variables, provided those variables are not interdependent. If the two datasets are interdependent, a multivariate regression may be biased.
  • #1
mathias1979
8
0
I have temperature data from two different sites, and I want to develop a relationship for temperature between those two sites. In that way, if I have temperature at just one of those sites, I can approximate the temperature at the other site. My question is whether or not a multivariate regression is appropriate here, since each site has the same errors associated with it, and neither site can necessarily be considered the "true" site. So they are both independent variables. Can a multivariate regression be done with only two independent variables, or do I need at least one dependent variable? Is the difference between a multivariate regression and a typical linear regression likely to be significant? And, how would I go about doing a multivariate regression, would I need special software? I'm not terribly familiar with statistical methods, so please dumb down responses as much as possible. Thanks for any help!

Matt
 
Physics news on Phys.org
  • #2
First of all, what you are saying is that both sites have random errors associated with them. In the classical regression model y = b x + u, variable x is nonrandom; variables y and u are random. One way to simplify the terminology is to say y is endogenous and x is exogenous. In your case there are two y's, one for each site. For your purposes, a simple correlation coefficient (between the two sites) may be sufficient (in any case you should start by measuring the simple correlation between the two sites, to see if there is any correlation to begin with).

If you must apply a regression model, e.g. y1 = a + b y2 + u, then you must consider the measured temperatures z1 and z2 having random errors v1 and v2 associated with each: z1 = y1 + v1, z2 = y2 + v2, where E[v1] = E[v2] = Cov[y1,v1] = Cov[y2,v2] = 0. (For example, z's are the measured temps, y's are the true temps.)

The reg. model y1 = a + b y2 + u can be written as
z1 - v1 = a + b(z2 - v2) + u
or
z1 = a + b z2 + w where w = v1 - b v2 + u.

Proposition 1: As long as z2 is independent from w, the OLS estimators of a and b are unbiased and efficient (minimum variance).

Proposition 2: If z2 and w are interdependent, OLS estimators may be biased and inefficient (at least in small samples). However, if z2 and w are contemporaneously uncorrelated, then the OLS estimators of a and b are "unbiased and efficient" asymptotically (when sample size ---> infinity).

Proposition 3: If z2 is contemporaneously correlated with w, the OLS estimators of a and b are biased, even asymptotically. (There are various techniques that come around this problem, such as the instrumental variables regression technique.)

Kmenta in Chap. 9 of Elements of Econometrics (1986) shows that in cases of measurement errors (errors-in-variables), Prop. 3 applies (i.e. Cov[z2,w] =/= 0).

A shortcut is to say "I know z2 is stochastic, so I will interpret the estimated a and b parameters as conditional on the sample values of z2 that were used to estimate the regression." Which means that you can run z1 = a + b z2 + u and not worry about random errors; you can use the estimates of a and b "with confidence" as long as the distribution of observed z2 at the second site is similar to the sample you will have used to estimate the a and the b. If and when the two distributions differ, you can update the regression with new data to obtain "better" estimates.
 
Last edited:
  • #3
The "golden key" will be to fit a separate model to each of your measurements:
z1 = a1 + b11 x1 + b12 x2 + ... + v1
z2 = a2 + b21 x1 + b22 x2 + ... + v2
so you can predict either one without taking any measurement at all! The x's (a.k.a. the instruments) are exogenous and are measured exactly.
 
Last edited:
  • #4
Thanks for the in-depth replies! I may have more questions once I dive back into this problem, for now I have other more pressing issues to take care of at work.
 
  • #5
As promised, I am back...

First off, just fyi, the correlation between the two datasets I have is 0.96

I have been trying to digest this a bit more today, and have actually been mulling over a document from a company that has done what I am trying to do...their description arrives at just about the same equation as you suggest. In their form:

y = mx + c + (ε(y) - mε(x))

Which if I adapt to your form, I assume is the same as

z1 = a + bz2 + w, where w= v1 - b v2 + u

They lack the 'u' in your 'w' equation...what does is this 'u'?

Reading through the document I have, I believe I can derive 'a' and 'b'...but I am still a bit uncertain on how to derive 'w'. If I don't know the "true" temperature readings, I don't know what "v" is (or ε in their terms). The document says to make the assumption that the uncertainties are normally distributed with a mean of 0, so that εx ~ N(0,σx) and εy ~ N(0,σy). I'm not sure exactly what to do with that though. Since I lack any statistical software, I'm trying to do this in Excel (thankfully the dataset is pretty small, so I hope this should be a problem...I was able to manually do a standard linear regression without any trouble). If I use the normdist function, it also asks for an "x" value in addition to the mean and standard deviation, so I'm not sure what I would use there.

Can you expand any on how to go about estimating the uncertainty term (or your "w") using the assumptions they suggest?
 
  • #6
The Co. document combines the v1 and u and calls it ε(y), which is okay. You don't need to derive w, the regression will do this for you. That's the residual (error) term, and you can have this series printed or plotted in Excel, if you are using the Data Analysis menu item under Tools (needs Analysis ToolPak add-in installed). Since Ordinary Linear Squares (the standard regression package) assumes the X is nonrandom (exogenous), it will force the residual (error) series to have zero correlation with your "x" (you can verify this). But if your "actual x" is random and correlated with the error term (as in my z2), it will mean that the OLS is making the wrong assumption! So, the statistics that the OLS will produce will not be exactly right!

Suppose you could measure the random error for the right-hand-side variable (the "x" in your model) by fitting the (first-stage) equation z2 = a + b1 x1 + b2 x2 + ... + v2 (for site 2), where the x's are exogenous (nonrandom) and exactly measured instrumental variables (e.g. altitude, latitude & longitude, day of the month, season of the year, the year itself, etc.) Then you could estimate each of the a and b parameters, and calculate "y2 = a + b1 x1 + b2 x2 + ... " ( = z2 - v2, which is y2) as your "true temperature" in site 2. If you can do this practically, and the first-stage regression turns out to be meaningful, then you can estimate the second-stage regression z1 = c + m y2 + e(z1). In this case, you will have eliminated the source of error in your "x" by replacing the random z2 with a nonrandom y2 = z2 - v2. The v2 is the residual error term that the first-stage regression will print out for you, and you have the z2 (measured temp. at site #2). Their difference is the "true" (nonrandom) temp. at the site.
 
Last edited:
  • #7
I think most of what you are saying makes sense, but it is still not clear to me (remember, very little statistical background) what exactly I need to do to do the multivariate regression. I understand the OLR simply because I have a stats book that has written out the equation to get the slope and y-intercept of my best fit line. I don't have that for the multivariate, just what you're saying and what is written in the Co. document, neither which spells out as clearly what I need to do in order to get my slope and y-intercept. The co-document has this:

m = m(x)/m(y) ; c = c(y)-mc(x)

Does this mean I do a OLR, repeating it twice, with the variables I am treating as x and y switched? In that way I get an m(x) and m(y)? Or am I way off base here?

I understand the math to get to the final multivariate regression model...but I don't fully understand how to actually fit my data to that model.
 
  • #8
Simple regression means regression with a single right-hand side variable, as in y = a + b x + u. Multivariate regression means regression with multiple right-hand side variables, as in y = a + b1 x1 + ... + bn xn + u. This distinction is not related to your problem (the right-hand side variable being endogenous) and it will not solve your problem.

You can read more on the subject by clicking on the following links: http://en.wikipedia.org/wiki/Linear_regression
http://en.wikipedia.org/wiki/Regression_analysis
When you're reading either article, you can click on any of the embedded links (in the text of the article) to jump to related subjects.

Let's take it one step at a time:

1. Find a set of non-random variables that you think would correlate with the temp. at site #2, and match these with the measurements at site #2. Knowing very little about the subject, I can think of the following variables: time of the day of the measurement, day of the year of the measurement, the season of the measurement, the year of the measurement, altitude of the measurement. Each of the variables should have more than one value, that is, it cannot be constant across all measurements. (For example, if all measurements are taken at an identical altitude then "altitude" may not be in the list.) You may have a longer and/or a better list. Do you understand/can you do this?

2. Regress the measured temperature at site #2 on those variables. Do you understand/can you do this?

3. Save the "predicted values" from that regression. (Alternatively, save the residuals, then take the difference "measured temp. - residual" for each observation, which will give you the predicted temp.) Do you understand/can you do this?

4. Regress the measured temp. at site #1 on the predicted values that you obtained in the previous step. Do you understand/can you do this?

Please go over the steps above. Let me know any point on which you need specific explanations. The narrower and more specific your questions, the better I can be of help.
 
Last edited:
  • #9
Alright, I think I'm in over my head here. I've gone ahead and attached the PDF describing the process I am trying to repeat, which does not seem to require deriving any relationship between the uncertainties and the variable of interest. I was overlooking the bulk of the process actually. I don't think this is exactly what you are describing. I was hoping it was somewhat similar to doing a linear regression, but it actually looks above and beyond my knowledge/capabilities. The good news is I only have about 20 points for which I would need to derive this, the downside is it look much more involved that I originally thought. Does it make any sense to you?
 

Attachments

  • Multivariate Regression.pdf
    139.1 KB · Views: 257
  • #10
How tightly are you bound by this document when you are solving this problem?

As I look at it, I don't think you can implement this document with a linear regression model; it will have to be a nonlinear model. (That's because alpha is a nonlinear function of m, and each of x1 and y1 is a nonlinear function of alpha.)
 
Last edited:
  • #11
well the goal is to try to replicate the analysis presented in the document as closely as possible to test whether it offers significant improvement over a OLR.
 
  • #12
I assume you don't have the variable w, or its equivalence in your situation. Is this right?

You will need a nonlinear regression software package to implement the document. Let σx and σy be the standard deviations of the measured x and the measured y respectively. Then define α = arctan (m σx^2/σy^2). The document seems to imply that you can use measured sample variances (say, sx and sy) to estimate σx and σy. I will assume σx and σy are approximately equal, which you should verify. If so, I can write α = arctan(m).

Then create a new x and a new y variable:

x1 = x cos α + y sin α
y1 = y cos α - x sin α

where x and y denote the measured temps.

The regression equation is y1 = m x1 + c + error, which translates to

y cos(arctan(m)) - x sin(arctan(m)) = m [x cos(arctan(m)) + y sin(arctan(m))] + c + error

which is nonlinear in m. Estimating m requires a nonlinear regression package. (Alternatively you can use an optimization package to determine the value of m from B.7 -- keeping in mind that α is a function of m. I think you can do this in Excel using the Goal Seek menu item under Tools.)

A shortcut would be to estimate y = m x + c + error using the original (measured) temperatures. This will give you a (biased) first estimate of m, which you can then use to calculate the α. Once you have a numerical value for α (say, A), you can create x1 and y1 and run the linear regression:

y cos A - x sin A = m (x cos A + y sin A) + c + error

to determine a new m conditional on α = A. If the new m is not very different from the first m, you can stop. If the new m is very different from the first m, you may have to iterate.
 
Last edited:
  • #13
For now I'm going to have to just estimate m, then we'll determine whether this method is honestly worth considering over an OLR and whether its worth it to use a nonlinear regression package.

Still have questions though. I understand getting an estimate for alpha. But I don't understand anything in the document beyond B.8. Isn't B.8 the solution I want? Gives me estimates for m, and c using known values (x, y, their means, and alpha). What are ŷ and x^? After that, it gets into stuff about t-tests ect...is this necessary? Or am I okay to simply plug the values for c, and m that I get from B.8 into the regression equation?

Thanks Enuma, you've been a huge help, and it is very much appreciated!
 
Last edited:
  • #14
mathias1979 said:
Still have questions though. I understand getting an estimate for alpha. But I don't understand anything in the document beyond B.8. Isn't B.8 the solution I want? Gives me estimates for m, and c using known values (x, y, their means, and alpha).
Sure, but the formula treats alpha as a given. How are you estimating the alpha? Don't you at least need an initial value for the alpha to "prime" the formula? Say you arbitrarily start with alpha = 3. Then you can solve for m (from B.8). But how do you know 3 is the "right" value for alpha? The document claims the right value is alpha = arctan(m), assuming equal variances for x and y. If so, you can invert the formula to tan(alpha) = m, or tan(3) = m, which implies m = 0.05. So now you have 2 separate estimates of m: one is 0.05, the other is whatever B.8 will produce, and they will not be identical necessarily. In this sense you have an overdetermined system. This calls for a nonlinear solution or some kind of an iterated solution to make the two estimates of m equal (at least close) to each other.

What are ŷ and x^?
These are the predicted values of y and x. If you were running a simple, uncomplicated linear ordinary least squares regression, for example y = c + m x + error, then m^ = (x'x)-1x'y, c^ = Mean(y) - (m^) Mean(x), and y^ = c^ + (m^) x. In this simple setup there is no x^, or more precisely x^ is identical to x, because x is assumed to be exogenous (error-free). In your case x has a random error component so x^ has to be estimated just like y^.

After that, it gets into stuff about t-tests ect...is this necessary? Or am I okay to simply plug the values for c, and m that I get from B.8 into the regression equation?
B.8 will give you a point estimate for each of m and c. But it will not tell you how "reliable" these estimates are. To see this, you need to construct confidence intervals as in B.16-17. The point of all of this document is to find an ad hoc transformation of the x and of the y so that the regression of the transformed y (that is y1) on the transformed x (that is x1) will produce a normally distributed error term, so you may construct the appropriate confidence intervals as in B.16-17. (Unless the regression errors are normally distributed it is pointless to construct a confidence interval.)

Thanks Enuma, you've been a huge help, and it is very much appreciated!
I am glad; good luck.
 
Last edited:

1. What is multivariate regression?

Multivariate regression is a statistical method used to analyze the relationship between multiple independent variables and a single dependent variable. It is an extension of simple linear regression, which only considers one independent variable.

2. What are the assumptions of multivariate regression?

The assumptions of multivariate regression include linearity, independence of errors, homoscedasticity (equal variance), normality of errors, and absence of multicollinearity (high correlation between independent variables).

3. How is multicollinearity detected in multivariate regression?

Multicollinearity can be detected by examining the correlation matrix of the independent variables and looking for high correlations (above 0.7 or 0.8). Another method is to calculate the Variance Inflation Factor (VIF) for each independent variable, with a VIF above 10 indicating multicollinearity.

4. How do you interpret the results of a multivariate regression?

The results of a multivariate regression will typically include the coefficients, standard errors, p-values, and confidence intervals for each independent variable. The coefficients represent the estimated effect of each independent variable on the dependent variable, while the p-values indicate the significance of these effects (a p-value less than 0.05 is considered significant).

5. What are some potential limitations of multivariate regression?

Some potential limitations of multivariate regression include the assumption of linearity, which may not hold in all cases, and the potential for omitted variable bias if important independent variables are not included in the model. Additionally, multicollinearity can affect the accuracy and interpretability of the results, and outliers or influential points can also impact the results.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
14
Views
218
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
979
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
803
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
6K
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
Back
Top