Linear regression, error in both variables

Click For Summary

Discussion Overview

The discussion revolves around the challenges of performing linear regression when both independent and dependent variables have measurement errors that are not constant. Participants explore methods for estimating the best fit line, including total least squares and weighted least squares, while seeking clarification on the nature of the errors and their implications for analysis.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant describes a dataset with errors in both variables and seeks the best value for the slope (a) and its uncertainty (da) for the linear relationship y = ax + b.
  • Another participant suggests that the variance of the errors may not be constant and questions the definition of "best" in the context of minimizing errors.
  • There is a discussion about the need for a method that averages errors over the entire line, weighing all parts equally, and measuring errors by perpendicular distances from data points to the line.
  • A participant proposes that the problem may involve a combination of weighted and total/orthogonal least squares.
  • Links to resources, including a lecture on weighted least squares and a Wikipedia article on total least squares, are shared as potentially helpful references.

Areas of Agreement / Disagreement

Participants express varying interpretations of the problem and the methods to address it, indicating that multiple competing views remain. The discussion does not reach a consensus on the best approach or solution.

Contextual Notes

Participants highlight the complexity of the errors involved and the need for further clarification on the nature of the data and the variances associated with the errors. There are unresolved questions regarding the specific characteristics of the errors and their impact on the regression analysis.

fhqwgads2005
Messages
23
Reaction score
0
Hi y'all, wondering if you could help me with this. I have a data set with a linear relationship between the independent and dependent variables. Both the depended and independent variables have error due to measurement and this error is not constant.

For example,

{x1, x2, x3, x4, x5}
{y1, y2, y3, y4, y5}

{dx1, dx2, dx3, dx4, dx5}
{dy1, dy2, dy3, dy4, dy5}

where one data point would be (x1±dx1, y±dy1), and so on.

Assuming the relationship is of the form,

y = ax + b, I need both the best value for a, and its uncertainty, (a ± da).

I've been scouring the internet for more information on total least squares methods, and generalized method of moments, etc. but I can't find something that works for the case where the error in x and y is just some arbitrary value, like in my case.

helpful hints?
 
Physics news on Phys.org
fhqwgads2005 said:
this error is not constant.

I think what you are trying to say is that the variance of the distribution of the errors is not constant with respect to X and Y.

y = ax + b, I need both the best value for a, and its uncertainty, (a ± da).

You must define what you mean by "best". I'll try to put some words in your mouth.
We want the line y = ax + b that minimizes the expected error between data points and the line, when we average these errors over the whole line between X = (some minimum value of interest) and X = (some maximum value of interest), giving all those parts of the line equal weight in this averaging. The error between a data point (x_i,y_i) and the line will be measured by the perpendicular distance (x_i, y_i) and the the line.

I but I can't find something that works

Let's try to define what you mean by "something that works". Do you mean a computer program that could (by trial and error if necessary) estimate the line? Or do you require some symbolic formula that you can use in a math paper?

for the case where the error in x and y is just some arbitrary value, like in my case.

I assume you are talking about the variances of the errors at various values of (x,y).
What exactly do you know about this? For example, if we have a data point (10.0, 50.2), do you have a lot data with similar values, so that we can estimate the variance in X and Y around the value (10.0,50.2)? Or do you only have data with widely separated X and Y values and are basing your assertion that the variances in the errors change with X and Y because of the overall scattered appearance of the data?
 
I was hoping to find help on the same topic! Any ideas?

Stephen Tashi said:
We want the line y = ax + b that minimizes the expected error between data points and the line, when we average these errors over the whole line between X = (some minimum value of interest) and X = (some maximum value of interest), giving all those parts of the line equal weight in this averaging. The error between a data point (x_i,y_i) and the line will be measured by the perpendicular distance (x_i, y_i) and the the line.



Do you mean a computer program that could (by trial and error if necessary) estimate the line?

yes and yes.

thanks in advance
 
Look at the Wikipedia article on Total Least Squares http://en.wikipedia.org/wiki/Total_least_squares. I've only scanned the article myself, but it looks like what you want. It has an example written in Octave, which is a free Matlab work-alike.
 

Similar threads

  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 30 ·
2
Replies
30
Views
5K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 64 ·
3
Replies
64
Views
6K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 25 ·
Replies
25
Views
9K