Linear regression with asymmetric error bars

Click For Summary

Discussion Overview

The discussion revolves around performing linear regression on data with asymmetric error bars for both x and y variables. Participants explore the implications of these asymmetric uncertainties on regression analysis, considering various statistical approaches and models.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant seeks assistance with linear regression on data that has asymmetric error bars, emphasizing the need for different weighting in the regression calculation.
  • Another participant suggests that the question lacks clarity and questions the necessity of linear regression over other estimation methods.
  • A participant proposes that correlation can be assessed without regression lines, focusing on covariance and standard deviations instead.
  • It is mentioned that "total least squares" regression may be more appropriate for data with errors in both x and y variables compared to ordinary linear regression.
  • One participant introduces a model involving piecewise Gaussian variables to handle the asymmetric errors, detailing the mathematical formulation required for this approach.
  • Another participant shares a simpler case where they only have error in y and suggest converting to log space for regression, noting that this method works under specific conditions of symmetry in log space.
  • A later reply highlights the ambiguity of asymmetric error bars, arguing that their meaning is not universally defined, which complicates the regression process.

Areas of Agreement / Disagreement

Participants express differing views on how to approach the problem of asymmetric error bars in regression analysis. There is no consensus on a single method or solution, and multiple competing perspectives remain throughout the discussion.

Contextual Notes

Participants note the importance of understanding the source and nature of the error bars, as this affects how they should be treated in statistical analysis. The discussion reveals limitations in the clarity of the initial question and the assumptions underlying the proposed methods.

tmj143
Messages
4
Reaction score
0
I've been trying to figure out how to do a linear regression on data with asymmetric x and y error bars (different for each data point). Any help would be much appreciated.
 
Physics news on Phys.org
:smile:That is the mean for x and y.
 
xiaoB said:
:smile:That is the mean for x and y.

? I don't think you understand. There's some probability distribution which says that each data point can lie somewhere in between some x1 and x2 and some y1 and y2; these uncertainties are of different magnitudes for each data point, and the fact that they are different means that the data points need to be weighted differently in the regression calculation. But, because the error bars are asymmetric, I can't just do a straight weighted fit...
 
It's probably best if you give more details and also explain why you are determined to do a linear regression instead of another type of estimation.
 
What other details do you want? I want to see if there's a linear correlation between the x and y variables, and have data points that look like:
l​
l​
l​
------x-------------------
l​

Or something like that where x is the data point and the l's/-'s represent the error (but of different magnitudes for each point, like I said). I don't know how I can better describe this...
 
tmj143 said:
I don't know how I can better describe this...

If you find a way, please post it. I'm too busy to conduct a detailed interrogation. If you really know what you're doing, your question will have an answer. If you don't know what you're doing (for example, if you just think regression and correlation are the "right" thing to do, but you don't understand what your trying to optimize by using them) then you are beyond help.
 
Stephen Tashi said:
If you find a way, please post it. I'm too busy to conduct a detailed interrogation. If you really know what you're doing, your question will have an answer. If you don't know what you're doing (for example, if you just think regression and correlation are the "right" thing to do, but you don't understand what your trying to optimize by using them) then you are beyond help.

Look, this is purely a statistical question; if you want me to go into science details, I could, but they're absolutely irrelevant. I have data. I need to figure out if there is a correlation between the x and y variables/the slope of said line in the case of a linear regression. I'm sure that I could do some sort of complicated simulation to randomly sample imaginary data points from within my error bars and calculate fits for all of them to see if I get anything significant, but that is far more complicated than something I want to deal with.

I know that if the error bars were the same, I could do a weighted least squares fit. But they're not. So all I'm asking for is if anyone knows how to deal with the asymmetric error bars in such instances... I'm sure people do such fits all the time, but my ability to google any sort of explanation hasn't been successful.
 
If it's correlation you're after, you don't have to deal with regression lines. The correlation is defined in terms of the covariance and the standard deviations of the two variables.
 
  • #10
Perhaps you can use a model of a piecewise Gaussian variable. Suppose the variable has a mean [itex]a[/itex] and different standard deviation for [itex]x > a[/itex] and [itex]x < a[/itex], i.e. its distribution is:

[tex] \varphi(x) = \left\{\begin{array}{ll}<br /> A_{1} \exp\left(-\frac{(x - a)^{2}}{2 \sigma^{2}_{1}}\right)&, x > a \\<br /> <br /> A_{2} \exp\left(-\frac{(x - a)^{2}}{2 \sigma^{2}_{2}}\right)&, x < a<br /> \end{array}\right.[/tex]

You have to adjust [itex]A_{1}[/itex] and [itex]A_{2}[/itex] so that:

[tex] E(X) - a = \int_{-\infty}^{\infty}{(x - a) \varphi(x) \, dx} = 0 \Rightarrow A_{1} \int_{0}^{\infty}{t e^{-\frac{t^{2}}{2 \sigma^{2}_{1}}} \, dt} = A_{2} \int_{0}^{\infty}{t e^{-\frac{t^{2}}{2 \sigma^{2}_{2}}} \, dt} \Rightarrow A_{1} \, \sigma^{2}_{1} = A_{2} \, \sigma^{2}_{2}[/tex]

Of course, the probability density must be normalized:

[tex] \int_{-\infty}^{\infty}{\varphi(x) \, dx} = 1 \Rightarrow A_{1} \, \int^{\infty}_{0}{e^{-\frac{t^{2}}{2\sigma^{2}_{1}} \, dt} + A_{2} \, \int^{\infty}_{0}{e^{-\frac{t^{2}}{2\sigma^{2}_{2}} \, dt} = 1 \Rightarrow \sqrt{\frac{\pi}{2}} \left(A_{1} \, \sigma_{1} + A_{2} \, \sigma_{2} \right) = 1[/tex]

These two equations allow you to express [itex]A_{1/2}[/itex] in terms of [itex]\sigma_{1/2}[/itex]. Try to find the variance of the variable.

Next, consider the variable:

[tex] \varepsilon_{i} = a \, X_{i} + b \, Y_{i} + c, \; a^{2} + b^{2} = 1, \; i = 1, \ldots, N[/tex]

If [itex]X_{i}[/itex] and [itex]Y_{i}[/itex] have the above distribution, what is the expectation value and variance for [itex]\varepsilon_{i}[/itex]?

Approximate these variables as having an approximately Normal distribution with the above expectaion values and variances and use the maximum likelihood method, which would reduce to a least-squares method to estimate the parameters of the general linear dependence:

[tex] a \, x + b \, y + c = 0, \; a^{2} + b^{2} = 1[/tex]
 
Last edited:
  • #11
I'm dealing with a similar problem. My case is a bit simpler because I only have error in y and because my error becomes symmetric when converted into log space. So to do the linear regression, I just convert into log-space, do a non-linear regression with log(mx+b) as my model curve, and convert back out of log space. There are a couple ways to do the nonlinear regression; I used the NonlinearModelFit command in Mathematica, which allows you to set weights to your points.

This method only works when your error is symmetric in log space, but this is the main kind of asymmetric error I usually run across. In fact, a lot of the error that we call symmetrical is really actually symmetrical only in log space, which makes it close to symmetrical for small errors, but quite asymmetrical for larger ones. Very often when we say +/-25%, we really mean */÷1.25, which actually works out to +25%/-20%, of course.
 
  • #12
"Look, this is purely a statistical question..."

Ha. That's amusing.

The problem is that there's no right way to do this without knowing where those error bars come from. Error bars by themselves have no definite meaning. But when people use symmetric error bars, we know by convention that they probably represent something like root-mean-squared error, or some multiple of it. There's no such universal meaning of asymmetric error bars, so without more information about the error distribution they're intended to summarize, it's hard to say how to handle them correctly.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 13 ·
Replies
13
Views
5K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 6 ·
Replies
6
Views
2K
Replies
3
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
5K