# Weighted average of a set of slopes with different goodness of fit

1. Jul 11, 2014

### latitude

Hi there, I have a bit of a confusing question, but I'll try to be as clear as I can in asking it.

I have a set of linear fits for four different sets of data. Basically, I have three sets of data, with sample sizes N1 = 5, N2 = 7, N3 = 5 respectively. I have plotted these data with respect to a common x-axis. Then, I have found the linear fit for each data set, giving me a line with a slope, so 3 slopes total (m1, m2, m3) with three errors in slope as well (m'1, m'2, m'3). Each of the linear fits also has a goodness of fit associated with it (0.79, 0.99, 0.89) using the R-squared value.

Here's my question. I need to solve for the average slope and average error in slope. I feel like a simple average and standard deviation isn't indicative of the real average, because the linear fits are not all equally "well-fit". Is there a way to weight the slopes m1, m2, m3 according to their R^2 values so that I can calculate m_avg using a weighted mean? Or is the average of the error in the slope enough? Should the sample sizes Ni come into it at all?

Any advice would be greatly appreciated! Thanks :)

2. Jul 16, 2014

### vyas22

Hello

Just a suggestion - How about calculating a weighted avarage i.e. (0.79*m1+0.99*m2+0.89*m3)/3 for the average slope

and I do not feel the sample size should matter here ..

3. Jul 16, 2014

### Stephen Tashi

That phrase doesn't specify a particular mathematical problem. For the word "average" to have a precise meaning you have to say what random variable is involved. If you have difficulty expressing the problem in mathematical terms, try explaining what you are trying to accomplish. If you get a number for "average slope", how do you expect to use it?

4. Jul 16, 2014

### latitude

Basically I am changing something incrementally and systematically (x-axis) and noting the induced change that occurs in another quality (y-axis). The response tends to be linearly increasing, y = mx. I repeated this measurement on four different samples from the same "batch", so that theoretically, they should be showing the same response. Since m has a physical significance, I want to take the average of the set of "m"s to show the average response for the samples from this batch. However, there is variance in the slope determined for each sample (not the standard deviation in the set, but experimental error associated with running the same test on the same sample a number of times and getting different results). I guess what I would like to do is show the average of the slope, with each contributor to the slope weighted according to how consistently I received that response during the testing iterations.

So if I tested Sample 1 six times and got a consistently linear response of m1 = 2.0 +/- 0.1, I feel that that should weigh more heavily in calculating the average than Sample 2, which I tested three times for m2 = 1.6 +/- 0.8. Does that make sense? So the average slope would just be a way of saying, "The Batch of samples tends to show this response (mavg +/- error_in_slope) for this particular test."

Thanks very much for the replies!

5. Jul 16, 2014

### FactChecker

I assume there is a reason that you can not do a single linear regression on all the data. In any case, I think you should make a test case of your proposed method, using real or fabricated data, and see how it compares to a single regression of all the data. I am skeptical that your proposed method is valid, but i could be wrong. I would be interested in hearing the result of that test.

6. Jul 16, 2014

### vyas22

if the errors of slope are due to measurement and if data converges to any theoretically expected mean value of m consistantly it does make sense to take a weighted average based on frequency of observations rather than arithmatic mean of slopes.. as per central limit theorem value of m should converge to a central tendancy in a bell shaped curve on repeating the experiment...

7. Jul 17, 2014

### Stephen Tashi

I think the proper statement of your goal is that you want to "estimate" some quantity. (Statistics has two main endeavors. These are "estimation" and "hypothesis testing".)

You need a probability model to make clear what theory says and how errors enter the picture. Here are some alternatives:

For example, you might assume each measurement (X,Y) has the form Y = MX + B + E where M and B are constant and E is a random variable representing an error that occurs with each measurement of Y.

Or you might assume each experiment produces data of the form Y = (M + E1)X + B + E2 + E3 where E1 and E2 are random errors that are constant on each "batch" of measurements and E3 is a random error that occurs on each measurement.

Or you might assume each experiment produces data of the form Y = MX + B + E1 + E3 where E1 is an error that is constant for each "batch" and E3 is an error that occurs on each measurement.

In some reali life situations (X,Y) measurements can have errors in the measurement of X as well as the measurement of Y. The usual sort of linear regression assumes there is no error in the X masurements.

Until you get into such detail, we don't have a specific mathematical question.

The least painful way to get into such detail is to use simulation. If you can write computer programs, I agree with FactChecker's advice to make test cases using fabricated data. Pick a specific probability model for simulating the data. That way you will know the actual value of M. Then simulate data and compute your estimate in various ways and see which does best.

8. Aug 10, 2014

### haruspex

The only reason I can think of would be that although the slopes are in principle the same the intercepts are not. If so, it would be logical to adjust the relative intercepts so as to minimise the R-squared value for the linear regression of the conflated set. There's probably a simple algebraic way to do that.