Weighted average of a set of slopes with different goodness of fit

  • Context: Undergrad 
  • Thread starter Thread starter latitude
  • Start date Start date
  • Tags Tags
    Average Fit Set
Click For Summary

Discussion Overview

The discussion revolves around the calculation of a weighted average of slopes derived from linear fits of multiple data sets, considering their respective goodness of fit and sample sizes. Participants explore how to appropriately weight these slopes to reflect their reliability and consistency, while also addressing the implications of experimental error in the measurements.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant questions the appropriateness of a simple average for the slopes, suggesting that the goodness of fit (R-squared values) should be used to weight the slopes in the calculation of the average slope.
  • Another participant proposes a specific formula for calculating the weighted average slope using the R-squared values, but expresses that sample size should not factor into this calculation.
  • A participant emphasizes the need for clarity in defining what is meant by "average slope" and suggests that the context of its use should be explained to provide a precise mathematical meaning.
  • One participant describes their experimental setup and the need to account for variability in slopes due to repeated measurements, advocating for a weighted average based on the consistency of responses across samples.
  • Another participant expresses skepticism about the validity of the proposed method and encourages testing it against a single regression of all data to compare results.
  • A later reply discusses the central limit theorem and suggests that if measurements converge to a theoretical mean, a weighted average based on the frequency of observations may be appropriate.
  • One participant highlights the importance of establishing a probability model to clarify how errors affect the measurements and proposes various models to consider.
  • Another participant notes the potential issue of differing intercepts among the data sets, suggesting that adjusting intercepts could minimize the R-squared value for a combined regression analysis.

Areas of Agreement / Disagreement

Participants express differing views on how to calculate the average slope, with some advocating for weighting based on R-squared values and others questioning the validity of this approach. There is no consensus on the best method to use, and the discussion remains unresolved.

Contextual Notes

Participants note the importance of defining the mathematical problem clearly and the potential impact of measurement errors on the results. There are also discussions about the implications of different statistical models and the assumptions underlying them.

latitude
Messages
54
Reaction score
0
Hi there, I have a bit of a confusing question, but I'll try to be as clear as I can in asking it.

I have a set of linear fits for four different sets of data. Basically, I have three sets of data, with sample sizes N1 = 5, N2 = 7, N3 = 5 respectively. I have plotted these data with respect to a common x-axis. Then, I have found the linear fit for each data set, giving me a line with a slope, so 3 slopes total (m1, m2, m3) with three errors in slope as well (m'1, m'2, m'3). Each of the linear fits also has a goodness of fit associated with it (0.79, 0.99, 0.89) using the R-squared value.

Here's my question. I need to solve for the average slope and average error in slope. I feel like a simple average and standard deviation isn't indicative of the real average, because the linear fits are not all equally "well-fit". Is there a way to weight the slopes m1, m2, m3 according to their R^2 values so that I can calculate m_avg using a weighted mean? Or is the average of the error in the slope enough? Should the sample sizes Ni come into it at all?

Any advice would be greatly appreciated! Thanks :)
 
Physics news on Phys.org
Hello

Just a suggestion - How about calculating a weighted avarage i.e. (0.79*m1+0.99*m2+0.89*m3)/3 for the average slope

and I do not feel the sample size should matter here ..
 
latitude said:
I need to solve for the average slope and average error in slope

That phrase doesn't specify a particular mathematical problem. For the word "average" to have a precise meaning you have to say what random variable is involved. If you have difficulty expressing the problem in mathematical terms, try explaining what you are trying to accomplish. If you get a number for "average slope", how do you expect to use it?
 
Basically I am changing something incrementally and systematically (x-axis) and noting the induced change that occurs in another quality (y-axis). The response tends to be linearly increasing, y = mx. I repeated this measurement on four different samples from the same "batch", so that theoretically, they should be showing the same response. Since m has a physical significance, I want to take the average of the set of "m"s to show the average response for the samples from this batch. However, there is variance in the slope determined for each sample (not the standard deviation in the set, but experimental error associated with running the same test on the same sample a number of times and getting different results). I guess what I would like to do is show the average of the slope, with each contributor to the slope weighted according to how consistently I received that response during the testing iterations.

So if I tested Sample 1 six times and got a consistently linear response of m1 = 2.0 +/- 0.1, I feel that that should weigh more heavily in calculating the average than Sample 2, which I tested three times for m2 = 1.6 +/- 0.8. Does that make sense? So the average slope would just be a way of saying, "The Batch of samples tends to show this response (mavg +/- error_in_slope) for this particular test."

Thanks very much for the replies!
 
I assume there is a reason that you can not do a single linear regression on all the data. In any case, I think you should make a test case of your proposed method, using real or fabricated data, and see how it compares to a single regression of all the data. I am skeptical that your proposed method is valid, but i could be wrong. I would be interested in hearing the result of that test.
 
latitude said:
... theoretically, they should be showing the same response. ...but experimental error associated with running the same test on the same sample a number of times and getting different results).

... So if I tested Sample 1 six times and got a consistently linear response of m1 = 2.0 +/- 0.1, I feel that that should weigh more heavily in calculating the average than Sample 2, which I tested three times for m2 = 1.6 +/- 0.8. Does that make sense?

if the errors of slope are due to measurement and if data converges to any theoretically expected mean value of m consistantly it does make sense to take a weighted average based on frequency of observations rather than arithmatic mean of slopes.. as per central limit theorem value of m should converge to a central tendency in a bell shaped curve on repeating the experiment...
 
I think the proper statement of your goal is that you want to "estimate" some quantity. (Statistics has two main endeavors. These are "estimation" and "hypothesis testing".)

latitude said:
theoretically, they should be showing the same response

You need a probability model to make clear what theory says and how errors enter the picture. Here are some alternatives:

For example, you might assume each measurement (X,Y) has the form Y = MX + B + E where M and B are constant and E is a random variable representing an error that occurs with each measurement of Y.

Or you might assume each experiment produces data of the form Y = (M + E1)X + B + E2 + E3 where E1 and E2 are random errors that are constant on each "batch" of measurements and E3 is a random error that occurs on each measurement.

Or you might assume each experiment produces data of the form Y = MX + B + E1 + E3 where E1 is an error that is constant for each "batch" and E3 is an error that occurs on each measurement.

In some reali life situations (X,Y) measurements can have errors in the measurement of X as well as the measurement of Y. The usual sort of linear regression assumes there is no error in the X masurements.

Until you get into such detail, we don't have a specific mathematical question.

The least painful way to get into such detail is to use simulation. If you can write computer programs, I agree with FactChecker's advice to make test cases using fabricated data. Pick a specific probability model for simulating the data. That way you will know the actual value of M. Then simulate data and compute your estimate in various ways and see which does best.
 
  • Like
Likes   Reactions: 1 person
FactChecker said:
I assume there is a reason that you can not do a single linear regression on all the data.
The only reason I can think of would be that although the slopes are in principle the same the intercepts are not. If so, it would be logical to adjust the relative intercepts so as to minimise the R-squared value for the linear regression of the conflated set. There's probably a simple algebraic way to do that.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 16 ·
Replies
16
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
Replies
8
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 30 ·
2
Replies
30
Views
5K
  • · Replies 26 ·
Replies
26
Views
3K
  • · Replies 21 ·
Replies
21
Views
3K