Is Averaging Individual Slopes Equivalent to Linear Regression?

Click For Summary

Discussion Overview

The discussion revolves around the mathematical equivalence of the slope obtained from linear regression and the average of individual slopes calculated between data points. Participants explore the implications of using different methods for calculating slopes in the context of programming and handling large datasets.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Some participants assert that the slope from the regression formula minimizes the sum of squared errors, which may not align with simply averaging individual slopes between data points.
  • Concerns are raised about how to define slopes when multiple y-values correspond to the same x-value, complicating the averaging process.
  • One participant suggests that calculating slopes between all points and averaging them could be more efficient than running a full regression repeatedly.
  • Another participant emphasizes that averaging slopes from closely spaced x-values could yield misleading results unless the x-values are evenly distributed.
  • A suggestion is made that a weighted average might be appropriate given the nature of the data and the "lookback" approach described.
  • Participants discuss the computational feasibility of handling large datasets with millions of calculations, noting that modern computers can manage such tasks efficiently.

Areas of Agreement / Disagreement

Participants express differing views on the equivalence of the two methods for calculating slopes, with no consensus reached on whether averaging individual slopes can replace linear regression in this context.

Contextual Notes

Participants highlight the complexity of the dataset, including the presence of overlapping data points and the potential for varying distributions of x-values, which may affect the validity of averaging slopes.

sawtooth500
Messages
16
Reaction score
0
So the linear regression formula is https://www.ncl.ac.uk/webtemplate/a...and-correlation/simple-linear-regression.html found here.

Question - is the slope given by the regression formula mathematically equivalent to individually finding the slope between all the datapoints, and then averaging the slopes out? I'm a programmer, and I need to write code that runs a linear regression across parts, note here only parts, of a very large dataset - I'm only interested in the slope of the linear regression line in my sample nothing more. However, I only need a regression lines across a part of the dataset. Different parts will have some overlapping data points though. I'm thinking if I just find the individual slope between each point, and then run an average to calculate the slope of the regression line for the set of points I need, if that work work. It would certainly be more efficient code than running an entire regression equation over and over again.... My intuition says yes I will get the same result but I've forgotten the math necessary to prove that. Thank you!
 
Physics news on Phys.org
sawtooth500 said:
So the linear regression formula is https://www.ncl.ac.uk/webtemplate/a...and-correlation/simple-linear-regression.html found here.

Question - is the slope given by the regression formula mathematically equivalent to individually finding the slope between all the datapoints, and then averaging the slopes out?
No. Remember that the regression line minimizes the sum-SQUARED errors of the line versus the sample y-values. So a sample y value being far from the line will have much more effect than if the slopes were just averaged. Also, what slopes are you talking about? The sample might have many different y values from the same x value. How would you define a slope then?
sawtooth500 said:
I'm a programmer, and I need to write code that runs a linear regression across parts, note here only parts, of a very large dataset - I'm only interested in the slope of the linear regression line in my sample nothing more. However, I only need a regression lines across a part of the dataset.
You haven't said how many dimensions you have in your independent variable(s). I will assume that you are talking about simple linear regression.
sawtooth500 said:
Different parts will have some overlapping data points though. I'm thinking if I just find the individual slope between each point, and then run an average to calculate the slope of the regression line for the set of points I need, if that work work. It would certainly be more efficient code than running an entire regression equation over and over again....
There are fairly efficient calculations. See this. You can use:
##\hat {\beta} = \frac{n \sum x_i y_i - \sum x_i \sum y_i}{n\sum {x^2_i} - ( \sum x_i)^2}##
 
sawtooth500 said:
So the linear regression formula is https://www.ncl.ac.uk/webtemplate/a...and-correlation/simple-linear-regression.html found here.

Question - is the slope given by the regression formula mathematically equivalent to individually finding the slope between all the datapoints, and then averaging the slopes out?
No. Remember that the regression line minimizes the sum-SQUARED errors of the line versus the sample y-values. So a sample y value being far from the line will have much more effect than if the slopes were just averaged. Also, what slopes are you talking about? Slope from what to what? The sample might have many different y values from the same x value. How would you define a slope then?
sawtooth500 said:
I'm a programmer, and I need to write code that runs a linear regression across parts, note here only parts, of a very large dataset - I'm only interested in the slope of the linear regression line in my sample nothing more. However, I only need a regression lines across a part of the dataset.
You haven't said how many dimensions you have in your independent variable(s). I will assume that you are talking about simple linear regression.
sawtooth500 said:
Different parts will have some overlapping data points though. I'm thinking if I just find the individual slope between each point, and then run an average to calculate the slope of the regression line for the set of points I need, if that work work. It would certainly be more efficient code than running an entire regression equation over and over again....
There are fairly efficient calculations. See this. You can use:
##\hat {\beta} = \frac{n \sum x_i y_i - \sum x_i \sum y_i}{n\sum {x^2_i - ( \sum x_i)^2}##
 
So a bit of clarification -

For any given X value, there will only be one Y value.

Good point about the Y values being squared in a regression - that does give the points further away more sway - at this time we don't want to do that.

The calculation we're doing can be imagined like this - Image X axis variables A-Z

So we need to calculate the average slope of -

A B C D E
B C D E F
C D E F G
D E F G H

And so on.... except in my actual dataset initially we are working with about 350,000 X variables (each X variable actually represents a timestamp in nanoseconds), so basically we are taking 5 second "lookbacks" at every 1 second interval - and depending on the lookback, some times are "busier" than others, so we have have hundreds to thousands of individual data points in a 5 second block. Because of this lookback nature, you can see that just in a 350,000 dataset we'll likely have tens of millions of calculations, and later data sets have millions of initial entries....

So I was thinking just have the program go through the ENTIRE dataset to find the slopes between each line... then just sum up the averages.

Of course it will be interesting to try a linear regression model after we do this model since the Y axis is squared in the regression model... and see how the data compares.
 
Suppose that your x values are ordered, ##x_0 \lt x_1 \lt .... \lt x_n##. An average of the slopes from point to point can give strange results unless the x values are reasonably equally spaced. Otherwise, you might have two x values very close together where even a small difference in the y values gives a huge slope.
It sounds like you are not wanting to use the sum-squared-errors of linear regression, so you may have to invent your own method. Be careful.
When you mention a 5 second "lookback" it makes me think that some sort of weighted average, where the influence of older values has less weight might be appropriate.
In any case, you should not be too intimidated by tens of millions of calculations unless you require hard real-time results. Today's computers are VERY fast at simple calculations like this.
 

Similar threads

  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 11 ·
Replies
11
Views
5K
  • · Replies 5 ·
Replies
5
Views
11K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 53 ·
2
Replies
53
Views
6K
  • · Replies 6 ·
Replies
6
Views
7K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 19 ·
Replies
19
Views
4K
  • · Replies 7 ·
Replies
7
Views
28K