Least squares line - understanding formulas

Click For Summary

Discussion Overview

The discussion centers on understanding the formulas for the slope and y-intercept in simple linear regression, exploring both the mathematical derivation and intuitive reasoning behind each component of the formulas. Participants express a desire to grasp the underlying principles and connections within the formulas.

Discussion Character

  • Exploratory
  • Technical explanation
  • Conceptual clarification
  • Debate/contested

Main Points Raised

  • One participant seeks to understand the rationale behind each term in the slope and intercept formulas, questioning the significance of the sums and their manipulations.
  • Another participant notes that the slope formula is related to the Pearson correlation coefficient and suggests that the manipulations aim to simplify calculations.
  • A third participant introduces the idea that the slope can be viewed as the ratio of covariance to variance, linking it to the concepts of correlation.
  • Some participants argue that there is no specific "why" for individual terms in the formulas, asserting that their importance lies in their collective function to minimize error in the regression model.
  • There is a mention of confusion regarding the term "margin of error" in relation to the average value of x, with one participant stating they have not encountered this terminology before.

Areas of Agreement / Disagreement

Participants express differing views on the significance of individual terms in the formulas, with some asserting that the overall purpose is more important than the individual components. There is no consensus on the interpretation of the term "margin of error" as it relates to the average value of x.

Contextual Notes

Participants highlight the complexity of the formulas and the manipulations involved, indicating that the derivations may obscure intuitive understanding. The discussion reflects varying levels of familiarity with statistical terminology and concepts.

Vital
Messages
108
Reaction score
4
Hello.

I have listened to a great lecture, which gave helpful intuitive insight into correlation and regression (basic stuff). But there are formulas, which I cannot grasp intuitively and don't know their origin. To remember them I would like to understand what's happening in each part of the formula and why these mathematical combinations are used to get the desired result, i.e. I would like to understand both mathematically and intuitively what's happening in those formulas.
I will be grateful for your patience and your help.

The first one is for the slope, and the second - for y-intercept
(both formulas below are used for variables in a simple linear relationships formula
y = y-intercept + slope multiplied by x).

slope = [ n×Σxy - ΣxΣy ] / [ n×Σx^2 - (Σx)^2 ]

I have "whys" about each part of this formula:
numerator
- why we take the sum of xy
- why we then multiply that sum by n (the number of elements) and what is the meaning and role of the result
- why we subtract from the previous result the sum of x multiplied by the sum of y
denominator
- why we take the sum of x squared
- why we then multiply it by n
- why we take the sum of all x and then square the result
- why we subtract the first from the second
formula
- why we use [n×Σxy - ΣxΣy] for numerator and [n×Σx^2 - (Σx)^2 ] for the denominator,
how do they work together, and what is the intuition behind the process?

The second one is for the y intercept:
intercept = [ Σy/n ] - slope x [ Σx / n]

Same questions here.And finally what is more confusing is that [ Σx / n] is called a margin of error. Why is this called a margin of error if it looks as a formula for finding the average value of x, given n elements. Thank you.
 
Physics news on Phys.org
The slope formula has been manipulated to be easier to calculate (one pass through the data rather than two). It is closely related to the Pearson correlation coefficient. You can see a fairly intuitive initial definition of the Pearson correlation coefficient which is then manipulated to be close to your slope formula here.

For the intercept, I don't know what slope x means. Whatever it is, I assume that the same sort of manipulations has been done as was done for the slope formula.

I have not heard the sample average called a "margin of error" before, so I can't help you there. The usual use of the term "margin of error" in statistics does not have that definition.
 
Last edited:
  • Like
Likes   Reactions: Vital
Qualitatively, the slope is the covariance / variance

[ n×Σx^2 - (Σx)^2 ] is the variance:

1566226277038.png


the covariance in the numerator is the same thing by XY instead of X2. If the correlation is perfect, covariance = variance and the slope is 1. If there is no correlation, then covariance is zero and so is the slope.
 
  • Like
Likes   Reactions: FactChecker
@Vital I think your approach here is not going to be fruitful. To my knowledge there is no “why” for the individual terms, there is only a “why” for the whole formula. The individual terms are only there because together they achieve the goal of the overall formula, they individually have no particular importance.

The purpose of the overall formula is to calculate the ##m## and ##b## that minimize the error from ##y=mx +b##. Specifically, we want to find ##m## and ##b## such that ##\frac{\partial}{\partial m}\Sigma r^2=0## and ##\frac{\partial}{\partial b}\Sigma r^2=0## where ##r## is the residual error ##r=y-(mx+b)##. All of those formulas you are looking into are just what you get when you solve these equations.
 
  • Like
Likes   Reactions: Vital, FactChecker and DaveE
Dale said:
@Vital I think your approach here is not going to be fruitful. To my knowledge there is no “why” for the individual terms, there is only a “why” for the whole formula.
I agree. It started with a very intuitive formula, but then got manipulated so that the parts are not intuitive. The reason for the manipulation was to make it a single-pass calculation through the data, which is easier than the original two-pass formula (a first pass through the data to get the average followed by a second pass to total all the deviations from that average).
 
  • Like
Likes   Reactions: Vital and Dale
Thank you very much for your answers and guidance.
 

Similar threads

  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K