Linear Regression, Population, Sample

  • #1
fog37
1,568
108
TL;DR Summary
Understanding the sample best-fit line and its relation to the population best-fit line
Hello,

1) Let's consider a population of 1,000,000 data points with each data point being represented by the pair of values (x,y).
Let's assume that, when plotted on a graph, the 1,000,000 points look like a spread out cloud with an overall positive linear trend. These 1,000,000 points represent the population. The best-fit line calculated using all the 1,000,000 points will have a specific slope and intercept.
Given a particular x value, the y value provided by the computed best-fit line equation will exactly represent the arithmetic average of all the y values of the data points that have the same x value. Is that correct? In essence, the average value of the y variable depends linearly with the value of x variable.

2) In this case, instead of using all 1,000,000 data points to plot the graph and calculate the best fit line, we only use a random sample of 100 points. The best-fit line obtained using the 100 random data points is a different line from the best fit line calculated using the 1,000,000 points. We can take a different sample of 100 random points and the best fit line will again be different (but similar in intercept and slope to the previous sample line). In essence, both the slope and the intercept, calculated for each different random sample of size 100, are random variables. Very often we can only work with a sample and not with the 1,000,000 data points population. Under which conditions will the sample best-fit line be a good approximation of the population best fit line? The larger the sample, the closer the sample best-fit line will be the to population best fit line...What are conditions must be met to guarantee that the sample line is close to the population line?

Thank you!


thank you!
 
Physics news on Phys.org
  • #2
fog37 said:
Given a particular x value, the y value provided by the computed best-fit line equation will exactly represent the arithmetic average of all the y values of the data points that have the same x value. Is that correct?
That is what linear regression does. If you want to use an X value to get the best estimate of the associated Y value, that is the thing to use. There are other calculations (see principle component analysis) that finds the line which has the smallest sum of the squared perpendicular distances from the points directly to the line.
fog37 said:
What are conditions must be met to guarantee that the sample line is close to the population line?
If the 100 sample points have a large variation (scatter off of the line), then you can expect that the calculated line from them will have a large variation from the population line. The standard linear regression software programs will use the ratio of the scatter of the y values off of the line divided by the total scatter of the 100 points. (see Assessing Goodness-of-Fit in a Regression Model )
 
Last edited:
  • #3
"Under which conditions will the sample best-fit line be a good approximation of the population best fit line?"

What are you using as your measure for assessing "good approximation"? You could make a very simple argument that for any given set of data, as long as you're referring to linear regression AND using least squares as your way to estimate slope and intercept, the resulting fitted line is a good approximation GIVEN the data set you're working with.

"The larger the sample, the closer the sample best-fit line will be the to population best fit line...What are conditions must be met to guarantee that the sample line is close to the population line?"

I'm not sure I agree with that first statement: for me the issue is knowing the quality of the sample: simply having a larger sample isn't enough to ensure what you say it does. (I know that basic probability says that both the sample intercept and slope will converge in probability to the population values, but I still have my doubts.)

I think my biggest bit of uncertainty here is the lack of a clear meaning of what you think constitutes a "good approximation".
 
  • Like
Likes fog37
  • #4
It’s not as if the population gives you a ‘true’ regression model in the same sense that you can get a true mean or variance. typically you are not trying to sample from a population with OLS, rather trying to infer some relationship between variables. In many cases - such as in financial markets - you are not sampling, you are using the complete data set, not a sample of it. Even with a sample, the tiny differences between sample estimates and population values of intercepts and betas tend to be immaterial compared to the overall error.
 
  • Like
Likes fog37
  • #5
BWV said:
It’s not as if the population gives you a ‘true’ regression model in the same sense that you can get a true mean or variance. typically you are not trying to sample from a population with OLS, rather trying to infer some relationship between variables. In many cases - such as in financial markets - you are not sampling, you are using the complete data set, not a sample of it. Even with a sample, the tiny differences between sample estimates and population values of intercepts and betas tend to be immaterial compared to the overall error.
Thank you BWV. I like your explanation. We have a bunch of data points and the linear regression model attempts to explain the collective behavior of these points with the goal being to find the relation between the dependent and independent variables (simple linear regression). It may be incorrect to call the best-fit line calculated using all the population the "true" best fit line.

@statdad points out the meaning of "good approximation"... I would say that, assuming we collect a good simple random sample of size ##N##, the sample best-fit line approximates the population best fit line better (the betas and intercepts are closer in value) if the sample size ##N## is large (whatever large means)...
 
  • Like
Likes BWV

What is linear regression?

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. The simplest form is simple linear regression, which deals with just one independent variable. The goal is to predict the value of the dependent variable based on the values of the independent variables.

What is the difference between population and sample in statistics?

In statistics, a population is the entire set of items or individuals from which data can be collected, while a sample is a subset of the population that is actually observed or measured. Sampling is used because it is often impractical or impossible to collect data from every member of a population. The main goal is to select a sample that is representative of the population, allowing for generalizations and conclusions about the population based on the sample data.

How do you determine the best fit line in linear regression?

The best fit line in linear regression is typically determined using the method of least squares. This method calculates the line that minimizes the sum of the squared differences (residuals) between the observed values and the values predicted by the linear model. By minimizing these residuals, the line of best fit represents the trend of the data as accurately as possible within the constraints of a linear model.

What are the assumptions of linear regression?

Linear regression makes several key assumptions, including linearity, independence, homoscedasticity, and normality. Linearity assumes that there is a linear relationship between the dependent and independent variables. Independence assumes that the residuals (errors) between the observed and predicted values are independent. Homoscedasticity implies that the residuals have constant variance at every level of the independent variable. Normality assumes that the residuals are normally distributed.

How can you assess the accuracy of a linear regression model?

The accuracy of a linear regression model can be assessed using various metrics, including R-squared, adjusted R-squared, mean squared error (MSE), and root mean squared error (RMSE). R-squared measures the proportion of variance in the dependent variable that is predictable from the independent variables. Adjusted R-squared adjusts the R-squared value based on the number of predictors in the model. MSE and RMSE provide information on the average distance between the predicted values and the actual values, with RMSE being particularly sensitive to large errors due to its squaring of residuals.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
495
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
11
Views
762
  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
485
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
981
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
472
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
923
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
Back
Top