Question about linear regression and sample sizes

Click For Summary
SUMMARY

This discussion centers on the effectiveness of two methods for calculating a linear regression line based on the pass/fail rates of exam takers relative to their ages. Method 1 involves excluding data points with fewer than 50 exam takers, while Method 2 includes all data points but applies weights based on sample sizes. The consensus is that Method 2 is superior as it utilizes all available data and avoids the loss of information inherent in Method 1. The conclusion emphasizes that weighting data points according to sample size leads to a more accurate regression analysis.

PREREQUISITES
  • Understanding of linear regression analysis
  • Familiarity with statistical weighting techniques
  • Knowledge of sample size considerations in research
  • Basic proficiency in data visualization tools
NEXT STEPS
  • Explore advanced linear regression techniques in R or Python
  • Learn about statistical power analysis to determine sample sizes
  • Investigate the use of weighted least squares regression
  • Study data visualization best practices for presenting regression results
USEFUL FOR

Data analysts, statisticians, researchers, and anyone involved in designing studies or interpreting regression analyses will benefit from this discussion.

byee614
Messages
9
Reaction score
0
Consider this situation. There is an exam designed in such a way that it appears that the pass/failure rate of the exam has a linear relationship to the age of the exam taker. The older the test taker, the higher the pass rate. I'm not interested in the exact scores of the exam, only pass or fail.

I randomly sample 5,000 exam takers and record their age and pass/fail. Then I count the number of 35, 40, 53, etc year olds who passed and plot the pass rate percentage against age on a graph that looks something like this:

http://img261.imageshack.us/img261/5974/10548526.jpg

One notable fact about this survey is that not all ages have the same number of exam takers. For example there could be 300 35 year olds but only 15 65 year olds. Either way, I plot the pass rate for every age.

Let's say I am interested in calculating a linear regression line like one shown in the graph. Which of the following methods is will get the best regression line?

Method 1: Establish a minimum sample size threshold and exclude all data points that do not meet that minimum. For example, exclude data points that have fewer than 50 exam takers. If there were only 15 65 year olds, that data point is excluded because there is a high possibility that this specific point is inaccurate. All remaining data points are weighted equally and a regression line is calculated.

Method 2: Include all data points but weight them according to the sample size that each point is based on. If there are 300 35 year olds and 15 65 year olds, the 35 year old data point is weighted 20 times heavier than the 65 year old data point. Calculate a regression line using all the weighted data points.

Which method is better?
 
Last edited by a moderator:
Physics news on Phys.org
First, note that method 1 is just a rough version of method 2, in that it uses 0 or 1 as the weight. In doing so, it wastes information. Consequently, method 2 is the more efficient.
 
Last edited:
"f there were only 15 65 year olds, that data point is excluded because there is a high possibility that this specific point is inaccurate. All remaining data points are weighted equally and a regression line is calculated."

I'm not sure why you would think this is a problem, unless you are considering two groups of such differing ages to be different populations: in that case, do two regressions.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 30 ·
2
Replies
30
Views
5K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 2 ·
Replies
2
Views
5K
  • · Replies 7 ·
Replies
7
Views
3K