Question about linear regression and sample sizes

byee614 · Jun 4, 2009

Consider this situation. There is an exam designed in such a way that it appears that the pass/failure rate of the exam has a linear relationship to the age of the exam taker. The older the test taker, the higher the pass rate. I'm not interested in the exact scores of the exam, only pass or fail.

I randomly sample 5,000 exam takers and record their age and pass/fail. Then I count the number of 35, 40, 53, etc year olds who passed and plot the pass rate percentage against age on a graph that looks something like this:

http://img261.imageshack.us/img261/5974/10548526.jpg

One notable fact about this survey is that not all ages have the same number of exam takers. For example there could be 300 35 year olds but only 15 65 year olds. Either way, I plot the pass rate for every age.

Let's say I am interested in calculating a linear regression line like one shown in the graph. Which of the following methods is will get the best regression line?

Method 1: Establish a minimum sample size threshold and exclude all data points that do not meet that minimum. For example, exclude data points that have fewer than 50 exam takers. If there were only 15 65 year olds, that data point is excluded because there is a high possibility that this specific point is inaccurate. All remaining data points are weighted equally and a regression line is calculated.

Method 2: Include all data points but weight them according to the sample size that each point is based on. If there are 300 35 year olds and 15 65 year olds, the 35 year old data point is weighted 20 times heavier than the 65 year old data point. Calculate a regression line using all the weighted data points.

Which method is better?

EnumaElish · Jun 4, 2009

First, note that method 1 is just a rough version of method 2, in that it uses 0 or 1 as the weight. In doing so, it wastes information. Consequently, method 2 is the more efficient.

statdad · Jun 5, 2009

"f there were only 15 65 year olds, that data point is excluded because there is a high possibility that this specific point is inaccurate. All remaining data points are weighted equally and a regression line is calculated."

I'm not sure why you would think this is a problem, unless you are considering two groups of such differing ages to be different populations: in that case, do two regressions.

Question about linear regression and sample sizes

Thread 'What Exactly is Dirac’s Delta Function? - Insight'

Thread 'Fermat's Last Theorem'

Thread 'Useless continued fraction for 1'

Similar threads

Hot Threads

Insights Why Vector Spaces Explain The World: A Historical Perspective

Insights Fermat's Last Theorem

B How is it that law of sines does not work in this exercise?

B What could prove this wrong? I'm having a dispute with friends

B About a definition: What is the number of terms of a polynomial P(x)?

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective