Question about linear regression and sample sizes

byee614 · Jun 4, 2009

Consider this situation. There is an exam designed in such a way that it appears that the pass/failure rate of the exam has a linear relationship to the age of the exam taker. The older the test taker, the higher the pass rate. I'm not interested in the exact scores of the exam, only pass or fail.

I randomly sample 5,000 exam takers and record their age and pass/fail. Then I count the number of 35, 40, 53, etc year olds who passed and plot the pass rate percentage against age on a graph that looks something like this:

http://img261.imageshack.us/img261/5974/10548526.jpg

One notable fact about this survey is that not all ages have the same number of exam takers. For example there could be 300 35 year olds but only 15 65 year olds. Either way, I plot the pass rate for every age.

Let's say I am interested in calculating a linear regression line like one shown in the graph. Which of the following methods is will get the best regression line?

Method 1: Establish a minimum sample size threshold and exclude all data points that do not meet that minimum. For example, exclude data points that have fewer than 50 exam takers. If there were only 15 65 year olds, that data point is excluded because there is a high possibility that this specific point is inaccurate. All remaining data points are weighted equally and a regression line is calculated.

Method 2: Include all data points but weight them according to the sample size that each point is based on. If there are 300 35 year olds and 15 65 year olds, the 35 year old data point is weighted 20 times heavier than the 65 year old data point. Calculate a regression line using all the weighted data points.

Which method is better?

EnumaElish · Jun 4, 2009

First, note that method 1 is just a rough version of method 2, in that it uses 0 or 1 as the weight. In doing so, it wastes information. Consequently, method 2 is the more efficient.

statdad · Jun 5, 2009

"f there were only 15 65 year olds, that data point is excluded because there is a high possibility that this specific point is inaccurate. All remaining data points are weighted equally and a regression line is calculated."

I'm not sure why you would think this is a problem, unless you are considering two groups of such differing ages to be different populations: in that case, do two regressions.

Question about linear regression and sample sizes

Similar threads

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Undergrad The countability paradox of computable numbers

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect