Question about linear regression and sample sizes

byee614
Messages
9
Reaction score
0
Consider this situation. There is an exam designed in such a way that it appears that the pass/failure rate of the exam has a linear relationship to the age of the exam taker. The older the test taker, the higher the pass rate. I'm not interested in the exact scores of the exam, only pass or fail.

I randomly sample 5,000 exam takers and record their age and pass/fail. Then I count the number of 35, 40, 53, etc year olds who passed and plot the pass rate percentage against age on a graph that looks something like this:

http://img261.imageshack.us/img261/5974/10548526.jpg

One notable fact about this survey is that not all ages have the same number of exam takers. For example there could be 300 35 year olds but only 15 65 year olds. Either way, I plot the pass rate for every age.

Let's say I am interested in calculating a linear regression line like one shown in the graph. Which of the following methods is will get the best regression line?

Method 1: Establish a minimum sample size threshold and exclude all data points that do not meet that minimum. For example, exclude data points that have fewer than 50 exam takers. If there were only 15 65 year olds, that data point is excluded because there is a high possibility that this specific point is inaccurate. All remaining data points are weighted equally and a regression line is calculated.

Method 2: Include all data points but weight them according to the sample size that each point is based on. If there are 300 35 year olds and 15 65 year olds, the 35 year old data point is weighted 20 times heavier than the 65 year old data point. Calculate a regression line using all the weighted data points.

Which method is better?
 
Last edited by a moderator:
Mathematics news on Phys.org
First, note that method 1 is just a rough version of method 2, in that it uses 0 or 1 as the weight. In doing so, it wastes information. Consequently, method 2 is the more efficient.
 
Last edited:
"f there were only 15 65 year olds, that data point is excluded because there is a high possibility that this specific point is inaccurate. All remaining data points are weighted equally and a regression line is calculated."

I'm not sure why you would think this is a problem, unless you are considering two groups of such differing ages to be different populations: in that case, do two regressions.
 
Insights auto threads is broken atm, so I'm manually creating these for new Insight articles. In Dirac’s Principles of Quantum Mechanics published in 1930 he introduced a “convenient notation” he referred to as a “delta function” which he treated as a continuum analog to the discrete Kronecker delta. The Kronecker delta is simply the indexed components of the identity operator in matrix algebra Source: https://www.physicsforums.com/insights/what-exactly-is-diracs-delta-function/ by...
Fermat's Last Theorem has long been one of the most famous mathematical problems, and is now one of the most famous theorems. It simply states that the equation $$ a^n+b^n=c^n $$ has no solutions with positive integers if ##n>2.## It was named after Pierre de Fermat (1607-1665). The problem itself stems from the book Arithmetica by Diophantus of Alexandria. It gained popularity because Fermat noted in his copy "Cubum autem in duos cubos, aut quadratoquadratum in duos quadratoquadratos, et...
I'm interested to know whether the equation $$1 = 2 - \frac{1}{2 - \frac{1}{2 - \cdots}}$$ is true or not. It can be shown easily that if the continued fraction converges, it cannot converge to anything else than 1. It seems that if the continued fraction converges, the convergence is very slow. The apparent slowness of the convergence makes it difficult to estimate the presence of true convergence numerically. At the moment I don't know whether this converges or not.
Back
Top