Question about linear regression and sample sizes

In summary, the conversation discusses two methods for calculating a linear regression line based on data from a sample of 5,000 exam takers. Method 1 excludes data points with a sample size below a certain threshold, while method 2 includes all data points but weights them according to their sample size. The expert summarizer concludes that method 2 is more efficient as it utilizes all the available information.
  • #1
byee614
9
0
Consider this situation. There is an exam designed in such a way that it appears that the pass/failure rate of the exam has a linear relationship to the age of the exam taker. The older the test taker, the higher the pass rate. I'm not interested in the exact scores of the exam, only pass or fail.

I randomly sample 5,000 exam takers and record their age and pass/fail. Then I count the number of 35, 40, 53, etc year olds who passed and plot the pass rate percentage against age on a graph that looks something like this:

http://img261.imageshack.us/img261/5974/10548526.jpg [Broken]

One notable fact about this survey is that not all ages have the same number of exam takers. For example there could be 300 35 year olds but only 15 65 year olds. Either way, I plot the pass rate for every age.

Let's say I am interested in calculating a linear regression line like one shown in the graph. Which of the following methods is will get the best regression line?

Method 1: Establish a minimum sample size threshold and exclude all data points that do not meet that minimum. For example, exclude data points that have fewer than 50 exam takers. If there were only 15 65 year olds, that data point is excluded because there is a high possibility that this specific point is inaccurate. All remaining data points are weighted equally and a regression line is calculated.

Method 2: Include all data points but weight them according to the sample size that each point is based on. If there are 300 35 year olds and 15 65 year olds, the 35 year old data point is weighted 20 times heavier than the 65 year old data point. Calculate a regression line using all the weighted data points.

Which method is better?
 
Last edited by a moderator:
Mathematics news on Phys.org
  • #2
First, note that method 1 is just a rough version of method 2, in that it uses 0 or 1 as the weight. In doing so, it wastes information. Consequently, method 2 is the more efficient.
 
Last edited:
  • #3
"f there were only 15 65 year olds, that data point is excluded because there is a high possibility that this specific point is inaccurate. All remaining data points are weighted equally and a regression line is calculated."

I'm not sure why you would think this is a problem, unless you are considering two groups of such differing ages to be different populations: in that case, do two regressions.
 

What is linear regression and how is it used in data analysis?

Linear regression is a statistical method used to model the relationship between two or more variables. It is commonly used to predict the value of a dependent variable based on the values of one or more independent variables. It is often used in data analysis to understand and quantify the relationship between variables and make predictions.

How do I determine the sample size needed for linear regression analysis?

The sample size for linear regression analysis depends on several factors, including the number of variables, the expected effect size, and the desired level of statistical power. Generally, a larger sample size is needed for more complex analyses or when the expected effect size is small. A sample size calculator or consultation with a statistician can help determine the appropriate sample size for a specific analysis.

What are the assumptions of linear regression analysis?

The main assumptions of linear regression analysis include linearity, independence of errors, homoscedasticity (equal variances), and normality of errors. These assumptions should be checked before conducting a linear regression analysis to ensure the validity of the results. If these assumptions are not met, alternative methods such as non-linear regression or transformation of data may be necessary.

Can linear regression be used with categorical variables?

Yes, linear regression can be used with categorical variables by converting them into dummy variables. Dummy variables represent the different categories of a categorical variable as binary (0 or 1) values, allowing them to be included in the regression model. However, it is important to note that this approach may have limitations and other statistical methods may be more appropriate for analyzing categorical variables.

What are some common pitfalls to avoid when using linear regression?

Some common pitfalls to avoid when using linear regression include overfitting the model, relying solely on statistical significance, and assuming causation from correlation. Overfitting occurs when the model is too complex and fits the data too closely, leading to poor performance on new data. Statistical significance does not necessarily indicate a strong or meaningful relationship between variables. And while linear regression can show a relationship between variables, it does not necessarily imply a causal relationship.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
774
  • STEM Educators and Teaching
Replies
11
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
953
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Linear and Abstract Algebra
Replies
7
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
  • Special and General Relativity
Replies
12
Views
800
Back
Top