Statistics: Cherry picking data?

  • Thread starter Thread starter Nikitin
  • Start date Start date
  • Tags Tags
    Data Statistics
Click For Summary
SUMMARY

The discussion centers on the implications of cherry-picking data samples in the context of defining a 90% confidence interval for the slope coefficient ##\hat{\beta_1}## in a linear regression model. Participants analyze whether collecting more samples from certain rock types (1 and 3) while minimizing samples from others (type 2) could effectively reduce the variance of ##\hat{\beta_1}##. The consensus indicates that while this approach may yield a smaller confidence interval, it raises concerns about the independence of error terms and the validity of the regression model. Ultimately, the discussion concludes that pre-selecting samples is not inherently dishonest if done transparently.

PREREQUISITES
  • Understanding of linear regression analysis and the role of slope coefficients.
  • Familiarity with confidence intervals and their calculation.
  • Knowledge of statistical independence and error term conditions.
  • Basic concepts of variance and its impact on regression accuracy.
NEXT STEPS
  • Study the calculation of confidence intervals in linear regression using R or Python.
  • Explore the implications of sample selection bias in statistical analysis.
  • Learn about the assumptions of linear regression, particularly regarding error terms.
  • Investigate alternative sampling methods and their effects on regression outcomes.
USEFUL FOR

Statisticians, data analysts, and researchers involved in regression analysis, particularly those interested in the effects of sampling strategies on statistical inference.

Nikitin
Messages
734
Reaction score
27

Homework Statement


http://www.math.ntnu.no/~haakont/grunnkurs/oppg/eksDes11e.pdf
Please take the time to read problem 3 ("Oppgave 3) b, to understand what I am talking about. Everything's explained there.

The specific part I am curious about:

"Define a 90 % confidence interval for ##\hat{\beta_1}##. Find the actual result of the interval given the
numbers above.
The company discusses whether the confidence interval could have been shorter if they
instead collected three core samples in both rock types 1 and 3, and only one in rock
type 2? Or what if they took one sample in both of rock types 1 and 3, and five samples
in type 2? Please discuss."

Here they ask if it's a good idea to cherry pick the rock types measured in order to reduce the variance of ##\hat{\beta_1}## (the slope of the regression line), and hence make the confidence interval of ##\hat{\beta_1}## smaller.

Homework Equations


##Y_i = \beta_0 + \beta_1 x_i + \epsilon_i##

The Attempt at a Solution



Well, I guess from the math it's quite obvious that ##SS_{xx}## WILL become smaller if you collect samples from primarily rock type 1 and 3, and hence the prediction of the slope becomes more accurate. BUT won't the independence of the error terms condition be hurt by this? Unless they are identically normally distributed regardless of rock type?

And isn't it kind of like cheating?

EDIT: Hmm I think I understand now why this trick works. The bigger ##\Delta y## is between the data points, the less will each data point's random error screw up the accuracy of ##\hat{\beta_1}##. OK, I guess I answered my questions after writing everything down on text.
 
Last edited by a moderator:
Physics news on Phys.org
Nikitin said:
EDIT: Hmm I think I understand now why this trick works. The bigger ##\Delta y## is between the data points, the less will each data point's random error screw up the accuracy of ##\hat{\beta_1}##. OK, I guess I answered my questions after writing everything down on text.
Right. It is not cheating if you decide which samples to take in advance (or at least consider "what would have happened if we did that"), and then use all those samples in the analysis.

I hope no company actual uses such a strange model, however - there is absolutely no reason why a linear regression to rock categories should give anything meaningful.
 
  • Like
Likes   Reactions: 1 person

Similar threads

  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 1 ·
Replies
1
Views
1K
Replies
1
Views
2K
Replies
1
Views
3K
  • · Replies 64 ·
3
Replies
64
Views
6K
  • · Replies 12 ·
Replies
12
Views
4K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K