Statistics: Cherry picking data?

Nikitin · May 17, 2014

Homework Statement

http://www.math.ntnu.no/~haakont/grunnkurs/oppg/eksDes11e.pdf
Please take the time to read problem 3 ("Oppgave 3) b, to understand what I am talking about. Everything's explained there.

The specific part I am curious about:

"Define a 90 % confidence interval for ##\hat{\beta_1}##. Find the actual result of the interval given the
numbers above.
The company discusses whether the confidence interval could have been shorter if they
instead collected three core samples in both rock types 1 and 3, and only one in rock
type 2? Or what if they took one sample in both of rock types 1 and 3, and five samples
in type 2? Please discuss."

Here they ask if it's a good idea to cherry pick the rock types measured in order to reduce the variance of ##\hat{\beta_1}## (the slope of the regression line), and hence make the confidence interval of ##\hat{\beta_1}## smaller.

Homework Equations

##Y_i = \beta_0 + \beta_1 x_i + \epsilon_i##

The Attempt at a Solution

Well, I guess from the math it's quite obvious that ##SS_{xx}## WILL become smaller if you collect samples from primarily rock type 1 and 3, and hence the prediction of the slope becomes more accurate. BUT won't the independence of the error terms condition be hurt by this? Unless they are identically normally distributed regardless of rock type?

And isn't it kind of like cheating?

EDIT: Hmm I think I understand now why this trick works. The bigger ##\Delta y## is between the data points, the less will each data point's random error screw up the accuracy of ##\hat{\beta_1}##. OK, I guess I answered my questions after writing everything down on text.

mfb · May 18, 2014

Nikitin said:

EDIT: Hmm I think I understand now why this trick works. The bigger ##\Delta y## is between the data points, the less will each data point's random error screw up the accuracy of ##\hat{\beta_1}##. OK, I guess I answered my questions after writing everything down on text.

Right. It is not cheating if you decide which samples to take in advance (or at least consider "what would have happened if we did that"), and then use all those samples in the analysis.

I hope no company actual uses such a strange model, however - there is absolutely no reason why a linear regression to rock categories should give anything meaningful.

Statistics: Cherry picking data?

Homework Statement

Homework Equations

The Attempt at a Solution

Thread 'Finding the nth roots of a complex number'

Thread 'Solve this problem that involves induction'

Similar threads

Hot Threads

Prove that the integral is equal to ##\pi^2/8##

Solving the wave equation with piecewise initial conditions

Area of loop in x-y plane

Calculating radius of gyration of plane figure about x-axis

Solve this problem that involves induction

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective