Statistics: Cherry picking data?

In summary, the conversation discusses the idea of cherry picking rock types in order to reduce the variance of the slope of a regression line. This would result in a smaller confidence interval for the slope, but may also violate the independence of error terms and be seen as cheating. While this approach may work in terms of accuracy, it may not be a practical or meaningful method for analysis.
  • #1
Nikitin
735
27

Homework Statement


http://www.math.ntnu.no/~haakont/grunnkurs/oppg/eksDes11e.pdf [Broken]
Please take the time to read problem 3 ("Oppgave 3) b, to understand what I am talking about. Everything's explained there.

The specific part I am curious about:

"Define a 90 % confidence interval for ##\hat{\beta_1}##. Find the actual result of the interval given the
numbers above.
The company discusses whether the confidence interval could have been shorter if they
instead collected three core samples in both rock types 1 and 3, and only one in rock
type 2? Or what if they took one sample in both of rock types 1 and 3, and five samples
in type 2? Please discuss."

Here they ask if it's a good idea to cherry pick the rock types measured in order to reduce the variance of ##\hat{\beta_1}## (the slope of the regression line), and hence make the confidence interval of ##\hat{\beta_1}## smaller.

Homework Equations


##Y_i = \beta_0 + \beta_1 x_i + \epsilon_i##

The Attempt at a Solution



Well, I guess from the math it's quite obvious that ##SS_{xx}## WILL become smaller if you collect samples from primarily rock type 1 and 3, and hence the prediction of the slope becomes more accurate. BUT won't the independence of the error terms condition be hurt by this? Unless they are identically normally distributed regardless of rock type?

And isn't it kind of like cheating?

EDIT: Hmm I think I understand now why this trick works. The bigger ##\Delta y## is between the data points, the less will each data point's random error screw up the accuracy of ##\hat{\beta_1}##. OK, I guess I answered my questions after writing everything down on text.
 
Last edited by a moderator:
Physics news on Phys.org
  • #2
Nikitin said:
EDIT: Hmm I think I understand now why this trick works. The bigger ##\Delta y## is between the data points, the less will each data point's random error screw up the accuracy of ##\hat{\beta_1}##. OK, I guess I answered my questions after writing everything down on text.
Right. It is not cheating if you decide which samples to take in advance (or at least consider "what would have happened if we did that"), and then use all those samples in the analysis.

I hope no company actual uses such a strange model, however - there is absolutely no reason why a linear regression to rock categories should give anything meaningful.
 
  • Like
Likes 1 person

1. What is cherry picking data in statistics?

Cherry picking data in statistics is the practice of selecting and presenting only the data that supports a particular argument or conclusion, while ignoring other data that may contradict it. This can result in a biased or misleading representation of the data.

2. Why is cherry picking data considered a problem in science?

Cherry picking data is considered a problem in science because it goes against the principles of objectivity and impartiality. It can lead to inaccurate conclusions and undermine the credibility of research findings.

3. How can we identify cherry picking in statistical analysis?

One way to identify cherry picking in statistical analysis is by looking for a lack of transparency in the data selection process. If only certain data points are presented without explanation, it may be a sign of cherry picking. Additionally, comparing the presented data to the overall data set can reveal any discrepancies or biases.

4. What are the consequences of cherry picking data?

The consequences of cherry picking data can include false or misleading conclusions, wasted resources on further research, and damage to the credibility of the scientific community. It can also lead to misguided decisions and policies based on incomplete or biased data.

5. How can we avoid cherry picking data in statistical analysis?

To avoid cherry picking data in statistical analysis, researchers should strive for transparency and objectivity in their data selection process. This can include clearly defining the criteria for data inclusion and presenting the complete data set rather than just selected data points. Peer review and replication studies can also help identify and prevent cherry picking in research.

Similar threads

  • Calculus and Beyond Homework Help
Replies
5
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
281
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
676
  • Calculus and Beyond Homework Help
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
2
Replies
64
Views
3K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
12
Views
3K
  • Calculus and Beyond Homework Help
Replies
1
Views
1K
  • General Math
Replies
1
Views
1K
  • Precalculus Mathematics Homework Help
Replies
3
Views
1K
Back
Top