Over-searching error (statistics)

In summary, the conversation discusses the concept of "over-searching" error in the context of machine learning and statistics. It explains that over-searching occurs when exhaustive sampling is done, resulting in a larger subset K being chosen from the search space S. This can lead to a better fit for the test data D, but may not work as well when tested against new data. The relevance of this is due to the bias in the maximum scores in different search spaces, leading to the possibility of suboptimal solutions being chosen. The explanation for this is difficult without specific quantifications.
  • #1
0rthodontist
Science Advisor
1,231
0
I guess this should properly go in the Programming forum, but I think I might get a better response here.

My question is with respect to statistics (context of machine learning) about "over-searching" error. You have a search space S of all possible models, from which you choose a subset K. Then you have some test data D, and you evaluate how well each model in K fits D. You pick the best model in K and use that as your model.

Over-searching says that it is bad to do exhaustive sampling, where K = S. Though the model you end up with fits D better than the model you end up with when K is much smaller than S, for some reason the model when K = S does not work as well when it's tested against new data that's not in D.

I didn't quite catch the reason for this and I still do not understand. I wrote down, "two or more search spaces contain different numbers of models. The maximum scores in each space are biased to different degrees." I understand this but I don't see its relevance to over-searching.
 
Mathematics news on Phys.org
  • #2
Difficult to answer without measures for quantifications. The general idea could be: the bigger K the better the fit, but also the tolerances are smaller. This means by additional data to check against, sub optimal solutions can still be solutions, whereas an optimum is likely to be destroyed.
 

1. What is an over-searching error in statistics?

An over-searching error in statistics occurs when a researcher conducts multiple statistical tests on the same data set, increasing the likelihood of finding a significant result by chance alone.

2. How does over-searching affect the results of a study?

Over-searching can lead to false positive results, meaning that a relationship or effect is deemed significant when it is actually due to chance rather than a true effect in the population.

3. What are some common methods to address over-searching in statistical analyses?

One common method is to adjust the significance threshold (such as the p-value) to account for multiple tests, such as using the Bonferroni correction. Other methods include using a more stringent alpha level, or conducting a pre-planned analysis with a specific hypothesis instead of multiple exploratory analyses.

4. How can researchers prevent over-searching errors in their studies?

Researchers can prevent over-searching errors by pre-registering their analysis plan, clearly stating their hypotheses and planned statistical tests before collecting data. They can also use more conservative significance thresholds and limit the number of tests conducted.

5. Are there any situations where over-searching may be acceptable?

In some cases, such as exploratory research where there is no specific hypothesis, conducting multiple tests may be appropriate. However, researchers should still be cautious about interpreting significant results and should follow up with confirmatory research to validate the findings.

Similar threads

Replies
4
Views
660
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Classical Physics
Replies
7
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
475
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
912
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
26
Views
3K
  • Beyond the Standard Models
Replies
10
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
Back
Top