# Probability of selecting accurate results

1. Mar 13, 2014

### geo101

I’m working on ways to try and assess if data selection methods are capable of isolating accurate results (from a control data set) with respect to random selection and to compare the relative performances of different methods.

In the data set we have multiple specimens and each specimen yields multiple results. So in terms of random selection, if randomly selecting 1 result from specimen #1 has a probability P1 of being an accurate result, if we randomly select 1 result from all specimen (P1, P2, P3, … Pn, for n specimens), what is the probability (Pf) of having an accurate result in this final data set?

Now suppose we select the data in some way (that we hope/think/pray will reject inaccurate results). At the specimen level the probabilities become P1′, P2′, P3′, … Pm′, where m <= n. What this means, is that some specimens yield no acceptable results. The probability of having an accurate result in this final data set is now Pf′.

How would we assess if our selection process is increasing our chances of obtaining an accurate result? What is the best way to compare Pf and Pf′ (or Pf′ and Pf′′ obtained from two different selection processes), and what factors should we consider?

This is where I get into a philosophical debate with one of my colleagues (neither of us are statisticians). His argument is that as long as Pf′ > Pf the data selection is an improvement. My argument is that the significance of the difference between Pf and Pf′ depends on m and that smaller differences require larger m to be important.
His view is that it doesn't matter what m, is as long as the final result is accurate. My opinion is, that this is only possible if Pf′ = 1 (i.e., we can reject all inaccurate results) and even then, only if this can be demonstrated to be universally true (I’m pretty sure that is impossible).
Also, I think more of a balance needs to be struck so as to avoid the situation whereby m is so small that the uncertainty of the final result (the average of the selected results) is so large that we cannot do anything meaningful with it.

As I mentioned, neither of us are statisticians, so some help and advice would be very welcome.

Cheers,
geo101

2. Mar 14, 2014

### Staff: Mentor

At least one? Then it is easier to start with the opposite case (what is the probability of no accurate result?) - this is just the product of the probabilities to get no accurate result in 1, in 2 and so on.
Exactly one? I would not expect this.
It looks weird that you just need a single accurate result in such a set, however. Why just 1? How do you know which one it is?

I have no idea how "what this means" is related to what you wrote before.

The previous case was for general Pi, special Pi do not change anything.

What did you change in the final dataset?

A bit more background would be helpful, I think.

3. Mar 14, 2014

### geo101

That makes sense, I was confused with the case where, say, P1=0, then the product of accurate probabilities is always zero.

That's just my bad wording, I do mean at least one.

What changes between Pf and Pf′ is the number of specimens and the number of results per specimen used to determine the P1, P2, etc.

So for example, lets say we have some original unselected data set that contains, say, 100 specimens and each specimen yields 40 results. Each P1, P2... is based on 40 results and Pf is from 100 specimen.

We then reject some data based on method selection (to obtain the Pf′). We then have, say, 60 specimens that yield at least 1 acceptable result (i.e., 40 specimen yield no acceptable results). Each of these 60 specimen, say 10, results are acceptable to calculate P1′, P2′... (in practice this varies from specimen to specimen).

1) If we can ignore that the number of results per specimen has changed, given that the the number of specimens is different, how should we compare Pf and Pf′?

2) If we cannot ignore that the number of results per specimen has changed, how should we then compare Pf and Pf′?

I hope that this helps clear up what I am looking for.

Cheers

4. Mar 15, 2014

### Staff: Mentor

No, why should it?
Just P1=1 (zero probability to be wrong) gives a special case - you always have at least one correct result.

Yeah, and that looks weird.

So you just take less samples? Then the probability to have at least one correct sample cannot increase.

What does "results are acceptable to calculate [...]" mean?

Please give more context to the problem. It is hard to understand what you try to do like that.

I still don't see the problem with applying the same formula for both, just with a different n.

5. Mar 15, 2014

### Stephen Tashi

geo101,

I think you'll get better advice if you describe a real world situation. After all, if you aren't a statistician, is it wise to attempt the abstract mathematical description yourself?

6. Mar 17, 2014

### geo101

Let’s start afresh, keeping it simple….

So we have 2D data from a number of specimens and we are interested in the linear slope of these two variables. From each specimen we can make multiple estimates of the slope, depending on the choice of (consecutive) points we use. We have criteria that we use to try and decide which of the possible fits we should choose. The challenge is that these are somewhat arbitrarily defined and I am working on more theoretically justified methods.

For each specimen, when we apply the selection criteria we accept only the fits that pass. So after selection an individual specimen can yield zero, one, or multiple fits. When we have a number of specimens we choose one fit from each and average these to get our final result. In the case where we have multiple fits for a single specimen, we have some additional approaches to decide what fit to choose, but I don’t want to go into these for the moment (back into the realm of arbitrariness). So for simplicity, let’s say that if a specimen yields more than one fit we make a random choice of one fit from those that are accepted.

What I am interested in is the performance of different approaches to data selection, not at the individual specimen level, but when we consider a suite of specimens, which should give a reasonable representation of the variety of bad and good behaviour that we typically see.

What I would like to do is two parts. First, I want to try and assess if the methods we use are better than simple random chance in terms of our ability to accept (reject) accurate (inaccurate) results. So each specimen yields multiple fits (i.e., all possible fits) and we random choose one from each. If we then apply some criteria, are we biasing towards accurate results?
Second, I would like to compare the relative performances of two selection methods – compare the arbitrarily defined method with modifications to this method based on my theoretical approach. Does incorporating theoretical constraints help our ability to isolate accurate results?

My initial thoughts were to compare the Pf values above from the different approaches, with higher Pf indicating higher acceptance rates of accurate result. But the different selection methods can yield wildly different number of specimens that give acceptable results, which I thought should be something to take into account when making the comparisons.

Thanks again for your interest and help with this.
geo

7. Mar 17, 2014

### Stephen Tashi

To analyze the performance of method, one would have to understand how performance is measured. I cannot understand that from your description . The general scenario for many statistical problems is that there is some set of quantities that have a "actual" values and one is trying to estimate those actual values from the values in some samples. If your problem fits into this pattern, I don't understand what the actual values are.

You have specimens. From one specimen, you take several samples from it. The samples give you 2D data of some sort. You either compute several numbers estimating "slopes" from that data or one number estimating a single "slope" from it - which isn't clear to me. It isn't clear whether you are trying to measure some global property of all the specimens (such as "the average slope from the population of all specimens that could theoretically be taken") or whether you are trying to estimate a "slope" for each individual specimen. Maybe your are trying to estimate the function that gives the actual distribution for the values of slope?

Last edited: Mar 17, 2014
8. Mar 17, 2014

### geo101

How I am viewing the system is just as a binary accurate/inaccurate system, which for our control data set we can do since we know the expected answer (accurate is defined as being within a factor of 1.1 of the expected value). So the distribution of the slopes is not what I am looking for, but the distribution of proportions of accurate results accepted and if they are significantly higher than a random selection process.

For a specimen we do a series of measurements to produce a 2D plot made up of, say 14 points. To get a slope estimate from this we must choose one segment of the plot to make a fit. If we say that for basic line fitting requirements we should use at least 3 points, we have 63 possible fits to choose from.

Under ideal conditions, all of our data lie perfect on the expected slope, so it doesn't matter what segment we choose. However, for various physical reasons, over some portions of the plot, the data are unsuitable to use (assumptions of the theory for the experiment are violated). These regions tend to yield inaccurate results and our selection methods are meant to be able to quantitatively identify these so that we choose fits only in the regions where ideal behaviour holds true. So, by applying selection criteria, we limit our choice of segments to fit over. If the criteria are working as we want them to, this limited choice of fitting should have a higher likelihood of yielding accurate results, than if we apply no criteria.

Cheers

9. Mar 18, 2014

### Stephen Tashi

Try to be specific about what you mean by "representation" and "behavior". A representation of the variability of something might be its probability distribution or it might be a few parameters of that distribution, like it's mean and variance.

A given sampling method can have all sorts of properties that might be represented. For example, 1) The mean number of specimens with at least one accurate result 2)The mean total number of accurate results from all specimens 3) The mean number of accurate results per specimen 4) The mean value of the fraction of accurate results per specimen. 5)The variance of the sample mean value of accurate results per specimen.

10. Mar 18, 2014

### Stephen Tashi

...and in regard to your original post:
You haven't clarified what you mean by "having an accurate result" in a data set. Is an accurate result for a data set of several specimens the event that each specimen has at least one accurate result? (-or exactly one result if this scenario assumes each specimen produces only one result.) Or is "an accurate result for the dataset" the event that at least one specimen in the data set has at least one accurate result?

11. Mar 18, 2014

### geo101

What I'm talking about here is the range of physical behaviour affecting the specimens. For some physical behaviour, some specimens will yield only accurate results, no matter how we selected the data. Others will yield only inaccurate results. By looking at a large number of control specimens we hope is reasonable to assume that different types of physical behaviour are present in the proportions that we would see in a practical, non-control, application.

May be I am thinking about this in the wrong fashion. He is how I first started to look at this problem, perhaps this will give you an idea of my thinking and what I'm looking...

I started by performing a bootstrap sampling of the control data set, by randomly sampling one result from each specimen to generate a pseudo-data set, which represent a random sampling. The proportion of accurate results in this random sampling was P(%). I then applied some selection criteria to the pseudo-data set and accepted Nt results, Ns of which were accurate. This can be viewed are the result of a Bernoulli trial process, so when sampling from a distribution with P proportion of accurate results, I calculate the probability (pr) of obtaining Ns/Nt proportion of accurate results by random chance from the binomial CDF (1-CDF):
$p_r=1- \sum\limits_{i=0}^{N_{s}}{ N_{t} \choose i } P^{i}(1-P)^{N_{t}-i}$
I then repeated this 104 times to build up a distribution of pr values. I then said, if pr < 0.05, at the 5% significance level this is not a significant increase in the acceptance of accurate results over random chance. In one example, using the arbitrary selection, ~30% of the bootstraps were no better than random chance. By adjusting the method according to my theoretical predictions, this drops to ~3% of bootstraps. So it seems like a good start (at least for my predictions).

The problem is, this is not really how we apply the criteria. Instead of sampling and then applying the criteria, we apply criteria then only sample from specimens that pass the criteria. What has made me uncertain about applying the above approach, is how to quantify P from a random selection. In the above approach P changes with each pseudo-data set, but here would it simply be the proportion of accurate results from all possible fits? Could I then simply bootstrap using only specimens that yield at least one acceptable fit to build up the distribution of pr values?

Thanks again for your input, and patience!

Cheers

12. Mar 18, 2014

### Stephen Tashi

I see why your analysis isn't a straightforward statistics 101 problem.

OK. The practical consequence is that we will assume sampling from the control specimens is sufficient to accurately simulate sampling from the wide world of specimens that you will eventually deal with.

This suggests "the proportion of accurate results" produced by a sampling method is a property of interest. It also suggests that a method produces one result per specimen (even though the method may consider several candidate results for a specimen before producing a final result).

That computes the probability that the proportion of accurate results is greater than Ns/Nt.

You got a distribution of values for $p_r$. What single value for $p_r$ did you use to conclude $p_r < 0.05$? Or did you test whether $p_r < 0.05$ for each value of $p_r$ and thus have a distribution of significance test results?

====

We want to compare two sampling methods. Lets call them "the selective method" and "the random method".

The general goal is to do a "hypothesis test" to determine whether the selective method is better than the random method with respect to the proportion of successful (i.e. "accurate) results and perhaps with respect to some other properties.

One complication of the problem is that "selective methods" may decline to analyze some of the specimens submitted to it. In hypothesis testing, one usually assumes a "null hypothesis" that implements the notion of a process that works "just by random chance". We have the question how to implement a method that selects certain specimens to analyze "just by random chance". The phrase "select some specimens to analyze just by random chance" doesn't specify a particular mathematical algorithm for randomly selecting a set specimens.

Another complication is precisely defining "better". For example, the selective method might do worse than the random method on more sets of specimens, but on those where it did better it might do so much better than the mean proportion of correct results from the selective method would be higher then the mean proportion of correct results from the random method.

Another question about "better" is whether it should consider the variety of samples analyzed. Suppose a method always selects just one specimen from a collection of specimens and always gives correct results for that single specimen. Is that "better" than a less accurate method that is willing to deal with more specimens ?

Hypothesis testing is a subjective procedure (it is merely a procedure, it isn't a mathematical proof.) So I think we should discuss the above questions in a general way before talking about specific math.

Your problem causes me to imagine another problem. Suppose we are analyzing specimens of ore to determine if they contain precious stones or not. An analysis method on the ore produces 1 of 3 results, which are "yes", "no", "don't know". The "yes" and "no" answers of the method are not completely reliable. How do we compare two analysis methods? If we only intend to prospect one claim then we would be happy with a method that always selected a specimen and gave a correct "yes" answer for it (when such a specimen exists). Instead of being a lone prospector, suppose we ran a mining company. Then we might have resources to explore several claims. So we would want a method that maximized the expected return on our exploration activity given the costs of exploring the claims selected or the constraint on the number of claims we can explore. If we were an academic and not concerned with material gain, our goal might be to estimate the mean number of locations in the world that contain precious stones. So we would be concerned with getting an accurate estimate of this number rather than only selecting specimens that contained precious stones.

13. Mar 18, 2014

### geo101

Thanks for your comments. Today is a little busy, so I don't have time to respond in detail.

I'll digest what you have said and get back to you tomorrow.

Cheers
geo101

14. Mar 20, 2014

### geo101

Finally, some time to respond!

In the interests of simplicity, I think that this is a necessary assumption. For various physical reasons, that I'm not going to go into, this assumption may not be entirely valid.

Yes it is.

This is a physical possibility, but one that is difficult to unravel and assess. These experiments are difficult and time consuming so large scale systematic control studies are not common. My control data set is somewhat of a patch work quilt. For example, the data set consists of experiments on slightly different materials and slightly different experimental set up, so we could subdivide on these grounds as the selective method may perform better on certain subgroups. The problem then becomes the number of data per subgroup.

This is then further compounded by variety and ambiguity in physical mechanisms ("non-ideal factors") that lead to bad data. So, for example, a selective method that is extremely sensitive to a particular non-ideal factor may perform well on a subgroup, but poorly on another. Under ideal conditions, we would be able to tailor the selective method to be best suited to a particular data set with knowledge of what non-ideal factors are influencing it. Sadly, unambiguously distinguishing non-ideal factors is all but impossible.

Under these conditions, my approach would to try and make an assessment that takes some consideration of the general applicability of a selective method. I think selective methods that give most rounded performance (in terms of discriminating against non-ideal factors) is probably going to be the most useful. In this sense quantifying overall performance from a varied control data set is the best approach. (Until we get a lot more control data).

I think this is a situation that we have to avoid. Our final result would then be the "average" of a single value, so what is our confidence in the mean?

As a bare minimum, we should be able to quantify our desired final result (i.e., the mean), so we need at least 1 result, but we should also be able to quantify the uncertainty/confidence interval around that value. So at an absolute minimum we need at least 2 results. Of course the uncertainty from 2 results will be potentially very big. So we would need some minimum acceptance rate, or some other constraint on acceptable data.

How I have been thinking about this, is about striking a balance between ability to accept accurate results (sensitivity) and ability to reject inaccurate result (specificity). Before I knew much about them I just multiplied them to get a "score" in the range [0,1]. Zero means all accurate results are rejected, or all inaccurate are accepted. One is the ideal result -all accurate results accepted, all inaccurate ones rejected. Having read more around the topic, I realize that there are many other ways of ranking using a single valued statistic.
In the case you present above, specificity is identically 1, and as long as sensitivity is > 0, the result is acceptable. In a case like this perhaps enforce a minimum acceptable sensitivity?

Much of this is based around my experiences and my opinion of what is "better". We have a community workshop coming up in the next couple of months and these are concepts that I want to raise there. I think some community consensus on this is needed, but I want to be well armed with ideas and examples.

A problem you have worked on before?

Cheers!

15. Mar 20, 2014

### Stephen Tashi

A way to begin is to ask: "Does the selective method do better than random chance on those specimens that the selective method selects ?

One scenario for an analysis is to think of contest defined as follows. Method A selects N specimens to analyze and produces a result from each specimen. Method B operates on the specimens that A selected and produces a result from each specimen. The winner of the contest is the method that gets the most accurate results. If we don't want to allow ties, we can say that a tied score is settled by the toss of a fair coin.

If method B is "random selection of a result from a specimen" (which I hope you can define precisely) and method A is , according to our "null hypothesis" also an equivalent method then, by symmetry, the probability that A wins = 1/2 and the probability that B wins = 1/2.

If you simulate K contests then you can use the binomial distribution to determine the probability that A wins M or more times. If you simulate many many contests, you could do the calculation by approximating the binomial distribution with a normal distribution. The number of times that A wins could be used as a statistic. There is some number M_0 such that the probability that M > M_0 is about 0.05 ( if you like the 0.05 as a number for "rejecting" the null hypothesis).

That would be a simple place to start. It doesn't use information about the margin by which one method wins.

16. Mar 23, 2014

### geo101

Let me see if I have the practical application straight in my head...
We apply Method A and end up with N specimens that yield acceptable results. Each of these specimens yields at least 1 acceptable result. Our analysis procedure would then be
1) For each of the N specimens select 1 result from each specimen from the acceptable (Method A) results and determine the test statistic (Stat(A)).
2) For the same specimens, select 1 result from each specimen from all possible fits (Method B, the random approach). Determine the test statistic (Stat(B)).
3) If Stat(A) > Stat(B), Method A wins. If Stat(A) = Stat(B), toss a coin. Else Method A loses.
4) Repeat steps 1-3 k times and count the number of wins for Method A (WINS).

This is where I am not too sure. Do we say k to some moderate value, say 20, and repeat the above steps, say 103 times and assess the distribution of WINS? That is, the distribution of Method A wins in 20 applications.

I know this depends somewhat on what we information we want to feed into our assessment, which depends on what my community views as relevant info, but....
By limiting the application of Method B to the N acceptable specimens from Method A, does it not imply that these N specimen are somehow "special" for Method B too? Method B should be akin to applying "no selection process", so in this respect all specimens are equal. An alternative approach for Method B would be choose 1 result from each of any N specimens. How would this change the assessment you outlined, if at all?

17. Mar 24, 2014

### Stephen Tashi

You do.

It would be simpler to set k to a very large value so you could use a normal distribution to approximate the binomial distribution.

We must clarify how the selective method selects. What is a realistic model for this? Does it select a deterministic N things from the total collection of specimens? Or is it presented with a random subset of N_s things from the total collection and allowed to select N things from that? I assume the number N of things it selects would vary depending on what it had to choose from.

Yes the selected specimens might be special for method B. For example, method B might be more likely to get an accurate result by random selection if A tends to select specimens where a majority of the possible results for such the specimen are "accurate". So if method A excelled at selecting "easy" specimens to analyze, it would still have to do better than chance on those specimens to beat method B.

That's a good idea.

It would change the null hypothesis. The null hypothesis would be that the selective method is no better than selecting the same number of specimens that the selective method selects at random and picking a result from each specimen at random.

18. Mar 24, 2014

### geo101

The selective method assess all specimens and all fits to those specimen. In this case we have 400 specimens. If the no results are accepted from a specimen that specimen is rejected. Specimen are only accepted if at least 1 fit passes the selection process. So we accept N <= 400 specimens.
In most cases the number of fits that we have is > 1. For these specimen, we select, at random, 1 of the acceptable fits.

Given that selection is applied to all specimen, I guess it is more appropriate for the "no selection" approach to also be applied to all specimens. In which case, we simply select, at random, 1 fit from all 400 specimen to represent "no selection".

I wonder if it is possible to a Monte Carlo analysis to estimate the significance of the margin of difference based on N specimen. Perhaps by making our test statistic the difference between Stat(A) and Stat(B) and not simply if Stat(A) > Stat(B). I need think this through a little more.

Cheers

19. Mar 24, 2014

### geo101

If we use a Normal distribution to approximate the distribution of wins for a particular method, for a Binomial disturbed variable that follows B(k, p), where k is our very large number of samples and p is the underlying probability of success, the normal approximation is $\mathcal{N}$(kp, kp(1-p)). Suppose we do this for the success of Method B to yield an estimate of pb.

If Method B represents our “no selection” scenario, then pB should be an estimate of the probability of success from the population, which, for simplicity, we will assume to be an exactly the probability of success from the population. So our Method B is a random sampling of 400 values from B(400, pB). Under our null hypothesis, any other method, say Method A, is a random sampling of N values from B(N, pB).

For each realization of applying Method A we can then apply the Bernoulli trial principles to assess the probability (pr) of >= Ns accurate results from a random sampling from B(N, pB). (As described in post #11).
We choose some significance level, say 5%, as our threshold for random behaviour. From the distribution of pr values we determine the proportion that are no better than random chance (e.g., pr < 5%). We then have to decide how larger this proportion should be to define an "ineffective method".

Last edited: Mar 24, 2014
20. Mar 25, 2014

### Stephen Tashi

I think it's simpler not to introduce an estimate of $p_B$. (Of course, most of the time when practical statistical analysis is done, several different methods are tried.) If you merely assume the two methods have he same probability of success, then the probability that each wins is 1/2, by symmetry.

If you study the problem intensely, you'll become interested in the "power" of whatever statistical test you use. The "power" of a test is a function, not a single number - although there are contexts where people assert a single value of this function as being "the" power of a test.

21. Mar 26, 2014

### geo101

I've been busy programming up some code to crunch through both of these tests for a range of methods. It turns out both give pretty similar results in terms of the number of bootstraps where the selective method beats the random selection.

The method I suggested does tend to give a slightly higher number of occurrences where the selective methods are not better than the random selection.

Is it not important to assess the size of the difference in proportion of accurate accepted, particularly when the selective method only accepts a small number of specimens? Or is this something that is better factored into a power analysis?

This is something to be thinking about. Let's say our null hypothesis is that Method A has the same probability of success (accepting the same proportion of accurate results) as a random selection process (Method B). And let pA and pB represent the underlying success rates of the populations sampled by Method A and B respectively.

I guess that the power would be a function of N, the number of specimens accepted by Method A, and the difference between pA and pB. So as N decreases, the difference between pA and pB needs to be larger to achieve the same level of power.
So how would we go about assessing this?

Would it be as simple as taking some fixed pB, say the estimate for our data. Then sample (a very large number of times) from distributions with varying pA (> pB) and N and counting the number of samples where we correctly identify that our estimate of pA ($\hat{p}_A$) is larger than pB. Then construct a power contour plot in pA-N space. Or perhaps $\hat{p}_A$-N space?

Cheers

22. Mar 27, 2014

### Stephen Tashi

Technically, "power" is defined for "a" statistical test rather than "a family" of statistical tests. The variable in the power curve is something that indexes a family of probability models. In a manner of speaking, the variable quantifies the degree of false-ness of the null hypothesis. For example, if the null hypothesis is that the probability of model_A beating model_B on a randomly chosen specimen is 1/2 then we can consider the case where the probability of model_A beating model B is 1/2 + x. The power curve plots the probability of rejecting the null hypothesis as a function of 1/2 + x.

You could plot a 3D "power surface" as (x,N, prob. of rejecting null hypothesis). The variable N is of interest. Here N does not represent a decision by the tester "Let sample size be N" - i.e. it is not a parameter set by the experimenter to pick a particular statistical test from a family of statistical tests. Instead it is an index that helps select a particular probability model that makes the null hypothesis false. When you describe such a plot, you should emphasize that N is a property of a model being tested rather than something the tester can set arbitrarily.

23. Apr 11, 2014

### geo101

Been away for a periods, so sorry the lateness. Thanks for all the comments and suggestions. I have to put this problem down for a little while (other priorities), but no doubt I will be back with other questions!

Cheers
Geo