Stats: How to combine the results of several binary tests

In summary: So I thought I'd try to lay out these terms and give some examples. Specificity is how many of the test results were actual positives (ie, the disorder was present in the person). Sensitivity is how many of the test results were actually positives compared to how many were actually negatives. PPV is the percentage of positives that were actually positives. NPV is the percentage of positives that were actually positives compared to the total number of test results. Here are some examples:If the specificity is 100% and the sensitivity is 100% (both are 1), then the PPV is 100%. If the specificity is 50% and
  • #1
DanHicks
10
0
(I asked this question in the Stack Exchange http://stats.stackexchange.com/questions/13014/how-to-combine-the-results-of-several-binary-tests" [Broken], but didn't get anything that was helpful to me.)

I'm a programmer (former EE) working on a medical application. My last stats course was in engineering school 38 years ago, so I'm not exactly current on my stats lingo.

I have the results of essentially 18 different binary (in programmer-speak -- ie, yes/no results, with no adjustable threshold) medical tests, each one of which is a (presumably independent) "proxy" measurement for the disorder being tested for. For each of the tests I have statistics (though in a few cases "artistically derived") for # of true positive, false positive, true negative, false negative when compared to the "gold standard", so I can compute specificity, sensitivity, PPV, NPV, etc. (Typical specificity/sensitivity values are, in %, 50/71, 24/85, 29/84, 72/52.) I do not have a collection of results for the entire suite of tests which would show which combination was true for a given specific patient, and I have no real prospect of making such measurements (or any other new measurements), at least not before an initial formula is produced.

What I want to do is, given the individual statistics, derive a formula that, for a given set of inputs (list of test results for a single patient), will decide "probably positive", "probably negative", or "ambiguous" (and, to keep the FDA and the medical bean counters happy, it would be nice if the formula had some small degree of rigor). (The whole idea here is to avoid the expensive and uncomfortable "gold standard" test where possible, especially on the negative side.)

The best scheme I've come up with so far is to combine specificities using the formula
Code:
spec_combined = 1 - (1 - spec_1) * (1 - spec_2) * ... (1 - spec_N)
combine the selectivities the same way, and then take the ratio
Code:
(1 - sens_combined) / (1 - spec_combined)
Using >> 1 for POSITIVE, << 1 for NEGATIVE, and near 1 for "ambiguous".

This works fairly well, though it seems to behave strangely for some inputs, and it clearly lacks any real mathematical/statistical rigor.

So I'm wondering how one should go about deriving the "correct" formula, given the available statistics.
 
Last edited by a moderator:
Physics news on Phys.org
  • #2
What you should be doing is http://en.wikipedia.org/wiki/Bayesian_inference" [Broken]. In addition to the data you have, you'll also need the frequency of the disorder in the population. (That'll give you "prior probabilities".) If you have those frequencies as a function of age and ethnicity, all the better.

I'm afraid your description of what you're doing was confusing enough that I didn't fully follow it. (Definitions of "specificity, sensitivity, PPV, NPV" would help. I could probably guess at specificity and sensitivity if I really wanted to think about it, but the only thing NPV means to me is "net present value", and PPV is a complete mystery.) It might be that what you're already doing is Bayesian inference, or close, in which case you can now give it a name and make claim to that small degree of rigor you desire. If not, read up!
 
Last edited by a moderator:
  • #3
Of course, the "population" is people referred for testing, so there's already a suspicion of the disorder that upsets the validity of any frequency measurement over the general population. But we can probably come up with an approximate frequency in the referred population. (It would be on the order of 30%, I suspect.)

I've looked at Bayesian inference, but (with my weak stats background) have not been able to make much out of it, without a few hints at what I should be looking at.

Specificity, et al, I assumed are standard stats terms:

http://en.wikipedia.org/wiki/Sensitivity_and_specificity
http://en.wikipedia.org/wiki/Positive_predictive_value
http://en.wikipedia.org/wiki/Negative_predictive_value
 
  • #4
The idea is actually very simple. You ask, "What is the probability of getting the results I got if the person is sick?", and "What is the probability of getting the results I got if the person is well?" You'll probably assume, unless you have other information, that each test is independent of the others, so the calculations just turn into a multiplication of the appropriate factors. You then multiply those factors by the prior probabilities to get the ratio [itex]\frac{\mathbb{P}\left[\text{sick}\right]}{\mathbb{P}\left[\text{well}\right]}[/itex]. Don't know if this description is clear, but it's actually quite intuitive once you understand. It is now pretty clear to me that this isn't what you're doing, since you would need the stats for all four possible test results, and you only seem to be using two.

DanHicks said:
Thanks for the links. Yes, if you were talking to people who earn a living doing exactly this, you probably wouldn't need to define your terms or even your abbreviations. But on a general purpose forum like this, there might be non-experts who know the answer but not the jargon. The folks I talk to generally use the terms false positive, true positive, false negative, and true negative, which seem a lot more obvious to me.
 
Last edited:
  • #5
Like I said, I've got the 4 quantities true positive/false positive/true negative/false negative -- it's just a matter of figuring out how to use them.

Looking at Bayesian inference again, it seems like maybe I can work something out. I'll give it a shot.
 
  • #6
DanHicks said:
Like I said, I've got the 4 quantities true positive/false positive/true negative/false negative -- it's just a matter of figuring out how to use them.
Yes -- I understand. What I meant was that you weren't using them all in your formulas.

Looking at Bayesian inference again, it seems like maybe I can work something out. I'll give it a shot.
I think you'll be surprised how obvious it is once you've figured it out.
 
  • #7
What you are doing looks basically right, but there are lots of continuity issues. If you are doing this for something medical, and you don't really understand what you are doing, you should definitely have someone check over your results.

pmsrw3 is right you are really looking to do something along the lines of Bayesian Inference. It sounds like you don't actually have to do the inference, since the results for each *independent* test are handed to you (they may in fact not be independent, which could be very important for your actual results if you mix them).

Generally speaking you would try and *classify* the the patient as sick/healthy/unsure, based on the ratio of the two probabilities p(sick)/p(healthy).

The issue that pmsrw3 was frequencies of sick or healthy in your test population (these are best thought of as proportions and not priors). Once again, those are all done separately, and could be over totally different populations from one another, and could be totally different populations that your code is applying the model on. All of those issues could be bad for what you are trying to do.

Anyways, the model you are trying to use given those tables, and assuming independence, is very straight forward.

Write down your tables.
Test 1
(Passed /Failed)
Truly Sick p1 f1
Truly Healthy p2 f2

Where p1+p2=1 and f1+f2=1

Now for 1 test, calculate the probability that the person is sick or healthy.
p(Sick | test=Passed)= p1
p(Healthy | test=pailed) = p2
p(Sick | test=failed)=f1
...

Now for the patient, assuming all of the tests are independent, calculate the probability of them being sick, and the prob of being healthy. The tests are multiplied. Generally speaking, if you can do the experiment yourself, you don't want all of these different separate studies being mushed into one classifier.

p(sick | N tests)=p(sick | Test1=Restult1)* p(sick | T2=Restult2)*...
p(healthy | N tests=p(healthy | Test1=Results1) *p(healthy | Restult2)*...

Then take the ration of the two.
R=p(sick| N tests)/p(healthy | N tests)

You have to choose what the "safe" regions are. High R's and very low R's are of course more discerning than R's near 1.
 
  • #8
One point about the scenario is that people taking these tests have a "high suspicion" of a positive -- probably in the neighborhood of 10-20% (though of course it will tend to vary perversely -- the more accurate & accessible the tests become the more likely they are to be applied to lower suspicion populations, skewing the stats that feed the formula, and it will vary by practitioner, etc).

But the probability of a "positive" is high enough that we're not dealing with a thin "tail" on the probability curve, which I would interpret as (helpfully) lowering the sensitivity of the scheme to the precise population makeup.
 
  • #9
DanHicks said:
One point about the scenario is that people taking these tests have a "high suspicion" of a positive -- probably in the neighborhood of 10-20% (though of course it will tend to vary perversely -- the more accurate & accessible the tests become the more likely they are to be applied to lower suspicion populations, skewing the stats that feed the formula, and it will vary by practitioner, etc).

But the probability of a "positive" is high enough that we're not dealing with a thin "tail" on the probability curve, which I would interpret as (helpfully) lowering the sensitivity of the scheme to the precise population makeup.
Yes, I agree. If the ratio of your prior probabilities is not much different from 1, they can practically be ignored. (I mean, if you can get something, I would use it, but...) It's when you're doing random testing for conditions with an incidence of 10^-5 that this becomes important.
 
  • #10
What about mutually exclusive tests?

What if I have mutually exclusive tests? Eg, I can test for red, green, blue, and yellow, and only one of those four can be true. How do I factor that into Bayesian inference?

If I were testing just red or not red and correlating that with my "gold standard" I can see how I'd compute the Bayesian factors, but I suspect it falls apart with mutually exclusive tests.
 
  • #11


DanHicks said:
What if I have mutually exclusive tests? Eg, I can test for red, green, blue, and yellow, and only one of those four can be true. How do I factor that into Bayesian inference?

If I were testing just red or not red and correlating that with my "gold standard" I can see how I'd compute the Bayesian factors, but I suspect it falls apart with mutually exclusive tests.
Could you clarify what you mean by "mutually exclusive tests"? Do you mean each patient is either red, green, blue, or yellow, and you have some tests that, with less than perfect fidelity, report on that? (In that case nothing has really changed -- this was your original problem, but with two possibilities instead of four.) Or do you mean that the tests themselves are mutually exclusive: e.g., if you test a patient for red you can't test him for blue? Or do you mean that the test results are mutually exclusive: the test will always report red, green, blue, or yellow?

Anyway, it doesn't noticeably change anything. You calculate the probability of getting the results you got for each of the (possibly four now) possibilities, then multiply those by the prior probabilities for each.

Bayesian inference is really a very broad framework. It is not restricted to binary tests, and in fact it can even be used when there's a multidimensional continuum of possibilities.
 
  • #12
Could you clarify what you mean by "mutually exclusive tests"?
I mean each of the four colors has a different correlation to the diagnosis of foot fungus, and if one color is present the others are not. And since they're mutually exclusive they're obviously not "independent".
 
  • #13
The only place we used independence in the above dialog was in calculating how to combine the results of multiple tests. If I understand right, you're saying that one test (or observation, or whatever) gives one of the four results. This doesn't affect its independence from OTHER tests, so it doesn't affect the ability to multiply results from different tests. We assumed mutually exclusive results in the case of the binary test (result always either positive or negative, never both, never neither), so it's even formally similar. In this case, instead of having four numbers (true/false negative/positive), you'll have eight: probability that a sick person shows red, green, blue, yellow; probability that a well person shows red, green, blue yellow. You get those 8 P's from just the kind of data you already described to us: a collection of cases in which both the color test and the gold standard were applied. If a patient tests red (for instance), your multiply P(sick) by P(red|sick) and you multiply P(well) by P(red|well). It's completely parallel to the binary case.

More broadly, independence of tests is not a condition, anyway, as long as you can calculate a probability of getting the results you got. Correlation just means you won't be able to do a simple multiplication of the P's from the distinct tests.
 
  • #14
I don't suppose you have any suggestions as to how to "fudge" non-independent measurements, so they don't muck up an estimate too much? The vast majority of discussions I can find on the topic of non-independence just deal with the obvious cases analogous to drawing cards from a deck.
 
  • #15
I was actually thinking about that. I have some ideas. I don't think it's fundamentally difficult (at least as long as you restrict yourself to pairwise correlations), but it would certainly require data that you said in the OP that you don't have -- test results on a individual basis. That is, you'd want, for each individual, a list of the tests that were done on him/her, and their results. (For this purpose, it might not be necessary that each of these series include the gold standard test, if that would help you get bigger numbers.)
 
  • #16
Basically, for two pairs of tests, I know that if the first test contributes, say, a multiplier of 1.05 to the probability, then the second test will be "influenced" by about 1.025. I suppose I could simply divide the running product by the square root of the first test's multiplier, but I'm not sure how well-behaved that approach would be, and it's hard to apply to other "influence factors".
 
  • #17
DanHicks said:
Basically, for two pairs of tests, I know that if the first test contributes, say, a multiplier of 1.05 to the probability, then the second test will be "influenced" by about 1.025.
Sorry, I don't get that. How do you know this? How do you know that one test influences another at all? How do you know by what factor? You could only know the answers to these questions on the basis of data that you said in your first post you don't have!
 
  • #18
pmsrw3 said:
Sorry, I don't get that. How do you know this? How do you know that one test influences another at all? How do you know by what factor? You could only know the answers to these questions on the basis of data that you said in your first post you don't have!

Call it a hunch.
 
  • #19
Just want to say that I've worked this out fairly well, using Bayesian inference and an ad-hoc mechanism to deal with the observations that are strongly connected.

(Again, mathematical rigor is not so important here so long as the algorithm "behaves" well for most cases. )
 

What is the purpose of combining the results of several binary tests?

The purpose of combining the results of several binary tests is to gain a more comprehensive understanding of a particular phenomenon or research question. By combining the results, researchers can analyze the overall trend or pattern of the data, rather than examining each test individually.

What is the most commonly used method for combining binary test results?

The most commonly used method for combining binary test results is the meta-analysis approach. This involves combining the effect sizes or odds ratios from each individual study and calculating an overall effect size or odds ratio for the entire set of data.

How does combining binary test results increase the statistical power of a study?

Combining binary test results can increase the statistical power of a study by increasing the sample size and reducing the variability of the data. This can lead to more precise and reliable estimates of the true effect or relationship being studied.

What are the potential limitations of combining binary test results?

One potential limitation of combining binary test results is the heterogeneity of the data. If the individual studies have different methodologies or populations, it may not be appropriate to combine their results. Additionally, publication bias can also skew the results of a meta-analysis if only certain studies are included.

How can researchers address the limitations of combining binary test results?

Researchers can address the limitations of combining binary test results by carefully selecting which studies to include in the analysis, conducting sensitivity analyses to assess the impact of different inclusion criteria, and using statistical methods to account for potential heterogeneity in the data.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
792
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
  • Introductory Physics Homework Help
Replies
12
Views
1K
  • Biology and Medical
Replies
14
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • Quantum Interpretations and Foundations
2
Replies
45
Views
3K
Replies
2
Views
821
  • Set Theory, Logic, Probability, Statistics
Replies
12
Views
3K
  • Special and General Relativity
Replies
23
Views
2K
  • Special and General Relativity
5
Replies
146
Views
6K
Back
Top