- #1
lavoisier
- 177
- 24
Hi everyone, and happy new year!
A colleague at work came up with this problem today, which I find very interesting.
We are both chemists, so our ability to solve it is limited - here I am asking for your help!
In a target discovery campaign that we run, we have two separate tests, let's call them A and B.
Each test takes a target as input and gives a positive or negative outcome. By previous validation of these tests, we know that the target is more likely to be active when the test is positive, and more likely to be inactive when the test is negative.
First, a number of targets get tested in A. Some of them are positive.
As it's very expensive to work on a target when it's not really active, we want to make sure that these are not false positives, so we submit the positives from test A to test B, and in the end we only keep those that are positive in B as well.
My colleague has doubts about the validity of this approach. In particular he thinks that, as both tests can also give false negatives, there's no point in running the second test, as we might end up removing as many active targets as inactive ones.
I told him that, at a very superficial glance, it's always better to run the second test, because it's much easier for an individual target to be a false positive once than twice - unless A and B are perfectly correlated (which they aren't, because positive targets do sometimes become negative on retesting).
On more careful consideration, though, it seems to me that there is much more to this problem.
First, the set of targets that are submitted to B is not a random one, it's pre-selected by A, so if I want to know how many false positives I have after B, I can't just multiply the false positive rate of A by the one of B, can I?
Second, if I want to know how many false negatives I have after B, i.e. targets that are actually active but fail to show that in B, do I need to know the false negative rate of both A and B?
I tried to express this in probability terms.
Let 'A+' = 'test A is positive', 'A-' = 'test A is negative', and similarly for B, 'T+' = 'the target is active', 'T-' = 'the target is inactive'. Let P(X) be the probability of X.
Then I think what my colleague wants to know is whether:
P(T+ | (A+ ∧ B+)) > P(T+ | A+)
Intuitively I would say that it is the case, if the false positive rate of each test is less than 50%, but I don't know how to calculate the quantity on the left.
I tried to expand it using the classic formula:
P(T+ | (A+ ∧ B+)) = P(T+ ∧ A+ ∧ B+) / P(A+ ∧ B+)
which tells me that I need some information on the correlation between A and B, but I am stuck with P(T+ ∧ A+ ∧ B+). I thought of separating it into P((T+ ∧ A+) ∧ (T+ ∧ B+)), but I don't see how that helps.
Can anyone please suggest how I should do this? Maybe a decision tree?
If I had a method, I could also calculate stuff like P(T+ | (A+ ∧ B-)), because at the moment we throw away a target that is positive in A and negative in B. This probability would tell us how safe that is.
Thanks!
L
A colleague at work came up with this problem today, which I find very interesting.
We are both chemists, so our ability to solve it is limited - here I am asking for your help!
In a target discovery campaign that we run, we have two separate tests, let's call them A and B.
Each test takes a target as input and gives a positive or negative outcome. By previous validation of these tests, we know that the target is more likely to be active when the test is positive, and more likely to be inactive when the test is negative.
First, a number of targets get tested in A. Some of them are positive.
As it's very expensive to work on a target when it's not really active, we want to make sure that these are not false positives, so we submit the positives from test A to test B, and in the end we only keep those that are positive in B as well.
My colleague has doubts about the validity of this approach. In particular he thinks that, as both tests can also give false negatives, there's no point in running the second test, as we might end up removing as many active targets as inactive ones.
I told him that, at a very superficial glance, it's always better to run the second test, because it's much easier for an individual target to be a false positive once than twice - unless A and B are perfectly correlated (which they aren't, because positive targets do sometimes become negative on retesting).
On more careful consideration, though, it seems to me that there is much more to this problem.
First, the set of targets that are submitted to B is not a random one, it's pre-selected by A, so if I want to know how many false positives I have after B, I can't just multiply the false positive rate of A by the one of B, can I?
Second, if I want to know how many false negatives I have after B, i.e. targets that are actually active but fail to show that in B, do I need to know the false negative rate of both A and B?
I tried to express this in probability terms.
Let 'A+' = 'test A is positive', 'A-' = 'test A is negative', and similarly for B, 'T+' = 'the target is active', 'T-' = 'the target is inactive'. Let P(X) be the probability of X.
Then I think what my colleague wants to know is whether:
P(T+ | (A+ ∧ B+)) > P(T+ | A+)
Intuitively I would say that it is the case, if the false positive rate of each test is less than 50%, but I don't know how to calculate the quantity on the left.
I tried to expand it using the classic formula:
P(T+ | (A+ ∧ B+)) = P(T+ ∧ A+ ∧ B+) / P(A+ ∧ B+)
which tells me that I need some information on the correlation between A and B, but I am stuck with P(T+ ∧ A+ ∧ B+). I thought of separating it into P((T+ ∧ A+) ∧ (T+ ∧ B+)), but I don't see how that helps.
Can anyone please suggest how I should do this? Maybe a decision tree?
If I had a method, I could also calculate stuff like P(T+ | (A+ ∧ B-)), because at the moment we throw away a target that is positive in A and negative in B. This probability would tell us how safe that is.
Thanks!
L