# How to measure the 'agreement' between two assays?

• I
Hi everyone,
I'm trying to find a test to compare two assays, and I'm not sure which one I should use. Could you please help?
Here's the situation.
My company is setting up two assays, A and B.
Both assays are supposed to measure the same property of certain items, i.e. we would expect A and B to give broadly the same value of the property when applied to the same item.

We plan to test N=100000 items.
However, we can only run either A or B on all of these items, because running both is too expensive.
A is considered less accurate, but is also less expensive than B, so if there is enough 'agreement' between A and B, we would prefer to run A.
We are therefore trying to measure if and to what extent A and B are 'agreeing'.
The plan is to select a random subset of the 100000 items, say 5000, run both A and B on them and compare the results, in particular looking at whether the same items did indeed give broadly the same assay result in both assays.

How would you analyse this?
I was thinking of using the concordance correlation coefficient (CCC), or maybe the rank correlation coefficient.
Is this appropriate?
And if so, can a significance be calculated for these coefficients, like there is one for the 'standard' linear correlation coefficient?

My other question is: are 5000 items out of 100000 sufficient to give us enough confidence in the 'agreement' that we observe?
E.g. how would we calculate the minimal number n of items to pre-test in both assays to reach a significance p<0.05? But in fact, isn't the significance related to the actual coefficient, which one doesn't know before, so how is it even possible to estimate n?

Thank you!
L

mfb
Mentor
It ultimately depends on what you want to do with the result.
5000=5% as test sample doesn't sound so bad, but depending on costs of the tests, precision you want and so on, more or less could be better.

You can calculate a significance of correlation between the tests, i. e. you can show that they do measure something related, but that is not a very strong statement.

I guess you want to use test A to get results as precise as possible. In that case, you can calibrate A using B in that sample of 5000: at least correct for the mean and standard deviation (if they differ notably, it might be worth exploring that in more detail). Run the CCC test afterwards, it should now be equivalent to the normal correlation coefficient because mean and standard deviation agree.
If you know more about your items, you can see if those things are correlated with the property you want to measure, to potentially improve the calibration of A.

One possible strategy for the remaining 95000: measure all with A, and if the value could be problematic / is outside some range / whatever, then also measure it with B.

Dale
Mentor
2021 Award
Is B generally acknowledged as a reference standard, and does B have a continuos outcome or a dichotomous outcome?

Also, is your goal to characterize the assays themselves or to characterize the population of items. If the latter, is there already existing literature characterizing the assays?

Thank you both for your replies.
To be more concrete, I will add that the assays are biological assays, the items are molecules, and the value is the percentage inhibition ('PIN') at a single dose. More details on PIN in my reply to Dale below.

@mfb :
OK, I hadn't thought of comparing mean and standard deviation between A and B. I'm curious to see that.
As for 'calibrating', I don't know if that's a common thing to do in biology. I will ask. In practice, I guess I could just plot the results from A vs the ones from B, calculate the least squares regression, apply the regression coefficients to transform either A to A' or B to B', and then calculate the correlation between A and B' or A' and B. I take it you're saying the CCC on A vs B should be the same as the linear correlation coefficient between A and B' or A' and B, correct?
I think biologists measure the 'robustness' of an assay by running the same assay twice for the same set of molecules, and checking how close the duplicate results are. This would essentially be like saying A=B. If mean and SD differ between duplicates, I don't suppose that's very good, because it means there is something systematic happening over time. I will check, but I think that's actually the plan in our case here. I.e. the test set of 5000 or so will be run in duplicate both in A and in B.
As for your last point, unfortunately we don't have the option to run both A and B on the whole set. We need to choose one assay and from that moment stick with it.

@Dale :
Good question, I will ask. From what I understood, I don't think either assay is the reference standard. They are based on two different technologies, and biologists would prefer one of the two, for reasons that I didn't fully catch. It may be that one is more commonly found in the literature, and the other is something we adapted ourselves in order to save resources, time, etc.
Both assays return PIN as their outcome. PIN is a continuous variable usually ranging between 0 and 100% (although it can often be smaller than 0 or greater than 100%, depending on what normalisation constants are used). PIN is calculated by a linear transformation (normalisation) of a raw physical response (emission of light, absorption of light, measurement of radioactivity...).
Our final goal is indeed to characterise the items. Our initial goal is to determine which assay gives us the best chances to characterise the items correctly.
We selected the set of 100000 molecules using certain QSAR models plus other criteria, in order to maximise the chances of finding molecules that have the desired biological activity that we measure with A and B. We may decide that 'having activity' is equivalent to 'having PIN > 50%', or 'having PIN greater than the assay mean by at least 4 SD'.
So, suppose that 100 molecules out of 100000 have activity, the rest don't. This is a 'hit rate' of 100/100000 = 0.1%.
If we select the test set of 5000 cpds at random, on average we would expect 5 molecules to have activity.
The point is, if indeed we have 5 active molecules in 5000, what are the chances that we are going to miss some of them ('false negatives'); and what are the chances that we are going to consider active molecules that actually aren't ('false positives')?
So I believe at some point we're going to look at the performance of these assays in this sense, too.
I think in theory this process ('validation') should be done with a set where we have certainty about the status of activity or inactivity of the molecules, so we can build a contingency table and measure TPR and FPR, and derive all the probabilities I am talking about.
In many cases though we don't have 'standard' molecules for the test, and we don't know the 'true' hit rate of the set we're testing, which doesn't make it very easy to decide which assay is best. One that gives you many 'tested actives' may be worse than one that gives you fewer, if the former has a higher FPR; but not if the latter has a higher FNR. Difficult stuff.

I don't know if I'm clear. it is a relatively new thing for me as well. I was working in a different field until last month.

mfb
Mentor
I take it you're saying the CCC on A vs B should be the same as the linear correlation coefficient between A and B' or A' and B, correct?
The CCC between A' and B (or A and B') should be the same as the linear correlation coefficient between all possible comparisons.

If the mean deviates, something is problematic, sure - test it, if the mean is the same everything is fine. The standard deviation can differ - a less precise test will have a larger standard deviation (the natural variation plus the deviation of the test, added in quadrature, if the test uncertainty does not depend on the measured value).

If you are looking for outliers (like 1 in 1000), then your sample size of 5000 could be problematic. With just the 5 expected molecules your uncertainty will be large. You can loosen the activity criteria and try to extrapolate, but that can be problematic as well - you would need some information about how your tests behave for those outliers, or which distribution you expect for your molecules.

Dale
Mentor
2021 Award
Our initial goal is to determine which assay gives us the best chances to characterise the items correctly.
How can you determine if an item is characterized correctly? From your description above it sounds like the assay's performance is unknown, and you don't have a reference sample whose characteristics are known. I don't think it is possible. You need some sort of standard, either a gold standard test or a standard reference sample.

How can you determine if an item is characterized correctly? From your description above it sounds like the assay's performance is unknown, and you don't have a reference sample whose characteristics are known. I don't think it is possible. You need some sort of standard, either a gold standard test or a standard reference sample.
I think you're right that a standard is needed, i.e. something that tells you the true activity or inactivity of the molecules you submit to the assay.
Biologists do usually test molecules of known activity when they develop and validate the assay.
As for the assay 'performance', I guess it depends on what we mean with it.

If you are looking for outliers (like 1 in 1000), then your sample size of 5000 could be problematic. With just the 5 expected molecules your uncertainty will be large.
If I understand correctly, there are two problems here.
One is to know how probable it is to have the 5 expected actives (or more) in a subset of 5000 molecules taken at random from the set of 100000 that contains 100 actives. This should be answered by the hypergeometric distribution, I suppose(?), and indeed, if my calculations are correct, it's only about 56%. I am a bit gutted about that, to be honest.
The second problem is: how likely are we to detect activity given that only very few molecules are active? It may be that by chance one of the active molecules has a large negative error on its PIN value and we categorise it as inactive, thus losing one of our precious few actives. And remember that we screen each molecule only once. Wanting to minimise the loss of wrongly categorised active molecules would argue in favour of an assay with as low as possible a FNR, i.e. a very sensitive assay (high TPR). Funnily enough, I often hear people complaining of false positives instead, so they would advocate a more specific assay (high TNR). Who knows...

In particular there is an old article on the 'famous' Z-prime factor - which is always mentioned when assays and screening are involved - and I still don't know much about it. Things surely moved on since then, but it's probably a good primer.
http://jbx.sagepub.com/content/4/2/67.full.pdf+html
And more in general about the process I am talking about (HTS):
https://en.wikipedia.org/wiki/High-throughput_screening

Thanks
L

mfb
Mentor
And remember that we screen each molecule only once.
Is that such a hard requirement? If you want to find the best ~100 out of 100,000, it would be interesting to do follow-up studies on those 100 (or better 200-300). That relaxes the requirements for the first screening round.

Of course, that's the standard procedure. Sorry, I wasn't clear.
I meant that the 100000 molecules are tested only once at single dose (usually 10 micromolar). The resulting PIN's are measured and analysed.
The (hopefully) 200-300 molecules having a PIN that identifies them as possible actives undergo a 'confirmation' screen, usually at multiple ascending doses, to obtain what we call a 'dose-response curve', which gives us more information and much better accuracy than the single dose experiment.
The reason we can't run assays several times on huge numbers of molecules is that each experimental measurement ('point') uses up time and space (and money).
So, say in the HTS you screen 100000 molecules by taking 100000 points, and it takes you N full days.
Such an effort can't be repeated very often, mainly because of the cost (molecules must also be moved around, so there is a lot of logistics going on), but also because many projects are running at the same time, and the screening team has only got so many resources to allocate.
So when you run the confirmation screen, where you need more points (say 10) for each molecule, and you can only dedicate a much smaller number of days, typically you have to reduce the number of molecules to a few hundred.

There are several interesting statistical approaches to HTS analysis, i.e. to identifying which molecules are the most likely to have 'true' activity, and improving on the necessarily sketchy picture you get from single dose PIN's.
For instance, as the items that are tested are molecules, many researchers proposed to 'cluster' them by structural similarity and measure the 'local hit rate' in clusters, rather than looking at individual PIN's alone. This is supposedly beneficial to rule out false positives and false negatives. A molecule with very low PIN that is found in a cluster with very high hit rate may be a false negative. A molecule with very high PIN that is found in a cluster with very low hit rate may be a false positive. Etc.

Dale
Mentor
2021 Award
I think you're right that a standard is needed, i.e. something that tells you the true activity or inactivity of the molecules you submit to the assay.
So, I do a lot of research developing and characterizing new diagnostic tests based on medical imaging. There are some things that you can do if you have a test that is considered to be a gold standard.

If the gold standard is dichotomous (i.e. gives "positive" or "negative") then I usually do a ROC analysis.

If the gold standard and the new test are continuous and purport to measure the same quantity then I usually do a Bland Altman analysis, often with an ICC just for reviewers who need P values.

If the outcomes are categorical and have the same categories then I usually do Cohen's kappa.

lavoisier and mfb
Thanks!
I'm looking this up, and I will mention it to the screening group.
Maybe this will bring a new perspective to the prescreen analysis people usually do in my company.

Stephen Tashi
Both assays are supposed to measure the same property of certain items, i.e. we would expect A and B to give broadly the same value of the property when applied to the same item.

We plan to test N=100000 items.

It isn't clear whether two different items are molecules of different chemical compositions or whether you will apply the test to 100000 samples, each of which is a molecule of the same chemical composition. Or perhaps you are doing some combination of varying both the chemical composition of the molecules and repeating the test on samples of molecules with the same chemical composition.

@Stephen Tashi
Hi Stephen.
Each item in the HTS set of 100000 items is a distinct molecule, i.e. different from all other 99999 molecules.
It may sometimes happen in confirmation screens or regular screens that we test different samples ('lots' or 'batches') of the same molecule, usually to check if a different purity or method of synthesis affect the potency.
In the HTS instead we need to include as many distinct molecules as possible.

A secondary point concerning nomenclature: I would be careful using the term 'composition' to distinguish molecules.
Two molecules can have the same 'composition' - in the sense of same numbers of chemical elements - but still be different ('isomers').

There are very precise, maths-based methods to determine if two isomers are identical or not, so much so that computers can do that.
The first layer is 'connectivity' (related to graph theory, I believe).
Two isomers that have differently connected atoms are 'constitutional isomers'.
If they have the same connectivity, they undergo a second layer of analysis based on symmetry.
If they can be interconverted by a first order symmetry operation, they are identical; if not, they are 'stereoisomers'.
Among stereoisomers you have 'enantiomers', i.e. pairs of stereoisomers that can be interconverted by a second order symmetry operation, and 'diastereoisomers', i.e. pairs of stereoisomers that can't be interconverted by any symmetry operation.

So a single 'composition' can potentially correspond to a large number of chemically distinct molecules.

Dale
Mentor
2021 Award
Wanting to minimise the loss of wrongly categorised active molecules would argue in favour of an assay with as low as possible a FNR, i.e. a very sensitive assay (high TPR). Funnily enough, I often hear people complaining of false positives instead, so they would advocate a more specific assay (high TNR). Who knows.
I just looked back and noticed this. It is an interesting question in its own right. The people advocating for a more specific assay are probably right in this case. Of course, it does depend on the actual cost function so the only way to really know is to run the numbers.

But all things being equal, because the vast majority of the population is negative, the way that the test handles negatives will largely determine the overall performance. I.e. most errors will be misclassified negatives simply because there are so many negatives. This is governed by the specificity. The more rare the condition, the more important the specificity.

Stephen Tashi
Each item in the HTS set of 100000 items is a distinct molecule, i.e. different from all other 99999 molecules.

Are all molecules to be examined of equal importance? One can have a formula that predicts the outcome of test B on a molecule from the outcome of test A that works well in the sense of working on the majority of molecules, but if it fails on certain molecules that are critical for evaluating something important (like the development of a new drug) then the test doesn't work well from that standpoint.

Let's imagine two different situations.

First, imagine that I collect twigs and I want to measure the electrical resistance of twigs. I can use cheap voltmeter A which is easy to connect or use expensive voltmeter B which is time consuming to connect. If I find a formula that predicts the reading of voltmeter B from the reading of voltmeter A then this (intuitively) strikes me as an "apples-to-apples" comparison if both meters are based on the same physics. There is reason to think that test A ought to have some relation to test B.

Next, imagine that I collect books and I want to measure the number of words in books. I have a simple test A that estimates the number of words by the procedure: Open the book to approximately the middle page, count the number of characters in the third line of text on that page and divide that number of characters by 8, then multiply that result by 100 times the thickness of the book in inches. I have a second test B which is to actually count the number of words in the book. If I find the formula that best predicts the result of test B as a function of the results of test A, the formula may work well on books on certain subjects, with certain page thicknesses etc. However, I'm not confident that the formula is based on any "physics" of books that is common to the two tests. For example, it might not work well on scientific dictionaries printed on thin paper. If the majority of books are works of fiction printed on medium thickness paper, then the "best" formula is liable to be the one that works on such books.

Which situation is most similar to your problem ?

The people advocating for a more specific assay are probably right in this case. Of course, it does depend on the actual cost function so the only way to really know is to run the numbers.
Are they? Then I must be misunderstanding something.
Let's see where my reasoning is not right.

If I want to avoid throwing away active molecules because the assay wrongly says they are inactive, I want to minimise this probability:
P(mol=active | assay=inactive) = P(M+ | A-) = P(A- | M+) * P(M+) / P(A-) =
= P(A- | M+) * P(M+) / [P(A- | M+) * P(M+) + P(A- | M-) * P(M-)] =
= FNR * P(M+) / [FNR * P(M+) + TNR * P(M-)]

which I think is increasing with FNR and decreasing with TNR.

So now I'm more confused than before.
One could say that both increasing TNR (more specificity) and decreasing FNR (more sensitivity) help minimising the above probability.
...

Dale
Mentor
2021 Award
If I want to avoid throwing away active molecules because the assay wrongly says they are inactive
If that is the goal then you should just label all molecules active and not throw any of them away.

Usually the goal is to minimize the probability of being wrong, and there is some trade-off function where increasing specificity decreases sensitivity and vice versa.

Last edited:
I went back to square one to look at this in more detail, and it's indeed interesting.

First, I was wrong about the 'hit rate': that's actually P(A+), not P(M+).
From my calculations, if we assume that the TPR and FPR completely characterise the assay and stay ~constant when changing the test set (is this a reasonable assumption, by the way?), then the hit rate P(A+) is supposed to be between FPR and TPR. Of course we need to have TPR>FPR, otherwise the assay is worse than random.
Within these conditions, the actual numbers of molecules in the 4 squares of the contingency table (TP, TN, FN, FP) can all be calculated from TPR, FPR and N. Apparently also P(M+) can be calculated, although I'm not sure if this is correct in practice.

So, here's a possible scenario, with made-up but reasonable figures.
From the assay validation we find TPR = 90% and FPR = 5%.
We run the assay on N = 100000 molecules and find 5200 tested actives. Hit rate = P(A+) = 5.2%.
P(M+) = [P(A+) - FPR] / [TPR - FPR] = (0.052 - 0.05)/(0.9 - 0.05) = 0.00235 = 0.235%.
TP = N * P(M+) * TPR = 100000 * 0.00235 * 0.9 = 212
FP = 5200 - 212 = 4988
TN = N * (1-P(M+)) * (1 - FPR) = 100000 * (1 - 0.00235) * (1 - 0.05) = 94776
FN = 100000 - 5200 - 94776 = 24

In terms of immediate cost, it is quite bad, or even not possible, for us to have to rescreen 5200 molecules only to find 212 true actives among them.
When the hit rate is too high, we often have to artificially 'raise the bar' (i.e. set a higher PIN) for inclusion of molecules into the 'tested active' set.
By doing this we should in principle be picking the best molecules, because the PIN is supposed to be higher for more active molecules, but if there is a large error on the PIN itself, we may actually be inappropriately discriminating between values that are not different, statistically.
In either case, if the probability of true activity in the 'tested active' set is high, it should help, because even choosing at random we would be including more good molecules.
So we would like to increase the PPV = P(M+ | A+) = 212 / 5200 = 4.1%.

At the same time, there is a long term risk of losing good chemical matter: we don't like to lose molecules that are wrongly categorised as inactive.
The number of molecules we lose is FN, which divided by the total number of true actives is simply FNR = 1 - TPR, in this case 1 - 0.9 = 10%.

What happens in the two possible cases, one where we increase sensitivity (TPR) and one where we increase specificity (decrease FPR)?

Let's consider P(M+) constant at 0.235%.

1) increased TPR = 99%, same FPR = 5%.

P(A+) = 5.22%
TP = 233
FP = 4988
TN = 94777
FN = 2
PPV = 4.5%
NPV = 99.998%
FNR = 1%

--> marginal improvement in the PPV, big improvement in the NPV, much smaller loss of true actives as false negatives.

2) same TPR = 90%, decreased FPR = 1%.

P(A+) = 1.2%
TP = 212
FP = 998
TN = 98767
FN = 23
PPV = 17.5%
NPV = 99.976%
FNR = 10%

--> strong decrease of the hit rate, big improvement in the PPV, no change in the loss of true actives as false negatives.

I suppose at this point one should decide if it's worse to have to rescreen many molecules to find true actives in a set with many false positives, in which case specificity should be improved, or lose many molecules as false negatives, in which case sensitivity should be improved.
Best would be to improve both, really :O)

Then one should figure out how to improve these parameters in practice, or how they can be measured for the two assays A and B, from the prescreen set, if they change between the validation set and the latter.
We're testing the same set, so P(M+) is the same. P(A+) will be known. From that I could calculate a quantity proportional to the ratio TPR/FPR for both assays, and maybe prefer the one with the higher ratio... not sure, I'm really new to this.

I'm going to see what my colleagues in screening and computational chemistry think and we'll take it from there.
But I'll definitely try the Bland-Altman analysis if I get the chance.

Dale and mfb
Dale
Mentor
2021 Award
I suppose at this point one should decide if it's worse to have to rescreen many molecules to find true actives in a set with many false positives, in which case specificity should be improved, or lose many molecules as false negatives, in which case sensitivity should be improved.
Yes. The decision theory approach would be to assign costs to each possible outcome and them find the threshold that minimizes the expected cost.

I think that you are going to want to look at the ROC analysis. It describes how a given test can be tweaked to trade off between true positive and false positives. Even if you don't use it, you may find it interesting to know.

Thank you again @Dale , I looked at the ROC concept as explained in the article here, and it is very interesting indeed.
I found particularly useful the 'Curves in ROC space' section, and the applet where you can see what happens by moving the threshold.
Bizarrely enough, it seems related to another problem I was trying to solve with much help from @Stephen Tashi and others, concerning Bayesian classifiers.

If I understand correctly, the overall ROC curve, via its shape and AUC, tells us how well a model works 'overall' in telling classes apart, and provides a way of comparing different classifiers.
By attaching a 'cost' to FP and FN, as you pointed out, one can then determine the optimal threshold (maybe there can be more than one?).

Doing this requires knowing the shape of the distribution for both classes, if I'm not mistaken.
And that's where I found a big difference with my case.

I spoke with the biologist who manages the HTS, and she confirmed to me that, as I remembered/feared, they don't always have positive or negative controls in the sense of TP or TN. They often have only very few compounds that elicit or suppress/blank the desired response, without necessarily resembling the type of TP we are looking for.
This is simply because we are often looking for TP's that are not known yet, e.g. a molecule that is active on a certain new/unexplored target, or sometimes even without a target, and we do that by testing many molecules with very different structures ('diverse set'), hoping that at least a few of them will approach the shape and electronics we need to 'hit' the assay and produce a good response (high PIN).

So the work they do in screening is often mostly assay-centered, i.e. they try and understand if the intrinsic (i.e. molecule-set-independent) assay reproducibility and variability are within certain ranges.

An initial criterion is the "window", i.e. the difference between min and max response they can measure. For instance, in a presentation I was reading the other day, the conclusion was something like: we have PIN's between 80 and 100 with positive controls, and between -30 and 30 with negative controls, therefore we have a "window".

Once the window is sufficiently good (by tweaking experimental parameters of the assay), a larger set of molecules is tested, with replicates, and the overall distribution of all PIN's, variation between replicates, etc is studied.
http://jbx.sagepub.com/content/4/2/67.long

[I find the choice of Z and Z' quite unfortunate, considering the definition they then give, but OK, we know it's in the context of assays and not pure statistics].

The aim is to determine if the assay is 'good enough' for testing an even larger set of molecules.
Fig. 2 shows the distribution of PIN's in the whole set and within replicates of a single measurement, and the text in the article mentions that by moving the hit limit (which would be our threshold, I guess) to higher PIN's, one decreases the FPR but increases the FNR.
Which seems fair enough, except that we don't have TP's and TN's in the tested set, so we are making the big assumption that TP's will have higher PIN and TN's will have lower PIN's, based on functional positive and negative controls, without really knowing where each distribution is centered and how spread out it is.

In theory, I thought, one would use these data to set the hit limit so that it maximises TPR/FPR, for instance, so, whatever P(M+) is, we maximise PPV.
I did some calculations in that sense, but I came across some integrals that can't be solved analytically; I'll have to study this in more detail

In the meantime I asked another biologist how they decide where to put the hit limit, to see if their approach was similar.
The answer was that by now he does that by eye, but if you do the maths, it's generally 3 SD above the overall PIN distribution mean.
And he added that the hit limit can and will change from one assay to the next, because of course both the mean and SD can be different each time.

This worried me a bit. If one day you test a really bad set of molecules (low mean PIN), you'll take the best Z-scoring of them as hits, but the same molecules could Z-score much worse in a new assay where there are better ones.
So there is a tendency to look at each screen in relative terms, and a lack of inter-screen consistency, I would say.
And I'm not even sure my own method would avoid this problem; probably not, because I too would be using the means and SD's from each specific run of the assay.

So I'm back to square one again.
In theory I thought TP controls should always be present to 'set the scene' and evaluate other molecules against them.
Apparently not: we test sets that may even contain no true actives; they do contain true negatives though (we're not that lucky!).
P(A+) is kept essentially constant between different runs of the assay (because the integral of the overall PIN distribution between mean + 3 SD and +inf is a constant, I believe), whereas the hit limit, or PIN value that tells apart A+ and A-, keeps changing.

I suspect it would be better, as suggested by @mfb above, to include some key molecules in each new run, and 'calibrate' it on a previous/better/reference run, so at least a PIN of 50 would mean the same in all runs, and only the variability of each run would determine P(A+), which to me would seem more sensible.
Why should we force the same P(A+) on each run, regardless of the set we screen? The paradox is that if we intentionally ran a set of true inactives, we would still categorise the same percentage P(A+) of them as hits, even if they have a miserable PIN. Of course our biologists know better than that, but the general method of just taking the tail of the curve doesn't sound right to me.

Sorry for the long post, complex stuff as usual.

Thanks
L

mfb
Mentor
By attaching a 'cost' to FP and FN, as you pointed out, one can then determine the optimal threshold (maybe there can be more than one?).
If your test works properly, the ratio of (positives)/(negatives) at a given test result is monotonously rising/falling, and there should just be one optimal threshold. Otherwise accepting multiple intervals could be ideal, but that makes the test very odd.
Doing this requires knowing the shape of the distribution for both classes, if I'm not mistaken.
Right. At least some idea how the shape will look like.
http://jbx.sagepub.com/content/4/2/67.long
If S/N and S/B are typically used without other constraints, as the article suggests, then there is a lot to improve...

The Z there looks quite odd as well. If the distribution is not Gaussian, then you can have tests with better Z scores which perform significantly worse. Why do they try to reinvent the wheel, where ROC curves (invented during WW2!) are so much more powerful?

I don't know what to say.
I too am very often puzzled by the way data are treated in screening and biology in general.
None of the (little) statistics I know seems to apply.

Most of the time I give up because 1) I assume people know what they are doing, and it's just me not getting it; 2) they always make it very clear that they've always done things that way, and they don't have any time or need to change it.
This time I am trying a bit harder, because to be frank I've not always been over-impressed with the end results.

E.g., I will tell you something that may shock you, if you're a statistician.
In screening they measure a variable that is not linear, but log linear, by which I mean 10 and 100 are at the same 'distance' as 100 and 1000, etc.
Well, what do you think, if they measure this for the same molecule more than once, they then average the replicates arithmetically! So they always overestimate the average, and that's bad because we usually want to minimise the variable.
I had strenuous fights in the past to convince people to change this to a geometric average, I even did a study to show the overestimation, and the fact that selectivities (ratios of the variable measured on different systems) were also varying by the different calculation methods, but in the end I gave up because it was clear I was not going to be listened to.

Even worse. These data I'm talking about are generated using many experimental points, so the error on the measurement can be calculated from that and also from the general assay variance, and there is even a nice article explaining it (provided to me by @Stephen Tashi). The software that is used to analyse the data does indeed give the error as one of its outputs. Well, do you think they are prepared to give us these error data, e.g. to compare different values and see how different they really are?
No such luck. They say the error isn't good to use, and instead they average arithmetically the values (wrong), and then they measure the SD only based on the replicates of that specific molecule, without taking into account the general assay variance.
So a molecule tested once will have no error defined (why? if the variance of the measurement system is known...); a molecule tested twice that has two different results will have, say, (10+50)/2=30 instead of 22 as the average and the corresponding SD calculated on 10 and 50; and, sit down if you're standing, a molecule tested twice that by chance has the same results will have SD = 0 !!!

I know I shouldn't get so worked up about this, and I must respect other people's dedication and professionalism, but I do have the impression that statistics is not really taken seriously in certain circles...

mfb
Mentor
Most of the time I give up because 1) I assume people know what they are doing, and it's just me not getting it; 2) they always make it very clear that they've always done things that way, and they don't have any time or need to change it.
I have bad news for you.
(why? if the variance of the measurement system is known...)
Do you know the variation is the same for every molecule?
but I do have the impression that statistics is not really taken seriously in certain circles...
That is true. There are studies showing that errors in the mathematics are quite frequent in various fields.

Dale
Mentor
2021 Award
The Z there looks quite odd as well. If the distribution is not Gaussian, then you can have tests with better Z scores which perform significantly worse. Why do they try to reinvent the wheel, where ROC curves (invented during WW2!) are so much more powerful?
I think that one advantage of the wierd Z scores is that they allow you to compare two tests that would each be identified as "perfect" with an area under the curve of 1.0 in ROC analysis.

mfb
Mentor
But how does the comparison help? If your area under the curve in ROC is 1, you don't know how good your test is - you can only set a lower limit. In this case, a different Z score could indicate different test performance, but you don't know, because your test sample is too small to really see where the tails are.

Dale
Mentor
2021 Award
If your area under the curve in ROC is 1, you don't know how good your test is - you can only set a lower limit. In this case, a different Z score could indicate different test performance, but you don't know, because your test sample is too small to really see where the tails are.
Good point. Also, most ROC analysis packages will give you a confidence interval for the AUC. So I believe that both tests would have a AUC of 1 but the "better" one would have a higher lower bound for the confidence interval

EnumaElish
Homework Helper
In addition to all of the above You could estimate the regression model B = α + β A [+ γ X] and see how good the fit is. γX is optional and is intended to capture variables other than A that could explain B. Ideally γ = 0, i. e. the only significant variable that explains B is A. But if you can think of X such that γ ≠ 0 then those should be included in the model.

1. Some observations are more informative than others ... If so consider a weighted regression
2. There are outliers that are not "predicted well" by the simple linear model ... If so consider adding polynomial terms +βkAk, k = 2, ..., n.

EnumaElish
Homework Helper
Thank you all again for your input, it's really valuable to me.
I was about to write to the head biologist about the interassay calibration idea, but then I thought it may be better to send the email Friday just before I leave, in case they decide to storm into my office and punch me :O)

I tried to catch up by reading the Wikipedia article on HTS that I mentioned earlier: https://en.wikipedia.org/wiki/High-throughput_screening
The 'Hit selection' section seems to be the closest one to the new problem I'm looking at (before it was comparing two assays, now it's how to select hits to maximise the chances of 'success', whichever assay is chosen).
It says that multiple approaches are possible (Z, t, B-score, SSMD, etc).
I seemed to gather that SSMD is a good one, and indeed the same author of the 1999 article I mentioned also wrote another article in 2007 on the application of SSMD to screening.
http://jbx.sagepub.com/content/12/5/645.full.pdf+html
I must have passed out when I scrolled down this article, because I looked at the clock and found that half an hour or so had gone, leaving me without any recollection of what happened during that time. Either that or I was abducted by aliens :O)
OK, it's possible that the formulae in it are not so complicated after all, but from a cursory glance it seems to me that they're over-complicating it.

In its essence the problem is simple to formulate.
We run an assay on N molecules, and for each of them we collect either a single or a duplicate measurement of a variable called 'PIN'.
In total we'll have N or 2 N data points, of the form {molecule_i, PIN1_i, (PIN2_i)}, where i is an integer index and goes from 1 to N.
The distribution of all PIN's is Gaussian and has mean m and SD s.
As most molecules are inactive and activity is proportional to PIN, m will be a rather low PIN, say between 0 and 20, and can be safely assumed to be close to the mean of the true inactive population. s may be anything, often quite large, say 30 or more.
Positive controls are usually present in the tested set, and their PIN will be quite high, usually around 100. However they are most often not enough in number to form a Gaussian distribution, and they are not strictly speaking true positives, they are just something that elicits the assay response. So we don't know the mean and SD of the true positive population: positives are an invisible bump in the tail of the overall PIN distribution.
When there is no replicate (i.e. PIN2 is not measured), we can't know the SD on individual PIN's.
When there is a replicate, we can calculate an SD for each molecule; or, as I was told by another user in these forums https://www.physicsforums.com/threads/standard-deviation-for-a-biological-assay.844014/#post-5295363, we can sum the squared differences (PIN1_i - mean_i)2, (PIN2_i - mean_i)2, over the whole set and divide that by N to obtain a sort of average variance that would apply to each PIN. Of course after checking that the squared differences are uncorrelated to PIN.

The question I need to answer is: given the above data, what is the best way to select 'hits', in order to strike the best balance between true positives and true negatives, and especially, is that possible at all when one doesn't actually have any information about how true positives behave?
The 2007 article seems to have the answer.
If/when I find the courage and the time I will wedge my eyes open with toothpicks and read it.

EnumaElish
EnumaElish
Homework Helper
A̶g̶a̶i̶n̶,̶ ̶y̶o̶u̶ ̶c̶o̶u̶l̶d̶ ̶e̶s̶t̶i̶m̶a̶t̶e̶ ̶a̶ ̶r̶e̶g̶r̶e̶s̶s̶i̶o̶n̶ ̶a̶s̶ ̶I̶ ̶d̶e̶s̶c̶r̶i̶b̶e̶d̶ ̶a̶b̶o̶v̶e̶,̶ ̶t̶h̶e̶n̶ ̶d̶e̶c̶i̶d̶e̶ ̶w̶h̶e̶t̶h̶e̶r̶ ̶A̶ ̶p̶r̶e̶d̶i̶c̶t̶s̶ ̶B̶ ̶"̶s̶u̶f̶f̶i̶c̶i̶e̶n̶t̶l̶y̶ ̶w̶e̶l̶l̶.̶"̶ Also get some sleep IMO.
[missed New Problem flag]

Last edited:
Dale
Mentor
2021 Award
However they are most often not enough in number to form a Gaussian distribution, and they are not strictly speaking true positives, they are just something that elicits the assay response
It sounds like they are essentially for instrument calibration. The same thing happens with other instruments.

I just read your comments and had a thought. Right now this is structured as a classification problem, you are looking to classify active molecules as positive and inactive molecules as negative.

Maybe instead you should look at it as a prioritization problem. You have N molecules for the primary test of which you can only test M molecules in the secondary test. So just run the M most active molecules through your secondary test and don't worry too much about classifying things in terms of "hits" or not.

mfb
mfb
Mentor
Maybe instead you should look at it as a prioritization problem. You have N molecules for the primary test of which you can only test M molecules in the secondary test. So just run the M most active molecules through your secondary test and don't worry too much about classifying things in terms of "hits" or not.
I think that is an important point.

Actually, instead of starting by running both tests on 5000 samples, why not go the opposite way? If the better test is too expensive to run it on all samples, run the cheaper test on all as you have to do that anyway, then run the more expensive test on the 5000 most promising samples based on the test results. Or less or more, depending on how much the test costs vs. how much a missed good molecule costs.

@Dale : actually what you say is what is done in 'simple' HTS practice.
The 'top M' molecules from the initial screen are chosen to go into confirmation.
I think we ended up talking about hit classification because my initial question was how to compare the two assays, and I suppose that is related to the ability of each assay to discriminate well between activity and inactivity.
Lacking a true positive population, I guess the variability on replication and the distribution of the negatives should be the main criteria, because whatever threshold we use, a molecule with a given PIN has a higher probability of being categorised correctly if the PIN has a small SD and if it's 'far' from the negative set.

There is however a subtler point, which I mentioned briefly earlier. We're testing molecules, for which various similarity scores can be calculated. If two molecules are very similar, there is a higher chance that they will have a similar biological activity. So if we have a not-very-diverse set to screen, it may happen that in the top M molecules there will be an over-representation of a certain type of molecule. So by simply picking the top M we will not get more information about the active 'scaffolds' as we call them, we'll just have several copies of same-y stuff. To identify active scaffolds (i.e. groups of closely related molecules, usually by a central 'core'), it is sometimes better to cluster by similarity first, look at the local cluster hit rates and select representative molecules from each high-scoring cluster for confirmation. This would lead to a more diverse confirmation set, and increase the chances to obtain valuable information.

@mfb :
Let's call A the more expensive assay, B the less expensive one.
We can't run either assay on the whole set before the 5K cpds have been run; that's what they call 'validation', and it hasn't been done yet.
Even if that weren't the case, I don't know if we should use B to select molecules for A before we know if the two assays tell us the same thing about the molecules.
Suppose for instance that by some crazy effect the two assays give us very poorly correlated results (so there is no good linear or whatever function relating A and B), and from the outcome of the positive controls we are more confident that A is 'telling the truth'.
Then if we pre-screened all 100 K compounds in B first, we would get a confirmation set that is much poorer in true actives (according to A), thus drastically reducing our chances of success.
I may be wrong, but for me it's important to know first how the two assays compare, and if B really doesn't make sense and/or has a much larger replicate SD than A, we give up on saving money and we run A.
In the long run it would cost us more to progress false positives and lose false negatives because of a bad assay, than to run upfront the more expensive assay.

Dale
Mentor
2021 Award
my initial question was how to compare the two assays, and I suppose that is related to the ability of each assay to discriminate well between activity and inactivity.
Yes, that makes sense. I would use ROC analysis for that. The Z score seems reasonable to me, but the statistical methods for ROC analysis have been challenged and tested and validated over decades.

mfb
Mentor
Ah okay, if you don't know if B gives relevant results at all then you need the comparison, sure.