I How to measure the 'agreement' between two assays?

Dale · Jun 16, 2016

lavoisier said:

However they are most often not enough in number to form a Gaussian distribution, and they are not strictly speaking true positives, they are just something that elicits the assay response

It sounds like they are essentially for instrument calibration. The same thing happens with other instruments.

I just read your comments and had a thought. Right now this is structured as a classification problem, you are looking to classify active molecules as positive and inactive molecules as negative.

Maybe instead you should look at it as a prioritization problem. You have N molecules for the primary test of which you can only test M molecules in the secondary test. So just run the M most active molecules through your secondary test and don't worry too much about classifying things in terms of "hits" or not.

mfb · Jun 16, 2016

Dale said:

Maybe instead you should look at it as a prioritization problem. You have N molecules for the primary test of which you can only test M molecules in the secondary test. So just run the M most active molecules through your secondary test and don't worry too much about classifying things in terms of "hits" or not.

I think that is an important point.

Actually, instead of starting by running both tests on 5000 samples, why not go the opposite way? If the better test is too expensive to run it on all samples, run the cheaper test on all as you have to do that anyway, then run the more expensive test on the 5000 most promising samples based on the test results. Or less or more, depending on how much the test costs vs. how much a missed good molecule costs.

lavoisier · Jun 17, 2016

@Dale : actually what you say is what is done in 'simple' HTS practice.
The 'top M' molecules from the initial screen are chosen to go into confirmation.
I think we ended up talking about hit classification because my initial question was how to compare the two assays, and I suppose that is related to the ability of each assay to discriminate well between activity and inactivity.
Lacking a true positive population, I guess the variability on replication and the distribution of the negatives should be the main criteria, because whatever threshold we use, a molecule with a given PIN has a higher probability of being categorised correctly if the PIN has a small SD and if it's 'far' from the negative set.

There is however a subtler point, which I mentioned briefly earlier. We're testing molecules, for which various similarity scores can be calculated. If two molecules are very similar, there is a higher chance that they will have a similar biological activity. So if we have a not-very-diverse set to screen, it may happen that in the top M molecules there will be an over-representation of a certain type of molecule. So by simply picking the top M we will not get more information about the active 'scaffolds' as we call them, we'll just have several copies of same-y stuff. To identify active scaffolds (i.e. groups of closely related molecules, usually by a central 'core'), it is sometimes better to cluster by similarity first, look at the local cluster hit rates and select representative molecules from each high-scoring cluster for confirmation. This would lead to a more diverse confirmation set, and increase the chances to obtain valuable information.

@mfb :
Let's call A the more expensive assay, B the less expensive one.
We can't run either assay on the whole set before the 5K cpds have been run; that's what they call 'validation', and it hasn't been done yet.
Even if that weren't the case, I don't know if we should use B to select molecules for A before we know if the two assays tell us the same thing about the molecules.
Suppose for instance that by some crazy effect the two assays give us very poorly correlated results (so there is no good linear or whatever function relating A and B), and from the outcome of the positive controls we are more confident that A is 'telling the truth'.
Then if we pre-screened all 100 K compounds in B first, we would get a confirmation set that is much poorer in true actives (according to A), thus drastically reducing our chances of success.
I may be wrong, but for me it's important to know first how the two assays compare, and if B really doesn't make sense and/or has a much larger replicate SD than A, we give up on saving money and we run A.
In the long run it would cost us more to progress false positives and lose false negatives because of a bad assay, than to run upfront the more expensive assay.

Dale · Jun 17, 2016

lavoisier said:

my initial question was how to compare the two assays, and I suppose that is related to the ability of each assay to discriminate well between activity and inactivity.

Yes, that makes sense. I would use ROC analysis for that. The Z score seems reasonable to me, but the statistical methods for ROC analysis have been challenged and tested and validated over decades.

mfb · Jun 18, 2016

Ah okay, if you don't know if B gives relevant results at all then you need the comparison, sure.

chiro · Jul 8, 2016

Hey lavoisier.

I'm wondering whether you have studied clinical trials and bio-statistics because this is precisely the kind of thing that this field looks at.

It's going to help to know how much you know about this field before giving more advice but one thing I feel I should do is direct your attention to clinical trials and the results embodied within them if you aren't aware of this stuff.

Clinical trials often looks at treatment "delta's" and doing inference and regression on them along with coming up with models to find the optimal number of tests to enforce statistical power (Type I/II errors).

There are a number of considerations including crossover trials (order of trials has impact on distribution) as well as conditional power (i.e. the power of the test changes conditionally on successes and failures of prior results). This is applied to clinical trials which look at biological phenomena and the thing about this is that the assay example has a lot of the same characteristics that would be considered within a normal "clinical trial".

lavoisier · Jul 13, 2016

Hi chiro,
no, I haven't studied that, I am a (medicinal) chemist by training and I've just moved to chemoinformatics and modelling, so I'm trying to learn these things.

In my company there are people who do biostatistics, I am indeed planning to talk to one of them shortly.

We once had a statistician visiting the company as a consultant. He talked about clinical trials and the need to appreciate the difference between significance and power of a test. It's all very well to have a significant result, but he showed that then sometimes the number or subjects tested is insufficient to avoid the other type of error. And if I understand correctly, increasing one decreases the other.
Tough stuff, especially because this concept is not always taught clearly to you at uni, if you do a degree other than pure maths or statistics.
Not helped by the fact that while significance is relatively easy to calculate, using the well known normal distribution formulae, power requires some more involved maths. Nothing incredibly hard, by the look of it, but I wonder how many non-statisticians/mathematicians know how to do this.

chiro · Jul 14, 2016

The idea he is talking of is that if you want to increase one (or decrease one) then it will have an opposite impact on the other since making one test better (null or alternative) impacts the ability to check the other hypothesis. You can get this relationship by examining the probabilities of getting a successful conclusion and the probabilities of getting false negatives with respect to false positives.

The other thing is that the way these probabilities change (Type I/II errors) is quite involved in clinical trials - much more so than normal statistics. They have to add constraints and take into account all sorts of clinical trial constraints (including ethics) so that the test that is done is the minimum in terms of getting some significance level.

When you do speak to them you should ask what sort of models in clinical trials would be appropriate and how that translates into mathematical constraints, test statistics and of course the significance level with respect to the different hypotheses - if they are well trained they should be able to tell you that.

Stephen Tashi · Jul 14, 2016

lavoisier said:

the need to appreciate the difference between significance and power of a test.

Introducing "power" is the only way to make sense of non-Bayesian statistics.

For example, suppose your are testing whether the mean of population A is different that the mean of population B and your test statistic X is the difference in sample means from the two populations. The natural procedure is to choose an "acceptance" region for the null hypothesis that is some interval containing zero. Your choice of p-value determines the size of this region.

But what is the logical justification for (the intuitively obvious) procedure of making the acceptance region for the null hypothesis an interval containing zero ? Why not define the acceptance region as some collection of disjoint intervals scattered about the the real number line ? For example, if the desired significance level is ##\alpha = 0.5 ## why not pick any old set of intervals so that the probability that X lands in one of them is 0.95. We could even omit any interval containing zero and still find other intervals whose probabilities add to 0.95.

Saying that the purpose of an acceptance region is to specify a set where X is likely to land if the null hypothesis is true doesn't explain why the acceptance region should be a single interval that contains the number zero instead of, say, 10 disjoint intervals, none of which contains zero. There are lots of different ways to pick a bunch of intervals whose total probability is 0.95.

chiro · Jul 15, 2016

Just following on from what Stephen Tashi mentioned - you can represent power with conditional probabilities and it will make more sense using the Type I/II errors to understand not only how they impact each other but how they are optimized.

Usually (and I say usually) it's a function of the sample size in a simple way but it can get complicated and in clinical trials this is "analyzed" very thoroughly because of how expensive clinical trials and also because of things like ethics committees making sure you don't do what's unnecessary when biology is involved.

For reference - you are looking at the term P(H1 is actually true|You picked H1 to be true) when power is involved.

I How to measure the 'agreement' between two assays?

Similar threads

B A Little Probability Puzzle

I A variant of the Monty Hall problem

I What Are the Axioms of Fuzzy Logic and How Do They Extend Boolean Algebra?

I Please Explain (actually explain) The Monty Hall Problem

B How Rare Is Low Smartphone Usage Among Metro Travelers in Japan?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers