- #1

- 177

- 24

## Main Question or Discussion Point

Hi everyone,

I am looking at an interesting problem at work, and I would like to hear your opinion about it, please.

I'll try to explain it as concisely as I can.

We took about 2500 chemical compounds and we tested them in some 'phenotypic assays'. The assays give as output an activity value (pIC

We then searched the literature for the known activities (same units as above) of these compounds at some biological 'targets'. So we have for each compound a list of activities in a number of targets, let's call T

You can imagine our current dataset as a table or matrix where each row is a compound i, and each column is a target k or a phenotypic assay j, and at the intersections you have the corresponding activity as defined above.

The table isn't full, there are many blanks, because not all compounds have an activity value, especially for the targets; for the phenotypic assays it's more complete.

The interesting part is that it is usually believed, in biology, that the activity in a phenotypic assay (which is a rather complex physiological system) is determined by the activity at one or more targets (which are the 'simpler' biochemical determinants of physiological effects).

So the hypothesis is that for each phenotypic assay P

P

We don't know how big n is, i.e. we don't know how many targets play a role in the activity of P

What we want to do is use our dataset, where as described we have several P

So, if we took the 'continuous' approach, and tried to fit each F

What do you think of this approach? Do you foresee any difficulties or inconsistencies, apart from the assumption of linearity? Should we check the correlations between targets first?

One objection to using a regression was that a classification may be better at dealing with experimental uncertainty.

However, the problem with a classification model is that we would need to fix thresholds to classify each activity as "active" or "inactive".

The suggestion from the boss was to avoid arbitrary thresholds, and derive the thresholds from the data.

Here's where I got stuck (note that I'm not a statistician, I'm a chemist, but I'm moving to the computational/modelling department, so I am learning these concepts).

So I looked at the Wikipedia page for Naive Bayes Classifiers, and that clarified a bit the picture in my mind.

I made a simulated dataset, classified phenotypic activities by an arbitrary threshold of 7 (>=7 --> active, <7 --> inactive), as that's inevitable, and then tried to find a way to calculate the best threshold for each target.

Example. I took phenotypic assay P

For the active class (8 compounds / 20), mean = 9.25, variance = 0.8; for the inactive class (12 compounds / 20), mean = 6.42, variance = 1.5.

And here's the main question I still need to answer: can I calculate, from the above data, the best threshold for T

This of course would then be repeated for all T

I already tried something, namely calculating the probability of activity and inactivity given a threshold T

P(active| T>T

because this would be the threshold that separates equal probabilities of finding activity and inactivity.

It seems that I can only solve this by iteration, as it's like a sum of ERF's (is this correct?), and it gives me a value between the two means, a bit shifted toward one or the other depending on the prior probabilities of activity and inactivity, and the variances.

More in general, do you think this is a good way to look for a threshold for the classification of each target, or can you suggest a better approach?

Thanks!

L

I am looking at an interesting problem at work, and I would like to hear your opinion about it, please.

I'll try to explain it as concisely as I can.

We took about 2500 chemical compounds and we tested them in some 'phenotypic assays'. The assays give as output an activity value (pIC

_{50}or pK_{d}), a real number usually between 4 and 10. Let's call P_{i,j}the activity of compound i in phenotypic assay j.We then searched the literature for the known activities (same units as above) of these compounds at some biological 'targets'. So we have for each compound a list of activities in a number of targets, let's call T

_{i,k}the activity of compound i at target k.You can imagine our current dataset as a table or matrix where each row is a compound i, and each column is a target k or a phenotypic assay j, and at the intersections you have the corresponding activity as defined above.

The table isn't full, there are many blanks, because not all compounds have an activity value, especially for the targets; for the phenotypic assays it's more complete.

The interesting part is that it is usually believed, in biology, that the activity in a phenotypic assay (which is a rather complex physiological system) is determined by the activity at one or more targets (which are the 'simpler' biochemical determinants of physiological effects).

So the hypothesis is that for each phenotypic assay P

_{j}, there is a function F_{j}such that:P

_{i,j}= F_{j}(T_{i,k1}, T_{i,k2}, ..., T_{i,kn})We don't know how big n is, i.e. we don't know how many targets play a role in the activity of P

_{j}, and we don't know if F is linear, logistic, etc, although it's pretty likely that it's linear to some extent.What we want to do is use our dataset, where as described we have several P

_{j}and T_{k}activities for each compound, to find which T's are most determinant for each P_{j}.So, if we took the 'continuous' approach, and tried to fit each F

_{j}as a linear function, we would probably take the targets having the regression coefficients with the lowest p-values as the most determinant ones for P_{i,j}.What do you think of this approach? Do you foresee any difficulties or inconsistencies, apart from the assumption of linearity? Should we check the correlations between targets first?

One objection to using a regression was that a classification may be better at dealing with experimental uncertainty.

However, the problem with a classification model is that we would need to fix thresholds to classify each activity as "active" or "inactive".

The suggestion from the boss was to avoid arbitrary thresholds, and derive the thresholds from the data.

Here's where I got stuck (note that I'm not a statistician, I'm a chemist, but I'm moving to the computational/modelling department, so I am learning these concepts).

So I looked at the Wikipedia page for Naive Bayes Classifiers, and that clarified a bit the picture in my mind.

I made a simulated dataset, classified phenotypic activities by an arbitrary threshold of 7 (>=7 --> active, <7 --> inactive), as that's inevitable, and then tried to find a way to calculate the best threshold for each target.

Example. I took phenotypic assay P

_{1}, and calculated the mean and variance for target T_{1}in each activity class of P_{1}.For the active class (8 compounds / 20), mean = 9.25, variance = 0.8; for the inactive class (12 compounds / 20), mean = 6.42, variance = 1.5.

And here's the main question I still need to answer: can I calculate, from the above data, the best threshold for T

_{1}that I can use to classify the activity of T_{1}for use in a future classifier (could be a classification tree, or something like an association rule mining routine)?This of course would then be repeated for all T

_{k}'s.I already tried something, namely calculating the probability of activity and inactivity given a threshold T

_{0}, as the integrated normal distributions, using Bayes' theorem, and then imposed the condition:P(active| T>T

_{0}) = P(inactive| T<T_{0})because this would be the threshold that separates equal probabilities of finding activity and inactivity.

It seems that I can only solve this by iteration, as it's like a sum of ERF's (is this correct?), and it gives me a value between the two means, a bit shifted toward one or the other depending on the prior probabilities of activity and inactivity, and the variances.

More in general, do you think this is a good way to look for a threshold for the classification of each target, or can you suggest a better approach?

Thanks!

L