- #1

lavoisier

- 177

- 24

I am looking at an interesting problem at work, and I would like to hear your opinion about it, please.

I'll try to explain it as concisely as I can.

We took about 2500 chemical compounds and we tested them in some 'phenotypic assays'. The assays give as output an activity value (pIC

_{50}or pK

_{d}), a real number usually between 4 and 10. Let's call P

_{i,j}the activity of compound i in phenotypic assay j.

We then searched the literature for the known activities (same units as above) of these compounds at some biological 'targets'. So we have for each compound a list of activities in a number of targets, let's call T

_{i,k}the activity of compound i at target k.

You can imagine our current dataset as a table or matrix where each row is a compound i, and each column is a target k or a phenotypic assay j, and at the intersections you have the corresponding activity as defined above.

The table isn't full, there are many blanks, because not all compounds have an activity value, especially for the targets; for the phenotypic assays it's more complete.

The interesting part is that it is usually believed, in biology, that the activity in a phenotypic assay (which is a rather complex physiological system) is determined by the activity at one or more targets (which are the 'simpler' biochemical determinants of physiological effects).

So the hypothesis is that for each phenotypic assay P

_{j}, there is a function F

_{j}such that:

P

_{i,j}= F

_{j}(T

_{i,k1}, T

_{i,k2}, ..., T

_{i,kn})

We don't know how big n is, i.e. we don't know how many targets play a role in the activity of P

_{j}, and we don't know if F is linear, logistic, etc, although it's pretty likely that it's linear to some extent.

What we want to do is use our dataset, where as described we have several P

_{j}and T

_{k}activities for each compound, to find which T's are most determinant for each P

_{j}.

So, if we took the 'continuous' approach, and tried to fit each F

_{j}as a linear function, we would probably take the targets having the regression coefficients with the lowest p-values as the most determinant ones for P

_{i,j}.

What do you think of this approach? Do you foresee any difficulties or inconsistencies, apart from the assumption of linearity? Should we check the correlations between targets first?

One objection to using a regression was that a classification may be better at dealing with experimental uncertainty.

However, the problem with a classification model is that we would need to fix thresholds to classify each activity as "active" or "inactive".

The suggestion from the boss was to avoid arbitrary thresholds, and derive the thresholds from the data.

Here's where I got stuck (note that I'm not a statistician, I'm a chemist, but I'm moving to the computational/modelling department, so I am learning these concepts).

So I looked at the Wikipedia page for Naive Bayes Classifiers, and that clarified a bit the picture in my mind.

I made a simulated dataset, classified phenotypic activities by an arbitrary threshold of 7 (>=7 --> active, <7 --> inactive), as that's inevitable, and then tried to find a way to calculate the best threshold for each target.

Example. I took phenotypic assay P

_{1}, and calculated the mean and variance for target T

_{1}in each activity class of P

_{1}.

For the active class (8 compounds / 20), mean = 9.25, variance = 0.8; for the inactive class (12 compounds / 20), mean = 6.42, variance = 1.5.

And here's the main question I still need to answer: can I calculate, from the above data, the best threshold for T

_{1}that I can use to classify the activity of T

_{1}for use in a future classifier (could be a classification tree, or something like an association rule mining routine)?

This of course would then be repeated for all T

_{k}'s.

I already tried something, namely calculating the probability of activity and inactivity given a threshold T

_{0}, as the integrated normal distributions, using Bayes' theorem, and then imposed the condition:

P(active| T>T

_{0}) = P(inactive| T<T

_{0})

because this would be the threshold that separates equal probabilities of finding activity and inactivity.

It seems that I can only solve this by iteration, as it's like a sum of ERF's (is this correct?), and it gives me a value between the two means, a bit shifted toward one or the other depending on the prior probabilities of activity and inactivity, and the variances.

More in general, do you think this is a good way to look for a threshold for the classification of each target, or can you suggest a better approach?

Thanks!

L