# I Is it better to use a regression or a classification?

1. Apr 21, 2016

### lavoisier

Hi everyone,
I am looking at an interesting problem at work, and I would like to hear your opinion about it, please.

I'll try to explain it as concisely as I can.
We took about 2500 chemical compounds and we tested them in some 'phenotypic assays'. The assays give as output an activity value (pIC50 or pKd), a real number usually between 4 and 10. Let's call Pi,j the activity of compound i in phenotypic assay j.
We then searched the literature for the known activities (same units as above) of these compounds at some biological 'targets'. So we have for each compound a list of activities in a number of targets, let's call Ti,k the activity of compound i at target k.

You can imagine our current dataset as a table or matrix where each row is a compound i, and each column is a target k or a phenotypic assay j, and at the intersections you have the corresponding activity as defined above.
The table isn't full, there are many blanks, because not all compounds have an activity value, especially for the targets; for the phenotypic assays it's more complete.

The interesting part is that it is usually believed, in biology, that the activity in a phenotypic assay (which is a rather complex physiological system) is determined by the activity at one or more targets (which are the 'simpler' biochemical determinants of physiological effects).
So the hypothesis is that for each phenotypic assay Pj, there is a function Fj such that:

Pi,j = Fj(Ti,k1, Ti,k2, ..., Ti,kn)

We don't know how big n is, i.e. we don't know how many targets play a role in the activity of Pj, and we don't know if F is linear, logistic, etc, although it's pretty likely that it's linear to some extent.

What we want to do is use our dataset, where as described we have several Pj and Tk activities for each compound, to find which T's are most determinant for each Pj.

So, if we took the 'continuous' approach, and tried to fit each Fj as a linear function, we would probably take the targets having the regression coefficients with the lowest p-values as the most determinant ones for Pi,j.

What do you think of this approach? Do you foresee any difficulties or inconsistencies, apart from the assumption of linearity? Should we check the correlations between targets first?

One objection to using a regression was that a classification may be better at dealing with experimental uncertainty.
However, the problem with a classification model is that we would need to fix thresholds to classify each activity as "active" or "inactive".
The suggestion from the boss was to avoid arbitrary thresholds, and derive the thresholds from the data.

Here's where I got stuck (note that I'm not a statistician, I'm a chemist, but I'm moving to the computational/modelling department, so I am learning these concepts).
So I looked at the Wikipedia page for Naive Bayes Classifiers, and that clarified a bit the picture in my mind.
I made a simulated dataset, classified phenotypic activities by an arbitrary threshold of 7 (>=7 --> active, <7 --> inactive), as that's inevitable, and then tried to find a way to calculate the best threshold for each target.

Example. I took phenotypic assay P1, and calculated the mean and variance for target T1 in each activity class of P1.
For the active class (8 compounds / 20), mean = 9.25, variance = 0.8; for the inactive class (12 compounds / 20), mean = 6.42, variance = 1.5.

And here's the main question I still need to answer: can I calculate, from the above data, the best threshold for T1 that I can use to classify the activity of T1 for use in a future classifier (could be a classification tree, or something like an association rule mining routine)?
This of course would then be repeated for all Tk's.
I already tried something, namely calculating the probability of activity and inactivity given a threshold T0, as the integrated normal distributions, using Bayes' theorem, and then imposed the condition:
P(active| T>T0) = P(inactive| T<T0)
because this would be the threshold that separates equal probabilities of finding activity and inactivity.
It seems that I can only solve this by iteration, as it's like a sum of ERF's (is this correct?), and it gives me a value between the two means, a bit shifted toward one or the other depending on the prior probabilities of activity and inactivity, and the variances.

More in general, do you think this is a good way to look for a threshold for the classification of each target, or can you suggest a better approach?

Thanks!
L

2. Apr 23, 2016

### Stephen Tashi

It's important to know the proposed use for any statistical model. From those statements, I conclude that a primary use of the model would be to to predict $P_{i,j}$ for a compound $i$ that has been given some of the target assays, but not the phenotypic assay test $j$.

We can hold two contradictory intuitions about phenomena involving compounds. We can imagine that the phenomena depends only on the "gross" properties of matter, like weight, density, charge, etc. On the other hand, we might think of each compound has having some very unique property - like a shape. If the phenomena involves things fitting together, splitting apart, in ways that depend on shapes then (to me) it isn't clear that the dependence of a phenomena $P_{i,j}$ would be the same nice function $f_j(....)$ of other phenomena $T_{i,k}$ for each compound $i$.

If we accept the latter intuition, then the classification problem could involve classifying the compounds $i$ as well as the target assays. The scenario for using the model would be: Here is new compound $i$ and the results of some target assays $T_{i,k}$. These assays indicate that compound $i$ is "type B" and type B compounds have result $P_{i,j}$ predicted by the function $f_{B,j}(T_{i,k_1},T_{i,k_2},...)$.

3. Apr 23, 2016

### FactChecker

If your measured output is a real number where the relative values mean something (2 is twice as much as 1, 1.5 is 50% more of something than 1, etc. ), then it is probably better to use a regression. On the other hand, if they are just codes where their relative values are not meaningful, then you should not use regression.

4. Apr 24, 2016

### lavoisier

Thank you both very much for your replies.

@FactChecker : yes, the numbers are meaningful, and they are a direct measure of either a) the strength of the binding between compound and target (pKd) or b) -log10[concentration] of compound at which half of the maximal expected biological response is observed (pIC50). The larger the number, the stronger the binding and/or the lower the concentration at which the half maximal effect is observed. The number is in logarithmic scale to ensure additivity, so indeed, 4 and 5 are at the same 'distance' as 7 and 8, etc.

@Stephen Tashi : your point is very subtle, and if you're right about the first hypothesis, i.e. that we should expect a different relationship between target(s) activity and phenotypic activity for each compound, then this whole exercise is pointless. In fact, our team accepts that this may be the case, and that the data may eventually tell us exactly that. We want to know how we should best *test* that. It would be a shame to have collected all this data and then throw away potentially good information because we're not interpreting/analysing the data correctly.

To answer your specific questions, no, we are not trying to find a way to predict phenotypic activity from target activity/ies. It is the opposite we're looking for: we want to find which targets matter in producing a given phenotypic activity, i.e. find the simpler target determinants of the complex phenotypic response. In the future we plan to run new phenotypic assays with the same compounds, and use the method defined by this exercise to match the new phenotypic assay to the targets.
Indeed, as you subtly pointed out, this requires looking at compounds as 'tools', not as individual entities that can have more complex effects. So in a way, if two structurally different compounds i=1 and i=2 exist, which have exactly the same target activity vectors, for the purpose of this exercise we will consider them 'equivalent'. Of course this is a big assumption, because the two molecules are indeed different, and we don't have information about ALL possible targets they do hit; it could be that compound i=1 hits targets A, B, C, D, and i=2 hits targets A, B, C, E, and we only know about A, B and C, but not about D and E.
That's where (we thought) statistics should be able to help us. Suppose we test in a phenotypic assay P1 multiple compounds with different target activity vectors, and we find that the probability of the phenotypic assay being active is higher than random when a given target T1, in common between these compounds, is hit. Then by Bayes' theorem in theory we should be able to reverse that and calculate the probability that the target is hit when the phenotypic assay is active.

I don't know, maybe this is an excessively naive way of looking at it, and we'll get false correlations/associations.
The question is really how we should best do this to avoid being misled.

There are many reasons why we want to know if, and if so which targets are determining a given phenotypic activity.
One is that it is often easier and cheaper to test compounds in target assays than in phenotypic assays, so once this exercise and target identification is done, we can test new compounds in the target assay, and only run the phenotypic assay for compounds that pass the target assay filter. Target assays are usually also more stable, reproducible, and better at allowing 'structure-activity relationship' (SAR) elucidation.
Another reason is that in drug discovery it is often preferred to 'know' the 'mechanism' or 'mode of action' responsible for a given biological effect. In the (far) past people were happy to stick a compound directly into animals and see if it cured the disease, regardless of how it did so. Nowadays we prefer to know how, or at least get as close as possible to the true mechanism. This in principle should also allow to be more selective, prevent toxicity, etc.

Anyway, to be more concrete, I attached a table with a simulated example of a target+phenotypic dataset, where I have planted an exact relationship between P1 and two of the targets.

You can see that by using the approach suggested in the linked Wikipedia page, i.e. calculating the mean and variance in the activity subsets of P1, there is indeed some discrimination between targets that matter to P1 and those that don't.

I also tried linear regression and association rule mining methods on the dataset, and found back the same pattern.
Once the 'real' dataset is complete (biologists are still working on it to annotate targets etc), it will be interesting to see if patterns as clear cut as this emerge.

Note that our modeller is also looking at this, and he's going to use very sophisticated methods (he said something about 'multi-layer classifier' or something). In the meantime I am asked to look at it from my simpler point of view, to see if anything obvious comes up.

Thank you again for taking the time to comment on this, it's very useful to hear other people's points of view.
L

#### Attached Files:

• ###### Table.txt
File size:
448 bytes
Views:
81
5. Apr 24, 2016

### Stephen Tashi

I don't understand the exact meaning of "match the new phenotypic assay to the targets". One guess at your goal is that you are (eventually) interested in predicting pheonotypic activity from target activity, but that this involves the initial step of finding which targets are good predictors.

Perhaps the scenario to be "solved" is that you will be given a vector of desired values for some target activites and you want to find which phenotypic tests are most likely to show high activity on compounds whose activity on the list of target activities matches the desired values.

Returning to a question in you original post:

The goal still needs to be stated precisely. I can visualize the data like this: We have T1 activity plotted on the x-axis and P1 activity plotted on the y-axis. The data is a scattergram of points.

Consider a statement of the form: If T1 activity is equal or greater than t-zero then P1 activity will be equal or greater than p-zero and if T1 activity is less than t-zero then P1 activity will be less than p-zero.

There is a set of points for which the statement is true. We visualize a vertical line and and a horizontal line defined by the thresholds t-zero and p-zero. The statement is true for points in the upper right "quadrant" and the lower left "quadrant" defined by the two intersecting lines. The statement is false for the points in the other two "quadrants".

How do we define "best" thresholds ? If we consider t-zero and p-zero both to be variables then we try to find values that maximize the number of data points for which the statement is true. Or we might set p-zero to be a specific threshold of interest and ask which value to t-zero maximizes the number of data points that make the statement true.

Thinking of a joint probability distribution instead of discrete data points, the analogous problem is to maximize the sum of the two integrations of the joint probability density over the areas defined by the two "quadrants" where the statement is correct.

If you have a particular joint density in mind (e.g. bivariate normal or bivariate log normal) then there might be a clever way to calculate an answer.

------

If you consider activity levels (target and phenotypic) the only data, then I think you are dealing with a problem of cluster analysis for continuous variables. However, aren't there discrete and categorical variables that describe complicated chemical compounds ? (e.g. the "type" of a protein - "enzymatic", "hormonal" etc. ). If useful terminology has evolved in your field of study then it would be wise to include categorical variables in the database.

6. Apr 25, 2016

### lavoisier

Thank you @Stephen Tashi for your input.
The part where you describe how to find the threshold is very interesting to me.

Just this morning I tried the t-zero and p-zero approach using contingency tables and looking for the t-zero that minimised the Fisher exact test p-value (p-zero must be set arbitrarily, and I chose 7; others may be happy with 6). It does work, i.e. there is indeed a value of t-zero that gives me the best separation. Trouble is that, in my simulated dataset, even target T2, that was supposed to be uncorrelated with P1, gave a rather decent contingency table.

I also tried the Wikipedia approach, where the distribution of target activities is continuous and assumed to be normally distributed, whereas the phenotypic activity is still arbitrarily pre-classified into active and inactive. The PDF's for the probability of finding a given target activity T, given that the phenotypic assay is active (active subset) or given that it is inactive (inactive subset) are simply the classic bell shaped curves parametrised on the mean and variance of the two subsets, and multiplied by the prior probability of their own class: P(phenotypic=active) or P(phenotypic=inactive). I calculated the integrals of these PDF's from the appropriate sides: -inf to T0 for the active subset, and T0 to +inf for the inactive subset. [Of course this is when the mean for the active subset is larger than the mean for the inactive subset; otherwise I just swapped the PDF's]. Then I equated the two integrals and solved numerically for T0. That sort of worked, too, I think, and the threshold values were not too far from the contingency table approach ones, although I am still not sure if and how I should incorporate the information on the prior target probability.
If I understood correctly the Wikipedia example:
P( phenotypic=active | target>=T0 ) = P( phenotypic=active ) * P( target>=T0 | phenotypic=active ) / P( target>=T0 )
P( phenotypic=inactive | target<T0 ) = P( phenotypic=inactive ) * P( target<T0 | phenotypic=inactive ) / P( target<T0 )
The integrals described above should be the products at the numerator, but then I don't know how to calculate the denominators. Should I use the mean and variance of target activity over the whole set, i.e. without classification of the phenotypic assay, or still within each subset? And is the approach of equating the two right hand sides to find T0 correct?

As for your suggestion to use a 'joint' probability distribution, I am not sure I have come across one before, or maybe I have but I didn't know it was called that.
I will look it up to see what it looks like mathematically.

Concerning your initial doubt on what the goal of this work is, I am sorry I am not able to explain myself better, maybe I am misleading in my wording of the problem.
I will try to phrase it in the most mathematical way I can.
Suppose you are given a data table with N columns of what you consider 'independent' variables x1, x2, x3, ..., xN, and one column of what you consider a 'dependent' variable y. Let's assume that the whole table is filled with data, there are no blanks.
Now you launch R and use the linear model 'lm' function with an initial formula y ~ 1, within a bidirectional 'step' routine with scope ~x1 + x2 + x3 + ... + xN. R will look at all the independent variables one by one, and find which combination of them gives the best AIC.
[Statisticians, please do not shoot me down in flames, I know this is not the best practice, I am just explaining the problem].
You get as the final result the linear model: y ~ x2 + x5 + x6; these 3 independent variables each have a given coefficient (positive or negative, with its own error), t-value and significance.
What we will conclude from this is that y is probably determined by x2, x5 and x6, whereas the other independent variables in the table matter less, or don't add enough information. So in the future we may decide to look at the values of x2, x5 and x6 before trying to determine y.
I suppose this is indeed, as you say, an exercise where we try to predict y as a function of independent variables, but we might never use again the linear model itself: for us it's more important to discover the *identity* of the independent variables that are important for y, because there is some literature where sets of specific independent variables (in our case targets) are already known to 'belong together' in a coordinated sequence (in our case a biological pathway) as determinants of a given independent variable (in our case the phenotypic effect or a more complex physiological response).
That's partly why an association rule mining method would perhaps also suit us, even though it doesn't give us much quantitative information on how to combine target activities into phenotypic activities: it only reveals which items, taken together, increase the probability of another item being present.

I hope this is clearer. As for including more categorical variables that enrich the description of molecules and targets, indeed, we are keeping all the information in the final dataset. We will start with something simple where only activities are considered, and if that's not enough to make sense out of the data, we will probably include categories in the analysis. At the end of the day, we don't want to end up with something so complex that it can't be interpreted/inspected, or we'd go back to a classic QSAR-like black box model, which may be statistically superior but usually doesn't really help understanding the system from a human perspective.

7. Apr 26, 2016

### Stephen Tashi

It's an interesting problem, but keep in mind that this is not the problem treated in the Wikipedia article on Naive Bayesian classifier, which you linked. The Naive Bayesian classifier specifies a decision rule, so you don't have a "free choice" about a threshold T0 and you can't necessarily state the decision rule in terms of a single threshold. The Naive Bayesian classifier is a probability model plus a decision rule. Of course, we can go beyond the scope of that article and investigate the probability model with a different decision rule.

The decision rule in the article uses the "liklihood" $p(C_k) p(x_i|C_k)$ , which is a constant times the value of the probability density. No integration of that density is involved. Think of two bell shaped curves plotted on the same graph. One curve is the density for p( observed value of T = x | compound is active with respect to P) times the constant P(compound is active with respect to P). The other curve is p(observed value of T = x | compound is inactive with respect to P) times the constant P(compound is inactive with respect to T). In a "typical" case, the two curves will cross at only one point and that point will be a threshold. However, I can imagine a case where the curves cross at more than one point.

When you seek a compound that has a particular y value, is it as simple as "I want its y to be big" or "I want its y to be small" or does it get into refinements, like "I want its y value to be between 6.3 and 7" ?

We should investigate whether there are better models for such a situation than a linear model. As I understand your description, you have cases where one $x_i$ is a function of the value of some other $x_j's$. For example $x_2$ might be the output of a process that takes $x_4$ and $x_5$ as inputs.

8. Apr 27, 2016

### lavoisier

For the probability density, indeed, in the Wikipedia example they don't do any integration, they use the PDF itself.
But this gives the probability that y belongs to a specific class (in our case, phenotypic assay active or inactive) given that the x variable (in our case a target activity) has one specific value. So it's as if I asked, what is the probability that my P assay is active given that this target has a pIC50 of T0?
In my case, what I'm asking is, what is the probability that my P assay is active given that this target has a pIC50 equal to or greater than T0? And vice versa, what is the probability that my P assay is inactive given that this target has a pIC50 smaller than T0?
And that's, I think, where integration comes into play. After integration you have two S-shaped curves of opposite 'slope', that intersect only in one point, i.e. where, I guessed, the 'best' threshold T0 is.
Maybe one thing that I didn't clarify is that when a target x determines a phenotypic activity y, normally the two are related in an almost linear fashion, like y = a + b*x. It is possible though quite rare to find 'bell-shaped' responses, where the activity in the target is linearly related to the P activity up to a point, and then it starts to reverse its effect. Let's assume for now that our targets are well-behaved and determine P activities in a linear way.

As for the value of y, phenotypic assay activity, I would like to have a pIC50 greater than 7; others may say 6. In any case yes, it's a 'greater than' condition, not a range.

For your last point, actually no, even if multiple targets are part of a pathway, the potency of a compound in each of these targets is in principle unrelated.
When we want to alter a given biological process that has a known pathway, we usually look at the components of the pathway, i.e. all the (macro)molecules that determine the various steps. They usually are enzymes, receptors and ion channels, all of which are proteins (note that some other types of targets exist).
Example. A receptor takes the message from a molecule in the blood and initiates a cascade of reactions inside a cell where proteins or other substances are transformed, produced, destroyed, etc., where enzymes are activated or inactivated, etc. We can decide to act at the origin, by blocking the receptor from binding the messenger molecule. Or we can target one of the enzymes in the cascade, inhibiting its catalytic activity. And so on. But as each of these elements that we may decide to target is very different, so the chemical compound that blocks the receptor will probably be very different from the one that blocks the enzyme.
So I don't think you can relate the activity of one target to another, when measured in isolation (biochemically), based on the fact that they are on the same pathway.
On the other hand, the expression (i.e. ultimately the quantity) of a target in the pathway can indeed be related to what happens to the rest of them. So if your phenotypic assay has a given pathway, and you agonise or antagonise the receptor at the beginning of the cascade, then it's probable that some of the enzymes in the cascade will increase or decrease in concentration, producing a response that is not directly due to the action on the receptor, but indeed to the long sequence of events that follow. This doesn't change the fact that your chemical compound has only acted on the receptor, and is probably not active at all on any of the enzymes.

So if we discover that y is determined by x2, x5 and x6, our reasoning will be: these targets belong to pathways that affect our phenotypic response. Are they known in the literature to be part of one single pathway? If the answer is yes, then we have probably found 'the' pathway of our P assay, and compounds that block 3 of the targets in this pathway. If the answer is no, then it's possible that we have an unknown pathway with these 3 targets in, or at the other extreme, that each of them is part of a different pathway. Not a big problem for us, as long as we find something.

9. Apr 27, 2016

### Stephen Tashi

I agree. And to answer the question in an earlier post:

p(target>T0) and p(target<T0) would be calculated from the distribution of the target activity over the whole dataset. ( Hence the Wikipedia article remarks: "In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on and the values of the features are given, so that the denominator is effectively constant.")

-----------

Ok, so you are using the "naive bayesian" probability model, but not the "naive bayesian" decision rule. The naive bayesian decision rule doesn't embody an idea like "T0 = 3.4 , so once we know that x is greater than 3.4 , we don't care how much bigger it is, for purposes of making a decison, we'll treat all cases greater than 3.4 the same".

The naive bayesian decision rule allows for the fact that their might be certain intervals of values of x that make a class $C_k$ probable interspersed with other intervals that make class $C_k$ improbable.

It's worth trying to state in what sense it is "best". It isn't necessarily the "best" decision rule out of all possible decision rules, but it is the "best" within the particular set of decision rules that are based on a threshold. It is the "best" value of T0 in the sense that it maximizes "The probability that if I randomly select a compound, that I classify it correctly using the threshold T0".

So if we apply this model to new compounds then we are assuming that a "typical" new compound can be modeled by randomly selecting a compound from those in our database.

When you say "determines a phenotypic activity" are you speaking of one given compound? I think that introduces a new variable - time ? Do we have y(t)=a+b*x(t) ? Or when you say "determines a phenotypic activity" , do you mean that if we plot the the data ( target i activty, phenotypic j activity) for many different compounds, the points are approximately on a line ?

I often hear a problem and, being lazy, get the feeling "Surely somebody has already studied this". So ( just daydreaming) imagine a mathematical model where the basic organization is described by a directed graph. Each "node" represents a process that produces something. The edges that come into the node represent "input" substances. The edges that go out of the node represent "output" substances. The way I imagine people studying this is that they began by making the simplifying assumption that the "outputs" are always linear functions of the inputs. Perhaps some forum member will say "Oh, what your are talking about is the theory of .... [something "well known"]".

In such a model, I visualize a phenotypic test as one that sets the initial inputs at some values and measures some "final" output of the total graph. I visualize a target test as one that is done on some subgraph.

I think such a model can be simplified to a single equation A = (M)(X) where A and X are vectors and M is matrix. But when approached from the view point of statistics, there might be a different M for each phenotypic test and each compound - and there might be data for various subgraphs (target tests) that are common to many compounds and phenotypic tests.

10. May 13, 2016

### chiro

Hey lavoisier.

Aside from what Stephen Tashi has commented on - what sort of domain knowledge are you using?

The ability to make inferences on all kinds of quantities (be them probabilities, means, medians, variances, and so on) is purely statistical.

But a statistical expert can't really do much for someone who wants it applied without expert knowledge.

Stephen Tashi has asked for clarification on many things and I definitely agree with him on this - but in addition to this you need to tell us what you think you know about the process itself and how you setup the experiment, collect data (including rejecting or transforming data) and what you are trying to answer given what you know (or postulate to know) about the processes themselves.

I'm imagining this is a bio-statistical analysis and that carries with it even more issues from a probabilistic and statistical stand-point.

Being able to estimate things, find correlations, find contributions of variance and even construct models is not useful without the above knowledge.

You mention that you want a human understanding - the only real way to do that is with the expert knowledge and that is going to contain a lot more than just the mathematical constraints that you impose on your data and the processes.

You should find that many statistical consultants will go through the same steps with their clients and it is necessary for them to extract this knowledge in as simple and intuitive terms as possible before you can even begin deciding what statistical techniques and logic to use.

11. May 13, 2016

### lavoisier

First of all I need to apologise for disappearing. I was away for a few days, then I was assigned a new task at work (in fact I'll post about it just now), but I will need to come back to this classification problem sooner or later.

I will then reply in detail to Stephen, who's so kindly taken the time to examine the problem.
The second option, i.e. points approximately on a line. There is always an element of time in our assays, phenotypic or not, but it's usually taken into account by the people who run the assays.

@chiro : thank you for cautioning me against mis/over-interpreting data. I can reassure you on this point. This analysis is not being carried out by myself alone. There is a team of scientists from several disciplines working on it, including biology, bio-statistics, modelling, computational biology, chemoinformatics (that'll be me), etc. And we are also aware of the approaches other groups worldwide have adopted to tackle this. Just the other day we had a presentation from a multinational CRO that is able to provide target/pathway finding services.
So I suppose it's fair to say that, taking the whole team together, we have at least 'some' knowledge of the problem we're dealing with, and are reasonably confident that we can give it a go, and see what comes out. Nobody here is going at this like that character from the Simpsons, shooting bullets in the air and shouting yippeee!
Perhaps you were misled by the description I gave, that is necessarily over-simplistic. This is because my role in this is to attempt an initial less sophisticated and less time-consuming approach to see if we can make anything out of the data we currently have. If this doesn't work, we have a backup plan that is actually already operational.
I turned to this forum because I often found very good insights and advice, e.g. from Stephen (and in my own little way I tried to provide advice to those who had questions I thought I could answer).
If you have specific, constructive advice to give on what should be done or not done, I'm open to hear it.
Thanks!
L

12. May 14, 2016

### chiro

I guess it's difficult to say more given the lack of domain knowledge (which I'm sure your team has but that is not communicated within this thread).

It sounds like there are a lot of specifics in what you are doing and I guess without these the only advice given is the standard statistical techniques which I'm sure your team is already fully aware of.

I wish you the best of luck for your project however.

13. May 17, 2016

### Stephen Tashi

Returning to title of the thread, if data for each compound with known properties is allowed to have "an equal vote" in determining what you do when you consider a new compound then you create a simple decision process for analyzing new compounds. For example, if you determine that, for the population of known compounds the most predictive target tests are T1,T2,T5 then you consider those tests most predictive for a new compound.

On the other hand, if you are able to find "clusters" of compounds within the set of known compounds, then it might be that tests T1,T2,T5 are most predictive for compounds in cluster A and test T2,T3,T4 are most predictive in compounds in cluster B. In that type analysis, when you consider a new compound, the first step is to classify which cluster it belongs to.

I find it useful to think about problems in terms of simulations. Even if no computer simulation is written, it is useful to imagine how you would write one. If the task was to write a simulation of how Nature creates the compounds that are relevant to your work, how would we write it? Suppose we aren't trying to simulate the details of molecules, etc. All we want to do is simulate creating a compound that has certain outcomes on phenotypic tests and certain outcomes on target tests. And we want the population of compounds that is created to confirm that the outcome of a phenotypic test has a linear relation to the outcome of some of the target tests.

14. May 17, 2016

### lavoisier

Thanks Stephen.
I take it you're suggesting to cluster by structure (or other intrinsic molecular properties).
It may be that within certain clusters different target activity profiles are required to determine a given phenotypic response.
However, for me clustering is equivalent to incorporating additional variables to the analysis, i.e. structure or whatever the clustering criterion is.
Most groups do seem to do that, or even more actually, e.g. they build QSAR models to estimate missing target activity data, to compute probabilities etc.
I guess they do that because it improves the predictive power of their methods.

I don't know, I would have liked to think that once you test a compound in a target assay, it's the activity at the target that counts, not what the compound is (and I'm a chemist! I must have gone mad...). Otherwise it's like saying that the phenotypic activity is better described by the concatenation of target_activity+structure, not target_activity alone, or in other words that there are features of the structure that affect the response and are not sufficiently captured by however many target assays you run.
But doesn't this defeat the object of this analysis, i.e. using compounds as 'tools' to establish a link between targets and phenotypic responses?
In a way I see the value of this, because it's true that the same phenotypic response can be caused by multiple targets or groups of targets, and similar compounds (i.e. those that belong to the same cluster) are more likely to hit the same group of targets.
My hope is that an association rule mining method will be able to spot different 'relevant' target activity profiles within the whole dataset, without the need for any artificial and arbitrary pre-clustering. Maybe I just need to keep the support threshold low and look more at some measure of association, like Fisher.

For the simulation, I'm not sure what you mean exactly, but for practical purposes I guess what one could do is write a dataset as follows:
- column of N compound identifiers
- column of nT target identifiers
-> make cartesian product of the 2, yielding a dataset of N*nT rows, where each compound has each target assay assigned
- add a column of assay potencies, say normally distributed reals between 4 and 9, maybe interspersing NA to simulate missing data
-> pivot the dataset in such a way that the resulting N rows are still compound identifiers, the nT columns are the potencies in each column title target
- in theory some targets could be inter-correlated, so it may be worth defining a few additional target columns as functions of another target, e.g. T10 = 0.5*T1 + 2 + random_error(between -1 and 1)
- calculate new columns (phenotypic assays) as linear functions of some target columns, e.g. P1 = 0.3*T1 + 0.7*T2 + random_error(between -1 and 1); probably best if the target coefficients sum to 1
- some phenotypic columns could be defined as random, to simulate what happens if there is no correlation between the targets and phenotypic assays

I think this would give you an example of a dataset like the one we are trying to analyse. To be honest, this one would be quite well behaved, not very realistic, but it would be a start to try out some methods. I might give it a shot when I find the time.

Thanks!
L