- #1
lavoisier
- 177
- 24
Hi everyone - I guess this will be a very basic question for you guys, but I'm no statistician and I need your advice, please.
At work (drug discovery, medicinal chemistry) we often get a lot of data, and we expect to find correlations between them. I'm afraid we're not always as good as we could/should be at looking for such correlations.
Example. We run a biochemical assay for a certain target, which gives us a 'potency' value. Then we run the corresponding cellular assay, which is known by experience to deviate a bit from the biochemical one, but is usually correlated to it. For instance, if B is the biochemical potency, the cell potency C can usually be fitted to a function like Ln(C) = m+n*Ln(B), where m and n are constants and the linear correlation coefficient between Ln(C) and Ln(B) is relatively high.
There are, however, many cases where the correlation between Ln(B) and Ln(C) is quite bad. In such cases we invoke other effects, like cell permeability, solubility, etc, but I've never seen a rigorous follow-up, like trying to determine quantitatively if and how these additional variables *concertedly* influence Ln(C).
People will often just look at pairs of variables, for instance they will plot Ln(C) versus permeability, find no good correlation and conclude that the two are not linked.
But I think this approach is wrong, because when a given endpoint is determined by a combination of multiple independent variables, one can't expect each single variable to be necessarily well correlated to the endpoint, right?
I made a simulation in Excel to test this. I generated two columns of random numbers between 0 and 1, and called them x and y. I verified that x and y were not correlated (R2=0). Then I calculated z=x*y and plotted z vs x and z vs y. In both cases the correlation was quite poor (R2 about 0.4).
In this case I know that I would find a perfect correlation if I tried to fit Ln(z) to a linear combination of Ln(x) and Ln(y). But that's because I made z myself. In 'real' life I can at most presume that some variables have an impact on my endpoint, but I don't know if or to what extent this is the case, and if I wanted to use a multivariate regression, I wouldn't know what mathematical form to use (linear, logarithmic...).
Shocking to see that even in a fictional dataset where a perfect relationship between {x,y} and z exists, the usual 2D visual inspection we normally do won't reveal anything particularly exciting about the data.
So here's the question.
If I have an endpoint z, which I believe may be determined by one or more independent variables w, x, y..., is there a statistical technique allowing me to measure to what extent each independent variable affects z? And would such technique also reveal what type of dependency exists between them?
The only technique I used in some cases is multivariate regression, which requires a prior hypothesis of the mathematical relationship existing between the variables.
I seem to recall from some discussions I had in the past that principal component analysis may help explore the data in a more unbiased way. Some software we use does that based on molecular structure descriptors. But I wouldn't know how to do that on numerical data, or what the result would tell me about the original variables.
Sorry for the long post, I wanted to give you the context of the problem and explain why it's important to us.
Thanks!
L
At work (drug discovery, medicinal chemistry) we often get a lot of data, and we expect to find correlations between them. I'm afraid we're not always as good as we could/should be at looking for such correlations.
Example. We run a biochemical assay for a certain target, which gives us a 'potency' value. Then we run the corresponding cellular assay, which is known by experience to deviate a bit from the biochemical one, but is usually correlated to it. For instance, if B is the biochemical potency, the cell potency C can usually be fitted to a function like Ln(C) = m+n*Ln(B), where m and n are constants and the linear correlation coefficient between Ln(C) and Ln(B) is relatively high.
There are, however, many cases where the correlation between Ln(B) and Ln(C) is quite bad. In such cases we invoke other effects, like cell permeability, solubility, etc, but I've never seen a rigorous follow-up, like trying to determine quantitatively if and how these additional variables *concertedly* influence Ln(C).
People will often just look at pairs of variables, for instance they will plot Ln(C) versus permeability, find no good correlation and conclude that the two are not linked.
But I think this approach is wrong, because when a given endpoint is determined by a combination of multiple independent variables, one can't expect each single variable to be necessarily well correlated to the endpoint, right?
I made a simulation in Excel to test this. I generated two columns of random numbers between 0 and 1, and called them x and y. I verified that x and y were not correlated (R2=0). Then I calculated z=x*y and plotted z vs x and z vs y. In both cases the correlation was quite poor (R2 about 0.4).
In this case I know that I would find a perfect correlation if I tried to fit Ln(z) to a linear combination of Ln(x) and Ln(y). But that's because I made z myself. In 'real' life I can at most presume that some variables have an impact on my endpoint, but I don't know if or to what extent this is the case, and if I wanted to use a multivariate regression, I wouldn't know what mathematical form to use (linear, logarithmic...).
Shocking to see that even in a fictional dataset where a perfect relationship between {x,y} and z exists, the usual 2D visual inspection we normally do won't reveal anything particularly exciting about the data.
So here's the question.
If I have an endpoint z, which I believe may be determined by one or more independent variables w, x, y..., is there a statistical technique allowing me to measure to what extent each independent variable affects z? And would such technique also reveal what type of dependency exists between them?
The only technique I used in some cases is multivariate regression, which requires a prior hypothesis of the mathematical relationship existing between the variables.
I seem to recall from some discussions I had in the past that principal component analysis may help explore the data in a more unbiased way. Some software we use does that based on molecular structure descriptors. But I wouldn't know how to do that on numerical data, or what the result would tell me about the original variables.
Sorry for the long post, I wanted to give you the context of the problem and explain why it's important to us.
Thanks!
L