How to find which variables w, x, y.... 'most' influence z?

lavoisier · Jul 9, 2015

Hi everyone - I guess this will be a very basic question for you guys, but I'm no statistician and I need your advice, please.

At work (drug discovery, medicinal chemistry) we often get a lot of data, and we expect to find correlations between them. I'm afraid we're not always as good as we could/should be at looking for such correlations.
Example. We run a biochemical assay for a certain target, which gives us a 'potency' value. Then we run the corresponding cellular assay, which is known by experience to deviate a bit from the biochemical one, but is usually correlated to it. For instance, if B is the biochemical potency, the cell potency C can usually be fitted to a function like Ln(C) = m+n*Ln(B), where m and n are constants and the linear correlation coefficient between Ln(C) and Ln(B) is relatively high.

There are, however, many cases where the correlation between Ln(B) and Ln(C) is quite bad. In such cases we invoke other effects, like cell permeability, solubility, etc, but I've never seen a rigorous follow-up, like trying to determine quantitatively if and how these additional variables *concertedly* influence Ln(C).
People will often just look at pairs of variables, for instance they will plot Ln(C) versus permeability, find no good correlation and conclude that the two are not linked.
But I think this approach is wrong, because when a given endpoint is determined by a combination of multiple independent variables, one can't expect each single variable to be necessarily well correlated to the endpoint, right?

I made a simulation in Excel to test this. I generated two columns of random numbers between 0 and 1, and called them x and y. I verified that x and y were not correlated (R²=0). Then I calculated z=x*y and plotted z vs x and z vs y. In both cases the correlation was quite poor (R² about 0.4).
In this case I know that I would find a perfect correlation if I tried to fit Ln(z) to a linear combination of Ln(x) and Ln(y). But that's because I made z myself. In 'real' life I can at most presume that some variables have an impact on my endpoint, but I don't know if or to what extent this is the case, and if I wanted to use a multivariate regression, I wouldn't know what mathematical form to use (linear, logarithmic...).
Shocking to see that even in a fictional dataset where a perfect relationship between {x,y} and z exists, the usual 2D visual inspection we normally do won't reveal anything particularly exciting about the data.

So here's the question.
If I have an endpoint z, which I believe may be determined by one or more independent variables w, x, y..., is there a statistical technique allowing me to measure to what extent each independent variable affects z? And would such technique also reveal what type of dependency exists between them?

The only technique I used in some cases is multivariate regression, which requires a prior hypothesis of the mathematical relationship existing between the variables.
I seem to recall from some discussions I had in the past that principal component analysis may help explore the data in a more unbiased way. Some software we use does that based on molecular structure descriptors. But I wouldn't know how to do that on numerical data, or what the result would tell me about the original variables.

Sorry for the long post, I wanted to give you the context of the problem and explain why it's important to us.

Thanks!
L

Dr. Courtney · Jul 9, 2015

There are techniques to correlate multiple variables, but if the functional relationships aren't linear, then some things can be missed.

I usually start by running the linear correlations and than working with the smaller group of variables that has the most promising linear correlations.

But I am also quick to explore the possibilities of non-linear models and relationships.

FactChecker · Jul 10, 2015

The main difference between regression and principle component analysis is this:
Regression looks for a model that explains one particular variable, y, in terms of the other variables, x_i
Principle component analysis looks for the single combination of all variables (no special variable y) that is the best fitting line through the scattered values of (x₁, x₂, ...) Because it does not identify a special variable, y, the principle component analysis might be a very bad predictor of anyone variable, y, that you want to predict. In other words, the thing you want to predict might get lost in the mix.

So you should use the technique that fits your objective best. Are you trying to explain / predict on thing,y, based on the others, x₁, x₂, x₃,... or are you trying to describe the scatter of the data (x₁, x₂, x₃,..)?

Stepwise linear regression can be used very well to fit a lot of models. Remember that "linear" means each dependent variable is multiplied by a real coefficient and added, not that all the variables x_i are linear. So you can include data x_i, ln(x_i), x_i * x_j, etc.in the "linear" regression model to predict y. Then you can generate a similar model to estimate ln(y) and see which works best. If the scatter of y values tends to grow proportional to the magnitude of y, that means the the random component is a percent of y. That implies that ln(y) is the correct model.

lavoisier · Jul 11, 2015

Thank you all for your replies.
So I seem to understand that it's a trial-and-error process: a linear multivariate regression is what I could try as a first assessment, and then refine it step by step.

I tried a simple linear regression z = a*x+b*y+c in the case of my simulated z=x*y dataset. I got a R² of 0.89, and similar coefficients for x and y:
z = 0.45*x+0.49*y-0.23.
F is 65.2 with df=16. [I guess I could calculate the significance of the correlation from the last two].

Am I correct in assuming that it's the absolute value of the coefficients that I should look at to decide which variables matter most?
So for instance if I had z = 0.01*w -0.2*x +0.5*y + 2, I should keep x and y, and discard w, in the next trial, right?
Or does it depend on the form of the equation, too? I seem to remember reading that the coefficient should be evaluated based on the function it multiplies.
So for instance in z = 0.5*w² + 0.5*x +0.5*ln(y), maybe the coefficients 0.5 haven't got the same impact on the 3 variables. Or should I consider ln(y) and w² as new variables in their own right?

Thanks
L

FactChecker · Jul 11, 2015

lavoisier said:

Am I correct in assuming that it's the absolute value of the coefficients that I should look at to decide which variables matter most?

No. The magnitude of a coefficient is greatly influenced by the units of the variable, which has nothing to do with whether it belongs in the regression. There are ways to statistically decide which variables are the most significant. There are automated tools that implement step-wise multiple linear regression and only end up with the significant variables. For a brief description of the methods available in R, see .

PS. Stepwise is the best. It is better than Forward or Backward. For instance, Forward regression would put in the best single variable first. Then after a couple of more variables are added, the first one might not add as much as the combination of the later variables. So it would be good to remove that first variable. Stepwise would do that, but Forward would not.

lavoisier · Jul 12, 2015

That's great, thank you!
I downloaded R and its manuals, I'll see if I can get my head around it.
Do you think I could put, say, x and x^2 in the list of independent variables, or do the IV's have to be uncorrelated?

And another question, do you know if there are tools in R that allow to measure the correlation or association within a mix of categorical and continuous variables?

Thank you again very much for your advice, it's very useful to me.

FactChecker · Jul 12, 2015

lavoisier said:

Do you think I could put, say, x and x^2 in the list of independent variables, or do the IV's have to be uncorrelated?

You can put both x and x² in as IVs. The stepwise regression will determine if one or both variables should be included. It handles correlated variables well. If it has already included x, it will see if the remaining, unexplained part of y is explained by the remaining, uncorrelated part of x².

And another question, do you know if there are tools in R that allow to measure the correlation or association within a mix of categorical and continuous variables?

You can always represent a category as a {0,1} variable, where 0 means it is not in the category and 1 means it is. Do not code a 3-category partitioning as {0, 1, 2} because that implies an ordered relationship of the categories, where the third category is twice as large as the second category. Instead, represent each of the last two categories as a separate {0,1} variable (both being 0 means it is in the first category).

lavoisier · Jul 15, 2015

Thank you!
IT installed R on my PC at work, I'll try it out when I find the time. The syntax is really different from what I'm used to...

How to find which variables w, x, y.... 'most' influence z?

1. How do I determine which variables have the most influence on a specific outcome?

2. Can I use machine learning algorithms to determine which variables have the most influence?

3. What is the difference between correlation and causation in determining variable influence?

4. Is there a specific statistical measure I should use to determine variable influence?

5. How can I interpret the results of a variable influence analysis?

Similar threads

Hot Threads

Recent Insights