# How to find which variables w, x, y.... 'most' influence z?

• lavoisier
In summary: So if you have a model where x1 is the independent variable and y is a dependent variable, then x2 and x3 would be the coefficients of x1 in the equation describing the relationship between x1 and y.
lavoisier
Hi everyone - I guess this will be a very basic question for you guys, but I'm no statistician and I need your advice, please.

At work (drug discovery, medicinal chemistry) we often get a lot of data, and we expect to find correlations between them. I'm afraid we're not always as good as we could/should be at looking for such correlations.
Example. We run a biochemical assay for a certain target, which gives us a 'potency' value. Then we run the corresponding cellular assay, which is known by experience to deviate a bit from the biochemical one, but is usually correlated to it. For instance, if B is the biochemical potency, the cell potency C can usually be fitted to a function like Ln(C) = m+n*Ln(B), where m and n are constants and the linear correlation coefficient between Ln(C) and Ln(B) is relatively high.

There are, however, many cases where the correlation between Ln(B) and Ln(C) is quite bad. In such cases we invoke other effects, like cell permeability, solubility, etc, but I've never seen a rigorous follow-up, like trying to determine quantitatively if and how these additional variables *concertedly* influence Ln(C).
People will often just look at pairs of variables, for instance they will plot Ln(C) versus permeability, find no good correlation and conclude that the two are not linked.
But I think this approach is wrong, because when a given endpoint is determined by a combination of multiple independent variables, one can't expect each single variable to be necessarily well correlated to the endpoint, right?

I made a simulation in Excel to test this. I generated two columns of random numbers between 0 and 1, and called them x and y. I verified that x and y were not correlated (R2=0). Then I calculated z=x*y and plotted z vs x and z vs y. In both cases the correlation was quite poor (R2 about 0.4).
In this case I know that I would find a perfect correlation if I tried to fit Ln(z) to a linear combination of Ln(x) and Ln(y). But that's because I made z myself. In 'real' life I can at most presume that some variables have an impact on my endpoint, but I don't know if or to what extent this is the case, and if I wanted to use a multivariate regression, I wouldn't know what mathematical form to use (linear, logarithmic...).
Shocking to see that even in a fictional dataset where a perfect relationship between {x,y} and z exists, the usual 2D visual inspection we normally do won't reveal anything particularly exciting about the data.

So here's the question.
If I have an endpoint z, which I believe may be determined by one or more independent variables w, x, y..., is there a statistical technique allowing me to measure to what extent each independent variable affects z? And would such technique also reveal what type of dependency exists between them?

The only technique I used in some cases is multivariate regression, which requires a prior hypothesis of the mathematical relationship existing between the variables.
I seem to recall from some discussions I had in the past that principal component analysis may help explore the data in a more unbiased way. Some software we use does that based on molecular structure descriptors. But I wouldn't know how to do that on numerical data, or what the result would tell me about the original variables.

Sorry for the long post, I wanted to give you the context of the problem and explain why it's important to us.

Thanks!
L

There are techniques to correlate multiple variables, but if the functional relationships aren't linear, then some things can be missed.

I usually start by running the linear correlations and than working with the smaller group of variables that has the most promising linear correlations.

But I am also quick to explore the possibilities of non-linear models and relationships.

The main difference between regression and principle component analysis is this:
Regression looks for a model that explains one particular variable, y, in terms of the other variables, xi
Principle component analysis looks for the single combination of all variables (no special variable y) that is the best fitting line through the scattered values of (x1, x2, ...) Because it does not identify a special variable, y, the principle component analysis might be a very bad predictor of anyone variable, y, that you want to predict. In other words, the thing you want to predict might get lost in the mix.

So you should use the technique that fits your objective best. Are you trying to explain / predict on thing,y, based on the others, x1, x2, x3,... or are you trying to describe the scatter of the data (x1, x2, x3,..)?

Stepwise linear regression can be used very well to fit a lot of models. Remember that "linear" means each dependent variable is multiplied by a real coefficient and added, not that all the variables xi are linear. So you can include data xi, ln(xi), xi * xj, etc.in the "linear" regression model to predict y. Then you can generate a similar model to estimate ln(y) and see which works best. If the scatter of y values tends to grow proportional to the magnitude of y, that means the the random component is a percent of y. That implies that ln(y) is the correct model.

Last edited:
Thank you all for your replies.
So I seem to understand that it's a trial-and-error process: a linear multivariate regression is what I could try as a first assessment, and then refine it step by step.

I tried a simple linear regression z = a*x+b*y+c in the case of my simulated z=x*y dataset. I got a R2 of 0.89, and similar coefficients for x and y:
z = 0.45*x+0.49*y-0.23.
F is 65.2 with df=16. [I guess I could calculate the significance of the correlation from the last two].

Am I correct in assuming that it's the absolute value of the coefficients that I should look at to decide which variables matter most?
So for instance if I had z = 0.01*w -0.2*x +0.5*y + 2, I should keep x and y, and discard w, in the next trial, right?
Or does it depend on the form of the equation, too? I seem to remember reading that the coefficient should be evaluated based on the function it multiplies.
So for instance in z = 0.5*w2 + 0.5*x +0.5*ln(y), maybe the coefficients 0.5 haven't got the same impact on the 3 variables. Or should I consider ln(y) and w2 as new variables in their own right?

Thanks
L

lavoisier said:
Am I correct in assuming that it's the absolute value of the coefficients that I should look at to decide which variables matter most?
No. The magnitude of a coefficient is greatly influenced by the units of the variable, which has nothing to do with whether it belongs in the regression. There are ways to statistically decide which variables are the most significant. There are automated tools that implement step-wise multiple linear regression and only end up with the significant variables. For a brief description of the methods available in R, see .

PS. Stepwise is the best. It is better than Forward or Backward. For instance, Forward regression would put in the best single variable first. Then after a couple of more variables are added, the first one might not add as much as the combination of the later variables. So it would be good to remove that first variable. Stepwise would do that, but Forward would not.

Last edited:
lavoisier
That's great, thank you!
Do you think I could put, say, x and x^2 in the list of independent variables, or do the IV's have to be uncorrelated?

And another question, do you know if there are tools in R that allow to measure the correlation or association within a mix of categorical and continuous variables?

Thank you again very much for your advice, it's very useful to me.

FactChecker
lavoisier said:
Do you think I could put, say, x and x^2 in the list of independent variables, or do the IV's have to be uncorrelated?
You can put both x and x2 in as IVs. The stepwise regression will determine if one or both variables should be included. It handles correlated variables well. If it has already included x, it will see if the remaining, unexplained part of y is explained by the remaining, uncorrelated part of x2.
And another question, do you know if there are tools in R that allow to measure the correlation or association within a mix of categorical and continuous variables?
You can always represent a category as a {0,1} variable, where 0 means it is not in the category and 1 means it is. Do not code a 3-category partitioning as {0, 1, 2} because that implies an ordered relationship of the categories, where the third category is twice as large as the second category. Instead, represent each of the last two categories as a separate {0,1} variable (both being 0 means it is in the first category).

Thank you!
IT installed R on my PC at work, I'll try it out when I find the time. The syntax is really different from what I'm used to...

## 1. How do I determine which variables have the most influence on a specific outcome?

To find which variables have the most influence on a specific outcome, you can use statistical methods such as regression analysis or correlation analysis. These methods will help you identify the strength and direction of the relationship between the variables and the outcome.

## 2. Can I use machine learning algorithms to determine which variables have the most influence?

Yes, machine learning algorithms such as decision trees and random forests can also be used to identify the most influential variables. These algorithms use a combination of statistical techniques and data mining to determine the importance of each variable in predicting the outcome.

## 3. What is the difference between correlation and causation in determining variable influence?

Correlation refers to the strength and direction of the relationship between two variables, while causation refers to the effect that one variable has on another. It is important to note that correlation does not always imply causation, and it is important to consider other factors when determining variable influence.

## 4. Is there a specific statistical measure I should use to determine variable influence?

The choice of statistical measure for determining variable influence depends on the type of data and the nature of the relationship between the variables and the outcome. Some commonly used measures include Pearson's correlation coefficient, R-squared, and p-value.

## 5. How can I interpret the results of a variable influence analysis?

The results of a variable influence analysis will typically include a measure of the strength of the relationship between each variable and the outcome, as well as its direction (positive or negative). It is also important to consider the significance of these results and any potential confounding factors that may affect the relationship between the variables and the outcome.

Replies
3
Views
1K
Replies
30
Views
3K
Replies
6
Views
2K
Replies
9
Views
1K
Replies
7
Views
1K
Replies
5
Views
1K
Replies
5
Views
589
Replies
8
Views
2K
Replies
23
Views
3K
Replies
4
Views
2K