Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

How to find which variables w, x, y... 'most' influence z?

  1. Jul 9, 2015 #1
    Hi everyone - I guess this will be a very basic question for you guys, but I'm no statistician and I need your advice, please.

    At work (drug discovery, medicinal chemistry) we often get a lot of data, and we expect to find correlations between them. I'm afraid we're not always as good as we could/should be at looking for such correlations.
    Example. We run a biochemical assay for a certain target, which gives us a 'potency' value. Then we run the corresponding cellular assay, which is known by experience to deviate a bit from the biochemical one, but is usually correlated to it. For instance, if B is the biochemical potency, the cell potency C can usually be fitted to a function like Ln(C) = m+n*Ln(B), where m and n are constants and the linear correlation coefficient between Ln(C) and Ln(B) is relatively high.

    There are, however, many cases where the correlation between Ln(B) and Ln(C) is quite bad. In such cases we invoke other effects, like cell permeability, solubility, etc, but I've never seen a rigorous follow-up, like trying to determine quantitatively if and how these additional variables *concertedly* influence Ln(C).
    People will often just look at pairs of variables, for instance they will plot Ln(C) versus permeability, find no good correlation and conclude that the two are not linked.
    But I think this approach is wrong, because when a given endpoint is determined by a combination of multiple independent variables, one can't expect each single variable to be necessarily well correlated to the endpoint, right?

    I made a simulation in Excel to test this. I generated two columns of random numbers between 0 and 1, and called them x and y. I verified that x and y were not correlated (R2=0). Then I calculated z=x*y and plotted z vs x and z vs y. In both cases the correlation was quite poor (R2 about 0.4).
    In this case I know that I would find a perfect correlation if I tried to fit Ln(z) to a linear combination of Ln(x) and Ln(y). But that's because I made z myself. In 'real' life I can at most presume that some variables have an impact on my endpoint, but I don't know if or to what extent this is the case, and if I wanted to use a multivariate regression, I wouldn't know what mathematical form to use (linear, logarithmic...).
    Shocking to see that even in a fictional dataset where a perfect relationship between {x,y} and z exists, the usual 2D visual inspection we normally do won't reveal anything particularly exciting about the data.

    So here's the question.
    If I have an endpoint z, which I believe may be determined by one or more independent variables w, x, y..., is there a statistical technique allowing me to measure to what extent each independent variable affects z? And would such technique also reveal what type of dependency exists between them?

    The only technique I used in some cases is multivariate regression, which requires a prior hypothesis of the mathematical relationship existing between the variables.
    I seem to recall from some discussions I had in the past that principal component analysis may help explore the data in a more unbiased way. Some software we use does that based on molecular structure descriptors. But I wouldn't know how to do that on numerical data, or what the result would tell me about the original variables.

    Sorry for the long post, I wanted to give you the context of the problem and explain why it's important to us.

  2. jcsd
  3. Jul 9, 2015 #2
    There are techniques to correlate multiple variables, but if the functional relationships aren't linear, then some things can be missed.

    I usually start by running the linear correlations and than working with the smaller group of variables that has the most promising linear correlations.

    But I am also quick to explore the possibilities of non-linear models and relationships.
  4. Jul 10, 2015 #3


    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    The main difference between regression and principle component analysis is this:
    Regression looks for a model that explains one particular variable, y, in terms of the other variables, xi
    Principle component analysis looks for the single combination of all variables (no special variable y) that is the best fitting line through the scattered values of (x1, x2, ...) Because it does not identify a special variable, y, the principle component analysis might be a very bad predictor of any one variable, y, that you want to predict. In other words, the thing you want to predict might get lost in the mix.

    So you should use the technique that fits your objective best. Are you trying to explain / predict on thing,y, based on the others, x1, x2, x3,... or are you trying to describe the scatter of the data (x1, x2, x3,..)?

    Stepwise linear regression can be used very well to fit a lot of models. Remember that "linear" means each dependent variable is multiplied by a real coefficient and added, not that all the variables xi are linear. So you can include data xi, ln(xi), xi * xj, etc.in the "linear" regression model to predict y. Then you can generate a similar model to estimate ln(y) and see which works best. If the scatter of y values tends to grow proportional to the magnitude of y, that means the the random component is a percent of y. That implies that ln(y) is the correct model.
    Last edited: Jul 10, 2015
  5. Jul 11, 2015 #4
    Thank you all for your replies.
    So I seem to understand that it's a trial-and-error process: a linear multivariate regression is what I could try as a first assessment, and then refine it step by step.

    I tried a simple linear regression z = a*x+b*y+c in the case of my simulated z=x*y dataset. I got a R2 of 0.89, and similar coefficients for x and y:
    z = 0.45*x+0.49*y-0.23.
    F is 65.2 with df=16. [I guess I could calculate the significance of the correlation from the last two].

    Am I correct in assuming that it's the absolute value of the coefficients that I should look at to decide which variables matter most?
    So for instance if I had z = 0.01*w -0.2*x +0.5*y + 2, I should keep x and y, and discard w, in the next trial, right?
    Or does it depend on the form of the equation, too? I seem to remember reading that the coefficient should be evaluated based on the function it multiplies.
    So for instance in z = 0.5*w2 + 0.5*x +0.5*ln(y), maybe the coefficients 0.5 haven't got the same impact on the 3 variables. Or should I consider ln(y) and w2 as new variables in their own right?

  6. Jul 11, 2015 #5


    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    No. The magnitude of a coefficient is greatly influenced by the units of the variable, which has nothing to do with whether it belongs in the regression. There are ways to statistically decide which variables are the most significant. There are automated tools that implement step-wise multiple linear regression and only end up with the significant variables. For a brief description of the methods available in R, see .

    PS. Stepwise is the best. It is better than Forward or Backward. For instance, Forward regression would put in the best single variable first. Then after a couple of more variables are added, the first one might not add as much as the combination of the later variables. So it would be good to remove that first variable. Stepwise would do that, but Forward would not.
    Last edited: Jul 11, 2015
  7. Jul 12, 2015 #6
    That's great, thank you!
    I downloaded R and its manuals, I'll see if I can get my head around it.
    Do you think I could put, say, x and x^2 in the list of independent variables, or do the IV's have to be uncorrelated?

    And another question, do you know if there are tools in R that allow to measure the correlation or association within a mix of categorical and continuous variables?

    Thank you again very much for your advice, it's very useful to me.
  8. Jul 12, 2015 #7


    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    You can put both x and x2 in as IVs. The stepwise regression will determine if one or both variables should be included. It handles correlated variables well. If it has already included x, it will see if the remaining, unexplained part of y is explained by the remaining, uncorrelated part of x2.
    You can always represent a category as a {0,1} variable, where 0 means it is not in the category and 1 means it is. Do not code a 3-category partitioning as {0, 1, 2} because that implies an ordered relationship of the categories, where the third category is twice as large as the second category. Instead, represent each of the last two categories as a separate {0,1} variable (both being 0 means it is in the first category).
  9. Jul 15, 2015 #8
    Thank you!
    IT installed R on my PC at work, I'll try it out when I find the time. The syntax is really different from what I'm used to...
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook