Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

I Correlation vs causation in biology/biochemistry

  1. Dec 8, 2016 #1
    Hello everyone,
    we were discussing a project at work just yesterday (I work as a researcher in a medium-size biotech), and at some point the (in)famous 'correlation doesn't imply causation' sentence came up. I would like to know what you think, please. I apologise in advance, long post: the topic is a bit complex, at least for me.

    The project lead chemist showed that in several diverse series of chemical compounds, there was a strong positive correlation between the biological activity observed in target A and the biological activity in an assay B. The trend was observed in a broad range of activities, i.e. not only we had many compounds active in both A and B, but also many that were poorly active or inactive in A and poorly active or inactive in B. Only very few compounds were found in the other two 'quadrants' of non-matching' activities.

    People usually take this as 'evidence' that 'acting' on target A causes a cascade of biochemical events ultimately leading to the desired effect in assay B, and the 'strength' of the action on A (the activity in this case) is linked by some (at least monotonic, or even ~linear) function to the 'magnitude' of the effect in B.

    However, the project leader was also there, and he said that observing a correlation between the activities or potencies in A and B doesn't imply that A is the mechanistic cause of B.

    I am inclined to agree on a purely logical / statistical basis. All books on statistics I read so far had this word of caution about not over-interpreting correlation. I read the examples, and I am convinced that in the most general sense, two variables A and B, paired or linked by some criterion, may show a correlation in their values even when the value of A isn't determining in any way the value of B.

    However this left me quite baffled about what to do in practice once one gets such data.

    Should we use the above information about A and B, or not?
    If after 100 observations we can say the activity on A is 'so far' a good predictor of the activity in B, should we not use A to predict B just because we can't be 100% sure that 'A causes B'?

    And assuming that we're not comfortable using A as a predictor, what numbers or evidence do we need, how do we test or inspect the hypothesis that A doesn't cause B?

    If we had only one series of compounds, all similar in structure, one could say that the activity in A is perhaps coincidental, and there is instead another unknown target X that is the real cause of the activity in B. So the real causal implication is not A --> B but X --> B, and A --> X happens by chance in the set of compounds we tested. That's fair enough, and it has actually happened a few times in projects I worked on.

    However that only happened indeed in specific series of closely related compounds. But when a large number of diverse compounds have been tested and all showed a good A-B association, what are the chances that a 'hidden' parameter is playing a role? How can so many compounds, regardless of their structure, all have activities on A associated with the activities on X purely by chance? At some point A and X would become so similar that activity in one would always imply activity in the other, so for all practical purposes they would be equivalent.

    I get the point that certainty doesn't exist. Obviously we can't run a huge series of experiments, test all molecules in the world and see if the 'wrong quadrants' get populated.
    From the little I know of statistics, though, I seem to understand that when you sample randomly from a population, the 'behaviour' you observe in your sample will gradually approach the one you'd observe in the overall population, as your sample size increases.
    Isn't it about the same here? OK, we don't really sample randomly, but as I said, chemical diversity is a good criterion to ensure that systematic bias is reduced, if not eliminated.
    How big and how diverse must our sample be before we can confidently say that we're observing something similar to what the whole set of existing compounds would show?

    In biology or biochemistry, it's already difficult enough to measure activity and show a link between things that are known to be causally related.
    How stringent do we want (or need) to be, and in particular, can we measure our 'confidence' like statisticians do when they test hypotheses?

    I'm trying to think of how many discoveries we would miss if we under-interpreted correlations, and if that's better or worse than making wrong causal links by over-interpreting them.

    If someone told you that having above a given concentration of 'bad' cholesterol in your blood increases your chances of developing heart disease, what evidence would you ask for?
    This is a case where they actually found plaques of cholesterol in blood vessels.
    If we were incredibly fastidious about it, we may say 'OK, the plaques obstruct blood vessels and cause ischemic events, but there is no proof that a high concentration of that cholesterol is the cause of the formation of these plaques; the fact that people with high cholesterol have more plaques may be a coincidence'; or 'what if the reverse is true, i.e. people that are predisposed to heart disease are more likely to have high cholesterol and/or plaques', etc.
    Where do we stop questioning if what we observe 'makes sense'? Do we have to believe that everything is completely random and that no number of observation ever 'means' anything? Sounds a lot like cognitive instability to me.

    On the other hand, I read the story about hormone replacement therapy, that was prescribed to menopausal women because it was shown (from historical data) to reduce their risk of heart attacks, only to find years later that it actually increased mortality by favouring several other equally serious events (cancer, stroke...).
    So it's not easy...

    Any thoughts?


    [BTW, I read this thread:
    but I didn't find my answer there, hence my post].
  2. jcsd
  3. Dec 8, 2016 #2


    User Avatar
    Science Advisor
    2017 Award

    Consider the difference between an observational study and an experiment. In an observational study, you look at your how your outcome variable (e.g. rate of heart attacks) changes with respect to the independent variable (e.g. cholesterol levels). These are definitely only going to give you correlations, as the independent variable may be correlated with the actual causative variable. (For example, one might find that drowning deaths are correlated with ice cream sales. This occurs because increased drowning deaths are caused by larger number of people playing in water which, like ice cream sales, is correlated with the temperature outside).

    In an experiment, however, you are trying to manipulate the independent variable without affecting any other variables, then measuring the effect on outcome. So, for example, this would involve taking two groups of individuals (probably matched with respect to gender, weight, etc), then assigning one group to a low cholesterol diet and another group to a high cholesterol diet, then measuring the incidence of heart attacks in the two populations over time. Obviously, this is not practical in many cases (e.g. unethical to give an unhealthy diet to one group, too costly to follow subjects over time, difficult to maintain compliance with dietary restrictions).

    In your case, you have an experiment, but an imperfect one. You have created a series of compounds to manipulate the activity of target A and are trying to see whether that correlates with biological activity B. However, you cannot be sure that manipulations to target A affect only target A and not any other targets inside the cell. For example, let's say your target was a protein kinase, and you see that drugs that inhibit kinase X also inhibit biological activity B. Now let's say that biological activity B is actually caused by kinase Y, which is similar enough to kinase X that drugs that inhibit kinase X also inhibit kinase Y. In this case, you would see a correlation between inhibition of kinase X and inhibition of biological activity B without a causal relationship between the two. No amount of additional trials would help as the experiment itself suffers from some unavoidable flaws and limitations.

    This is why, in biology, we try to use a few independent methods to test hypotheses such as these. For example, you could try inhibiting target A through siRNA knockdown and/or make a CRISPR knockout. Like pharmacological inhibition, both of these techniques are prone to off target effects and other artifacts, but the off target effects are likely different for each different approach. Seeing a link between target A and biological activity B through multiple different approaches would strengthen your confidence that the relationship is causative and not due to some off-target effect.

    Also, here's an article from Science that I find helpful for thinking about correlation vs causation:
  4. Dec 8, 2016 #3


    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    There is a big difference between the existence of a function from A to B and the assumption that anything about A causes B. A function does not imply causation any more than a correlation does. If A and B are correlated, why would one assume that A causes B rather than that B causes A or rather than another factor C causes both A and B?
    For example, the length of a person's left arm and the length of his right arm. They are obviously very highly correlated, but one does not cause the other.
    You can use it appropriately. It is up to the subject matter expert to use it correctly in each case.
    A can be a great predictor of B even if it does not cause B. Use it appropriately. But I am concerned about your use of the term "predictor". In this context, it should not necessarily imply any timing of A before B or any causality of A causing B. Suppose B occurred before A, but it is unknown. If A and B are highly correlated, I would still say that A is a "predictor" of B in that it can be used to estimate B. In the example of arm length, the length of a man's left arm is a great predictor of the length of his right arm. Or you can say that the number of fingers an adult has is a good "predictor" of the number he had as a baby. But timing and causation went the other way.
  5. Dec 8, 2016 #4


    Staff: Mentor

    It is not a question of being 100% sure. It is a question of mechanism. The reason that we don't say that ice cream sales causes drowning deaths is that there is no plausible mechanism by which they could do so.

    Do you have a specific proposed mechanism by which A could cause B?
  6. Dec 9, 2016 #5
    Thank you all for your great input!
    @Ygggdrasil : indeed, the project leader mentioned knockdown experiments. Perhaps that's going to give us some more clarity.
    Concerning the possible alternative target, I have a comment from a medicinal chemistry point of view.
    But as I said, if X and Y become SO similar that structurally unrelated compounds inhibit both, then in practice whatever drug you make will inhibit both. That may be a problem when X causes toxicity. Unless you can find something that discriminates between X and Y (selectivity). Molecular modelling sometimes helps, but then you need to have the crystal structure of X and Y, and if Y is not known, as per our hypothesis... Tough stuff.

    @Dale : You're right (of course). That was always my doubt when I was reading the statisticians' examples on correlation without causation. It was always far-fetched stuff, like the reported cases of autism in the US and the GDP growth in China. The link is just 'same year', so it's a bit obvious that you may find them increasing or decreasing 'at the same time' just by chance, without there being any plausible mechanism relating the two.
    As pointed out, in my example the link is a bit more direct (same compound, tested in ad-hoc assays). And still, as Ygggdrasil said, there are possible doubts even in such controlled conditions.
    To address your question, yes, the project leader has been looking into mechanistic explanation from the known biology/biochemistry of target A, and so far he's not sure he can find one. Target A is not new or unknown, it's in several patents on compounds with oncology indications. Maybe that's why he 'hopes' it's not related, I don't know.

    @FactChecker :
    Indeed. And this goes back to the usual point: if not A, but a factor C is the real cause of B, how do we 'separate' A from C? I was hoping statistics would help.

    B = c0 + c1*C
    and by chance, in a small set of cases:
    A = c2 + c3*C
    then we will observe our correlation between A and B:
    B = (c0 - c1*c2/c3) + (c1/c3)*C

    But we can't expect A = c2 + c3*C to hold 'everywhere'. There have to be 'places' where this relationship between A and C is not true (unless A and C are identical for all practical purposes!), and then we will detect the disconnect between A and B, if there actually is one.
    My hope was that by expanding the chemical diversity, one could find such places.
    But I think Ygggdrasil's suggestion to act on A in different ways is a very good one and less prone to bias.

    So... we will see. Maybe after all this we'll find out that A IS actually causing B :O( :O)

    Thanks again
  7. Dec 9, 2016 #6


    Staff: Mentor

    Be careful here. Proper statistics can rule this out. In fact, the relevant null hypothesis is specifically the hypothesis that the observed correlation is just by chance. So whatever the relationship between A and B you will easily rule out that it is just by chance.

    However, the point is that there are many other non-random relationships than A causes B.

    I think you will need to wait until that is completed. Without a specific causal mechanism any claim that A causes B will be rejected by any competent reviewer.
  8. Dec 9, 2016 #7


    User Avatar
    Science Advisor
    2017 Award

    Sure, from a medicinal chemistry point of view, if all you are interested in is inhibiting biological activity B, then it does not matter if your drug does so by inhibiting target A or some other unknown target. However, biologists would absolutely care about whether the drug acts through target A or some other unknown target.

    A nice experiment to test for off target effects would be to engineer a version of target A that is resistant to your drug. If the resistant target A is able to rescue the effects of the drug on biological activity B, you have some good evidence that the drug works through inhibition of target A. If the drug acts through some unknown target and not target A, the rescue with the resistant form of target A would not affect the drug's effect on biological activity B.
  9. Dec 9, 2016 #8
    :O) Well, actually medicinal chemists too care about targets. I once asked my colleagues why, because I didn't, really. Given that phenotypic assays are often much better at predicting in vivo efficacy, why 'go back' to a target, which so often led us to dead ends in the past? The fact that the biology of a living system is so much more complex than a single target in isolation, would be an argument to use only phenotypic data. The answer was somewhat hand-wavy, but I seemed to understand that knowing the target is useful not only for molecular modelling, but to run a less expensive and higher-throughput assay (like a pre-filter), to de-risk side effects when the biology of the target is known, etc.
    Your other suggestion is very interesting. I will see with my colleagues in biology if they already considered it in this specific case.
  10. Dec 9, 2016 #9
    OK, but I guess you mean that this is possible by obtaining additional data/experiments, right? A bit like Ygggdrasil is suggesting?
    If all you have in absolute is two columns of data A and B that, scatter-plotted, give you something close to a straight line with very high R squared, and the test on the significance of R squared is also very good, then at least I hope we agree that there is enough evidence to reject the null hypothesis of random correlation.

    The example I mentioned (autism/China) was from 'Naked Statistics' by C. Wheelan. I think the author wants to convey the message that one needs to 'go beyond' the mere observation of a correlation, no matter how good (statistically significant) it is, to establish if there is a causal link between facts.
    So indeed, of course you're right again, if one wants to make the strong(er) statement that A causes B rather than their activities are numerically correlated in a given set, we need more (direct) evidence.
    Part of my question was also: when is something good enough to be considered evidence, given that as scientists we need to doubt everything?
    E.g. I am told that to this day there are people who doubt the existence of molecules, because they say there isn't sufficient evidence. [So incidentally thousands of chemists working in labs and piecing molecules together would be a huge bunch of fools].
    Perhaps at some point one must accept that some 'theories' have enough factual support to be at least a valid working hypothesis to build new knowledge on, otherwise science would be nowhere, I suspect.
  11. Dec 9, 2016 #10


    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    Usually the subject matter expert will have to propose a logical theory of what causes what. Statistical correlation by itself says very little about that. In fact, rather than being helpful, there is a real danger that it will be deceptive.
  12. Dec 10, 2016 #11
    Indeed @FactChecker , that's also Wheelan's conclusion in the chapter about regression caveats. Expert researchers must carefully investigate which variables should be included in a regression, to avoid under- or over- fitting stuff.

    Concerning the direction of cause-effects relationships, which as you wrote is another thing one can't deduce from a good correlation alone, if I understand correctly in formal logic things are a lot simpler. I'll describe it as I learnt it.
    Suppose we think there is an implication A → B (equivalent to NOT(A) ∨ B).
    We can't prove A → B, but only disprove it, if we find a case when A is true and B is not true (which is the only configuration that makes A → B false).
    No other case falsifies A → B.
    A = true and B = true, which many would consider 'proof' of the causality A → B, in fact may be due to other unknown causes C, D... determining B (which is a bit what we've been discussing above). I think someone calls this 'confirmation bias'.
    A = false and B = true; if nothing else, this would imply that there are other things that cause B, so if A does, it's not alone.
    A = false and B = false; I don't know if people consider this support for A → B; in a contingency table it would surely contribute to the 'goodness' of the association.

    As we're not dealing with formal logic, but complex biological systems where a lot of factors are unknown, even A = true and B = false doesn't definitively disprove that A may cause B. One particular compound M may act on target A in the way that would normally cause B, but at the same time M may also act on another target C that instead masks or prevents B from happening.
    There are a couple of compounds in the A=true, B=false quadrant; maybe we should look at them more carefully and understand why they 'deviate' from what most other compounds do.
  13. Dec 10, 2016 #12


    Staff: Mentor

    Not really. What you need is a theory. If you have a theory where A causes B, then you expect A to be correlated with B. So evidence showing A correlates with B is evidence supporting the theory and therefore supporting the idea that A causes B.

    But the key step in establishing causation instead of just correlation is theoretical.
  14. Dec 11, 2016 #13


    User Avatar
    Science Advisor
    Gold Member
    2017 Award

  15. Dec 11, 2016 #14


    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    Good point. If we can force A to specific values and observe a correlation in B, then with a well designed experiment we may be able to deduce that A causes / controls B. That seems valid even if there is no theory regarding how A causes B. The method of forcing values of A would have to be such that it has no effect on B except through A. Also, the experiment would have to be designed in such a way that we can rule out an accidental correlation between the values of A and any other antecedent. (see https://en.wikipedia.org/wiki/Spurious_relationship )
  16. Dec 11, 2016 #15


    User Avatar
    Staff Emeritus
    Science Advisor

    In my opinion, it's misleading to say that the Granger test implies causality. To give a counter-example, statistically we might be able to say that the presence of smoke from a pile of leaves allows us to predict the future presence of flames. But the smoke didn't cause the flames. The smoke is a symptom that typically precedes the flames.
  17. Dec 11, 2016 #16


    Staff: Mentor

    I agree. I would say that it tests for "prediction" rather than "causation". In your example, smoke does predict fire, even if it doesn't cause it.
  18. Dec 11, 2016 #17


    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    Yes. Granger causality is a test for causality not a mechanism. But it seems a better test than correlation.
  19. Dec 11, 2016 #18


    User Avatar
    Staff Emeritus
    Science Advisor

    Is that just because correlation can be spurious (that is, just a quirk due to insufficient sample size)? It seems to me that if a correlation persists over a long enough period of time, then it would satisfy the Granger criterion.
  20. Dec 11, 2016 #19


    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    Not sure Stevendaryl.

    What about these examples?

    - High market price volatility is correlated with a decline in stock prices. Contemporaneously, stock prices are lower and price volatility is higher than at a previous time.

    Higher volatility as measured in options prices precedes a fall in stock market prices. A spike in volatility as measured in options prices precedes a decline in the stock market.

    Higher volatility is not only correlated with lower stock prices, it Granger causes lower prices. These seem different.

    Empirically it is true that changes in options implied volatilities Granger cause changes in prices. This is well documented in the Euro-dollar futures market.

    Mechanism: Investors are risk averse. If their perceptions of uncertainty for their investments goes up they will liquidate risky assets in favor of less risky or riskless assets.

    Historical example of the mechanism: When Colin Powell announced before the United Nations that the United States was going to invade Iraq, the stock market rallied strongly and bonds traded off. The reason was that uncertainty about a war was eliminated. One would suppose that if Powell had announced that the United States had decided not to invade Iraq the the stock market would have also rallied.

    Before the announcement traders spoke of the "Iraq war risk premium" . This gave them a measure of how large the rally could be if the uncertainty about the war was eliminated. Smart investors bought stocks in front of Powell's announcement since they knew that if no new information would come out then stocks would remain near their current depressed levels but that if new information did come out the market would rally.

    - A physicist measures the spin of an electron in some direction. The electron is prepared along random axes beforehand.

    In another experiment, a physicist measures the spin of an an electron along a fixed known axis.

    Does the fixing of the axis Granger cause the outcome of the experiment? Does it cause the outcome of the experiment?

    Last edited: Dec 11, 2016
  21. Dec 11, 2016 #20


    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    If you can force input values, A, to get correlated results, B, then the case for causality can be very strong (depending on how the experiment is designed). This is the basis of many experimental proofs. It can be purely statistical and not rely on proposing a theory of the mechanism involved.

    But you have to be able to force values of A that allow you to rule out any hidden cause, C. That means you have to force A values in a random way that would be very unlikely to correlate with any unknown independent cause, C. Also you would have to be sure that your method of forcing values for A do not unintentionally force correlated values of the real cause, C, except through the value of A.

    One problem is that it would show that A can cause B, not that it is the cause of B in all cases of interest where A and B are correlated.
    Last edited: Dec 11, 2016
  22. Dec 11, 2016 #21


    User Avatar
    Science Advisor
    Gold Member
    2017 Award

    I know nothing about econometrics, but it looks like Granger's theories are deeper than I initially thought (he won a Nobel Prize for them). I am tempted to get a couple of his books, but they may be more expensive and require more study than my idle curiosity justifies.
    It seems that there are ways to analyse time series of economic data to infer causality. (If time series of all the major economic factors are analysed can something based solely on statistics be said about the one with the greatest lagged cross-correlation?)

    PS. At this point I fear that I have gone beyond my knowledge base and will leave further comments to others.
    Last edited: Dec 11, 2016
  23. Dec 12, 2016 #22
    I got some news just this morning: the project leader got some 'new biological data' strongly supporting the hypothesis that A is indeed mechanistically causing B.
    I'll ask him what these data were, maybe the KD Ygggdrasil suggested.

    Very interesting discussion in general.

    Maybe to sum up (from my point of view) the input all you guys kindly contributed:
    1. statistics alone (in isolation) won't (always) determine causality between correlated variables
    2. a solid theoretical justification is (usually) needed, making predictions on what should happen
    3. statistics is then the tool one must use to test whether the predictions and experimental results match sufficiently well to support the theory that is proposed
    4. the specific statistical methods one should use for this analysis depend on the nature of the data one is studying and the desired

    I'm sure I'm not capturing the whole picture; I hope I'm getting close enough.

    Thank you all again very much for your contributions.

    PS: a link on Causality
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook

Have something to add?
Draft saved Draft deleted