Hello everyone, we were discussing a project at work just yesterday (I work as a researcher in a medium-size biotech), and at some point the (in)famous 'correlation doesn't imply causation' sentence came up. I would like to know what you think, please. I apologise in advance, long post: the topic is a bit complex, at least for me. The project lead chemist showed that in several diverse series of chemical compounds, there was a strong positive correlation between the biological activity observed in target A and the biological activity in an assay B. The trend was observed in a broad range of activities, i.e. not only we had many compounds active in both A and B, but also many that were poorly active or inactive in A and poorly active or inactive in B. Only very few compounds were found in the other two 'quadrants' of non-matching' activities. People usually take this as 'evidence' that 'acting' on target A causes a cascade of biochemical events ultimately leading to the desired effect in assay B, and the 'strength' of the action on A (the activity in this case) is linked by some (at least monotonic, or even ~linear) function to the 'magnitude' of the effect in B. However, the project leader was also there, and he said that observing a correlation between the activities or potencies in A and B doesn't imply that A is the mechanistic cause of B. I am inclined to agree on a purely logical / statistical basis. All books on statistics I read so far had this word of caution about not over-interpreting correlation. I read the examples, and I am convinced that in the most general sense, two variables A and B, paired or linked by some criterion, may show a correlation in their values even when the value of A isn't determining in any way the value of B. However this left me quite baffled about what to do in practice once one gets such data. Should we use the above information about A and B, or not? If after 100 observations we can say the activity on A is 'so far' a good predictor of the activity in B, should we not use A to predict B just because we can't be 100% sure that 'A causes B'? And assuming that we're not comfortable using A as a predictor, what numbers or evidence do we need, how do we test or inspect the hypothesis that A doesn't cause B? If we had only one series of compounds, all similar in structure, one could say that the activity in A is perhaps coincidental, and there is instead another unknown target X that is the real cause of the activity in B. So the real causal implication is not A --> B but X --> B, and A --> X happens by chance in the set of compounds we tested. That's fair enough, and it has actually happened a few times in projects I worked on. However that only happened indeed in specific series of closely related compounds. But when a large number of diverse compounds have been tested and all showed a good A-B association, what are the chances that a 'hidden' parameter is playing a role? How can so many compounds, regardless of their structure, all have activities on A associated with the activities on X purely by chance? At some point A and X would become so similar that activity in one would always imply activity in the other, so for all practical purposes they would be equivalent. I get the point that certainty doesn't exist. Obviously we can't run a huge series of experiments, test all molecules in the world and see if the 'wrong quadrants' get populated. From the little I know of statistics, though, I seem to understand that when you sample randomly from a population, the 'behaviour' you observe in your sample will gradually approach the one you'd observe in the overall population, as your sample size increases. Isn't it about the same here? OK, we don't really sample randomly, but as I said, chemical diversity is a good criterion to ensure that systematic bias is reduced, if not eliminated. How big and how diverse must our sample be before we can confidently say that we're observing something similar to what the whole set of existing compounds would show? In biology or biochemistry, it's already difficult enough to measure activity and show a link between things that are known to be causally related. How stringent do we want (or need) to be, and in particular, can we measure our 'confidence' like statisticians do when they test hypotheses? I'm trying to think of how many discoveries we would miss if we under-interpreted correlations, and if that's better or worse than making wrong causal links by over-interpreting them. If someone told you that having above a given concentration of 'bad' cholesterol in your blood increases your chances of developing heart disease, what evidence would you ask for? This is a case where they actually found plaques of cholesterol in blood vessels. If we were incredibly fastidious about it, we may say 'OK, the plaques obstruct blood vessels and cause ischemic events, but there is no proof that a high concentration of that cholesterol is the cause of the formation of these plaques; the fact that people with high cholesterol have more plaques may be a coincidence'; or 'what if the reverse is true, i.e. people that are predisposed to heart disease are more likely to have high cholesterol and/or plaques', etc. Where do we stop questioning if what we observe 'makes sense'? Do we have to believe that everything is completely random and that no number of observation ever 'means' anything? Sounds a lot like cognitive instability to me. On the other hand, I read the story about hormone replacement therapy, that was prescribed to menopausal women because it was shown (from historical data) to reduce their risk of heart attacks, only to find years later that it actually increased mortality by favouring several other equally serious events (cancer, stroke...). So it's not easy... Any thoughts? Thanks L [BTW, I read this thread: https://www.physicsforums.com/threads/correlation-and-the-probability-of-causation.611197/ but I didn't find my answer there, hence my post].