I've always been told "correlation does not imply causation." However, I've never been told much about whether it can imply a probability of causation. Moreover, there seem to be competing and often misused definitions of "to cause", i.e., use in a syllogistic sense versus use in probabilistic sense. Please consider the following: Imagine we conduct a strictly controlled experiment only once, and it has one of two outcomes: (Outcome 1) Y is strongly correlated to X. (Outcome 2) Y is not correlated to X at all. Suppose the single experiment has outcome (1), and we call the probability that X causes Y, P1. Now let's go back in time, suppose it instead has outcome (2), and call the probability that X causes Y, P2. Is P1 > P2? Why? I realize this raises lots of questions: what do I mean by "cause"? What do we know about the experiment? What do we mean by strictly controlled? For the sake of my curiosity, I invite you to provide your own assumptions in answer to these questions. I apologize for the vagaries, but it's the vagaries of this question that have me scratching my head! Please, feel free to point out what must be clarified, and some possible clarifications in response (point out the blanks and fill them in). Thanks so much!
Hey koenigcochran and welcome to the forums. I think the best way to think about this is through two ways: the idea that you have incomplete information and the other idea (that builds on this), is that there is a causal interaction chain between elements involving your incomplete information scenario (i.e. information you do not have). Because of this, we often go to the pessimistic viewpoint that correlation doesn't necessarily imply causation, simply because we have to accept that your data is incomplete and that because of this, also accept the possibility that we have causal mechanisms that independent in one form or another, in relation to the observed data. The idea of incomplete information assumptions as a worst case one is important, but it doesn't mean that you can still can't make conjectures based on a good line of thinking. The thing you can't do is to say for sure that two things are causally linked which means that you can't just jump to conclusions automatically just because you see a obvious pattern and a 0.9+ coeffecient. At the same time though, there is no reason why you can't offer your own insights as to why something is the way it is. Be aware that all of science is under uncertainty: we have gone from the Newtonian paradigm to the probabilistic/statistical paradigm, and as a result have to do many forms of inference under a high degree of uncertainty.
The other thing also is the nature of your information. For example most information we deal with is more or less a simplification of something more complex. Think of it as a projection from a huge higher dimensional space to something much lower. As a result of this simplification of data, we are going to miss things and make bad inferences, especially if we forget what the simplification is and what it relates to in terms of the context of the un-projected, complete data as a whole. So if you have this circumstance where you think you have the complete amount of data, but the data itself is a vast simplification that hides many of the internal mechanisms that contribute to the final 'simplified' numeric quantities you use in the analysis, just remember that simplifications, what is measured, and how they relate to each other and the context of the experiment makes a big impact on causality and also in analyzing correlation. In fact it is a good idea to mention those things in a report so that other people can be aware of your analyses to draw their own conclusions whether favorable or not, in a constructive manner.
Hi koenigcochran! This is a cool question, OK, my two cents on this one: There are an infinite number of experiments in which X is the cause of Y in (Outcome 1) and another infinite number of experiments in which X is not the cause of Y. And exactly the same thing happens in (Outcome 2). So, to calculate (and compare) P1 and P2 would only make sense within a well define set of scenarios. There is no way whatsoever to prove (or, interestingly, disprove) causality by just looking at the data, you need to understand the underlying model that causes X and Y to do so, and then the causality is as true as your model is. We humans have the feeling that P1>P2 because our "well define sets of scenarios" is not the infinite mathematical possible cases but our daily experience which is full of underlying models that we build throughout our lives to have better chances to survive. So if we see event Y happening right after event X (whether correlated or not) our collection of human underlying models will assume as the more likely scenario X to be the cause of Y. And so we humans will do with the example you pose by considering P1, not higher, but much higher than P2 and, in this context, we are right, P1>>P2. In short, once you define the context (whether mathematical, physical, human...) you can talk about how causality relates to correlation.
koenigcochran, you have more-or-less described Bayesian reasoning. Yes, if you have an a priori probability of an hypothesis, and can calculate the odds of an observation on the basis that the hypothesis is true, and again on the basis that it is false, then you can adjust your probability of the hypothesis in light of the data. This is one reason (the main reason?) that a feasible mechanism for cause contributes greatly to one's confidence in it. Some would say that without such the a priori probability is zero, so can never rise above that. OTOH, we should always allow the possibility that we just haven't thought of the mechanism. Btw, as indicated by earlier posts, need to distinguish between cause and causal connection. There may be a common cause.