Jason's Homepage

Stats Notes

SEMrefs

Statistics Links

Other links

Lecture 16
Correlation and Causation

I'd like to make a few comments about correlation and causation. It is often said that a correlation between two variables does not imply causation. This is generally true. Just because the amount of time spent working on an exam is positively correlated with the grade on the exam does not mean that spending more time on an exam will necessarily lead to a better grade. The two might be correlated for some other reason.

In some instances, the adage about correlation and causation is actually not true. Consider the situation in which we have a well conducted experiment that compares two groups. We compute a correlation coefficient and there is a significant relationship between the experimental variable (i.e., group membership) and the dependent variable. This would, of course, be the same as conducting a t-test. Assuming we have eliminated any confounds and we used random assignment, we should be able to conclude that the independent variable caused the dependent variable. So in this case a correlation between the independent variable and the dependent variable indicates that the independent variable caused the dependent variable.

If a non-experimental study is conducted, there is likely to be some alternative explanations for a relationship between the independent and the dependent variable, because we did not randomly assign participants to experimental conditions.

A Couple of Common Alternative Explanations for a Correlation Between X and Y
There are three general reasons we cannot conclude that X caused Y just because X and Y are correlated (in cases where X is not experimentally manipulated). First, we don't know for sure that Y did not cause X. I'll use arrows to indicate causation.

X ----> Y (X causes Y)

but, it might be that Y causes X as in the following diagram:

Y -----> X (Y causes X)

Second, we don't know that they don't cause each other:

Finally, there is often a third variable that might cause both X and Y as this diagram points out:

The latter is referred to as the "third variable problem". My favorite example of the third variable problem is the correlation between the number of fire hydrants in a city and the number of dogs in a city. Cities with more fire hydrants tend to have more dogs. Why the relationship? A third variable. Any guesses what the third variable might be?

 

Jason's Homepage

Stats Notes

SEMrefs

Statistics Links

Other links