StatisticsCausal Inference
Exercise
Suppose that your company's ad spending and revenue are found to have the relationship shown in the figure below. What are some potential explanations for why these values are positively associated?
Solution. Perhaps both revenue and ad spending are associated with a third variable, such as proximity to the holiday season. Or maybe management decides to spend more on ads when they have more revenue. Or maybe more ad spending results in more ad impressions and leads to increased sales.
Association does not imply causation is a cautionary mantra, and as such it raises the important question How can we use statistics to discern cause? There are many applications in business and science where the distinction between association and causation is exactly what we're interested in.
We will develop the counterfactual model for describing causation mathematically. The idea is to model causal relationships using random variables which describe potential outcomes.
For example, suppose you choose to drive rather than take the train to work, and you end up being late. It's natural to wonder would I have been late if I'd driven? In your mind, you're pondering two random variables: the amount of time that it would have taken if you'd chosen the train, and the amount of time that it was going to take if you drove. You would model both of these as random variables since you don't their values at the outset of the trip. When your journey is complete, you've been able to observe the value of one of these random variables, but not the other. Given your decision , your observed outcome is
To simplify, let's let be 0 if you're on time and 1 if you're late. Similarly, we let be 0 if you're on time and 1 if you're late. Also, we'll use and interchangeably, as well as and (in other words, encode train and car as 0 and 1, respectively).
Exercise
Suppose that the joint distribution of , , and is the uniform distribution on the rows of a table compatible with the following one:
Note that the asterisks indicate counterfactual outcomes which are not observed.
We define the association to be
and the average causal effect to be
Find the association as well as the largest and smallest possible values for the average causal effect. Describe a scenario in which the measure which gives rise to these extreme average causal effect values might be plausible.
Solution.
The association is , while the largest possible value for the average causal effect occurs when the last column is all ones and the next-to-last is all zeros. That gives an average causal effect of 1. The smallest value would be zero, if the first four rows are all zeros in the last two columns, and the last four rows are all ones.
Intepretation-wise, this makes sense. If the table ends in in every row, that means that taking the train always results in our being on time, while taking the car always results in our being late. The value of in that case definitely has a causal effect. Conversely, if the top half of the table is all zeros in the last two columns and the bottom half is all ones, then that means that we would have been on time those days regardless of our mode of transit on the days we took the train, and we would have been late no matter what on the days we took the car. So there is no causal effect in that case, and is appropriately equal to 0.
The punch line of Problem 2 is still negative: it tells us that the missing counterfactual outcomes can make it impossible to use association to say something about the causal effect. However, this is not always the case:
Exercise
Suppose that you flip a coin every day to determine whether to take the train or car. In other words, suppose that is independent of . Show that in that case, we have .
We have
A study in which the treatment value is not randomly assigned is called an observational study. Observational studies are subject to confounding from variables such as the weather in the scenario described in Problem 2. In that situation, was associated with both and , and their non-independence led to a difference between and to be different.
However, if and are independent conditioned on , and if we record the value of as well as and in our study, then we can obtain an unbiased estimate of the causal effect by from an unbiased estimator of the association by performing that estimate within each group and averaging. This is called the adjusted treatment effect.
Exercise
Suppose that the probability measure on is uniform on the rows of the following table ( means good weather and means bad weather).
(a) Compute the association .
(b) Compute the average causal effect .
(c) Show that and are conditionally independent given , and compute the adjusted treatment effect.
(a) The association is equal to .
(b) The average causal effect is equal to .
(c) The conditional distribution of given and places half its probability mass at and half at . The conditional distribution of given and likewise places half its probability mass at and half at . So and are conditionally independent given . A similar calculation shows that and are conditionally independent given . So and are conditionally independent given .
The adjusted treatment effect is the average of (coming from ) and (coming from ). So it is indeed equal to the average causal effect .
Continuous random variables
Although we've focused on binary random variables, essentially the same analysis carries over to continuous random variables. If is real-valued, then the counterfactual vector becomes a counterfactual process which specifies the outcome that results from each possible value of . As in the binary case, only one of the values of the random function is ever seen for a given observation.
Example
Suppose that is a random variable and and are and random variables (respectively), which are independent. Suppose that and that
Plot several instances of , over .
Solution.
plot(xlabel = "x", ylabel = "C(x)") for i in 1:10 Z = rand(Uniform(0, 10)) U = rand(Uniform(0,1)) V = rand(Uniform(-5,5)) plot!(0:0.01:10, Z + V < 5 ? x -> 5 + U : x->x+sin(U*x)) end current()
Example
Draw 1000 observations from the joint distribution on and , and make a scatter plot.
Solution.
points = Tuple{Float64, Float64}[] for i in 1:1000 U = rand(Uniform(0,1)) V = rand(Uniform(-5, 5)) X = rand(Uniform(0,10)) Y = if X + V < 5 5 + U else X + sin(U*X) end push!(points, (X,Y)) end scatter(points, ms = 1.5, msw = 0.5, color = :LightSeaGreen, markeropacity = 0.5)
Example
The causal regression function is . Find the causal regression function in the example above.
using SymPy @vars x u f = 1//2 * integrate(x + sin(u*x), (u, 0, 1)) + 1//2 * integrate(5 + u, (u, 0, 1)) plot!(0.01:0.01:10, x->f(x), lw = 2, color = :purple)
Exercise
How does the causal regression function compare to the regression function? Feel free to eyeball the regression function from the graph.
Solution. The causal regression function weights the and parts of the probability space equally all along the range from , rather than giving more weight to the former condition when is close to 0 and more to the latter when is close to 10.
You can imagine the distinction between the regression function and causal regression function by visualizing a person sitting a particular value of and watching a sequence of observations of . For the causal regression function, they record every value of they observe. For the ordinary regression function, they wait until they see a value of which is very close to , and only then do they record the pair for that observation.
When is close to the extremes in this example, the additional conditioning on performed in the ordinary regerssion obscures the causal relationship between and .
The formula for the adjusted treatment effect in the continuous case becomes
where is the density of (note that this is the same idea as in the discrete case: we're averaging the -specific estimates , weighted by how frequently those -values occur).
And as in the discrete case, the adjusted treatment effect is equal to the causal regression function if and are conditionally independent given . This implies that, again assuming conditional independence of and given , if is a consistent estimator of , then is a consistent estimator of .
If the regression function given and is linear (that is, ), then we can control for merely by including as a feature in the ordinary least squares regression. In other words, if is independent of given , then is a consistent estimator of
Exercise
Suppose that
(a) Calculate
(b) Calculate
(c) Suppose that
(d) Show that if
Solution. (a) We have
(b) We have
(c) We have
(d) We have
Conclusion
We conclude by noting that the conditional independence of
Congraulations! You've finished the Data Gymnasia Statistics Course.