The Bias of Using Observational Data to Estimate Causal Effect

The Bias of Using Observational Data to Estimate Causal Effect

Let's consider the effect of college attendance on an individual's mental ability. We find that individuals who have attended college score higher than those who have not. What are the possible reasons? Does college attendance cause an increase in mental ability? There are three possible explanations. First, attending college might make individuals smarter on average. Second, those who attend college might have been smarter in the first place (i.e., even if they didn't attend college, their mental ability is higher). Third, the mental ability of those who attend college may increase more than it would for those who did not attend college if they had instead attended college (meaning, the response to college attendance between the two groups is inherently different). We will try to demonstrate this using mathematical formulas.

Let's define the Naive Estimator as:

\[\hat{\delta} = E_N[y_i|d_i = 1] - E_N[y_i|d_i=0]\]

Here, \(N\) is the sample size from the observational data. \(y_i\) is the realized treatment effect of individual \(i\). \(d_i = 1\) means the individual received treatment and \(d_i=0\) means they did not. The estimator suggests that the treatment effect can be estimated by subtracting the average mental ability of those who did not attend college from those who did. Of course, this estimate is a naive one.

The definition of Average Treatment Effect (ATE) is \(E[\delta]=E[Y^1] - E[Y^0]\). The powers \(1\) and \(0\) indicate whether treatment is received or not. Notice that \(Y\) is a random variable as opposed to \(y_i\), which is the realized value for the random variable. Also, note that as opposed to individual treatment effect, which is defined as \(\delta_i = y_i^1 - y_i^0\), we are interested in the aggregate causal effects. Let \(\pi\) be the proportion of the population that takes the treatment. We can rewrite the ATE as:

\[E[\delta]=\{\pi E[Y^1|D=1]+(1-\pi)E[Y^1|D=0]\} \\ - \{\pi E[Y^0|D=1]+(1-\pi)E[Y^0|D=0]\}\]

For a sufficiently large sample size \(N\), \(E_N[y_i|d_i = 1] \to E[Y^1|D=1]\), and \(E_N[y_i|d_i = 0] \to E[Y^0|D=0]\). Also, \(E_N[d_i] \to \pi\). However, there is no assumption-free way to compute the two remaining unknowns: \(E[Y^1|D=0]\) and \(E[Y^0|D=1]\), which are the counterfactuals. Therefore, we are unsure whether the Naive Estimator is equal to the ATE. So, when will they differ?

Let's rearrange the ATE formula in the following way (the algebra is a bit tricky, but it is just algebra). Let \(E[\delta]=e\), \(E[Y^1|D=1]=a\), \(E[Y^1|D=0]=b\),  \(E[Y^0|D=1]=c\), and \(E[Y^0|D=0] = d\). Then, \(e=\pi a + b - \pi b - \pi c - d + \pi d\). This simplifies to \(0 = e - b + d - \pi a - \pi b + \pi c + \pi d\). We need to find what is equal to \(a - d\). Thus, \(a - d = (a - d) + ...\), which becomes \(a - d = e + a - b - ...\). Finally, it simplifies to:

\[a - d = e + (c - d) + (1 - \pi)[(a - c) - (b - d)]\]

\(a - d\) is different from \(e\) when \(c - d\) is non-zero or \((a - c) - (b - d)\) is non-zero. \(c - d = E[Y^0|D=1] - E[Y^0|D=0]\) which is the baseline bias (those who attend college are naturally smarter than those who did not). \((a - c) - (b - d) = (E[Y^1|D=1] - E[Y^0|D=1]) - (E[Y^1|D=0] - E[Y^0|D=0]) = E[\delta|D=1] - E[\delta|D=0]\). This is called the differential treatment effect bias (the response to college attendance between the two groups is inherently different).

When trying to recover the causal effect from observational data, we attempt to use different techniques to remove these two biases. One simple way is to conduct a randomized control experiment where \((Y^1,Y^0)\) is independent of \(D\).