Before defining the Doubly-Robust Estimator, let me define a few functions and estimators. We denote the dependent variable as y, the binary treatment variable as D, and nuisance variables as X. We can create a model that predicts y using all variables (X and D=0) within the Control Group where D=0. Similarly, we can create a model for D=1. If the two groups are comparable, this alone might be a fairly good model. (Of course, for some Learners, due to the curse of dimensionality in numerous covariates, there’s a possibility that regardless of whether D is 0 or 1, the effect may not be considered and predictions may be based solely on X, which could lead to underestimating a policy that actually has an effect.)
We can also create a propensity score model p(x) as follows. The q(x) is a model that predicts Y without using the D variable. (This is used for the residualization method employed in DML or Causal Forest.)
Note that when actually implementing, you can choose any machine learning method you want to obtain all the estimates described here. However, due to the overfitting problem, one rule when using these methods is to always perform cross-validation. For detailed implementation and application methods, please refer to the link.
The reason it’s called Doubly Robust is because even if the propensity score estimate p^(Xi) is inaccurate, as long as g^1(Xi) and g^0(Xi) are accurate, the expected value becomes the correct ATE estimate, and conversely, even if g^1(Xi) and g^0(Xi) are inaccurate, as long as the propensity score estimate p^(Xi) is accurate, the expected value likewise becomes the correct ATE estimate.
The Doubly Robust Estimator is defined as follows:
It can then be used as follows. Here we only prove ATE, but it can similarly be shown that by organizing in terms of a given X, it can be used for CATE as well.
4. Why is accurate ATE estimation possible if only one of the two is accurate?¶
Assumption 1: Propensity score p^(Xi) is inaccurate, but g^1(Xi) and g^0(Xi) are accurate¶
In (5), since (Y−g^D(X)) always becomes 0, we can naturally confirm that it is an ATE estimator as follows, where all terms except the two below disappear from the expectation:
Assumption 2: g^1(Xi) and g^0(Xi) are inaccurate, but the propensity score p^(Xi) is accurate¶
In (18), when the propensity score is accurate, since E[D−p^(X)]=0, only the following terms remain. From here, this part corresponds to the property of Inverse Propensity Weighting, and can be proven using the law of iterated expectations.