The Rubin Causal Model

EC655

Justin Smith

Wilfrid Laurier University

Fall 2022

Introduction

Introduction

  • The workhorse model in econometrics is linear regression

  • Two key uses of this model

    • Prediction

    • Modelling (causal) effect of one variable on another

  • In many empirical application the focus is causal effects

    • The independent effect of a particular variable on the outcome
  • We will study linear regression with a focus on causality

  • First, we attempt to understand the underlying concept of causality

  • For this we use the Rubin Causal Model

    • Defines causality and under what conditions we can measure it

Model Basics

Potential Outcomes

  • Start with a binary framework

    • There is a “treatment” or “no treatment”

    • An individual theoretically has an outcome with treatment and without

  • Treatment is defined generally

    • Getting a drug

    • Going to university

    • Being in a large class

  • Define the following potential outcomes

    • \(y_{1}\) is the outcome with treatment

    • \(y_{0}\) is the outcome without treatment

    • \(w\) is a binary variable with 1 denoting treatment, and 0 no treatment

Treatment Effects

  • We would like to know the treatment effect \(y_{1} - y_{0}\) for an individual

    • This is the .causal effect of the treatment

    • Effect differs from person to person in the population

  • Fundamental problem of causal inference: we never observe both \(y_{1}\) and \(y_{0}\)

  • We only observe \((y, w)\), where

    \[y = y_{0} + (y_{1} -y_{0})w\]

    • We observe treatment status, potential outcome given that treatment status
  • The counterfactual outcome with opposite treatment is never observed

Simple Differences in Average Outcomes

  • What if we naïvely compute difference in average outcomes between treated and control? \[E(y|w=1) - E(y|w=0)\]

  • Using the definition of \(y\) above,

    \[E(y|w=1) - E(y|w=0)\] \[= E(y_{1}|w=1) - E(y_{0}|w=0)\] \[= \left [ E(y_{1}|w=1) - E(y_{0}|w=1) \right ] + E(y_{0}|w=1) - E(y_{0}|w=0)\]

  • The first term is called the Average Treatment Effect on the Treated (ATT)

    • Average effect of the treatment for those in the treatment group

Simple Differences in Average Outcomes

  • The second term is Selection Bias

    • Baseline difference between treatment and control groups
  • Simple average differences will not identify a treatment effect

    • It is partly a treatment effect, partly differences in who gets treated
  • Ex: Comparing average incomes of university grads to high school grads

    • Will be partly average causal effect of university

    • Also difference in baseline earning ability without the degree

  • Lesson is that simple differences in averages do not reveal causal effects

  • Under what conditions can we measure the causal effect of \(w\) on \(y\)?

Randomizing Treatment Status

Randomization and Independence of Treatment

  • A common way to isolate treatment effects is to randomize \(w\)

    • Blindly put people into treatment or control group

    • Ensures that on average the two groups are similar at baseline

  • Mathematically, potential outcomes are independent from treatment \[(y_{0}, y_{1}) \perp w\]

  • Independence means conditioning on \(w\) has no effect on expectation

    \[E(y_{0}|w=1) =E(y_{0}|w=0)\] \[E(y_{0}|w) = E(y_{0})\] \[E(y_{1}|w) = E(y_{1})\]

Randomization and Treatment Effects

  • With randomization, selection bias is zero

    \[E(y_{0}|w=1) - E(y_{0}|w=0) = E(y_{0}|w=1) - E(y_{0}|w=1) = 0\]

  • As a result the difference in mean \(y\) is \[E(y|w=1) - E(y|w=0)\] \[= E(y_{1}|w=1) - E(y_{0}|w=0)\] \[= E(y_{1}) - E(y_{0})\]

  • The first term is the ATT we saw before

  • The second term is the Average Treatment Effect (ATE)

    • The treatment effect across the whole population

Recent Example in Economics Literature

  • When we randomize treatment we can measure causal effects

  • Randomization is the standard way to measure the effects of medical treatments

  • It is becoming more popular in economics

  • Ex: Bangladesh mask study (Abaluck et. al., 2021)

    • Randomized promoting mask use in rural Bangladesh

    • Compare COVID rates between treatment and control

    • Find some positive effect of masks, especially for age 50+

  • Next we will show how to model this in a regression framework

Causal Effects without Randomization

Mean Independence of Treatment

  • We cannot always randomize into treatment and control

  • Can we uncover causal effects without experiments?

  • The answer is yes, depending on assumptions

  • One possible assumption is Mean Independence

    \[E(y_{0}|w) = E(y_{0})\] \[E(y_{1}|w) = E(y_{1})\]

  • Says conditional means do not depend on treatment status

    • Weaker assumption than full statistical independence

    • Full independence means one event has no effect on probability of another

Mean Independence of Treatment

  • With mean independence, we get

    \[E(y|w=1) - E(y|w=0)\] \[= \left [ E(y_{1}|w=1) - E(y_{0}|w=1) \right ] + E(y_{0}|w=1) - E(y_{0}|w=0)\] \[= \left [ E(y_{1}|w=1) - E(y_{0}|w=1) \right ]\] \[= E(y_{1}) - E(y_{0})\]

  • This identifies ATT = ATE

  • Is this assumption realistic?

    • Means both potential outcomes unrelated to treatment

    • On average, people in treatment and control have similar treated and non-treated outcomes

    • Whether this is realistic depends on context

Mean Independence of \(y_{0}\)

  • A variation if this assumption is mean independence of \(\mathbf{y_{0}}\)

    \[E(y_{0}|w) = E(y_{0})\]

  • With this assumption

  • Meaning that \[E(y|w=1) - E(y|w=0)\] \[= \left [ E(y_{1}|w=1) - E(y_{0}|w=1) \right ] + E(y_{0}|w=1) - E(y_{0}|w=0)\] \[= \left [ E(y_{1}|w=1) - E(y_{0}|w=1) \right ]\]

  • With this assumption, we only measure the ATT (Not ATE)

  • Is this realistic?

    • Means untreated outcome is same between groups on average

    • Puts no restriction on differences in treated outcome

    • Intuitively, there are no baseline differences between groups

Conditional Mean Independence

  • We can also use other variables to help with our assumptions

  • Suppose we observe a set of pre-treatment characteristics \(\mathbf{x}\)

    • Ex: gender, parental education, school test scores, etc.

    • Key is they are determined before treatment

  • With this information you could assume Conditional Independence

    \[(y_{0}, y_{1}) \perp w |\mathbf{x}\]

  • Conditional on \(\mathbf{x}\), treatment is independent of outcomes

  • Write this mathematically as \[E(y_{0}|w=1, \mathbf{x}) =E(y_{0}|w=0, \mathbf{x})\] \[E(y_{0}|w, \mathbf{x}) = E(y_{0}| \mathbf{x})\] \[E(y_{1}|w, \mathbf{x}) = E(y_{1}|\mathbf{x})\]

Conditional Mean Independence

  • This implies that we can get treatment effects at each \(\mathbf{x}\) \[E(y|w=1, \mathbf{x}) - E(y|w=0, \mathbf{x})\] \[= E(y_{1}|w=1, \mathbf{x}) - E(y_{0}|w=1, \mathbf{x})= E(y_{1} | \mathbf{x}) - E(y_{0}| \mathbf{x})\] \[= ATT( \mathbf{x}) =ATE( \mathbf{x})\]

  • These treatment effects are functions of \(\mathbf{x}\)

    • They will differ across values of \(\mathbf{x}\)

    • So there are multiple treatment effects

  • Finally, a variation on this is Conditional Mean Independence

    \[E(y_{0}|w, \mathbf{x}) = E(y_{0}| \mathbf{x})\] \[E(y_{1}|w, \mathbf{x}) = E(y_{1}|\mathbf{x})\]

  • Gives you the same \(ATE( \mathbf{x}) = ATT( \mathbf{x})\) as above

Summary of Rubin Model

Summary of Rubin Model

  • The Rubin model defines what is a causal effect

  • Roughly speaking, it an Average Treatment Effect

    • They will differ across values of \(\mathbf{x}\)

    • Difference in potential outcomes, on average in population

    • Depending on context, it might be an Average Treatment Effect for the Treated

  • We can express the Rubin model in a regression framework

  • The slope in a linear regression is the causal effect if we can assume one of

    • Randomization of treatment

    • “As good as” randomization

      • Mean Independence, Conditional Independence, Conditional Mean Indpendence

Summary of Rubin Model

  • When our regression model identifies an underlying causal effect, we call it a structural model

  • In many econometric applications, this is what we want

  • Next, we discuss in more detail linear regression

  • First we discuss the population model

    • We will define the parameters we are measuring

    • Some of this might be new

  • Then we discuss estimation by OLS

    • Focus is on when OLS consistently estimates the parameters

Simulation

Data Setup

  • To help understand the Rubin model we will demonstrate with simulated data

  • Code to the right creates potential outcomes

  • For simplicity the treatment effect is set to 5 for everyone

  • Outcomes \(y_{0}\) and \(y_{1}\) have a Normal distribution because of \(\eta\)

data <- data.frame(eta=rnorm(100000,0,1)) %>%
  mutate(y0 = 2 + eta, y1 = y0 + 5, 
         treat_eff = y1 - y0)

sumtable(data, summ=c('notNA(x)','mean(x)','sd(x)'))
Summary Statistics
Variable NotNA Mean Sd
eta 1e+05 0.004 1.004
y0 1e+05 2.004 1.004
y1 1e+05 7.004 1.004
treat_eff 1e+05 5 0

Random Assignment to Treatment

  • Next assign treatment \(w\) using randomization

  • In the code, \(w=1\) randomly with probability 0.5

  • Compute observed \(y\) based on treatment status

data %<>% mutate(w = if_else(runif(100000) > .5,1,0), 
                 y = y0 + (y1-y0)*w) %>% 
  group_by(w)

head(data)
# A tibble: 6 × 6
# Groups:   w [2]
      eta    y0    y1 treat_eff     w     y
    <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
1 -0.471   1.53  6.53         5     0  1.53
2 -0.253   1.75  6.75         5     0  1.75
3  1.17    3.17  8.17         5     1  8.17
4  1.28    3.28  8.28         5     1  8.28
5 -0.0665  1.93  6.93         5     1  6.93
6 -0.873   1.13  6.13         5     1  6.13

Random Assignment to Treatment

  • With random assignment we know

    • \(y_{0}\) is independent of \(w\)

    • \(y_{1}\) is independent of \(w\)

  • So the distributions of \(y_{0}\) and \(y_{1}\) are the same when \(w=0\) and when \(w=1\)

  • To the right we show the distribution of \(y_{0}\)

ggplot(data, aes(x=y0, color=as.factor(w))) +
  geom_density(alpha = .4, size=2) +
  theme_pander(nomargin=FALSE, boxes=TRUE) +
  labs(title = "Distribution of Y0")

Random Assignment to Treatment

  • Randomization ensures difference in average \(y\) between groups equals the ATE and ATT

  • On the right we show the difference in mean of \(y\) equals 5

    • (it’s not exactly 5 due to sampling error)
summarize(data, mean(y))
# A tibble: 2 × 2
      w `mean(y)`
  <dbl>     <dbl>
1     0      2.00
2     1      7.01
summarize(data,mean(y))$`mean(y)`[2] - 
  summarize(data,mean(y))$`mean(y)`[1]
[1] 5.003349

Random Assignment to Treatment

  • Can implement difference in means as a regression

  • Recall slope in OLS regression of \(y\) on dummy variable is difference in means of \(y\)

lm(y ~ w, data)

Call:
lm(formula = y ~ w, data = data)

Coefficients:
(Intercept)            w  
      2.002        5.003  

Selection into Treatment

  • Now simulate selection into treatment based on \(y_{0}\)

    • Treatment now related to value of \(y_{0}\)
  • We know \(\eta\) determines value of \(y_{0}\)

    • If we make \(w=\) with higher values of \(\eta\) then \(w\) is related to \(y_{0}\)
data2 <- data %>% 
  ungroup() %>% 
  select(eta, y0,y1) %>%
  mutate(w = if_else(eta + runif(100000,-1,1) > 0,1,0), 
         y = y0 + (y1-y0)*w) %>%
  group_by(w)

sumtable(data2, 
         summ=c('notNA(x)','mean(x)','sd(x)'), 
         group="w",
         group.long = TRUE)

Selection into Treatment

  • The means of \(y_{0}\) and \(y_{1}\) are now different by group

  • Because of selection bias

    • Treated group has better non-treated outcomes
Summary Statistics
Variable NotNA Mean Sd
w: 0
eta 49794 -0.684 0.733
y0 49794 1.316 0.733
y1 49794 6.316 0.733
y 49794 1.316 0.733
w: 1
eta 50206 0.686 0.734
y0 50206 2.686 0.734
y1 50206 7.686 0.734
y 50206 7.686 0.734

Selection into Treatment

  • The distribution of \(y_{0}\) differs by \(w\)

    • Treated group has better baseline outcomes
  • This creates selection bias

ggplot(data2, aes(x=y0, color=as.factor(w))) +
  geom_density(alpha = .4, size=2) +
  theme_pander(nomargin=FALSE, boxes=TRUE) +
  labs(title = "Distribution of Y0")

Selection into Treatment

  • Selction bias shows up when you take difference in mean \(y\)

    • We know the true treatment effect is 5

    • But difference in \(y\) is larger

    • There is positive selection bias

    • Bias is about 1.4

summarize(data2, mean(y))
# A tibble: 2 × 2
      w `mean(y)`
  <dbl>     <dbl>
1     0      1.32
2     1      7.69
summarize(data2,mean(y))$`mean(y)`[2] - 
  summarize(data2,mean(y))$`mean(y)`[1]
[1] 6.370304

Selection into Treatment

  • You can implement this as a regression

  • OLS estimates biased treatment effect

  • Remember the intercept is mean of \(y\) when \(w=0\)

lm(y ~ w, data2)

Call:
lm(formula = y ~ w, data = data2)

Coefficients:
(Intercept)            w  
      1.316        6.370  

Mean Independence of \(y_{0}\)

  • Randomization ensures entire distribution of \(y_{0}\) is the same for treatment and control

  • We do not need this to estimate average treatment effects

  • If the mean of \(y_{0}\) is the same between treatment and control we can estimate treatment effect with difference in means

  • Code makes mean of \(y_{0}\) the same, but variance bigger for \(w=1\)

data3 <- data %>% 
  ungroup() %>% 
  select(eta, y0,y1) %>%
  mutate(w =  if_else(between(percent_rank(y0),.25,.75),0,1), 
         y = y0 + (y1-y0)*w) %>%
  group_by(w)

sumtable(data3, 
         summ=c('notNA(x)','mean(x)','sd(x)'), 
         group="w",
         group.long = TRUE)

Mean Independence of \(y_{0}\)

  • Randomization ensures entire distribution of \(y_{0}\) is the same for treatment and control

  • We do not need this to estimate average treatment effects

  • If the mean of \(y_{0}\) is the same between treatment and control we can estimate treatment effect with difference in means

  • Code makes mean of \(y_{0}\) the same, but variance bigger for \(w=1\)

Summary Statistics
Variable NotNA Mean Sd
w: 0
eta 50000 0.004 0.378
y0 50000 2.004 0.378
y1 50000 7.004 0.378
y 50000 2.004 0.378
w: 1
eta 50000 0.003 1.369
y0 50000 2.003 1.369
y1 50000 7.003 1.369
y 50000 7.003 1.369

Mean Independence of \(y_{0}\)

  • The distribution of \(y_{0}\) is plotted on the right

  • The spread is larger for \(w=1\)

  • This does not affect estimate of the average treatment effect

ggplot(data3, aes(x=y0, color=as.factor(w))) +
  geom_density(alpha = .4, size=2) +
  theme_pander(nomargin=FALSE, boxes=TRUE) +
  labs(title = "Distribution of Y0")

Mean Independence of \(y_{0}\)

  • Take difference in mean \(y\)

    • This equals the treatment effect

    • Difference in variance did not create bias

summarize(data2, mean(y))
# A tibble: 2 × 2
      w `mean(y)`
  <dbl>     <dbl>
1     0      1.32
2     1      7.69
summarize(data2,mean(y))$`mean(y)`[2] - 
  summarize(data2,mean(y))$`mean(y)`[1]
[1] 6.370304

Mean Independence of \(y_{0}\)

  • Running regression produces same result

  • The variance in \(y_{1}\) affects the standard error

    • But we are not concerned with that right now
lm(y ~ w, data3)

Call:
lm(formula = y ~ w, data = data3)

Coefficients:
(Intercept)            w  
      2.004        4.999  

Conditional Mean Independence

  • Finally consider conditional mean independence

  • Treatment is related to \(y_{0}\), but only through \(x\)

    • For people with the same \(x\), \(y_{0}\) is unrelated to \(w\)
  • Ex: Education and wages

    • Smart people \((x = 1)\) earn higher wages regardless of schooling \((y_0)\)

    • Smart people are more likely to go to university \((w = 1)\)

    • People at university will have higher \(y_0\)

data4 <- data %>% 
  ungroup() %>% 
  select(eta) %>%
  mutate(x = if_else(runif(100000) > .5,1,0),
         w = if_else(x + runif(100000, -1,1) > .5,1,0),
         y0 = 2 + 3*x + eta,
         y1 = y0 + 5,
         y = y0 + (y1-y0)*w) %>%
  group_by(w)

sumtable(data4, 
         summ=c('notNA(x)','mean(x)','sd(x)'), 
         group="w",
         group.long = TRUE)

Conditional Mean Independence

  • Comparing treatment and control, \(y_{0}\) is bigger when \(w=1\)

  • This is because

    • \(y_{0}\) is bigger when \(x=1\)

    • \(w\) more likely to be \(1\) when \(x=1\)

Summary Statistics
Variable NotNA Mean Sd
w: 0
eta 49882 0.011 1.005
x 49882 0.25 0.433
y0 49882 2.763 1.646
y1 49882 7.763 1.646
y 49882 2.763 1.646
w: 1
eta 50118 -0.003 1.003
x 50118 0.752 0.432
y0 50118 4.253 1.636
y1 50118 9.253 1.636
y 50118 9.253 1.636

Conditional Mean Independence

  • What if we focus only on people with \(x=1\)?

  • No difference in \(y_{0}\) between treated and control

    • Because \(x\) is only reason why they differed

    • This is holding \(x\) fixed

sumtable(filter(data4, x==1), 
         summ=c('notNA(x)','mean(x)','sd(x)'), 
         group="w")
Summary Statistics
w
0
1
Variable NotNA Mean Sd NotNA Mean Sd
eta 12493 0.017 1 37694 -0.005 1.004
x 12493 1 0 37694 1 0
y0 12493 5.017 1 37694 4.995 1.004
y1 12493 10.017 1 37694 9.995 1.004
y 12493 5.017 1 37694 9.995 1.004

Conditional Mean Independence

  • Same result if we hold \(x=0\)?

    • Again because \(x\) is only reason why they differed
sumtable(filter(data4, x==0), 
         summ=c('notNA(x)','mean(x)','sd(x)'), 
         group="w")
Summary Statistics
w
0
1
Variable NotNA Mean Sd NotNA Mean Sd
eta 37389 0.009 1.007 12424 0.001 0.999
x 37389 0 0 12424 0 0
y0 37389 2.009 1.007 12424 2.001 0.999
y1 37389 7.009 1.007 12424 7.001 0.999
y 37389 2.009 1.007 12424 7.001 0.999

Conditional Mean Independence

  • Regression of \(y\) on \(w\) is biased

    • Because \(w\) is correlated with error
  • But regression of \(y\) on \(w\) and \(x\) generates actual treatment effect

  • This is conditional mean independence

    • Holding \(x\) fixed, potential outcomes no longer related to treatment
lm(y ~ w, data4)

Call:
lm(formula = y ~ w, data = data4)

Coefficients:
(Intercept)            w  
      2.763        6.490  
lm(y ~ w + x, data4)

Call:
lm(formula = y ~ w + x, data = data4)

Coefficients:
(Intercept)            w            x  
      2.011        4.985        3.001