# The Rubin Causal Model

EC655

Wilfrid Laurier University

Fall 2022

# Introduction

## Introduction

• The workhorse model in econometrics is linear regression

• Two key uses of this model

• Prediction

• Modelling (causal) effect of one variable on another

• In many empirical application the focus is causal effects

• The independent effect of a particular variable on the outcome
• We will study linear regression with a focus on causality

• First, we attempt to understand the underlying concept of causality

• For this we use the Rubin Causal Model

• Defines causality and under what conditions we can measure it

# Model Basics

## Potential Outcomes

• There is a “treatment” or “no treatment”

• An individual theoretically has an outcome with treatment and without

• Treatment is defined generally

• Getting a drug

• Going to university

• Being in a large class

• Define the following potential outcomes

• $y_{1}$ is the outcome with treatment

• $y_{0}$ is the outcome without treatment

• $w$ is a binary variable with 1 denoting treatment, and 0 no treatment

## Treatment Effects

• We would like to know the treatment effect $y_{1} - y_{0}$ for an individual

• This is the .causal effect of the treatment

• Effect differs from person to person in the population

• Fundamental problem of causal inference: we never observe both $y_{1}$ and $y_{0}$

• We only observe $(y, w)$, where

$y = y_{0} + (y_{1} -y_{0})w$

• We observe treatment status, potential outcome given that treatment status
• The counterfactual outcome with opposite treatment is never observed

## Simple Differences in Average Outcomes

• What if we naïvely compute difference in average outcomes between treated and control? $E(y|w=1) - E(y|w=0)$

• Using the definition of $y$ above,

$E(y|w=1) - E(y|w=0)$ $= E(y_{1}|w=1) - E(y_{0}|w=0)$ $= \left [ E(y_{1}|w=1) - E(y_{0}|w=1) \right ] + E(y_{0}|w=1) - E(y_{0}|w=0)$

• The first term is called the Average Treatment Effect on the Treated (ATT)

• Average effect of the treatment for those in the treatment group

## Simple Differences in Average Outcomes

• The second term is Selection Bias

• Baseline difference between treatment and control groups
• Simple average differences will not identify a treatment effect

• It is partly a treatment effect, partly differences in who gets treated
• Ex: Comparing average incomes of university grads to high school grads

• Will be partly average causal effect of university

• Also difference in baseline earning ability without the degree

• Lesson is that simple differences in averages do not reveal causal effects

• Under what conditions can we measure the causal effect of $w$ on $y$?

# Randomizing Treatment Status

## Randomization and Independence of Treatment

• A common way to isolate treatment effects is to randomize $w$

• Blindly put people into treatment or control group

• Ensures that on average the two groups are similar at baseline

• Mathematically, potential outcomes are independent from treatment $(y_{0}, y_{1}) \perp w$

• Independence means conditioning on $w$ has no effect on expectation

$E(y_{0}|w=1) =E(y_{0}|w=0)$ $E(y_{0}|w) = E(y_{0})$ $E(y_{1}|w) = E(y_{1})$

## Randomization and Treatment Effects

• With randomization, selection bias is zero

$E(y_{0}|w=1) - E(y_{0}|w=0) = E(y_{0}|w=1) - E(y_{0}|w=1) = 0$

• As a result the difference in mean $y$ is $E(y|w=1) - E(y|w=0)$ $= E(y_{1}|w=1) - E(y_{0}|w=0)$ $= E(y_{1}) - E(y_{0})$

• The first term is the ATT we saw before

• The second term is the Average Treatment Effect (ATE)

• The treatment effect across the whole population

## Recent Example in Economics Literature

• When we randomize treatment we can measure causal effects

• Randomization is the standard way to measure the effects of medical treatments

• It is becoming more popular in economics

• Compare COVID rates between treatment and control

• Find some positive effect of masks, especially for age 50+

• Next we will show how to model this in a regression framework

# Causal Effects without Randomization

## Mean Independence of Treatment

• We cannot always randomize into treatment and control

• Can we uncover causal effects without experiments?

• The answer is yes, depending on assumptions

• One possible assumption is Mean Independence

$E(y_{0}|w) = E(y_{0})$ $E(y_{1}|w) = E(y_{1})$

• Says conditional means do not depend on treatment status

• Weaker assumption than full statistical independence

• Full independence means one event has no effect on probability of another

## Mean Independence of Treatment

• With mean independence, we get

$E(y|w=1) - E(y|w=0)$ $= \left [ E(y_{1}|w=1) - E(y_{0}|w=1) \right ] + E(y_{0}|w=1) - E(y_{0}|w=0)$ $= \left [ E(y_{1}|w=1) - E(y_{0}|w=1) \right ]$ $= E(y_{1}) - E(y_{0})$

• This identifies ATT = ATE

• Is this assumption realistic?

• Means both potential outcomes unrelated to treatment

• On average, people in treatment and control have similar treated and non-treated outcomes

• Whether this is realistic depends on context

## Mean Independence of $y_{0}$

• A variation if this assumption is mean independence of $\mathbf{y_{0}}$

$E(y_{0}|w) = E(y_{0})$

• With this assumption

• Meaning that $E(y|w=1) - E(y|w=0)$ $= \left [ E(y_{1}|w=1) - E(y_{0}|w=1) \right ] + E(y_{0}|w=1) - E(y_{0}|w=0)$ $= \left [ E(y_{1}|w=1) - E(y_{0}|w=1) \right ]$

• With this assumption, we only measure the ATT (Not ATE)

• Is this realistic?

• Means untreated outcome is same between groups on average

• Puts no restriction on differences in treated outcome

• Intuitively, there are no baseline differences between groups

## Conditional Mean Independence

• We can also use other variables to help with our assumptions

• Suppose we observe a set of pre-treatment characteristics $\mathbf{x}$

• Ex: gender, parental education, school test scores, etc.

• Key is they are determined before treatment

• With this information you could assume Conditional Independence

$(y_{0}, y_{1}) \perp w |\mathbf{x}$

• Conditional on $\mathbf{x}$, treatment is independent of outcomes

• Write this mathematically as $E(y_{0}|w=1, \mathbf{x}) =E(y_{0}|w=0, \mathbf{x})$ $E(y_{0}|w, \mathbf{x}) = E(y_{0}| \mathbf{x})$ $E(y_{1}|w, \mathbf{x}) = E(y_{1}|\mathbf{x})$

## Conditional Mean Independence

• This implies that we can get treatment effects at each $\mathbf{x}$ $E(y|w=1, \mathbf{x}) - E(y|w=0, \mathbf{x})$ $= E(y_{1}|w=1, \mathbf{x}) - E(y_{0}|w=1, \mathbf{x})= E(y_{1} | \mathbf{x}) - E(y_{0}| \mathbf{x})$ $= ATT( \mathbf{x}) =ATE( \mathbf{x})$

• These treatment effects are functions of $\mathbf{x}$

• They will differ across values of $\mathbf{x}$

• So there are multiple treatment effects

• Finally, a variation on this is Conditional Mean Independence

$E(y_{0}|w, \mathbf{x}) = E(y_{0}| \mathbf{x})$ $E(y_{1}|w, \mathbf{x}) = E(y_{1}|\mathbf{x})$

• Gives you the same $ATE( \mathbf{x}) = ATT( \mathbf{x})$ as above

# Summary of Rubin Model

## Summary of Rubin Model

• The Rubin model defines what is a causal effect

• Roughly speaking, it an Average Treatment Effect

• They will differ across values of $\mathbf{x}$

• Difference in potential outcomes, on average in population

• Depending on context, it might be an Average Treatment Effect for the Treated

• We can express the Rubin model in a regression framework

• The slope in a linear regression is the causal effect if we can assume one of

• Randomization of treatment

• “As good as” randomization

• Mean Independence, Conditional Independence, Conditional Mean Indpendence

## Summary of Rubin Model

• When our regression model identifies an underlying causal effect, we call it a structural model

• In many econometric applications, this is what we want

• Next, we discuss in more detail linear regression

• First we discuss the population model

• We will define the parameters we are measuring

• Some of this might be new

• Then we discuss estimation by OLS

• Focus is on when OLS consistently estimates the parameters

# Simulation

## Data Setup

• To help understand the Rubin model we will demonstrate with simulated data

• Code to the right creates potential outcomes

• For simplicity the treatment effect is set to 5 for everyone

• Outcomes $y_{0}$ and $y_{1}$ have a Normal distribution because of $\eta$

data <- data.frame(eta=rnorm(100000,0,1)) %>%
mutate(y0 = 2 + eta, y1 = y0 + 5,
treat_eff = y1 - y0)

sumtable(data, summ=c('notNA(x)','mean(x)','sd(x)'))
Summary Statistics
Variable NotNA Mean Sd
eta 1e+05 0.004 1.004
y0 1e+05 2.004 1.004
y1 1e+05 7.004 1.004
treat_eff 1e+05 5 0

## Random Assignment to Treatment

• Next assign treatment $w$ using randomization

• In the code, $w=1$ randomly with probability 0.5

• Compute observed $y$ based on treatment status

data %<>% mutate(w = if_else(runif(100000) > .5,1,0),
y = y0 + (y1-y0)*w) %>%
group_by(w)

head(data)
# A tibble: 6 × 6
# Groups:   w [2]
eta    y0    y1 treat_eff     w     y
<dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
1 -0.471   1.53  6.53         5     0  1.53
2 -0.253   1.75  6.75         5     0  1.75
3  1.17    3.17  8.17         5     1  8.17
4  1.28    3.28  8.28         5     1  8.28
5 -0.0665  1.93  6.93         5     1  6.93
6 -0.873   1.13  6.13         5     1  6.13

## Random Assignment to Treatment

• With random assignment we know

• $y_{0}$ is independent of $w$

• $y_{1}$ is independent of $w$

• So the distributions of $y_{0}$ and $y_{1}$ are the same when $w=0$ and when $w=1$

• To the right we show the distribution of $y_{0}$

ggplot(data, aes(x=y0, color=as.factor(w))) +
geom_density(alpha = .4, size=2) +
theme_pander(nomargin=FALSE, boxes=TRUE) +
labs(title = "Distribution of Y0")

## Random Assignment to Treatment

• Randomization ensures difference in average $y$ between groups equals the ATE and ATT

• On the right we show the difference in mean of $y$ equals 5

• (it’s not exactly 5 due to sampling error)
summarize(data, mean(y))
# A tibble: 2 × 2
w mean(y)
<dbl>     <dbl>
1     0      2.00
2     1      7.01
summarize(data,mean(y))$mean(y)[2] - summarize(data,mean(y))$mean(y)[1]
[1] 5.003349

## Random Assignment to Treatment

• Can implement difference in means as a regression

• Recall slope in OLS regression of $y$ on dummy variable is difference in means of $y$

lm(y ~ w, data)

Call:
lm(formula = y ~ w, data = data)

Coefficients:
(Intercept)            w
2.002        5.003  

## Selection into Treatment

• Now simulate selection into treatment based on $y_{0}$

• Treatment now related to value of $y_{0}$
• We know $\eta$ determines value of $y_{0}$

• If we make $w=$ with higher values of $\eta$ then $w$ is related to $y_{0}$
data2 <- data %>%
ungroup() %>%
select(eta, y0,y1) %>%
mutate(w = if_else(eta + runif(100000,-1,1) > 0,1,0),
y = y0 + (y1-y0)*w) %>%
group_by(w)

sumtable(data2,
summ=c('notNA(x)','mean(x)','sd(x)'),
group="w",
group.long = TRUE)

## Selection into Treatment

• The means of $y_{0}$ and $y_{1}$ are now different by group

• Because of selection bias

• Treated group has better non-treated outcomes
Summary Statistics
Variable NotNA Mean Sd
w: 0
eta 49794 -0.684 0.733
y0 49794 1.316 0.733
y1 49794 6.316 0.733
y 49794 1.316 0.733
w: 1
eta 50206 0.686 0.734
y0 50206 2.686 0.734
y1 50206 7.686 0.734
y 50206 7.686 0.734

## Selection into Treatment

• The distribution of $y_{0}$ differs by $w$

• Treated group has better baseline outcomes
• This creates selection bias

ggplot(data2, aes(x=y0, color=as.factor(w))) +
geom_density(alpha = .4, size=2) +
theme_pander(nomargin=FALSE, boxes=TRUE) +
labs(title = "Distribution of Y0")

## Selection into Treatment

• Selction bias shows up when you take difference in mean $y$

• We know the true treatment effect is 5

• But difference in $y$ is larger

• There is positive selection bias

summarize(data2, mean(y))
# A tibble: 2 × 2
w mean(y)
<dbl>     <dbl>
1     0      1.32
2     1      7.69
summarize(data2,mean(y))$mean(y)[2] - summarize(data2,mean(y))$mean(y)[1]
[1] 6.370304

## Selection into Treatment

• You can implement this as a regression

• OLS estimates biased treatment effect

• Remember the intercept is mean of $y$ when $w=0$

lm(y ~ w, data2)

Call:
lm(formula = y ~ w, data = data2)

Coefficients:
(Intercept)            w
1.316        6.370  

## Mean Independence of $y_{0}$

• Randomization ensures entire distribution of $y_{0}$ is the same for treatment and control

• We do not need this to estimate average treatment effects

• If the mean of $y_{0}$ is the same between treatment and control we can estimate treatment effect with difference in means

• Code makes mean of $y_{0}$ the same, but variance bigger for $w=1$

data3 <- data %>%
ungroup() %>%
select(eta, y0,y1) %>%
mutate(w =  if_else(between(percent_rank(y0),.25,.75),0,1),
y = y0 + (y1-y0)*w) %>%
group_by(w)

sumtable(data3,
summ=c('notNA(x)','mean(x)','sd(x)'),
group="w",
group.long = TRUE)

## Mean Independence of $y_{0}$

• Randomization ensures entire distribution of $y_{0}$ is the same for treatment and control

• We do not need this to estimate average treatment effects

• If the mean of $y_{0}$ is the same between treatment and control we can estimate treatment effect with difference in means

• Code makes mean of $y_{0}$ the same, but variance bigger for $w=1$

Summary Statistics
Variable NotNA Mean Sd
w: 0
eta 50000 0.004 0.378
y0 50000 2.004 0.378
y1 50000 7.004 0.378
y 50000 2.004 0.378
w: 1
eta 50000 0.003 1.369
y0 50000 2.003 1.369
y1 50000 7.003 1.369
y 50000 7.003 1.369

## Mean Independence of $y_{0}$

• The distribution of $y_{0}$ is plotted on the right

• The spread is larger for $w=1$

• This does not affect estimate of the average treatment effect

ggplot(data3, aes(x=y0, color=as.factor(w))) +
geom_density(alpha = .4, size=2) +
theme_pander(nomargin=FALSE, boxes=TRUE) +
labs(title = "Distribution of Y0")

## Mean Independence of $y_{0}$

• Take difference in mean $y$

• This equals the treatment effect

• Difference in variance did not create bias

summarize(data2, mean(y))
# A tibble: 2 × 2
w mean(y)
<dbl>     <dbl>
1     0      1.32
2     1      7.69
summarize(data2,mean(y))$mean(y)[2] - summarize(data2,mean(y))$mean(y)[1]
[1] 6.370304

## Mean Independence of $y_{0}$

• Running regression produces same result

• The variance in $y_{1}$ affects the standard error

• But we are not concerned with that right now
lm(y ~ w, data3)

Call:
lm(formula = y ~ w, data = data3)

Coefficients:
(Intercept)            w
2.004        4.999  

## Conditional Mean Independence

• Finally consider conditional mean independence

• Treatment is related to $y_{0}$, but only through $x$

• For people with the same $x$, $y_{0}$ is unrelated to $w$
• Ex: Education and wages

• Smart people $(x = 1)$ earn higher wages regardless of schooling $(y_0)$

• Smart people are more likely to go to university $(w = 1)$

• People at university will have higher $y_0$

data4 <- data %>%
ungroup() %>%
select(eta) %>%
mutate(x = if_else(runif(100000) > .5,1,0),
w = if_else(x + runif(100000, -1,1) > .5,1,0),
y0 = 2 + 3*x + eta,
y1 = y0 + 5,
y = y0 + (y1-y0)*w) %>%
group_by(w)

sumtable(data4,
summ=c('notNA(x)','mean(x)','sd(x)'),
group="w",
group.long = TRUE)

## Conditional Mean Independence

• Comparing treatment and control, $y_{0}$ is bigger when $w=1$

• This is because

• $y_{0}$ is bigger when $x=1$

• $w$ more likely to be $1$ when $x=1$

Summary Statistics
Variable NotNA Mean Sd
w: 0
eta 49882 0.011 1.005
x 49882 0.25 0.433
y0 49882 2.763 1.646
y1 49882 7.763 1.646
y 49882 2.763 1.646
w: 1
eta 50118 -0.003 1.003
x 50118 0.752 0.432
y0 50118 4.253 1.636
y1 50118 9.253 1.636
y 50118 9.253 1.636

## Conditional Mean Independence

• What if we focus only on people with $x=1$?

• No difference in $y_{0}$ between treated and control

• Because $x$ is only reason why they differed

• This is holding $x$ fixed

sumtable(filter(data4, x==1),
summ=c('notNA(x)','mean(x)','sd(x)'),
group="w")
Summary Statistics
w
0
1
Variable NotNA Mean Sd NotNA Mean Sd
eta 12493 0.017 1 37694 -0.005 1.004
x 12493 1 0 37694 1 0
y0 12493 5.017 1 37694 4.995 1.004
y1 12493 10.017 1 37694 9.995 1.004
y 12493 5.017 1 37694 9.995 1.004

## Conditional Mean Independence

• Same result if we hold $x=0$?

• Again because $x$ is only reason why they differed
sumtable(filter(data4, x==0),
summ=c('notNA(x)','mean(x)','sd(x)'),
group="w")
Summary Statistics
w
0
1
Variable NotNA Mean Sd NotNA Mean Sd
eta 37389 0.009 1.007 12424 0.001 0.999
x 37389 0 0 12424 0 0
y0 37389 2.009 1.007 12424 2.001 0.999
y1 37389 7.009 1.007 12424 7.001 0.999
y 37389 2.009 1.007 12424 7.001 0.999

## Conditional Mean Independence

• Regression of $y$ on $w$ is biased

• Because $w$ is correlated with error
• But regression of $y$ on $w$ and $x$ generates actual treatment effect

• This is conditional mean independence

• Holding $x$ fixed, potential outcomes no longer related to treatment
lm(y ~ w, data4)

Call:
lm(formula = y ~ w, data = data4)

Coefficients:
(Intercept)            w
2.763        6.490  
lm(y ~ w + x, data4)

Call:
lm(formula = y ~ w + x, data = data4)

Coefficients:
(Intercept)            w            x
2.011        4.985        3.001