Presenting results of a Regression Discontinuity Design – Part 3

The plots in part 1 and part 2 will help you identify a spurious jump. Looking at the scatter plot o the outcome on the running variable enable you correctly identify the functional form of your estimation (i.e., don’t run a linear regression function when you should have added a polynomial). Plotting the point estimate and confidence intervals for different bandwidths helps  making sure that any significant effect is not a result of a particular (and unprincipled) choice of bandwidth.

However, there are more checks to be done. The core assumption of the RDD design is that observations in the treatment and control group are comparable when they close enough to the cutoff point. More technically, the distribution of their potential outcome is continuous at the cutoff point or, equivalently, the expected of value of potential outcome of both groups is the same as they approach the cutoff (go revise Neyman-Rubin causal inference framework if this ‘potential outcome’ lingo gets you confused).

A way to check if the two groups are comparable is to look if belonging to the treatment of control group changes some variables other than the outcome. Let’s look at concrete example this. Suppose you are trying to test whether the effects of different voting technology on political outcomes. You could explore, for example, a new rule stipulating that cities with more than 200K inhabitants, and these cities only, will have to adopt the electronic ballot. Great opportunity to implement an RDD approach , right?  Yes it, and that’s exactly what Thomas Fujiwara, from Princeton, did. What you should do is to look at the cities in which population is slightly below 200K and compare your outcome of interest (political competition, voting for left-wing parties, etc) in both groups.

However, what if cities that are slightly above 200K are also consistently  richer or their  population is more educated, older, etc. Than those slightly below the threshold? Well something will smell fishy… This may imply that either there is some sorting among observations, i.e., richer or more educated cities deliberately (for some weird reason) over-report their population. The consequence is that the two groups are not comparable any more. You don’t if the difference in the groups is a consequence of the voting technology or better socio-economic performance of cities above the 200K threshold. In other words, their potential outcome is no longer the same, compromising the RDD causal estimation. Finding whether covariates are balanced (i.e., their average is the same) in the control and treatment groups is pretty standard in experimental settings, but more important in the RDD approach in order to validate it’s assumption.

Below I propose a way to visually summarize covariate balance in the RDD in a single graph. All you have to do is to run an RDD regression using the covariate and the outcome. If the coefficient of the treatment on this regression (i.e., the ‘effect’ of being treated in the covariate) is statistically indistinguishable from zero (p-value > .1) than covariates are balanced.

Below the code for this graph. I basically loop a RDD regression model over the covariates, collect the p-values and plot them on the same graph.

 

As before, let’s start by creating some fake data for reproducibility.


 

##creating a fake dataset (N=1000, 500 at treated, 500 at control group)
#outcome variable
outcome <- c(rnorm(500, mean = 50, sd = 10), rnorm(500, mean = 70, sd = 10))

#running variable
running.var <- seq(0, 1, by = .0001)
running.var <- sample(running.var, size = 1000, replace = T)

##Put negative values for the running variable in the control group
running.var[1:500] <- -running.var[1:500]

#treatment indicator (just a binary variable indicating treated and control groups)
treat.ind <- c(rep(0,500), rep(1,500))

##Now we will create data for covariates. For the RDD estimation to be valid, they are supposed to be the, which should be the case for the ones below&amp;amp;amp;amp;amp;nbsp;
set.seed(123)
covar1 <- c(rnorm(500, mean = 50, sd = 10), rnorm(500, mean = 50, sd = 20))
covar2 <- c(rnorm(500, mean = 10, sd = 20), rnorm(500, mean = 10, sd = 30))
covar3 <- c(rnorm(500, mean = 50, sd = 50), rnorm(500, mean = 55, sd = 60))
covar4 <- c(rnorm(500, mean = 50, sd = 30), rnorm(500, mean = 51, sd = 30))

##Now we add this covariates to the dataset:

data <- data.frame(cbind(outcome, running.var, treat.ind, covar1, covar2, covar3, covar4))
data$treat.ind <- as.factor(data$treat.ind)

We are interested in the impact of the treatment close to the cutoff point. So let’s subset the data for the bandwidth of 10%

 

d <- data[which(data$running.var < .1 & data$running.var <-.1), ]

 

We want to evaluate the ‘impact’ of the treatment condition on the covariates. So we can  use a lapply function to loop a linear regression model over the covariates.

 

#Bundle the covariates' names together&amp;amp;amp;nbsp;
covars <- c("covar1", "covar2", "covar3", "covar4")

#Loop over them using a convenient feature of the "as.formula" function

models <- lapply(covars, function(x){
 lm(as.formula(paste(x," ~ running.var + treat.ind",sep = "")), data = d)
})
names(models) <- covars

As said, if the treatment condition is associated with different values of the covariates in the treatment and control group, then the p-values of its coefficient must be statistically significant.

#So, let's extract the p-values for the treatment variable in the regressions above. 
bal <- sapply(models, function(x){
 p_value <- summary(x)$coefficient[3,4]
 return(p_value)
})

bal <- data.frame(covariate = covars, p_value = bal)

#Reorder to have a nicer plot
bal <- bal[order(bal$p_value), ]

Now we just need to plot it:


###Data for plotting p-values in the vertical lines&nbsp;
ps <- data.frame(value = c(0.1, 0.05, 0.01), threshold = c("10%","5%","1%" ))
##reordering
ps$significance.level <- reorder(ps$threshold, ps$value)

###Plotting:
require(ggplot2)

p <- ggplot(bal, aes(x = p_value))
p + geom_point(aes(y=covariate), colour = "blue", size=4, shape=24, fill="blue")+
 xlim(0, 1)+
 ylab("Covariate")+
 geom_vline(data=ps, aes(xintercept = value, colour=significance.level), show_guide = TRUE)

This should spit out a graph like this: rdd_discontinuity_plot3

Notice the p-value of all covariates is above significance level (10%), which mean we are ok in terms of covariate balance.

Before closing the sequence of posts, it must be said that another common practice is to  run a McCrary test, which will examine the continuity of the running variable density function across the cutoff point. The idea here is to check whether the observations are capable to manipulate the the running variable, i.e., if they can force themselves in or out of the treatment condition, compromising the quasi-randomness when we get close to the cutoff point (take a look at the original paper) .

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s