Presenting results of a Regression Discontinuity Design – Part 2

Now that we have graphed the basic scatter plot of a RDD estimation in this post, it is time to estimate and graph the actual effects of the treatment. The tricky point about RDD estimation is that there are different bandwidths around the cutoff point and the estimation results may vary quite a bit depending on the one you choose. Therefore, the researcher needs to be as transparent as possible and show the results (point estimate and confidence intervals) for different bandwidths.

Optimally, your plot should look like the one below.

rdd_discontinuity_plot2

The graph depicts the point estimates and confidence interval (y-axis) for different bandwidths (x-axis). The goal is to show whether your results are consistent across them. Certainly, the closer to the cutoff point (where the bandwidth size is close to zero), the better for causal inference because the ‘randomness’ of an observation ending up in the control or treatment group is higher there. However, this region is also more likely to have fewer observations in many real-world applications. Therefore, the closer to the cutoff point, the higher the uncertainty about the estimate and the larger the confidence interval (notice this is the pattern above).

There are ways to mathematically optimize the bandwidth and a well-known one is to use the Imbens Kalyanaraman (IK) bandwidth. This graph also plots the IK estimation, which is the one highlighted by the vertical dashed line. But as we can see, the IK bandwidth in this case is too large. In sum, the most transparent way to present the RDD results in by plotting estimation for different bandwidths, as well as for the IK one.

The R code below does that. In the end I wrap up the code into single function that takes in the outcome variable, running variable, treatment indicator and bandwidths and spits out the graph above.

Let’s start by, again, creating some fake data for easier replication, as we did in Part 1.


#running variable
running.var <- seq(0, 1, by = .0001)
running.var <- sample(running.var, size = 1000, replace = T)
##Put negative values for the running variable in the control group
running.var[1:500] <- -running.var[1:500]
#treatment indicator (just a binary variable indicating treated and control groups)
treat.ind <- c(rep(0,500), rep(1,500))

Now we’ll choose some sensible bandwidth and loop a simple linear regression estimation at each of them. The size of these bandwidths will depend on your substantive study. Think a bit while you choose it. Here what I have in mind is an election where the running variable is the winning margin. Also, you could make the model more complex by adding a quadratic term in the regression  estimation, but I’ll keep it simple. Finally, don’t forget to make the confidence intervals heteroscedasticity-robust.

#First, collapse the variables above in one dataset
data <- data.frame(cbind(outcome, running.var, treat.ind))

#Define the different bandwidths
bw <- seq(0.01,.5,0.05) #this will depend on your data as well as substantive question. do some thinking to define which BWs you are interested in evaluating.  #Build matrix to  store estimates => you will need one estimate per bandwidth, therefore the number of rows here will be the same amount of bandwidths.
#Create a matrix to collect the results of the regression estimations.
estimates <- matrix(NA, nrow = (length(bw)+1), ncol = 3)

#Loop over different bws to obtain a point estimate and CI for each bw
for (i in 1:length(bw)){
    d <- filter(data, running.var <= bw[i] & running.var >= -bw[i]) ###Filter the dataset for ith bw
    est <- lm(outcome~running.var + treat.ind, data=d)
    #Making it heteroskedastic robust
    require("sandwich")
    require("lmtest")
    est.robust <- coeftest(est, vcov. = vcovHC(est, type = "HC0"))
    #Prevent the function to stop in cases where lm() cannot calculate the point estimate (too few cases in that bandwidth)
    if(dim(est.robust)[1]<3){
      estimates[i,1] <- NA
      estimates[i,2] <- NA
      estimates[i,3] <- NA
    }
    else{
      estimates[i,1] <- est.robust[3,1]
      estimates[i,2] <- est.robust[3,1]+1.96*est.robust[3,2]
      estimates[i,3] <- est.robust[3,1]-1.96*est.robust[3,2]
    }
  }

This code block should give us a matrix with point estimate and confidence intervals for the bandwidths we chose in the beginning. What we are missing now is the estimation in the IK bandwidth:

#We will need to 'rdd' package for that
require(rdd)
#Get the running and outcome variables to feed into the next function
X <- data$running.var
Y <- data$outcome
#Now use IKbandwidth() from the package 'rdd'. Tweak with the parameters as necessary.
IK <- IKbandwidth(X, Y, cutpoint = 0, verbose = T,kernel = "triangular")
#Once you have established the IK band, re-run the lm estimation for the observations within it
d.IK <- filter(data, running.var <= IK & running.var >= -IK)
est.IK <- lm(outcome ~ running.var + treat.ind, data = d.IK)
est.robust.IK <- coeftest(est, vcov. = vcovHC(est.IK, type = "HC0"))
#Include them in the matrix with estimation results.
estimates[length(bw) + 1, 1] <- est.robust.IK[3,1]
estimates[length(bw) + 1, 2] <- est.robust.IK[3,1] + 1.96 * est.robust.IK[3,2]
estimates[length(bw) + 1, 3] <- est.robust.IK[3,1] - 1.96 * est.robust.IK[3,2]

Now you have a matrix with the point estimate and confidence intervals for the bandwidths of your choice as well as the IK one. All you need to do is to plot them on the same graph, which is the figure pasted in the beginning of the post. We will do that using ggplot2. Change xlab and ylab parameters according to your study.

require(ggplot2)
ggplot(data=d, mapping = aes(x = bandwidth, y = point.estimate, ymin = ci.lower, ymax = ci.upper)) +
    geom_vline(xintercept = IK, colour = "royalblue3", linetype = "longdash") +
    geom_pointrange(size = 0.8) +
    geom_hline(yintercept = 0, colour="brown4") +
    ggtitle("RDD results") +
    xlab("Distance from Cutoff Point") +
    ylab("Point estimate and confidence interval")

To make it easier, I’ve wrapped it all in a single function that takes in the variables in the dataset (running variable, outcome variable and treatment indicator), the bandwidths of choice, and spits out the graph above.

rdd <- function(outcome, running.var,treat.ind, bw){
  data <- data.frame(cbind(outcome, running.var, treat.ind))
  estimates <- matrix(NA, nrow = (length(bw)+1), ncol = 3)
  for (i in 1:length(bw)){ ###Looping over different bws to obtain a point estimate and CI for each bw
    d <- filter(data, running.var <= bw[i] & running.var >= -bw[i])
    est <- lm(outcome~running.var + treat.ind, data=d) ###Estimating formula
    require("sandwich")
    require("lmtest")
    est.robust <- coeftest(est, vcov. = vcovHC(est, type = "HC0"))
    if(dim(est.robust)[1]<3){
      estimates[i,1] <- NA
      estimates[i,2] <- NA
      estimates[i,3] <- NA
    }
    else{
      estimates[i,1] <- est.robust[3,1]
      estimates[i,2] <- est.robust[3,1]+1.96*est.robust[3,2]
      estimates[i,3] <- est.robust[3,1]-1.96*est.robust[3,2]
    }
  }
  require(rdd)
  X <- data$running.var
  Y <- data$outcome
  IK <- IKbandwidth(X, Y, cutpoint = 0, verbose = T,kernel = "triangular")
  d.IK <- filter(data, running.var <= IK & running.var >= -IK)
  est.IK <- lm(outcome ~ running.var + treat.ind, data = d.IK)
  est.robust.IK <- coeftest(est, vcov. = vcovHC(est.IK, type = "HC0"))
  estimates[length(bw) + 1, 1] <- est.robust.IK[3,1]
  estimates[length(bw) + 1, 2] <- est.robust.IK[3,1] + 1.96 * est.robust.IK[3,2]
  estimates[length(bw) + 1, 3] <- est.robust.IK[3,1] - 1.96 * est.robust.IK[3,2]
  d <- data.frame(point.estimate = estimates[,1], ci.upper = estimates[,2], ci.lower = estimates[,3])
  d$bandwidth <- c(bw, IK)
  require(ggplot2)
  p <- ggplot(data=d, mapping=aes(x=bandwidth,y=point.estimate, ymin=ci.lower, ymax=ci.upper)) +
    geom_vline(xintercept = IK, colour="royalblue3", linetype = "longdash") +
    geom_pointrange(size=0.8) +
    geom_hline(yintercept=0, colour="brown4") +
    ggtitle("RDD results") +
    xlab("Distance from Cutoff Point") +
    ylab("Point estimate and confidence interval")
  return(p)
}
Advertisement

Presenting results of a Regression Discontinuity Design – Part 1

Regression discontinuity design (RDD) is a great tool for going beyond descriptive statistics and moving into causal inference. I’m assuming the reader knows the theory, assumptions, advantages and weakness of the method. Below I’ll just give a few, basic tips for how to present results and share code that graphs it. The interesting thing about RDD is that its a very visual estimation technique. You can see and, therefore, present to your audience what are the results in a very intuitive way. Here I’m proposing  that you present your RDD estimation using three types of graphs. I present the first, and simplest, one in this post. The next ones are slightly more elaborate.

In short, RDD is about evaluating whether an outcome variable changes significantly after an independent variable changes its value. There is third element: the running variable that defines the value of the independent variable (0 or 1 in the sharp RDD case, which is the only one I’m dealing with here). If the running variable is below a certain value – the cutoff point –  the independent variable takes the value of zero (the observations in this ‘area’ are the control group). If above, the independent variable is one (treated group).

We can show all the behavior of these three variable in a simple bidimensional scatter plot in which the y-axis is the outcome variable, the x-axis is the running variable and there is a vertical line signaling the cutoff point. I.e., this vertical line divides the scatter plot in half: the control group is plotted in the left-hand side, treated group in the right-hand side.

Therefore, this graph, very standard in RDD presentation, should look like the one below. Instead of a linear regression line, fit in a loess line to show the central tendency of the observations in each group. This way you will better visualize the functional form of the data and avoid imposing a linear trend while the data could actually be in, for example, a quadratic form.

rdd_discontinuity_plot1

Below a very simple code that graphs the plot above. We first create some fake data for easier replication.

##creating a fake dataset (N=1000, 500 at treated, 500 at control group)
#outcome variable
outcome <- c(rnorm(500, mean = 50, sd = 10),  rnorm(500, mean = 70, sd = 10))

#running variable
running.var <- seq(0, 1, by = .0001)
running.var <- sample(running.var, size = 1000, replace = T)

##Put negative values for the running variable in the control group
running.var[1:500] <- -running.var[1:500]

#treatment indicator (just a binary variable indicating treated and control groups)
treat.ind <- c(rep(0,500), rep(1,500))
data <- data.frame(cbind(outcome, running.var

Now the plot itself using ggplot2. Notice we need to tweak a bit the legend labels in this graph using the command scale_colour_discrete()

require(ggplot2)
ggplot(data, aes(running.var, outcome, color = treat.ind)) +
  geom_point() + stat_smooth(size = 1.5) +
  geom_vline(xintercept=0, linetype="longdash") +
  xlab("Running variable") +
  ylab("Outcome variable") +
  scale_colour_discrete(name="Experimental\nCondition",
                          breaks=c("0", "1"), labels=c("Control", "Treatment"))

Yes, that’s all. Now let us move to something slightly more complex in the next post.