Now that we have graphed the basic scatter plot of a RDD estimation in this post, it is time to estimate and graph the actual effects of the treatment. The tricky point about RDD estimation is that there are different bandwidths around the cutoff point and the estimation results may vary quite a bit depending on the one you choose. Therefore, the researcher needs to be as transparent as possible and show the results (point estimate and confidence intervals) for different bandwidths.
Optimally, your plot should look like the one below.
The graph depicts the point estimates and confidence interval (y-axis) for different bandwidths (x-axis). The goal is to show whether your results are consistent across them. Certainly, the closer to the cutoff point (where the bandwidth size is close to zero), the better for causal inference because the ‘randomness’ of an observation ending up in the control or treatment group is higher there. However, this region is also more likely to have fewer observations in many real-world applications. Therefore, the closer to the cutoff point, the higher the uncertainty about the estimate and the larger the confidence interval (notice this is the pattern above).
There are ways to mathematically optimize the bandwidth and a well-known one is to use the Imbens Kalyanaraman (IK) bandwidth. This graph also plots the IK estimation, which is the one highlighted by the vertical dashed line. But as we can see, the IK bandwidth in this case is too large. In sum, the most transparent way to present the RDD results in by plotting estimation for different bandwidths, as well as for the IK one.
The R code below does that. In the end I wrap up the code into single function that takes in the outcome variable, running variable, treatment indicator and bandwidths and spits out the graph above.
Let’s start by, again, creating some fake data for easier replication, as we did in Part 1.
#running variable running.var <- seq(0, 1, by = .0001) running.var <- sample(running.var, size = 1000, replace = T) ##Put negative values for the running variable in the control group running.var[1:500] <- -running.var[1:500] #treatment indicator (just a binary variable indicating treated and control groups) treat.ind <- c(rep(0,500), rep(1,500))
Now we’ll choose some sensible bandwidth and loop a simple linear regression estimation at each of them. The size of these bandwidths will depend on your substantive study. Think a bit while you choose it. Here what I have in mind is an election where the running variable is the winning margin. Also, you could make the model more complex by adding a quadratic term in the regression estimation, but I’ll keep it simple. Finally, don’t forget to make the confidence intervals heteroscedasticity-robust.
#First, collapse the variables above in one dataset data <- data.frame(cbind(outcome, running.var, treat.ind)) #Define the different bandwidths bw <- seq(0.01,.5,0.05) #this will depend on your data as well as substantive question. do some thinking to define which BWs you are interested in evaluating. #Build matrix to store estimates => you will need one estimate per bandwidth, therefore the number of rows here will be the same amount of bandwidths. #Create a matrix to collect the results of the regression estimations. estimates <- matrix(NA, nrow = (length(bw)+1), ncol = 3) #Loop over different bws to obtain a point estimate and CI for each bw for (i in 1:length(bw)){ d <- filter(data, running.var <= bw[i] & running.var >= -bw[i]) ###Filter the dataset for ith bw est <- lm(outcome~running.var + treat.ind, data=d) #Making it heteroskedastic robust require("sandwich") require("lmtest") est.robust <- coeftest(est, vcov. = vcovHC(est, type = "HC0")) #Prevent the function to stop in cases where lm() cannot calculate the point estimate (too few cases in that bandwidth) if(dim(est.robust)[1]<3){ estimates[i,1] <- NA estimates[i,2] <- NA estimates[i,3] <- NA } else{ estimates[i,1] <- est.robust[3,1] estimates[i,2] <- est.robust[3,1]+1.96*est.robust[3,2] estimates[i,3] <- est.robust[3,1]-1.96*est.robust[3,2] } }
This code block should give us a matrix with point estimate and confidence intervals for the bandwidths we chose in the beginning. What we are missing now is the estimation in the IK bandwidth:
#We will need to 'rdd' package for that require(rdd) #Get the running and outcome variables to feed into the next function X <- data$running.var Y <- data$outcome #Now use IKbandwidth() from the package 'rdd'. Tweak with the parameters as necessary. IK <- IKbandwidth(X, Y, cutpoint = 0, verbose = T,kernel = "triangular") #Once you have established the IK band, re-run the lm estimation for the observations within it d.IK <- filter(data, running.var <= IK & running.var >= -IK) est.IK <- lm(outcome ~ running.var + treat.ind, data = d.IK) est.robust.IK <- coeftest(est, vcov. = vcovHC(est.IK, type = "HC0")) #Include them in the matrix with estimation results. estimates[length(bw) + 1, 1] <- est.robust.IK[3,1] estimates[length(bw) + 1, 2] <- est.robust.IK[3,1] + 1.96 * est.robust.IK[3,2] estimates[length(bw) + 1, 3] <- est.robust.IK[3,1] - 1.96 * est.robust.IK[3,2]
Now you have a matrix with the point estimate and confidence intervals for the bandwidths of your choice as well as the IK one. All you need to do is to plot them on the same graph, which is the figure pasted in the beginning of the post. We will do that using ggplot2. Change xlab and ylab parameters according to your study.
require(ggplot2) ggplot(data=d, mapping = aes(x = bandwidth, y = point.estimate, ymin = ci.lower, ymax = ci.upper)) + geom_vline(xintercept = IK, colour = "royalblue3", linetype = "longdash") + geom_pointrange(size = 0.8) + geom_hline(yintercept = 0, colour="brown4") + ggtitle("RDD results") + xlab("Distance from Cutoff Point") + ylab("Point estimate and confidence interval")
To make it easier, I’ve wrapped it all in a single function that takes in the variables in the dataset (running variable, outcome variable and treatment indicator), the bandwidths of choice, and spits out the graph above.
rdd <- function(outcome, running.var,treat.ind, bw){ data <- data.frame(cbind(outcome, running.var, treat.ind)) estimates <- matrix(NA, nrow = (length(bw)+1), ncol = 3) for (i in 1:length(bw)){ ###Looping over different bws to obtain a point estimate and CI for each bw d <- filter(data, running.var <= bw[i] & running.var >= -bw[i]) est <- lm(outcome~running.var + treat.ind, data=d) ###Estimating formula require("sandwich") require("lmtest") est.robust <- coeftest(est, vcov. = vcovHC(est, type = "HC0")) if(dim(est.robust)[1]<3){ estimates[i,1] <- NA estimates[i,2] <- NA estimates[i,3] <- NA } else{ estimates[i,1] <- est.robust[3,1] estimates[i,2] <- est.robust[3,1]+1.96*est.robust[3,2] estimates[i,3] <- est.robust[3,1]-1.96*est.robust[3,2] } } require(rdd) X <- data$running.var Y <- data$outcome IK <- IKbandwidth(X, Y, cutpoint = 0, verbose = T,kernel = "triangular") d.IK <- filter(data, running.var <= IK & running.var >= -IK) est.IK <- lm(outcome ~ running.var + treat.ind, data = d.IK) est.robust.IK <- coeftest(est, vcov. = vcovHC(est.IK, type = "HC0")) estimates[length(bw) + 1, 1] <- est.robust.IK[3,1] estimates[length(bw) + 1, 2] <- est.robust.IK[3,1] + 1.96 * est.robust.IK[3,2] estimates[length(bw) + 1, 3] <- est.robust.IK[3,1] - 1.96 * est.robust.IK[3,2] d <- data.frame(point.estimate = estimates[,1], ci.upper = estimates[,2], ci.lower = estimates[,3]) d$bandwidth <- c(bw, IK) require(ggplot2) p <- ggplot(data=d, mapping=aes(x=bandwidth,y=point.estimate, ymin=ci.lower, ymax=ci.upper)) + geom_vline(xintercept = IK, colour="royalblue3", linetype = "longdash") + geom_pointrange(size=0.8) + geom_hline(yintercept=0, colour="brown4") + ggtitle("RDD results") + xlab("Distance from Cutoff Point") + ylab("Point estimate and confidence interval") return(p) }