---
title: "New Features in rdrobust 4.0.0"
author: "Sebastian Calonico, Matias D. Cattaneo, Max H. Farrell, Rocio Titiunik"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{New Features in rdrobust 4.0.0}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  fig.width = 7,
  fig.height = 4.5
)
```

This vignette documents the features added in **rdrobust 4.0.0**:

1. A `data =` argument for `rdrobust()`, `rdbwselect()`, and `rdplot()`, letting you refer to variables by their bare names inside a data frame.
2. Heteroskedasticity-robust variance estimators (`vce = "hc0"`, `"hc1"`, `"hc2"`, `"hc3"`).
3. Cluster-robust variance estimators (`vce = "cr1"`, `"cr2"`, `"cr3"`) with automatic validation.
4. A `plot` method for `rdrobust` objects (`plot.rdrobust`).
5. A `hide = TRUE` flag and `vars_bins` / `vars_poly` / `coef` outputs from `rdplot()`, so you can build custom RD plots on top of `ggplot2` (or any other graphics system) without re-implementing the binning logic.
6. `tidy` / `glance` methods for integration with the **broom** package.

Existing functionality — bandwidth selection, covariates, kernels, fuzzy and kink designs — is unchanged. See `?rdrobust` and `?rdbwselect` for full documentation.

---

## Setup

We use the U.S. Senate incumbency advantage dataset included in the package. The running variable is the Democratic vote margin at election $t$; the outcome is the Democratic vote share at election $t+2$. The `state` variable serves as a natural cluster identifier, since elections within the same state share common political and institutional features.

```{r data}
library(rdrobust)
data(rdrobust_RDsenate)
df <- rdrobust_RDsenate
head(df)
```

---

## 1. The `data =` argument

Previously, callers either had to `attach()` the data frame or pass each column with `df$` prefixing. In 4.0.0, `rdrobust()`, `rdbwselect()`, and `rdplot()` accept a `data =` argument so `y`, `x`, `covs`, `cluster`, `fuzzy`, `weights`, and `subset` can be given as bare names, resolved against the supplied data frame:

```{r data-arg-basic}
r <- rdrobust(y = vote, x = margin, data = df)
round(r$coef, 3)
```

It also works for expressions built from columns (e.g. `cbind()` for multiple covariates, comparisons for `subset`):

```{r data-arg-expr}
r_cov <- rdrobust(y = vote, x = margin,
                  covs = cbind(class, termshouse, termssenate),
                  data = df)
round(r_cov$coef, 3)

# subset via an expression evaluated inside `data`
r_sub <- rdrobust(y = vote, x = margin,
                  subset = year >= 1950, data = df)
round(r_sub$coef, 3)
```

The same pattern works for `rdbwselect()` and `rdplot()`. For the rest of the vignette we pass variables through `data =`.

### 1.1 `covs =`: formula, names, or matrix

The `covs` argument accepts three forms, so the way you specify covariates can match the shape of your problem:

1. **One-sided formula** — recommended when you have factors, interactions, or transformations:

    ```{r covs-formula-factor}
    df$classf <- factor(df$class, labels = c("A", "B", "C"))

    r_fm <- rdrobust(y = vote, x = margin,
                     covs = ~ classf + termshouse + I(termssenate^2),
                     data = df)
    round(r_fm$coef, 3)
    ```

    Factors are expanded to contrast dummies via `model.matrix()`; interactions (`~ z1 * z2`) and transformations (`I(...)`, `log()`, `poly()`) all work, and the intercept column is dropped automatically. Symbols missing from `data` fall through to the caller's environment.

2. **Character vector of column names** — convenient for programmatic specifications:

    ```{r covs-char}
    covs_names <- c("class", "termshouse", "termssenate")
    r_ch <- rdrobust(y = vote, x = margin, covs = covs_names, data = df)
    round(r_ch$coef, 3)
    ```

    Missing column names give a clear error (`Column(s) not found in 'data': ...`). Passing a character vector without `data = ` is rejected up front instead of failing deep inside the estimator.

3. **Numeric vector, matrix, or data frame** — fully backwards compatible:

    ```{r covs-matrix}
    Z <- with(df, cbind(class, termshouse, termssenate))
    r_mat <- rdrobust(y = vote, x = margin, covs = Z, data = df)
    round(r_mat$coef, 3)
    ```

All three forms — formula with the same plain terms, character vector, and pre-built matrix — yield identical results when the covariates are numeric and the expansion is trivial; the formula form is the right choice when factor expansion or interactions matter.

```{r rdplot, fig.cap="RD plot: U.S. Senate incumbency advantage."}
rdplot(y = vote, x = margin, data = df,
       title   = "Senate: incumbency advantage",
       x.label = "Vote margin (t)",
       y.label = "Vote share (t+2)")
```

---

## 2. Heteroskedasticity-robust variance (HC family)

The full `HC0`–`HC3` family is available via `vce =`. `HC0` is the plug-in sandwich; `HC1` adds a finite-sample scalar; `HC2` and `HC3` reweight residuals by leverage, with `HC3` being the most conservative.

```{r hc-variants}
hc_fits <- lapply(c("nn", "hc0", "hc1", "hc2", "hc3"), function(v) {
  rdrobust(y = vote, x = margin, vce = v, data = df)
})
names(hc_fits) <- c("NN", "HC0", "HC1", "HC2", "HC3")

hc_tbl <- do.call(rbind, lapply(names(hc_fits), function(nm) {
  r <- hc_fits[[nm]]
  data.frame(vce    = nm,
             coef   = round(r$coef[2], 3),   # bias-corrected
             se_rb  = round(r$se[3],   3),   # robust SE
             pval   = round(r$pv[3],   3))
}))
print(hc_tbl, row.names = FALSE)
```

As expected, the point estimate is identical across HC variants (the variance estimator does not change it); only the standard errors and p-values move.

---

## 3. Cluster-Robust Variance Estimators (CR family)

When observations are grouped into clusters, standard errors should account for within-cluster correlation. Version 4.0.0 adds three cluster-robust variance estimators:

| `vce` | Description |
|-------|-------------|
| `"cr1"` | CR1 — cluster sandwich with degrees-of-freedom correction; the default when `cluster` is supplied |
| `"cr2"` | CR2 — Bell & McCaffrey (2002); accounts for within-cluster leverage |
| `"cr3"` | CR3 — Pustejovsky & Tipton (2018); leave-one-cluster-out jackknife; most conservative |

```{r cluster-counts}
cat("Number of clusters:", length(unique(df$state)), "\n")
```

```{r crv-variants}
r_cr1 <- rdrobust(y = vote, x = margin, cluster = state, vce = "cr1", data = df)
r_cr2 <- rdrobust(y = vote, x = margin, cluster = state, vce = "cr2", data = df)
r_cr3 <- rdrobust(y = vote, x = margin, cluster = state, vce = "cr3", data = df)

cl_tbl <- data.frame(
  vce    = c("CR1", "CR2", "CR3"),
  coef   = c(r_cr1$coef[2],  r_cr2$coef[2],  r_cr3$coef[2]),
  se_rb  = c(r_cr1$se[3],    r_cr2$se[3],    r_cr3$se[3]),
  pval   = c(r_cr1$pv[3],    r_cr2$pv[3],    r_cr3$pv[3])
)
cl_tbl[, 2:4] <- round(cl_tbl[, 2:4], 3)
print(cl_tbl, row.names = FALSE)
```

The point estimate (bias-corrected) is the same across estimators — only the standard errors differ. CR3 produces the widest intervals and is the safest default when the number of clusters is moderate.

### Automatic validation

When `cluster` is supplied, only `cr1`/`cr2`/`cr3` are valid; the HC variants and `nn` are silently remapped (or warned and remapped, for HC) to the corresponding cluster-robust estimator:

- `vce = "nn"`   + cluster → `cr1` (silent default)
- `vce = "hc0"`  + cluster → `cr1` (with warning)
- `vce = "hc1"`  + cluster → `cr1` (with warning)
- `vce = "hc2"`  + cluster → `cr2` (with warning)
- `vce = "hc3"`  + cluster → `cr3` (with warning)

Conversely, asking for `cr1/cr2/cr3` without supplying a `cluster` falls back to the corresponding HC variant with a warning.

### Full output for CR3

```{r cr3-summary}
summary(r_cr3)
```

---

## 4. `plot.rdrobust`: Visual Summary of RD Results

Version 4.0.0 adds a `plot` S3 method for `rdrobust` objects. It produces a binscatter with polynomial fits and confidence bands on a common scale.

**Signature:**

```r
plot(x, y, x_run, nbins = 20, ci = TRUE, show_effect = FALSE,
     title = NULL, x.label = "Running Variable", y.label = "Outcome", ...)
```

### Basic plot

```{r plot-basic, fig.cap="Binscatter with local polynomial fits and 95% CI bands."}
plot(r_cr1, df$vote, df$margin,
     title   = "Senate: incumbency advantage",
     x.label = "Vote margin (t)",
     y.label = "Vote share (t+2)")
```

### Adding an effect panel

`show_effect = TRUE` appends a second panel with the conventional point estimate, the robust bias-corrected confidence interval, and significance stars.

```{r plot-effect, fig.height=6, fig.cap="Main plot with RD effect panel."}
plot(r_cr1, df$vote, df$margin,
     show_effect = TRUE,
     title       = "Senate: incumbency advantage",
     x.label     = "Vote margin (t)",
     y.label     = "Vote share (t+2)")
```

### Clustered plot

The method works identically for any `rdrobust` object, regardless of how it was estimated.

```{r plot-cr3, fig.cap="CR3 clustered variance, with effect panel."}
plot(r_cr3, df$vote, df$margin,
     show_effect = TRUE,
     title       = "Senate: CR3 clustered SE",
     x.label     = "Vote margin (t)",
     y.label     = "Vote share (t+2)")
```

---

## 5. Custom RD plots from `rdplot()`

`rdplot()` now accepts a `hide = TRUE` flag and returns the bin-level data used to draw the plot, so you can build any visualization on top of `ggplot2`, `lattice`, or base graphics without re-implementing the binning logic.

The returned list contains:

- `vars_bins` — one row per bin: bin midpoint (`rdplot_mean_bin`), within-bin mean of `y` (`rdplot_mean_y`), within-bin standard error (`rdplot_se_y`), and pointwise CI endpoints (`rdplot_ci_l` / `rdplot_ci_r`) when `ci = ` is set.
- `vars_poly` — a fine evaluation grid (`rdplot_x`) with the fitted polynomial values on each side of the cutoff (`rdplot_y`).
- `coef` — the per-side polynomial coefficient matrix (`Left` / `Right` columns) that produced `vars_poly`.

```{r rdplot-hide}
out <- rdplot(y = vote, x = margin, c = 0,
              hide = TRUE, ci = 95, data = df)
str(out$vars_bins, give.attr = FALSE)
str(out$vars_poly, give.attr = FALSE)
out$coef
```

A minimal custom plot using `ggplot2`:

```{r rdplot-manual, fig.cap="Custom RD plot built from rdplot() output."}
library(ggplot2)
poly_l <- subset(out$vars_poly, rdplot_x <  0)
poly_r <- subset(out$vars_poly, rdplot_x >= 0)

ggplot() +
  geom_vline(xintercept = 0, linewidth = 0.4) +
  geom_errorbar(data = out$vars_bins,
                aes(x = rdplot_mean_bin,
                    ymin = rdplot_ci_l, ymax = rdplot_ci_r),
                color = "grey60", width = 0) +
  geom_point(data = out$vars_bins,
             aes(x = rdplot_mean_bin, y = rdplot_mean_y),
             color = "grey40", size = 1.5) +
  geom_line(data = poly_l, aes(x = rdplot_x, y = rdplot_y)) +
  geom_line(data = poly_r, aes(x = rdplot_x, y = rdplot_y)) +
  theme_bw() +
  labs(x = "Vote margin (t)", y = "Vote share (t+2)",
       title = "Senate: regression function fit")
```

You can also reconstruct the fitted polynomial on a custom x-range directly from `coef`, which is convenient when the default evaluation grid doesn't span the range you want:

```{r rdplot-from-coef}
eval_fit <- function(xs, coef, c)
  sapply(xs, function(xi) sum(coef * (xi - c) ^ (0:(length(coef) - 1))))

xs <- seq(-50, 50, length.out = 5)
data.frame(
  x       = xs,
  y_left  = ifelse(xs <  0, eval_fit(xs, out$coef[, "Left"],  0), NA),
  y_right = ifelse(xs >= 0, eval_fit(xs, out$coef[, "Right"], 0), NA)
)
```

The full set of worked examples (default plot, error-bar variant, shaded CI ribbons, coefficient-based reconstruction) is in `rdplot_illustration.R` at the project root, with Python and Stata twins (`rdplot_illustration.py`, `rdplot_illustration.do`).

---

## 6. broom Integration

**rdrobust** 4.0.0 registers `tidy` and `glance` methods so that results integrate seamlessly with the **broom** package and tidyverse workflows.

### `tidy()`: Coefficient-level output

`tidy()` returns one row per estimator (Conventional, Bias-Corrected, Robust), with standard broom column names:

```{r broom-tidy, eval=requireNamespace("broom", quietly=TRUE)}
library(broom)
tidy(r_cr3)
```

For most applications, the **Robust** row (row 3) is the recommended inference — it combines the bias-corrected point estimate with robust standard errors.

### `glance()`: Model-level summaries

`glance()` returns a single-row data frame with key estimation details: sample sizes, bandwidths, VCE method, and whether covariates or clusters were used.

```{r broom-glance, eval=requireNamespace("broom", quietly=TRUE)}
glance(r_cr3)
```

### Collecting results across specifications

The main payoff of broom integration is easy comparison across specifications. Here we compare all three cluster-robust estimators:

```{r broom-collect, eval=requireNamespace("broom", quietly=TRUE)}
specs <- list(CR1 = r_cr1, CR2 = r_cr2, CR3 = r_cr3)

# Robust BC estimates side-by-side
res <- do.call(rbind, lapply(names(specs), function(nm) {
  cbind(spec = nm, tidy(specs[[nm]])[3, ])  # row 3 = robust
}))
res[, c("spec", "estimate", "std.error", "conf.low", "conf.high")]
```

We can also compare model-level details:

```{r broom-glance-compare, eval=requireNamespace("broom", quietly=TRUE)}
gl <- do.call(rbind, lapply(names(specs), function(nm) {
  cbind(spec = nm, glance(specs[[nm]]))
}))
gl[, c("spec", "nobs.effective.left", "nobs.effective.right",
       "h.left", "vce", "clusters.left", "clusters.right")]
```

### Broader comparison: HC3 vs clustered

A common workflow is comparing heteroskedasticity-robust and cluster-robust results to assess sensitivity to within-cluster correlation:

```{r broom-hc-vs-cl, eval=requireNamespace("broom", quietly=TRUE)}
r_hc3 <- rdrobust(y = vote, x = margin, vce = "hc3", data = df)

all_specs <- list("HC3"  = r_hc3,
                  "CR1"  = r_cr1,
                  "CR2"  = r_cr2,
                  "CR3"  = r_cr3)

comparison <- do.call(rbind, lapply(names(all_specs), function(nm) {
  g <- glance(all_specs[[nm]])
  t <- tidy(all_specs[[nm]])[3, ]   # robust row
  data.frame(spec      = nm,
             estimate  = round(t$estimate, 3),
             std.error = round(t$std.error, 3),
             ci        = paste0("[", round(t$conf.low, 3), ", ", round(t$conf.high, 3), "]"),
             h         = round(g$h.left, 3),
             clusters  = g$clusters.left)
}))
print(comparison, row.names = FALSE)
```

---

## 7. Reporting tables: conventional point + RBC CI

A common workflow is to present a small panel of specifications (sharp, covariate-adjusted, cluster-robust, etc.) in a single table. The canonical Calonico-Cattaneo-Titiunik presentation pairs the **conventional point estimate** with the **robust bias-corrected (RBC) confidence interval** — the point estimate is from row 1 of `coef`, the CI is row 3 of `ci`. The same pattern appears in the Stata illustration (`rdrobust_illustration_new_features.do`), built there with `esttab`; the R analog uses `knitr::kable()`.

We fit four specifications side-by-side: sharp, sharp+covs, cluster, cluster+covs.

```{r table-fits}
covs_form <- ~ class + termshouse + termssenate

m1 <- rdrobust(y = vote, x = margin, data = df)
m2 <- rdrobust(y = vote, x = margin, covs = covs_form, data = df)
m3 <- rdrobust(y = vote, x = margin, vce = "cr1", cluster = state, data = df)
m4 <- rdrobust(y = vote, x = margin, vce = "cr1", cluster = state,
               covs = covs_form, data = df)
```

A small helper extracts the conventional point estimate, the RBC confidence interval, and a few model-level stats from a fitted object:

```{r table-row-builder}
rdr_row <- function(fit, label) {
  data.frame(
    spec      = label,
    estimate  = fit$coef[1, 1],                                 # tau_conv
    rbc_ci    = sprintf("[%.3f, %.3f]", fit$ci[3, 1], fit$ci[3, 2]),
    rbc_pval  = fit$pv[3, 1],
    h         = fit$bws[1, 1],                                  # h_left
    N_eff     = fit$N_h[1] + fit$N_h[2],
    clusters  = if (!is.null(fit$n_clust)) sum(fit$n_clust) else NA_integer_
  )
}

tab <- rbind(
  rdr_row(m1, "sharp"),
  rdr_row(m2, "sharp + covs"),
  rdr_row(m3, "cluster"),
  rdr_row(m4, "cluster + covs")
)
```

Render with `knitr::kable()` for the report:

```{r table-kable}
knitr::kable(
  tab,
  digits   = 3,
  col.names = c("Specification", "Estimate", "Robust 95% CI",
                "RBC p-value", "h", "Eff. N", "Clusters"),
  caption  = "RD estimates: conventional point estimate with RBC confidence interval."
)
```

The Estimate column reports `tau_conv` (the standard local-polynomial point estimate); the **Robust 95% CI** is the bias-corrected, robust-SE confidence interval — the inference recommended by CCT 2014 and the same one printed by `summary(fit)` in the Robust row. The CI is centered on `tau_bc`, not on `tau_conv`, which is why a CI may not be symmetric about the reported point estimate. This is a feature of robust bias-corrected inference, not a typo.

### Specs-as-columns variant (paper style)

For a typeset table closer to the Stata `esttab` layout (specifications as columns, statistics as rows), transpose:

```{r table-paper}
paper <- t(tab[, c("estimate", "rbc_ci", "h", "N_eff")])
colnames(paper) <- tab$spec
rownames(paper) <- c("RD effect (conv.)", "Robust 95% CI", "h", "Eff. N")

knitr::kable(
  paper,
  caption = "Senate incumbency: RD estimates across specifications."
)
```

### LaTeX export

To write a publication-ready LaTeX table to a file (the R analog of the Stata `esttab ... using rdrobust_table.tex`):

```{r table-latex, eval=FALSE}
tex <- knitr::kable(
  paper,
  format    = "latex",
  booktabs  = TRUE,
  caption   = "RD estimates for Senate incumbency.",
  label     = "tab:rdrobust"
)
writeLines(tex, "rdrobust_table.tex")
```

For richer formatting (multi-row headers, footnotes, grouping), the `kableExtra` and `gt` packages build on top of `kable()` / `gt::gt()` and accept the same input data frame.

---

## Session Info

```{r session}
sessionInfo()
```