---
title: "Causal Discovery"
vignette: >
  %\VignetteIndexEntry{Causal Discovery}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
knitr:
  opts_chunk:
    collapse: true
    comment: "#>"
    fig.align: "center"
---

```{r}
#| label: setup
library(causalDisco)
```

This vignette provides a very brief introduction to causal discovery using
simulated data. For a thorough introduction to causal discovery concepts, we
recommend Glymour et al. (2019) [Review of Causal Discovery Methods Based on
Graphical
Models](https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2019.00524/full)
or Zanga et al. (2022) [A Survey on Causal Discovery: Theory and
Practice](https://arxiv.org/abs/2305.10032).

The goal of causal discovery is to infer the causal relationships among a set of
observed variables using observational data.

# Example of causal discovery

Suppose we have this DAG:

```{r}
#| label: dag
cg <- caugi::caugi(
  Z %-->% X1,
  X3 %-->% X2,
  X1 %-->% Y,
  X2 %-->% Y
)
```

We define a layout which we will use for plotting the graphs in this vignette:

```{r}
#| label: plot dag
layout <- caugi::caugi_layout_sugiyama(cg)
plot(cg, layout = layout, main = "True DAG")
```

We can create data from a linear Gaussian model corresponding to the above DAG
using `generate_dag_data()`. We generate 10,000 samples with a fixed random seed
from the DAG above, where we let the edge coefficients be sampled with absolute
values between 0.1 and 0.9 and assigned random signs, and where the standard
deviation of the additive Gaussian noise at each node is sampled from a
log-uniform distribution between 0.3 and 2.

```{r}
#| label: simple causal discovery
data_linear <- generate_dag_data(
  cg,
  n = 10000,
  seed = 1405,
  coef_range = c(0.1, 0.9),
  error_sd = c(0.3, 2)
)
head(data_linear)
```

The R code used to generate the data is stored as an attribute of the data
frame:

```{r}
#| label: generating model
attr(data_linear, "generating_model")
```

The goal is now to recover the causal structure from the data alone.

We will use the PC algorithm to learn the causal structure. One of its key
assumptions is causal sufficiency, meaning that there are no unobserved
confounders. We can use the PC algorithm from any of the "tetrad", "pcalg", or
"bnlearn" engines to learn the structure from the data. The motivation for
having multiple engines is that they may implement different algorithms, tests,
or scoring methods, or the same algorithm may be implemented differently across
engines.

Below, we set up the PC algorithm using Fisher's Z test, a significance level of
0.05, and `pcalg` as the engine. To do so, we first define the PC method using
the `pc()` function, and then pass it to the `disco()` function along with the
data.

```{r}
#| label: pc algorithm simple
pc_pcalg <- pc(engine = "pcalg", test = "fisher_z", alpha = 0.05)
pc_result_pcalg <- disco(data = data_linear, method = pc_pcalg)
```

We can visualize the results using the `plot()` function, where we explicitly
provide the layout defined above so graphs are easier to compare.

```{r}
#| label: plot pc results simple
plot(pc_result_pcalg, layout = layout, main = "PC (pcalg)")
```

The PC algorithm recovers the correct causal structure up to Markov equivalence,
represented as a CPDAG. A CPDAG represents the equivalence class of DAGs that
encode the same conditional independencies.

The first notable feature of this plot is that some edges are directed, while
others are undirected. For example, the edge from `X1` to `Y` is directed,
indicating a causal effect of `X1` on `Y`, but not in the reverse direction. In
contrast, the edge between `X2` and `X3` is undirected, indicating that the data
alone do not provide sufficient information to determine the causal direction.
Both orientations `X2 %-->% X3` and `X3 %-->% X2` are compatible with the
observed conditional independencies.

We demonstrate the non-identifiability of the causal direction between `X2` and
`X3` by reversing the direction of this edge in the data-generating process
above and applying the PC algorithm to the resulting data set.

```{r}
#| label: pc algorithm reversed
cg_reverse <- caugi::caugi(
  Z %-->% X1,
  X2 %-->% X3,
  X1 %-->% Y,
  X2 %-->% Y
)

data_linear_reverse <- generate_dag_data(
  cg_reverse,
  n = 10000,
  seed = 1405,
  coef_range = c(0.1, 0.9),
  error_sd = c(0.3, 2)
)

pc_result_reverse <- disco(data = data_linear_reverse, method = pc_pcalg)
plot(pc_result_reverse, layout = layout, main = "PC (pcalg) reversed")
```

We learn the same causal structure as before, demonstrating that the direction
of influence between `X2` and `X3` cannot be determined from the data alone.

# Unobserved confounding

In practice, unobserved confounding may be present, violating one of the
assumptions of the PC algorithm. Suppose we have the following DAG with an
unobserved confounder `U` between `X1` and `X2`:

```{r}
#| label: dag unobserved confounder
cg_unobserved <- caugi::caugi(
  Z %-->% X1,
  X3 %-->% X2,
  X1 %-->% Y,
  X2 %-->% Y,
  U %-->% X1 + X2
)
```

We can visualize this DAG, marking the unobserved variable `U` in red and using
dashed edges to indicate that it is unobserved:

```{r}
#| label: plot dag unobserved confounder
plot(
  cg_unobserved,
  edge_style = list(
    by_edge = list(
      U = list(col = "red", fill = "red", lty = "dashed")
    )
  ),
  node_style = list(
    by_node = list(
      U = list(col = "red", fill = "red")
    )
  ),
  main = "True DAG"
)
```

We can generate data from this DAG using `generate_dag_data()`, and then
afterwards remove the unobserved variable `U` from the data frame:

```{r}
#| label: data unobserved confounder
data_unobserved <- generate_dag_data(
  cg_unobserved,
  n = 10000,
  seed = 1405,
  coef_range = c(0.1, 0.9),
  error_sd = c(0.3, 2)
)
data_unobserved <- data_unobserved[, names(data_unobserved) != "U"]
head(data_unobserved)
```

We can then apply the PC algorithm as before:

```{r}
#| label: pc algorithm unobserved confounder
pc_pcalg_unobserved <- pc(engine = "pcalg", test = "fisher_z", alpha = 0.05)
pc_result_unobserved <- disco(
  data = data_unobserved,
  method = pc_pcalg_unobserved
)
plot(pc_result_unobserved, layout = layout, main = "PC (pcalg)")
```

We see that the PC algorithm does not recover the correct causal structure due
to the presence of the unobserved confounder. In particular, it found an
incorrect edge between `X1` and `X2`.

# Next steps

For more information about how to incorporate knowledge into causal discovery
methods, see the [knowledge vignette](knowledge.html).

For more information about how to visualize causal discovery results, see the
[visualization vignette](visualization.html).
