The xplainfi package provides several data generating
processes (DGPs) designed to illustrate specific strengths and
weaknesses of different feature importance methods. Each DGP focuses on
one primary challenge to make the differences between methods clear.
This article provides a comprehensive overview of all simulation settings, including their mathematical formulations and causal structures visualized as directed acyclic graphs (DAGs).
| DGP | Challenge | PFI Behavior | CFI Behavior |
|---|---|---|---|
| sim_dgp_correlated | Spurious correlation | High for spurious x2 | Low for spurious x2 |
| sim_dgp_mediated | Mediation effects | Shows total effects | Shows direct effects |
| sim_dgp_confounded | Confounding | Biased upward | Less biased |
| sim_dgp_interactions | Interaction effects | Low (no main effects) | High (captures interactions) |
| sim_dgp_independent | Baseline (no challenges) | Accurate | Accurate |
| sim_dgp_ewald | Mixed effects | Mixed | Mixed |
This DGP demonstrates the difference between total and direct causal effects. Some features affect the outcome only through mediators.
\[\text{exposure} \sim N(0,1), \quad \text{direct} \sim N(0,1)\] \[\text{mediator} = 0.8 \cdot \text{exposure} + 0.6 \cdot \text{direct} + \varepsilon_m\] \[Y = 1.5 \cdot \text{mediator} + 0.5 \cdot \text{direct} + \varepsilon\]
where \(\varepsilon_m \sim N(0, 0.3^2)\) and \(\varepsilon \sim N(0, 0.2^2)\).
DAG for mediated effects DGP
This DGP includes a confounder that affects both a feature and the outcome.
\[H \sim N(0,1) \quad \text{(confounder)}\] \[X_1 = H + \varepsilon_1\] \[\text{proxy} = H + \varepsilon_p, \quad \text{independent} \sim N(0,1)\] \[Y = H + X_1 + \text{independent} + \varepsilon\]
where all \(\varepsilon \sim N(0, 0.5^2)\) independently.
DAG for confounding DGP
set.seed(123)
# Hidden confounder scenario (default)
task_hidden <- sim_dgp_confounded(n = 500, hidden = TRUE)
task_hidden$feature_names # proxy available but not confounder
#> [1] "independent" "proxy" "x1"
# Observable confounder scenario
task_observed <- sim_dgp_confounded(n = 500, hidden = FALSE)
task_observed$feature_names # both confounder and proxy available
#> [1] "confounder" "independent" "proxy" "x1"This DGP demonstrates a pure interaction effect where features have no main effects.
\[Y = 2 \cdot X_1 \cdot X_2 + X_3 + \varepsilon\]
where \(X_j \sim N(0,1)\) independently and \(\varepsilon \sim N(0, 0.5^2)\).
DAG for interaction effects DGP
This is a baseline scenario where all features are independent and their effects are additive. All importance methods should give similar results.
\[Y = 2.0 \cdot X_1 + 1.0 \cdot X_2 + 0.5 \cdot X_3 + \varepsilon\]
where \(X_j \sim N(0,1)\) independently and \(\varepsilon \sim N(0, 0.2^2)\).
DAG for independent features DGP
Reproduces the data generating process from Ewald et al. (2024) for benchmarking feature importance methods. Includes correlated features and interaction effects.
\[X_1, X_3, X_5 \sim \text{Uniform}(0,1)\] \[X_2 = X_1 + \varepsilon_2, \quad \varepsilon_2 \sim N(0, 0.001)\] \[X_4 = X_3 + \varepsilon_4, \quad \varepsilon_4 \sim N(0, 0.1)\] \[Y = X_4 + X_5 + X_4 \cdot X_5 + \varepsilon, \quad \varepsilon \sim N(0, 0.1)\]
DAG for Ewald et al. (2024) DGP