splitGraph is an R package for representing biomedical
dataset structure as a typed dependency graph so that leakage-relevant
relationships can be made explicit, validated, queried, and converted
into deterministic split constraints.
It does not fit models, run preprocessing pipelines, or generate resamples by itself. Its job is to encode dataset structure before evaluation so that overlap, provenance, and time-ordering assumptions are inspectable instead of implicit.
The plot above shows six samples (blue) that share three subjects
(orange), two batches (green), two timepoints (red), and two outcome
classes (brown). A plain vfold_cv on this dataset would
violate subject, batch, and time structure at the same time —
and that is exactly what the graph is designed to make visible.
In biomedical evaluation workflows, leakage often comes from dataset structure rather than obvious coding mistakes. Samples may share:
If those relationships are not modeled explicitly, a train/test split can look correct while still violating the intended scientific separation.
splitGraph makes those dependencies first-class
objects.
Does:
graph_from_metadata()igraphsplit_specplot() method with per-type colors and a
node-type legendprint(), summary(), and
as.data.frame() on all core S3 objectsDoes not:
rsample does that)The package is intentionally narrow: dataset dependency structure for leakage-aware evaluation design.
Development version from GitHub:
install.packages("remotes")
remotes::install_github("selcukorkmaz/splitGraph")The fastest path is graph_from_metadata(), which
auto-detects canonical columns in a metadata frame and assembles a
validated dependency_graph:
library(splitGraph)
meta <- data.frame(
sample_id = c("S1", "S2", "S3", "S4", "S5", "S6"),
subject_id = c("P1", "P1", "P2", "P2", "P3", "P3"),
batch_id = c("B1", "B2", "B1", "B2", "B1", "B2"),
timepoint_id = c("T0", "T1", "T0", "T1", "T0", "T1"),
time_index = c(0, 1, 0, 1, 0, 1),
outcome_value = c(0, 1, 0, 1, 1, 0)
)
g <- graph_from_metadata(meta, graph_name = "demo")
plot(g)
validation <- validate_graph(g)
subject_constraint <- derive_split_constraints(g, mode = "subject")
split_spec <- as_split_spec(subject_constraint, graph = g)
validate_split_spec(split_spec)
summarize_leakage_risks(g, constraint = subject_constraint, split_spec = split_spec)For full control over node labels, attribute columns, and
non-canonical relations, use create_nodes() /
create_edges() / build_dependency_graph()
directly.
split_spec is the tool-agnostic handoff object produced
by as_split_spec(). splitGraph does not know
about any particular resampling package — downstream consumers are
expected to provide their own adapters so that splitGraph
stays neutral and has no runtime dependency on them.
The typical end-to-end flow is:
graph_from_metadata(meta) → typed
dependency_graphderive_split_constraints(g, mode = ...) →
split_constraintas_split_spec(constraint, graph = g) →
split_specThe sample_data frame carried by split_spec
exposes exactly what an adapter needs: sample_id for
joining against the observation frame, group_id for grouped
resampling, batch_group / study_group for
blocking, and order_rank for ordered evaluation. An adapter
can be built on top of, for example,
rsample::group_vfold_cv() (grouped CV keyed to
group_id) or rsample::rolling_origin()
(ordered evaluation keyed to order_rank).
Sample, Subject, Batch,
Study, Timepoint, Assay,
FeatureSet, Outcomesample_belongs_to_subjectsample_processed_in_batchsample_from_studysample_collected_at_timepointsample_measured_by_assaysample_uses_featuresetsample_has_outcomesubject_has_outcometimepoint_precedesfeatureset_generated_from_studyfeatureset_generated_from_batchgraph_node_set, graph_edge_set,
dependency_graph, depgraph_validation_report,
graph_query_result, split_constraint,
split_spec, split_spec_validation,
leakage_risk_summary.
| Layer | Functions |
|---|---|
| Ingestion and construction | ingest_metadata(), graph_from_metadata(),
create_nodes(), create_edges(),
build_dependency_graph(), dependency_graph(),
as_igraph() |
| Validation | validate_graph(),
validate_split_spec() |
| Queries | query_node_type(), query_edge_type(),
query_neighbors(), query_paths(),
query_shortest_paths(),
detect_dependency_components(),
detect_shared_dependencies() |
| Constraint derivation | derive_split_constraints(),
grouping_vector() |
| Split-spec translation | as_split_spec(),
summarize_leakage_risks() |
query_node_type(g, "Subject")
query_edge_type(g, "sample_processed_in_batch")
query_neighbors(g, node_ids = "sample:S1", edge_types = "sample_belongs_to_subject")
detect_shared_dependencies(g, via = "Batch")
detect_dependency_components(g, via = c("Subject", "Batch"))subject_constraint <- derive_split_constraints(g, mode = "subject")
batch_constraint <- derive_split_constraints(g, mode = "batch")
study_constraint <- derive_split_constraints(g, mode = "study")
time_constraint <- derive_split_constraints(g, mode = "time")
strict_composite <- derive_split_constraints(
g, mode = "composite", strategy = "strict",
via = c("Subject", "Batch")
)
rule_based_composite <- derive_split_constraints(
g, mode = "composite", strategy = "rule_based",
priority = c("batch", "study", "subject", "time")
)plot(g) renders a typed, layered layout with per-type
node colors and an auto-generated node-type legend. Layers: Sample
(top), peer dependencies (Subject / Batch / Study / Timepoint) in the
middle band, Assay / FeatureSet next, Outcome (bottom).
plot(g) # typed layered layout (default)
plot(g, layout = "sugiyama") # alternative hierarchical layout
plot(g, show_labels = FALSE) # hide node labels on dense graphs
plot(g, legend = FALSE) # suppress the legend
plot(g, legend_position = "bottomright")
plot(g, node_colors = c(Sample = "#000000")) # override type colorscitation("splitGraph")produces:
Korkmaz S (2026). splitGraph: Dataset Dependency Graphs for Leakage-Aware Evaluation. R package version 0.1.0. https://github.com/selcukorkmaz/splitGraph
MIT. See LICENSE.
The package prefers explicit failure over silent guessing. In particular: