scrutr helps you inspect, profile, and convert
collections of structured datasets. This vignette walks
through the main workflows.
inspect() produces a one-row-per-variable summary:
class, distinct count, missing values, void strings, character lengths,
and sample modalities.
result <- inspect(CO2)
result
#> # A tibble: 5 × 10
#> variables class nb_distinct prop_distinct nb_na prop_na nb_void prop_void
#> <chr> <chr> <int> <dbl> <int> <dbl> <int> <dbl>
#> 1 Plant ordered/f… 12 0.143 0 0 0 0
#> 2 Type factor 2 0.0238 0 0 0 0
#> 3 Treatment factor 2 0.0238 0 0 0 0
#> 4 conc numeric 7 0.0833 0 0 0 0
#> 5 uptake numeric 76 0.905 0 0 0 0
#> # ℹ 2 more variables: nchars <chr>, modalities <chr>Use nrow = TRUE to also print the number of
observations:
When working with several related tables, you often need to know which variables appear where and whether their types are consistent.
data_list <- list(
cars = cars,
mtcars = mtcars[, c("mpg", "hp", "wt", "speed") |> intersect(names(mtcars))],
iris = iris
)
# Which variables are in which datasets?
vars_detect(data_list)
#> # A tibble: 10 × 4
#> vars_union cars mtcars iris
#> <chr> <chr> <chr> <chr>
#> 1 speed ok - -
#> 2 dist ok - -
#> 3 mpg - ok -
#> 4 hp - ok -
#> 5 wt - ok -
#> 6 Sepal.Length - - ok
#> 7 Sepal.Width - - ok
#> 8 Petal.Length - - ok
#> 9 Petal.Width - - ok
#> 10 Species - - okvars_compclasses() goes further and compares the class
of each shared variable:
# Use two datasets that share some columns
shared_list <- list(
df1 = data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE),
df2 = data.frame(x = c(1.1, 2.2, 3.3), y = c("d", "e", "f"), stringsAsFactors = FALSE)
)
vars_compclasses(shared_list)
#> # A tibble: 2 × 3
#> vars_union df1 df2
#> <chr> <chr> <chr>
#> 1 x integer numeric
#> 2 y character characterinspect_vars() is the main collection-level function.
Point it at a folder, and it reads all matching files, inspects each
one, then writes a comprehensive Excel report.
# Create a temporary folder with example datasets
mydir <- file.path(tempdir(), "scrutr_demo")
dir.create(mydir, showWarnings = FALSE)
saveRDS(cars, file.path(mydir, "cars.rds"))
saveRDS(mtcars, file.path(mydir, "mtcars.rds"))
saveRDS(iris, file.path(mydir, "iris.rds"))
# Run the full inspection pipeline
inspect_vars(
input_path = mydir,
output_path = mydir,
output_label = "demo",
considered_extensions = "rds"
)
# The output Excel file contains multiple sheets:
# dims, inspect_tot, one sheet per dataset, vars_detect, vars_compclasses, etc.
list.files(mydir, pattern = "\\.xlsx$")convert_all()Convert all matching files in a folder to another format:
convert_r()For more control, use an Excel mask that specifies exactly which files to convert and how to name the outputs:
scrutr includes several utilities for common data
quality checks:
# Find duplicates in a data frame
df <- data.frame(id = c(1, 2, 2, 3, 3, 3), value = letters[1:6])
dupl_show(df, "id")
#> id value
#> 1 2 b
#> 2 2 c
#> 3 3 d
#> 4 3 e
#> 5 3 f# Check a left join for key issues
left_df <- data.frame(key = c("a", "b", "c"))
right_df <- data.frame(key = c("a", "b", "b", "d"), val = 1:4)
ljoin_checks(left_df, right_df, "key")
#> Checks :
#> ltable rows : 3
#> rtable rows :4
#> jtable rows : 4
#> key are common var names accross the two tables
#> key val
#> 1 a 1
#> 2 b 2
#> 3 b 3
#> 4 c NA