dqcheckr — Software Specification

Mick Mioduszewski

2026-05-10

Version: 0.1.1 Author: Mick Mioduszewski Date: 2026-05-10


Part A — Purpose and Intent

Why this exists

The analytics organisation ingests external datasets from multiple suppliers on a recurring basis. Verifying that each new delivery is complete, structurally correct, and consistent with prior deliveries is currently slow and largely manual. Analysts spend time on repetitive visual inspection instead of higher-value work, and quality problems risk being missed or caught late.

What we are building

dqcheckr is a purpose-built R package that automates this verification process. For each dataset arrival it produces a single, plain-English HTML report that tells data management staff:

  1. Whether the file is structurally sound (columns, types, row count as expected).
  2. Whether data values are within acceptable bounds and free of obvious errors.
  3. How this delivery compares to the previous one — what has changed, what has drifted.
  4. How this delivery compares to a defined schema contract, if one exists.

Results are also stored in a local SQLite database so that quality trends can be tracked over time across many deliveries.

Design principles


Part B — Business-Level Description

B.1 How it is used

A data officer runs a single command for each arriving dataset:

run_dq_check("customer_accounts", config_dir = "config")

The tool finds the two most recent files in the dataset’s folder, checks the newer one, compares it to the older one, writes a timestamped HTML report to a reports folder, and prints a one-line summary to the console. The report opens automatically in the browser.

Files can also be identified explicitly by naming a current and a previous file, rather than relying on folder modification times.

Scale: the tool is designed for up to 10 dataset arrivals per day across all dataset types.

B.2 What is checked

Data completeness

Data validity

Data integrity

Data consistency

Schema contract (when an expected column list is configured)

Custom checks (when a custom checks file is configured for a dataset)

B.3 What is compared to the previous delivery

B.4 What is produced

HTML report — one file per run, self-contained (no internet required to view). Contains:

  1. A header showing dataset name, file name, run time, and an overall status badge (green = all clear, amber = warnings, red = failures).
  2. A file summary (row count, column count, file size, format, encoding).
  3. A quality checks table listing every check, its status, what was observed, the threshold, and a plain-English explanation. Failures and warnings appear first.
  4. A custom checks table (if configured).
  5. A version comparison section showing what changed since the previous delivery, with a plain-English narrative summary at the top.
  6. A historical trend section showing row count and check outcomes across the last 10 runs (appears once at least two runs have been recorded).
  7. An appendix with a full column-level statistics table.

SQLite database — a single database shared across all datasets. Records summary statistics for every run, enabling long-term trend queries without reading the source files again.


Part C — Technical Specification

C.1 Package structure

dqcheckr/
├── R/
│   ├── run_check.R          # main entry point: run_dq_check()
│   ├── ingest.R             # read CSV / FWF; trim whitespace
│   ├── snapshot.R           # SQLite read/write
│   ├── checks_generic.R     # QC-01..14 and SC-01..02
│   ├── checks_custom.R      # source and call custom_checks(df)
│   ├── compare.R            # CP-01..08
│   ├── report.R             # rmarkdown::render() wrapper
│   ├── utils.R              # dq_result(), load_config(), infer_col_type(), helpers
│   └── dqcheckr-package.R   # package-level documentation
├── inst/
│   ├── templates/report.Rmd
│   ├── config/
│   │   ├── dqcheckr.yml
│   │   └── example_dataset.yml
│   └── demonstrations/
│       ├── demo.R           # named-file mode demo (CSV + FWF)
│       ├── demo2.R          # folder-scan mode demo with version comparison
│       ├── README.md
│       ├── config/          # global + starwars_csv + starwars_fwf configs
│       ├── config2/         # global + starwars_folder configs
│       ├── custom2/         # starwars_custom.R (human-specific checks)
│       ├── data/            # starwars.csv, starwars.fwf
│       └── data2/           # starwars_v1.csv (original), starwars_v2.csv (perturbed)
├── tests/testthat/
│   ├── test-utils.R
│   ├── test-checks.R
│   ├── test-compare.R
│   ├── test-ingest.R
│   ├── test-snapshot.R
│   └── test-integration.R
├── vignettes/
│   ├── dqcheckr.Rmd         # user guide
│   └── specification.Rmd    # this document
├── DESCRIPTION
└── NAMESPACE

C.2 Configuration

Global config — dqcheckr.yml

snapshot_db: "data/snapshots.sqlite"
report_output_dir: "reports/"

default_rules:
  # Single-snapshot thresholds
  max_missing_rate: 0.05            # FAIL if column missing rate exceeds this
  max_non_numeric_rate: 0.01        # FAIL if non-numeric rate in numeric column exceeds this
  min_row_count: 0                  # FAIL if row count below this (0 = disabled)

  # Version comparison thresholds
  max_row_count_change_pct: 0.10
  max_numeric_mean_shift_pct: 0.20
  max_missing_rate_change_pp: 2.0   # percentage points
  max_non_numeric_rate_change_pp: 1.0

  # Schema flags
  flag_new_columns: true            # set false to suppress new-column warnings in CP-02
  flag_dropped_columns: true        # set false to suppress dropped-column warnings in CP-02
  flag_type_changes: true           # set false to suppress type-change warnings in CP-02
  flag_column_order_change: true    # set false to skip CP-08 entirely (WARN CSV / FAIL FWF)

Per-dataset config — <dataset_name>.yml

dataset_name: "customer_accounts"
folder: "data/incoming/customer_accounts/"
format: csv                 # csv | fwf
encoding: "UTF-8"
delimiter: ","              # csv only
fwf_widths: ~               # fwf only
fwf_col_names: ~            # fwf only
fwf_skip: 0                 # fwf only — header rows to skip (default 0)

# Optional: explicit file paths override folder-scan version detection
# current_file: "data/incoming/customer_accounts/20260410.csv"
# previous_file: "data/incoming/customer_accounts/20260401.csv"

# Optional: schema contract
expected_columns:
  - id
  - name
  - country_code
  - account_status
  - account_balance
  - created_date

# Optional: key columns that must be unique
key_columns:
  - id

# Optional rule overrides (merged over global defaults)
rule_overrides:
  max_missing_rate: 0.02
  max_row_count_change_pct: 0.05
  max_non_numeric_rate: 0.005
  min_row_count: 1000

# Optional per-column rules
column_rules:
  country_code:
    allowed_values: ["GB", "US", "DE", "FR"]
    pattern: "^[A-Z]{2}$"
  account_status:
    allowed_values: ["ACTIVE", "CLOSED", "SUSPENDED"]
  account_balance:
    min_value: 0
    max_value: 1000000

# Optional custom checks
custom_checks_file: "custom/customer_accounts_checks.R"

C.3 File version detection

Two modes:

  1. Folder scan (default): the two most recently modified files in folder are used. Newest = current, second-newest = previous.
  2. Explicit naming: if current_file (and optionally previous_file) are set in the dataset config, those paths are used directly. Useful when file naming conventions do not reflect delivery order, or when re-running a check for a specific pair of files.

If only a current file is available (either via folder scan or explicit config), the run proceeds in single-file mode: version comparison and vs-previous schema checks are skipped.

C.4 Ingestion and whitespace trimming

  1. File is read entirely as character columns via readr::read_delim() or readr::read_fwf().
  2. Leading and trailing whitespace is trimmed from every value immediately after reading.
  3. The trimmed data frame is used for all subsequent checks and comparisons. The raw file is never modified.

C.5 Type inference

Applied to each column of the trimmed data frame. Rules applied in order:

Priority Condition Inferred type
1 All non-null/non-empty values parse as a known date format date
2 ≥ 90% of non-null/non-empty values coerce to numeric numeric
3 Otherwise character
All values null or empty unknown

Known date formats tested: %Y-%m-%d, %d/%m/%Y, %m/%d/%Y, %Y%m%d, %d-%m-%Y.

The 90% threshold means a column with up to 10% bad (non-numeric) values is still classified as numeric, making those bad values visible to QC-11.

C.6 Quality checks

Each check returns one or more dq_result objects:

dq_result(check_id, check_name, column = NA, status, observed, threshold = NA, message)
# status: "PASS" | "WARN" | "FAIL" | "INFO"

Single-snapshot checks (QC series)

ID Name Applies to FAIL condition WARN condition
QC-01 Missing rate Every column missing_rate > max_missing_rate
QC-02 Empty column Every column Column is 100% empty
QC-03 Duplicate rows Table Any fully-duplicate rows exist
QC-04 Row count Table — (INFO)
QC-05 Column count Table — (INFO)
QC-06 Inferred type Every column — (INFO)
QC-07 Numeric stats Numeric columns — (INFO: min/max/mean/SD)
QC-08 Distinct value count Character columns — (INFO)
QC-09 Allowed values Configured columns Any value outside allowed_values
QC-10 Numeric bounds Configured columns Any value outside min_value/max_value
QC-11 Non-numeric values Numeric columns non_numeric_rate > max_non_numeric_rate Any non-numeric values present
QC-12 Key uniqueness Configured key_columns Any duplicate value in key column
QC-13 Pattern / regex Configured columns Any value not matching pattern
QC-14 Minimum row count Table row_count < min_row_count (if > 0)

Schema contract checks (SC series) — only when expected_columns is configured

ID Name FAIL condition
SC-01 Unexpected column Column in file but not in expected_columns
SC-02 Missing expected column Column in expected_columns but absent from file

Version comparison checks (CP series) — skipped in single-file mode

ID Name Flag condition Severity
CP-01 Row count change abs(pct_change) > max_row_count_change_pct WARN
CP-02 Schema diff Columns added, dropped, or type-changed (each independently suppressible via flag_new_columns, flag_dropped_columns, flag_type_changes) WARN
CP-03 Missing rate change Increase > max_missing_rate_change_pp pp per column WARN
CP-04 Numeric mean shift Shift > max_numeric_mean_shift_pct per column WARN
CP-05 New distinct values Values in current not in previous (categorical columns) INFO
CP-06 Dropped distinct values Values in previous not in current (categorical columns) INFO
CP-07 Non-numeric rate change Increase > max_non_numeric_rate_change_pp pp per column WARN
CP-08 Column order change Column order differs from previous (suppressible via flag_column_order_change: false) WARN (CSV) / FAIL (FWF)

CP-02 is the single source of truth for schema differences. The new_cols_vs_previous and missing_cols_vs_previous columns in the snapshots table are derived directly from the CP-02 result — they are not computed separately.

C.7 Custom checks

C.8 SQLite schema

Database and tables created automatically on first run via init_snapshot_db().

Table: snapshots

Column Type Notes
id INTEGER PK Auto-increment
dataset_name TEXT
run_timestamp TEXT YYYY-MM-DD HH:MM:SS
file_name TEXT Basename of current file
row_count INTEGER
col_count INTEGER
check_pass_count INTEGER
check_warn_count INTEGER
check_fail_count INTEGER
check_info_count INTEGER
overall_status TEXT FAIL > WARN > PASS > INFO
new_cols_vs_previous TEXT Comma-separated; NULL if none or no previous
missing_cols_vs_previous TEXT Comma-separated; NULL if none or no previous
new_cols_vs_schema TEXT Comma-separated; NULL if none or expected_columns not set
missing_cols_vs_schema TEXT Comma-separated; NULL if none or expected_columns not set

Table: column_snapshots

Column Type Notes
id INTEGER PK Auto-increment
snapshot_id INTEGER FK snapshots.id
column_name TEXT Column name, or check_id for custom checks
dq_check TEXT Stat name or custom check_id
value TEXT Always stored as text
threshold TEXT NULL if no threshold applies
severity_on_breach TEXT FAIL, WARN, or NULL if informational

Rows written per column type:

dq_check Written for threshold severity_on_breach
inferred_type All columns NULL NULL
missing_count All columns NULL NULL
missing_rate All columns max_missing_rate FAIL
distinct_count All columns NULL NULL
numeric_mean Numeric only NULL NULL
numeric_sd Numeric only NULL NULL
numeric_min Numeric only NULL NULL
numeric_max Numeric only NULL NULL
non_numeric_count Numeric only NULL NULL
non_numeric_rate Numeric only max_non_numeric_rate FAIL

Custom check rows: one row per dq_result with a non-NA column value. dq_check = check_id, value = observed, threshold = threshold, severity_on_breach = status if WARN/FAIL, else NULL.

C.9 Main entry point

run_dq_check(
  dataset_name,        # matches <dataset_name>.yml in config_dir
  config_dir = ".",    # contains dqcheckr.yml and <dataset_name>.yml
  open_report = TRUE   # open in browser when interactive
)
# Returns invisibly: list(status, report_path, snapshot_id)

Report filename: <dataset_name>_<YYYYMMDD_HHMMSS>.html

Console output:

[dqcheckr] customer_accounts: FAIL - 0 warning(s), 2 failure(s). Report: reports/customer_accounts_20260410_143022.html

C.10 Error handling

Situation Behaviour
Folder does not exist Stop; no output written
No files in folder Stop; no output written
Explicit file path not found Stop; no output written
Only one file available Single-file mode; comparison and vs-previous schema skipped
File cannot be parsed Stop with filename and parse error
SC-01 / SC-02 violations FAIL recorded; run continues; report and snapshot written
Custom checks file missing Stop with message
custom_checks() not defined Stop with message
custom_checks() runtime error Stop with R error detail
SQLite write fails Warning emitted; HTML report still written
Report render fails Stop; snapshot already written
snapshot_db path missing Created automatically by init_snapshot_db()

C.11 Dependencies

Package Role
readr CSV / FWF ingestion
DBI + RSQLite Snapshot store
rmarkdown + knitr Report rendering
kableExtra Styled HTML tables in report
ggplot2 + gridExtra Trend charts in report
yaml Config parsing
dplyr Data manipulation
tidyr Data reshaping for trend charts
rlang Structured error conditions

All on CRAN. pointblank and dataCompareR were evaluated and not used — all logic is implemented natively.


Part D — Unresolved and Deferred Issues

D.1 Needs design before implementation

# Issue Detail
D-01 Explicit file naming — partial spec The dataset config allows current_file and previous_file to override folder-scan detection, but the behaviour when one is set and the other is not has not been fully specified. Implemented as: current_file set but previous_file absent runs in single-file mode.
D-02 CP-08 column order — WARN vs FAIL severity CP-08 is WARN for CSV and FAIL for FWF. There is currently no way to override the severity per dataset. A column_order_change_severity per-dataset override may be needed.
D-03 QC-03 duplicate rows — no key scope QC-03 checks for fully-identical rows only. Multi-column key uniqueness (two rows with the same id but different amount) is not yet implemented.

D.2 Backlog — agreed

# Feature
B-01 Cross-column consistency rules — conditional rules between columns (e.g. if status = CLOSED then close_date must not be null)
B-02 Date column recency check — flag if the most recent value in a date column is older than N days
B-03 Multi-column key uniqueness — QC-12 currently checks single columns only
B-04 post_run_hook parameter — user-supplied function for email/Slack/S3 delivery after a run
B-05 batch_run(config_dir) — run all configured datasets in sequence
B-06 Additional input formats — Parquet (arrow), Excel, database tables via DBI
B-07 CP-03 severity — make WARN vs FAIL configurable (currently always WARN)
B-08 Report themes — organisation branding via a theme parameter

D.3 Known limitations

# Limitation
L-01 Type inference uses a fixed 90% threshold — resolved in v0.1.1; configurable via type_inference_threshold in rule_overrides
L-02 Date formats tested are hardcoded to five common patterns; exotic formats will be misclassified as character
L-03 The report appendix reads from in-memory col_stats, not from SQLite. If the SQLite write fails but the report succeeds, the appendix and the database can show slightly different values in edge cases
L-04 Performance target (< 60 seconds for 1 M rows) has not been benchmarked; very wide files (many columns) may be slower than expected