dqcheckr — Software Specification

Part C — Technical Specification

C.1 Package structure

dqcheckr/
├── R/
│   ├── run_check.R          # main entry point: run_dq_check()
│   ├── ingest.R             # read CSV / FWF; trim whitespace
│   ├── snapshot.R           # SQLite read/write
│   ├── checks_generic.R     # QC-01..14 and SC-01..02
│   ├── checks_custom.R      # source and call custom_checks(df)
│   ├── compare.R            # CP-01..08
│   ├── report.R             # rmarkdown::render() wrapper
│   ├── utils.R              # dq_result(), load_config(), infer_col_type(), helpers
│   └── dqcheckr-package.R   # package-level documentation
├── inst/
│   ├── templates/report.Rmd
│   ├── config/
│   │   ├── dqcheckr.yml
│   │   └── example_dataset.yml
│   └── demonstrations/
│       ├── demo.R           # named-file mode demo (CSV + FWF)
│       ├── demo2.R          # folder-scan mode demo with version comparison
│       ├── README.md
│       ├── config/          # global + starwars_csv + starwars_fwf configs
│       ├── config2/         # global + starwars_folder configs
│       ├── custom2/         # starwars_custom.R (human-specific checks)
│       ├── data/            # starwars.csv, starwars.fwf
│       └── data2/           # starwars_v1.csv (original), starwars_v2.csv (perturbed)
├── tests/testthat/
│   ├── test-utils.R
│   ├── test-checks.R
│   ├── test-compare.R
│   ├── test-ingest.R
│   ├── test-snapshot.R
│   └── test-integration.R
├── vignettes/
│   ├── dqcheckr.Rmd         # user guide
│   └── specification.Rmd    # this document
├── DESCRIPTION
└── NAMESPACE

C.2 Configuration

Global config — dqcheckr.yml

snapshot_db: "data/snapshots.sqlite"
report_output_dir: "reports/"

default_rules:
  # Single-snapshot thresholds
  max_missing_rate: 0.05            # FAIL if column missing rate exceeds this
  max_non_numeric_rate: 0.01        # FAIL if non-numeric rate in numeric column exceeds this
  min_row_count: 0                  # FAIL if row count below this (0 = disabled)

  # Version comparison thresholds
  max_row_count_change_pct: 0.10
  max_numeric_mean_shift_pct: 0.20
  max_missing_rate_change_pp: 2.0   # percentage points
  max_non_numeric_rate_change_pp: 1.0

  # Schema flags
  flag_new_columns: true            # set false to suppress new-column warnings in CP-02
  flag_dropped_columns: true        # set false to suppress dropped-column warnings in CP-02
  flag_type_changes: true           # set false to suppress type-change warnings in CP-02
  flag_column_order_change: true    # set false to skip CP-08 entirely (WARN CSV / FAIL FWF)

Per-dataset config — <dataset_name>.yml

dataset_name: "customer_accounts"
folder: "data/incoming/customer_accounts/"
format: csv                 # csv | fwf
encoding: "UTF-8"
delimiter: ","              # csv only
fwf_widths: ~               # fwf only
fwf_col_names: ~            # fwf only
fwf_skip: 0                 # fwf only — header rows to skip (default 0)

# Optional: explicit file paths override folder-scan version detection
# current_file: "data/incoming/customer_accounts/20260410.csv"
# previous_file: "data/incoming/customer_accounts/20260401.csv"

# Optional: schema contract
expected_columns:
  - id
  - name
  - country_code
  - account_status
  - account_balance
  - created_date

# Optional: key columns that must be unique
key_columns:
  - id

# Optional rule overrides (merged over global defaults)
rule_overrides:
  max_missing_rate: 0.02
  max_row_count_change_pct: 0.05
  max_non_numeric_rate: 0.005
  min_row_count: 1000

# Optional per-column rules
column_rules:
  country_code:
    allowed_values: ["GB", "US", "DE", "FR"]
    pattern: "^[A-Z]{2}$"
  account_status:
    allowed_values: ["ACTIVE", "CLOSED", "SUSPENDED"]
  account_balance:
    min_value: 0
    max_value: 1000000

# Optional custom checks
custom_checks_file: "custom/customer_accounts_checks.R"

C.3 File version detection

Two modes:

Folder scan (default): the two most recently modified files in folder are used. Newest = current, second-newest = previous.
Explicit naming: if current_file (and optionally previous_file) are set in the dataset config, those paths are used directly. Useful when file naming conventions do not reflect delivery order, or when re-running a check for a specific pair of files.

If only a current file is available (either via folder scan or explicit config), the run proceeds in single-file mode: version comparison and vs-previous schema checks are skipped.

C.4 Ingestion and whitespace trimming

File is read entirely as character columns via readr::read_delim() or readr::read_fwf().
Leading and trailing whitespace is trimmed from every value immediately after reading.
The trimmed data frame is used for all subsequent checks and comparisons. The raw file is never modified.

C.5 Type inference

Applied to each column of the trimmed data frame. Rules applied in order:

Priority	Condition	Inferred type
1	All non-null/non-empty values parse as a known date format	`date`
2	≥ 90% of non-null/non-empty values coerce to numeric	`numeric`
3	Otherwise	`character`
—	All values null or empty	`unknown`

Known date formats tested: %Y-%m-%d, %d/%m/%Y, %m/%d/%Y, %Y%m%d, %d-%m-%Y.

The 90% threshold means a column with up to 10% bad (non-numeric) values is still classified as numeric, making those bad values visible to QC-11.

C.6 Quality checks

Each check returns one or more dq_result objects:

dq_result(check_id, check_name, column = NA, status, observed, threshold = NA, message)
# status: "PASS" | "WARN" | "FAIL" | "INFO"

Single-snapshot checks (QC series)

ID	Name	Applies to	FAIL condition	WARN condition
QC-01	Missing rate	Every column	`missing_rate > max_missing_rate`	—
QC-02	Empty column	Every column	Column is 100% empty	—
QC-03	Duplicate rows	Table	—	Any fully-duplicate rows exist
QC-04	Row count	Table	—	— (INFO)
QC-05	Column count	Table	—	— (INFO)
QC-06	Inferred type	Every column	—	— (INFO)
QC-07	Numeric stats	Numeric columns	—	— (INFO: min/max/mean/SD)
QC-08	Distinct value count	Character columns	—	— (INFO)
QC-09	Allowed values	Configured columns	Any value outside `allowed_values`	—
QC-10	Numeric bounds	Configured columns	Any value outside `min_value`/`max_value`	—
QC-11	Non-numeric values	Numeric columns	`non_numeric_rate > max_non_numeric_rate`	Any non-numeric values present
QC-12	Key uniqueness	Configured `key_columns`	Any duplicate value in key column	—
QC-13	Pattern / regex	Configured columns	Any value not matching `pattern`	—
QC-14	Minimum row count	Table	`row_count < min_row_count` (if > 0)	—

Schema contract checks (SC series) — only when expected_columns is configured

ID	Name	FAIL condition
SC-01	Unexpected column	Column in file but not in `expected_columns`
SC-02	Missing expected column	Column in `expected_columns` but absent from file

Version comparison checks (CP series) — skipped in single-file mode

ID	Name	Flag condition	Severity
CP-01	Row count change	`abs(pct_change) > max_row_count_change_pct`	WARN
CP-02	Schema diff	Columns added, dropped, or type-changed (each independently suppressible via `flag_new_columns`, `flag_dropped_columns`, `flag_type_changes`)	WARN
CP-03	Missing rate change	Increase > `max_missing_rate_change_pp` pp per column	WARN
CP-04	Numeric mean shift	Shift > `max_numeric_mean_shift_pct` per column	WARN
CP-05	New distinct values	Values in current not in previous (categorical columns)	INFO
CP-06	Dropped distinct values	Values in previous not in current (categorical columns)	INFO
CP-07	Non-numeric rate change	Increase > `max_non_numeric_rate_change_pp` pp per column	WARN
CP-08	Column order change	Column order differs from previous (suppressible via `flag_column_order_change: false`)	WARN (CSV) / FAIL (FWF)

CP-02 is the single source of truth for schema differences. The new_cols_vs_previous and missing_cols_vs_previous columns in the snapshots table are derived directly from the CP-02 result — they are not computed separately.

C.7 Custom checks

custom_checks_file in dataset config points to an .R file.
Sourced into an isolated environment (not the global env).
Must define custom_checks(df) returning a list of dq_result objects.
Receives the trimmed data frame.
Errors (missing file, missing function, runtime error) stop the run.
Results appear in the “Custom Checks” report section and are written to column_snapshots (see C.8).

C.8 SQLite schema

Database and tables created automatically on first run via init_snapshot_db().

Table: snapshots

Column	Type	Notes
`id`	INTEGER PK	Auto-increment
`dataset_name`	TEXT
`run_timestamp`	TEXT	`YYYY-MM-DD HH:MM:SS`
`file_name`	TEXT	Basename of current file
`row_count`	INTEGER
`col_count`	INTEGER
`check_pass_count`	INTEGER
`check_warn_count`	INTEGER
`check_fail_count`	INTEGER
`check_info_count`	INTEGER
`overall_status`	TEXT	`FAIL` > `WARN` > `PASS` > `INFO`
`new_cols_vs_previous`	TEXT	Comma-separated; NULL if none or no previous
`missing_cols_vs_previous`	TEXT	Comma-separated; NULL if none or no previous
`new_cols_vs_schema`	TEXT	Comma-separated; NULL if none or `expected_columns` not set
`missing_cols_vs_schema`	TEXT	Comma-separated; NULL if none or `expected_columns` not set

Table: column_snapshots

Column	Type	Notes
`id`	INTEGER PK	Auto-increment
`snapshot_id`	INTEGER FK	→ `snapshots.id`
`column_name`	TEXT	Column name, or check_id for custom checks
`dq_check`	TEXT	Stat name or custom check_id
`value`	TEXT	Always stored as text
`threshold`	TEXT	NULL if no threshold applies
`severity_on_breach`	TEXT	`FAIL`, `WARN`, or NULL if informational

Rows written per column type:

`dq_check`	Written for	`threshold`	`severity_on_breach`
`inferred_type`	All columns	NULL	NULL
`missing_count`	All columns	NULL	NULL
`missing_rate`	All columns	`max_missing_rate`	`FAIL`
`distinct_count`	All columns	NULL	NULL
`numeric_mean`	Numeric only	NULL	NULL
`numeric_sd`	Numeric only	NULL	NULL
`numeric_min`	Numeric only	NULL	NULL
`numeric_max`	Numeric only	NULL	NULL
`non_numeric_count`	Numeric only	NULL	NULL
`non_numeric_rate`	Numeric only	`max_non_numeric_rate`	`FAIL`

Custom check rows: one row per dq_result with a non-NA column value. dq_check = check_id, value = observed, threshold = threshold, severity_on_breach = status if WARN/FAIL, else NULL.

C.9 Main entry point

run_dq_check(
  dataset_name,        # matches <dataset_name>.yml in config_dir
  config_dir = ".",    # contains dqcheckr.yml and <dataset_name>.yml
  open_report = TRUE   # open in browser when interactive
)
# Returns invisibly: list(status, report_path, snapshot_id)

Report filename: <dataset_name>_<YYYYMMDD_HHMMSS>.html

Console output:

[dqcheckr] customer_accounts: FAIL - 0 warning(s), 2 failure(s). Report: reports/customer_accounts_20260410_143022.html

C.10 Error handling

Situation	Behaviour
Folder does not exist	Stop; no output written
No files in folder	Stop; no output written
Explicit file path not found	Stop; no output written
Only one file available	Single-file mode; comparison and vs-previous schema skipped
File cannot be parsed	Stop with filename and parse error
SC-01 / SC-02 violations	FAIL recorded; run continues; report and snapshot written
Custom checks file missing	Stop with message
`custom_checks()` not defined	Stop with message
`custom_checks()` runtime error	Stop with R error detail
SQLite write fails	Warning emitted; HTML report still written
Report render fails	Stop; snapshot already written
`snapshot_db` path missing	Created automatically by `init_snapshot_db()`

C.11 Dependencies

Package	Role
`readr`	CSV / FWF ingestion
`DBI` + `RSQLite`	Snapshot store
`rmarkdown` + `knitr`	Report rendering
`kableExtra`	Styled HTML tables in report
`ggplot2` + `gridExtra`	Trend charts in report
`yaml`	Config parsing
`dplyr`	Data manipulation
`tidyr`	Data reshaping for trend charts
`rlang`	Structured error conditions

All on CRAN. pointblank and dataCompareR were evaluated and not used — all logic is implemented natively.

#	Issue	Detail
D-01	Explicit file naming — partial spec	The dataset config allows `current_file` and `previous_file` to override folder-scan detection, but the behaviour when one is set and the other is not has not been fully specified. Implemented as: `current_file` set but `previous_file` absent runs in single-file mode.
D-02	CP-08 column order — WARN vs FAIL severity	CP-08 is WARN for CSV and FAIL for FWF. There is currently no way to override the severity per dataset. A `column_order_change_severity` per-dataset override may be needed.
D-03	QC-03 duplicate rows — no key scope	QC-03 checks for fully-identical rows only. Multi-column key uniqueness (two rows with the same `id` but different `amount`) is not yet implemented.

#	Feature
B-01	Cross-column consistency rules — conditional rules between columns (e.g. if `status = CLOSED` then `close_date` must not be null)
B-02	Date column recency check — flag if the most recent value in a date column is older than N days
B-03	Multi-column key uniqueness — QC-12 currently checks single columns only
B-04	`post_run_hook` parameter — user-supplied function for email/Slack/S3 delivery after a run
B-05	`batch_run(config_dir)` — run all configured datasets in sequence
B-06	Additional input formats — Parquet (`arrow`), Excel, database tables via `DBI`
B-07	CP-03 severity — make WARN vs FAIL configurable (currently always WARN)
B-08	Report themes — organisation branding via a `theme` parameter

#	Limitation
L-01	~~Type inference uses a fixed 90% threshold~~ — resolved in v0.1.1; configurable via `type_inference_threshold` in `rule_overrides`
L-02	Date formats tested are hardcoded to five common patterns; exotic formats will be misclassified as character
L-03	The report appendix reads from in-memory `col_stats`, not from SQLite. If the SQLite write fails but the report succeeds, the appendix and the database can show slightly different values in edge cases
L-04	Performance target (< 60 seconds for 1 M rows) has not been benchmarked; very wide files (many columns) may be slower than expected

dqcheckr — Software Specification

Mick Mioduszewski

2026-05-10

Part A — Purpose and Intent

Why this exists

What we are building

Design principles

Part B — Business-Level Description

B.1 How it is used

B.2 What is checked

B.3 What is compared to the previous delivery

B.4 What is produced