Getting started with dqcheckr

dqcheckr automates the verification of recurring external dataset deliveries. For each new file arrival it runs a battery of quality checks, compares the file to the previous delivery, writes a self-contained HTML report, and records summary statistics in a local SQLite database so that quality trends can be tracked over time.

How it works

Your data file                  Your YAML config
      │                                │
      ▼                                ▼
  read_dataset()            load_config()
  (all columns as           (global defaults +
   character, trimmed)       per-dataset rules)
      │                                │
      └──────────────┬─────────────────┘
                     ▼
          run_qc_checks()        ← single-snapshot checks (QC / SC series)
          run_comparison_checks()← vs previous delivery   (CP series)
          run_custom_checks()    ← your own rules
                     │
                     ▼
          write_snapshot()       ← SQLite: run summary + per-column stats
          render_report()        ← self-contained HTML
                     │
                     ▼
          run_dq_check()         ← the one function you call

A data officer runs a single command for each arriving dataset:

run_dq_check("customer_accounts", config_dir = "path/to/configs")

This prints a one-line console summary, writes an HTML report, and returns list(status, report_path, snapshot_id) invisibly.

Installation

dqcheckr is a first-party package — install it directly from source:

devtools::install("path/to/dqr/dqcheckr")
library(dqcheckr)

All dependencies are on CRAN: readr, DBI, RSQLite, rmarkdown, knitr, kableExtra, ggplot2, gridExtra, yaml, dplyr, tidyr, rlang.

Configuration

Two YAML files control every run.

Global config — `dqcheckr.yml`

Placed in your config directory. Sets default thresholds used by all datasets unless a dataset overrides them.

snapshot_db:       "data/snapshots.sqlite"
report_output_dir: "reports/"

default_rules:
  # Proportion of non-empty values that must parse as numeric for a column to
  # be classified as numeric (affects type inference across many checks)
  type_inference_threshold: 0.90

  # FAIL if any column's missing-value rate exceeds this (0-1)
  max_missing_rate: 0.05

  # FAIL if non-numeric values in a numeric column exceed this rate
  max_non_numeric_rate: 0.01

  # FAIL if row count drops below this (0 = disabled)
  min_row_count: 0

  # Version-comparison thresholds (used when a previous file is available)
  max_row_count_change_pct:       0.10
  max_numeric_mean_shift_pct:     0.20
  max_missing_rate_change_pp:     2.0    # percentage points
  max_non_numeric_rate_change_pp: 1.0

  flag_new_columns:         true   # set false to suppress new-column warnings in CP-02
  flag_dropped_columns:     true   # set false to suppress dropped-column warnings in CP-02
  flag_type_changes:        true   # set false to suppress type-change warnings in CP-02
  flag_column_order_change: true   # set false to skip CP-08 entirely (WARN CSV / FAIL FWF)

Per-dataset config — `<dataset_name>.yml`

One file per dataset. Only the keys you specify are needed — everything else inherits from the global defaults.

dataset_name: "customer_accounts"   # for human reference only — the functional
                                     # identifier is the YAML filename, which must
                                     # match the dataset_name argument to run_dq_check()

# --- File location ---
# Option A: folder scan (two most recently modified files are used)
folder: "data/incoming/customer_accounts/"

# Option B: explicit paths (set current_file only → single-file mode,
#           no version comparison)
# current_file:  "data/incoming/customer_accounts/20260410.csv"
# previous_file: "data/incoming/customer_accounts/20260401.csv"

format:    csv
encoding:  "UTF-8"
delimiter: ","

# --- Schema contract (optional) ---
# SC-01 flags extra columns; SC-02 flags missing columns.
expected_columns:
  - id
  - name
  - country_code
  - account_status
  - account_balance
  - created_date

# --- Key columns (optional) ---
# QC-12 checks each for duplicate values.
key_columns:
  - id

# --- Rule overrides (optional) ---
# These replace the matching keys in default_rules for this dataset only.
rule_overrides:
  max_missing_rate:        0.02
  max_row_count_change_pct: 0.05
  min_row_count:           1000

# --- Per-column rules (optional) ---
column_rules:
  country_code:
    allowed_values: ["GB", "US", "DE", "FR"]   # QC-09
    pattern: "^[A-Z]{2}$"                       # QC-13
  account_status:
    allowed_values: ["ACTIVE", "CLOSED", "SUSPENDED"]
  account_balance:
    min_value: 0                                 # QC-10
    max_value: 1000000

# --- Custom checks (optional) ---
# custom_checks_file: "custom/customer_accounts_checks.R"

# --- Output paths (optional, override global) ---
# snapshot_db:       "data/snapshots.sqlite"
# report_output_dir: "reports/"

What is required vs optional

The only fields the package requires to run are the file location (folder or current_file) and the format. Every other config key is optional — the corresponding checks are simply skipped when the key is absent.

Config key	Skipped / default when absent
`key_columns`	QC-12 (key uniqueness) skipped
`expected_columns`	SC-01 and SC-02 (schema contract) skipped
`column_rules.allowed_values`	QC-09 (allowed values) skipped
`column_rules.min_value` / `max_value`	QC-10 (numeric bounds) skipped
`column_rules.pattern`	QC-13 (pattern / regex) skipped
`custom_checks_file`	All custom checks skipped
`previous_file` / second file in folder	All CP-01–CP-08 (version comparison) skipped
`min_row_count` set to 0	QC-14 (minimum row count) skipped
`type_inference_threshold`	Defaults to 0.90 (affects QC-06, QC-07, QC-08, QC-11, CP-02, CP-04, CP-05, CP-06, CP-07)
`flag_new_columns`	Defaults to `true`; set to `false` to suppress new-column warnings in CP-02 (still tracked in SQLite)
`flag_dropped_columns`	Defaults to `true`; set to `false` to suppress dropped-column warnings in CP-02 (still tracked in SQLite)
`flag_type_changes`	Defaults to `true`; set to `false` to suppress type-change warnings in CP-02
`flag_column_order_change`	Defaults to `true`; set to `false` to skip CP-08 entirely

A minimal dataset config that runs basic checks with no column-level rules:

dataset_name: "my_dataset"           # for reference; must match the YAML filename
current_file: "data/incoming/my_dataset.csv"
format: csv

Fixed-width files

For fixed-width files, set format: fwf and supply column widths and names. If the file has a header row, set fwf_skip: 1 to skip it.

format: fwf
fwf_skip: 1
fwf_widths:    [10, 20, 3, 10, 12, 11]
fwf_col_names: [id, name, country_code, account_status, account_balance, created_date]

The quality checks

Each check returns one or more dq_result objects with four key fields: check_id, status (PASS / WARN / FAIL / INFO), observed, and message. The overall run status is the worst single status across all checks.

Single-snapshot checks (QC series)

These run on the current file alone.

ID	Name	Triggers on	Severity
QC-01	Missing rate	Column missing rate > `max_missing_rate`	FAIL
QC-02	Empty column	Column is 100% empty	FAIL
QC-03	Duplicate rows	Any fully-identical rows	WARN
QC-04	Row count	Always	INFO
QC-05	Column count	Always	INFO
QC-06	Inferred type	Always (date / numeric / character / unknown)	INFO
QC-07	Numeric stats	Numeric columns (min, max, mean, SD)	INFO
QC-08	Distinct count	Character columns	INFO
QC-09	Allowed values	Values outside `allowed_values` list	FAIL
QC-10	Numeric bounds	Values outside `min_value` / `max_value`	FAIL
QC-11	Non-numeric values	Non-numeric values in a numeric column	WARN / FAIL
QC-12	Key uniqueness	Duplicate values in a `key_columns` column	FAIL
QC-13	Pattern / regex	Values not matching `pattern`	FAIL
QC-14	Minimum row count	Row count below `min_row_count`	FAIL

Schema contract checks (SC series)

Only run when expected_columns is set.

ID	Name	Triggers on	Severity
SC-01	Unexpected column	Column in file not in `expected_columns`	FAIL
SC-02	Missing expected column	Column in `expected_columns` absent from file	FAIL

Version comparison checks (CP series)

Only run when a previous file is available. Skipped automatically in single-file mode.

ID	Name	Triggers on	Severity
CP-01	Row count change	Change > `max_row_count_change_pct`	WARN
CP-02	Schema diff	Columns added, dropped, or type-changed	WARN
CP-03	Missing rate change	Increase > `max_missing_rate_change_pp` pp	WARN
CP-04	Numeric mean shift	Shift > `max_numeric_mean_shift_pct`	WARN
CP-05	New distinct values	Values in current not seen in previous	INFO
CP-06	Dropped distinct values	Values in previous absent from current	INFO
CP-07	Non-numeric rate change	Increase > `max_non_numeric_rate_change_pp` pp	WARN
CP-08	Column order change	Column order differs from previous	WARN (CSV) / FAIL (FWF)

Type inference

infer_col_type() classifies each column after whitespace trimming. Rules are applied in priority order:

date — all non-empty values parse as a known date format (%Y-%m-%d, %d/%m/%Y, %m/%d/%Y, %Y%m%d, %d-%m-%Y)
numeric — ≥ type_inference_threshold of non-empty values coerce to numeric
character — everything else
unknown — all values are empty or NA

The threshold defaults to 90%, meaning a column with up to 10% bad values is still classified as numeric, making those bad values visible to QC-11 (non-numeric contamination). It is configurable per dataset via type_inference_threshold in default_rules or rule_overrides:

# raise threshold — stricter, fewer columns classified as numeric
rule_overrides:
  type_inference_threshold: 0.95

# lower threshold — more lenient, columns with more noise still classified numeric
rule_overrides:
  type_inference_threshold: 0.80

Changing the threshold affects QC-06, QC-07, QC-08, QC-11, CP-02, CP-04, CP-05, CP-06, and CP-07. Note that changing it between runs may cause CP-02 to report a spurious type change on columns that sit close to the old and new thresholds, even if the underlying data did not change.

infer_col_type(c("2024-01-15", "2024-06-30"))  # "date"
infer_col_type(c("100", "200", "N/A"))          # "character" (only 67% numeric)
infer_col_type(c("100", "200", "N/A", rep("50", 17)))  # "numeric" (95% numeric)
infer_col_type(c(NA, "", NA))                   # "unknown"
# with a custom threshold:
infer_col_type(c(rep("1", 17), "a", "b", "c"), threshold = 0.80)  # "numeric" (85%)

Running a check

result <- run_dq_check(
  dataset_name = "customer_accounts",
  config_dir   = "path/to/configs",   # contains dqcheckr.yml + customer_accounts.yml
  open_report  = TRUE                 # open HTML in browser when interactive
)

# Console output (always printed):
# [dqcheckr] customer_accounts: FAIL - 1 warning(s), 2 failure(s). Report: reports/...html

result$status       # "PASS", "WARN", or "FAIL"
result$report_path  # full path to the HTML file
result$snapshot_id  # integer row ID in snapshots table

The HTML report contains:

Header — dataset name, file name, run time, overall status badge
File summary — row count, column count, file size, format, encoding
Quality checks table — every check result, failures first
Custom checks table — if configured
Version comparison — what changed since the previous delivery
Historical trend — row count and check outcomes across the last 10 runs (appears once at least two runs have been recorded)
Appendix — full column-level statistics table

Calling individual checks

You can call any check function directly, without going through the full pipeline. Every check takes (df, config) and returns a list of dq_result objects.

library(dqcheckr)

# Build a minimal config programmatically
cfg <- list(
  rules = list(
    max_missing_rate     = 0.05,
    max_non_numeric_rate = 0.01,
    min_row_count        = 0
  ),
  column_rules     = list(
    status = list(allowed_values = c("ACTIVE", "CLOSED"))
  ),
  key_columns      = "id",
  expected_columns = NULL
)

df <- read.csv("data/customer_accounts.csv", colClasses = "character")
df[] <- lapply(df, trimws)

# Run a single check
results <- check_missing_rate(df, cfg)

# Inspect
results[[1]]$status    # "PASS" or "FAIL"
results[[1]]$observed  # e.g. "3.2% missing (4 of 125)"

# Run all single-snapshot checks at once
all_results <- run_qc_checks(df, cfg)

# Overall status
overall_status(all_results)  # "FAIL" > "WARN" > "PASS" > "INFO"

Custom checks

Create a plain .R file that defines custom_checks(df). It receives the trimmed data frame and must return a list of dq_result objects.

# File: custom/customer_accounts_checks.R

custom_checks <- function(df) {
  results <- list()

  # Rule: ACTIVE accounts must not have a zero balance
  active_zero <- df[df$account_status == "ACTIVE" &
                    !is.na(df$account_balance) &
                    df$account_balance == "0", ]
  n <- nrow(active_zero)
  results <- c(results, list(dq_result(
    check_id   = "CUST-01",
    check_name = "No zero-balance active accounts",
    column     = "account_balance",   # enables per-column storage in SQLite
    status     = if (n > 0) "FAIL" else "PASS",
    observed   = sprintf("%d ACTIVE account(s) with balance 0", n),
    message    = if (n > 0)
      sprintf("%d ACTIVE account(s) have a zero balance.", n)
    else
      "No ACTIVE accounts have a zero balance."
  )))

  results
}

Point to it in the dataset config:

custom_checks_file: "custom/customer_accounts_checks.R"

The file is sourced in an isolated environment (not the global session). dq_result() is available inside it. Any error — missing file, undefined function, runtime failure — stops the run with a clear message.

Tip: always set column = when your check is scoped to a specific column. Results with column = NA (the default) appear in the HTML report and contribute to the overall status, but are not written to the column_snapshots SQLite table and therefore do not appear in per-column trend queries.

The snapshot database

Every run writes two rows to a shared SQLite database: one row in snapshots (run-level summary) and one row per column-stat in column_snapshots.

library(DBI)
library(RSQLite)

con <- dbConnect(SQLite(), "data/snapshots.sqlite")

# Recent runs for one dataset
dbGetQuery(con,
  "SELECT id, file_name, overall_status, check_fail_count, run_timestamp
   FROM snapshots
   WHERE dataset_name = 'customer_accounts'
   ORDER BY id DESC
   LIMIT 10")

# Column-level stats for the most recent run
dbGetQuery(con,
  "SELECT column_name, dq_check, value, threshold
   FROM column_snapshots
   WHERE snapshot_id = (
     SELECT MAX(id) FROM snapshots WHERE dataset_name = 'customer_accounts'
   )
   ORDER BY column_name, dq_check")

dbDisconnect(con)

Because all datasets share one database, you can query across them:

dbGetQuery(con,
  "SELECT dataset_name, COUNT(*) AS runs,
          SUM(check_fail_count) AS total_failures
   FROM snapshots
   GROUP BY dataset_name")

The output/ directory (and database file) are created automatically on the first run if they do not exist.

Worked example — Star Wars dataset

The following uses dplyr::starwars exported to CSV and fixed-width files.

# Config directory contains:
#   dqcheckr.yml         — global thresholds
#   starwars_csv.yml     — CSV dataset config
#   starwars_fwf.yml     — FWF dataset config

# Run checks on both formats
result_csv <- run_dq_check("starwars_csv", config_dir = "config", open_report = TRUE)
result_fwf <- run_dq_check("starwars_fwf", config_dir = "config", open_report = TRUE)

The Star Wars config sets max_missing_rate: 0.60 because vehicles (87%) and starships (77%) are naturally sparse — most characters have none. Both columns FAIL, which is the correct finding: they are genuinely incomplete and a consumer should be aware before using them.

Expected console output:

[dqcheckr] starwars_csv: FAIL - 0 warning(s), 2 failure(s). Report: output/reports/starwars_csv_....html
[dqcheckr] starwars_fwf: FAIL - 0 warning(s), 2 failure(s). Report: output/reports/starwars_fwf_....html

Situation	Behaviour
Config file not found	Stop immediately; no output written
Data file not found	Stop immediately; no output written
Only one file available	Single-file mode; CP checks skipped
File cannot be parsed	Stop with filename and parse error
SC-01 / SC-02 violations	FAIL recorded; run continues; report written
Custom checks file missing	Stop with message
`custom_checks()` not defined	Stop with message
`custom_checks()` runtime error	Stop with R error detail
SQLite write fails	Warning emitted; HTML report still written

Getting started with dqcheckr

How it works

Installation

Configuration

Global config — `dqcheckr.yml`

Per-dataset config — `<dataset_name>.yml`

What is required vs optional

Fixed-width files

The quality checks

Single-snapshot checks (QC series)

Schema contract checks (SC series)

Version comparison checks (CP series)

Type inference

Running a check

Calling individual checks

Custom checks

The snapshot database

Worked example — Star Wars dataset

Error handling

Design principles

Getting started with dqcheckr

How it works

Installation

Configuration

Global config — dqcheckr.yml

Per-dataset config — <dataset_name>.yml

What is required vs optional

Fixed-width files

The quality checks

Single-snapshot checks (QC series)

Schema contract checks (SC series)

Version comparison checks (CP series)

Type inference

Running a check

Calling individual checks

Custom checks

The snapshot database

Worked example — Star Wars dataset

Error handling

Design principles

Global config — `dqcheckr.yml`

Per-dataset config — `<dataset_name>.yml`