Getting started with dqcheckr

dqcheckr automates the verification of recurring external dataset deliveries. For each new file arrival it runs a battery of quality checks, compares the file to the previous delivery, writes a self-contained HTML report, and records summary statistics in a local SQLite database so that quality trends can be tracked over time.

How it works

Your data file                  Your YAML config
      │                                │
      ▼                                ▼
  read_dataset()            load_config()
  (all columns as           (global defaults +
   character, trimmed)       per-dataset rules)
      │                                │
      └──────────────┬─────────────────┘
                     ▼
          run_qc_checks()        ← single-snapshot checks (QC / SC series)
          run_comparison_checks()← vs previous delivery   (CP series)
          run_custom_checks()    ← your own rules
                     │
                     ▼
          write_snapshot()       ← SQLite: run summary + per-column stats
          render_report()        ← self-contained HTML
                     │
                     ▼
          run_dq_check()         ← the one function you call

A data officer runs a single command for each arriving dataset:

run_dq_check("customer_accounts", config_dir = "path/to/configs")

This prints a one-line console summary, writes an HTML report, and returns list(status, report_path, snapshot_id) invisibly.


Installation

dqcheckr is a first-party package — install it directly from source:

devtools::install("path/to/dqr/dqcheckr")
library(dqcheckr)

All dependencies are on CRAN: readr, DBI, RSQLite, rmarkdown, knitr, kableExtra, ggplot2, gridExtra, yaml, dplyr, tidyr, rlang.


Configuration

Two YAML files control every run.

Global config — dqcheckr.yml

Placed in your config directory. Sets default thresholds used by all datasets unless a dataset overrides them.

snapshot_db:       "data/snapshots.sqlite"
report_output_dir: "reports/"

default_rules:
  # Proportion of non-empty values that must parse as numeric for a column to
  # be classified as numeric (affects type inference across many checks)
  type_inference_threshold: 0.90

  # FAIL if any column's missing-value rate exceeds this (0-1)
  max_missing_rate: 0.05

  # FAIL if non-numeric values in a numeric column exceed this rate
  max_non_numeric_rate: 0.01

  # FAIL if row count drops below this (0 = disabled)
  min_row_count: 0

  # Version-comparison thresholds (used when a previous file is available)
  max_row_count_change_pct:       0.10
  max_numeric_mean_shift_pct:     0.20
  max_missing_rate_change_pp:     2.0    # percentage points
  max_non_numeric_rate_change_pp: 1.0

  flag_new_columns:         true   # set false to suppress new-column warnings in CP-02
  flag_dropped_columns:     true   # set false to suppress dropped-column warnings in CP-02
  flag_type_changes:        true   # set false to suppress type-change warnings in CP-02
  flag_column_order_change: true   # set false to skip CP-08 entirely (WARN CSV / FAIL FWF)

Per-dataset config — <dataset_name>.yml

One file per dataset. Only the keys you specify are needed — everything else inherits from the global defaults.

dataset_name: "customer_accounts"   # for human reference only — the functional
                                     # identifier is the YAML filename, which must
                                     # match the dataset_name argument to run_dq_check()

# --- File location ---
# Option A: folder scan (two most recently modified files are used)
folder: "data/incoming/customer_accounts/"

# Option B: explicit paths (set current_file only → single-file mode,
#           no version comparison)
# current_file:  "data/incoming/customer_accounts/20260410.csv"
# previous_file: "data/incoming/customer_accounts/20260401.csv"

format:    csv
encoding:  "UTF-8"
delimiter: ","

# --- Schema contract (optional) ---
# SC-01 flags extra columns; SC-02 flags missing columns.
expected_columns:
  - id
  - name
  - country_code
  - account_status
  - account_balance
  - created_date

# --- Key columns (optional) ---
# QC-12 checks each for duplicate values.
key_columns:
  - id

# --- Rule overrides (optional) ---
# These replace the matching keys in default_rules for this dataset only.
rule_overrides:
  max_missing_rate:        0.02
  max_row_count_change_pct: 0.05
  min_row_count:           1000

# --- Per-column rules (optional) ---
column_rules:
  country_code:
    allowed_values: ["GB", "US", "DE", "FR"]   # QC-09
    pattern: "^[A-Z]{2}$"                       # QC-13
  account_status:
    allowed_values: ["ACTIVE", "CLOSED", "SUSPENDED"]
  account_balance:
    min_value: 0                                 # QC-10
    max_value: 1000000

# --- Custom checks (optional) ---
# custom_checks_file: "custom/customer_accounts_checks.R"

# --- Output paths (optional, override global) ---
# snapshot_db:       "data/snapshots.sqlite"
# report_output_dir: "reports/"

What is required vs optional

The only fields the package requires to run are the file location (folder or current_file) and the format. Every other config key is optional — the corresponding checks are simply skipped when the key is absent.

Config key Skipped / default when absent
key_columns QC-12 (key uniqueness) skipped
expected_columns SC-01 and SC-02 (schema contract) skipped
column_rules.allowed_values QC-09 (allowed values) skipped
column_rules.min_value / max_value QC-10 (numeric bounds) skipped
column_rules.pattern QC-13 (pattern / regex) skipped
custom_checks_file All custom checks skipped
previous_file / second file in folder All CP-01–CP-08 (version comparison) skipped
min_row_count set to 0 QC-14 (minimum row count) skipped
type_inference_threshold Defaults to 0.90 (affects QC-06, QC-07, QC-08, QC-11, CP-02, CP-04, CP-05, CP-06, CP-07)
flag_new_columns Defaults to true; set to false to suppress new-column warnings in CP-02 (still tracked in SQLite)
flag_dropped_columns Defaults to true; set to false to suppress dropped-column warnings in CP-02 (still tracked in SQLite)
flag_type_changes Defaults to true; set to false to suppress type-change warnings in CP-02
flag_column_order_change Defaults to true; set to false to skip CP-08 entirely

A minimal dataset config that runs basic checks with no column-level rules:

dataset_name: "my_dataset"           # for reference; must match the YAML filename
current_file: "data/incoming/my_dataset.csv"
format: csv

Fixed-width files

For fixed-width files, set format: fwf and supply column widths and names. If the file has a header row, set fwf_skip: 1 to skip it.

format: fwf
fwf_skip: 1
fwf_widths:    [10, 20, 3, 10, 12, 11]
fwf_col_names: [id, name, country_code, account_status, account_balance, created_date]

The quality checks

Each check returns one or more dq_result objects with four key fields: check_id, status (PASS / WARN / FAIL / INFO), observed, and message. The overall run status is the worst single status across all checks.

Single-snapshot checks (QC series)

These run on the current file alone.

ID Name Triggers on Severity
QC-01 Missing rate Column missing rate > max_missing_rate FAIL
QC-02 Empty column Column is 100% empty FAIL
QC-03 Duplicate rows Any fully-identical rows WARN
QC-04 Row count Always INFO
QC-05 Column count Always INFO
QC-06 Inferred type Always (date / numeric / character / unknown) INFO
QC-07 Numeric stats Numeric columns (min, max, mean, SD) INFO
QC-08 Distinct count Character columns INFO
QC-09 Allowed values Values outside allowed_values list FAIL
QC-10 Numeric bounds Values outside min_value / max_value FAIL
QC-11 Non-numeric values Non-numeric values in a numeric column WARN / FAIL
QC-12 Key uniqueness Duplicate values in a key_columns column FAIL
QC-13 Pattern / regex Values not matching pattern FAIL
QC-14 Minimum row count Row count below min_row_count FAIL

Schema contract checks (SC series)

Only run when expected_columns is set.

ID Name Triggers on Severity
SC-01 Unexpected column Column in file not in expected_columns FAIL
SC-02 Missing expected column Column in expected_columns absent from file FAIL

Version comparison checks (CP series)

Only run when a previous file is available. Skipped automatically in single-file mode.

ID Name Triggers on Severity
CP-01 Row count change Change > max_row_count_change_pct WARN
CP-02 Schema diff Columns added, dropped, or type-changed WARN
CP-03 Missing rate change Increase > max_missing_rate_change_pp pp WARN
CP-04 Numeric mean shift Shift > max_numeric_mean_shift_pct WARN
CP-05 New distinct values Values in current not seen in previous INFO
CP-06 Dropped distinct values Values in previous absent from current INFO
CP-07 Non-numeric rate change Increase > max_non_numeric_rate_change_pp pp WARN
CP-08 Column order change Column order differs from previous WARN (CSV) / FAIL (FWF)

Type inference

infer_col_type() classifies each column after whitespace trimming. Rules are applied in priority order:

  1. date — all non-empty values parse as a known date format (%Y-%m-%d, %d/%m/%Y, %m/%d/%Y, %Y%m%d, %d-%m-%Y)
  2. numeric — ≥ type_inference_threshold of non-empty values coerce to numeric
  3. character — everything else
  4. unknown — all values are empty or NA

The threshold defaults to 90%, meaning a column with up to 10% bad values is still classified as numeric, making those bad values visible to QC-11 (non-numeric contamination). It is configurable per dataset via type_inference_threshold in default_rules or rule_overrides:

# raise threshold — stricter, fewer columns classified as numeric
rule_overrides:
  type_inference_threshold: 0.95

# lower threshold — more lenient, columns with more noise still classified numeric
rule_overrides:
  type_inference_threshold: 0.80

Changing the threshold affects QC-06, QC-07, QC-08, QC-11, CP-02, CP-04, CP-05, CP-06, and CP-07. Note that changing it between runs may cause CP-02 to report a spurious type change on columns that sit close to the old and new thresholds, even if the underlying data did not change.

infer_col_type(c("2024-01-15", "2024-06-30"))  # "date"
infer_col_type(c("100", "200", "N/A"))          # "character" (only 67% numeric)
infer_col_type(c("100", "200", "N/A", rep("50", 17)))  # "numeric" (95% numeric)
infer_col_type(c(NA, "", NA))                   # "unknown"
# with a custom threshold:
infer_col_type(c(rep("1", 17), "a", "b", "c"), threshold = 0.80)  # "numeric" (85%)

Running a check

result <- run_dq_check(
  dataset_name = "customer_accounts",
  config_dir   = "path/to/configs",   # contains dqcheckr.yml + customer_accounts.yml
  open_report  = TRUE                 # open HTML in browser when interactive
)

# Console output (always printed):
# [dqcheckr] customer_accounts: FAIL - 1 warning(s), 2 failure(s). Report: reports/...html

result$status       # "PASS", "WARN", or "FAIL"
result$report_path  # full path to the HTML file
result$snapshot_id  # integer row ID in snapshots table

The HTML report contains:

  1. Header — dataset name, file name, run time, overall status badge
  2. File summary — row count, column count, file size, format, encoding
  3. Quality checks table — every check result, failures first
  4. Custom checks table — if configured
  5. Version comparison — what changed since the previous delivery
  6. Historical trend — row count and check outcomes across the last 10 runs (appears once at least two runs have been recorded)
  7. Appendix — full column-level statistics table

Calling individual checks

You can call any check function directly, without going through the full pipeline. Every check takes (df, config) and returns a list of dq_result objects.

library(dqcheckr)

# Build a minimal config programmatically
cfg <- list(
  rules = list(
    max_missing_rate     = 0.05,
    max_non_numeric_rate = 0.01,
    min_row_count        = 0
  ),
  column_rules     = list(
    status = list(allowed_values = c("ACTIVE", "CLOSED"))
  ),
  key_columns      = "id",
  expected_columns = NULL
)

df <- read.csv("data/customer_accounts.csv", colClasses = "character")
df[] <- lapply(df, trimws)

# Run a single check
results <- check_missing_rate(df, cfg)

# Inspect
results[[1]]$status    # "PASS" or "FAIL"
results[[1]]$observed  # e.g. "3.2% missing (4 of 125)"

# Run all single-snapshot checks at once
all_results <- run_qc_checks(df, cfg)

# Overall status
overall_status(all_results)  # "FAIL" > "WARN" > "PASS" > "INFO"

Custom checks

Create a plain .R file that defines custom_checks(df). It receives the trimmed data frame and must return a list of dq_result objects.

# File: custom/customer_accounts_checks.R

custom_checks <- function(df) {
  results <- list()

  # Rule: ACTIVE accounts must not have a zero balance
  active_zero <- df[df$account_status == "ACTIVE" &
                    !is.na(df$account_balance) &
                    df$account_balance == "0", ]
  n <- nrow(active_zero)
  results <- c(results, list(dq_result(
    check_id   = "CUST-01",
    check_name = "No zero-balance active accounts",
    column     = "account_balance",   # enables per-column storage in SQLite
    status     = if (n > 0) "FAIL" else "PASS",
    observed   = sprintf("%d ACTIVE account(s) with balance 0", n),
    message    = if (n > 0)
      sprintf("%d ACTIVE account(s) have a zero balance.", n)
    else
      "No ACTIVE accounts have a zero balance."
  )))

  results
}

Point to it in the dataset config:

custom_checks_file: "custom/customer_accounts_checks.R"

The file is sourced in an isolated environment (not the global session). dq_result() is available inside it. Any error — missing file, undefined function, runtime failure — stops the run with a clear message.

Tip: always set column = when your check is scoped to a specific column. Results with column = NA (the default) appear in the HTML report and contribute to the overall status, but are not written to the column_snapshots SQLite table and therefore do not appear in per-column trend queries.


The snapshot database

Every run writes two rows to a shared SQLite database: one row in snapshots (run-level summary) and one row per column-stat in column_snapshots.

library(DBI)
library(RSQLite)

con <- dbConnect(SQLite(), "data/snapshots.sqlite")

# Recent runs for one dataset
dbGetQuery(con,
  "SELECT id, file_name, overall_status, check_fail_count, run_timestamp
   FROM snapshots
   WHERE dataset_name = 'customer_accounts'
   ORDER BY id DESC
   LIMIT 10")

# Column-level stats for the most recent run
dbGetQuery(con,
  "SELECT column_name, dq_check, value, threshold
   FROM column_snapshots
   WHERE snapshot_id = (
     SELECT MAX(id) FROM snapshots WHERE dataset_name = 'customer_accounts'
   )
   ORDER BY column_name, dq_check")

dbDisconnect(con)

Because all datasets share one database, you can query across them:

dbGetQuery(con,
  "SELECT dataset_name, COUNT(*) AS runs,
          SUM(check_fail_count) AS total_failures
   FROM snapshots
   GROUP BY dataset_name")

The output/ directory (and database file) are created automatically on the first run if they do not exist.


Worked example — Star Wars dataset

The following uses dplyr::starwars exported to CSV and fixed-width files.

# Config directory contains:
#   dqcheckr.yml         — global thresholds
#   starwars_csv.yml     — CSV dataset config
#   starwars_fwf.yml     — FWF dataset config

# Run checks on both formats
result_csv <- run_dq_check("starwars_csv", config_dir = "config", open_report = TRUE)
result_fwf <- run_dq_check("starwars_fwf", config_dir = "config", open_report = TRUE)

The Star Wars config sets max_missing_rate: 0.60 because vehicles (87%) and starships (77%) are naturally sparse — most characters have none. Both columns FAIL, which is the correct finding: they are genuinely incomplete and a consumer should be aware before using them.

Expected console output:

[dqcheckr] starwars_csv: FAIL - 0 warning(s), 2 failure(s). Report: output/reports/starwars_csv_....html
[dqcheckr] starwars_fwf: FAIL - 0 warning(s), 2 failure(s). Report: output/reports/starwars_fwf_....html

Error handling

Situation Behaviour
Config file not found Stop immediately; no output written
Data file not found Stop immediately; no output written
Only one file available Single-file mode; CP checks skipped
File cannot be parsed Stop with filename and parse error
SC-01 / SC-02 violations FAIL recorded; run continues; report written
Custom checks file missing Stop with message
custom_checks() not defined Stop with message
custom_checks() runtime error Stop with R error detail
SQLite write fails Warning emitted; HTML report still written

The package never modifies, moves, or deletes source files.


Design principles

Owned and evolvable. Every check is plain R — readable, modifiable, and not hidden behind a third-party API.

Plain English outputs. Reports are written for data management staff, not engineers. No R code, no package names, no jargon.

Fail loudly. Configuration errors, missing files, and custom-check failures all stop the run with a clear message. Nothing is silently skipped.

Non-destructive. The package reads files and writes reports and database records. Source files are never touched.