Audit Trail Walkthrough

library(tidyaudit)
library(dplyr)

What is an audit trail?

When building data pipelines, it’s easy to lose track of what happens at each step. How many rows were dropped by that filter? Did the join introduce duplicates? Which columns have missing values now?

tidyaudit’s audit trail captures metadata-only snapshots at each step of a pipe — row counts, column counts, NA totals, numeric summaries — without storing the data itself. This gives you a lightweight, structured record of your pipeline’s behavior. The trail object also allows for custom functions to increase flexibility and capture domain-specific diagnostics.

Building a basic trail

Start by creating a trail object and inserting audit_tap() calls into your pipeline. Each tap records a snapshot and passes the data through unchanged.

# Sample data
orders <- data.frame(
  id       = 1:20,
  customer = rep(c("Alice", "Bob", "Carol", "Dan", "Eve"), 4),
  amount   = c(150, 200, 50, 300, 75, 120, 400, 90, 250, 60,
               180, 210, 45, 320, 85, 130, 380, 95, 270, 55),
  status   = rep(c("complete", "pending", "complete", "cancelled", "complete"), 4)
)

trail <- audit_trail("order_pipeline")

result <- orders |>
  audit_tap(trail, "raw") |>
  filter(status == "complete") |>
  audit_tap(trail, "complete_only") |>
  mutate(tax = amount * 0.1) |>
  audit_tap(trail, "with_tax")

Now print the trail to see the snapshot timeline:

print(trail)
#> 
#> ── Audit Trail: "order_pipeline" ───────────────────────────────────────────────
#> Created: 2026-02-22 09:37:05
#> Snapshots: 3
#> 
#>   #  Label          Rows  Cols  NAs  Type
#>   ─  ─────────────  ────  ────  ───  ────
#>   1  raw              20     4    0  tap 
#>   2  complete_only    12     4    0  tap 
#>   3  with_tax         12     5    0  tap
#> 
#> Changes:
#>   raw → complete_only: -8 rows, = cols, = NAs
#>   complete_only → with_tax: = rows, +1 cols, = NAs

The timeline shows row counts, column counts, NA totals, and change summaries between consecutive steps. You can see exactly how many rows each filter removed and when columns were added.

Operation-aware taps

Plain audit_tap() records what the data looks like, but it can’t tell you why it changed. Operation-aware taps — left_join_tap(), filter_tap(), etc. — perform the operation AND record enriched diagnostics.

Join taps

Replace dplyr::left_join() + audit_tap() with left_join_tap() to capture match rates, relationship type, and duplicate key information:

customers <- data.frame(
  customer = c("Alice", "Bob", "Carol", "Dan"),
  region   = c("East", "West", "East", "North")
)

trail2 <- audit_trail("join_pipeline")

result2 <- orders |>
  audit_tap(trail2, "raw") |>
  left_join_tap(customers, by = "customer",
                .trail = trail2, .label = "with_region")

print(trail2)
#> 
#> ── Audit Trail: "join_pipeline" ────────────────────────────────────────────────
#> Created: 2026-02-22 09:37:05
#> Snapshots: 2
#> 
#>   #  Label        Rows  Cols  NAs  Type                                
#>   ─  ───────────  ────  ────  ───  ────────────────────────────────────
#>   1  raw            20     4    0  tap                                 
#>   2  with_region    20     5    4  left_join (many-to-one, 80% matched)
#> 
#> Changes:
#>   raw → with_region: = rows, +1 cols, +4 NAs

The Type column now shows the join type, relationship, and match rate — all without leaving the pipe.

All six dplyr join types are supported: left_join_tap(), right_join_tap(), inner_join_tap(), full_join_tap(), anti_join_tap(), semi_join_tap().

Filter taps

filter_tap() keeps matching rows (like dplyr::filter()) while recording how many rows were dropped:

trail3 <- audit_trail("filter_pipeline")

result3 <- orders |>
  audit_tap(trail3, "raw") |>
  filter_tap(status == "complete",
             .trail = trail3, .label = "complete_only") |>
  filter_tap(amount > 100,
             .trail = trail3, .label = "high_value",
             .stat = amount)
#> ℹ filter_tap: status == "complete"
#> Dropped 8 of 20 rows (40.0%)
#> ℹ filter_tap: amount > 100
#> Dropped 8 of 12 rows (66.7%)
#> Stat amount: dropped 555 of 1,135

print(trail3)
#> 
#> ── Audit Trail: "filter_pipeline" ──────────────────────────────────────────────
#> Created: 2026-02-22 09:37:05
#> Snapshots: 3
#> 
#>   #  Label          Rows  Cols  NAs  Type                          
#>   ─  ─────────────  ────  ────  ───  ──────────────────────────────
#>   1  raw              20     4    0  tap                           
#>   2  complete_only    12     4    0  filter (dropped 8 rows, 40%)  
#>   3  high_value        4     4    0  filter (dropped 8 rows, 66.7%)
#> 
#> Changes:
#>   raw → complete_only: -8 rows, = cols, = NAs
#>   complete_only → high_value: -8 rows, = cols, = NAs

The .stat argument tracks a numeric column through the filter, reporting how much of the total was dropped — useful for financial pipelines where you want to know the monetary impact of each filter.

filter_out_tap() works the same way but drops matching rows (the inverse).

Comparing snapshots

audit_diff() gives you a detailed before/after comparison between any two snapshots in the trail:

audit_diff(trail3, "raw", "high_value")
#> 
#> ── Audit Diff: "raw" → "high_value" ──
#> 
#>   Metric  Before  After  Delta
#>   ──────  ──────  ─────  ─────
#>   Rows        20      4    -16
#>   Cols         4      4      =
#>   NAs          0      0      =
#> 
#> ℹ No columns added or removed
#> 
#> Numeric shifts (common columns):
#>     Column  Mean before  Mean after   Shift
#>     ──────  ───────────  ──────────  ──────
#>     id            10.50         8.5      -2
#>     amount       173.25       145.0  -28.25

This shows row/column/NA deltas, columns added or removed, and numeric distribution shifts.

Full audit report

audit_report() prints the complete trail summary plus all consecutive diffs in one call:

audit_report(trail3)
#> ── Audit Report: "filter_pipeline" ─────────────────────────────────────────────
#> Created: 2026-02-22 09:37:05
#> Total snapshots: 3
#> 
#> ── Audit Trail: "filter_pipeline" ──────────────────────────────────────────────
#> Created: 2026-02-22 09:37:05
#> Snapshots: 3
#> 
#>   #  Label          Rows  Cols  NAs  Type                          
#>   ─  ─────────────  ────  ────  ───  ──────────────────────────────
#>   1  raw              20     4    0  tap                           
#>   2  complete_only    12     4    0  filter (dropped 8 rows, 40%)  
#>   3  high_value        4     4    0  filter (dropped 8 rows, 66.7%)
#> 
#> Changes:
#>   raw → complete_only: -8 rows, = cols, = NAs
#>   complete_only → high_value: -8 rows, = cols, = NAs
#> 
#> ── Detailed Diffs ──────────────────────────────────────────────────────────────
#> 
#> ── Audit Diff: "raw" → "complete_only" ──
#> 
#>   Metric  Before  After  Delta
#>   ──────  ──────  ─────  ─────
#>   Rows        20     12     -8
#>   Cols         4      4      =
#>   NAs          0      0      =
#> 
#> ℹ No columns added or removed
#> 
#> Numeric shifts (common columns):
#>     Column  Mean before  Mean after   Shift
#>     ──────  ───────────  ──────────  ──────
#>     id            10.50       10.50       0
#>     amount       173.25       94.58  -78.67
#> 
#> ── Audit Diff: "complete_only" → "high_value" ──
#> 
#>   Metric  Before  After  Delta
#>   ──────  ──────  ─────  ─────
#>   Rows        12      4     -8
#>   Cols         4      4      =
#>   NAs          0      0      =
#> 
#> ℹ No columns added or removed
#> 
#> Numeric shifts (common columns):
#>     Column  Mean before  Mean after   Shift
#>     ──────  ───────────  ──────────  ──────
#>     id            10.50         8.5      -2
#>     amount        94.58       145.0  +50.42
#> 
#> ── Final Snapshot Profile ──────────────────────────────────────────────────────
#> 
#> high_value (4 rows x 4 cols)
#> Column types: 2 character, 1 integer, 1 numeric
#> ✔ No missing values
#> 
#> Numeric summary:
#>     Column  Min   Mean  Median  Max
#>     ──────  ───  ─────  ──────  ───
#>     id        1    8.5     8.5   16
#>     amount  120  145.0   140.0  180
#> 
#> ────────────────────────────────────────────────────────────────────────────────

Tips

Custom diagnostics

Pass a named list of functions via .fns to compute custom diagnostics at any tap:

trail4 <- audit_trail("custom_example")
result4 <- orders |>
  audit_tap(trail4, "raw", .fns = list(
    mean_amount = ~mean(.x$amount),
    n_customers = ~length(unique(.x$customer))
  ))

audit_report(trail4)
#> ── Audit Report: "custom_example" ──────────────────────────────────────────────
#> Created: 2026-02-22 09:37:06
#> Total snapshots: 1
#> 
#> ── Audit Trail: "custom_example" ───────────────────────────────────────────────
#> Created: 2026-02-22 09:37:06
#> Snapshots: 1
#> 
#>   #  Label  Rows  Cols  NAs  Type
#>   ─  ─────  ────  ────  ───  ────
#>   1  raw      20     4    0  tap
#> 
#> ── Custom Diagnostics ──────────────────────────────────────────────────────────
#> 
#> raw:
#> mean_amount: 173.25
#> n_customers: 5
#> 
#> ── Final Snapshot Profile ──────────────────────────────────────────────────────
#> 
#> raw (20 rows x 4 cols)
#> Column types: 2 character, 1 integer, 1 numeric
#> ✔ No missing values
#> 
#> Numeric summary:
#>     Column  Min    Mean  Median  Max
#>     ──────  ───  ──────  ──────  ───
#>     id        1   10.50    10.5   20
#>     amount   45  173.25   140.0  400
#> 
#> ────────────────────────────────────────────────────────────────────────────────

NULL-trail mode

All tap functions work without a trail. When .trail = NULL (the default):

# Plain filter — no diagnostics
orders |> filter_tap(amount > 100) |> nrow()
#> [1] 12

# Diagnostics without a trail
orders |> filter_tap(amount > 100, .stat = amount) |> invisible()
#> filter_keep(.data, amount > 100)
#> Dropped 8 of 20 rows (40.00%).
#> Dropped 555 of 3,465 for amount (16.02%).

This makes it easy to add quick diagnostics to any pipeline without setting up a full trail.