doctr
is an R package that helps you check the consistency and the quality of data.
The goal of the package is, in other words, automating as much as possible the task of verifying if everything is ok with a dataset. Like a real doctor, it has functions for examining, diagnosing and assessing the progress of its “patients’”.
Since doctr
was created with the Tidy Tools Manifesto in mind, it works perfectly alongiside the tidyverse.
One of doctr
’s main fuctions is diagnose()
, which runs tests (nicknamed “exams”) on a table to check if its variables pass certain standards and fit certain assumptions.
After running diagnose()
, we can use the issues()
function to get a report about the results of the exams.
Let’s see how this works with an example dataset: ggplot2::mpg
.
# Runninng exams on table
diagnostics <- diagnose(mpg)
Now the diagnostics
object contains all the errors found while diagnosing the mpg
dataset. By using issues()
we can get human-readable reports on these errors.
# Getting summary of diagnostics
issues(diagnostics)
## No issues found in 'manufacturer'
## No issues found in 'model'
## No issues found in 'displ'
## No issues found in 'year'
## No issues found in 'cyl'
## No issues found in 'trans'
## No issues found in 'drv'
## No issues found in 'cty'
## No issues found in 'hwy'
## No issues found in 'fl'
## No issues found in 'class'
Since mpg
is already very well-formed, no issues were found. I’m going to artificially break the table so we can see what issues look like (I’m also turning verbose
on so the function shows exactly what the issues were).
# Manually breaking mpg
mpg2 <- mpg %>%
mutate(year = as.Date(year, origin = "1970-01-01"))
# Getting summary of diagnostics
mpg2 %>% diagnose() %>% issues(verbose = TRUE)
## No issues found in 'manufacturer'
## No issues found in 'model'
## No issues found in 'displ'
## Issues found in 'year'
## Data isn't of type factor
## No issues found in 'cyl'
## No issues found in 'trans'
## No issues found in 'drv'
## No issues found in 'cty'
## No issues found in 'hwy'
## No issues found in 'fl'
## No issues found in 'class'
As we can see, diagnose()
was able to parse year
, but it alerted us that it isn’t a character variable.
diagnose()
by default uses a function called guess_exams()
to generate the exams it is going to run on a given table. This special function grabs a sample of the table and tries to assign each of its variables to one of the types below (from most to least restrictive):
percentage
: must be only numeric values between 0 and 1money
: must be positive values and have at most 2 decimal placescount
: must be positive integersquantity
: must be positive valuescontinuous
: must be numeric valuescategorical
: must be factorscharacter
: must be textIf you run guess_exams()
by yourself, you can customize the exams it generates and then pass them as an argument to diganose()
so that it uses you custom exams.
Let’s see how this works in practice.
exams <- guess_exams(mpg)
cols | funs | max_na | min_val | max_val | max_dec_places | min_unq | max_unq | least_frec_cls |
---|---|---|---|---|---|---|---|---|
manufacturer | character | |||||||
model | character | |||||||
displ | quantity | |||||||
year | count | |||||||
cyl | count | |||||||
trans | character | |||||||
drv | character | |||||||
cty | count | |||||||
hwy | count | |||||||
fl | character | |||||||
class | character |
Each columns in exams
can be filled with a parameter that is going to be used by diagnose()
to find problems in mpg
. These are the meanings of these parameters and to what variable types they apply (for more information on types, run vignette(doctr_examine)
):
parameter | numeric | text | factor | description |
---|---|---|---|---|
funs |
x | x | x | which type should be used for base exams (percentage , money , etc.) |
max_na |
x | x | x | maximum % of NAs |
min_val , max_val |
x | minimum and maximum values | ||
max_dec_places |
x | maximum number of decimal places | ||
min_unq , max_unq |
x | x | minimum and maximum number of unique classes | |
least_freq_cls |
x | x | minimum % of the total a class can represent |
Let’s customize these exams and use them with diagnose()
.
# Setting some arbritraty maximum and minimum values
exams$max_val[8] <- 30
exams$min_val[9] <- 15
# Setting least frequent class
exams$least_frec_cls[10] <- 0.2
# Setting maximum unique classes
exams$max_unq[1] <- 10
# Use custom exams to diagnose table
mpg %>% diagnose(exams) %>% issues()
## Issues found in 'manufacturer'
## No issues found in 'model'
## No issues found in 'displ'
## No issues found in 'year'
## No issues found in 'cyl'
## No issues found in 'trans'
## No issues found in 'drv'
## Issues found in 'cty'
## Issues found in 'hwy'
## Issues found in 'fl'
## No issues found in 'class'
Using the i
parameter of issues()
paired with verbose
, we can pass the name or index of a column in order to get only the issues associated with it.
# Use custom exams to diagnose table
diagnostics <- diagnose(mpg, exams)
# Get results for 1st column
issues(diagnostics, i = 1, verbose = TRUE)
## Issues found in 'manufacturer'
## There are more than 10 unique classes
# Get results for fl column
issues(diagnostics, i = "fl", verbose = TRUE)
## Issues found in 'fl'
## There are 3 classes that represent less than 20% of the total