doctr
is an R package that helps you check the consistency and the quality of data.
The goal of the package is, in other words, automating as much as possible the task of verifying if everything is ok with a dataset. Like a real doctor, it has functions for examining, diagnosing and assessing the progress of its “patients’”.
Since doctr
was created with the Tidy Tools Manifesto in mind, it works perfectly alongiside the tidyverse.
One of doctr
’s main fuctions is examine()
, which gets the summary statistics for every column of a table, varying the summarization strategy depending on the type of variable.
After running examine()
, we can use the report_*()
family of functions to get the different types of reports back. report_num()
is used for numeric varibales, report_txt()
for text variables and report_fct()
for factor variables.
Let’s see how this works with an example dataset: ggplot2::mpg
. For the sake of this example, I’m going to transform the class
column into a factor.
# Converting class to factor
mpg$class <- as.factor(mpg$class)
Now we have 3 main types of variables represented in this table: numeric, text and factor. When we run examine()
, the function is going to treat each column differently depending on in which of these groups it fits; if it can’t classify the column, examine()
always defaults to text.
# Creating the EDA
eda <- examine(mpg)
With the eda
object we can get all 3 exploratory analyses.
# Getting report of numeric variables
report_num(eda)
## # A tibble: 5 × 26
## name len min max `1%` `5%` `10%` `20%` `30%` `40%`
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 displ 234 1.6 7 1.6 1.8 2.0 2.2 2.5 2.8
## 2 year 234 1999.0 2008 1999.0 1999.0 1999.0 1999.0 1999.0 1999.0
## 3 cyl 234 4.0 8 4.0 4.0 4.0 4.0 4.0 6.0
## 4 cty 234 9.0 35 9.0 11.0 11.0 13.0 14.0 15.0
## 5 hwy 234 12.0 44 12.0 15.0 16.3 17.0 19.0 22.0
## # ... with 16 more variables: `50%` <dbl>, `60%` <dbl>, `70%` <dbl>,
## # `80%` <dbl>, `90%` <dbl>, `95%` <dbl>, `99%` <dbl>, mean <dbl>,
## # sd <dbl>, na <dbl>, val <dbl>, neg <dbl>, zero <dbl>, pos <dbl>,
## # unq <int>, mdp <dbl>
# Getting report of text variables
report_txt(eda)
## # A tibble: 5 × 25
## name len min max `1%` `5%` `10%` `20%` `30%` `40%` `50%`
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 manufacturer 234 4 10 4 4.0 4 4 5 5 6
## 2 model 234 2 22 2 4.3 5 6 7 10 11
## 3 trans 234 8 10 8 8.0 8 8 8 8 8
## 4 drv 234 1 1 1 1.0 1 1 1 1 1
## 5 fl 234 1 1 1 1.0 1 1 1 1 1
## # ... with 14 more variables: `60%` <dbl>, `70%` <dbl>, `80%` <dbl>,
## # `90%` <dbl>, `95%` <dbl>, `99%` <dbl>, mean <dbl>, sd <dbl>, na <dbl>,
## # val <dbl>, unq <int>, asc <dbl>, ltr <dbl>, num <dbl>
# Getting report of factor variables
report_fct(eda)
## # A tibble: 7 × 4
## name data cnt frq
## <chr> <fctr> <int> <dbl>
## 1 class 2seater 5 0.02136752
## 2 class compact 47 0.20085470
## 3 class midsize 41 0.17521368
## 4 class minivan 11 0.04700855
## 5 class pickup 33 0.14102564
## 6 class subcompact 35 0.14957265
## 7 class suv 62 0.26495726
The tables produced are very wide, so I won’t show them here in their integrity, but the names of the columns in the reports are codes for each summary statistic; here’s what each of them mean and in which reports they come up:
column | numeric | text | factor | description |
---|---|---|---|---|
name |
x | x | x | name of the variable |
min , max |
x | x | minimum and maximum value/length | |
1% , …, 99% |
x | x | value/length percentiles | |
mean |
x | x | mean value/length | |
sd |
x | x | value/length standard deviation | |
na , val |
x | x | percentage of missing and non-missing entries | |
neg , zero , pos |
x | percentage of negative, zero and positive values | ||
unq |
x | x | count of unique values/texts | |
mdp |
x | maximum number of decimal places | ||
asc |
x | equals 1 if the text is identified as ASCII | ||
ltr , num |
x | percentage of text that is identified as letters and numbers | ||
data |
x | each factor level | ||
cnt , frq |
x | count and frequency of each level |
Like with a group_by()
statement, it is also possible to divide the table before getting the EDA. We do this with the group
argument of the examine()
function and then collect the results with the same argument of the report_*()
family.
# Creating the EDA (grouped by the class variable)
eda <- examine(mpg, group = "class")
For examine()
, group
receives the name or index of a column. When collecting the reports, group
receives the level of the grouped variable from which we want the results.
# Getting report of numeric variables for compact cars
report_num(eda, group = "compact")
## # A tibble: 5 × 26
## name len min max `1%` `5%` `10%` `20%` `30%` `40%`
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 displ 47 1.8 3.3 1.80 1.8 1.8 1.92 2 2
## 2 year 47 1999.0 2008.0 1999.00 1999.0 1999.0 1999.00 1999 1999
## 3 cyl 47 4.0 6.0 4.00 4.0 4.0 4.00 4 4
## 4 cty 47 15.0 33.0 15.00 16.0 16.6 18.00 18 19
## 5 hwy 47 23.0 44.0 23.46 24.3 25.0 25.20 26 27
## # ... with 16 more variables: `50%` <dbl>, `60%` <dbl>, `70%` <dbl>,
## # `80%` <dbl>, `90%` <dbl>, `95%` <dbl>, `99%` <dbl>, mean <dbl>,
## # sd <dbl>, na <dbl>, val <dbl>, neg <dbl>, zero <dbl>, pos <dbl>,
## # unq <int>, mdp <dbl>
# Getting report of text variables for SUVs
report_txt(eda, group = "suv")
## # A tibble: 5 × 25
## name len min max `1%` `5%` `10%` `20%` `30%` `40%` `50%`
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 manufacturer 62 4 10 4 4 4 4 5 6 6
## 2 model 62 11 22 11 11 11 11 12 12 13
## 3 trans 62 8 10 8 8 8 8 8 8 8
## 4 drv 62 1 1 1 1 1 1 1 1 1
## 5 fl 62 1 1 1 1 1 1 1 1 1
## # ... with 14 more variables: `60%` <dbl>, `70%` <dbl>, `80%` <dbl>,
## # `90%` <dbl>, `95%` <dbl>, `99%` <dbl>, mean <dbl>, sd <dbl>, na <dbl>,
## # val <dbl>, unq <int>, asc <dbl>, ltr <dbl>, num <dbl>
# Getting report of factor variables for midsize cars
report_fct(eda, group = "midsize")
## # A tibble: 1 × 4
## name data cnt frq
## <chr> <fctr> <int> <dbl>
## 1 class midsize 41 1