| Title: | A Set of Tools for Exploratory Data Analysis |
| Version: | 0.1.0 |
| Description: | Functions to profile a dataset, identify anomalies (special values, outliers, and inliers, defined as data values that are repeated unusually often), and compare data subsets with respect to either numerical or categorical variable distributions. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.2.3 |
| Suggests: | testthat (≥ 3.0.0) |
| Config/testthat/edition: | 3 |
| Imports: | data.table, graphics, PropCIs |
| Depends: | R (≥ 2.10) |
| LazyData: | true |
| NeedsCompilation: | no |
| Packaged: | 2026-02-17 17:12:49 UTC; ronal |
| Author: | Ronald Pearson [aut, cre] |
| Maintainer: | Ronald Pearson <ronald.k.pearson@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-02-20 08:30:13 UTC |
Synthetic accounting dataset example, from Excel
Description
Small dataset illustrating various unexpected data formats arising from the accounting data format in an Excel spreadsheet. Variables that appear to be numeric based on the name are represented as character strings with embedded commas, dollar signs, percent signs, and parentheses to indicate negative numbers
Usage
AccountingExample
Format
AccountingExample
data frame with 8 rows and 6 columns:
- Year
Four-digit integer year, with missing values coded NA
- Quarter
Two-character quarter designation, Q1 through Q4
- CurrentTotal
Dollar amount with dollar signs, commas, and decimal points
- PriorYearTotal
Dollar amount with dollar signs, commas, and decimal points
- YOYchange
Dollar amount with dollar signs, commas, decimal points and parentheses to indicate negative values
- PctChange
Ratio of YOYchange to CurrentTotal, converted to a percentage, with percent sign
Compute binomial probabilities over categorical variable levels
Description
Compute binomial probabilities over categorical variable levels
Usage
BinomialCIsByCategorical(
DF,
binVar,
catVar,
targetLevel,
keepNA = "ifany",
keepLevels = NULL,
cLevel = 0.95
)
Arguments
DF |
A data frame containing |
binVar |
Binary variable for binomial probabilities |
catVar |
Categorical variable over which binomial probabilities are computed |
targetLevel |
Positive response level for |
keepNA |
Missing data handling option: |
keepLevels |
Optional subset of |
cLevel |
Confidence level for binomial probabilities (default 0.95) |
Value
Data frame with one row for each catVar level in the analysis and these 6 columns:
-
LevelthecatVarlevel -
nWiththe number of records withcatVarequal toLevelandbinVarequal totargetLevel -
nTotalthe total number of records withcatVarequal toLevel -
pEstthe estimated probability thatbinVarequalstargetLevel -
loCIthe lowercLevelconfidence limit forpEst -
upCIthe uppercLevelconfidence limit forpEst
Examples
catVar <- c(rep("A", 100), rep("B", 100), rep("C", 100))
binVar <- c(rep(0,80),rep(1,20), rep(0,50),rep(1,50), rep(0,20),rep(1,80))
DF <- data.frame(catVar = catVar, binVar = binVar)
BinomialCIsByCategorical(DF, "binVar", "catVar", 1)
Compare categorical level distribution between subsets
Description
Given two data subsets, defined by indexA and indexB, and a
categorical variable catVar, compute the probability that each
level of catVar appears in each subset and the Agresti-Caffo
confidence interval for the difference in these probabilities,
based on the PropCIs::wald2ci function.
Usage
CompareCategoricalLevels(
DF,
catVar,
indexA,
indexB = NULL,
cLevel = 0.95,
includeNA = "ifany"
)
Arguments
DF |
A data frame containing |
catVar |
Categorical variable whose distribution is compared between two subsets |
indexA |
Defines records in the first subset |
indexB |
Defines records in the second subset; default NULL uses all records not in the first subset |
cLevel |
Confidence level for estimated probability differences |
includeNA |
Missing data handling option: |
Value
Data frame with one row for each catVar level and these 10 columns:
-
LevelthecatVarlevel -
xAthe number of timesLevelappears in the first subset -
nAthe total records in the first subset -
xBthe number of timesLevelappears in the second subset -
nBthe total records in the second subset -
pAthe estimated probability thatLevelappears in the first subset -
pBthe estimated probability thatLevelappears in the second subset -
loCIthe lower confidence limit on the differencepA - pB -
upCIthe upper confidence limit on the differencepA - pB -
signifa logical indicator of whetherpA - pBis significantly different from zero
Examples
catVar <- c(rep("a", 100), rep("b", 100), rep("c", 100))
auxVar <- c(rep("Set1", 30), rep("Set2", 70),
rep("Set1", 50), rep("Set2", 50),
rep("Set1", 90), rep("Set2", 10))
DF <- data.frame(catVar = catVar, auxVar = auxVar)
indexA <- which(DF$auxVar == "Set1")
CompareCategoricalLevels(DF, "catVar", indexA)
Compare distributions of numerical variables between subsets
Description
Sets up and calls WelchRankTest to compare the distributions of a set of numerical variables between two record subsets. If the set of numerical variables contains a single element, this function effectively reduces to WelchRankTest.
Usage
CompareNumericSets(DF, IndexA, numVars, IndexB = NULL, cLevel = 0.95)
Arguments
DF |
data frame containing all variables in |
IndexA |
record index defining the first record subset to be compared |
numVars |
vector of numerical variable names from |
IndexB |
record index defining the second record subset to be compared (default NULL means the second set contains all records not included in the first) |
cLevel |
confidence level for the Welch rank test (default = 0.95) |
Value
data frame with one row for each element of numVars and columns
containing the numVars element name and all columns from WelchRankTest
for that variable
Examples
x <- seq(-1, 1, length = 200)
a <- rep(c("a", "b"), 100)
offset <- rep(c(0, 0.2), 100)
xMod <- x + offset
DF <- data.frame(numVar = x, numVar2 = xMod, setVar = a)
indexA <- which(DF$setVar == "a")
CompareNumericSets(DF, indexA, c("numVar", "numVar2"))
Compuute outlier limits by three methods
Description
Compute upper and lower outlier limits by three detection rules: the 3-sigma edit rule, the Hampel identifier, or the boxplot rule
Usage
ComputeOutlierLimits(x, method, t = NULL)
Arguments
x |
numerical vector in which outliers are to be detected |
method |
single character specifying the outlier rule (T, H, or B) |
t |
threshold parameter (default NULL, gives 3 for T and H rules, 1.5 for B rule) |
Value
named numerical vector with these 4 elements:
nRec the number of elements in
xnonMiss the number of non-missing elements in
xloLim the lower outlier threshold for
xelementsupLim the upper outlier threshold for
xelements
Examples
x <- seq(-1, 1, length = 100)
x[1:10] <- 10
ComputeOutlierLimits(x, "T")
ComputeOutlierLimits(x, "H")
ComputeOutlierLimits(x, "B")
Detect inliers based on unusual frequency of ocurrance
Description
Returns an index to elements of a numerical vector whose frequency is unusually large relative to most elements, applying the three-sigma edit rule to counts of individual values. Inliers often represent data values that are incorrect but consistent with the overall data distribution, as in the case of numerically-coded disguised missing data
Usage
FindInliers(x, t = 3)
Arguments
x |
numerical vector in which inliers are to be detected |
t |
threshold parameter for detecting outlying counts (default value 3) |
Value
index to elements of x that occur unusually often, if any
Examples
x <- seq(-1, 1, length = 100)
x[45:54] <- 0
FindInliers(x)
Find outliers by three methods
Description
Returns an index into outlying points, if any, identified by one of three outlier detection rules: the three-sigma edit rule, the Hampel identifier, or the boxplot rule
Usage
FindOutliers(x, method, t = NULL)
Arguments
x |
numerical vector in which outliers are to be detected |
method |
single character specifying the outlier rule (T, H, or B) |
t |
threshold parameter (default NULL, gives 3 for T and H rules, 1.5 for B rule) |
Value
index into elements of x identified as outliers
Examples
x <- seq(-1, 1, length = 100)
x[1:10] <- 10
Tindex <- FindOutliers(x, "T")
x[Tindex] # Example where the three-sigma rule fails
Hindex <- FindOutliers(x, "H")
x[Hindex]
Bindex <- FindOutliers(x, "B")
x[Bindex]
Dataset with missing values and other special cases
Description
Small dataset illustrating standard missing values (NA and NaN), blanks, spaces, and other values sometimes used to represent missing data (e.g., blanks, spaces, and zeros)
Usage
FirstAnomalyDataFrame
Format
FirstAnomalyDataFrame
a data frame with 5 rows and 6 columns:
- NumVar1
numerical variable with positive, zero, negative and missing (NA) values
- NumVar2
numerical variable with positive, zero, and missing (NA) values
- NumVar3
the ratio of
NumVar1toNumVar2- CatVar1
categorical variable with missing data represented as NA
- CatVar2
categorical variable with missing data represented with blanks or spaces
- CatVar3
categorical variable with missing data represented with multiple spaces
Profile a data frame
Description
Given the data frame DF, create a new data frame with one row for
each column of DF that characterizes that column in terms of the
number and fraction of missing values, the most frequent value and
its frequency and other characteristics like the Shannon homogeneity
measure computed by the ShannonHomogeneity() function.
Usage
ProfileDataFrame(DF, dgts = 3, charMax = 20)
Arguments
DF |
data frame to be characterized |
dgts |
digits retained for numerical characterizations like fractions (default = 3) |
charMax |
maximum number of characters retained in representing the most frequent value for a variable (default = 20) |
Value
data frame with one row for each column of DF and these columns:
Variable the name of the column from
DFbeing characterizedType the class of
Variable(e.g., numeric, integer, character, etc.)nMiss the number of missing (NA) or blank
VariablerecordsfracMiss the fraction of total records represented by
nMissnLevels the number of distinct values
VariableexhibitstopValue the most frequently occurring
Variablevalue, truncated tocharMaxcharacterstopChars the actual number of characters required to represent
topValuetopFreq the number of times
topValueoccurstopFrac the fraction of total records represented by
topFreqHomog the Shannon homogeneity measure for
Variable
Examples
ProfileDataFrame(ChickWeight)
Compute the Shannon homogeneity for a vector
Description
Computes the Shannon homogeneity (normalized Shannon entropy) for a vector, typically categorical but the procedure also works with numerical vectors. Returns a value in the range from 0 (for a highly inhomogeneous vector, concentrated entirely on one of L > 1 levels) to 1 (for a completely homogeneous vector). By convention, vectors of length 0 or 1 return homogeneity values of 1.
Usage
ShannonHomogeneity(x, dgts = 3)
Arguments
x |
the vector to be characterized |
dgts |
number of digits in the return value (default = 3) |
Value
a numerical homogeneity measure between 0 and 1
Examples
x <- rep(c("a", "b", "c", "d", "e"), 200)
y <- c(rep("a", 497), rep("b", 497), rep("c", 2), rep("d", 2), rep("e", 2))
z <- c(rep("a", 996), "b", "c", "d", "e")
ShannonHomogeneity(x)
ShannonHomogeneity(y)
ShannonHomogeneity(z)
Summarize elements of a numerical vector that occur unusually often
Description
Applies the three-sigma edit rule to the frequencies of distinct values
of a numerical vector, finding those that occur unusually often and
identifying them either by record number or an associated identifying
characteristic specified by label. Inliers often represent data
values that are incorrect but consistent with the overall data distribution,
as in the case of numerically-coded disguised missing data
Usage
SummarizeInliers(x, label = NULL, labelName = NULL, t = 3)
Arguments
x |
numerical vector in which inliers are to be detected |
label |
optional identifying tag for inliers (default NULL gives
an index into the elements of |
labelName |
optional name for the |
t |
detection threshold for the three-sigma edit rule applied to record counts (default value 3) |
Value
Data frame with one row for each inlier detected and two columns:
Record (or
labelNamevalue) identifying or characterizing each inlierValue the numerical value that occurs unusually often
Note that this data frame is empty (0 rows) if no inliers are detected
Examples
x <- seq(-1, 1, length = 100)
x[45:54] <- 0
SummarizeInliers(x)
Summarize outliers detected by three methods
Description
Generates a summary of outliers detected by the three-sigma edit rule, the Hampel identifier, and the boxplot rule, including an optional label to identify the outlying points
Usage
SummarizeOutliers(x, label = NULL, labelName = NULL, thresh = c(3, 3, 1.5))
Arguments
x |
numerical vector in which outliers are to be detected |
label |
optional identifying tag for outliers (default NULL gives
an index into the elements of |
labelName |
optional name for the |
thresh |
vector of threshold values for each outlier detection rule (default = c(3, 3, 1.5)) |
Value
Data frame with one row for each outlier detected by any of the three methods and these 5 columns:
Record (or
labelName) giving the location or label for each outlierValue the value detected as an outlier by at least one method
ThreeSigma 1 if the outlier is detected by the three-sigma rule, 0 otherwise
Hampel 1 if the outlier is detected by the Hampel identifier, 0 otherwise
Boxplot 1 if the outlier is detected by the boxplot rule, 0 otherwise
Note that this data frame is empty (0 rows) if no outliers are detected by any method
Examples
x <- seq(-1, 1, length = 100)
x[1:10] <- 10
SummarizeOutliers(x)
Tabulate special values often representing missing data
Description
Generates a summary of counts and fractions of records from the
variables listed in xVars that are missing (using the standard R
designation NA), blank (0 length, common in character data),
spaces (one or more, also common in character data), and zeros
or negative values in numerical data (sometimes indicative of
range errors or disguised missing data)
Usage
TabulateSpecialValues(DF, xVars = NULL, subsetIndex = NULL, dgts = 3)
Arguments
DF |
data frame containing all variables in the |
xVars |
character vector of the names of variables to be examined
(default NULL means characterize all variables in data frame |
subsetIndex |
index into record subset in |
dgts |
number of digits in frequency results (default = 3) |
Value
data frame with one row for each variable in xVars list and
these columns:
Variable an element of the
xVarslistnMiss number of records exhibiting the missing value NA (or NaN)
fracMiss fraction of records represented by
nMissnBlank number of records listing the value blank (0 length character string)
fracBlank fraction of records represented by
nBlanknSpaces number of records consisting only of one or more spaces
fracSpaces fraction of records represented by
nSpacesnZero number of records listing the numerical value zero
fracZero fraction of records represented by ‘nZero’
nNeg number of records listing a negative numerical value
fracNeg fraction of records represented by
nNeg
Examples
FirstAnomalyDataFrame
TabulateSpecialValues(FirstAnomalyDataFrame)
Compare two numerical data subsets
Description
Uses the Welch rank-test (a robust alternative to the classical t-test, with better resistance to outliers and asymmetry) to compare the distributions of two subsets of the same numerical variable. The result characterizes the subsets in terms of their median values, and a small p-value (traditionally less than 0.05) implies significant distributional differences between the two subsets.
Usage
WelchRankTest(DF, xVar, indexA, indexB = NULL, cLevel = 0.95)
Arguments
DF |
data frame containing |
xVar |
numerical variable whose subsets are to be compared |
indexA |
record index defining the first subset of |
indexB |
record index defining the second subset of |
cLevel |
confidence level for the test (default = 0.95) |
Value
a named vector with these 5 elements:
nA the number of records in the first
xVarsubsetnB the number of records in the second
xVarsubsetmedianA the median
xVarvalue in the first subsetmedianB the median
xVarvalue in the second subsetpValue the p-value returned by the Welch rank test
Examples
x <- seq(-1, 1, length = 200)
a <- rep(c("a", "b"), 100)
DF <- data.frame(numVar = x, setVar = a)
indexA <- which(DF$setVar == "a")
WelchRankTest(DF, "numVar", indexA) # No difference in distribution
offset <- rep(c(0, 0.2), 100)
DF$numVar2 <- x + offset
WelchRankTest(DF, "numVar2", indexA) # Significant difference
xMod <- x
xMod[indexA[1:4]] <- x[indexA[1:4]] + 10
DF$numVar3 <- xMod
WelchRankTest(DF, "numVar3", indexA) # No difference even with outliers
stats::t.test(DF[indexA, "numVar3"], DF[-indexA, "numVar3"]) # Compare t-test
Plot binomial confidence intervals
Description
Plot method for the S3 object class BinomCIframe generated by the
BinomialCIsByCategorical() function
Usage
## S3 method for class 'BinomCIframe'
plot(x, ..., CIrange = NULL, addRef = TRUE)
Arguments
x |
an S3 object of class BinomCIframe |
... |
optional named parameters to be passed to |
CIrange |
two-element vector giving the minimum and maximum y-axis
values to plot (default NULL uses minimum lower confidence limit and
maximum upper confidence limit from |
addRef |
logical: add a reference line at the average probability of a positive response? (default = TRUE) |
Value
None: this method generates a plot from x
Examples
catVar <- c(rep("A", 100), rep("B", 100), rep("C", 100))
binVar <- c(rep(0,80),rep(1,20), rep(0,50),rep(1,50), rep(0,20),rep(1,80))
DF <- data.frame(catVar = catVar, binVar = binVar)
CIframe <- BinomialCIsByCategorical(DF, "binVar", "catVar", 1)
plot(CIframe)
Plot significant categorical level differences between data subsets
Description
Plot method for S3 objects of class CatDiffs generated by the
CompareCategoricalLevels() function, creating a horizontal barplot
of categorical variable level frequencies that differ significantly
between two data subsets
Usage
## S3 method for class 'CatDiffs'
plot(x, ..., labelA, labelB, nMax = 20, levelFrac = 0.5, xLims = NULL)
Arguments
x |
an S3 object of class CatDiffs |
... |
optional named parameters to be passed to |
labelA |
plot label identifying the first data subset |
labelB |
plot label identifying the second data subset |
nMax |
maximum number of levels to include in the barplot (default = 20) |
levelFrac |
relative position of the level labels on the barplot (default = 0.5) |
xLims |
two-element vector of x-axis limits for the barplot (default sets the range from 0 to 1.2 times the length of the longest bar on the plot) |
Value
None: this method generates a plot from x
Examples
catVar <- c(rep("a", 100), rep("b", 100), rep("c", 100))
auxVar <- c(rep("Set1", 30), rep("Set2", 70),
rep("Set1", 50), rep("Set2", 50),
rep("Set1", 90), rep("Set2", 10))
DF <- data.frame(catVar = catVar, auxVar = auxVar)
indexA <- which(DF$auxVar == "Set1")
CatDiffObj <- CompareCategoricalLevels(DF, "catVar", indexA)
plot(CatDiffObj, labelA = "Set1", labelB = "Set2")