| Type: | Package |
| Title: | Cluster Analysis Using Thresholding After Random Projections (n-TARP) |
| Version: | 0.1.0 |
| Description: | Implements the high-dimensional clustering technique Thresholding After Random Projections (n-TARP). Provides functionality to iteratively decompose larger datasets using contextual variables or within-cluster sum of squares. See Tarun & Boutin (2018) <doi:10.48550/arXiv.1806.05297> and Tarun & Boutin (2018) <doi:10.4231/R74B2ZJV> for the original method and applications. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Imports: | stats |
| Suggests: | knitr, rmarkdown, HDclassif |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2026-03-17 15:50:24 UTC; reepi |
| Author: | David Reeping [aut, cre], Yunmeng Han [aut], Nahal Rashedi [rev] |
| Maintainer: | David Reeping <reepindp@ucmail.uc.edu> |
| Repository: | CRAN |
| Date/Publication: | 2026-03-20 12:40:02 UTC |
Combine nTARP BestCluster solutions into one table and assign stable final cluster IDs
Description
Consolidates a set of nTARP "best cluster" solutions (branching runs) that contain 'LabeledClusters' into a single ID-level data frame. For each run, the function adds a per-run cluster assignment column, constructs a concatenated 'ClusterPath', and assigns a stable numeric 'FinalClusterID' based on unique 'ClusterPath' values.
Usage
build_solution_from_labeled_clusters(
nTARP_best_clusters,
ids = NULL,
contextual_variables_df = NULL
)
Arguments
nTARP_best_clusters |
A list of run objects. Each run object must either:
'LabeledClusters' must be a named list with '"Cluster 1"' and '"Cluster 2"' entries containing IDs. The list should be named (recommended). Names are used as run identifiers (column suffixes). If unnamed, runs are labeled sequentially ('run1', 'run2', ...). |
ids |
A vector of IDs (character or coercible) that defines the universe of rows in the output. |
contextual_variables_df |
Optional 'data.frame' of contextual variables to merge in. If supplied, it must contain a column named 'id_name' (default '"mcid"'). The merge is a left join on the provided 'ids' vector, so every ID in 'ids' appears in the output. |
Value
A 'data.frame' with one row per ID in 'ids', optionally merged with contextual variables, plus per-run cluster columns, 'ClusterPath', and 'FinalClusterID'.
Examples
data <- data.frame(X1 = c(0.5, -0.2, 0.1, 0.3, -0.1, 0.2, 5.2, 4.8, 5.1, 5.0,
-4.5, -5.2, -4.8, -5.1, -4.9, -5.3, 0.0, 0.2, 5.3, -5.0),
X2 = c(0.3, -0.1, 0.2, 0.1, 0.0, 0.2, 5.0, 4.9, 5.3, 5.1,
5.0, 5.2, 4.7, 4.9, 5.1, 4.8, -0.2, 0.0, 5.2, -4.9),
X3 = c(0.4, 0.0, 0.1, -0.1, 0.2, 0.0, 5.1, 4.7, 5.2, 5.0,
-5.0, -4.8, -5.3, -5.1, -4.9, -5.2, 0.1, 0.3, 5.0, -5.1)
)
nTARP_result <- nTARP_bisecting(data = data,number_of_projections = 100,withinss_threshold = 0.36)
result <- build_solution_from_labeled_clusters(nTARP_best_clusters = nTARP_result$BestClusters,
ids = 1:10, contextual_variables_df = data)
str(result)
Merge clusters as post-processing
Description
This function combines user-specified clusters after 'nTARP' has run. The input to the function is the output of the 'build_solution_from_labeled_clusters' function, along with the two cluster labels the user would like to merge. The output is returned in the same format as 'build_solution_from_labeled_clusters', with the final solution labels renamed to reflect the merging.
Usage
consolidate_clusters(
cluster_path_matrix,
first_cluster_to_combine,
second_cluster_to_combine
)
Arguments
cluster_path_matrix |
Data frame — output of 'build_solution_from_labeled_clusters' showing which branch each observation belongs to from the 'nTARP' clustering |
first_cluster_to_combine |
Numeric — label of the first cluster to merge |
second_cluster_to_combine |
Numeric — label of the second cluster to merge |
Value
A data frame in the same format as the first argument, with the final column showing the cluster IDs relabeled based on the chosen merge.
Examples
data <- data.frame(X1 = c(0.5, -0.2, 0.1, 0.3, -0.1, 0.2, 5.2, 4.8, 5.1, 5.0,
-4.5, -5.2, -4.8, -5.1, -4.9, -5.3, 0.0, 0.2, 5.3, -5.0),
X2 = c(0.3, -0.1, 0.2, 0.1, 0.0, 0.2, 5.0, 4.9, 5.3, 5.1,
5.0, 5.2, 4.7, 4.9, 5.1, 4.8, -0.2, 0.0, 5.2, -4.9),
X3 = c(0.4, 0.0, 0.1, -0.1, 0.2, 0.0, 5.1, 4.7, 5.2, 5.0,
-5.0, -4.8, -5.3, -5.1, -4.9, -5.2, 0.1, 0.3, 5.0, -5.1)
)
nTARP_result <- nTARP_bisecting(data = data,number_of_projections = 100,
withinss_threshold = 0.36, minimum_cluster_size_percent = 30)
result <- build_solution_from_labeled_clusters(nTARP_best_clusters = nTARP_result$BestClusters,
ids = 1:20, contextual_variables_df = data)
str(result)
result <- consolidate_clusters(result,first_cluster_to_combine = 1,second_cluster_to_combine = 2)
str(result)
Thresholding After Random Projections (n-TARP) Clustering
Description
Implements the n-TARP clustering technique by projecting the data into a one-dimensional space and performing k-means. The data can be either unlabeled or labeled. The only required parameters are the number of projections and the within-cluster sum of squares threshold. Suggested starting values: 'number_of_projections = 1000' and 'withinss_threshold = 0.36'.
Usage
nTARP(data, number_of_projections, withinss_threshold, ids = NULL)
Arguments
data |
Numeric matrix — dataset to be clustered using 'nTARP' |
number_of_projections |
Numeric — number of random projections for 'nTARP' to try for each run |
withinss_threshold |
Numeric — maximum value defining what a "quality cluster" is, based on the solution's normalized within-cluster sum of squares (typically 0.36) |
ids |
Numeric or character vector — identifying labels for individuals in the clusters |
Value
A list containing results and supporting data from the k-means clustering analysis: (1) 'OptimalSolution': the optimal clustering solution, including cluster assignments and centroids, (2) 'OptimalProjection': the projection vector associated with the optimal solution, (3) 'Threshold': the threshold used for determining cluster membership or filtering, (4) 'Direction': indicates where a new data point should be placed if using the result as a classifier, (5) 'OptimalWithinss': the within-cluster sum of squares for the optimal solution, (6) 'AllWithinss': the within-cluster sum of squares for all candidate solutions, (7) 'Clusterings': all clustering solutions generated during analysis, (8) 'OriginalData': the original dataset used for clustering, (9) 'OriginalIDs': the identifiers of the original observations.
References
Tarun, Y.; Boutin, M. (2018). n-TARP Binary Clustering Code. Purdue University Research Repository. doi:10.4231/R74B2ZJV
Examples
data <- data.frame(X1 = c(0.5, -0.2, 0.1, 5.2, 4.8, 5.1, -4.5, -5.2, -4.8, -5.1),
X2 = c(0.3, -0.1, 0.2, 5.0, 4.9, 5.3, 5.0, 5.2, 4.7, 4.9),
X3 = c(0.4, 0.0, 0.1, 5.1, 4.7, 5.2, -5.0, -4.8, -5.3, -5.1))
result <- nTARP(data = data,number_of_projections = 100,withinss_threshold = 0.36)
str(result)
Run nTARP repeatedly in a bisecting fashion
Description
Repeatedly applies 'nTARP' to iteratively bisect a dataset until a minimum cluster size threshold is reached.
Usage
nTARP_bisecting(
data,
number_of_projections,
withinss_threshold,
ids = NULL,
minimum_cluster_size_percent = 20,
contextual_variable = NULL
)
Arguments
data |
Numeric matrix — dataset to be clustered using 'nTARP' |
number_of_projections |
Numeric — number of random projections for 'nTARP' to try for each run |
withinss_threshold |
Numeric — maximum value defining what a "quality cluster" is, based on the solution's normalized within-cluster sum of squares (typically 0.36) |
ids |
Numeric or character vector — identifying labels for individuals in the clusters |
minimum_cluster_size_percent |
Numeric — minimum size allowable for a cluster (expressed as a percentage) |
contextual_variable |
Vector of integers or characters — variable to use as the basis for comparing clusters. This is 'NULL' by default, which analytically corresponds to option (1). |
Details
This function supports two strategies for selecting the optimal split at each step:
(1) Within-Cluster Compactness Criterion: The optimal solution is selected based on the normalized within-cluster sum of squares (WSS). The split that minimizes normalized WSS is retained.
(2) Contextual Purity Criterion: The optimal solution is selected using a contextual variable. Inspired by decision tree learning, the algorithm evaluates candidate splits based on improvements in class purity (i.e., Gini reduction) with respect to the contextual variable. The split that maximizes purity gain is retained.
The process continues recursively (bisecting the largest eligible cluster) until no resulting cluster meets the user-defined minimum size threshold.
Value
A list containing: (1) Complete solutions (i.e., outputs from the 'nTARP' function), (2) Clusters with the best gains identified using the 'pull_best_solution_and_gain' function, (3) Within-cluster sum of squares for each solution, (4) Gains for each solution (if a contextual variable is used).
Examples
# 20-point example dataset
data <- data.frame(
X1 = c(0.5, -0.2, 0.1, 0.3, -0.1, 0.2, 5.2, 4.8, 5.1, 5.0,
-4.5, -5.2, -4.8, -5.1, -4.9, -5.3, 0.0, 0.2, 5.3, -5.0),
X2 = c(0.3, -0.1, 0.2, 0.1, 0.0, 0.2, 5.0, 4.9, 5.3, 5.1,
5.0, 5.2, 4.7, 4.9, 5.1, 4.8, -0.2, 0.0, 5.2, -4.9),
X3 = c(0.4, 0.0, 0.1, -0.1, 0.2, 0.0, 5.1, 4.7, 5.2, 5.0,
-5.0, -4.8, -5.3, -5.1, -4.9, -5.2, 0.1, 0.3, 5.0, -5.1)
)
# Run nTARP without contextual variable
result1 <- nTARP_bisecting(
data = data,
number_of_projections = 10,
withinss_threshold = 0.36
)
str(result1)
# Add a latent group as contextual variable
latent_group <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 1, 1, 2, 3)
# Run nTARP with contextual variable
result2 <- nTARP_bisecting(
data = data,
number_of_projections = 10,
withinss_threshold = 0.36,
contextual_variable = latent_group
)
str(result2)
Run nTARP repeatedly in a bisecting fashion (using normalized within sum of squares)
Description
#' @keywords internal
Usage
nTARP_complete_solution_no_contextual_variable(
data,
number_of_projections,
withinss_threshold,
ids,
minimum_cluster_size_percent
)
Arguments
data |
Numeric matrix — dataset to be clustered using 'nTARP' |
number_of_projections |
Numeric — number of random projections for 'nTARP' to try for each run (usually 1000 to start) |
withinss_threshold |
Numeric — maximum value defining what a "quality cluster" is, based on the solution's normalized within-cluster sum of squares (typically 0.36) |
ids |
Numeric or character vector — identifying labels for individuals in the clusters |
minimum_cluster_size_percent |
Numeric — minimum size allowable for a cluster to be further bisected (as a percentage) |
Details
Repeatedly applies 'nTARP' to iteratively bisect a dataset until a minimum cluster size threshold is reached, using within-cluster compactness to select optimal splits.
At each step, the algorithm evaluates candidate splits based on the normalized within-cluster sum of squares (WSS). The split that minimizes normalized WSS is retained.
The process continues recursively, bisecting the largest eligible cluster, until no resulting cluster meets the user-defined minimum size threshold.
Value
A list containing: (1) Complete solutions (i.e., outputs from the 'nTARP' function), (2) Clusters with the best gains identified using the 'pull_best_solution_and_gain' function, (3) Within-cluster sum of squares for each solution
Run nTARP repeatedly in a bisecting fashion (using contextual variable)
Description
#' @keywords internal
Usage
nTARP_complete_solution_with_contextual_variable(
data,
number_of_projections,
withinss_threshold,
ids,
contextual_variable,
minimum_cluster_size_percent
)
Arguments
data |
Numeric matrix — dataset to be clustered using 'nTARP' |
number_of_projections |
Numeric — number of random projections for 'nTARP' to try for each run (usually 1000 to start) |
withinss_threshold |
Numeric — maximum value defining what a "quality cluster" is, based on the solution's normalized within-cluster sum of squares (typically 0.36) |
ids |
Numeric or character vector — identifying labels for individuals in the clusters |
contextual_variable |
Vector of integers or characters — variable to use as the basis for comparing clusters |
minimum_cluster_size_percent |
Numeric — minimum size allowable for a cluster to be further bisected (as a percentage) |
Details
Repeatedly applies 'nTARP' to iteratively bisect a dataset until a minimum cluster size threshold is reached, using a contextual variable to select optimal splits.
At each step, the algorithm evaluates candidate splits based on improvements in class purity of the contextual variable (e.g., Gini reduction). The split that maximizes purity gain is retained.
The process continues recursively, bisecting the largest eligible cluster, until no resulting cluster meets the user-defined minimum size threshold.
Value
A list containing: (1) Complete solutions (i.e., outputs from the 'nTARP' function), (2) Clusters with the best gains identified using the 'pull_best_solution_and_gain' function, (3) Within-cluster sum of squares for each solution, (4) Gains for each solution.