This document catalogs all evaluation risks that BORG detects, organized by severity and mechanism.
BORG classifies risks into two categories based on their impact on evaluation validity:
| Category | Impact | BORG Response |
|---|---|---|
| Hard Violation | Results are invalid | Blocks evaluation, requires fix |
| Soft Inflation | Results are biased | Warns, allows with caution |
These make your evaluation results invalid. Any metrics computed with these violations are unreliable.
What: Same row indices appear in both training and test sets.
Why it matters: The model has seen the exact data it’s being tested on. This is the most basic form of leakage.
Detection: Set intersection of
train_idx and test_idx.
data <- data.frame(x = 1:100, y = rnorm(100))
# Accidental overlap
result <- borg_inspect(data, train_idx = 1:60, test_idx = 51:100)
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: INVALID (1 hard violation) — Resistance is futile
#> Hard violations: 1
#> Soft inflations: 0
#> Train indices: 60 rows
#> Test indices: 50 rows
#> Inspected at: 2026-03-17 13:44:35
#>
#> --- HARD VIOLATIONS (must fix) ---
#>
#> [1] index_overlap
#> Train and test indices overlap (10 shared indices). This invalidates evaluation.
#> Source: train_idx/test_idx
#> Affected: 10 indices (first 5: 51, 52, 53, 54, 55)Fix: Ensure indices are mutually exclusive. Use
setdiff() to create non-overlapping sets.
What: Test set contains rows identical to training rows.
Why it matters: Model may have memorized these exact patterns. Even without index overlap, identical feature values constitute leakage.
Detection: Row hashing and comparison (C++ backend for numeric data).
# Data with duplicate rows
dup_data <- rbind(
data.frame(x = 1:5, y = 1:5),
data.frame(x = 1:5, y = 1:5) # Duplicates
)
result <- borg_inspect(dup_data, train_idx = 1:5, test_idx = 6:10)
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: INVALID (1 hard violation) — Resistance is futile
#> Hard violations: 1
#> Soft inflations: 0
#> Train indices: 5 rows
#> Test indices: 5 rows
#> Inspected at: 2026-03-17 13:44:35
#>
#> --- HARD VIOLATIONS (must fix) ---
#>
#> [1] duplicate_rows
#> Test set contains 5 rows identical to training rows (memorization risk)
#> Source: data.frame
#> Affected: 6, 7, 8, 9, 10Fix: Remove duplicate rows before splitting, or ensure splits respect duplicates (keep all copies in same set).
What: Normalization, imputation, or dimensionality reduction fitted on full data before splitting.
Why it matters: Test set statistics influenced the preprocessing parameters applied to training data. Information flows backwards from test to train.
Detection: Recompute statistics on train-only data and compare to stored parameters. Discrepancy indicates leakage.
Supported objects:
| Object Type | Parameters Checked |
|---|---|
caret::preProcess |
$mean, $std |
recipes::recipe |
Step parameters after prep() |
prcomp |
$center, $scale, rotation matrix |
scale() attributes |
center, scale |
# BAD: Scale fitted on all data
scaled_data <- scale(data) # Uses all rows!
train <- scaled_data[1:70, ]
test <- scaled_data[71:100, ]
# BORG detects this
borg_inspect(scaled_data, train_idx = 1:70, test_idx = 71:100)Fix: Fit preprocessing on training data only, then apply to test:
What: Feature has absolute correlation > 0.99 with target.
Why it matters: Feature is almost certainly derived
from the outcome. Examples: - days_since_diagnosis when
predicting has_disease
total_spent when predicting
is_customer
Aggregated future values leaked into current features
Detection: Compute Pearson correlation of each numeric feature with target on training data.
# Simulate target leakage
leaky <- data.frame(
x = rnorm(100),
outcome = rnorm(100)
)
leaky$leaked <- leaky$outcome + rnorm(100, sd = 0.01) # Near-perfect correlation
result <- borg_inspect(leaky, train_idx = 1:70, test_idx = 71:100, target = "outcome")
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: INVALID (1 hard violation) — Resistance is futile
#> Hard violations: 1
#> Soft inflations: 0
#> Train indices: 70 rows
#> Test indices: 30 rows
#> Inspected at: 2026-03-17 13:44:35
#>
#> --- HARD VIOLATIONS (must fix) ---
#>
#> [1] target_leakage_direct
#> Feature 'leaked' has correlation 1.000 with target 'outcome'. Likely derived from outcome.
#> Source: data.frame$leakedFix: Remove or investigate the leaky feature. If it’s a legitimate predictor, document why correlation > 0.99 is expected.
What: Same group (patient, site, species) appears in both train and test.
Why it matters: Observations within a group tend to be similar. If the same patient appears in train and test, the model can exploit patient-specific patterns that won’t exist for new patients.
Detection: Set intersection of group membership values.
# Clinical data with patient IDs
clinical <- data.frame(
patient_id = rep(1:10, each = 10),
measurement = rnorm(100)
)
# Random split ignoring patients
set.seed(123)
all_idx <- sample(100)
train_idx <- all_idx[1:70]
test_idx <- all_idx[71:100]
result <- borg_inspect(clinical, train_idx = train_idx, test_idx = test_idx,
groups = "patient_id")
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 0
#> Train indices: 70 rows
#> Test indices: 30 rows
#> Inspected at: 2026-03-17 13:44:35
#>
#> No risks detected.Fix: Use group-aware splitting:
What: Test observations predate training observations.
Why it matters: Model uses future information to predict the past. In deployment, future data won’t be available.
Detection: Compare max training timestamp to min test timestamp.
# Time series data
ts_data <- data.frame(
date = seq(as.Date("2020-01-01"), by = "day", length.out = 100),
value = cumsum(rnorm(100))
)
# Wrong: random split ignores time
set.seed(42)
random_idx <- sample(100)
train_idx <- random_idx[1:70]
test_idx <- random_idx[71:100]
result <- borg_inspect(ts_data, train_idx = train_idx, test_idx = test_idx,
time = "date")
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 0
#> Train indices: 70 rows
#> Test indices: 30 rows
#> Inspected at: 2026-03-17 13:44:35
#>
#> No risks detected.Fix: Use chronological splits where all test data comes after training:
What: Cross-validation folds contain test indices, or folds overlap incorrectly.
Why it matters: Nested CV requires the outer test set to be completely held out from all inner training.
Detection: Check if any fold’s training indices intersect with held-out test set.
Supported objects:
caret::trainControl - checks $index and
$indexOut
rsample::vfold_cv and other rset
objects
rsample::rsplit objects
What: Model was trained on more rows than claimed training set.
Why it matters: Model saw test data during training, even if indirectly (e.g., through hyperparameter tuning on full data).
Detection: Compare nrow(trainingData)
or length(fitted.values) to
length(train_idx).
Supported objects: lm,
glm, ranger, caret::train,
parsnip models, workflows.
These bias results but may not completely invalidate them. Model ranking might be preserved even if absolute metrics are optimistic.
What: Feature has correlation 0.95-0.99 with target.
Why warning not error: May be a legitimate strong predictor. Requires domain knowledge to judge.
Detection: Same as direct leakage, different threshold.
# Strong but not extreme correlation
proxy <- data.frame(
x = rnorm(100),
outcome = rnorm(100)
)
proxy$strong_predictor <- proxy$outcome + rnorm(100, sd = 0.3) # r ~ 0.96
result <- borg_inspect(proxy, train_idx = 1:70, test_idx = 71:100, target = "outcome")
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 1
#> Train indices: 70 rows
#> Test indices: 30 rows
#> Inspected at: 2026-03-17 13:44:35
#>
#> --- SOFT INFLATIONS (warnings) ---
#>
#> [1] target_leakage_proxy
#> Feature 'strong_predictor' has correlation 0.959 with target 'outcome'. May be a proxy for outcome.
#> Source: data.frame$strong_predictorAction: Review whether the feature should be available at prediction time in production.
What: Test points are very close to training points in geographic space.
Why it matters: Spatial autocorrelation means nearby points share variance. Model learns local patterns that don’t generalize to distant locations.
Detection: Compute minimum distance from each test point to nearest training point. Flag if < 1% of spatial spread.
set.seed(42)
spatial <- data.frame(
lon = runif(100, 0, 100),
lat = runif(100, 0, 100),
value = rnorm(100)
)
# Random split intermixes nearby points
train_idx <- sample(100, 70)
test_idx <- setdiff(1:100, train_idx)
result <- borg_inspect(spatial, train_idx = train_idx, test_idx = test_idx,
coords = c("lon", "lat"))
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 1
#> Train indices: 70 rows
#> Test indices: 30 rows
#> Inspected at: 2026-03-17 13:44:35
#>
#> --- SOFT INFLATIONS (warnings) ---
#>
#> [1] spatial_overlap
#> 93% of test points fall within the training region convex hull. Consider spatial blocking.
#> Source: data.frameFix: Use spatial blocking:
What: Test region falls inside training region’s convex hull.
Why it matters: Interpolation is easier than extrapolation. Model performance on “surrounded” test points overestimates performance on truly new regions.
Detection: Compute convex hull of training points, count test points inside.
Threshold: Warning if > 50% of test points fall inside training hull.
What: Using random k-fold CV when data has spatial, temporal, or group structure.
Why it matters: Random folds break dependencies artificially, leading to optimistic error estimates.
# Diagnose data dependencies
spatial <- data.frame(
lon = runif(200, 0, 100),
lat = runif(200, 0, 100),
response = rnorm(200)
)
diagnosis <- borg_diagnose(spatial, coords = c("lon", "lat"), target = "response",
verbose = FALSE)
diagnosis@recommended_cv
#> [1] "random"Fix: Use borg() to generate appropriate
blocked CV folds.
| Risk Type | Severity | Detection Method | Fix |
|---|---|---|---|
index_overlap |
Hard | Index intersection | Use setdiff() |
duplicate_rows |
Hard | Row hashing | Deduplicate or group |
preprocessing_leak |
Hard | Parameter comparison | Fit on train only |
target_leakage |
Hard | Correlation > 0.99 | Remove feature |
group_leakage |
Hard | Group intersection | Group-aware split |
temporal_leak |
Hard | Timestamp comparison | Chronological split |
cv_contamination |
Hard | Fold index check | Rebuild folds |
model_scope |
Hard | Row count | Refit on train only |
proxy_leakage |
Soft | Correlation 0.95-0.99 | Domain review |
spatial_proximity |
Soft | Distance check | Spatial blocking |
spatial_overlap |
Soft | Convex hull | Geographic split |
# Create result with violations
result <- borg_inspect(
data.frame(x = 1:100, y = rnorm(100)),
train_idx = 1:60,
test_idx = 51:100
)
# Summary
cat("Valid:", result@is_valid, "\n")
#> Valid: FALSE
cat("Hard violations:", result@n_hard, "\n")
#> Hard violations: 1
cat("Soft warnings:", result@n_soft, "\n")
#> Soft warnings: 0
# Individual risks
for (risk in result@risks) {
cat("\n", risk$type, "(", risk$severity, "):\n", sep = "")
cat(" ", risk$description, "\n")
if (!is.null(risk$affected)) {
cat(" Affected:", head(risk$affected, 5), "...\n")
}
}
#>
#> index_overlap(hard_violation):
#> Train and test indices overlap (10 shared indices). This invalidates evaluation.
#> Affected: 51 52 53 54 55 ...
# Tabular format
as.data.frame(result)
#> type severity
#> 1 index_overlap hard_violation
#> description
#> 1 Train and test indices overlap (10 shared indices). This invalidates evaluation.
#> source_object n_affected
#> 1 train_idx/test_idx 10vignette("quickstart") - Basic usage
vignette("frameworks") - Framework
integration