This document catalogs all evaluation risks that BORG detects, organized by severity and mechanism.

Risk Classification

BORG classifies risks into two categories based on their impact on evaluation validity:

Category	Impact	BORG Response
Hard Violation	Results are invalid	Blocks evaluation, requires fix
Soft Inflation	Results are biased	Warns, allows with caution

Hard Violations

These make your evaluation results invalid. Any metrics computed with these violations are unreliable.

1. Index Overlap

What: Same row indices appear in both training and test sets.

Why it matters: The model has seen the exact data it’s being tested on. This is the most basic form of leakage.

Detection: Set intersection of train_idx and test_idx.

data <- data.frame(x = 1:100, y = rnorm(100))

# Accidental overlap
result <- borg_inspect(data, train_idx = 1:60, test_idx = 51:100)
result
#> BorgRisk Assessment
#> ===================
#> 
#> Status: INVALID (1 hard violation) — Resistance is futile
#>   Hard violations:  1
#>   Soft inflations:  0
#>   Train indices:    60 rows
#>   Test indices:     50 rows
#>   Inspected at:     2026-03-29 10:09:29
#> 
#> --- HARD VIOLATIONS (must fix) ---
#> 
#> [1] index_overlap
#>     Train and test indices overlap (10 shared indices). This invalidates evaluation.
#>     Source: train_idx/test_idx
#>     Affected: 10 indices (first 5: 51, 52, 53, 54, 55)
#>     Fix: Recreate train/test split with non-overlapping indices

Fix: Ensure indices are mutually exclusive. Use setdiff() to create non-overlapping sets.

2. Duplicate Rows

What: Test set contains rows identical to training rows.

Why it matters: Model may have memorized these exact patterns. Even without index overlap, identical feature values constitute leakage.

Detection: Row hashing and comparison (C++ backend for numeric data).

# Data with duplicate rows
dup_data <- rbind(
  data.frame(x = 1:5, y = 1:5),
  data.frame(x = 1:5, y = 1:5)  # Duplicates
)

result <- borg_inspect(dup_data, train_idx = 1:5, test_idx = 6:10)
result
#> BorgRisk Assessment
#> ===================
#> 
#> Status: INVALID (1 hard violation) — Resistance is futile
#>   Hard violations:  1
#>   Soft inflations:  0
#>   Train indices:    5 rows
#>   Test indices:     5 rows
#>   Inspected at:     2026-03-29 10:09:29
#> 
#> --- HARD VIOLATIONS (must fix) ---
#> 
#> [1] duplicate_rows
#>     Test set contains 5 rows identical to training rows (memorization risk)
#>     Source: data.frame
#>     Affected: 6, 7, 8, 9, 10
#>     Fix: Remove duplicate rows or ensure they fall within the same fold

Fix: Remove duplicate rows before splitting, or ensure splits respect duplicates (keep all copies in same set).

3. Preprocessing Leakage

What: Normalization, imputation, or dimensionality reduction fitted on full data before splitting.

Why it matters: Test set statistics influenced the preprocessing parameters applied to training data. Information flows backwards from test to train.

Detection: Recompute statistics on train-only data and compare to stored parameters. Discrepancy indicates leakage.

Supported objects:

Object Type	Parameters Checked
`caret::preProcess`	`$mean`, `$std`
`recipes::recipe`	Step parameters after `prep()`
`prcomp`	`$center`, `$scale`, rotation matrix
`scale()` attributes	`center`, `scale`

# BAD: Scale fitted on all data
scaled_data <- scale(data)  # Uses all rows!
train <- scaled_data[1:70, ]
test <- scaled_data[71:100, ]

# BORG detects this
borg_inspect(scaled_data, train_idx = 1:70, test_idx = 71:100)

Fix: Fit preprocessing on training data only, then apply to test:

train_data <- data[1:70, ]
test_data <- data[71:100, ]

# Fit on train
means <- colMeans(train_data)
sds <- apply(train_data, 2, sd)

# Apply to both
train_scaled <- scale(train_data, center = means, scale = sds)
test_scaled <- scale(test_data, center = means, scale = sds)

4. Target Leakage (Direct)

What: Feature has absolute correlation > 0.99 with target.

Why it matters: Feature is almost certainly derived from the outcome. Examples: - days_since_diagnosis when predicting has_disease

total_spent when predicting is_customer
Aggregated future values leaked into current features

Detection: Compute Pearson correlation of each numeric feature with target on training data.

# Simulate target leakage
leaky <- data.frame(
  x = rnorm(100),
  outcome = rnorm(100)
)
leaky$leaked <- leaky$outcome + rnorm(100, sd = 0.01)  # Near-perfect correlation

result <- borg_inspect(leaky, train_idx = 1:70, test_idx = 71:100, target = "outcome")
result
#> BorgRisk Assessment
#> ===================
#> 
#> Status: INVALID (1 hard violation) — Resistance is futile
#>   Hard violations:  1
#>   Soft inflations:  0
#>   Train indices:    70 rows
#>   Test indices:     30 rows
#>   Inspected at:     2026-03-29 10:09:29
#> 
#> --- HARD VIOLATIONS (must fix) ---
#> 
#> [1] target_leakage_direct
#>     Feature 'leaked' has correlation 1.000 with target 'outcome'. Likely derived from outcome.
#>     Source: data.frame$leaked
#>     Fix: Remove features derived from the target variable

Fix: Remove or investigate the leaky feature. If it’s a legitimate predictor, document why correlation > 0.99 is expected.

5. Group Leakage

What: Same group (patient, site, species) appears in both train and test.

Why it matters: Observations within a group tend to be similar. If the same patient appears in train and test, the model can exploit patient-specific patterns that won’t exist for new patients.

Detection: Set intersection of group membership values.

# Clinical data with patient IDs
clinical <- data.frame(
  patient_id = rep(1:10, each = 10),
  measurement = rnorm(100)
)

# Random split ignoring patients
set.seed(123)
all_idx <- sample(100)
train_idx <- all_idx[1:70]
test_idx <- all_idx[71:100]

result <- borg_inspect(clinical, train_idx = train_idx, test_idx = test_idx,
                       groups = "patient_id")
result
#> BorgRisk Assessment
#> ===================
#> 
#> Status: VALID (no hard violations)
#>   Hard violations:  0
#>   Soft inflations:  0
#>   Train indices:    70 rows
#>   Test indices:     30 rows
#>   Inspected at:     2026-03-29 10:09:29
#> 
#> No risks detected.

Fix: Use group-aware splitting:

# Split at the patient level
train_patients <- sample(unique(clinical$patient_id), 7)
train_idx <- which(clinical$patient_id %in% train_patients)
test_idx <- which(!clinical$patient_id %in% train_patients)

6. Temporal Ordering Violation

What: Test observations predate training observations.

Why it matters: Model uses future information to predict the past. In deployment, future data won’t be available.

Detection: Compare max training timestamp to min test timestamp.

# Time series data
ts_data <- data.frame(
  date = seq(as.Date("2020-01-01"), by = "day", length.out = 100),
  value = cumsum(rnorm(100))
)

# Wrong: random split ignores time
set.seed(42)
random_idx <- sample(100)
train_idx <- random_idx[1:70]
test_idx <- random_idx[71:100]

result <- borg_inspect(ts_data, train_idx = train_idx, test_idx = test_idx,
                       time = "date")
result
#> BorgRisk Assessment
#> ===================
#> 
#> Status: VALID (no hard violations)
#>   Hard violations:  0
#>   Soft inflations:  0
#>   Train indices:    70 rows
#>   Test indices:     30 rows
#>   Inspected at:     2026-03-29 10:09:29
#> 
#> No risks detected.

Fix: Use chronological splits where all test data comes after training:

train_idx <- 1:70
test_idx <- 71:100

7. CV Fold Contamination

What: Cross-validation folds contain test indices, or folds overlap incorrectly.

Why it matters: Nested CV requires the outer test set to be completely held out from all inner training.

Detection: Check if any fold’s training indices intersect with held-out test set.

Supported objects:

caret::trainControl - checks $index and $indexOut
rsample::vfold_cv and other rset objects
rsample::rsplit objects

8. Model Scope

What: Model was trained on more rows than claimed training set.

Why it matters: Model saw test data during training, even if indirectly (e.g., through hyperparameter tuning on full data).

Detection: Compare nrow(trainingData) or length(fitted.values) to length(train_idx).

Supported objects: lm, glm, ranger, caret::train, parsnip models, workflows.

Soft Inflation Risks

These bias results but may not completely invalidate them. Model ranking might be preserved even if absolute metrics are optimistic.

1. Target Leakage (Proxy)

What: Feature has correlation 0.95-0.99 with target.

Why warning not error: May be a legitimate strong predictor. Requires domain knowledge to judge.

Detection: Same as direct leakage, different threshold.

# Strong but not extreme correlation
proxy <- data.frame(
  x = rnorm(100),
  outcome = rnorm(100)
)
proxy$strong_predictor <- proxy$outcome + rnorm(100, sd = 0.3)  # r ~ 0.96

result <- borg_inspect(proxy, train_idx = 1:70, test_idx = 71:100, target = "outcome")
result
#> BorgRisk Assessment
#> ===================
#> 
#> Status: VALID (no hard violations)
#>   Hard violations:  0
#>   Soft inflations:  1
#>   Train indices:    70 rows
#>   Test indices:     30 rows
#>   Inspected at:     2026-03-29 10:09:29
#> 
#> --- SOFT INFLATIONS (warnings) ---
#> 
#> [1] target_leakage_proxy
#>     Feature 'strong_predictor' has correlation 0.959 with target 'outcome'. May be a proxy for outcome.
#>     Source: data.frame$strong_predictor
#>     Fix: Review evaluation workflow for potential information reuse

Action: Review whether the feature should be available at prediction time in production.

2. Spatial Proximity

What: Test points are very close to training points in geographic space.

Why it matters: Spatial autocorrelation means nearby points share variance. Model learns local patterns that don’t generalize to distant locations.

Detection: Compute minimum distance from each test point to nearest training point. Flag if < 1% of spatial spread.

set.seed(42)
spatial <- data.frame(
  lon = runif(100, 0, 100),
  lat = runif(100, 0, 100),
  value = rnorm(100)
)

# Random split intermixes nearby points
train_idx <- sample(100, 70)
test_idx <- setdiff(1:100, train_idx)

result <- borg_inspect(spatial, train_idx = train_idx, test_idx = test_idx,
                       coords = c("lon", "lat"))
result
#> BorgRisk Assessment
#> ===================
#> 
#> Status: VALID (no hard violations)
#>   Hard violations:  0
#>   Soft inflations:  1
#>   Train indices:    70 rows
#>   Test indices:     30 rows
#>   Inspected at:     2026-03-29 10:09:30
#> 
#> --- SOFT INFLATIONS (warnings) ---
#> 
#> [1] spatial_overlap
#>     93% of test points fall within the training region convex hull. Consider spatial blocking.
#>     Source: data.frame
#>     Fix: Review evaluation workflow for potential information reuse

Fix: Use spatial blocking:

# Geographic split
train_idx <- which(spatial$lon < 50)  # West
test_idx <- which(spatial$lon >= 50)  # East

3. Spatial Overlap

What: Test region falls inside training region’s convex hull.

Why it matters: Interpolation is easier than extrapolation. Model performance on “surrounded” test points overestimates performance on truly new regions.

Detection: Compute convex hull of training points, count test points inside.

Threshold: Warning if > 50% of test points fall inside training hull.

4. Random CV on Dependent Data

What: Using random k-fold CV when data has spatial, temporal, or group structure.

Why it matters: Random folds break dependencies artificially, leading to optimistic error estimates.

# Diagnose data dependencies
spatial <- data.frame(
  lon = runif(200, 0, 100),
  lat = runif(200, 0, 100),
  response = rnorm(200)
)

diagnosis <- borg_diagnose(spatial, coords = c("lon", "lat"), target = "response",
                           verbose = FALSE)
diagnosis@recommended_cv
#> [1] "random"

Fix: Use borg() to generate appropriate blocked CV folds.

Risk Type	Severity	Detection Method	Fix
`index_overlap`	Hard	Index intersection	Use `setdiff()`
`duplicate_rows`	Hard	Row hashing	Deduplicate or group
`preprocessing_leak`	Hard	Parameter comparison	Fit on train only
`target_leakage`	Hard	Correlation > 0.99	Remove feature
`group_leakage`	Hard	Group intersection	Group-aware split
`temporal_leak`	Hard	Timestamp comparison	Chronological split
`cv_contamination`	Hard	Fold index check	Rebuild folds
`model_scope`	Hard	Row count	Refit on train only
`proxy_leakage`	Soft	Correlation 0.95-0.99	Domain review
`spatial_proximity`	Soft	Distance check	Spatial blocking
`spatial_overlap`	Soft	Convex hull	Geographic split

Risk Taxonomy

Gilles Colling

2026-03-29