The ebrahim.gof package implements the Ebrahim-Farrington goodness-of-fit test for logistic regression models. This test is particularly effective for binary data and sparse datasets, providing an improved alternative to the traditional Hosmer-Lemeshow test.
Goodness-of-fit testing is crucial in logistic regression to assess whether the fitted model adequately describes the data. The most commonly used test is the Hosmer-Lemeshow test, but it has several limitations:
The Ebrahim-Farrington test addresses these limitations by using a modified Pearson chi-square statistic based on Farrington’s (1996) theoretical framework, but simplified for practical implementation with binary data.
The main function ef.gof() performs the goodness-of-fit
test:
# Simulate binary data
set.seed(123)
n <- 500
x <- rnorm(n)
linpred <- 0.5 + 1.2 * x
prob <- plogis(linpred) # Convert to probabilities
y <- rbinom(n, 1, prob)
# Fit logistic regression
model <- glm(y ~ x, family = binomial())
predicted_probs <- fitted(model)
# Perform Ebrahim-Farrington test
result <- ef.gof(y, predicted_probs, G = 10)
print(result)
#> Test Test_Statistic p_value
#> 1 Ebrahim-Farrington -1.250567 0.9344997For binary data with automatic grouping, the Ebrahim-Farrington test statistic is:
\[Z_{EF} = \frac{T_{EF} - (G - 2)}{\sqrt{2(G-2)}}\]
Where: - \(T_{EF}\) is the modified Pearson chi-square statistic - \(G\) is the number of groups - The test statistic follows a standard normal distribution under \(H_0\)
The null hypothesis is that the model fits the data adequately.
The number of groups \(G\) can affect the test’s performance:
# Test with different numbers of groups
group_sizes <- c(4, 8, 10, 15, 20)
results <- data.frame(
Groups = group_sizes,
P_value = sapply(group_sizes, function(g) {
ef.gof(y, predicted_probs, G = g)$p_value
})
)
print(results)
#> Groups P_value
#> 1 4 0.7449740
#> 2 8 0.3317745
#> 3 10 0.9344997
#> 4 15 0.7347885
#> 5 20 0.3532473Let’s compare the Ebrahim-Farrington test with the traditional Hosmer-Lemeshow test:
# Hosmer-Lemeshow test (requires ResourceSelection package)
if (requireNamespace("ResourceSelection", quietly = TRUE)) {
library(ResourceSelection)
# Perform both tests
ef_result <- ef.gof(y, predicted_probs, G = 10)
hl_result <- hoslem.test(y, predicted_probs, g = 10)
# Compare results
comparison <- data.frame(
Test = c("Ebrahim-Farrington", "Hosmer-Lemeshow"),
P_value = c(ef_result$p_value, hl_result$p.value),
Test_Statistic = c(ef_result$Test_Statistic, hl_result$statistic)
)
print(comparison)
} else {
cat("ResourceSelection package not available for comparison\n")
}
#> ResourceSelection 0.3-6 2023-06-27
#> Test P_value Test_Statistic
#> Ebrahim-Farrington 0.9344997 -1.250567
#> X-squared Hosmer-Lemeshow 0.9431075 2.855296Version 2.0.0 turns the package into a full goodness-of-fit toolkit.
The Directed EF (DEF) test concentrates power on
calibration-curve shape directions, def.ensemble.gof()
combines the DEF bases via the Cauchy combination test, and
run.all.gof() runs a whole battery of GOF tests at
once.
# Directed Ebrahim-Farrington test (takes the fitted model)
def.gof(model) # default poly3 basis
#> Test Basis Test_Statistic df Method
#> 1 Directed Ebrahim-Farrington poly3 0.1333555 2.058835 satterthwaite
#> p_value
#> 1 0.9394923
def.gof(model, basis = "ensemble") # combine all three bases (Cauchy)
#> Test Combiner Components k p_value
#> 1 DEF ensemble cct poly2+poly3+stukel 3 0.8921128
# Ensemble of the three DEF bases
def.ensemble.gof(model)
#> Test Combiner Components k p_value
#> 1 DEF ensemble cct poly2+poly3+stukel 3 0.8921128
def.ensemble.gof(model, add_ef = TRUE) # add the omnibus EF
#> Test Combiner Components k p_value
#> 1 DEF ensemble cct poly2+poly3+stukel+EF 4 0.9070102run.all.gof() returns one tidy data frame, one row per
test:
run.all.gof(model)
#> Test Family Statistic df p_value
#> 1 Pearson Global 500.189779915 498.000000 0.46398355
#> 2 Deviance Global 548.361194487 498.000000 0.05868791
#> 3 Osius-Rojek Standardized 0.124150785 NA 0.90119589
#> 4 Copas-RSS Standardized 0.001485027 1.000000 0.99881512
#> 5 Information-Matrix Global 0.017967854 2.000000 0.99105631
#> 6 HL Partition 2.855296455 8.000000 0.94310750
#> 7 HL-equalwidth Partition 4.860031216 8.000000 0.77242612
#> 8 Pigeon-Heyse Partition 2.877438508 9.000000 0.96895634
#> 9 EF Standardized -1.250566694 8.000000 0.93449972
#> 10 EF-normal Standardized -1.250566694 8.000000 0.89445370
#> 11 DEF.poly2 Directed 0.061762126 1.076818 0.82265331
#> 12 DEF.poly3 Directed 0.133355497 2.058835 0.93949233
#> 13 DEF.stukel Directed 0.319113754 2.002282 0.83134362
#> 14 Stukel Directed 0.030567470 2.000000 0.98483247
#> 15 Tsiatis Covariate-space 5.778041795 9.000000 0.76191087
#> 16 Xie Covariate-space 8.311608210 8.500000 0.45352565
#> 17 Pulkstenis-Robinson Covariate-space NA NA NA
#> 18 Ensemble.Vote(3DEF) Ensemble NA NA 0.89211275
#> 19 Ensemble.Univ(3DEF+EF) Ensemble NA NA 0.90701023
#> Note
#> 1
#> 2
#> 3
#> 4
#> 5
#> 6
#> 7
#> 8
#> 9
#> 10 normal reference (thesis)
#> 11
#> 12
#> 13
#> 14
#> 15
#> 16
#> 17 no categorical covariate
#> 18 CCT
#> 19 CCTAdd include_slow = TRUE to also run the opt-in slow
tests (le Cessie, the GAM-based tests, Stute-Zhu, eHL, BAGofT, and the
Lai & Liu standardized-power test), or pass
tests = c("EF", "DEF.poly3", "HL") to run a chosen
subset.
Most GOF tests for logistic regression are
partition-based (they group the data and compare
observed with expected counts), and that is the family
ef.gof() and def.gof() belong to. A key
property of the directed tests is that they gain power
without inflating the type I error rate. In a Monte
Carlo study (n = 500, 1000 replications, α = 0.05), the partition tests
compare as follows:
| Test | Size (null) | Power: quadratic | Power: wrong link |
|---|---|---|---|
| Hosmer–Lemeshow (decile) | 0.060 | 0.588 | 0.179 |
| Hosmer–Lemeshow (equal-width) | 0.053 | 0.332 | 0.244 |
| Pigeon–Heyse | 0.035 | 0.535 | 0.133 |
| EF (omnibus) | 0.058 | 0.480 | 0.218 |
| Tsiatis | 0.056 | 0.574 | 0.162 |
| Xie | 0.042 | 0.557 | 0.147 |
| DEF (poly3) | 0.060 | 0.709 | 0.404 |
| DEF (ensemble, vote) | 0.066 | 0.767 | 0.468 |
DEF and its vote ensemble are the most powerful in the family while keeping the size near the nominal 0.05 — they are not liberal — and they roughly double the power of Hosmer–Lemeshow, Tsiatis, and Xie on the wrong-link misfit.
Partition-based tests are intuitive and work for sparse data and continuous covariates (where the Pearson and deviance chi-square tests fail), but their result depends on the grouping choice and the simpler members (HL) can have low power. DEF keeps the intuitive fitted-probability grouping but directs the test at calibration-curve shapes, which is why it tops the table without losing size control.
Note: as of 2.0.0,
ef.gof()defaults to the chi-square reference (method = "chisq"); usemethod = "normal"for the version 1.0.0 behaviour.
Let’s examine the power of the test to detect model misspecification:
# Function to simulate power under model misspecification
simulate_power <- function(n, beta_quad = 0.1, n_sims = 100, G = 10) {
rejections_ef <- 0
rejections_hl <- 0
for (i in 1:n_sims) {
# Generate data with quadratic term (true model)
x <- runif(n, -2, 2)
linpred_true <- 0 + x + beta_quad * x^2
prob_true <- plogis(linpred_true)
y <- rbinom(n, 1, prob_true)
# Fit misspecified linear model (omitting quadratic term)
model_mis <- glm(y ~ x, family = binomial())
pred_probs <- fitted(model_mis)
# Ebrahim-Farrington test
ef_test <- ef.gof(y, pred_probs, G = G)
if (ef_test$p_value < 0.05) rejections_ef <- rejections_ef + 1
# Hosmer-Lemeshow test (if available)
if (requireNamespace("ResourceSelection", quietly = TRUE)) {
hl_test <- ResourceSelection::hoslem.test(y, pred_probs, g = G)
if (hl_test$p.value < 0.05) rejections_hl <- rejections_hl + 1
}
}
power_ef <- rejections_ef / n_sims
power_hl <- if (requireNamespace("ResourceSelection", quietly = TRUE)) {
rejections_hl / n_sims
} else {
NA
}
return(list(power_ef = power_ef, power_hl = power_hl))
}
# Calculate power for different sample sizes
sample_sizes <- c(200, 500, 1000)
power_results <- data.frame(
n = sample_sizes,
EbrahimFarrington_Power = sapply(sample_sizes, function(n) {
simulate_power(n, beta_quad = 0.15, n_sims = 50)$power_ef
})
)
if (requireNamespace("ResourceSelection", quietly = TRUE)) {
power_results$HosmerLemeshow_Power <- sapply(sample_sizes, function(n) {
simulate_power(n, beta_quad = 0.15, n_sims = 50)$power_hl
})
}
print(power_results)
#> n EbrahimFarrington_Power HosmerLemeshow_Power
#> 1 200 0.06 0.12
#> 2 500 0.14 0.14
#> 3 1000 0.20 0.22For datasets with grouped observations (multiple trials per covariate pattern), you can use the original Farrington test:
# Simulate grouped data
set.seed(456)
n_groups <- 30
m_trials <- sample(5:20, n_groups, replace = TRUE)
x_grouped <- rnorm(n_groups)
prob_grouped <- plogis(0.2 + 0.8 * x_grouped)
y_grouped <- rbinom(n_groups, m_trials, prob_grouped)
# Create data frame and fit model
data_grouped <- data.frame(
successes = y_grouped,
trials = m_trials,
x = x_grouped
)
model_grouped <- glm(
cbind(successes, trials - successes) ~ x,
data = data_grouped,
family = binomial()
)
predicted_probs_grouped <- fitted(model_grouped)
# Original Farrington test for grouped data
result_grouped <- ef.gof(
y_grouped,
predicted_probs_grouped,
model = model_grouped,
m = m_trials,
G = NULL # No automatic grouping for original test
)
print(result_grouped)
#> Test Test_Statistic p_value
#> 1 Farrington-Original -1.476122 0.9300444G specified):
m provided,
G = NULL):
Farrington, C. P. (1996). On Assessing Goodness of Fit of Generalized Linear Models to Sparse Data. Journal of the Royal Statistical Society. Series B (Methodological), 58(2), 349-360.
Ebrahim, Khaled Ebrahim (2025). Goodness-of-Fits Tests and Calibration Machine Learning Algorithms for Logistic Regression Model with Sparse Data. Master’s Thesis, Alexandria University.
Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression, Second Edition. New York: Wiley.
Hosmer, D. W., & Lemeshow, S. (1980). A goodness-of-fit test for the multiple logistic regression model. Communications in Statistics - Theory and Methods, 9(10), 1043–1069. https://doi.org/10.1080/03610928008827941
The Ebrahim-Farrington test provides a powerful and practical tool for assessing goodness-of-fit in logistic regression, particularly for binary data and sparse datasets. Its simplified implementation makes it accessible for routine use while maintaining strong theoretical foundations.