The DHSr package offers advanced tools designed for repeated regression analysis, particularly well-suited for demographic and health survey datasets, though it can be applied to other datasets as well. It includes capabilities for compound statistical analyses using Stein’s estimate and features a unique approach to spatial data analysis, enabling the identification of clusters within hot-hot and cold-cold regions. The package also includes a basic map visualization feature, primarily intended for visualizing calculations. For users seeking more control over mapping, the derived variables from the package can be used with other mapping packages for enhanced visualization options
You can install the released version of DHSr
from CRAN
with: install.packages(“DHSr”)
The Replm2
function allows you to run linear regressions
across all locations in a dataset. Below is a basic example using a
simple formula.
here are the parameters :
data
: The dataset containing the variables you want to
analyze.
formula
:This specifies the model you want to fit, using
standard R formula syntax (e.g., years_education ~ gender_female +
household_wealth).
location_var
: The name of the variable in your dataset
that identifies different locations or groups (e.g., district_code). The
function will run a separate regression for each unique value in this
variable.This parameter is mandatory. If the data is not spatial data,
but if the data has other grouping varible that also can be passed here
to perform repeated regression for each group
response_distribution
: Specifies the type of regression
model to fit. Commonly, “normal” is used for linear regression. For
logistic or other types of models, you would specify an appropriate
distribution.
family
: (Optional) Used when you are performing a
generalized linear model (GLM), such as logistic regression. For linear
regression with a normal distribution, this can be set to NULL.
library(DHSr)
#> DHSr package loaded with all dependencies.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(123)
dummy_data <- data.frame(
years_education = rnorm(100, 12, 3), # Represents years of education
gender_female = rbinom(100, 1, 0.5), # 1 = Female, 0 = Male
household_wealth = sample(1:5, 100, replace = TRUE), # Wealth index from 1 to 5
district_code = sample(1:10, 100, replace = TRUE) # Represents district codes
)%>% arrange(district_code)
shapefile_path <- "C:/Users/91911/Documents/India%3A_State_Boundary_2021_.shp"
# Define a simple regression formula
formula <- years_education ~ gender_female + household_wealth + household_wealth:gender_female
# Run the regression across all locations (districts)
results1 <- Replm2(dummy_data, formula, "district_code", "normal")
# Print the results
print(head(results1))
#> # A tibble: 6 × 16
#> location `estimate_(Intercept)` estimate_gender_female estimate_household_we…¹
#> <int> <dbl> <dbl> <dbl>
#> 1 1 22.4 -11.6 -2.81
#> 2 2 13.0 2.29 -0.229
#> 3 3 12.4 1.89 -0.389
#> 4 4 12.1 -0.116 -0.0620
#> 5 5 19.5 -3.46 -2.00
#> 6 6 11.3 1.67 0.517
#> # ℹ abbreviated name: ¹estimate_household_wealth
#> # ℹ 12 more variables: `estimate_gender_female:household_wealth` <dbl>,
#> # `estimate_R-squared` <dbl>, `std_error_(Intercept)` <dbl>,
#> # std_error_gender_female <dbl>, std_error_household_wealth <dbl>,
#> # `std_error_gender_female:household_wealth` <dbl>,
#> # `std_error_R-squared` <dbl>, `p_value_(Intercept)` <dbl>,
#> # p_value_gender_female <dbl>, p_value_household_wealth <dbl>, …
The Replmre2
function is designed to run mixed-effects
models across all locations defined by a categorical variable in your
dataset. It supports linear mixed-effects models (LMEMs), making it
suitable for continuous response variables and normally distributed
errors.
random_effect_var
: Specifies the grouping variable for
random effects. Note : one common issue ‘singularity error’ occurs with
complex random effects; to avoid try dropping that particular location
from the data.
set.seed(123)
# Create HHid (Household ID), grouping every 3-4 records, and convert to character
dummy_data$HHid <- as.character(rep(1:20, each = 5, length.out = nrow(dummy_data)))
# Test the function with full_rank dataset
data <- dummy_data
# Define a simple regression formula
formula <- years_education ~ gender_female + household_wealth:gender_female
location_var <- "district_code"
random_effect_var <- "HHid"
results <- DHSr::Replmre2(data, formula, location_var, random_effect_var)
#> Warning: 'r.squaredGLMM' now calculates a revised statistic. See the help page.
print(head(results))
#> # A tibble: 6 × 16
#> location `estimate_(Intercept)` estimate_gender_female estimate_gender_femal…¹
#> <int> <dbl> <dbl> <dbl>
#> 1 1 12.6 -1.73 0.522
#> 2 2 12.3 3.03 -1.08
#> 3 3 11.1 3.27 -0.110
#> 4 4 11.9 0.0698 0.124
#> 5 5 12.0 4.04 -1.06
#> 6 6 12.9 0.906 -0.660
#> # ℹ abbreviated name: ¹`estimate_gender_female:household_wealth`
#> # ℹ 12 more variables: `estimate_Marginal R-squared` <dbl>,
#> # `estimate_Conditional R-squared` <dbl>, `std_error_(Intercept)` <dbl>,
#> # std_error_gender_female <dbl>,
#> # `std_error_gender_female:household_wealth` <dbl>,
#> # `std_error_Marginal R-squared` <dbl>,
#> # `std_error_Conditional R-squared` <dbl>, `p_value_(Intercept)` <dbl>, …
Repglmre2
: Extends Replmre2 to non-normal regressions
like logistic or Poisson
# Create a binary outcome variable for years of education
dummy_data$education_binary <- ifelse(dummy_data$years_education > 11, 1, 0)
# Define a logistic regression formula
formula <- education_binary ~ gender_female + household_wealth:gender_female
location_var <- "district_code"
random_effect_var <- "HHid"
# Run the logistic mixed-effects model across all locations (districts)
results <- DHSr::Repglmre2(data = dummy_data, formula = formula,
location_var = location_var, random_effect_var = random_effect_var,
family = binomial())
# Print the results
print(head(results))
#> # A tibble: 6 × 16
#> location `estimate_(Intercept)` estimate_gender_female estimate_gender_femal…¹
#> <int> <dbl> <dbl> <dbl>
#> 1 1 0.75 -0.228 2.17e- 2
#> 2 2 0.800 0.200 -7.77e-12
#> 3 3 0.462 0.449 -1.96e-17
#> 4 4 0.714 0.514 -1.43e- 1
#> 5 5 0.750 0.114 -4.55e- 2
#> 6 6 0.715 0.131 -1.35e- 1
#> # ℹ abbreviated name: ¹`estimate_gender_female:household_wealth`
#> # ℹ 12 more variables: `estimate_Marginal R-squared` <dbl>,
#> # `estimate_Conditional R-squared` <dbl>, `std_error_(Intercept)` <dbl>,
#> # std_error_gender_female <dbl>,
#> # `std_error_gender_female:household_wealth` <dbl>,
#> # `std_error_Marginal R-squared` <dbl>,
#> # `std_error_Conditional R-squared` <dbl>, `p_value_(Intercept)` <dbl>, …
listw
: Essential for cluster map creation and Stein’s
beta; optional if weights exist.
spdeplisa
: Computes LISA values using native spdep
functions. It’s recommended to run this before cluster_map for
consistency, though it may not be needed if LISA signs and indicators
are precomputed
cluster_map
: Produces clusters from spatial data, with
cluster IDs and admin boundary listings (e.g., districts or states). The
min_ parameter sets the minimum number of admin boundaries per cluster,
and min_area sets the minimum cluster area. The lisa_cutoff adjusts the
threshold for clustering, useful for continuous clusters. The function
provides a basic map; for more dynamic mapping, users can use otyher
packages ( such as ggplot2::geom_sf) after obtaining cluster IDs. The
lisa_label and label specify the type of clusters (e.g., High-High).
##cluster_map parameter:
dataset
: Input spatial dataset ( “sf” “data.frame”
object) lisa_value
: Variable used for LISA analysis
lisa_label
: Variable indicating LISA significance
label
: Specific label for clustering (e.g., “High-High”)
lisa_cutoff
: Threshold for LISA values
location_var
: Column name for location variable
location_name
: Column name for location name
level2
: Optional second level of geography
id_start
: Starting ID for cluster numbering
comparison
: Comparison operator for cutoff
min_area
: Minimum area for valid clusters
min_
: Minimum number of districts per cluster
title
: Title for the resulting plot subtitle
:
Subtitle for the resulting plot footnote
: Footnote for the
resulting plot legend_position
: Position of the legend on
the plot color_scheme
: Color scheme for the plot
# Load the sf package if not already loaded
library(sf)
#> Warning: package 'sf' was built under R version 4.3.3
#> Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
# Group by district_code and calculate mean household_wealth
mean_wealth <- dummy_data %>%
group_by(district_code) %>%
summarize(mean_household_wealth = mean(household_wealth, na.rm = TRUE))
# Use the custom listw function with loc_shape and loc_data parameters
listw <- listw(
shapefile_path = shapefile_path, # Use the predefined shapefile path
data = mean_wealth,
loc_shape = "objectid", # The key column to be used for merging in the shapefile
loc_data = "district_code", # The column in data that will be used if loc_shape doesn't exist
weight_function = function(d) exp(-d / 0.2)
)
#> Reading layer `India%3A_State_Boundary_2021_' from data source
#> `C:\Users\91911\Documents\India%3A_State_Boundary_2021_.shp'
#> using driver `ESRI Shapefile'
#> Simple feature collection with 36 features and 19 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 68.12382 ymin: 6.754078 xmax: 97.4116 ymax: 37.08835
#> Geodetic CRS: WGS 84
#> Warning in nb2listw(res$neighbours, glist = res$weights, style = style, : zero
#> sum general weights
# Read the shapefile
shape_data<- st_read(shapefile_path)
#> Reading layer `India%3A_State_Boundary_2021_' from data source
#> `C:\Users\91911\Documents\India%3A_State_Boundary_2021_.shp'
#> using driver `ESRI Shapefile'
#> Simple feature collection with 36 features and 19 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 68.12382 ymin: 6.754078 xmax: 97.4116 ymax: 37.08835
#> Geodetic CRS: WGS 84
# Ensure uniqueness by REGCODE
shape_data <- shape_data %>% distinct(objectid, .keep_all = TRUE)
#objectid is the location-shape(loc_shape) in listw
# Merge shapefile data with the provided dataset
mean_wealth_geom<- merge(shape_data, mean_wealth, by.x = "objectid", by.y="district_code", all.y = TRUE)
lisa2 <- Spdeplisa(mean_wealth_geom, "mean_household_wealth",listw)
result <- cluster_map(lisa2, "lisa_I", "sign_combination3", "High-High", 0.4, "objectid", "name", "objectid", id_start = 0, comparison = ">", min_area = 0, min_ = 1,
title = "Wealth Clusters ", subtitle = "State Level", footnote = NULL, legend_position = "right", color_scheme = "D")
plot<- result$plot
clusters <-result$dataset_with_clusters
#merge the OLS beta's for each location
cluster_beta <- merge(clusters, results1, by.x = "objectid", by.y = "location")
# Example usage
results_with_stein_beta <- DHSr::stein_beta(
data = cluster_beta,
cluster_id = "cluster_id",
beta = "estimate_gender_female" # Replace with the actual column name for beta in your data
)