DHSr: Advanced Regression, Compound Statistical Analysis, and Spatial Data Tools

The DHSr package offers advanced tools designed for repeated regression analysis, particularly well-suited for demographic and health survey datasets, though it can be applied to other datasets as well. It includes capabilities for compound statistical analyses using Stein’s estimate and features a unique approach to spatial data analysis, enabling the identification of clusters within hot-hot and cold-cold regions. The package also includes a basic map visualization feature, primarily intended for visualizing calculations. For users seeking more control over mapping, the derived variables from the package can be used with other mapping packages for enhanced visualization options

Installation

You can install the released version of DHSr from CRAN with: install.packages(“DHSr”)

Example: Replm2

Running Regressions Across All Locations

The Replm2 function allows you to run linear regressions across all locations in a dataset. Below is a basic example using a simple formula.

here are the parameters :

data: The dataset containing the variables you want to analyze.

formula:This specifies the model you want to fit, using standard R formula syntax (e.g., years_education ~ gender_female + household_wealth).

location_var: The name of the variable in your dataset that identifies different locations or groups (e.g., district_code). The function will run a separate regression for each unique value in this variable.This parameter is mandatory. If the data is not spatial data, but if the data has other grouping varible that also can be passed here to perform repeated regression for each group

response_distribution : Specifies the type of regression model to fit. Commonly, “normal” is used for linear regression. For logistic or other types of models, you would specify an appropriate distribution.

family: (Optional) Used when you are performing a generalized linear model (GLM), such as logistic regression. For linear regression with a normal distribution, this can be set to NULL.

library(DHSr)
#> DHSr package loaded with all dependencies.
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
set.seed(123)
dummy_data <- data.frame(
  years_education = rnorm(100, 12, 3),    # Represents years of education
  gender_female = rbinom(100, 1, 0.5),    # 1 = Female, 0 = Male
  household_wealth = sample(1:5, 100, replace = TRUE),  # Wealth index from 1 to 5
  district_code = sample(1:10, 100, replace = TRUE) # Represents district codes
)%>% arrange(district_code)

shapefile_path <- "C:/Users/91911/Documents/India%3A_State_Boundary_2021_.shp"
# Define a simple regression formula
formula <- years_education ~ gender_female + household_wealth + household_wealth:gender_female

# Run the regression across all locations (districts)
results1 <- Replm2(dummy_data, formula, "district_code", "normal")

# Print the results
print(head(results1))
#> # A tibble: 6 × 16
#>   location `estimate_(Intercept)` estimate_gender_female estimate_household_we…¹
#>      <int>                  <dbl>                  <dbl>                   <dbl>
#> 1        1                   22.4                -11.6                   -2.81  
#> 2        2                   13.0                  2.29                  -0.229 
#> 3        3                   12.4                  1.89                  -0.389 
#> 4        4                   12.1                 -0.116                 -0.0620
#> 5        5                   19.5                 -3.46                  -2.00  
#> 6        6                   11.3                  1.67                   0.517 
#> # ℹ abbreviated name: ¹estimate_household_wealth
#> # ℹ 12 more variables: `estimate_gender_female:household_wealth` <dbl>,
#> #   `estimate_R-squared` <dbl>, `std_error_(Intercept)` <dbl>,
#> #   std_error_gender_female <dbl>, std_error_household_wealth <dbl>,
#> #   `std_error_gender_female:household_wealth` <dbl>,
#> #   `std_error_R-squared` <dbl>, `p_value_(Intercept)` <dbl>,
#> #   p_value_gender_female <dbl>, p_value_household_wealth <dbl>, …

Example: Replmre2

The Replmre2 function is designed to run mixed-effects models across all locations defined by a categorical variable in your dataset. It supports linear mixed-effects models (LMEMs), making it suitable for continuous response variables and normally distributed errors.

random_effect_var : Specifies the grouping variable for random effects. Note : one common issue ‘singularity error’ occurs with complex random effects; to avoid try dropping that particular location from the data.

set.seed(123)
# Create HHid (Household ID), grouping every 3-4 records, and convert to character
dummy_data$HHid <- as.character(rep(1:20, each = 5, length.out = nrow(dummy_data)))

# Test the function with full_rank dataset
data <- dummy_data
# Define a simple regression formula
formula <- years_education ~ gender_female + household_wealth:gender_female

location_var <- "district_code"
random_effect_var <- "HHid"

results <- DHSr::Replmre2(data, formula, location_var, random_effect_var)
#> Warning: 'r.squaredGLMM' now calculates a revised statistic. See the help page.
print(head(results))
#> # A tibble: 6 × 16
#>   location `estimate_(Intercept)` estimate_gender_female estimate_gender_femal…¹
#>      <int>                  <dbl>                  <dbl>                   <dbl>
#> 1        1                   12.6                -1.73                     0.522
#> 2        2                   12.3                 3.03                    -1.08 
#> 3        3                   11.1                 3.27                    -0.110
#> 4        4                   11.9                 0.0698                   0.124
#> 5        5                   12.0                 4.04                    -1.06 
#> 6        6                   12.9                 0.906                   -0.660
#> # ℹ abbreviated name: ¹`estimate_gender_female:household_wealth`
#> # ℹ 12 more variables: `estimate_Marginal R-squared` <dbl>,
#> #   `estimate_Conditional R-squared` <dbl>, `std_error_(Intercept)` <dbl>,
#> #   std_error_gender_female <dbl>,
#> #   `std_error_gender_female:household_wealth` <dbl>,
#> #   `std_error_Marginal R-squared` <dbl>,
#> #   `std_error_Conditional R-squared` <dbl>, `p_value_(Intercept)` <dbl>, …

Example: Repglmre2

Repglmre2 : Extends Replmre2 to non-normal regressions like logistic or Poisson

# Create a binary outcome variable for years of education
dummy_data$education_binary <- ifelse(dummy_data$years_education > 11, 1, 0)

# Define a logistic regression formula
formula <- education_binary ~ gender_female + household_wealth:gender_female

location_var <- "district_code"
random_effect_var <- "HHid"

# Run the logistic mixed-effects model across all locations (districts)
results <- DHSr::Repglmre2(data = dummy_data, formula = formula, 
                           location_var = location_var, random_effect_var = random_effect_var, 
                           family = binomial())

# Print the results
print(head(results))
#> # A tibble: 6 × 16
#>   location `estimate_(Intercept)` estimate_gender_female estimate_gender_femal…¹
#>      <int>                  <dbl>                  <dbl>                   <dbl>
#> 1        1                  0.75                  -0.228                2.17e- 2
#> 2        2                  0.800                  0.200               -7.77e-12
#> 3        3                  0.462                  0.449               -1.96e-17
#> 4        4                  0.714                  0.514               -1.43e- 1
#> 5        5                  0.750                  0.114               -4.55e- 2
#> 6        6                  0.715                  0.131               -1.35e- 1
#> # ℹ abbreviated name: ¹`estimate_gender_female:household_wealth`
#> # ℹ 12 more variables: `estimate_Marginal R-squared` <dbl>,
#> #   `estimate_Conditional R-squared` <dbl>, `std_error_(Intercept)` <dbl>,
#> #   std_error_gender_female <dbl>,
#> #   `std_error_gender_female:household_wealth` <dbl>,
#> #   `std_error_Marginal R-squared` <dbl>,
#> #   `std_error_Conditional R-squared` <dbl>, `p_value_(Intercept)` <dbl>, …

Example: listw, spdeplisa

listw : Essential for cluster map creation and Stein’s beta; optional if weights exist.

spdeplisa : Computes LISA values using native spdep functions. It’s recommended to run this before cluster_map for consistency, though it may not be needed if LISA signs and indicators are precomputed

cluster_map : Produces clusters from spatial data, with cluster IDs and admin boundary listings (e.g., districts or states). The min_ parameter sets the minimum number of admin boundaries per cluster, and min_area sets the minimum cluster area. The lisa_cutoff adjusts the threshold for clustering, useful for continuous clusters. The function provides a basic map; for more dynamic mapping, users can use otyher packages ( such as ggplot2::geom_sf) after obtaining cluster IDs. The lisa_label and label specify the type of clusters (e.g., High-High).

##cluster_map parameter:

dataset: Input spatial dataset ( “sf” “data.frame” object) lisa_value: Variable used for LISA analysis lisa_label: Variable indicating LISA significance label: Specific label for clustering (e.g., “High-High”) lisa_cutoff: Threshold for LISA values location_var: Column name for location variable location_name: Column name for location name level2: Optional second level of geography id_start: Starting ID for cluster numbering comparison: Comparison operator for cutoff min_area: Minimum area for valid clusters min_: Minimum number of districts per cluster title: Title for the resulting plot subtitle: Subtitle for the resulting plot footnote: Footnote for the resulting plot legend_position: Position of the legend on the plot color_scheme: Color scheme for the plot

# Load the sf package if not already loaded
library(sf)
#> Warning: package 'sf' was built under R version 4.3.3
#> Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE

# Group by district_code and calculate mean household_wealth
mean_wealth <- dummy_data %>%
  group_by(district_code) %>%
  summarize(mean_household_wealth = mean(household_wealth, na.rm = TRUE))




# Use the custom listw function with loc_shape and loc_data parameters
listw <- listw(
  shapefile_path = shapefile_path,  # Use the predefined shapefile path
  data = mean_wealth, 
  loc_shape = "objectid",            # The key column to be used for merging in the shapefile
  loc_data = "district_code",       # The column in data that will be used if loc_shape doesn't exist
  weight_function = function(d) exp(-d / 0.2)
)
#> Reading layer `India%3A_State_Boundary_2021_' from data source 
#>   `C:\Users\91911\Documents\India%3A_State_Boundary_2021_.shp' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 36 features and 19 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 68.12382 ymin: 6.754078 xmax: 97.4116 ymax: 37.08835
#> Geodetic CRS:  WGS 84
#> Warning in nb2listw(res$neighbours, glist = res$weights, style = style, : zero
#> sum general weights


# Read the shapefile
shape_data<- st_read(shapefile_path)
#> Reading layer `India%3A_State_Boundary_2021_' from data source 
#>   `C:\Users\91911\Documents\India%3A_State_Boundary_2021_.shp' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 36 features and 19 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 68.12382 ymin: 6.754078 xmax: 97.4116 ymax: 37.08835
#> Geodetic CRS:  WGS 84

# Ensure uniqueness by REGCODE
shape_data <- shape_data %>% distinct(objectid, .keep_all = TRUE) 
#objectid is the location-shape(loc_shape) in listw

# Merge shapefile data with the provided dataset
mean_wealth_geom<- merge(shape_data, mean_wealth, by.x = "objectid", by.y="district_code", all.y = TRUE)


lisa2 <- Spdeplisa(mean_wealth_geom, "mean_household_wealth",listw)

result <- cluster_map(lisa2, "lisa_I", "sign_combination3", "High-High", 0.4, "objectid", "name", "objectid", id_start = 0, comparison = ">", min_area = 0, min_ = 1,
                             title = "Wealth Clusters ", subtitle = "State Level", footnote = NULL, legend_position = "right", color_scheme = "D")

plot<- result$plot
clusters <-result$dataset_with_clusters

#merge the OLS beta's for each location 
cluster_beta <- merge(clusters, results1, by.x = "objectid", by.y = "location")

# Example usage
results_with_stein_beta <- DHSr::stein_beta(
  data = cluster_beta,
  cluster_id = "cluster_id",
  beta = "estimate_gender_female"  # Replace with the actual column name for beta in your data
)