Bayesian variable selection with geomc.vs

Vivekananda Roy

2026-02-27

Introduction

Bayesian variable selection is an approach to identifying important predictors in regression models, particularly when the number of predictors \(p\) is large relative to the number of observations \(n\). The geomc.vs() function implements geometric MCMC (Roy 2024) for high-dimensional Bayesian variable selection in linear regression. In particular, geomc.vs implements the geometric MH algorithm (Roy 2024) with specific choices for \(f\) and \(g\) for sampling from a popular ‘spike and slab’ Bayesian variable selection model. The details of the variable selection model and the MH algorithm can be found in Roy (2024) (See also the appendix of this document for a description of the model.). This vignette illustrates the use of geomc.vs.

The Variable Selection Problem

Given: - Response vector \(\mathbf{y} \in \mathbb{R}^n\)

We want to identify which subset of the \(p\) predictors are truly associated with the response. This is especially challenging when \(p \gg n\) (high-dimensional setting).

Why Bayesian Variable Selection?

Advantages:

library(geommc)

Basic Usage

The geomc.vs() function requires minimal inputs:

result <- geomc.vs(
  X = X,              # n \times p predictor matrix
  y = y               # n-vector response
)

Example 1:

Small Sparse Model

Let’s start with a simple example where only 3 out of 100 predictors are truly important.

Data Generation

# Set parameters
n <- 50          # Sample size
p <- 100         # Number of predictors
nonzero <- 3     # Number of true predictors
trueidx <- 1:3   # Indices of true predictors
nonzero.value <- 4  # Effect size

# True regression coefficients
TrueBeta <- numeric(p)
TrueBeta[trueidx] <- nonzero.value

# Generate correlated predictors
rho <- 0.5
set.seed(3)
xone <- matrix(rnorm(n * p), n, p)
X <- sqrt(1 - rho) * xone + sqrt(rho) * rnorm(n)

# Generate response
intercept <- 0.5
sigma_true <- 1
y <- intercept + X %*% TrueBeta + rnorm(n, 0, sigma_true)

cat("Data dimensions:\n")
#> Data dimensions:
cat("  n (observations):", n, "\n")
#>   n (observations): 50
cat("  p (predictors):", p, "\n")
#>   p (predictors): 100
cat("  Number of true non-zero predictors:", nonzero, "\n")
#>   Number of true non-zero predictors: 3
cat("  True coefficient indices:", paste(trueidx, collapse = ", "), "\n")
#>   True coefficient indices: 1, 2, 3
cat("  True coefficient values:", nonzero.value, "\n")
#>   True coefficient values: 4

Run geomc.vs for Variable Selection

# Run geomc.vs
set.seed(123)
result <- geomc.vs(
  X = X,
  y = y,
  model.summary = TRUE # Compute model summaries
)

Examining the Results

If model.summary is set TRUE, geomc.vs() function returns a comprehensive list of results:

names(result)
#> [1] "samples"         "acceptance.rate" "log.p"           "mip"            
#> [5] "median.model"    "beta.mean"       "wmip"            "wam"            
#> [9] "beta.wam"

Key components:

On the other hand, if model.summary is set FALSE (Default value), geomc.vs only returns samples, acceptance.rate and log.p.

Marginal Inclusion Probabilities (MIP)

The MIP for variable \(j\) is the posterior probability that variable \(j\) should be included in the model:

\[\text{MIP}_j = P(\beta_j \neq 0 | \mathbf{y}, \mathbf{X})\] The formula used by geomc.vs to compute MIP is given in the appendix.

# Top 10 variables by MIP
top_vars <- order(result$mip, decreasing = TRUE)[1:10]

cat("Top 10 variables by Marginal Inclusion Probability:\n\n")
#> Top 10 variables by Marginal Inclusion Probability:
mip_df <- data.frame(
  Variable = top_vars,
  MIP = round(result$mip[top_vars], 4),
  True = ifelse(top_vars %in% trueidx, "Yes", "No")
)
print(mip_df, row.names = FALSE)
#>  Variable  MIP True
#>         3 0.94  Yes
#>         2 0.92  Yes
#>         1 0.90  Yes
#>        79 0.12   No
#>        95 0.12   No
#>         9 0.04   No
#>        67 0.04   No
#>        29 0.02   No
#>        86 0.02   No
#>         4 0.00   No

Visualizing Marginal Inclusion Probabilities

par(mfrow = c(1, 1))

# Plot MIPs
plot(1:p, result$mip, 
     type = "h", lwd = 2, col = "gray",
     xlab = "Variable Index", 
     ylab = "MIP",
     main = "Marginal Inclusion Probabilities",
     ylim = c(0, 1))

# Highlight true variables
points(trueidx, result$mip[trueidx], 
       col = "red", pch = 19, cex = 2)

# Add threshold line
abline(h = 0.5, col = "blue", lty = 2, lwd = 2)

legend("topright", 
       legend = c("True variables", "MIP > 0.5 threshold"),
       col = c("red", "blue"), 
       pch = c(19, NA), 
       lty = c(NA, 2),
       lwd = 2)

The Median Probability Model

The returned median probability model includes all variables with mip > model.threshold (Default 0.5):

cat("Median Probability Model:\n")
#> Median Probability Model:
cat("  Number of variables selected:", length(result$median.model), "\n")
#>   Number of variables selected: 3
cat("  Variable indices:", paste(result$median.model, collapse = ", "), "\n\n")
#>   Variable indices: 1, 2, 3
# Check if we recovered the true model
correctly_identified <- all(trueidx %in% result$median.model)
false_positives <- setdiff(result$median.model, trueidx)
false_negatives <- setdiff(trueidx, result$median.model)

cat("Model Recovery:\n")
#> Model Recovery:
cat("  All true variables identified:", correctly_identified, "\n")
#>   All true variables identified: TRUE
cat("  Number of false positives:", length(false_positives), "\n")
#>   Number of false positives: 0
cat("  Number of false negatives:", length(false_negatives), "\n")
#>   Number of false negatives: 0
if (length(false_positives) > 0) {
  cat("  False positive indices:", paste(false_positives, collapse = ", "), "\n")
}

Posterior Mean of Coefficients

# Compare estimated vs true coefficients for true variables
cat("Coefficient Estimates (True Variables):\n\n")
#> Coefficient Estimates (True Variables):
coef_comparison <- data.frame(
  Variable = trueidx,
  True_Beta = TrueBeta[trueidx],
  Posterior_Mean = round(result$beta.mean[(1+trueidx)], 3),
  Beta_WAM = round(result$beta.wam[(1+trueidx)], 3),
  MIP = round(result$mip[trueidx], 4)
)
print(coef_comparison)
#>   Variable True_Beta Posterior_Mean Beta_WAM  MIP
#> 1        1         4          3.491    3.894 0.90
#> 2        2         4          3.818    4.128 0.92
#> 3        3         4          3.844    4.046 0.94
intercept_comparison<- data.frame(
  True_intercept =intercept,
  Posterior_Mean = round(result$beta.mean[1], 3),
  Beta_WAM = round(result$beta.wam[1], 3)
)
print(intercept_comparison)
#>   True_intercept Posterior_Mean Beta_WAM
#> 1            0.5          0.418     0.49

The formulas used by geomc.vs to compute beta.mean and beta.wam are given in the appendix.

Model Space Exploration

# Model sizes visited
model_sizes <- apply(result$samples, 1, sum)

hist(model_sizes, breaks = 20, 
     main = "Distribution of Model Sizes Visited",
     xlab = "Number of Variables in Model",
     col = "lightblue", border = "white")
abline(v = nonzero, col = "red", lwd = 3, lty = 2)
legend("topright", 
       legend = "True model size",
       col = "red", lty = 2, lwd = 2)
cat("\nModel Size Summary:\n")
#> 
#> Model Size Summary:
cat("  Mean model size:", round(mean(model_sizes), 2), "\n")
#>   Mean model size: 3.12
cat("  Median model size:", median(model_sizes), "\n")
#>   Median model size: 3
cat("  True model size:", nonzero, "\n")
#>   True model size: 3

Example 2:

Another Example to Illustrate Different Data Structures and Tuning Parameters

This example demonstrates how to use geomc.vs() for variable selection with sparse matrices, such as document term matrices commonly found in text mining applications. Different tuning parameters allowed by geomc.vs() will also be discussed. In particular, we’ll show how to:

Data Generation

Generate Sparse Data

We’ll create a document-term matrix where most entries are zero (sparse), simulating a text classification problem.

# Problem dimensions
n <- 80        # Number of documents
p <- 200       # Vocabulary size (number of words)
sparsity <- 0.95  # 95% of entries are zero

# True important words (only 5 words matter)
true_words <- c(10, 25, 50, 100, 150)
true_effects <- c(2.5, -2.0, 3.0, -1.8, 2.2)

# Create sparse design matrix (document-term matrix)
# Most words don't appear in most documents
X_dense <- matrix(0, n, p)

# Add word counts (Poisson distributed when word appears)
set.seed(3)
for (i in 1:n) {
  # Each document has about 5% of words present
  present_words <- sample(1:p, size = round(p * (1 - sparsity)))
  X_dense[i, present_words] <- rpois(length(present_words), lambda = 2)
}

# Ensure true words appear with higher frequency
for (word in true_words) {
  appears_in <- sample(1:n, size = round(0.7 * n))
  X_dense[appears_in, word] <- rpois(length(appears_in), lambda = 3)
}

# Remove zero-variance columns
col_vars <- apply(X_dense, 2, var)
nonzero_cols <- which(col_vars > 0)

cat("Original number of columns:", p, "\n")
#> Original number of columns: 200
cat("Columns with zero variance:", sum(col_vars == 0), "\n")
#> Columns with zero variance: 4
cat("Columns retained:", length(nonzero_cols), "\n\n")
#> Columns retained: 196
# Keep mapping from new to old indices
old_to_new <- rep(NA, p)
old_to_new[nonzero_cols] <- 1:length(nonzero_cols)

# Update X_dense to only include non-zero variance columns
X_dense <- X_dense[, nonzero_cols, drop = FALSE]

# Update true_words indices to reflect new column positions
true_words_old <- true_words
true_words <- match(true_words_old, nonzero_cols)
true_words <- true_words[!is.na(true_words)]  # Keep only those that survived

cat("Original true words:", paste(true_words_old, collapse = ", "), "\n")
#> Original true words: 10, 25, 50, 100, 150
cat("New true words indices:", paste(true_words, collapse = ", "), "\n")
#> New true words indices: 9, 24, 49, 98, 147
cat("True words retained:", length(true_words), "out of", length(true_words_old), "\n\n")
#> True words retained: 5 out of 5
# Update true_effects to match retained true words
true_effects <- true_effects[!is.na(match(true_words_old, nonzero_cols))]

# Update p to reflect actual number of columns
p <- ncol(X_dense)
cat("Updated p:", p, "\n\n")
#> Updated p: 196
library(Matrix)
# Convert to sparse matrix (dgCMatrix)
X_sparse <- Matrix(X_dense, sparse = TRUE)

# Check sparsity
actual_sparsity <- 1 - (nnzero(X_sparse) / (n * p))
cat("Actual sparsity:", round(actual_sparsity * 100, 1), "%\n")
#> Actual sparsity: 93.8 %
cat("Non-zero elements:", nnzero(X_sparse), "\n")
#> Non-zero elements: 968
cat("Matrix class:", class(X_sparse), "\n")
#> Matrix class: dgCMatrix

Visualize Sparse Matrix Structure

# Visualize the sparse matrix pattern
image(
    x = 1:min(100, p),
    y = 1:min(50, n),
    z = as.matrix(t(X_sparse[1:min(50, n), 1:min(100, p)])),
    main = "Sparse Matrix Pattern (50 docs × 100 words)",
    ylab = "Documents",
    xlab = "Words",
    col = colorRampPalette(c("white", "blue", "darkblue"))(100)
)

Memory Comparison

One of the main advantages of sparse matrices is memory efficiency:

# Compare memory usage
size_sparse <- object.size(X_sparse)
size_dense <- object.size(X_dense)

cat("Memory usage:\n")
#> Memory usage:
cat("  Sparse matrix:", format(size_sparse, units = "Kb"), "\n")
#>   Sparse matrix: 13.6 Kb
cat("  Dense matrix:", format(size_dense, units = "Kb"), "\n")
#>   Dense matrix: 122.7 Kb
cat("  Reduction:", 
    round(100 * (1 - as.numeric(size_sparse) / as.numeric(size_dense)), 1), 
    "%\n")
#>   Reduction: 88.9 %

Generate Response Variable

# Create coefficient vector (now with updated p)
beta_true <- numeric(p)
beta_true[true_words] <- true_effects

# Generate response (document sentiment/category)
set.seed(3)
linear_predictor <- X_sparse %*% beta_true
y <- as.numeric(linear_predictor) + rnorm(n, sd = 1.5)

cat("Response variable summary:\n")
#> Response variable summary:
print(summary(y))
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> -15.4385  -0.7504   6.3207   6.8944  14.3634  29.8603

Variable Selection using geomc.vs with Symmetric Random Walk Base

By default, geomc.vs implements the symmetric random walk base by setting symm=TRUE (for details, see Roy (2024)). Now, we apply geomc.vs() with the sparse matrix:

# Run variable selection
set.seed(123)
result <- geomc.vs(
  X = X_sparse,         # Sparse dgCMatrix
  y = y,
  model.summary = TRUE # Compute model summaries (default: FALSE)
)

Understanding the Output

MCMC Samples

The samples matrix contains binary indicators for which variables are in the model at each iteration:

# Dimensions
cat("Sample matrix dimensions:", dim(result$samples), "\n")
#> Sample matrix dimensions: 50 196
cat("(rows = iterations, columns = variables)\n\n")
#> (rows = iterations, columns = variables)
# First few iterations
cat("First 5 iterations (first 10 variables):\n")
#> First 5 iterations (first 10 variables):
print(result$samples[1:5, 1:20])
#> 5 x 20 sparse Matrix of class "lgCMatrix"
#>                                             
#> [1,] . . . . . . . . . . . . . . . . . . . .
#> [2,] . . . . . . . . . . . . . . . . . . . .
#> [3,] . . . . . . . . . . . . . . . . . . . .
#> [4,] . . . . . . . . . . . . . . . . . . . .
#> [5,] . . . . . . . . . . . . . . . . . . . .
# Each row is a model
cat("\nIteration 1 includes variables:", which(result$samples[1, ] == 1), "\n")
#> 
#> Iteration 1 includes variables:
cat("Iteration 2 includes variables:", which(result$samples[2, ] == 1), "\n")
#> Iteration 2 includes variables: 147
# The proportion of accepted proposals
cat("\nAcceptance rate:", round(result$acceptance.rate, 3), "\n")
#> 
#> Acceptance rate: 0.14

Log Posterior Values

The log.p vector contains the log posterior probability for each sampled model:

# Trace plot of log posterior
plot(result$log.p, type = "l",
     xlab = "Iteration", ylab = "Log Posterior",
     main = "Trace of Log Posterior Probability")
cat("\nLog posterior summary:\n")
#> 
#> Log posterior summary:
cat("  Min:", round(min(result$log.p), 2), "\n")
#>   Min: -172.59
cat("  Max:", round(max(result$log.p), 2), "\n")
#>   Max: -55.03
cat("  Mean:", round(mean(result$log.p), 2), "\n")
#>   Mean: -69.62

Results Analysis

Marginal Inclusion Probabilities

As mentioned before, MIP represents how often each variable was included in the sampled models.

# Identify top words by MIP
top_n <- 20
top_indices <- order(result$mip, decreasing = TRUE)[1:top_n]

cat("Top", top_n, "words by Marginal Inclusion Probability:\n\n")
#> Top 20 words by Marginal Inclusion Probability:
cat("(Note: Word indices are in the reduced matrix space)\n\n")
#> (Note: Word indices are in the reduced matrix space)
mip_table <- data.frame(
  Rank = 1:top_n,
  Word_Index = top_indices,
  MIP = round(result$mip[top_indices], 4),
  True_Word = ifelse(top_indices %in% true_words, "Yes", "No"),
  True_Effect = ifelse(
    top_indices %in% true_words,
    sprintf("%.2f", beta_true[top_indices]),
    ""
  )
)
print(mip_table, row.names = FALSE)
#>  Rank Word_Index  MIP True_Word True_Effect
#>     1        147 0.98       Yes        2.20
#>     2         24 0.94       Yes       -2.00
#>     3         49 0.90       Yes        3.00
#>     4          9 0.88       Yes        2.50
#>     5         98 0.86       Yes       -1.80
#>     6          1 0.02        No            
#>     7          2 0.00        No            
#>     8          3 0.00        No            
#>     9          4 0.00        No            
#>    10          5 0.00        No            
#>    11          6 0.00        No            
#>    12          7 0.00        No            
#>    13          8 0.00        No            
#>    14         10 0.00        No            
#>    15         11 0.00        No            
#>    16         12 0.00        No            
#>    17         13 0.00        No            
#>    18         14 0.00        No            
#>    19         15 0.00        No            
#>    20         16 0.00        No
# Check recovery
cat("\nRecovery of true words:\n")
#> 
#> Recovery of true words:
for (i in seq_along(true_words)) {
  word_idx <- true_words[i]
  cat(sprintf("  Word %3d (original %3d): MIP = %.4f, Rank = %2d\n",
              word_idx,
              true_words_old[i],
              result$mip[word_idx],
              which(order(result$mip, decreasing = TRUE) == word_idx)))
}
#>   Word   9 (original  10): MIP = 0.8800, Rank =  4
#>   Word  24 (original  25): MIP = 0.9400, Rank =  2
#>   Word  49 (original  50): MIP = 0.9000, Rank =  3
#>   Word  98 (original 100): MIP = 0.8600, Rank =  5
#>   Word 147 (original 150): MIP = 0.9800, Rank =  1

Visualizing MIPs

par(mfrow = c(2, 1), mar = c(4, 4, 3, 1))

# Plot 1: All MIPs
plot(
  1:p, result$mip,
  type = "h", col = "gray70", lwd = 1,
  xlab = "Word Index (in reduced matrix)",
  ylab = "MIP",
  main = "Marginal Inclusion Probabilities (All Words)",
  ylim = c(0, 1)
)
points(true_words, result$mip[true_words], 
       col = "red", pch = 19, cex = 2)
abline(h = 0.5, col = "blue", lty = 2, lwd = 2)
legend("topright",
       legend = c("True word", "MIP > 0.5"),
       col = c("red", "blue"),
       pch = c(19, NA),
       lty = c(NA, 2),
       lwd = 2)

Visualizing WMIPs

geomc.vs also returns the weighted marginal inclusion probabilities (wmip).

par(mfrow = c(2, 1), mar = c(4, 4, 3, 1))

# Plot 2: All WMIPs
plot(
  1:p, result$wmip,
  type = "h", col = "gray70", lwd = 1,
  xlab = "Word Index (in reduced matrix)",
  ylab = "WMIP",
  main = "Weighted Marginal Inclusion Probabilities (All Words)",
  ylim = c(0, 1)
)
points(true_words, result$wmip[true_words], 
       col = "red", pch = 19, cex = 2)
abline(h = 0.5, col = "blue", lty = 2, lwd = 2)
legend("topright",
       legend = c("True word", "WMIP > 0.5"),
       col = c("red", "blue"),
       pch = c(19, NA),
       lty = c(NA, 2),
       lwd = 2)

Model Selection Performance

# Median probability model
selected_words <- result$median.model

cat("Median Probability Model:\n")
#> Median Probability Model:
cat("  Words selected:", length(selected_words), "\n")
#>   Words selected: 5
cat("  Word indices (in reduced matrix):", paste(selected_words, collapse = ", "), "\n\n")
#>   Word indices (in reduced matrix): 9, 24, 49, 98, 147
# Confusion matrix
true_positive <- sum(true_words %in% selected_words)
false_positive <- length(selected_words) - true_positive
false_negative <- length(true_words) - true_positive
true_negative <- p - length(true_words) - false_positive

cat("Classification Performance:\n")
#> Classification Performance:
cat("  True Positives:", true_positive, "/", length(true_words), "\n")
#>   True Positives: 5 / 5
cat("  False Positives:", false_positive, "\n")
#>   False Positives: 0
cat("  False Negatives:", false_negative, "\n")
#>   False Negatives: 0
cat("  Sensitivity:", round(true_positive / length(true_words), 3), "\n")
#>   Sensitivity: 1
cat("  Precision:", 
    round(true_positive / max(1, length(selected_words)), 3), "\n")
#>   Precision: 1

Weighted Average Model

geomc.vs also returns the weighted average model. The weighted average model includes all variables with wpip > model.threshold (Default 0.5).

# Weighted average model
selected_words <- result$wam

cat("Weighted Average Model:\n")
#> Weighted Average Model:
cat("  Words selected:", length(selected_words), "\n")
#>   Words selected: 5
cat("  Word indices (in reduced matrix):", paste(selected_words, collapse = ", "), "\n\n")
#>   Word indices (in reduced matrix): 9, 24, 49, 98, 147
# Confusion matrix
true_positive <- sum(true_words %in% selected_words)
false_positive <- length(selected_words) - true_positive
false_negative <- length(true_words) - true_positive
true_negative <- p - length(true_words) - false_positive

cat("Classification Performance:\n")
#> Classification Performance:
cat("  True Positives:", true_positive, "/", length(true_words), "\n")
#>   True Positives: 5 / 5
cat("  False Positives:", false_positive, "\n")
#>   False Positives: 0
cat("  False Negatives:", false_negative, "\n")
#>   False Negatives: 0
cat("  Sensitivity:", round(true_positive / length(true_words), 3), "\n")
#>   Sensitivity: 1
cat("  Precision:", 
    round(true_positive / max(1, length(selected_words)), 3), "\n")
#>   Precision: 1

Coefficient Estimates

cat("Coefficient Estimates for True Words:\n\n")
#> Coefficient Estimates for True Words:
coef_table <- data.frame(
  Word_New = true_words,
  Word_Original = true_words_old[1:length(true_words)],
  True_Effect = true_effects,
  Post_Mean = round(result$beta.mean[(1+true_words)], 3),
  Beta_WAM = round(result$beta.wam[(1+true_words)], 3),
  MIP = round(result$mip[true_words], 4),
  WMIP = round(result$wmip[true_words], 4),
  Selected = ifelse(true_words %in% selected_words, "Yes", "")
)
print(coef_table, row.names = FALSE)
#>  Word_New Word_Original True_Effect Post_Mean Beta_WAM  MIP WMIP Selected
#>         9            10         2.5     2.164    2.464 0.88    1      Yes
#>        24            25        -2.0    -1.874   -1.983 0.94    1      Yes
#>        49            50         3.0     2.690    3.004 0.90    1      Yes
#>        98           100        -1.8    -1.597   -1.857 0.86    1      Yes
#>       147           150         2.2     2.174    2.163 0.98    1      Yes

Model Space Exploration

# Sizes of visited models
model_sizes <- apply(result$samples, 1, sum)

hist(
  model_sizes,
  breaks = 30,
  main = "Distribution of Model Sizes Visited",
  xlab = "Number of Words in Model",
  ylab = "Frequency",
  col = "lightblue",
  border = "white"
)
abline(v = length(true_words), col = "red", lwd = 3, lty = 2)
abline(v = median(model_sizes), col = "blue", lwd = 2, lty = 2)
legend("topleft",
       legend = c(
         paste("True size =", length(true_words)),
         paste("Median =", median(model_sizes))
       ),
       col = c("red", "blue"),
       lty = 2,
       lwd = 2)
cat("\nModel size summary:\n")
#> 
#> Model size summary:
cat("  Mean:", round(mean(model_sizes), 2), "\n")
#>   Mean: 4.58
cat("  Median:", median(model_sizes), "\n")
#>   Median: 5
cat("  Range:", range(model_sizes), "\n")
#>   Range: 0 6

Key Takeaways

Variable Selection using geomc.vs with an Asymmetric Random Walk Base

geomc.vs can also implement an asymmetric random walk base. This is specified by setting symm=FALSE and indicating the vector of (‘addition’, ‘deletion’, ‘swap’) move probabilities via the argument move.prob.

# Run variable selection
set.seed(123)
result_asym <- geomc.vs(
  X = X_sparse,         # Sparse dgCMatrix
  y = y,
  symm = FALSE,
  move.prob = c(0.4,0.4,0.2),
  model.summary = TRUE # Compute model summaries (default: FALSE)
)
cat("\nAcceptance rate:", round(result_asym$acceptance.rate, 3), "\n")
#> 
#> Acceptance rate: 0.14
# Identify top words by MIP
top_n <- 20
top_indices <- order(result_asym$mip, decreasing = TRUE)[1:top_n]

cat("Top", top_n, "words by Marginal Inclusion Probability:\n\n")
#> Top 20 words by Marginal Inclusion Probability:
cat("(Note: Word indices are in the reduced matrix space)\n\n")
#> (Note: Word indices are in the reduced matrix space)
mip_table <- data.frame(
  Rank = 1:top_n,
  Word_Index = top_indices,
  MIP = round(result_asym$mip[top_indices], 4),
  True_Word = ifelse(top_indices %in% true_words, "Yes", "No"),
  True_Effect = ifelse(
    top_indices %in% true_words,
    sprintf("%.2f", beta_true[top_indices]),
    ""
  )
)
print(mip_table, row.names = FALSE)
#>  Rank Word_Index  MIP True_Word True_Effect
#>     1        147 0.98       Yes        2.20
#>     2         24 0.96       Yes       -2.00
#>     3         49 0.94       Yes        3.00
#>     4          9 0.90       Yes        2.50
#>     5         98 0.86       Yes       -1.80
#>     6        154 0.04        No            
#>     7          1 0.00        No            
#>     8          2 0.00        No            
#>     9          3 0.00        No            
#>    10          4 0.00        No            
#>    11          5 0.00        No            
#>    12          6 0.00        No            
#>    13          7 0.00        No            
#>    14          8 0.00        No            
#>    15         10 0.00        No            
#>    16         11 0.00        No            
#>    17         12 0.00        No            
#>    18         13 0.00        No            
#>    19         14 0.00        No            
#>    20         15 0.00        No
# Check recovery
cat("\nRecovery of true words:\n")
#> 
#> Recovery of true words:
for (i in seq_along(true_words)) {
  word_idx <- true_words[i]
  cat(sprintf("  Word %3d (original %3d): MIP = %.4f, Rank = %2d\n",
              word_idx,
              true_words_old[i],
              result_asym$mip[word_idx],
              which(order(result_asym$mip, decreasing = TRUE) == word_idx)))
}
#>   Word   9 (original  10): MIP = 0.9000, Rank =  4
#>   Word  24 (original  25): MIP = 0.9600, Rank =  2
#>   Word  49 (original  50): MIP = 0.9400, Rank =  3
#>   Word  98 (original 100): MIP = 0.8600, Rank =  5
#>   Word 147 (original 150): MIP = 0.9800, Rank =  1

Visualizing MIPs

par(mfrow = c(2, 1), mar = c(4, 4, 3, 1))

# Plot 1: All MIPs
plot(
  1:p, result_asym$mip,
  type = "h", col = "gray70", lwd = 1,
  xlab = "Word Index (in reduced matrix)",
  ylab = "MIP",
  main = "Marginal Inclusion Probabilities (All Words)",
  ylim = c(0, 1)
)
points(true_words, result_asym$mip[true_words], 
       col = "red", pch = 19, cex = 2)
abline(h = 0.5, col = "blue", lty = 2, lwd = 2)
legend("topright",
       legend = c("True words", "MIP > 0.5"),
       col = c("red", "blue"),
       pch = c(19, NA),
       lty = c(NA, 2),
       lwd = 2)

Model Selection Performance

# Median probability model
selected_words <- result_asym$median.model

cat("Median Probability Model:\n")
#> Median Probability Model:
cat("  Words selected:", length(selected_words), "\n")
#>   Words selected: 5
cat("  Word indices (in reduced matrix):", paste(selected_words, collapse = ", "), "\n\n")
#>   Word indices (in reduced matrix): 9, 24, 49, 98, 147
# Confusion matrix
true_positive <- sum(true_words %in% selected_words)
false_positive <- length(selected_words) - true_positive
false_negative <- length(true_words) - true_positive
true_negative <- p - length(true_words) - false_positive

cat("Classification Performance:\n")
#> Classification Performance:
cat("  True Positives:", true_positive, "/", length(true_words), "\n")
#>   True Positives: 5 / 5
cat("  False Positives:", false_positive, "\n")
#>   False Positives: 0
cat("  False Negatives:", false_negative, "\n")
#>   False Negatives: 0
cat("  Sensitivity:", round(true_positive / length(true_words), 3), "\n")
#>   Sensitivity: 1
cat("  Precision:", 
    round(true_positive / max(1, length(selected_words)), 3), "\n")
#>   Precision: 1

Coefficient Estimates

cat("Coefficient Estimates for True Words:\n\n")
#> Coefficient Estimates for True Words:
coef_table <- data.frame(
  Word_New = true_words,
  Word_Original = true_words_old[1:length(true_words)],
  True_Effect = true_effects,
  Post_Mean = round(result_asym$beta.mean[(1+true_words)], 3),
  Beta_WAM = round(result_asym$beta.wam[(1+true_words)], 3),
  MIP = round(result_asym$mip[true_words], 4),
  WMIP = round(result_asym$wmip[true_words], 4),
  Selected = ifelse(true_words %in% selected_words, "Yes", "")
)
print(coef_table, row.names = FALSE)
#>  Word_New Word_Original True_Effect Post_Mean Beta_WAM  MIP WMIP Selected
#>         9            10         2.5     2.210    2.464 0.90    1      Yes
#>        24            25        -2.0    -1.908   -1.983 0.96    1      Yes
#>        49            50         3.0     2.796    3.004 0.94    1      Yes
#>        98           100        -1.8    -1.597   -1.857 0.86    1      Yes
#>       147           150         2.2     2.165    2.163 0.98    1      Yes

Model Space Exploration

# Sizes of visited models
model_sizes <- apply(result_asym$samples, 1, sum)

hist(
  model_sizes,
  breaks = 30,
  main = "Distribution of Model Sizes Visited",
  xlab = "Number of Words in Model",
  ylab = "Frequency",
  col = "lightblue",
  border = "white"
)
abline(v = length(true_words), col = "red", lwd = 3, lty = 2)
abline(v = median(model_sizes), col = "blue", lwd = 2, lty = 2)
legend("topleft",
       legend = c(
         paste("True size =", length(true_words)),
         paste("Median =", median(model_sizes))
       ),
       col = c("red", "blue"),
       lty = 2,
       lwd = 2)
cat("\nModel size summary:\n")
#> 
#> Model size summary:
cat("  Mean:", round(mean(model_sizes), 2), "\n")
#>   Mean: 4.68
cat("  Median:", median(model_sizes), "\n")
#>   Median: 5
cat("  Range:", range(model_sizes), "\n")
#>   Range: 0 6

We see that geomc.vs with either of the two random walk base pmfs, successfully recovers all true variables while controlling false positives. For a detailed analysis of the performance of geomc.vs in several ultra-high dimensional simulated and real data examples, see Roy (2024).

Additional Parameters

The geomc.vs() function allows several arguments. initial is used to specify the initial model (the set of active variables) for running the Markov chain (Default: Null model). n.iter is the no. of MCMC samples needed (Default: 50). burnin is the value of burnin used to compute the median probability model and other model summaries based only on the post burnin samples (Default: 1). eps is the value for epsilon perturbation (Default: 0.5). lam0 is the precision parameter for the normal prior of \(\beta_0\) (Default: 0, corresponding to the improper uniform prior). The shape and scale parameters for the Inverse-Gamma prior on \(\sigma^2\) are specified by a0 and b0, respectively (The default values are 0, 0, corresponding to the prior \(1/\sigma^2\)). lam is the slab precision parameter for the normal prior of \(\beta_i\)’s (Default: \(n/p^2\) as suggested by the theoretical results of Li, Dutta, and Roy (2023)). w is the prior inclusion probability of each variable (Default: \(n/p\) as suggested by Li, Dutta, and Roy (2023)).

Key Parameters

result <- geomc.vs(
  X = X,
  y = y,
  initial = c(1,2),         # Initial model (the set of active variables)
  n.iter = 100,        # Number of MCMC iterations
  burnin = 2,           # Burn-in period (default: 1)
  eps = 0.5,            # Perturbation parameter (default: 0.5)
  symm = FALSE,          # Use symmetric proposals (default: TRUE)
  move.prob = c(.3, .3, .4),  # Probabilities for add/delete/swap moves
  lam0 = 0,             # Prior parameter (default: 0)
  a0 = 0,               # Prior shape parameter (default: 0)
  b0 = 0,               # Prior scale parameter (default: 0)
  lam = n / p^2,        # Prior parameter (default: n/p^2)
  w = sqrt(n) / p,      # Prior inclusion probability (default: sqrt(n)/p)
  model.summary = TRUE, # Compute model summaries (default: FALSE)
  model.threshold = 0.5, # Threshold for median model (default: 0.5)
  show.progress = TRUE # Show progress during sampling
)

Summary

  1. geomc.vs() performs Bayesian variable selection using geometric MCMC
  2. Returns marginal inclusion probabilities (MIPs) for each variable
  3. Provides multiple model summaries: median model, WAM, posterior means
  4. Scales to high-dimensional problems (\(p >> n\))
  5. Quantifies uncertainty about variable selection

Appendix: The Model

Roy (2024) considers the following linear regression for variable selection:

\[y_i | \mathbf{X}, \beta_0,\mathbf{\beta},\mathbf{\gamma},\sigma^2,w,\lambda \stackrel{ind}{\sim} N(\beta_0 + \sum_{j=1}^p \gamma_j \beta_j x_{ij}, \sigma^2), i=1,\dots, n,\] or in vector notations: \[\mathbf{y} | \mathbf{X}, \beta_0,\mathbf{\beta},\mathbf{\gamma},\sigma^2,w,\lambda \sim N(\beta_01 + \mathbf{X}_{\mathbf{\gamma}}\mathbf{\beta}_{\mathbf{\gamma}},\sigma^2I_n).\]

where \(\mathbf{\gamma} = (\gamma_1, \ldots, \gamma_p) \in \{0, 1\}^p\) are binary indicators, \(\mathbf{\beta} = (\beta_1, \ldots, \beta_p)\) is the vector of regression coefficients, \(\mathbf{X}_{\mathbf{\gamma}}\) is the \(n \times |\mathbf{\gamma}|\) submatrix of \(\mathbf{X}\) consisting of those columns of \(\mathbf{X}\) for which \(\gamma_i=1\) and similarly, \(\mathbf{\beta}_{\mathbf{\gamma}}\) is the \(|\mathbf{\gamma}|\) subvector of \(\mathbf{\beta}\) corresponding to \(\mathbf{\gamma}\) with \(|\gamma| = \sum_{j=1}^p \gamma_j\) being the model size.

Next, the hierarchical Gaussian linear model in Roy (2024) places priors on the regression coefficients as well as on the model space as follows:

Prior Specification

  • Coefficient prior: \(\beta_i|\beta_0,\mathbf{\gamma},\sigma^2,w,\lambda \stackrel{ind}{\sim} N(0, \gamma_i\sigma^2/\lambda),~i=1,\ldots,p,\)

  • Intercept prior: \(\beta_0|\mathbf{\gamma},\sigma^2,w,\lambda \sim N(0, \sigma^2/\lambda_0)\)

  • Error variance: \(\sigma^2|\mathbf{\gamma},w,\lambda \sim \text{Inverse-Gamma}(a_0, b_0)\)

  • Model prior: \(\gamma_i|w,\lambda \stackrel{iid}{\sim} \text{Bernoulli} (w), i=1,\dots,p\)

The density \(\pi(\sigma^2)\) of \(\sigma^2 \sim\) Inv-Gamma \((a_0, b_0)\) has the form \[\pi(\sigma^2) \propto (\sigma^2)^{-a_0-1} \exp(-b_0/\sigma^2).\] geomc.vs also allows the non-informative prior \[(\beta_{0}, \sigma^2)|\mathbf{\gamma}, w \sim 1 / \sigma^{2},\] which is obtained by setting \(\lambda_0=a_0=b_0=0\).

Posterior Computation

The posterior is:

\[P(\mathbf{\gamma} | \mathbf{y}) \propto P(\mathbf{y} | \mathbf{\gamma}) P(\mathbf{\gamma}).\]

geomc.vs explores this posterior using the geometric MCMC proposed in Roy (2024). geomc.vs returns the Markov chain samples, the empirical MH acceptance rate, and the log-posterior values (up to an additive constant) at the observed MCMC samples.

If model.summary is set TRUE, geomc.vs also returns other model summaries. In particular, it returns the marginal inclusion probabilities (mip) computed by the Monte Carlo average as well as the weighted marginal inclusion probabilities (wmip) computed with weights \[w_i = P(\gamma^{(i)}|\mathbf{y})/\sum_{k=1}^K P(\gamma^{(k)}|\mathbf{y}), i=1,2,...,K\] where \(\gamma^{(k)}, k=1,2,...,K\) are the distinct models sampled. Thus, if \(N_k\) is the no. of times the \(k\)th distinct model \(\gamma^{(k)}\) is repeated in the MCMC samples, the mip for the \(j\)th variable is \[\texttt{mip}_j =\sum_{k=1}^{K} N_k I(\gamma^{(k)}_j = 1)/\texttt{n.iter}\] and wmip for the \(j\)th variable is \[\texttt{wmip}_j = \sum_{k=1}^K w_k I(\gamma^{(k)}_j = 1).\]

The median.model is the model containing variables \(j\) with \[\texttt{mip}_j >\texttt{model.threshold}\] and the wam is the model containing variables \(j\) with \[\texttt{wmip}_j > \texttt{model.threshold}.\] Note that \(E(\beta|\gamma, \mathbf{y})\), the conditional posterior mean of \(\mathbf{\beta}\) given a model \(\gamma\) is available in closed form (see Li, Dutta, and Roy (2023) for details). geomc.vs returns two estimates (beta.mean, beta.wam) of the posterior mean of \(\beta\) computed as \[ \texttt{beta.mean} = \sum_{k=1}^{K} N_k E(\beta|\gamma^{(k)},\mathbf{y})/\text{n.iter}\] and \[\texttt{beta.wam} = \sum_{k=1}^K w_k E(\beta|\gamma^{(k)},\mathbf{y}),\] respectively.

References

Li, Dongjin, Somak Dutta, and Vivekananda Roy. 2023. “Model Based Screening Embedded Bayesian Variable Selection for Ultra-High Dimensional Settings.” Journal of Computational and Graphical Statistics 32 (1): 61–73.
Roy, Vivekananda. 2024. “A Geometric Approach to Informed MCMC Sampling.” arXiv Preprint arXiv:2406.09010.