Most spatial modeling approaches work under the assumption that the data used for modeling are stationary. That is to say, the mean and variance of the response variable are constant in space. This assumption is often violated, especially when modeling large areas. Sometimes, a nonstationary process can be transformed into a stationary one by modeling external trends; however, this is not possible to achieve if the external trends are not constant over space. In such cases, an alternative approach is to partition the space into stationary sub-regions and model each region separately. The problem with this approach is that continuous response variables will have discontinuities in the prediction surface at the borders of the regions.
The package remap
is an implementation of a regional
modeling with border smoothing method that results in a continuous
global model. Border smoothing is accomplished by taking a weighted
average of regional model predictions near region borders. The
remap
function is also a convenient way to build
independent models in a partitioned space, even if no border smoothing
is required for the problem. See also “remap: Regionalized Models with
Spatially Smooth Predictions” published in the R Journal https://doi.org/10.32614/RJ-2023-004.
library(dplyr) # For light data wrangling
library(ggplot2) # For plots
library(maps) # For a polygon of the state of Utah
library(sf) # For spatial data manipulation
library(mgcv) # For GAM modeling
library(remap)
data(utsnow)
data(utws)
To introduce the functionality of remap, we will look at a modeling
problem for estimating snow water content in the state of Utah using
water equivalent of snow density (WESD) measurements. The
utsnow
data that is part of the package remap
contains WESD in mm water measured on April 1st, 2011 at 394 locations
within and near the state of Utah. The utws
data in
remap
is a set of polygons representing watersheds defined
by the US Geological Survey. These watersheds are defined by a hierarchy
of hydrologic unit cods (HUC) with a two-digit designation for
continental scale watersheds (HUC2). We will build a regionalized model
with remap
using HUC2 watershed regions.
utstate <- maps::map("state", region = "utah", plot = FALSE, fill = TRUE) |>
sf::st_as_sf() |>
sf::st_transform(crs = 4326)
ggplot(utws, aes(fill = HUC2)) +
geom_sf(alpha = 0.5) +
geom_sf(data = utstate, fill = "NA", size = 1) +
geom_sf(data = utsnow) +
ggtitle("Modeling Data and Regions",
"HUC2 regions are made up of smaller HUC4 regions.") +
theme_void()
\(~\)
\(~\)
\(~\)
The remap
function requires the following 4
parameters:
data
- Observations with a spatial component used for
modeling.regions
- Polygons describing the regions used to build
each model.model_function
- A function that takes a subset of the
data
and returns a model.buffer
- Observations are located within and near a
region are used to build each model. The buffer
parameter
dictates how near an observation must be to a location to be included in
that region’s model.The following parameters are optional:
region_id
- A column name of regions
which
describes which polygons to include in each region. This is helpful if
any region has polygons described in multiple rows in the
sf
object. The region_id
allows for
combinations of smaller polygons into larger regions.min_n
- The minimum number of observations required to
build a model in each region. If a region does not contain enough
observations within its border and buffer zone, the nearest
min_n
observations will be used for modeling.In this section, we use some simple linear models to model snow water
content in Utah. First a global model using all data, then a regional
model using remap
. The WESD measurement in our example data
commonly shares a log-linear relationship with elevation in mountainous
western states. Many locations in Utah have a value of 0 for WESD on
April first. Since zero snow values in log transformed variables add
another level of complexity to the modeling process, we remove them for
now and make a new dataset called utsnz
.
The relationship between the log transformed WESD and elevation of
utsnz
is visualized as:
ggplot(utsnz, aes(x = ELEVATION, y = WESD)) +
geom_point() +
geom_smooth(method = "lm") +
scale_y_log10() +
theme_minimal()
#> `geom_smooth()` using formula = 'y ~ x'
The resubstitution mean squared error (MSE) for such a model is:
The utsnz
data describes which watersheds each location
falls in with the HUC2
columns. (Note that
remap
does not require these columns in the data when
building models using the utws
regions.) Here is what the
relationship between log(WESD) and elevation looks like for each of the
HUC2 regions:
ggplot(utsnz, aes(x = ELEVATION, y = WESD)) +
facet_grid(HUC2 ~ .) +
geom_point() +
geom_smooth(method = "lm") +
scale_y_log10() +
theme_minimal()
#> `geom_smooth()` using formula = 'y ~ x'
The linear models for each HUC2 region seem to fit a little better
than the global linear model; however, it looks like HUC2 region 15 does
not have enough data to build a very confident model. Using
remap
to build models for each region, some of the nearest
observations to HUC2 region 15 are added to build a better model by
using the min_n
parameter to set the minimum number of
observations per region to 10.
For this modeling task, any observation within 20km of a region is to
be included in that region’s model using the buffer
parameter. The lm
function requires a formula
,
so formula
is added as an extra parameter in remap. Since
remap
makes smooth predictions over the entire surface, a
smooth
parameter is required in the predict
function that dictates the distance from region borders where the
smoothing process will start. We set smooth
to 10km for
this example.
t1 <- Sys.time()
lm_huc2 <- remap::remap(
utsnz, utws,
buffer = 20, min_n = 10,
model_function = lm,
formula = log(WESD) ~ ELEVATION
)
lm_huc2_mse <- mean((utsnz$WESD - exp(predict(lm_huc2, utsnz, smooth = 10)))^2)
t2 <- Sys.time()
# mse
lm_huc2_mse
#> [1] 85725.65
# runtime
round(difftime(t2, t1), 1)
#> Time difference of 0.2 secs
The output of remap
returns a list that contains a list
of models built in each region and an sf
data frame storing
the region polygons. The models for each region can be accessed
directly. For example, the coefficients for each model can be accessed
with the following code:
sapply(lm_huc2$models, function(x) x$coefficients)
#> 1 2 3 4
#> (Intercept) 0.557864695 -2.292085274 0.564719359 1.8300546
#> ELEVATION 0.001914744 0.003114987 0.002140786 0.0018686
The resulting remap model has a 29.4% reduction in resubstitution MSE using separate linear models for each of the HUC2 regions rather than the global linear model. This increase in accuracy is likely to be even more drastic when problems span greater areas, such as a model for an entire continent.
\(~\)
\(~\)
\(~\)
Many modeling techniques that partition space depend on the distance
between prediction locations and the center of each region.
However, the center is a poor characterization of irregularly shaped
regions. The border smoothing method in remap
uses the
distances to region borders rather than region centers. This
allows for regions of any shape being used for modeling. Keep in mind
that remap
requires distance calculations between each
observation and the region boundaries in both the modeling and
prediction steps. This tends to be computationally expensive with large
numbers of points and/or complex region boundaries. This section
outlines the steps taken to ameliorate the computational burden of
remap
which allow remap
to scale to large
problems.
Many spatial processes require continuous polygons (i.e. no gaps in region boundaries), which limits the degree of polygon simplification that can be achieved. One desirable feature of remap is the ability of each model to make smooth predict outside of polygons within a smoothing zone, which smooths over any small gaps in polygons that occurs in an aggressive geometrical simplification. Thus, by only preserving the macro structure of the input polygons, we can greatly speed up distance computations without losing fidelity in model predictions.
Distance calculation time can be dramatically reduced by simplifying
the polygons passed to remap
using the sf
package st_simplify
function. The function gives a warning
that can be ignored for our purposes, we are only trying to preserve the
macro structure and the regions are not near any poles. Gaps at region
borders appear, but remap has the ability to predict outside of regions
and predictions will remain smooth as long as the gaps aren’t wider than
two times the smooth
parameter. Notice how the simplified
polygons retain the basic structure of the original regions:
utws_simp <- utws |> sf::st_simplify(dTolerance = 5000)
rbind(
utws |> dplyr::mutate(TYPE = "Original Watershed Polygons"),
utws_simp |> dplyr::mutate(TYPE = "Simplified Watershed Polygons")
) |>
ggplot() +
facet_grid(.~TYPE) +
geom_sf() +
theme_void()
Simplifying the polygons doesn’t drastically change the computation time in this particular example, but some regions can contain massive polygons with details that are unnecessary for regional modeling.
redist
The remap
and predict
functions both
internally call a function called redist
to calculate
distances from points to polygons. The user can directly use the
function redist
to pre-compute distances from points to
polygons and use the pre-computed distances as direct inputs in the
remap
function. The pre-computing step greatly reduces
computational costs if multiple regional models must be made with the
same input data and polygons.
Distances from prediction locations need only be computed to polygon
boundaries for which the prediction location falls within their
smoothing zone. Buffered polygons can be used to quickly determine
candidate observations for distance calculations in each region. The
max_dist
parameter of redist
can be used to
make these buffered polygons. This drastically reduces the number of
points for which distance calculations must be performed and greatly
improves computation times.
huc2_dist_nz <- remap::redist(utsnz, utws_simp, region_id = HUC2)
head(huc2_dist_nz)
#> Units: [km]
#> 14 15 16 17
#> [1,] 0.00000 333.2641 273.7183 438.11364
#> [2,] 0.00000 430.5133 247.8148 357.75421
#> [3,] 0.00000 240.2121 285.3032 527.65980
#> [4,] 75.49279 387.8302 0.0000 64.65256
#> [5,] 80.50456 325.9622 0.0000 152.99741
#> [6,] 97.42136 175.9546 0.0000 287.59656
The newly created distance matrix can be sent to remap
and predict
as a parameter. Notice how much faster the
remap
and predict
functions run when the
distances are pre-calculated.
t1 <- Sys.time()
lm_huc2 <- remap(
utsnz, utws_simp, region_id = HUC2,
buffer = 20, min_n = 10,
model_function = lm,
formula = log(WESD) ~ ELEVATION,
distances = huc2_dist_nz
)
lm_huc2_mse <- mean(
(utsnz$WESD -
exp(predict(lm_huc2, utsnz, smooth = 10,
distances = huc2_dist_nz)))^2
)
t2 <- Sys.time()
# mse
lm_huc2_mse
#> [1] 87730.86
# runtime
round(difftime(t2, t1), 1)
#> Time difference of 0 secs
The MSE for this model is slightly different than previous regional linear model since the simplified polygons are being used. This is a small problem so simplifying polygons and precomputing distances might be unnecessary; however, these steps can make a large modeling and mapping problem feasible with limited computing resources.
Since calculating distances to polygons is independent for each
polygon, redist
can be run in parallel by setting the
cores
to a number greater than one. Models are built and
make predictions independent of one another so the remap
function and predict
method also have a cores
parameter for parallel computing. This means that distance calculations,
modeling, and predicting processes can all be run in parallel.
\(~\)
\(~\)
\(~\)
While linear models are great for illustration, many spatial modeling
approaches will require more complex regional model forms. The
remap
function is flexible enough to handle arbitrary model
inputs and outputs, with the only requirement being that the model
output must be compatible for use with the generic predict
function and return a numeric output.
The remap
function takes a model_function
as a parameter. The model_function
must have the following
two features:
sf
data point object,Note that named function arguments can be supplied at the end of the
remap function, as was the case with the formula
argument
in the linear model example shown previously. The key is that the first
unnamed parameter be dedicated to data input.
Suppose we want to make a generalized additive model (GAM) that is
limited to only making positive predictions. We also don’t want to
predict a number greater than the highest observed response in each
region to avoid over extrapolation. This can be accomplished through
wrapper functions for both gam
and predict.gam
that restrict predicted values to a specified range. We can also make
the option to return standard errors rather than predictions.
remap
has an option to combine standard errors on region
borders to find an upper bound of combined standard errors.
gam_limit <- function(data, fml) {
g_model <- mgcv::gam(fml, data = data)
upper_lim <- max(data$WESD)
out <- list(g_model = g_model, upper_lim = upper_lim)
class(out) <- "gam_limit"
return(out)
}
predict.gam_limit <- function(object, newobs, se.fit = FALSE) {
if (nrow(newobs) != 0) {
if (se.fit) {
return(predict(object$g_model, newobs, se.fit = TRUE)$se.fit)
} else {
preds <- predict(object$g_model, newobs)
preds[preds < 0] <- 0
preds[preds > object$upper_lim] <- object$upper_lim
return(preds)
}
}
return(NULL)
}
The following code tests a GAM model where elevation and splines on the sphere are used as predictors. We can use the functions written to do a GAM with remap to easily do cross validation on a global model:
# Create vector for cross-validation
set.seed(42)
cv_k <- sample(1:10, nrow(utsnow), replace = TRUE)
# Initialize predictions
gam_global_preds <- rep(as.numeric(NA), nrow(utsnow))
# Formula for global GAM
global_fml <- WESD ~ s(ELEVATION, k = 5) + s(LATITUDE, LONGITUDE, bs = "sos", k = 50)
# Build and test models with 10 fold cross-validation
for (i in 1:10) {
index <- cv_k == i
gam_global <- gam_limit(utsnow[!index, ], fml = global_fml)
gam_global_preds[index] <- predict(gam_global, utsnow[index, ])
}
# Calculate MSE
gam_global_mse <- mean((utsnow$WESD - gam_global_preds)^2)
gam_global_mse
#> [1] 23875.94
This model is much better than either of the basic linear models, even though the GAM accuracy is measured on cross-validation rather than resubstitution error and the GAM model is also modeling the zero valued observations.
First, the distances are pre-calculated so the distance calculations
aren’t repeated 10 different times when doing cross validation. The
distances return a matrix where each row corresponds to each observation
in the data. The cross validation only uses a subset of the data, so the
corresponding subset of distances should be passed to remap
during each modeling step.
HUC2 regions are used to build a regionalized GAM models with
remap
. We will reduce the knots on the splines on the
sphere from 50 to 25 so we don’t need so many degrees of freedom for
each model. The min_n
can be set to 35 to allow at least 5
degrees of freedom per model.
# Initialize predictions
gam_huc2_preds <- rep(as.numeric(NA), nrow(utsnow))
# Formula for regional GAMs
gam_huc2_fml <- WESD ~ s(ELEVATION, k = 5) +
s(LATITUDE, LONGITUDE, bs = "sos", k = 25)
# Build and test models with 10 fold cross-validation
for (i in 1:10) {
index <- cv_k == i
gam_huc2 <- remap::remap(
utsnow[!index, ], utws, region_id = HUC2,
model_function = gam_limit,
buffer = 20, min_n = 35,
distances = huc2_dist[!index, ],
fml = gam_huc2_fml
)
gam_huc2_preds[index] <- predict(
gam_huc2, utsnow[index, ],
smooth = 10,
distances = huc2_dist[index, ]
)
}
# Calculate MSE
gam_huc2_mse <- mean((utsnow$WESD - gam_huc2_preds)^2)
gam_huc2_mse
#> [1] 16620.73
The HUC2 regionalized GAM has 30.4% better MSE than the global GAM model. With the custom functions that we wrote, we can get both smooth predictions and smoothed combined standard errors.
gam_huc2 <- remap::remap(
utsnow, utws, region_id = HUC2,
model_function = gam_limit,
buffer = 20, min_n = 35,
distances = huc2_dist,
fml = gam_huc2_fml
)
predict(gam_huc2, utsnow[1:3, ], smooth = 25)
#> [1] 7.629856 248.959425 8.694482
predict(gam_huc2, utsnow[1:3, ], smooth = 25, se = TRUE, se.fit = TRUE)
#> Upper bound for standard error calculated at each location.
#> Reminder: make sure that the predict function outputs a vector of standard error values for each regional model in your remap object.
#> [1] 35.12763 42.04082 33.69042
\(~\)
\(~\)
\(~\)
A toy model is best used to show how smooth predictions work since the Utah snow water content models have extreme values and sharp changes with elevation. The toy model has 3 regional models with contrived response variables that consists of an affine combination of longitude and latitude values. The left region predicts \(lat - lon\), the bottom region predicts \(lon - lat - 0.4\), and the right region predicts \(lat - lon + 0.3\)
# Make regions
toy_regions <- sf::st_sf(
id = c("a", "b", "c"),
geometry = sf::st_sfc(
sf::st_polygon(list(matrix(c(0, 0, 2, 0, 6, 3, 4, 10, 0, 10, 0, 0)*.1,
ncol = 2, byrow = TRUE))),
sf::st_polygon(list(matrix(c(2, 0, 10, 0, 10, 4, 6, 3, 2, 0)*.1,
ncol = 2, byrow = TRUE))),
sf::st_polygon(list(matrix(c(4, 10, 6, 3, 10, 4, 10, 10, 4, 10)*.1,
ncol = 2, byrow = TRUE)))),
crs = 4326)
# Manually make a toy remap model
make_toy <- function(x) {
class(x) <- "toy_model"
return(x)
}
remap_toy_model <- list(
models = list("a" = make_toy("a"),
"b" = make_toy("b"),
"c" = make_toy("c")),
regions = toy_regions,
region_id = "id"
)
class(remap_toy_model) <- "remap"
# Make a prediction method for toy_model
predict.toy_model <- function(object, data) {
x <- sf::st_coordinates(data)[, "X"]
y <- sf::st_coordinates(data)[, "Y"]
if (object == "a") {
y - x
} else if (object == "b") {
x - y - 0.4
} else {
y - x + 0.3
}
}
# Make a grid over the regions for predictions
grd <- sf::st_make_grid(toy_regions, cellsize = .01, what = "corners") |>
sf::st_sf()
The regions cover the following area:
ggplot2::ggplot(toy_regions, aes(fill = id)) +
geom_sf(color = "black", size = 1) +
ggtitle("Toy Regions") +
theme_bw()
The remap_toy_model
object can now be used to make
predictions on the grd
object. There are 10201 points in
the grd
object but the regions are simple, so it will not
take long to find distances. Two predictions will be made, the
SHARP
predictions will have a smoothing parameter of zero
and the SMOOTH
predictions will have a smoothing parameter
set to 30km.
grd_pred <- grd |>
dplyr::mutate(SHARP = predict(remap_toy_model, grd, smooth = 0),
SMOOTH = predict(remap_toy_model, grd, smooth = 30),
LON = sf::st_coordinates(grd)[, "X"],
LAT = sf::st_coordinates(grd)[, "Y"])
The smooth predictions from the remap
object start to
become a weighted average of regional predictions when the predictions
are within 30km of a region border. The following plots show what is
happening with both the SHARP
and SMOOTH
predictions at predicted values at all locations and specific plots
along the 0.8 degree N line. Notice how the predictions at the borders
of the toy regions are smoothed:
ggplot(toy_regions) +
geom_sf() +
geom_tile(data = grd_pred, aes(x = LON, y = LAT, fill = SHARP)) +
scale_fill_viridis_c(limits = c(-0.3, 1)) +
geom_hline(yintercept = 0.8) +
ggtitle("Sharp Predictions", "Black line corresponds to x-axis of the next plot.") +
xlab("") + ylab("") +
theme_bw()
ggplot(grd_pred |> dplyr::filter(LAT == 0.8),
aes(x = LON, y = SHARP)) +
geom_line(linewidth = 1) +
ggtitle("Sharp Predictions at 0.8 degrees N") +
theme_minimal()
ggplot(toy_regions) +
geom_sf() +
geom_tile(data = grd_pred, aes(x = LON, y = LAT, fill = SMOOTH)) +
scale_fill_viridis_c(limits = c(-0.3, 1)) +
geom_hline(yintercept = 0.8) +
ggtitle("Smooth Predictions", "Black line corresponds to x-axis of the next plot.") +
xlab("") + ylab("") +
theme_bw()
ggplot(grd_pred |> dplyr::filter(LAT == 0.8),
aes(x = LON, y = SMOOTH)) +
geom_line(linewidth = 1) +
ggtitle("Smooth Predictions at 0.8 degrees N") +
theme_minimal()
The remap
package provides a way to build regional
spatial models given a set of observations and a set of regions. The
resulting model can make predictions that have no discontinuities at
region borders and scales well to large problems.