---
title: "2. Environmental thinning"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{2. Environmental thinning}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  out.width = "90%"
)
has_ggplot2 <- requireNamespace("ggplot2", quietly = TRUE)
```

```{r setup}
library(bean)
data(origin_dat_prepared, package = "bean")
env_vars <- c("bio_1", "bio_4", "bio_12", "bio_15")
```

## Choosing an objective grid resolution

### The intuition

A grid cell is the unit of "redundancy" in environmental space. If two
occurrences sit in the same cell, the package treats them as carrying
the same environmental information and keeps only one. Pick the cell
too small, and almost every point survives — thinning does nothing. Pick
it too large, and you wipe out genuine ecological variation. We need a
data-driven choice that sits between those two extremes.

### Why a kernel-density bandwidth?

A kernel density estimator (KDE) replaces every observation with a small
"bump" of width *h* (the **bandwidth**) and adds the bumps together to
estimate the probability density of the variable. The bandwidth is the
scale at which the estimator stops resolving individual points and
starts producing a smooth curve. Operationally:

- features finer than *h* are treated as sampling noise,
- features coarser than *h* are treated as real density structure.

That is exactly the dividing line a thinning grid should respect, so
`find_env_resolution()` uses *h* (computed independently per variable)
as the suggested edge length of each cell.

### Which selector?

Three selectors with established statistical properties are available
via the `method` argument:

- **Sheather–Jones plug-in** (`"sheather-jones"`, the default) — solves
  an equation for the bandwidth that minimises the asymptotic mean
  integrated squared error of the density estimate. It uses the data
  itself (rather than a Gaussian assumption) to estimate the curvature
  of the unknown density, and is the modern standard recommendation
  (Sheather & Jones, 1991; Jones, Marron & Sheather, 1996).

- **Silverman's rule of thumb** (`"silverman"`) — the closed form
  $$h = 0.9 \, \min\!\left(\hat\sigma,\, IQR/1.34\right) \, n^{-1/5}.$$
  Cheap, stable, and a good fallback. Uses the IQR-based scale to
  resist mild outliers, but assumes the underlying density is
  approximately Gaussian. Tends to over-smooth multimodal data
  (Silverman, 1986).

- **Scott's rule** (`"scott"`) — the closed form
  $$h = 1.06 \, \hat\sigma \, n^{-1/5}.$$
  Optimal when the density is exactly Gaussian; otherwise less
  defensible than the other two (Scott, 1992).

All three rules shrink at the canonical rate \(n^{-1/5}\): more data
buys you a finer resolution, but only slowly. This is the right
behaviour for SDM data — doubling your sample size should let you see
slightly finer structure, not arbitrarily finer.

### Computing the resolution

```{r}
res <- find_env_resolution(
  data = origin_dat_prepared,
  env_vars = env_vars,
  method = "sheather-jones"
)
res
```

### Visualising the bandwidth

Each panel below shows the per-variable kernel density. The red bar at
the bottom of each panel has length equal to the chosen bandwidth — the
cell width that will be used for thinning.

```{r, fig.width = 8, fig.height = 6}
plot(res)
```

### Sensitivity to the selector

It's worth checking that the three rules give comparable answers; if
they don't, the data is far from Gaussian and you may want to inspect
the densities yourself.

```{r}
sapply(c("sheather-jones", "silverman", "scott"), function(m) {
  find_env_resolution(origin_dat_prepared, env_vars, method = m)$suggested_resolution
})
```

## Stochastic thinning

`thin_env_nd()` randomly retains exactly one occurrence per occupied
grid cell. A `seed` makes the selection reproducible without disturbing
the global random state.

```{r}
thinned_stochastic <- thin_env_nd(
  data = origin_dat_prepared,
  env_vars = env_vars,
  grid_resolution = res$suggested_resolution,
  seed = 1
)
thinned_stochastic
```

## Deterministic thinning

`thin_env_center()` replaces each occupied cell with a single point at
the geometric centre of the cell — no randomness involved.

```{r}
thinned_deterministic <- thin_env_center(
  data = origin_dat_prepared,
  env_vars = env_vars,
  grid_resolution = c(0.5, 0.5, 0.5, 0.5)
)
thinned_deterministic
```

## Comparing the two thinned datasets

```{r, eval = has_ggplot2, fig.width = 8, fig.height = 6}
library(ggplot2)
plot_compare <- rbind(
  data.frame(origin_dat_prepared[, env_vars], Status = "Original"),
  data.frame(thinned_stochastic$thinned_data[, env_vars], Status = "Stochastic"),
  data.frame(thinned_deterministic$thinned_points[, env_vars], Status = "Deterministic")
)
plot_compare$Status <- factor(plot_compare$Status,
                              levels = c("Original", "Stochastic", "Deterministic"))

ggplot(plot_compare, aes(bio_1, bio_12, colour = Status)) +
  geom_point(alpha = 0.5, size = 3) +
  facet_wrap(~Status, nrow = 1) +
  scale_colour_manual(values = c(Original = "#ef476f",
                                 Stochastic = "#118ab2",
                                 Deterministic = "#06d6a0"),
                      guide = "none") +
  theme_classic() +
  labs(title = "Occurrences in environmental space",
       x = "bio_1 (scaled)", y = "bio_12 (scaled)")
```

The stochastic plot preserves *actual* observations (one per cell), so
its points reflect the empirical distribution within each cell. The
deterministic plot replaces each cell's observations with the cell's
centre, so its points sit on a regular lattice.

## Pairs view with `plot_bean()`

```{r, fig.width = 8, fig.height = 8}
plot_bean(
  original_data = origin_dat_prepared,
  thinned_object = thinned_stochastic,
  env_vars = env_vars
)
```

```{r, fig.width = 8, fig.height = 8}
plot_bean(
  original_data = origin_dat_prepared,
  thinned_object = thinned_deterministic,
  env_vars = env_vars
)
```

The next vignette uses these two thinned datasets to fit niche
ellipsoids and project suitability across the landscape.