---
title: "Planning a MAIHDA analysis"
author: "Hamid Bulut"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Planning a MAIHDA analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 4.5
)
```

## Before you fit

through those design decisions, with small runnable checks you can do on your own
data first to evaluate dimensions, the number of strata, and the analytic sample

```{r lib}
library(MAIHDA)
data("maihda_health_data")
```

## Is MAIHDA the right tool?

MAIHDA is for questions of the form *"how much of the variation in an outcome lies
between people's intersectional social positions, and how much of that is more than
the sum of its parts?"* It is well suited when:

- you have several **categorical** social dimensions (gender, race/ethnicity,
  education, class, ...) whose **joint** categories define the strata;
- the outcome is measured at the **individual** level;
- you have enough individuals to populate the cells (see below).

## The central tradeoff: more dimensions means emptier cells

Strata are the **cross-product** of the dimensions, so cell counts fall off fast as
you add dimensions. `make_strata()` builds the strata and returns a `strata_info`
table of counts you can inspect *before* modelling:

```{r strata2}
s2 <- make_strata(maihda_health_data, vars = c("Gender", "Race"))
nrow(s2$strata_info)                       # number of strata
summary(s2$strata_info$n)              # cell-size distribution
```

Add education and the same sample splits into many more, smaller cells:

```{r strata3}
s3 <- make_strata(maihda_health_data, vars = c("Gender", "Race", "Education"))
nrow(s3$strata_info)
summary(s3$strata_info$n)
sum(s3$strata_info$n < 10)             # how many strata have < 10 people
```

Each extra dimension multiplies the number of strata and divides the people among
them. Small cells are not fatal, (partial pooling shrinkage is exactly what
protects MAIHDA against noisy small strata) but they have consequences (next
section). A useful rule: choose the fewest dimensions that answer your
question, and look at the cell-size distribution before committing.

## What sparse cells do: singular fits

When cells get very small the maximum-likelihood (`lme4`) estimate of the
between-stratum variance can collapse to the boundary ( a singular fit) and
report a VPC of (near) zero with no uncertainty. The package records this and
surfaces it in a "Fit diagnostics" note rather than letting it pass silently:

```{r singular}
over <- fit_maihda(
  BMI ~ 1 + (1 | Gender:Race:Education),
  data = maihda_health_data[1:60, ]       # deliberately too few people per stratum
)
over
```

If you see a singular-fit note, do not read the VPC as a clean zero. The solution is
to collapse dimensions or categories (fewer, larger cells), or to use
`engine = "brms"`, whose weakly-informative priors regularise the variance off the
boundary and return a posterior interval, the subject of the
[Bayesian sparse vignette](bayesian_sparse_maihda.html).

## Continuous variables and the analytic sample

- **Keep continuous variables out of the strata.** A continuous variable in the
  grouping term gives one stratum per value. `make_strata()` will auto-bin a numeric
  dimension into tertiles (with a `message()`), but a continuous *covariate* belongs
  in the fixed part of the formula, not the strata. 

## What the summaries can and cannot tell you

| Quantity | Answers | Does *not* answer |
|---|---|---|
| **VPC/ICC** | share of variance between strata | the *amount* of between-stratum variation (a share can rise just because the residual fell) |
| **PCV** | additive share of the between-stratum variance | a causal decomposition; a negative PCV is not proof of hidden inequality |
| **Discriminatory accuracy (AUC/MOR)** | how well strata predict the *individual* outcome | how large the *group* differences are (a high VPC can go with modest AUC) |

## Which engine, which design?

- **`lme4` (default)** -- fast frequentist fits for adequately-sized cells.
- **`brms`** -- Bayesian; preferred when cells are sparse or dimensions have few
  levels (regularising priors, posterior intervals).

For extensions beyond the cross-sectional case, see the
[crossed random effects](cross_classified.html) (dimensions/contexts) and
[longitudinal](longitudinal.html) vignettes.

## A suggested learning path

1. [Introduction to MAIHDA](introduction.html) -- the end-to-end workflow.
2. [Interpreting MAIHDA plots and diagnostics](interpreting_plots.html).
3. [Finding interaction patterns](finding_interactions.html).
4. [Reporting MAIHDA results](reporting_results.html) -- tidy output and tables.
5. Specialised designs: [binary outcomes](binary_outcomes.html),
   [group comparison](group_comparison.html),
   survey weights, and [Bayesian / sparse](bayesian_sparse_maihda.html).

## References

- Evans, C. R., Leckie, G., Subramanian, S. V., Bell, A., & Merlo, J. (2024). A
  tutorial for conducting intersectional multilevel analysis of individual
  heterogeneity and discriminatory accuracy (MAIHDA). *SSM - Population Health*, 26,
  101664.