---
title: "Data Preparation and Integration"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data Preparation and Integration}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE, purl = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE,
  message = FALSE,
  warning = FALSE
)
```

## Overview

Before fitting any model in mlumr, you need to:

1. Set up the IPD with `set_ipd()`
2. Set up the AgD with `set_agd()`
3. Combine them with `combine_data()`
4. (For ML-UMR only) Add integration points with `add_integration()`

## Setting up IPD

IPD should be a data frame with one row per patient, containing:

- A treatment indicator column
- An outcome column (binary 0/1, continuous numeric, or non-negative integer
  counts depending on `family`)
- Covariate columns
- For Poisson family: an exposure column

```{r, eval = TRUE}
library(mlumr)
set.seed(2026)

# Example: Trial A data (index treatment)
trial_a <- data.frame(
  patient_id = 1:500,
  treatment = "Drug_A",
  response = rbinom(500, 1, 0.6),
  age_group = rbinom(500, 1, 0.4),   # 0 = young, 1 = old
  sex = rbinom(500, 1, 0.55)          # 0 = male, 1 = female
)

ipd <- set_ipd(
  data = trial_a,
  treatment = "treatment",
  outcome = "response",
  covariates = c("age_group", "sex")
)

ipd
```

### Continuous outcomes

For continuous outcomes, use `family = "normal"`. The IPD frame needs a
numeric outcome column:

```{r, eval = TRUE}
trial_a_normal <- data.frame(
  patient_id = 1:500,
  treatment = "Drug_A",
  score = rnorm(500, mean = 3.0, sd = 1.2),
  age_group = rbinom(500, 1, 0.4),
  sex = rbinom(500, 1, 0.55)
)

ipd_normal <- set_ipd(
  data = trial_a_normal,
  treatment = "treatment",
  outcome = "score",
  covariates = c("age_group", "sex"),
  family = "normal"
)
```

### Count outcomes

For Poisson count data, use `family = "poisson"` and supply an `exposure`
column:

```{r, eval = TRUE}
trial_a_poisson <- data.frame(
  patient_id = 1:500,
  treatment = "Drug_A",
  events = rpois(500, lambda = 0.8),
  person_years = runif(500, 0.5, 2.0),
  age_group = rbinom(500, 1, 0.4),
  sex = rbinom(500, 1, 0.55)
)

ipd_poisson <- set_ipd(
  data = trial_a_poisson,
  treatment = "treatment",
  outcome = "events",
  covariates = c("age_group", "sex"),
  family = "poisson",
  exposure = "person_years"
)
```

### Validation

`set_ipd()` checks that:

- All specified columns exist in the data
- The outcome matches the family (binary for binomial, numeric for normal,
  non-negative integers for Poisson)
- Missing values are handled (rows dropped with a warning)

```{r, eval = TRUE, error = TRUE, purl = FALSE}
# This will error: non-binary outcome with binomial family
bad_data <- data.frame(trt = "A", outcome = c(0, 1, 2), x = rnorm(3))
set_ipd(bad_data, "trt", "outcome", "x")
```

## Setting up AgD

AgD should be a data frame with one row per study/subgroup, containing:

- Treatment indicator
- Outcome summaries (depend on `family`):
  - **Binomial**: sample size (`outcome_n`) and number of events (`outcome_r`)
  - **Normal**: mean outcome (`outcome_mean`) and standard error (`outcome_se`),
    optionally sample size (`outcome_n`)
  - **Poisson**: number of events (`outcome_r`) and total exposure (`outcome_E`)
- Covariate means/proportions

```{r, eval = TRUE}
# Example: Trial B data (comparator treatment)
trial_b <- data.frame(
  study = "Trial_B",
  treatment = "Drug_B",
  n_total = 400,
  n_events = 160,
  age_group_mean = 0.35,   # proportion in "old" group
  sex_prop = 0.50           # proportion female
)

agd <- set_agd(
  data = trial_b,
  treatment = "treatment",
  outcome_n = "n_total",
  outcome_r = "n_events",
  cov_means = c("age_group_mean", "sex_prop"),
  cov_types = c("binary", "binary")
)
```

### AgD for continuous outcomes

```{r, eval = TRUE}
agd_normal <- set_agd(
  data = data.frame(
    trt = "Drug_B", y_mean = 3.2, se = 0.15, n = 400,
    age_group_mean = 0.35, sex_prop = 0.50
  ),
  treatment = "trt",
  family = "normal",
  outcome_mean = "y_mean",
  outcome_se = "se",
  outcome_n = "n",
  cov_means = c("age_group_mean", "sex_prop"),
  cov_types = c("binary", "binary")
)
```

**Scale requirement for normal outcomes.** `set_agd()` assumes the
AgD mean (`outcome_mean`) and SE (`outcome_se`) are on the
**arithmetic (original, untransformed) scale**, regardless of
whether `mlumr()` is later called with `link = "identity"` or
`link = "log"`. If a publication reports only a log-scale mean and
SD, or a geometric mean, back-transform before calling `set_agd()`
and propagate uncertainty via the delta method; passing log-scale
summaries silently biases the posterior because the Stan likelihood
is `normal(E[exp(eta)], se_agd)` under the log link, expecting the
AgD quantities already on the outcome scale.

### AgD for count outcomes

```{r, eval = TRUE}
agd_poisson <- set_agd(
  data = data.frame(
    trt = "Drug_B", n_events = 120, person_years = 800,
    age_group_mean = 0.35, sex_prop = 0.50
  ),
  treatment = "trt",
  family = "poisson",
  outcome_r = "n_events",
  outcome_E = "person_years",
  cov_means = c("age_group_mean", "sex_prop"),
  cov_types = c("binary", "binary")
)
```

### Covariate naming

`set_agd()` automatically strips `_mean` and `_prop` suffixes from covariate
names to match the IPD covariate names:

- `"age_group_mean"` -> covariate name `"age_group"`
- `"sex_prop"` -> covariate name `"sex"`

Make sure the resulting names match those in your IPD.

### Continuous covariates

For continuous covariates, also supply standard deviations:

```{r, eval = TRUE}
agd_continuous <- set_agd(
  data = data.frame(
    trt = "B", n_total = 400, n_events = 160,
    bmi_mean = 25.3, bmi_sd = 4.2
  ),
  treatment = "trt",
  outcome_n = "n_total",
  outcome_r = "n_events",
  cov_means = "bmi_mean",
  cov_sds = "bmi_sd",
  cov_types = "continuous"
)
```

## Combining data

```{r, eval = TRUE}
dat <- combine_data(ipd, agd)
print(dat)
```

## Adding integration points

Integration points are needed for ML-UMR's numerical integration over the
comparator population's covariate distribution. They are generated using
Sobol quasi-Monte Carlo sequences with a Gaussian copula to respect
covariate correlations.

### Basic usage

```{r, eval = TRUE}
dat <- add_integration(
  dat,
  n_int = 64,
  age_group = distr(qbern, prob = age_group_mean),
  sex = distr(qbern, prob = sex_mean)
)
```

Key points:

- **`n_int`**: Number of integration points (powers of 2 recommended: 32,
  64, 128, 256). More points = better accuracy but slower Stan fitting.
- **`distr()`**: Wraps a quantile function with parameters. Parameters can
  reference columns in the AgD data (e.g., `age_group_mean`).
- **`qbern`**: Bernoulli quantile function (provided by mlumr) for binary
  covariates. Use `qnorm` for continuous covariates.

### Distribution types

```{r, eval = FALSE, purl = FALSE}
# Binary covariate
age = distr(qbern, prob = age_mean)

# Continuous covariate (normal)
bmi = distr(qnorm, mean = bmi_mean, sd = bmi_sd)

# Continuous covariate (gamma)
biomarker = distr(qgamma, shape = 2, rate = 0.5)
```

### Correlation handling

By default, correlations between covariates are estimated from the IPD
using Spearman rank correlation, then adjusted for the Gaussian copula.
For binary and mixed discrete/continuous covariate pairs, this adjustment
is a pragmatic approximation — exact calibration of the latent Gaussian
correlation depends on marginal prevalences, which are not used in the
current transform. For most practical datasets this provides adequate
integration-point distributions, but users with strong prior information
on covariate correlations can supply their own matrix via the `cor`
argument.

```{r, eval = TRUE}
# Default: Spearman correlation from IPD
dat <- add_integration(dat, n_int = 64,
                       age_group = distr(qbern, prob = age_group_mean),
                       sex = distr(qbern, prob = sex_mean))

# Supply your own correlation matrix
my_cor <- matrix(c(1, 0.3, 0.3, 1), 2, 2)
dat <- add_integration(dat, n_int = 64, cor = my_cor,
                       age_group = distr(qbern, prob = age_group_mean),
                       sex = distr(qbern, prob = sex_mean))

# No correlation adjustment
dat <- add_integration(dat, n_int = 64, cor_adjust = "none",
                       age_group = distr(qbern, prob = age_group_mean),
                       sex = distr(qbern, prob = sex_mean))
```

### Inspecting integration points

```{r, eval = TRUE}
# Expand to long format
int_df <- unnest_integration(dat)
head(int_df)
```

Use `check_integration()` to compare the generated integration points against
the requested AgD marginal distributions and correlation structure:

```{r, eval = TRUE, purl = FALSE}
check_integration(
  dat,
  age_group = distr(qbern, prob = age_group_mean),
  sex = distr(qbern, prob = sex_mean)
)
```

This is especially important when there are several covariates, strong
correlations, binary covariates with prevalences near 0 or 1, or a small
number of integration points.

## Notes on STC and naive

The STC and naive methods do **not** require integration points. The STC
uses integration points if available (for better marginalization), but
falls back to AgD covariate means if they are not. You can call `stc()`
and `naive()` immediately after `combine_data()`.

```{r, eval = TRUE}
# These work without add_integration()
dat_no_int <- combine_data(ipd, agd)
naive_result <- naive(dat_no_int)
stc_result <- stc(dat_no_int)
```