% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/sim.R
\name{simdat}
\alias{simdat}
\title{Simulate data with varying degrees of selection and confounding bias}
\usage{
simdat(
  N = 1e+06,
  p = 1,
  q = 0,
  n_strat = 1,
  n_clust = 1,
  sigma_strat = 1,
  sigma_clust = 1,
  X_fam = c("gaussian", "binary"),
  tau_0 = 0,
  tau_A = 1,
  tau_X = rep(1, p),
  tau_X12 = 0,
  beta_0 = 0,
  beta_A = 1,
  beta_X = rep(1, p),
  beta_U = rep(1, q),
  Y_fam = c("gaussian", "binary", "poisson"),
  alpha_0 = 0,
  alpha_A = 1,
  alpha_X = rep(1, p),
  alpha_AX = 0
)
}
\arguments{
\item{N}{int - Number of observations to be generated. Defaults to 1000000.}

\item{p}{int - Number of covariates to be generated. Defaults to 1.}

\item{q}{int - Number of additional covariates that affect selection to be
generated. Defaults to 0.}

\item{n_strat}{int - Number of strata in the population to be generated.
Defaults to 1.}

\item{n_clust}{int - Number of clusters within each stratum in the
population to be generated. Defaults to 1.}

\item{sigma_strat}{double - Standard deviation of covariate means across
strata. Defaults to 1.}

\item{sigma_clust}{double - Standard deviation of covariate means across
clusters. Defaults to 1.}

\item{X_fam}{string - Distribution of the covariates, \code{X}. Defaults
to a multivariate normal distribution with mean equal to the sum of the
cluster and stratum means, and an identity covariance matrix. If "binary",
continuous covariates are discretized at their median values.}

\item{tau_0}{double - Intercept for propensity model. Defaults to 0.}

\item{tau_A}{double - Scaling factor for group assignment. Defaults to 1.}

\item{tau_X}{double - Coefficients for \code{X} in propensity model.
Defaults to a 1 vector of length \code{p}.}

\item{tau_X12}{double - Interaction term coefficient for \code{X1*X2} if
p > 1. Defaults to 0.}

\item{beta_0}{double - Intercept for selection model. Defaults to 0.}

\item{beta_A}{double - Coefficient for \code{A} in selection model.
Defaults to 1.}

\item{beta_X}{double - Coefficients for \code{X} in selection model.
Defaults to a 1 vector of length \code{p}.}

\item{beta_U}{double - Coefficients for \code{U} (additional covariates
affection only selection) in selection model. Defaults to a 1 vector of
length \code{q}.}

\item{Y_fam}{string - Distribution of the outcome variable, \code{Y}.
Defaults to "gaussian" for a normally distributed outcome. Other options
include "binary" for a Bernoulli-distributed outcome and "poisson" for a
Poisson-distributed outcome.}

\item{alpha_0}{double - Intercept for outcome model. Defaults to 0.}

\item{alpha_A}{double - Coefficient for \code{A} in outcome model.
Defaults to 1.}

\item{alpha_X}{double - Coefficients for \code{X} in outcome model.
Defaults to a 1 vector of length \code{p}.}

\item{alpha_AX}{double - Coefficient for interaction between \code{A} and
\code{X} in outcome model. Defaults to 0.}
}
\value{
A \code{data.frame} with \code{N} observations and the following variables:
\describe{
   \item{Strata}{Stratum index (integer)}
   \item{Cluster}{Cluster index (integer)}
   \item{X1, X2, ..., Xp}{Confounding covariates (continuous or binary,
   depending on \code{X_fam})}
   \item{pA}{True probability of A = 1 conditional on X (continuous)}
   \item{A}{Group assignment (binary)}
   \item{pS}{True probability of selection conditional on A and X
   (continuous)}
   \item{Y0}{Potential outcome under A = 0 (continuous, binary, or count
   depending on \code{Y_fam})}
   \item{Y1}{Potential outcome under A = 1 (continuous, binary, or count
   depending on \code{Y_fam})}
   \item{Y}{Observed outcome, based on treatment assignment (continuous,
   binary, or count depending on \code{Y_fam})}
   \item{CDIFF}{True controlled difference in outcomes by comparison group
   (double, computed as mean(Y1 - Y0))}
}
}
\description{
Function to simulate data based on specified relationships between the
generated outcome, group variable, confounder(s), and selection mechanism.
}
\details{
The function generates data in a hierarchical structure with stratified
clusters. The data generation process follows these steps:

1. \strong{Stratum and Cluster Means:} For each of the \code{n_strat}
   strata, a matrix of stratum-level means for \code{p} covariates is
   generated from a normal distribution with standard deviation
   \code{sigma_strat}. Similarly, for each of the \code{n_clust} clusters
   within each stratum, cluster-level means are generated from a normal
   distribution with standard deviation \code{sigma_clust}.

2. \strong{Covariate Generation:} Within each cluster, covariates,
   \code{X}, for \code{N / (n_strat * n_clust)} individuals are generated
   from a multivariate normal distribution with mean equal to the sum of
   the cluster and stratum means, and an identity covariance matrix.

3. \strong{Covariate Transformation:} If \code{X_fam} is \code{"binary"},
   each covariate is discretized at its median, otherwise it remains
   continuous.

4. \strong{Propensity Model:} The group variable, \code{A}, is generated
   using a logistic regression model with intercept \code{tau_0}, covariate
   effects \code{tau_X}, and an interaction effect between the first two
   covariates with coefficient \code{tau_X12}. The group membership
   probability, \code{pA}, is defined by the logistic model.

5. \strong{Selection Model:} The probability of selection, \code{pS}, is
   generated using a logistic regression model with intercept \code{beta_0},
   group effect \code{beta_A}, and covariate effects \code{beta_X}. Gaussian
   noise is added to the linear predictor.

6. \strong{Outcome Model:} The outcome, \code{Y}, is generated based on a
   chosen outcome distribution, \code{Y_fam}. The linear predictor includes
   an intercept, \code{alpha_0}, group effect, \code{alpha_A}, covariate
   effects, \code{alpha_X}, and an optional interaction effect,
   \code{alpha_AX}, between the group variable and covariates.

7. \strong{Controlled Difference:} The true controlled difference in the
   outcome between groups is calculated as \code{CDIFF}.

The output is a data frame containing the generated outcome, group variable,
covariates, and selection probabilities.
}
\examples{

N <- 100000

dat <- simdat(N)

head(dat)

}
