Simulation Methods
Let Yit be the binary or multinomial response for subject i (i=1,…,N) at measurement occasion t (t=1,…,T), and let xit be the associated covariates vector. We assume that Yit∈{0,1} for binary responses and Yit∈{1,2,…,J≥3} for multinomial responses.
Correlated nominal responses
The function rmult.bcl
simulates nominal responses under the marginal baseline-category logit model
\begin{equation}
\log \left[\frac{\Pr(Y_{it}=j |\mathbf {x}_{it})}{\Pr(Y_{it}=J |\mathbf {x}_{it})}\right]=(\beta_{tj0}-\beta_{tJ0})+(\boldsymbol {\beta}_{tj}-\boldsymbol{\beta}_{tJ})^{\prime} \mathbf {x}_{it}=\beta^{\ast}_{tj0}+\boldsymbol{\beta}^{\ast\prime}_{tj}\mathbf {x}_{it},
\tag{3.1}
\end{equation}
where \beta_{tj0} is the j-th category-specific intercept at measurement occasion t and \boldsymbol{\beta}_{tj} is the j-th category-specific parameter vector associated with the covariates at measurement occasion t. The popular identifiability constraints \beta_{tJ0}=0 and \boldsymbol{\beta}_{tJ}=\mathbf {0} for all t, imply that \beta^{\ast}_{tj0}=\beta_{tj0} and \boldsymbol {\beta}^{\ast}_{tj}=\boldsymbol{\beta}_{tj} for all t=1,\ldots,T and j=1,\ldots,J-1. The threshold
Y_{it}=j \Leftrightarrow U^{NO}_{itj}=\max \{U^{NO}_{it1},\ldots,U^{NO}_{itJ}\}
generates clustered nominal responses that satisfy the marginal baseline-category logit model (3.1), where
U^{NO}_{itj}=\beta_{tj0}+\boldsymbol{\beta}_{tj}^{\prime} \mathbf {x}_{it}+e^{NO}_{itj},
and where the random variables \{e^{NO}_{itj}:i=1,\ldots,N \text{, } t=1,\ldots,T \text{ and } j=1,\ldots,J\} satisfy the following conditions:
- e^{NO}_{itj} follows the standard extreme value distribution for all i, t and j (mean =\gamma \approx 0.5772, where \gamma is Euler’s constant, and variance =\pi^2/6).
- e^{NO}_{i_1t_1j_1} and e^{NO}_{i_2t_2j_2} are independent random variables provided that i_1 \neq i_2.
- e^{NO}_{itj_1} and e^{NO}_{itj_2} are independent random variables provided that j_1\neq j_2.
For each subject i, the association structure among the clustered nominal responses \{Y_{it}:t=1,\ldots,T\} depends on the joint distribution and correlation matrix of \{e^{NO}_{itj}:t=1,\ldots,T \text{ and } j=1,\ldots,J\}. If the random variables \{e^{NO}_{itj}:t=1,\ldots,T \text{ and } j=1,\ldots,J\} are independent then so are \{Y_{it}:t=1,\ldots,T\}.
Example 3.1 (Simulation of clustered nominal responses using the NORTA method) Suppose the aim is to simulate nominal responses from the marginal baseline-category logit model
\begin{equation*}
\log \left[\frac{\Pr(Y_{it}=j |\mathbf {x}_{it})}{\Pr(Y_{it}=4 |\mathbf {x}_{it})}\right]=\beta_{j0}+ \beta_{j1} {x}_{i1}+ \beta_{j2} {x}_{it2} \end{equation*}
where N=500, T=3, (\beta_{10},\beta_{11},\beta_{12},\beta_{20},\beta_{21},\beta_{22},\beta_{30},\beta_{31},\beta_{32})=(1, 3, 2, 1.25, 3.25, 1.75, 0.75, 2.75, 2.25) and \mathbf {x}_{it}=(x_{i1},x_{it2})^{\prime} for all i and t, with x_{i1}\overset{iid}{\sim} N(0,1) and x_{it2}\overset{iid}{\sim} N(0,1). For the dependence structure, suppose that the correlation matrix \mathbf{R} in the NORTA method has elements
\mathbf{R}_{t_1j_1,t_2j_2}=\begin{cases}
1 & \text{if } t_1=t_2 \text{ and } j_1=j_2\\
0.95 & \text{if } t_1 \neq t_2 \text{ and } j_1=j_2\\
0 & \text{otherwise }\\
\end{cases}
for all i=1,\ldots,500.
betas <- c(1, 3, 2, 1.25, 3.25, 1.75, 0.75, 2.75, 2.25, 0, 0, 0)
sample_size <- 500
categories_no <- 4
cluster_size <- 3
set.seed(1)
x1 <- rep(rnorm(sample_size), each = cluster_size)
x2 <- rnorm(sample_size * cluster_size)
xdata <- data.frame(x1, x2)
set.seed(321)
library("SimCorMultRes")
equicorrelation_matrix <- toeplitz(c(1, rep(0.95, cluster_size - 1)))
identity_matrix <- diag(categories_no)
latent_correlation_matrix <- kronecker(equicorrelation_matrix, identity_matrix)
simulated_nominal_dataset <- rmult.bcl(clsize = cluster_size, ncategories = categories_no,
betas = betas, xformula = ~x1 + x2, xdata = xdata, cor.matrix = latent_correlation_matrix)
suppressPackageStartupMessages(library("multgee"))
nominal_gee_model <- nomLORgee(y ~ x1 + x2, data = simulated_nominal_dataset$simdata,
id = id, repeated = time, LORstr = "time.exch")
round(coef(nominal_gee_model), 2)
Correlated ordinal responses
Simulation of clustered ordinal responses is feasible under either a marginal cumulative link model or a marginal continuation-ratio model.
Marginal cumulative link model
The function rmult.clm
simulates ordinal responses under the marginal cumulative link model
\begin{equation}
\Pr(Y_{it}\le j |\mathbf {x}_{it})=F(\beta_{tj0} +\boldsymbol {\beta}^{\prime}_t \mathbf {x}_{it})
\tag{3.2}
\end{equation}
where F is a cumulative distribution function (cdf), \beta_{tj0} is the j-th category-specific intercept at measurement occasion t and \boldsymbol \beta_t is the regression parameter vector associated with the covariates at measurement occasion t. The category-specific intercepts at each measurement occasion t are assumed to be monotone increasing, that is
-\infty=\beta_{t00} <\beta_{t10} < \beta_{t20} < \cdots < \beta_{t(J-1)0}< \beta_{tJ0}=\infty
for all t. Using the threshold
Y_{it}=j \Leftrightarrow \beta_{t(j-1)0} < U^{O1}_{it} \leq \beta_{tj0}
clustered ordinal responses that satisfy the marginal cumulative link model (3.2) are generated, where
U^{O1}_{it}=-\boldsymbol{\beta}^{\prime}_t \mathbf {x}_{it}+e^{O1}_{it},
and where \{e^{O1}_{it}:i=1,\ldots,N \text{ and } t=1,\ldots,T\} are random variables such that:
- e^{O1}_{it} \sim F for all i and t.
- e^{O1}_{i_1t_1} and e^{O1}_{i_2t_2} are independent random variables provided that i_1 \neq i_2.
For each subject i, the association structure among the clustered ordinal responses \{Y_{it}:t=1,\ldots,T\} depends on the pairwise bivariate distributions and correlation matrix of \{e^{O1}_{it}:t=1,\ldots,T\}. If the random variables \{e^{O1}_{it}:t=1,\ldots,T\} are independent then so are \{Y_{it}:t=1,\ldots,T\}.
Example 3.2 (Simulation of clustered ordinal responses conditional on a marginal cumulative probit model with time-varying regression parameters) Suppose the goal is to simulate correlated ordinal responses from the marginal cumulative probit model
\begin{equation*}
\Pr(Y_{it}\le j |\mathbf x_{it})=\Phi(\beta_{j0} + \beta_{t1} {x}_{i})
\end{equation*}
where \Phi denotes the cdf of the standard normal distribution (mean =0 and variance =1), N=500, T=4, (\beta_{10},\beta_{20},\beta_{30},\beta_{40})=(-1.5,-0.5,0.5,1.5), (\beta_{11},\beta_{21},\beta_{31},\beta_{41})=(1,2,3,4) and \mathbf x_{it}=x_{i}\overset{iid}{\sim} N(0,1) for all i and t. For the dependence structure, assume that \mathbf{e}_i^{O1}=(e^{O1}_{i1},e^{O1}_{i2},e^{O1}_{i3},e^{O1}_{i4})^{\prime} are iid random vectors from a tetra-variate normal distribution with mean vector the zero vector and covariance matrix the correlation matrix
\left( {\begin{array}{*{20}c}
1.00 & 0.85 & 0.50 & 0.15 \\
0.85 & 1.00 & 0.85 & 0.50 \\
0.50 & 0.15 & 1.00 & 0.85 \\
0.15 & 0.85 & 0.50 & 1.00
\end{array} } \right).
set.seed(12345)
sample_size <- 500
cluster_size <- 4
beta_intercepts <- c(-1.5, -0.5, 0.5, 1.5)
beta_coefficients <- matrix(c(1, 2, 3, 4), 4, 1)
x <- rep(rnorm(sample_size), each = cluster_size)
latent_correlation_matrix <- toeplitz(c(1, 0.85, 0.5, 0.15))
simulated_ordinal_dataset <- rmult.clm(clsize = cluster_size, intercepts = beta_intercepts,
betas = beta_coefficients, xformula = ~x, cor.matrix = latent_correlation_matrix,
link = "probit")
head(simulated_ordinal_dataset$simdata, n = 8)
Marginal continuation-ratio model
The function rmult.crm
simulates clustered ordinal responses under the marginal continuation-ratio model
\begin{equation}
\Pr(Y_{it}=j |Y_{it} \ge j,\mathbf {x}_{it})=F(\beta_{tj0} +\boldsymbol {\beta}^{'}_t \mathbf {x}_{it})
\tag{3.3}
\end{equation}
where \beta_{tj0} is the j-th category-specific intercept at measurement occasion t, \boldsymbol \beta_t is the regression parameter vector associated with the covariates at measurement occasion t and F is a cdf. This is accomplished by utilizing the threshold
Y_{it}=j, \text{ given } Y_{it} \geq j \Leftrightarrow U^{O2}_{itj} \leq \beta_{tj0}
where
U^{O2}_{itj}=-\boldsymbol {\beta}^{\prime}_t \mathbf {x}_{it}+e^{O2}_{itj},
and where \{e^{O2}_{itj}:i=1,\ldots,N \text{ , } t=1,\ldots,T \text{ and } j=1,\ldots,J-1\} satisfy the following three conditions:
- e^{O2}_{itj} \sim F for all i, t and j.
- e^{O2}_{i_1t_1j_1} and e^{O2}_{i_2t_2j_2} are independent random variables provided that i_1 \neq i_2.
- e^{O2}_{itj_1} and e^{O2}_{itj_2} are independent random variables provided that j_1\neq j_2.
For each subject i, the association structure among the clustered ordinal responses \{Y_{it}:t=1,\ldots,T\} depends on the joint distribution and correlation matrix of \{e^{O2}_{itj}:j=1,\ldots,J \text{ and } t=1,\ldots,T\}. If the random variables \{e^{O2}_{itj}:j=1,\ldots,J \text{ and } t=1,\ldots,T\} are independent then so are \{Y_{it}:t=1,\ldots,T\}.
Example 3.3 (Simulation of clustered ordinal responses conditional on a marginal continuation-ratio probit model) Suppose simulation of clustered ordinal responses under the marginal continuation-ratio probit model
\begin{equation*}
\Pr(Y_{it}=j |Y_{it} \ge j,\mathbf{x}_{it})=\Phi(\beta_{j0} + \beta {x}_{it})
\end{equation*}
with N=500, T=4, (\beta_{10},\beta_{20},\beta_{30},\beta_{40},\beta)=(-1.5,-0.5,0.5,1.5,1) and \mathbf{x}_{it}=x_{it}\overset{iid}{\sim} N(0,1) for all i and t is desired. For the dependence structure, assume that \left\{\mathbf{e}_i^{O2}=\left(e^{O2}_{i11},\ldots,e^{O1}_{i44}\right)^{\prime}:i=1,\ldots,N\right\} are iid random vectors from a multivariate normal distribution with mean vector the zero vector and covariance matrix the 16 \times 16 correlation matrix with elements
\text{corr}(e^{O2}_{it_1j_1},e^{O2}_{it_2j_2}) = \begin{cases}
1 & \text{for } j_1 = j_2 \text{ and } t_1 = t_2\\
0.24 & \text{for } t_1 \neq t_2\\
0 & \text{otherwise.}\\
\end{cases}
set.seed(1)
sample_size <- 500
cluster_size <- 4
beta_intercepts <- c(-1.5, -0.5, 0.5, 1.5)
beta_coefficients <- 1
x <- rnorm(sample_size * cluster_size)
categories_no <- 5
latent_correlation_matrix <- diag(1, (categories_no - 1) * cluster_size) + kronecker(toeplitz(c(0,
rep(0.24, categories_no - 2))), matrix(1, cluster_size, cluster_size))
simulated_ordinal_dataset <- rmult.crm(clsize = cluster_size, intercepts = beta_intercepts,
betas = beta_coefficients, xformula = ~x, cor.matrix = latent_correlation_matrix,
link = "probit")
head(simulated_ordinal_dataset$Ysim)
Marginal adjacent-category logit model
The function rmult.acl
simulates clustered ordinal responses under the marginal adjacent-category logit model
\begin{equation}
\log\left[\frac{\Pr(Y_{it}=j |\mathbf {x}_{it})}{\Pr(Y_{it}=j+1 |\mathbf {x}_{it})}\right]=\beta_{tj0} +\boldsymbol {\beta}^{'}_t \mathbf {x}_{it}
\tag{3.4}
\end{equation}
where \beta_{tj0} is the j-th category-specific intercept at measurement occasion t, \boldsymbol \beta_t is the regression parameter vector associated with the covariates at measurement occasion t.
Generation of clustered ordinal responses relies upon utilizing the connection between baseline-category logit models and adjacent-category logit models. In particular, the threshold
Y_{it}=j \Leftrightarrow U^{O3}_{itj}=\max \{U^{O3}_{it1},\ldots,U^{O3}_{itJ}\}
generates clustered nominal responses that satisfy the marginal adjacent-category logit model (3.4), where
U^{O3}_{itj}=\sum_{k=j}^J\beta_{tk0}+(J-j)\boldsymbol{\beta}_{t}^{\prime} \mathbf {x}_{it}+e^{O3}_{itj},
and where the random variables \{e^{O3}_{itj}:i=1,\ldots,N \text{, } t=1,\ldots,T \text{ and } j=1,\ldots,J\} satisfy the following conditions:
- e^{O3}_{itj} follows the standard extreme value distribution for all i, t and j.
- e^{O3}_{i_1t_1j_1} and e^{O3}_{i_2t_2j_2} are independent random variables provided that i_1 \neq i_2.
- e^{O3}_{itj_1} and e^{O3}_{itj_2} are independent random variables provided that j_1\neq j_2.
For each subject i, the association structure among the clustered ordinal responses \{Y_{it}:t=1,\ldots,T\} depends on the joint distribution and correlation matrix of \{e^{O3}_{itj}:t=1,\ldots,T \text{ and } j=1,\ldots,J\}. If the random variables \{e^{O3}_{itj}:t=1,\ldots,T \text{ and } j=1,\ldots,J\} are independent then so are \{Y_{it}:t=1,\ldots,T\}.
Example 3.4 (Simulation of clustered ordinal responses conditional on a marginal adjacent-category logit model using the NORTA method) Suppose the aim is to simulate ordinal responses from the marginal adjacent-category logit model
\begin{equation*}
\log \left[\frac{\Pr(Y_{it}=j |\mathbf {x}_{it})}{\Pr(Y_{it}=j+1 |\mathbf {x}_{it})}\right]=\beta_{j0}+ \beta_{1} {x}_{i1}+ \beta_{2} {x}_{it2} \end{equation*}
where N=500, T=3, (\beta_{10},\beta_{20},\beta_{30})=(3, 2, 1), (\beta_{1},\beta_{2})=(1, 1) and \mathbf {x}_{it}=(x_{i1},x_{it2})^{\prime} for all i and t, with x_{i1}\overset{iid}{\sim} N(0,1) and x_{it2}\overset{iid}{\sim} N(0,1). For the dependence structure, suppose that the correlation matrix \mathbf{R} in the NORTA method has elements
\mathbf{R}_{t_1j_1,t_2j_2}=\begin{cases}
1 & \text{if } t_1=t_2 \text{ and } j_1=j_2\\
0.95 & \text{if } t_1 \neq t_2 \text{ and } j_1=j_2\\
0 & \text{otherwise }\\
\end{cases}
for all i=1,\ldots,500.
beta_intercepts <- c(3, 2, 1)
beta_coefficients <- c(1, 1)
sample_size <- 500
cluster_size <- 3
set.seed(321)
x1 <- rep(rnorm(sample_size), each = cluster_size)
x2 <- rnorm(sample_size * cluster_size)
xdata <- data.frame(x1, x2)
equicorrelation_matrix <- toeplitz(c(1, rep(0.95, cluster_size - 1)))
identity_matrix <- diag(4)
latent_correlation_matrix <- kronecker(equicorrelation_matrix, identity_matrix)
simulated_ordinal_dataset <- rmult.acl(clsize = cluster_size, intercepts = beta_intercepts,
betas = beta_coefficients, xformula = ~x1 + x2, xdata = xdata, cor.matrix = latent_correlation_matrix)
suppressPackageStartupMessages(library("multgee"))
ordinal_gee_model <- ordLORgee(y ~ x1 + x2, data = simulated_ordinal_dataset$simdata,
id = id, repeated = time, LORstr = "time.exch", link = "acl")
round(coef(ordinal_gee_model), 2)
Correlated binary responses
The function rbin
simulates binary responses under the marginal model specification
\begin{equation}
\Pr(Y_{it}=1 |\mathbf {x}_{it})=F(\beta_{t0} +\boldsymbol {\beta}^{\prime}_{t} \mathbf {x}_{it})
\tag{3.5}
\end{equation}
where \beta_{t0} is the intercept at measurement occasion t, \boldsymbol \beta_t is the regression parameter vector associated with the covariates at measurement occasion t and F is a cdf. The threshold
Y_{it}=1 \Leftrightarrow U^{B}_{it} \leq \beta_{t0} + 2 \boldsymbol {\beta}^{\prime}_t \mathbf {x}_{it},
generates clustered binary responses that satisfy the marginal model (3.5), where
\begin{equation}
U^{B}_{it}=\boldsymbol {\beta}^{\prime}_t \mathbf {x}_{it}+e^{B}_{it},
\tag{3.6}
\end{equation}
and where \{e^{B}_{it}:i=1,\ldots,N \text{ and } t=1,\ldots,T\} are random variables such that:
- e^{B}_{it} \sim F for all i and t.
- e^{B}_{i_1t_1} and e^{B}_{i_2t_2} are independent random variables provided that i_1 \neq i_2.
For each subject i, the association structure among the clustered binary responses \{Y_{it}:t=1,\ldots,T\} depends on the pairwise bivariate distributions and correlation matrix of \{e^{B}_{it}:t=1,\ldots,T\}. If the random variables \{e^{B}_{it}:t=1,\ldots,T\} are independent then so are \{Y_{it}:t=1,\ldots,T\}.
Example 3.5 (Simulation of clustered binary responses conditional on a marginal probit model using NORTA method) Suppose the goal is to simulate clustered binary responses from the marginal probit model
\Pr(Y_{it}=1 |\mathbf{x}_{it})=\Phi(0.2x_i)
where N=100, T=4 and \mathbf{x}_{it}=x_i\overset{iid}{\sim} N(0,1) for all i and t. For the association structure, assume that the random variables \mathbf{e}_i^{B}=(e^{B}_{i1},e^{B}_{i2},e^{B}_{i3},e^{B}_{i4})^{\prime} in (3.6) are iid random vectors from the tetra-variate normal distribution with mean vector the zero vector and covariance matrix the correlation matrix \mathbf{R} given by
\begin{equation}
\mathbf{R}=\left( {\begin{array}{*{20}c}
1.00 & 0.90 & 0.90 & 0.90 \\
0.90 & 1.00 & 0.90 & 0.90 \\
0.90 & 0.90 & 1.00 & 0.90 \\
0.90 & 0.90 & 0.90 & 1.00
\end{array} } \right).
\tag{3.7}
\end{equation}
This association configuration defines an exchangeable correlation matrix for the clustered binary responses, i.e. \text{corr}(Y_{it_1},Y_{it_2})=\rho_i for all i and t. The strength of the correlation (\rho_i) is decreasing as the absolute value of the time-stationary covariate x_i increases. For example, \rho_i=0.7128 when x_{i}=0 and \rho_i=0.7 when x_i=3 or x_i=-3. Therefore, a strong exchangeable correlation pattern for each subject that does not differ much across subjects is implied with this configuration.
set.seed(123)
sample_size <- 100
cluster_size <- 4
beta_intercepts <- 0
beta_coefficients <- 0.2
latent_correlation_matrix <- toeplitz(c(1, 0.9, 0.9, 0.9))
x <- rep(rnorm(sample_size), each = cluster_size)
simulated_binary_dataset <- rbin(clsize = cluster_size, intercepts = beta_intercepts,
betas = beta_coefficients, xformula = ~x, cor.matrix = latent_correlation_matrix,
link = "probit")
library("gee")
binary_gee_model <- gee(y ~ x, family = binomial("probit"), id = id, data = simulated_binary_dataset$simdata)
summary(binary_gee_model)$coefficients
Example 3.6 (Simulation of clustered binary responses under a conditional marginal logit model without utilizing the NORTA method) Consider now simulation of correlated binary responses from the marginal logit model
\begin{equation*}
\Pr(Y_{it}=1 |\mathbf{x}_{it})=F(0.2x_i)
\end{equation*}
where F is the cdf of the standard logistic distribution (mean =0 and variance =\pi^2/3), N=100, T=4 and \mathbf{x}_{it}=x_i\overset{iid}{\sim} N(0,1) for all i and t. This is similar to the marginal model configuration in Example 3.5 except from the link function. For the dependence structure, assume that the correlation matrix of \mathbf{e}_i^{B}=(e^{B}_{i1},e^{B}_{i2},e^{B}_{i3},e^{B}_{i4})^{\prime} in (3.6) is equal to the correlation matrix \mathbf{R} defined in (3.7). To simulate \mathbf{e}_i^{B} without utilizing the NORTA method, one can employ the tetra-variate extreme value distribution (Gumbel 1958). In particular, this is accomplished by setting \mathbf{e}_i^{B}=\mathbf{U}_i-\mathbf{V}_i for all i, where \mathbf{U}_i and \mathbf{V}_i are independent random vectors from the tetra-variate extreme value distribution with dependence parameter equal to 0.9, that is
\Pr\left(U_{i1}\leq u_{i1},U_{i2}\leq u_{i2},U_{i3}\leq u_{i3},U_{i4}\leq u_{i4}\right)=\exp\left\{-\left[\sum_{t=1}^4 \exp{\left(-\frac{u_{it}}{0.9}\right)}\right]^{0.9}\right\}
and
\Pr\left(V_{i1}\leq v_{i1},V_{i2}\leq v_{i2},V_{i3}\leq v_{i3},V_{i4}\leq v_{i4}\right)=\exp\left\{-\left[\sum_{t=1}^4 \exp{\left(-\frac{v_{it}}{0.9}\right)}\right]^{0.9}\right\}.
It follows that e_{it}^{B}\sim F for all i and t and \textrm{corr}(\mathbf{e}_i^{B})=\mathbf{R} for all i.
set.seed(8)
library("evd")
simulated_latent_variables1 <- rmvevd(sample_size, dep = sqrt(1 - 0.9), model = "log",
d = cluster_size)
simulated_latent_variables2 <- rmvevd(sample_size, dep = sqrt(1 - 0.9), model = "log",
d = cluster_size)
simulated_latent_variables <- simulated_latent_variables1 - simulated_latent_variables2
simulated_binary_dataset <- rbin(clsize = cluster_size, intercepts = beta_intercepts,
betas = beta_coefficients, xformula = ~x, rlatent = simulated_latent_variables)
binary_gee_model <- gee(y ~ x, family = binomial("logit"), id = id, data = simulated_binary_dataset$simdata)
summary(binary_gee_model)$coefficients
No marginal model specification
To achieve simulation of clustered binary, ordinal and nominal responses under no marginal model specification, perform the following intercepts:
Based on the marginal probabilities calculate the intercept of a marginal probit model for binary responses (see Example 3.7) or the category-specific intercepts of a cumulative probit model (see third example in Touloumis 2016) or of a baseline-category logit model for multinomial responses (see Example 3.8).
Create a pseudo-covariate say x
of length equal to the number of cluster size (clsize
) times the desired number of clusters of simulated responses (say R
), that is x = clsize * R
. This step is required in order to identify the desired number of clustered responses.
Set betas = 0
in the core functions rbin
(see Example 3.7) or rmult.clm
, or set 0 all values of the beta
argument that correspond to the category-specific parameters in the core function rmult.bcl
(see Example 3.8).
set xformula = ~ x
.
Run the core function to obtain realizations of the simulated clustered responses.
Example 3.7 (Simulation of clustered binary responses without covariates) Suppose the goal is to simulate 5000 clustered binary responses with \Pr(Y_{t}=1)=0.8 for all t=1,\ldots,4. For simplicity, assume that the clustered binary responses are independent.
set.seed(123)
sample_size <- 5000
cluster_size <- 4
beta_intercepts <- qnorm(0.8)
x <- rep(0, each = cluster_size * sample_size)
beta_coefficients <- 0
latent_correlation_matrix <- diag(cluster_size)
simulated_binary_dataset <- rbin(clsize = cluster_size, intercepts = beta_intercepts,
betas = beta_coefficients, xformula = ~x, cor.matrix = latent_correlation_matrix,
link = "probit")
colMeans(simulated_binary_dataset$Ysim)
Example 3.8 (Simulation of clustered nominal responses without covariates) Suppose the aim is to simulate N=5000 clustered nominal responses with
\Pr(Y_{t}=1)=0.1, \Pr(Y_{t}=2)=0.2, \Pr(Y_{t}=3)=0.3 and \Pr(Y_{t}=4)=0.4, for all i and t=1,\ldots,3. For the sake of simplicity, we assume that the clustered responses are independent.
sample_size <- 5000
cluster_size <- 3
x <- rep(0, each = cluster_size * sample_size)
betas <- c(log(0.1/0.4), 0, log(0.2/0.4), 0, log(0.3/0.4), 0, 0, 0)
categories_no <- 4
set.seed(1)
latent_correlation_matrix <- kronecker(diag(cluster_size), diag(categories_no))
simulated_nominal_dataset <- rmult.bcl(clsize = cluster_size, ncategories = categories_no,
betas = betas, xformula = ~x, cor.matrix = latent_correlation_matrix)
apply(simulated_nominal_dataset$Ysim, 2, table)/sample_size
A note on the correlation matrix
In SimCorMultRes
, the user provides the correlation matrix (denoted as \mathbf{R}) for the multivariate normal distribution used in the intermediate step of the NORTA method, rather than the correlation matrix of the latent responses used in the corresponding threshold approach. This choice is motivated by the observation that when all the marginal distributions of the correlated latent responses follow a logistic distribution, the correlation matrix \mathbf{R} and the correlation matrix of the latent responses are expected to be similar, as noted by Touloumis (2016). Therefore, in SimCorMultRes
, this approximation is employed irrespective of the marginal distribution of the latent responses
.
To evaluate the validity of this approximation for the threshold approaches employed in SimCorMultRes
, a simulation study was conducted. For a fixed sample size N and a correlation parameter \rho, N independent bivariate random vectors \{\mathbf y_{i}: i = 1, \ldots, N \} from a bivariate normal distribution with mean vector the zero vector and covariance matrix the correlation matrix
\mathbf R = \begin{bmatrix}
1 & \rho\\
\rho & 1
\end{bmatrix}
were drawn. The sample correlation was used to estimate \rho. Next, the NORTA method was applied to obtain bivariate random vectors \{\mathbf z_{i}: i = 1, \ldots, N \} so that their marginal distribution is the logistic distribution. Their correlation parameter, say \rho_{z}, was estimated using the corresponding sample correlation. Then, the NORTA method was applied to obtain bivariate random vectors \{\mathbf w_{i}: i = 1, \ldots, N \} so that their marginal distribution is the Gumbel distribution. Their correlation parameter, say \rho_{w}, was estimated using their sample correlation.
For a fixed value of \rho, this procedure was replicated 10,000,000 times to reduce the simulation error and with N=10,000 to reduce the sampling error. The three correlation parameters \rho, \rho_z and \rho_w were estimated using their corresponding Monte Carlo counterparts, denoted by \widehat{\rho}, \widehat{\rho}_z and \widehat{\rho}_w, respectively. We let \rho \in \{ 0, 0.01,0.02,\ldots, 0.99\}.
The dataframe simulation
contains the true correlation parameter \rho (rho
) and the Monte Carlo estimates \widehat{\rho} (normal
), \widehat{\rho}_z (logistic
) and \widehat{\rho}_w (gumbel
) from the simulation study described above. As expected, \widehat{\rho} \approx \rho regardless of the strength of the correlation parameter. For the case of logistic marginal distributions, the average difference between \rho and \widehat{\rho}_z is 0.002, taking the maximum value of 0.0031 at \rho = 0.58. Therefore \rho appears to approximate \rho_{z} to 2 decimal points. For the case of Gumbel marginal distributions, the average difference between \rho and \widehat{\rho}_w is 0.0103, taking the maximum value of 0.0154 at \rho = 0.51. Although \rho appears to approximate well \rho_{w}, there is some accuracy loss compared to \rho_{z}. The plot below shows the differences between the true correlation coefficient of the bivariate normal distribution and the simulated correlations for \rho, \rho_z or \rho_w.
Overall, there is little accuracy loss by specifying the correlation matrix of the multivariate normal distribution in the intermediate step of the NORTA distribution instead of the correlation matrix of correlated latent responses regardless of whether their marginal distributions are either logistic or Gumbel distributions. Users can treat the correlation matrix passed on to the core functions of SimCorMultRes
as the correlation matrix of the latent variables.