This vignette provides examples demonstrating usage of the simsurv package. For a technical vignette describing the methods underpinning the package, please see Technical background to the simsurv package.
Note that this package is modelled on the survsim Stata package. For comparability, the majority of the following examples are based on Crowther and Lambert (2012), which is the supporting paper for the survsim Stata package.
This first example shows how the simsurv package can be used to generate event times under a relatively standard Weibull proportional hazards model. This will be demonstrated as part of a simple simulation study.
The simulated event times will be generated under the following conditions:
The objective of the simulation study will be to assess the bias and coverage of the estimated treatment effect. This will be achieved by:
The code for performing the simulation study and the results are shown below.
# Define a function for analysing one simulated dataset
function() {
sim_run <-# Create a data frame with the subject IDs and treatment covariate
data.frame(id = 1:200,
cov <-trt = rbinom(200, 1, 0.5))
# Simulate the event times
simsurv(lambdas = 0.1,
dat <-gammas = 1.5,
betas = c(trt = -0.5),
x = cov,
maxt = 5)
# Merge the simulated event times onto covariate data frame
merge(cov, dat)
dat <-
# Fit a Weibull proportional hazards model
flexsurv::flexsurvspline(Surv(eventtime, status) ~ trt, data = dat)
mod <-
# Obtain estimates, standard errors and 95% CI limits
mod$coefficients[["trt"]]
est <- sqrt(diag(mod$cov))[["trt"]]
ses <- est + qnorm(.025) * ses
cil <- est + qnorm(.975) * ses
ciu <-
# Return bias and coverage indicator for treatment effect
c(bias = est - (-0.5),
coverage = ((-0.5 > cil) && (-0.5 < ciu)))
}
# Set seed for simulations
set.seed(908070)
# Perform 100 replicates in simulation study
rowMeans(replicate(100, sim_run()))
## bias coverage
## 0.02842414 0.90000000
Here we see that there is very little bias in the estimates of the log hazard ratio for the treatment effect, and the 95% confidence intervals are near their intended level of coverage.
Next, we will simulate event times under a slightly more complex parametric survival model that incorporates a flexible baseline hazard.
In this example we will use the publically accessible German breast cancer dataset. This dataset is included with the simsurv R package (see help(simsurv::brcancer)
for a description of the dataset). Let us look at the first few rows of the dataset:
data("brcancer")
head(brcancer)
## id hormon rectime censrec
## 1 1 0 1814 1
## 2 2 1 2018 1
## 3 3 1 712 1
## 4 4 1 1807 1
## 5 5 0 772 1
## 6 6 0 448 1
Now let us fit two parametric survival models to the breast cancer data:
The flexible parametric survival model will be based on the method of Royston and Parmar (2002); i.e. restricted cubic splines are used to approximate the log cumulative baseline hazard. This model can be estimated using the flexsurvspline
function from the flexsurv package (Jackson (2016)).
We will use three internal knots (i.e. four degrees of freedom) for the restricted cubic splines with the knot points placed at evenly spaced percentiles of the distribution of observed event time (obtained by specifying the argument k = 3
in the code below). We can also estimate the Weibull proportional hazards model using the flexsurvspline
function from the flexsurv package, by specifying no internal knots (i.e. specifying k = 0
).
# Fit the Weibull survival model
flexsurv::flexsurvspline(Surv(rectime, censrec) ~ hormon,
mod_weib <-data = brcancer, k = 0)
# Fit the flexible parametric survival model
flexsurv::flexsurvspline(Surv(rectime, censrec) ~ hormon,
mod_flex <-data = brcancer, k = 3)
Now let us compare the fit of the two models by plotting each of the fitted survival functions on top of the Kaplan-Meier survival curve.
par(mfrow = c(1,2), cex = 0.85) # graphics parameters
plot(mod_weib,
main = "Weibull model",
ylab = "Survival probability",
xlab = "Time")
plot(mod_flex,
main = "Flexible parametric model",
ylab = "Survival probability",
xlab = "Time")
There is evidence in the plots that the flexible parametric model fits the data better than the standard Weibull model. Therefore, if we wanted to simulate event times from a data generating process similar to that of the breast cancer data, then using a Weibull distribution may not be adequate. Rather, it would be more appropriate to simulate event times under the flexible parametric model. We will demonstrate how the simsurv package can be used to do this. The estimated parameters from the flexible parametric model will be used as the “true” parameters for the simulated event times.
The event times can be generated under a user-specified log cumulative hazard function that is equivalent to the Royston and Parmar specification used by the flexsurv package. First, the log cumulative hazard function for this model needs to be defined as a function in the R session. The user-defined function passed to simsurv
must always have the following three arguments:
t
: scalar specifying the current time at which to evaluate the hazardx
: a named list with the covariate databetas
: a named list with the “true” parametersEach of these arguments provide information that is used in evaluating the hazard hi(t), log hazard loghi(t), cumulative hazard Hi(t), or log cumulative hazard logHi(t) (depending on which type of user-specified function is being provided). These three arguments (t
, x
, betas
) can then be followed in the function signature by any additional arguments that may be necessary. For example, in the function definition below, the first three arguments are followed by an additional argument knots
, which allows the calculation of the log cumulative hazard at time t to depend on the knot locations for the splines.
# Define a function returning the log cum hazard at time t
function(t, x, betas, knots) {
logcumhaz <-
# Obtain the basis terms for the spline-based log
# cumulative hazard (evaluated at time t)
flexsurv::basis(knots, log(t))
basis <-
# Evaluate the log cumulative hazard under the
# Royston and Parmar specification
res <- betas[["gamma0"]] * basis[[1]] +
betas[["gamma1"]] * basis[[2]] +
betas[["gamma2"]] * basis[[3]] +
betas[["gamma3"]] * basis[[4]] +
betas[["gamma4"]] * basis[[5]] +
betas[["hormon"]] * x[["hormon"]]
# Return the log cumulative hazard at time t
res }
Next, we will show how to use the simsurv
function to simulate event times under the flexible parametric model. To demonstrate this, we will again generate the event times as part of a simulation study. The objective of the simulation study will be to assess the bias and coverage of the estimated log hazard ratio for hormone therapy. This will be achieved by:
# Fit the model to the brcancer dataset to obtain the "true"
# parameter values that will be used in our simulation study
flexsurv::flexsurvspline(Surv(rectime, censrec) ~ hormon,
true_mod <-data = brcancer, k = 3)
# Define a function to generate one simulated dataset, fit
# our two models (Weibull and flexible) to the simulated data
# and then return the bias in the estimated effect of hormone
# therapy under each fitted model
function(true_mod) {
sim_run <-# Create a data frame with the subject IDs and treatment covariate
data.frame(id = 1:200, hormon = rbinom(200, 1, 0.5))
cov <-
# Simulate the event times
simsurv(betas = true_mod$coefficients, # "true" parameter values
dat <-x = cov, # covariate data for 200 individuals
knots = true_mod$knots, # knot locations for splines
logcumhazard = logcumhaz, # definition of log cum hazard
maxt = NULL, # no right-censoring
interval = c(1E-8,100000)) # interval for root finding
# Merge the simulated event times onto covariate data frame
merge(cov, dat)
dat <-
# Fit a Weibull proportional hazards model
flexsurv::flexsurvspline(Surv(eventtime, status) ~ hormon,
weib_mod <-data = dat, k = 0)
# Fit a flexible parametric proportional hazards model
flexsurv::flexsurvspline(Surv(eventtime, status) ~ hormon,
flex_mod <-data = dat, k = 3)
# Obtain estimates, standard errors and 95% CI limits for hormone effect
true_mod$coefficients[["hormon"]]
true_loghr <- weib_mod$coefficients[["hormon"]]
weib_loghr <- flex_mod$coefficients[["hormon"]]
flex_loghr <-
# Return bias and coverage indicator for hormone effect
c(weib_bias = weib_loghr - true_loghr,
flex_bias = flex_loghr - true_loghr)
}
# Set a seed for the simulations
set.seed(543543)
# Perform the simulation study using 100 replicates
rowMeans(replicate(100, sim_run(true_mod = true_mod)))
## weib_bias flex_bias
## -0.029244642 -0.008417815
This short example shows how to simulate data under a standard Weibull survival model that incorporates a time-dependent effect (i.e. non-proportional hazards). For the time-dependent effect we will include a single binary covariate (e.g. a treatment indicator) with a protective effect (i.e. a negative log hazard ratio), but we will allow the effect of the covariate to diminish over time. The data generating model will be
hi(t)=γλ(tγ−1)exp(β0Xi+β1Xi×log(t))
where Xi is the binary treatment indicator for individual i, λ and γ are the scale and shape parameters for the Weibull baseline hazard, β0 is the log hazard ratio for treatment when t=1 (i.e. when log(t)=0), and β1 quantifies the amount by which the log hazard ratio for treatment changes for each one unit increase in log(t). Here we are assuming the time-dependent effect is induced by interacting the log hazard ratio with log time, but we could have used some other function of time (for example linear time, t, or time squared, t2, if we had wanted to).
We will simulate data for N=5000 individuals under this model, with a maximum follow up time of five years, and using the following “true” parameter values for the data generating model:
data.frame(id = 1:5000, trt = rbinom(5000, 1, 0.5))
covs <- simsurv(dist = "weibull", lambdas = 0.1, gammas = 1.5, betas = c(trt = -0.5),
simdat <-x = covs, tde = c(trt = 0.15), tdefunction = "log", maxt = 5)
merge(simdat, covs)
simdat <-head(simdat)
## id eventtime status trt
## 1 1 2.091009 1 1
## 2 2 5.000000 0 1
## 3 3 5.000000 0 1
## 4 4 5.000000 0 1
## 5 5 5.000000 0 1
## 6 6 4.930287 1 0
Then let us fit a flexible parametric model with two internal knots (i.e. 3 degrees of freedom) for the baseline hazard, and a time-dependent hazard ratio for the treatment effect. For the time-dependent hazard ratio we will use an interaction with log time (the same as used in the data generating model); this can be easily achieved using the stpm2
function from the rstpm2 package (Clements and Liu (2017)) and specifying the tvc
option. Note that the rstpm2 package and flexsurv packages can both be used to fit the Royston and Parmar flexible parametric survival model, however, they differ slightly in their post-estimation functionality and other possible extensions. Here, we use the rstpm2 package because it allows us to easily specify time-dependent effects and then plot the time-dependent hazard ratio after fitting the model (as shown in the code below).
The model with the time-dependent effect for treatment can be estimated using the following code
rstpm2::stpm2(Surv(eventtime, status) ~ trt,
mod_tvc <-data = simdat, tvc = list(trt = 1))
And for comparison we can fit the corresponding model, but without the time-dependent effect for treatment (i.e. assuming proportional hazards instead)
rstpm2::stpm2(Surv(eventtime, status) ~ trt,
mod_ph <-data = simdat)
Now, we can plot the time-dependent hazard ratio and the time-fixed hazard ratio on the same plot region using the following code
plot(mod_tvc, newdata = data.frame(trt = 0), type = "hr",
var = "trt", ylim = c(0,1), ci = TRUE, rug = FALSE,
main = "Time dependent hazard ratio",
ylab = "Hazard ratio", xlab = "Time")
plot(mod_ph, newdata = data.frame(trt = 0), type = "hr",
var = "trt", ylim = c(0,1), add = TRUE, ci = FALSE, lty = 2)
From the plot we can see the diminishing effect of treatment under the model with the time-dependent hazard ratio; as time increases the hazard ratio approaches a value of 1. Moreover, note that the hazard ratio is approximately equal to a value of 0.6 (i.e. exp(−0.5)) when t=1, which is what we specified in the data generating model.
This example shows how the simsurv package can be used to simulate event times under a shared parameter joint model for longitudinal and survival data.
We will simulate event times according to the following model formulation for the longitudinal submodel
Yi(t)∼N(μi(t),σ2y)
μi(t)=β0i+β1it+β2x1i+β3x2i β0i=β00+b0i β1i=β10+b1i (b0i,b1i)T∼N(0,Σ)
and the event submodel
hi(t)=δ(tδ−1)exp(γ0+γ1x1i+γ2x2i+αμi(t))
where x1i is an indicator variable for a binary covariate, x2i is a continuous covariate, b0i and b1i are individual-level parameters (i.e. random effects) for the intercept and slope for individual i, the β and γ terms are population-level parameters (i.e. fixed effects), and δ is the shape parameter for the Weibull baseline hazard.
This specification allows for an individual-specific linear trajectory for the longitudinal submodel, a Weibull baseline hazard in the event submodel, a current value association structure, and the effects of a binary and a continuous covariate in both the longitudinal and event submodels.
To simulate from this model using simsurv, we need to first explicitly define the hazard function. The code defining a function that returns the hazard for this joint model is
# First we define the hazard function to pass to simsurv
# (NB this is a Weibull proportional hazards regression submodel
# from a joint longitudinal and survival model with a "current
# value" association structure)
function(t, x, betas, ...) {
haz <-"delta"]] * (t ^ (betas[["delta"]] - 1)) * exp(
betas[["gamma_0"]] +
betas[[ betas[["gamma_1"]] * x[["x1"]] +
betas[["gamma_2"]] * x[["x2"]] +
betas[["alpha"]] * (
"beta_0i"]] +
betas[[ betas[["beta_1i"]] * t +
betas[["beta_2"]] * x[["x1"]] +
betas[["beta_3"]] * x[["x2"]]
)
) }
The next step is to define the “true” parameter values and covariate data for each individual. This is achieved by specifying two data frames: one for the parameter values, and one for the covariate data. Each row of the data frame will correspond to a different individual. The R code to achieve this is
# Then we construct data frames with the true parameter
# values and the covariate data for each individual
set.seed(5454) # set seed before simulating data
200 # number of individuals
N <-
# Population (fixed effect) parameters
data.frame(
betas <-delta = rep(2, N),
gamma_0 = rep(-11.9,N),
gamma_1 = rep(0.6, N),
gamma_2 = rep(0.08, N),
alpha = rep(0.03, N),
beta_0 = rep(90, N),
beta_1 = rep(2.5, N),
beta_2 = rep(-1.5, N),
beta_3 = rep(1, N)
)
# Individual-specific (random effect) parameters
matrix(c(1, 0.5, 0.5, 1), 2, 2)
b_corrmat <- c(20, 3)
b_sds <- rep(0, 2)
b_means <- MASS::mvrnorm(n = N, mu = b_means, Sigma = b_corrmat)
b_z <- sapply(1:length(b_sds),
b <-FUN = function(x) b_sds[x] * b_z[,x])
$beta_0i <- betas$beta_0 + b[,1]
betas$beta_1i <- betas$beta_1 + b[,2]
betas
# Covariate data
data.frame(
covdat <-x1 = stats::rbinom(N, 1, 0.45), # a binary covariate
x2 = stats::rnorm(N, 44, 8.5) # a continuous covariate
)
The final step is to then generate the simulated event times using a call to the simsurv
function. The only arguments that need to be specified are the user-defined hazard function, the true parameter values, and the covariate data. In this example we will also specify a maximum follow up time of ten units (for example, ten years, after which individuals will be censored if they have not yet experienced the event).
The code to generate the simulated event times is
# Set seed for simulations
set.seed(546546)
# Then simulate the survival times based on the user-defined
# hazard function, covariates data, and true parameter values
simsurv(hazard = haz, x = covdat, betas = betas, maxt = 10) times <-
We can them examine the first few rows of the resulting data frame, to see the simulated event times and event indicator
head(times)
## id eventtime status
## 1 1 4.0795945 1
## 2 2 10.0000000 0
## 3 3 4.1868793 1
## 4 4 0.2630766 1
## 5 5 7.5213303 1
## 6 6 4.0806461 1
## id eventtime status
## 1 4.813339 1
## 2 9.763900 1
## 3 5.913436 1
## 4 2.823562 1
## 5 2.315488 1
## 6 10.000000 0
Of course, we have only simulated the event times here; we haven’t simulated any observed values for the longitudinal outcome. Moreover, although the simsurv package can be used for simulating joint longitudinal and time-to-event data, it did take a bit of work and several lines of code to achieve. Therefore, it is worth noting that the simjm package (https://github.com/sambrilleman/simjm), which acts as a wrapper for simsurv, is designed specifically for this purpose. It can make the process a lot easier, since it shields the user from much of the work described in this example. Instead, the user can simulate joint longitudinal and time-to-event data using one function call to simjm::simjm
and a number of optional arguments are available to alter the exact specification of the shared parameter joint model.
Clements M, Liu X. (2017) rstpm2: Generalized Survival Models. R package version 1.4.1. https://CRAN.R-project.org/package=rstpm2
Crowther MJ, Lambert PC. Simulating complex survival data. Stata J 2012;12(4):674-687.
Jackson C. flexsurv: A platform for parametric survival modeling in R. Journal of Statistical Software 2016;70(8):1-33.
Royston P, Parmar MK. Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Stat Med 2002;21(15):2175-2197.