---
title: "How to use the simsurv package"
author: "Sam Brilleman"
date: "`r Sys.Date()`"
output: 
  html_vignette:
    toc: yes
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{How to use the simsurv package}
  \usepackage[utf8]{inputenc}
---

```{r setup, include=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      fig.width = 7.5, 
                      fig.height = 5)

# Load packages
library(flexsurv)
library(rstpm2)
library(simsurv)
```

## Preamble

This vignette provides examples demonstrating usage of the **simsurv** package. For a technical vignette describing the methods underpinning the package, please see [Technical background to the simsurv package](simsurv_technical.html). 

Note that this package is modelled on the **survsim** Stata package. For comparability, the majority of the following examples are based on Crowther and Lambert (2012), which is the supporting paper for the **survsim** Stata package. 

## Usage examples

### Example 1: Simulating under a standard parametric survival model

This first example shows how the **simsurv** package can be used to generate event times under a relatively standard Weibull proportional hazards model. This will be demonstrated as part of a simple simulation study. 

The simulated event times will be generated under the following conditions: 

* a monotonically increasing baseline hazard function, achieved by specifying a Weibull baseline hazard with a $\gamma$ parameter of 1.5; 
* the effect of a protective treatment obtained by specifying a binary covariate with log hazard ratio of -0.5; 
* a maximum follow up time by censoring any individuals with a simulated survival time larger than five years. 

The objective of the simulation study will be to assess the bias and coverage of the estimated treatment effect. This will be achieved by: 

* generating 100 simulated datasets (ideally it should be more than 100 datasets, but we don't want the vignette to take forever to build!), each containing $N = 200$ individuals; 
* fitting a Weibull proportional hazards model to each simulated dataset using the **flexsurv** package; 
* calculating mean bias and mean coverage (of the estimated treatment effect) across the 100 simulated datasets. 

The code for performing the simulation study and the results are shown below.

```{r, example1}
# Define a function for analysing one simulated dataset
sim_run <- function() {
  # Create a data frame with the subject IDs and treatment covariate
  cov <- data.frame(id = 1:200,
                    trt = rbinom(200, 1, 0.5))
  
  # Simulate the event times
  dat <- simsurv(lambdas = 0.1, 
                 gammas = 1.5, 
                 betas = c(trt = -0.5), 
                 x = cov, 
                 maxt = 5)
  
  # Merge the simulated event times onto covariate data frame
  dat <- merge(cov, dat)
  
  # Fit a Weibull proportional hazards model
  mod <- flexsurv::flexsurvspline(Surv(eventtime, status) ~ trt, data = dat)
  
  # Obtain estimates, standard errors and 95% CI limits
  est <- mod$coefficients[["trt"]]
  ses <- sqrt(diag(mod$cov))[["trt"]]
  cil <- est + qnorm(.025) * ses
  ciu <- est + qnorm(.975) * ses
  
  # Return bias and coverage indicator for treatment effect
  c(bias = est - (-0.5), 
    coverage = ((-0.5 > cil) && (-0.5 < ciu)))
}

# Set seed for simulations
set.seed(908070)

# Perform 100 replicates in simulation study
rowMeans(replicate(100, sim_run()))
```

Here we see that there is very little bias in the estimates of the log hazard ratio for the treatment effect, and the 95% confidence intervals are near their intended level of coverage.

### Example 2: Simulating under a flexible parametric survival model

Next, we will simulate event times under a slightly more complex parametric survival model that incorporates a flexible baseline hazard.

In this example we will use the publically accessible German breast cancer dataset. This dataset is included with the **simsurv** R package (see `help(simsurv::brcancer)` for a description of the dataset). Let us look at the first few rows of the dataset:

```{r, brcancer_data}
data("brcancer")
head(brcancer)
```

Now let us fit two parametric survival models to the breast cancer data:

* one Weibull survival model; and 
* one flexible parametric survival model

The flexible parametric survival model will be based on the method of Royston and Parmar (2002); i.e. restricted cubic splines are used to approximate the log cumulative baseline hazard. This model can be estimated using the `flexsurvspline` function from the **flexsurv** package (Jackson (2016)). 

We will use three internal knots (i.e. four degrees of freedom) for the restricted cubic splines with the knot points placed at evenly spaced percentiles of the distribution of observed event time (obtained by specifying the argument `k = 3` in the code below). We can also estimate the Weibull proportional hazards model using the `flexsurvspline` function from the **flexsurv** package, by specifying no internal knots (i.e. specifying `k = 0`).

```{r, brcancer_models}
# Fit the Weibull survival model
mod_weib <- flexsurv::flexsurvspline(Surv(rectime, censrec) ~ hormon, 
                                     data = brcancer, k = 0)

# Fit the flexible parametric survival model
mod_flex <- flexsurv::flexsurvspline(Surv(rectime, censrec) ~ hormon, 
                                     data = brcancer, k = 3)
```

Now let us compare the fit of the two models by plotting each of the fitted survival functions on top of the Kaplan-Meier survival curve.

```{r, brcancer_plots}
par(mfrow = c(1,2), cex = 0.85) # graphics parameters
plot(mod_weib,
     main = "Weibull model",
     ylab = "Survival probability",
     xlab = "Time") 
plot(mod_flex,
     main = "Flexible parametric model",
     ylab = "Survival probability",
     xlab = "Time")
```

There is evidence in the plots that the flexible parametric model fits the data better than the standard Weibull model. Therefore, if we wanted to simulate event times from a data generating process similar to that of the breast cancer data, then using a Weibull distribution may not be adequate. Rather, it would be more appropriate to simulate event times under the flexible parametric model. We will demonstrate how the **simsurv** package can be used to do this. The estimated parameters from the flexible parametric model will be used as the "true" parameters for the simulated event times.

The event times can be generated under a user-specified log cumulative hazard function that is equivalent to the Royston and Parmar specification used by the **flexsurv** package. First, the log cumulative hazard function for this model needs to be defined as a function in the R session. The **user-defined function passed to `simsurv` must always have the following three arguments:** 

* `t`: scalar specifying the current time at which to evaluate the hazard 
* `x`: a named list with the covariate data
* `betas`: a named list with the "true" parameters

Each of these arguments provide information that is used in evaluating the hazard $h_i(t)$, log hazard $\log h_i(t)$, cumulative hazard $H_i(t)$, or log cumulative hazard $\log H_i(t)$ (depending on which type of user-specified function is being provided). These three arguments (`t`, `x`, `betas`) can then be followed in the function signature by any additional arguments that may be necessary. For example, in the function definition below, the first three arguments are followed by an additional argument `knots`, which allows the calculation of the log cumulative hazard at time $t$ to depend on the knot locations for the splines.

```{r, define_logcumhaz}
# Define a function returning the log cum hazard at time t
logcumhaz <- function(t, x, betas, knots) {
  
  # Obtain the basis terms for the spline-based log
  # cumulative hazard (evaluated at time t)
  basis <- flexsurv::basis(knots, log(t))
  
  # Evaluate the log cumulative hazard under the
  # Royston and Parmar specification
  res <- 
    betas[["gamma0"]] * basis[[1]] + 
    betas[["gamma1"]] * basis[[2]] +
    betas[["gamma2"]] * basis[[3]] +
    betas[["gamma3"]] * basis[[4]] +
    betas[["gamma4"]] * basis[[5]] +
    betas[["hormon"]] * x[["hormon"]]
  
  # Return the log cumulative hazard at time t
  res
}
```

Next, we will show how to use the `simsurv` function to simulate event times under the flexible parametric model. To demonstrate this, we will again generate the event times as part of a simulation study. The objective of the simulation study will be to assess the bias and coverage of the estimated log hazard ratio for hormone therapy. This will be achieved by: 

* generating 100 simulated datasets (ideally it should be more than 100 datasets, but we don't want the vignette to take forever to build!), each containing $N = 200$ individuals. The simulated event times will be generated under our flexible parametric model (with the "true" parameter values taken from fitting a model to the German breast cancer data);
* fitting both a Weibull model and a flexible parametric model to each simulated dataset; 
* calculating the mean bias (across the 100 simulated datasets) in the log hazard ratio for hormone therapy under the Weibull model and the flexible parametric models. 

```{r, example2}
# Fit the model to the brcancer dataset to obtain the "true"
# parameter values that will be used in our simulation study
true_mod <- flexsurv::flexsurvspline(Surv(rectime, censrec) ~ hormon, 
                                     data = brcancer, k = 3)

# Define a function to generate one simulated dataset, fit
# our two models (Weibull and flexible) to the simulated data
# and then return the bias in the estimated effect of hormone
# therapy under each fitted model
sim_run <- function(true_mod) {
  # Create a data frame with the subject IDs and treatment covariate
  cov <- data.frame(id = 1:200, hormon = rbinom(200, 1, 0.5))

  # Simulate the event times
  dat <- simsurv(betas = true_mod$coefficients, # "true" parameter values
                 x = cov,                   # covariate data for 200 individuals
                 knots = true_mod$knots,    # knot locations for splines
                 logcumhazard = logcumhaz,  # definition of log cum hazard
                 maxt = NULL,               # no right-censoring
                 interval = c(1E-8,100000)) # interval for root finding
  
  # Merge the simulated event times onto covariate data frame
  dat <- merge(cov, dat)

  # Fit a Weibull proportional hazards model
  weib_mod <- flexsurv::flexsurvspline(Surv(eventtime, status) ~ hormon, 
                                       data = dat, k = 0)

  # Fit a flexible parametric proportional hazards model
  flex_mod <- flexsurv::flexsurvspline(Surv(eventtime, status) ~ hormon, 
                                       data = dat, k = 3)
  
  # Obtain estimates, standard errors and 95% CI limits for hormone effect
  true_loghr <- true_mod$coefficients[["hormon"]]
  weib_loghr <- weib_mod$coefficients[["hormon"]]
  flex_loghr <- flex_mod$coefficients[["hormon"]]
 
  # Return bias and coverage indicator for hormone effect
  c(weib_bias = weib_loghr - true_loghr, 
    flex_bias = flex_loghr - true_loghr)
}

# Set a seed for the simulations
set.seed(543543)

# Perform the simulation study using 100 replicates
rowMeans(replicate(100, sim_run(true_mod = true_mod)))
```

### Example 3: Simulating under a Weibull model with time-dependent effects

This short example shows how to simulate data under a standard Weibull survival model that incorporates a time-dependent effect (i.e. non-proportional hazards). For the time-dependent effect we will include a single binary covariate (e.g. a treatment indicator) with a protective effect (i.e. a negative log hazard ratio), but we will allow the effect of the covariate to diminish over time. The data generating model will be

$$
  h_i(t) = \gamma \lambda (t ^{\gamma - 1}) \exp(\beta_0 X_i + \beta_1 X_i\times \log(t))
$$

where $X_i$ is the binary treatment indicator for individual $i$, $\lambda$ and $\gamma$ are the scale and shape parameters for the Weibull baseline hazard, $\beta_0$ is the log hazard ratio for treatment when $t = 1$ (i.e. when $\log(t) = 0$), and $\beta_1$ quantifies the amount by which the log hazard ratio for treatment changes for each one unit increase in $\log(t)$. Here we are assuming the time-dependent effect is induced by interacting the log hazard ratio with log time, but we could have used some other function of time (for example linear time, $t$, or time squared, $t^2$, if we had wanted to).

We will simulate data for $N = 5000$ individuals under this model, with a maximum follow up time of five years, and using the following "true" parameter values for the data generating model:

* $\beta_0 = -0.5$
* $\beta_1 = 0.15$
* $\lambda = 0.1$
* $\gamma = 1.5$

```{r, tdeexample_sim}
covs <- data.frame(id = 1:5000, trt = rbinom(5000, 1, 0.5))
simdat <- simsurv(dist = "weibull", lambdas = 0.1, gammas = 1.5, betas = c(trt = -0.5),
                  x = covs, tde = c(trt = 0.15), tdefunction = "log", maxt = 5)
simdat <- merge(simdat, covs)
head(simdat)
```

Then let us fit a flexible parametric model with two internal knots (i.e. 3 degrees of freedom) for the baseline hazard, and a time-dependent hazard ratio for the treatment effect. For the time-dependent hazard ratio we will use an interaction with log time (the same as used in the data generating model); this can be easily achieved using the `stpm2` function from the **rstpm2** package (Clements and Liu (2017)) and specifying the `tvc` option. Note that the **rstpm2** package and **flexsurv** packages can both be used to fit the Royston and Parmar flexible parametric survival model, however, they differ slightly in their post-estimation functionality and other possible extensions. Here, we use the **rstpm2** package because it allows us to easily specify time-dependent effects and then plot the time-dependent hazard ratio after fitting the model (as shown in the code below). 

The model with the time-dependent effect for treatment can be estimated using the following code

```{r, tdeexample_mod1}
mod_tvc <- rstpm2::stpm2(Surv(eventtime, status) ~ trt, 
                         data = simdat, tvc = list(trt = 1))
```

And for comparison we can fit the corresponding model, but without the time-dependent effect for treatment (i.e. assuming proportional hazards instead)

```{r, tdeexample_mod2}
mod_ph <- rstpm2::stpm2(Surv(eventtime, status) ~ trt, 
                        data = simdat)
```

Now, we can plot the time-dependent hazard ratio and the time-fixed hazard ratio on the same plot region using the following code

```{r, tdeexample_plothr}
plot(mod_tvc, newdata = data.frame(trt = 0), type = "hr", 
     var = "trt", ylim = c(0,1), ci = TRUE, rug = FALSE,
     main = "Time dependent hazard ratio",
     ylab = "Hazard ratio", xlab = "Time")
plot(mod_ph,  newdata = data.frame(trt = 0), type = "hr", 
     var = "trt", ylim = c(0,1), add = TRUE, ci = FALSE, lty = 2)
```

From the plot we can see the diminishing effect of treatment under the model with the time-dependent hazard ratio; as time increases the hazard ratio approaches a value of 1. Moreover, note that the hazard ratio is approximately equal to a value of 0.6 (i.e. $\exp(-0.5)$) when $t = 1$, which is what we specified in the data generating model.

### Example 4: Simulating under a joint model for longitudinal and survival data

This example shows how the **simsurv** package can be used to simulate event times under a shared parameter joint model for longitudinal and survival data. 

We will simulate event times according to the following model formulation for the longitudinal submodel

$$ Y_i(t) \sim N(\mu_i(t), \sigma_y^2) $$

$$ \mu_i(t) = \beta_{0i} + \beta_{1i} t + \beta_2 x_{1i} + \beta_3 x_{2i} $$
$$ \beta_{0i} = \beta_{00} + b_{0i} $$
$$ \beta_{1i} = \beta_{10} + b_{1i} $$
$$ (b_{0i}, b_{1i})^T \sim N(0, \Sigma) $$

and the event submodel

$$ 
  h_i(t) = \delta (t^{\delta-1}) 
    \exp (\gamma_0 + \gamma_1 x_{1i} + \gamma_2 x_{2i} + \alpha \mu_i(t)) 
$$

where $x_{1i}$ is an indicator variable for a binary covariate, $x_{2i}$ is a continuous covariate, $b_{0i}$ and $b_{1i}$ are individual-level parameters (i.e. random effects) for the intercept and slope for individual $i$, the $\beta$ and $\gamma$ terms are population-level parameters (i.e. fixed effects), and $\delta$ is the shape parameter for the Weibull baseline hazard. 

This specification allows for an individual-specific linear trajectory for the longitudinal submodel, a Weibull baseline hazard in the event submodel, a current value association structure, and the effects of a binary and a continuous covariate in both the longitudinal and event submodels. 

To simulate from this model using **simsurv**, we need to first explicitly define the hazard function. The code defining a function that returns the hazard for this joint model is

```{r, jmexample_define_hazard}
# First we define the hazard function to pass to simsurv
# (NB this is a Weibull proportional hazards regression submodel
# from a joint longitudinal and survival model with a "current
# value" association structure)
haz <- function(t, x, betas, ...) {
  betas[["delta"]] * (t ^ (betas[["delta"]] - 1)) * exp(
    betas[["gamma_0"]] +
    betas[["gamma_1"]] * x[["x1"]] +
    betas[["gamma_2"]] * x[["x2"]] +
    betas[["alpha"]] * (
      betas[["beta_0i"]] +
      betas[["beta_1i"]] * t +
      betas[["beta_2"]]  * x[["x1"]] +
      betas[["beta_3"]]  * x[["x2"]]
    )
  )
}
```

The next step is to define the “true” parameter values and covariate data for each individual. This is achieved by specifying two data frames: one for the parameter values, and one for the covariate data. Each row of the data frame will correspond to a different individual. The R code to achieve this is

```{r, jmexample_pars_and_covs} 
# Then we construct data frames with the true parameter
# values and the covariate data for each individual
set.seed(5454) # set seed before simulating data
N <- 200       # number of individuals

# Population (fixed effect) parameters
betas <- data.frame(
  delta   = rep(2,    N),
  gamma_0 = rep(-11.9,N),
  gamma_1 = rep(0.6,  N),
  gamma_2 = rep(0.08, N),
  alpha   = rep(0.03, N),
  beta_0  = rep(90,   N),
  beta_1  = rep(2.5,  N),
  beta_2  = rep(-1.5, N),
  beta_3  = rep(1,    N)
)

# Individual-specific (random effect) parameters
b_corrmat <- matrix(c(1, 0.5, 0.5, 1), 2, 2)
b_sds     <- c(20, 3)
b_means   <- rep(0, 2)
b_z       <- MASS::mvrnorm(n = N, mu = b_means, Sigma = b_corrmat)
b         <- sapply(1:length(b_sds), 
                    FUN = function(x) b_sds[x] * b_z[,x])
betas$beta_0i <- betas$beta_0 + b[,1]
betas$beta_1i <- betas$beta_1 + b[,2]

# Covariate data
covdat <- data.frame(
  x1 = stats::rbinom(N, 1, 0.45), # a binary covariate
  x2 = stats::rnorm(N, 44, 8.5)   # a continuous covariate
)
```

The final step is to then generate the simulated event times using a call to the `simsurv` function. The only arguments that need to be specified are the user-defined hazard function, the true parameter values, and the covariate data. In this example we will also specify a maximum follow up time of ten units (for example, ten years, after which individuals will be censored if they have not yet experienced the event). 

The code to generate the simulated event times is 

```{r, jmexample_sim}
# Set seed for simulations
set.seed(546546)

# Then simulate the survival times based on the user-defined
# hazard function, covariates data, and true parameter values
times <- simsurv(hazard = haz, x = covdat, betas = betas, maxt = 10)
```

We can them examine the first few rows of the resulting data frame, to see the simulated event times and event indicator

```{r, jmexample_view_data}
head(times)
## id eventtime status
##  1  4.813339      1
##  2  9.763900      1
##  3  5.913436      1
##  4  2.823562      1
##  5  2.315488      1
##  6 10.000000      0
```

Of course, we have only simulated the event times here; we haven't simulated any observed values for the longitudinal outcome. Moreover, although the **simsurv** package can be used for simulating joint longitudinal and time-to-event data, it did take a bit of work and several lines of code to achieve. Therefore, it is worth noting that the **simjm** package (https://github.com/sambrilleman/simjm), which acts as a wrapper for **simsurv**, is designed specifically for this purpose. It can make the process a lot easier, since it shields the user from much of the work described in this example. Instead, the user can simulate joint longitudinal and time-to-event data using one function call to `simjm::simjm` and a number of optional arguments are available to alter the exact specification of the shared parameter joint model.

## References

Clements M, Liu X. (2017) rstpm2: Generalized Survival Models. R package version 1.4.1. https://CRAN.R-project.org/package=rstpm2

Crowther MJ, Lambert PC. Simulating complex survival data. *Stata J* 2012;**12**(4):674-687.

Jackson C. flexsurv: A platform for parametric survival modeling in R. *Journal of Statistical  Software* 2016;**70**(8):1-33. \doi{10.18637/jss.v070.i08}

Royston P, Parmar MK. Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. *Stat Med* 2002;**21**(15):2175-2197. \doi{10.1002/sim.1203}