vignettes/cooking-survival-data.Rmd
cooking-survival-data.Rmd
Simulation of survival data is important for both theoretical and practical work. In a practical setting we might wish to validate that standard errors are valid even in a rather small sample, or validate that a more complicated procedure is doing as intended. Therefore it is useful to have simple tools for generating survival data that looks as much as possible like particular data. In a theoretical setting we often are interested in evaluating the finite sample properties of a new procedure in different settings that often are motivated by a specific practical problem. The aim is provide such tools.
Bender et al. in a nice recent paper also discussed how to generate survival data based on the Cox model, and restricted attention to some of the many useful parametric survival models (weibull, exponential).
Different survival models can be cooked, and we here give recipes for hazard and cumulative incidence based simulations. More recipes are given in vignette about recurrent events.
Given a survival time with cumulative hazard , it follows that with (exponential with rate 1), that will have the same distribution as .
This provides the basis for simulations of survival times with a given hazard and is a consequence of this simple calculation
Similarly if given have hazard on Cox form where is a -dimensional regression coefficient and a baseline hazard funcion, then it is useful to observe also that with has the same distribution as given .
Therefore if the inverse of the cumulative hazard can be computed we can generate survival with a specified hazard function. One useful observation is note that for a piecewise linear continuous cumulative hazard on an interval it is easy to compute the inverse.
Further, we can approximate any cumulative hazard with a piecewise linear continous cumulative hazard and then simulate data according to this approximation. Recall that fitting the Cox model to data will give a piecewise constant cumulative hazard and the regression coefficients so with these at hand we can first approximate the piecewise constant “Breslow”-estimator with a linear upper (or lower bound) by simply connecting the values by straight lines.
If
given
have hazard on Cox form
and we wish to generate data according
to this hazard for those that are alive at time
,
that is draw from the distribution of
given
(all given
), then we note that
with
and with
has the distributiion we are after.
This is again a consequence of a simple calculation
The engine is to simulate data with a given linear cumulative hazard.
nsim <- 100
chaz <- c(0,1,1.5,2,2.1)
breaks <- c(0,10, 20, 30, 40)
cumhaz <- cbind(breaks,chaz)
X <- rbinom(nsim,1,0.5)
beta <- 0.2
rrcox <- exp(X * beta)
pctime <- rchaz(cumhaz,n=nsim)
pctimecox <- rchaz(cumhaz,rrcox)
Now we generate data that resemble Cox models for the bmt data
data(bmt);
cox1 <- phreg(Surv(time,cause==1)~tcell+platelet,data=bmt)
cox2 <- phreg(Surv(time,cause==2)~tcell+platelet,data=bmt)
X1 <- bmt[,c("tcell","platelet")]
n <- nsim
xid <- sample(1:nrow(X1),n,replace=TRUE)
Z1 <- X1[xid,]
Z2 <- X1[xid,]
rr1 <- exp(as.matrix(Z1) %*% cox1$coef)
rr2 <- exp(as.matrix(Z2) %*% cox2$coef)
d <- rcrisk(cox1$cum,cox2$cum,rr1,rr2)
dd <- cbind(d,Z1)
scox1 <- phreg(Surv(time,status==1)~tcell+platelet,data=dd)
scox2 <- phreg(Surv(time,status==2)~tcell+platelet,data=dd)
par(mfrow=c(1,2))
plot(cox1); plot(scox1,add=TRUE,col=2)
plot(cox2); plot(scox2,add=TRUE,col=2)
cbind(cox1$coef,scox1$coef,cox2$coef,scox2$coef)
#> [,1] [,2] [,3] [,4]
#> tcell -0.4232606 -0.1004993 0.3991068 0.7390719
#> platelet -0.5654438 -0.4042542 -0.2461474 -0.2459575
Now model with no covariates and specific call of sim.base function
data(sTRACE)
dtable(sTRACE,~chf+diabetes)
#>
#> diabetes 0 1
#> chf
#> 0 223 16
#> 1 230 31
coxs <- phreg(Surv(time,status==9)~strata(diabetes,chf),data=sTRACE)
strata <- sample(0:3,nsim,replace=TRUE)
simb <- sim.base(coxs$cumhaz,nsim,stratajump=coxs$strata.jumps,strata=strata)
cc <- phreg(Surv(time,status)~strata(strata),data=simb)
plot(coxs,col=1); plot(cc,add=TRUE,col=2)
More Cox games
cox <- coxph(Surv(time,status==9)~vf+chf+wmi,data=sTRACE)
sim1 <- sim.cox(cox,nsim,data=sTRACE)
cc <- coxph(Surv(time,status)~vf+chf+wmi,data=sim1)
cbind(cox$coef,cc$coef)
#> [,1] [,2]
#> vf 0.2970218 -0.2279416
#> chf 0.8018334 0.6783073
#> wmi -0.8920005 -1.4844731
cor(sim1[,c("vf","chf","wmi")])
#> vf chf wmi
#> vf 1.0000000 0.2067080 -0.0977774
#> chf 0.2067080 1.0000000 -0.5084292
#> wmi -0.0977774 -0.5084292 1.0000000
cor(sTRACE[,c("vf","chf","wmi")])
#> vf chf wmi
#> vf 1.00000000 0.1346711 -0.08966805
#> chf 0.13467109 1.0000000 -0.37464791
#> wmi -0.08966805 -0.3746479 1.00000000
cox <- phreg(Surv(time, status==9)~vf+chf+wmi,data=sTRACE)
sim3 <- sim.cox(cox,nsim,data=sTRACE)
cc <- phreg(Surv(time, status)~vf+chf+wmi,data=sim3)
cbind(cox$coef,cc$coef)
#> [,1] [,2]
#> vf 0.2970218 -10.5088431
#> chf 0.8018334 0.9662815
#> wmi -0.8920005 -0.8485974
plot(cox,se=TRUE); plot(cc,add=TRUE,col=2)
coxs <- phreg(Surv(time,status==9)~strata(chf,vf)+wmi,data=sTRACE)
sim3 <- sim.phreg(coxs,nsim,data=sTRACE)
cc <- phreg(Surv(time, status)~strata(chf,vf)+wmi,data=sim3)
cbind(coxs$coef,cc$coef)
#> [,1] [,2]
#> wmi -0.8683355 -0.1482848
plot(coxs,col=1); plot(cc,add=TRUE,col=2)
More Cox games with cause specific hazards
data(bmt)
# coxph
cox1 <- coxph(Surv(time,cause==1)~tcell+platelet,data=bmt)
cox2 <- coxph(Surv(time,cause==2)~tcell+platelet,data=bmt)
coxs <- list(cox1,cox2)
dd <- sim.cause.cox(coxs,nsim,data=bmt)
scox1 <- coxph(Surv(time,status==1)~tcell+platelet,data=dd)
scox2 <- coxph(Surv(time,status==2)~tcell+platelet,data=dd)
cbind(cox1$coef,scox1$coef)
#> [,1] [,2]
#> tcell -0.4231551 -0.3833041
#> platelet -0.5646181 -0.5517000
cbind(cox2$coef,scox2$coef)
#> [,1] [,2]
#> tcell 0.3991911 -0.3693576
#> platelet -0.2456203 0.4904355
Stratified Cox models using phreg
## stratified with phreg
cox0 <- phreg(Surv(time,cause==0)~tcell+platelet,data=bmt)
cox1 <- phreg(Surv(time,cause==1)~tcell+platelet,data=bmt)
cox2 <- phreg(Surv(time,cause==2)~strata(tcell)+platelet,data=bmt)
coxs <- list(cox0,cox1,cox2)
dd <- sim.cause.cox(coxs,nsim,data=bmt)
scox0 <- phreg(Surv(time,status==1)~tcell+platelet,data=dd)
scox1 <- phreg(Surv(time,status==2)~tcell+platelet,data=dd)
scox2 <- phreg(Surv(time,status==3)~strata(tcell)+platelet,data=dd)
cbind(cox0$coef,scox0$coef)
#> [,1] [,2]
#> tcell 0.1912407 0.1017686
#> platelet 0.1563789 -0.4795122
cbind(cox1$coef,scox1$coef)
#> [,1] [,2]
#> tcell -0.4232606 -0.8654778
#> platelet -0.5654438 -0.3978295
cbind(cox2$coef,scox2$coef)
#> [,1] [,2]
#> platelet -0.2271912 -0.004601463
par(mfrow=c(1,3))
plot(cox0); plot(scox0,add=TRUE,col=2);
plot(cox1); plot(scox1,add=TRUE,col=2);
plot(cox2); plot(scox2,add=TRUE,col=2);
cox1 <- phreg(Surv(time,cause==1)~strata(tcell)+platelet,data=bmt)
cox2 <- phreg(Surv(time,cause==2)~tcell+strata(platelet),data=bmt)
coxs <- list(cox1,cox2)
dd <- sim.cause.cox(coxs,nsim,data=bmt)
scox1 <- phreg(Surv(time,status==1)~strata(tcell)+platelet,data=dd)
scox2 <- phreg(Surv(time,status==2)~tcell+strata(platelet),data=dd)
cbind(cox1$coef,scox1$coef)
#> [,1] [,2]
#> platelet -0.5658612 -0.5197637
cbind(cox2$coef,scox2$coef)
#> [,1] [,2]
#> tcell 0.4153706 0.2740442
par(mfrow=c(1,2))
plot(cox1); plot(scox1,add=TRUE);
plot(cox2); plot(scox2,add=TRUE);
library(mets)
n <- 100
data(bmt)
bmt$bmi <- rnorm(408)
dcut(bmt) <- gage~age
data <- bmt
cox1 <- phreg(Surv(time,cause==1)~strata(tcell)+platelet,data=bmt)
cox2 <- phreg(Surv(time,cause==2)~strata(gage)+tcell+platelet,data=bmt)
cox3 <- phreg(Surv(time,cause==0)~strata(platelet)+bmi,data=bmt)
coxs <- list(cox1,cox2,cox3)
dd <- sim.phregs(coxs,1000,data=bmt,extend=0.002)
scox1 <- phreg(Surv(time,status==1)~strata(tcell)+platelet,data=dd)
scox2 <- phreg(Surv(time,status==2)~strata(gage)+tcell+platelet,data=dd)
scox3 <- phreg(Surv(time,status==3)~strata(platelet)+bmi,data=dd)
cbind(coef(cox1),coef(scox1), coef(cox2),coef(scox2), coef(cox3),coef(scox3))
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> tcell -0.5658612 -0.4683225 0.3034864 0.4545260 -0.0837082 -0.08223882
#> platelet -0.5658612 -0.4683225 -0.2159670 -0.4755456 -0.0837082 -0.08223882
par(mfrow=c(1,3))
plot(scox1,col=2); plot(cox1,add=TRUE,col=1)
plot(scox2,col=2); plot(cox2,add=TRUE,col=1)
plot(scox3,col=2); plot(cox3,add=TRUE,col=1)
Using a hazard based simulation with delayed entry we can then simulate data from for example the general illness-death model. Here the cumulative hazards need to be specified.
First we set up some cumulative hazards, then we simulate some data and re-estimate the cumulative baselines
data(base1cumhaz)
data(base4cumhaz)
data(drcumhaz)
dr <- drcumhaz
dr2 <- drcumhaz
dr2[,2] <- 1.5*drcumhaz[,2]
base1 <- base1cumhaz
base4 <- base4cumhaz
cens <- rbind(c(0,0),c(2000,0.5),c(5110,3))
iddata <- simMultistate(nsim,base1,base1,dr,dr2,cens=cens)
dlist(iddata,.~id|id<3,n=0)
#> id: 1
#> time status entry death from to start stop
#> 1 123.8971 3 0 1 1 3 0 123.8971
#> ------------------------------------------------------------
#> id: 2
#> time status entry death from to start stop
#> 2 6.802368 2 0.000000 0 1 2 0.000000 6.802368
#> 101 408.278262 1 6.802368 0 2 1 6.802368 408.278262
#> 154 1354.051677 2 408.278262 0 1 2 408.278262 1354.051677
#> 195 1455.213519 1 1354.051677 0 2 1 1354.051677 1455.213519
#> 231 1820.282337 2 1455.213519 0 1 2 1455.213519 1820.282337
#> 259 2129.840288 1 1820.282337 0 2 1 1820.282337 2129.840288
#> 278 2396.584549 2 2129.840288 0 1 2 2129.840288 2396.584549
#> 292 3388.354776 1 2396.584549 0 2 1 2396.584549 3388.354776
#> 305 3990.632013 2 3388.354776 0 1 2 3388.354776 3990.632013
#> 315 3991.456202 1 3990.632013 0 2 1 3990.632013 3991.456202
#> 324 5110.000000 0 3991.456202 0 1 0 3991.456202 5110.000000
### estimating rates from simulated data
c0 <- phreg(Surv(start,stop,status==0)~+1,iddata)
c3 <- phreg(Surv(start,stop,status==3)~+strata(from),iddata)
c1 <- phreg(Surv(start,stop,status==1)~+1,subset(iddata,from==2))
c2 <- phreg(Surv(start,stop,status==2)~+1,subset(iddata,from==1))
###
par(mfrow=c(2,2))
bplot(c0)
lines(cens,col=2)
bplot(c3,main="rates 1-> 3 , 2->3")
lines(dr,col=1,lwd=2)
lines(dr2,col=2,lwd=2)
###
bplot(c1,main="rate 1->2")
lines(base1,lwd=2)
###
bplot(c2,main="rate 2->1")
lines(base1,lwd=2)
In this section we discuss how to simulate competing risks data that have a specfied cumulative incidence function. We consider for simplicity a competing risks model with two causes and denote the cumulative incidence curves as and .
To generate data with the required cumulative incidence functions a simple approach is to first figure out if the subject dies and then from what cause, then finally draw the survival time according to the conditional distribution.
For simplicity we consider survival times in a fixed interval , and first flip a coin with and probabilities to decide if the subject is a survivor or dies. If the subject dies we then flip a coin with probabilities and to decide if or , and finally draw a with and is a uniform.
We again note that if and are piecewise linear continuous functions then the inverses are easy to compute.
We here simulate two causes of death with two binary covariates
cif1 <- cbind(c(0,10,20,100),c(0,0.1,0.15,0.2))
cif2 <- cbind(c(0,10,20,100),c(0,0.4,0.45,0.5))
n <- 100; lrr1=c(0.2,0.1); lrr2=c(0.2,0.1); cens=NULL
### A binary, L binary
A <- rbinom(n,1,0.5)
L <- rbinom(n,1,0.5)
###
rr1 <- exp(cbind(A,L) %*% lrr1)
rr2 <- exp(cbind(A,L) %*% lrr2)
## model is fine
mmm<-max(rr1)*max(cif1[,2])+max(rr2)*max(cif2[,2])
mcif1 <- max(cif1[,2])
mcif2 <- max(cif2[,2])
if (mmm>1) warning(" models not satisfying sum <=1\n")
### here log-link model
T1 <- simsubdist(cif1,rr1,type="cif")
T2 <- simsubdist(cif2,rr2,type="cif")
###
dies <- rbinom(n,1,rr1*mcif1+rr2*mcif2)
sel1 <- rbinom(n,1,mcif2/(mcif1+mcif2))+1
epsilon <- dies*(sel1)
T1$epsilon <- epsilon
###
T1$A <- A; T1$L <- L
## times given
T1$time <- T1$timecause
T1$time2 <- T2$timecause
T1$status <- epsilon
T1 <- dtransform(T1,time=100,epsilon==0)
T1 <- dtransform(T1,status=0,epsilon==0)
###
T1 <- dtransform(T1,time=time2,epsilon==2)
T1 <- dtransform(T1,status=2,epsilon==2)
dtable(T1,~status)
#>
#> status
#> 0 1 2
#> 25 21 54
par(mfrow=c(1,2))
lrr1=c(0.2,0.1);lrr2=c(0.2,0.1)
pcif1 <- cif(Event(time,status)~strata(A,L),T1,cause=1)
pcif2 <- cif(Event(time,status)~strata(A,L),T1,cause=2)
###
newd <- data.frame(expand.grid(A=0:1,L=0:1))
rr1 <- c(exp(as.matrix(newd) %*% lrr1))
rr2 <- c(exp(as.matrix(newd) %*% lrr2))
###
cifm1 <- cbind(cif1[,1],cif1[,2] %o% rr1)
cifm2 <- cbind(cif2[,1],cif2[,2] %o% rr2)
###
par(mfrow=c(1,2))
plot(pcif1,ylim=c(0,0.3));
matlines(cifm1[,1],cifm1[,-1],col=1,lwd=2)
###
plot(pcif2,ylim=c(0,0.7))
matlines(cifm2[,1],cifm2[,-1],col=1,lwd=2)
Now assume that given covariates and are two cumulative incidence functions that satistifes the needed constraints.
Possibly
given estimators of
and
and
and
.
We can obtain a piecewise linear continuous approximation,
by linearly connecting estimates
.
Now with these at hand
and
we can generate data with these cumulative incidence functions.
Here both the cumulative incidence are on the specified form if the restriction is not important. Using sim.cifs but sim.cifs enforces the restriction. Here will be on the specified form, and not.
data(bmt)
################################################################
# simulating several causes with specific cumulatives
################################################################
cif1 <- cifreg(Event(time,cause)~tcell+age,data=bmt,cause=1)
cif2 <- cifreg(Event(time,cause)~tcell+age,data=bmt,cause=2)
## dd <- sim.cifs(list(cif1,cif2),nsim,data=bmt)
dds <- sim.cifsRestrict(list(cif1,cif2),nsim,data=bmt)
scif1 <- cifreg(Event(time,cause)~tcell+age,data=dds,cause=1)
scif2 <- cifreg(Event(time,cause)~tcell+age,data=dds,cause=2)
cbind(cif1$coef,scif1$coef)
#> [,1] [,2]
#> tcell -0.7959291 0.003453075
#> age 0.4160844 0.582797245
cbind(cif2$coef,scif2$coef)
#> [,1] [,2]
#> tcell 0.66730051 1.7314537
#> age -0.03250819 -0.5701337
par(mfrow=c(1,2))
plot(cif1); plot(scif1,add=TRUE,col=2)
plot(cif2); plot(scif2,add=TRUE,col=2)
We assumed that
with
and
,
and that the other cause was given by
with
and
,
a parametrization that satisfies the constraint
.
set.seed(100)
rho1 <- 0.2; rho2 <- 10
n <- nsim
beta=c(0.0,-0.1,-0.5,0.3)
dats <- simul.cifs(n,rho1,rho2,beta,rc=0.2)
dtable(dats,~status)
#>
#> status
#> 0 1 2
#> 6 13 81
dsort(dats) <- ~time
fg <- cifreg(Event(time,status)~Z1+Z2,data=dats,cause=1,propodds=NULL)
summary(fg)
#>
#> n events
#> 100 13
#>
#> 100 clusters
#> coeffients:
#> Estimate S.E. dU^-1/2 P-value
#> Z1 -0.25559 0.27563 0.28698 0.3538
#> Z2 0.43883 0.55113 0.57407 0.4259
#>
#> exp(coeffients):
#> Estimate 2.5% 97.5%
#> Z1 0.77446 0.45121 1.3293
#> Z2 1.55089 0.52658 4.5677
Now assume that given covariates
and
are two cumulative incidence functions that satistifes the needed
constraints. We wish to generate data that follows these two piecewise
linear cumulative indidence functions with delayed entry at time
.
We should thus generate data that follows the cumulative incidence
functions
and
this can be done according to the
recipe in the previous section.
To be specific (ignoring the
in the formula)
where
is a uniform, will have distribution given by
.
See also recurrent events vignette
data(base1cumhaz)
data(base4cumhaz)
data(drcumhaz)
dr <- drcumhaz
base1 <- base1cumhaz
base4 <- base4cumhaz
n <- 100
rr <- simRecurrent(n,base1,death.cumhaz=dr)
###
par(mfrow=c(1,3))
showfitsim(causes=1,rr,dr,base1,base1,which=1:2)
rr <- simRecurrentII(n,base1,base4,death.cumhaz=dr)
dtable(rr,~death+status)
#>
#> status 0 1 2
#> death
#> 0 17 299 41
#> 1 83 0 0
par(mfrow=c(2,2))
showfitsim(causes=2,rr,dr,base1,base4,which=1:2)
cumhaz <- list(base1,base1,base4)
drl <- list(dr,base4)
rr <- simRecurrentIII(n,cumhaz,death.cumhaz=drl)
dtable(rr,~death+status)
#>
#> status 0 1 2 3
#> death
#> 0 6 191 183 17
#> 1 72 0 0 0
#> 2 22 0 0 0
showfitsimIII(rr,cumhaz,drl)
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] mets_1.3.5 timereg_2.0.6 survival_3.7-0
#>
#> loaded via a namespace (and not attached):
#> [1] cli_3.6.3 knitr_1.49 rlang_1.1.4
#> [4] xfun_0.50 textshaping_0.4.1 jsonlite_1.8.9
#> [7] listenv_0.9.1 future.apply_1.11.3 lava_1.8.0
#> [10] htmltools_0.5.8.1 ragg_1.3.3 sass_0.4.9
#> [13] rmarkdown_2.29 grid_4.4.2 evaluate_1.0.3
#> [16] jquerylib_0.1.4 fastmap_1.2.0 mvtnorm_1.3-3
#> [19] numDeriv_2016.8-1.1 yaml_2.3.10 lifecycle_1.0.4
#> [22] compiler_4.4.2 codetools_0.2-20 fs_1.6.5
#> [25] Rcpp_1.0.13-1 future_1.34.0 systemfonts_1.1.0
#> [28] lattice_0.22-6 digest_0.6.37 R6_2.5.1
#> [31] parallelly_1.41.0 parallel_4.4.2 splines_4.4.2
#> [34] bslib_0.8.0 Matrix_1.7-1 tools_4.4.2
#> [37] globals_0.16.3 pkgdown_2.1.1 cachem_1.1.0
#> [40] desc_1.4.3