
Binomial Regression for Survival and Competing Risks Data
Klaus Holst & Thomas Scheike
2026-05-23
Source:vignettes/binreg.Rmd
binreg.RmdBinomial Regression for censored data
The binreg function fits a logistic link model with IPCW
adjustment for a specific time-point, and can thus be used for both
survival and competing risks data. Computation is linear in data size,
and influence functions are computed and available for downstream
calculations.
Key features:
- the censoring weights can be stratum-dependent
- predictions can be computed with standard errors
- computation time is linear in data size, including standard errors
- cluster-corrected standard errors are available via the
clustersargument
Details
The binreg function does direct binomial regression for one time-point, t, fitting the model \begin{align*} P(T \leq t, \epsilon=1 | X ) & = \mbox{expit}( X^T \beta) = F_1(t,X,\beta) \end{align*} based on an IPCW adjusted estimating equation (EE) with response Y(t)=I(T \leq t, \epsilon=1 ) \begin{align*} U(\beta,\hat G_c) = & X ( Y(t) \frac{ \Delta(t) }{\hat G_c(T_i \wedge t)} - \mbox{expit}( X^T \beta)) = 0, \end{align*} with G_c(t)=P(C>t), the censoring survival distribution, and with \Delta(t) = I( C_i > T_i \wedge t) the indicator of being uncensored at time t (type=“I”).
The default type=“II” is to augment with a censoring term, that is solve \begin{align*} & U(\beta,\hat G_c) + \int_0^t X \frac{\hat E(Y(t)| T>u)}{\hat G_c(u)} d\hat M_c(u) =0 \end{align*} where M_c(u) is the censoring martingale, this typically improves the performance. This is equivalent to the pseudo-value approach (see Overgaard (2025)).
The influence function for the type=“II” estimator is \begin{align*} U(\beta,G_c) + \int_0^t X \frac{E(Y| T>u)}{G_c(u)} d M_c(u) - \int_0^t \frac{E(X| T>u) E(Y| T>u)}{G_c(u)} d M_c(u) - \int_0^t \frac{E( X Y| T>u)}{G_c(u)} d M_c(u) \end{align*} and for type=“I” \begin{align*} & U(\beta) + \int_0^t \frac{E( X Y| T>u)}{G_c(u)} d M_c(u). \end{align*} The means E(X Y(t) | T>u) and E(Y(t)| T>u) are estimated by IPCW estimators among survivors to get estimates of the influence functions.
The function logitIPCW instead considers \begin{align*} U^{glm}(\beta,\hat G_c) = & \frac{ \Delta(t) }{\hat G_c(T_i \wedge t)} X ( Y(t) - \mbox{expit}( X^T \beta)) = 0. \end{align*} This score equation is quite similar to those of the binreg, and exactly the same when the censoring model is fully-nonparametric.
The logitIPCW has influence function \begin{align*} & U^{glm}(\beta,G_c) + \int_0^t \frac{E( X ( Y - F_1(t,\beta)) | T>u)}{G_c(u)} d M_c(u) \end{align*}
Which estimator performs the best depends on the censoring distribution and it seems that the binreg with type=“II” performs overall quite nicely (see Blanche et al (2023) and Overgaard (2025)). For the full estimated censoring model all estimators have the same influence function (see Blanche et al (2023)).
Additional functions logitATE, and binregATE computes the average treatment effect based on propensity and outcome modelling. We demonstrate this in another vignette.
The functions logitATE/binregATE can be used when there is no censoring and we thus have simple binary outcome.
The variance is based on a sandwich formula with IPCW adjustment
(using the influence functions), and naive.var is the
variance under a known censoring model. The influence functions are
stored in the output. Clusters can be specified to obtain
cluster-corrected standard errors.
Examples
library(mets)
options(warn=-1)
set.seed(1000) # to control output in random noise just below.
data(bmt)
bmt$time <- bmt$time+runif(nrow(bmt))*0.01
bmt$id <- 1:408
# logistic regression with IPCW binomial regression
out <- binreg(Event(time,cause)~tcell+platelet,bmt,time=50)
summary(out)
#> n events
#> 408 160
#>
#> 408 clusters
#> coeffients:
#> Estimate Std.Err 2.5% 97.5% P-value
#> (Intercept) -0.180338 0.126748 -0.428760 0.068084 0.1548
#> tcell -0.418545 0.345480 -1.095675 0.258584 0.2257
#> platelet -0.437644 0.240978 -0.909952 0.034665 0.0694
#>
#> exp(coeffients):
#> Estimate 2.5% 97.5%
#> (Intercept) 0.83499 0.65132 1.0705
#> tcell 0.65800 0.33431 1.2951
#> platelet 0.64556 0.40254 1.0353We can also compute predictions using the estimates
predict(out,data.frame(tcell=c(0,1),platelet=c(1,1)),se=TRUE)
#> pred se lower upper
#> 1 0.3502406 0.04847582 0.2552280 0.4452533
#> 2 0.2618207 0.06969334 0.1252217 0.3984196Further the censoring model can depend on strata
outs <- binreg(Event(time,cause)~tcell+platelet,bmt,time=50,cens.model=~strata(tcell,platelet))
summary(outs)
#> n events
#> 408 160
#>
#> 408 clusters
#> coeffients:
#> Estimate Std.Err 2.5% 97.5% P-value
#> (Intercept) -0.180697 0.127414 -0.430424 0.069030 0.1561
#> tcell -0.365928 0.350632 -1.053154 0.321299 0.2967
#> platelet -0.433494 0.240270 -0.904415 0.037428 0.0712
#>
#> exp(coeffients):
#> Estimate 2.5% 97.5%
#> (Intercept) 0.83469 0.65023 1.0715
#> tcell 0.69355 0.34884 1.3789
#> platelet 0.64824 0.40478 1.0381Absolute risk differences and ratio
Now for illustrations we consider the absolute risk difference depending on tcell
outs <- binreg(Event(time,cause)~tcell,bmt,time=50,cens.model=~strata(tcell))
summary(outs)
#> n events
#> 408 160
#>
#> 408 clusters
#> coeffients:
#> Estimate Std.Err 2.5% 97.5% P-value
#> (Intercept) -0.30054 0.11153 -0.51914 -0.08194 0.0070
#> tcell -0.51741 0.33981 -1.18342 0.14860 0.1278
#>
#> exp(coeffients):
#> Estimate 2.5% 97.5%
#> (Intercept) 0.74042 0.59503 0.9213
#> tcell 0.59606 0.30623 1.1602the risk difference is
ps <- predict(outs,data.frame(tcell=c(0,1)),se=TRUE)
ps
#> pred se lower upper
#> 1 0.4254253 0.02726306 0.3719897 0.4788609
#> 2 0.3061988 0.06819019 0.1725461 0.4398516
sum( c(1,-1) * ps[,1])
#> [1] 0.1192264Getting the standard errors is straightforward since the two groups are independent. If we had additionally adjusted for other covariates, however, we would need to apply the delta theorem using the relevant covariances, along the lines of:
dd <- data.frame(tcell=c(0,1))
p <- predict(outs,dd)
riskdifratio <- function(p,contrast=c(1,-1)) {
outs$coef <- p
p <- predict(outs,dd)[,1]
pd <- sum(contrast*p)
r1 <- p[1]/p[2]
r2 <- p[2]/p[1]
return(c(pd,r1,r2))
}
estimate(outs,f=riskdifratio,dd,null=c(0,1,1))
#> Estimate Std.Err 2.5% 97.5% P-value
#> p1 0.1192 0.07344 -0.02471 0.2632 0.10448
#> p2 1.3894 0.32197 0.75833 2.0204 0.22652
#> p3 0.7197 0.16679 0.39284 1.0467 0.09291
#> ────────────────────────────────────────────────────────────
#> Null Hypothesis:
#> [p1] = 0
#> [p2] = 1
#> [p3] = 1
#>
#> chisq = 12.0249, df = 2, p-value = 0.002448same as
SessionInfo
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] mets_1.3.10
#>
#> loaded via a namespace (and not attached):
#> [1] cli_3.6.6 knitr_1.51 rlang_1.2.0
#> [4] xfun_0.57 textshaping_1.0.5 jsonlite_2.0.0
#> [7] listenv_0.10.1 future.apply_1.20.2 lava_1.9.1
#> [10] htmltools_0.5.9 ragg_1.5.2 sass_0.4.10
#> [13] rmarkdown_2.31 grid_4.6.0 evaluate_1.0.5
#> [16] jquerylib_0.1.4 fastmap_1.2.0 numDeriv_2016.8-1.1
#> [19] yaml_2.3.12 mvtnorm_1.3-7 lifecycle_1.0.5
#> [22] timereg_2.0.7 compiler_4.6.0 codetools_0.2-20
#> [25] fs_2.1.0 htmlwidgets_1.6.4 Rcpp_1.1.1-1.1
#> [28] future_1.70.0 lattice_0.22-9 systemfonts_1.3.2
#> [31] digest_0.6.39 R6_2.6.1 parallelly_1.47.0
#> [34] parallel_4.6.0 splines_4.6.0 Matrix_1.7-5
#> [37] bslib_0.11.0 tools_4.6.0 RcppArmadillo_15.2.6-1
#> [40] globals_0.19.1 survival_3.8-6 pkgdown_2.2.0
#> [43] cachem_1.1.0 desc_1.4.3