The purpose of this vignette is to show how {icpack
} can
be used to fit certain parametric models, namely with constant hazards,
Gompertz models and Weibull models. We will use the ovarian cancer data,
because with right-censored data we can easily check the results of
{icpack
} with other methods. Those other methods do not
extend to interval-censored data, while {icpack
} does.
For information on the ovarian cancer data we refer to the vignette
“Smooth modeling of right-censored survival data with
{icpack
}”, or the help file. First the package needs to be
loaded.
Using icfit
it is possible to fit proportional hazards
models to right-censored and interval-censored data with constant
baseline hazards. It is of interest to start with a model without
covariates and right-censored data, since then we know what the maximum
likelihood of the underlying constant hazard is, and we also have a
straightforward expression for its confidence interval. Here is how the
model is fit using icfit
.
ic_fit_cst <- icfit(Surv(time, d) ~ 1, data = Ova,
bdeg=1, pord=1, lambda=1e+06, update_lambda=FALSE)
ic_fit_cst
## Object of class 'icfit'
## Call:
## icfit(formula = Surv(time, d) ~ 1, data = Ova, lambda = 1e+06,
## bdeg = 1, pord = 1, update_lambda = FALSE)
##
## Bins summary: tmin = 0, tmax = 2756, number of bins = 100, bin width = 27.56
## P-splines summary: number of segments = 20, degree = 1, penalty order = 1
##
## Parameter estimates:
##
## Effective baseline dimension: 0.04605, log-likelihood: -1299, AIC: 2598
## Smoothness parameter lambda: 1e+06
## Number of iterations: 4
## n = 358, number of events = 266
We use bdeg=1
to use B-splines of degree 1 (constants),
pord=1
to penalize first order differences of coefficients
of the B-splines, lambda=1e+06
to use an extremely high
penalty, and update_lambda=FALSE
to make sure that lambda
is not updated in the process. Note that the effective dimension of the
log-hazard is effectively 1, so the log-hazard and hazard are constant.
A plot of the hazard shows a straight line.
The value fo the hazard and its 95% confidence interval can be seen
by printing the result of predict
.
pic_fit_cst <- predict(ic_fit_cst)[, c("time", "haz", "sehaz", "lowerhaz", "upperhaz")]
head(pic_fit_cst)
## time haz sehaz lowerhaz upperhaz
## 1 0.00000 0.0007470698 4.582517e-05 0.0006624431 0.0008425075
## 2 5.51258 0.0007470698 4.582473e-05 0.0006624439 0.0008425065
## 3 11.02516 0.0007470699 4.582430e-05 0.0006624447 0.0008425057
## 4 16.53774 0.0007470700 4.582390e-05 0.0006624455 0.0008425049
## 5 22.05032 0.0007470700 4.582352e-05 0.0006624462 0.0008425041
## 6 27.56290 0.0007470701 4.582316e-05 0.0006624469 0.0008425033
## time haz sehaz lowerhaz upperhaz
## 496 2728.727 0.0007468735 4.586438e-05 0.0006621800 0.0008423994
## 497 2734.240 0.0007468735 4.586478e-05 0.0006621793 0.0008424003
## 498 2739.752 0.0007468735 4.586520e-05 0.0006621785 0.0008424012
## 499 2745.265 0.0007468735 4.586564e-05 0.0006621778 0.0008424022
## 500 2750.777 0.0007468735 4.586609e-05 0.0006621770 0.0008424032
## 501 2756.290 0.0007468735 4.586657e-05 0.0006621761 0.0008424042
We see that the hazard is estimated as λ̂ = 0.000747, with 95% confidence interval about 0.000662 – 0.000842. The estimate and confidence interval are obtained using the general theory outlined in Putter, Gampe & Eilers (2023). Nevertheless, they coincide closely with standard theory, which says that λ̂MLE = D/T, where $D = \sum_{i=1}^n d_i$ is the total number of events and $T = \sum_{i=1}^n t_i$ is the total follow-up time in the data. (Here (ti, di) for subject i = 1, …, n are the usual time and status indicators in right-censored time-to-event data.) Moreover, the variance of the log of λ̂MLE is given by 1/D, so a 95% confidence interval for λ is given by $\exp( \log(\hat{\lambda}_{\textrm{MLE}}) \pm 1.96/\sqrt{D} )$. In our data this would yield
D <- sum(Ova$d)
T <- sum(Ova$time)
rate <- D / T
lograte <- log(rate)
selograte <- 1 / sqrt(D)
lowerlograte <- lograte - qnorm(0.975) * selograte
upperlograte <- lograte + qnorm(0.975) * selograte
data.frame(D=D, T=T, rate=rate, lower=exp(lowerlograte), upper=exp(upperlograte))
## D T rate lower upper
## 1 266 351135 0.0007575434 0.0006717644 0.0008542757
We see that this is quite close to what icfit
has given
us.
We can go one step further with icfit
and add the
covariate effects to the constant baseline hazard.
ic_fit_cst <- icfit(Surv(time, d) ~ Diameter + FIGO + Karnofsky, data = Ova,
bdeg=1, pord=1, lambda=1e+06, update_lambda=FALSE)
ic_fit_cst
## Object of class 'icfit'
## Call:
## icfit(formula = Surv(time, d) ~ Diameter + FIGO + Karnofsky,
## data = Ova, lambda = 1e+06, bdeg = 1, pord = 1, update_lambda = FALSE)
##
## Bins summary: tmin = 0, tmax = 2756, number of bins = 100, bin width = 27.56
## P-splines summary: number of segments = 20, degree = 1, penalty order = 1
##
## Parameter estimates:
## coef SE HR lower upper pvalue
## Diameter 0.21013 0.04931 1.23384 1.12017 1.359 2.03e-05 ***
## FIGO 0.56814 0.13461 1.76498 1.35569 2.298 2.44e-05 ***
## Karnofsky -0.17083 0.05325 0.84296 0.75942 0.936 0.00134 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Effective baseline dimension: 0.04602, log-likelihood: -1264, AIC: 2528
## Smoothness parameter lambda: 1e+06
## Number of iterations: 7
## n = 358, number of events = 266
Also this model can be fitted in an alternative way, namely using
direct Poisson regression. To achieve this d
is considered
the outcome variable and the logarithm of the follow-up time is added as
an offset. Here is the code to run it.
Ova$logtime <- log(Ova$time)
pois_fit <- glm(d ~ Diameter + FIGO + Karnofsky + offset(logtime), family=poisson(), data = Ova)
spois_fit <- summary(pois_fit)
spois_fit
##
## Call:
## glm(formula = d ~ Diameter + FIGO + Karnofsky + offset(logtime),
## family = poisson(), data = Ova)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.33203 0.53325 -11.874 < 2e-16 ***
## Diameter 0.21167 0.04932 4.291 1.77e-05 ***
## FIGO 0.57594 0.13456 4.280 1.87e-05 ***
## Karnofsky -0.17387 0.05332 -3.261 0.00111 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 595.55 on 357 degrees of freedom
## Residual deviance: 523.70 on 354 degrees of freedom
## AIC: 1063.7
##
## Number of Fisher Scoring iterations: 6
We see that the estimates of the covariate effects are quite similar
to the estimates from ic_fit_cst
. Moreover the
(Intercept)
term is the logarithm of the (constant)
baseline hazard. If we exponeniate it the corresponding baseline hazard
and 95% confidence interval are given by
lograte <- spois_fit$coefficients[1, 1]
selograte <- spois_fit$coefficients[1, 2]
lowerlograte <- lograte - qnorm(0.975) * selograte
upperlograte <- lograte + qnorm(0.975) * selograte
data.frame(rate=exp(lograte), lower=exp(lowerlograte), upper=exp(upperlograte))
## rate lower upper
## 1 0.001778422 0.0006253608 0.005057535
The result from icfit
is visualised as
plot(ic_fit_cst, xlab="Days since randomisation", ylab="Hazard",
title="Constant baseline hazard") + ylim(0, 0.0052)
Naturally, direct Poisson regression using glm
is
preferable to icfit
in case of constant hazards both in
terms of speed and accuracy, but it is comforting to see that
icfit
does give rapid and reliable results also here.
Moreover, Poisson regression cannot be used directly with
interval-censored data.
Precision in icfit
can be increased (at the cost of
extra computation time) by increasing nt
. If we remove the
covariates again, we set nt
to 500, and look at the hazard
and its 95% confidence interval, which again can be seen by printing the
result of predict
, we do see that the results are quite a
bit closer to that of the occurrence over exposure theoretical ones.
ic_fit_cst <- icfit(Surv(time, d) ~ 1, data = Ova, nt=500,
bdeg=1, pord=1, lambda=1e+06, update_lambda=FALSE)
pic_fit_cst <- predict(ic_fit_cst)[, c("time", "haz", "sehaz", "lowerhaz", "upperhaz")]
head(pic_fit_cst)
## time haz sehaz lowerhaz upperhaz
## 1 0.00000 0.0007554777 4.634085e-05 0.0006698987 0.0008519894
## 2 5.51258 0.0007554778 4.634041e-05 0.0006698995 0.0008519885
## 3 11.02516 0.0007554779 4.633998e-05 0.0006699003 0.0008519876
## 4 16.53774 0.0007554779 4.633958e-05 0.0006699011 0.0008519868
## 5 22.05032 0.0007554780 4.633919e-05 0.0006699018 0.0008519860
## 6 27.56290 0.0007554780 4.633882e-05 0.0006699025 0.0008519852
## time haz sehaz lowerhaz upperhaz
## 496 2728.727 0.0007552829 4.638097e-05 0.0006696355 0.0008518848
## 497 2734.240 0.0007552829 4.638137e-05 0.0006696348 0.0008518857
## 498 2739.752 0.0007552829 4.638179e-05 0.0006696340 0.0008518866
## 499 2745.265 0.0007552829 4.638223e-05 0.0006696333 0.0008518876
## 500 2750.777 0.0007552829 4.638270e-05 0.0006696325 0.0008518886
## 501 2756.290 0.0007552829 4.638318e-05 0.0006696316 0.0008518897
Note that only with right-censored data we can use the occurrence
over exposure theory and Poisson GLM. Importantly, {icpack
}
can also be used to fit models (with and without covariates) with
constant (baseline) hazards for interval-censored data, while there is
no standard alternative for this situation.
When we set the degree of the B-splines to 1 and the order of the
penalty to 2, and take a very large value of lambda, we get a model for
which the log-hazard is a straight line. The corresponding hazard is of
the form h(t) = aebt.
This corresponds to the hazard of a Gompertz model, provided b > 0. Gompertz rates are often
used to represent rates of mortality; when b < 0 then this implies that very
large, or indeed infinite values are not uncommon, since then ∫0∞h(t)dt
is finite. For modeling human mortality this obviously is a problem, but
for many other practical purposes we may ignore this issue. But
implicitly allowing for negative b by icfit
is the
reason for referring to “Gompertz” rather than Gompertz models.
Here is the fit of the “Gompertz” model using icfit
.
ic_fit_gompertz <- icfit(Surv(time, d) ~ 1, data = Ova,
bdeg=1, pord=2, lambda=1e+06, update_lambda=FALSE)
ic_fit_gompertz
## Object of class 'icfit'
## Call:
## icfit(formula = Surv(time, d) ~ 1, data = Ova, lambda = 1e+06,
## bdeg = 1, pord = 2, update_lambda = FALSE)
##
## Bins summary: tmin = 0, tmax = 2756, number of bins = 100, bin width = 27.56
## P-splines summary: number of segments = 20, degree = 1, penalty order = 2
##
## Parameter estimates:
##
## Effective baseline dimension: 1.048, log-likelihood: -1291, AIC: 2585
## Smoothness parameter lambda: 1e+06
## Number of iterations: 7
## n = 358, number of events = 266
We see that the effective dimension of the log-hazard is 2, so a straight line. Below is a plot of the hazard.
The fact that the log-hazard is a straight line is easier to see after transforming the y-axis.
Note that the “Gompertz” hazard rate is implicitly obtained by
setting the order of the penalty to 2. From the result of
icfit
the values of a and b and their standard errors can only
be derived indirectly.
Also Weibull models can be fitted with icfit
in the same
indirect way, by setting bdeg
, pord
, and
lambda
, with one additional trick, namely time warping. The
hazard of a Weibull distribution is of the form h(t) = αλtα − 1,
with λ > 0. If we take
pord=2
, lambda = 1e+06
, and work with log (t) instead of t itself, the form of the log-hazard
would be β0 + β1log (t),
which corresponds to a Weibull hazard with α = β1 + 1,
λ = eβ0/(β1 + 1).
Here is the fit of the Weibull model with icfit
. This
works fine with our present way of using time in days;
time=1
corresponds to 1 day after randomisation. When time
is in years, the logarithm of time would be negative for all events and
censorings before one year. In that case adding a constant, or
alternatively, defining logtime
below as
log(Ova$time * 365.25)
would solve the issue.
Ova$logtime <- log(Ova$time)
ic_fit_weibull <- icfit(Surv(logtime, d) ~ 1, data = Ova,
bdeg=1, pord=2, lambda=1e+06, update_lambda=FALSE)
ic_fit_weibull
## Object of class 'icfit'
## Call:
## icfit(formula = Surv(logtime, d) ~ 1, data = Ova, lambda = 1e+06,
## bdeg = 1, pord = 2, update_lambda = FALSE)
##
## Bins summary: tmin = 0, tmax = 7.991, number of bins = 100, bin width = 0.07991
## P-splines summary: number of segments = 20, degree = 1, penalty order = 2
##
## Parameter estimates:
##
## Effective baseline dimension: 1.047, log-likelihood: -1244, AIC: 2490
## Smoothness parameter lambda: 1e+06
## Number of iterations: 50
## n = 358, number of events = 266
Here is a plot of the hazard with log-transformed y-axis. Time on the x-axis actually refers to the logarithm (base 10) of the original time, so 3 on the x-axis corresponds to t = 1000.
To see everything on the original time scale and to facilitate future
comparison with a direct parametric fit using survreg
, let
us extract the data
elements of the plot, and explicitly
transform time back into the original time.
We need to be careful though. Note that the estimated hazard with transformed time measures the expected number of events per transformed time unit, not in the original time unit. So we need not only back transform time, but also the hazard itself. If h̃(u) denotes the hazard in transformed time u = g(t), then in the original time scale t = f(u) = g−1(u) we have that h(t) = h̃(u)/f′(u). With u = g(t) = log (t), t = f(u) = exp (u), we get h(t) = h̃(u)/exp (u) = h̃(log (t))/t.
## time haz sehaz lowerhaz upperhaz
## 1 0.00000000 0.001471256 0.0004404592 0.0008181982 0.002645561
## 2 0.01598161 0.001492235 0.0004456057 0.0008311023 0.002679291
## 3 0.03196323 0.001513513 0.0004508096 0.0008442097 0.002713451
## 4 0.04794484 0.001535095 0.0004560715 0.0008575237 0.002748048
## 5 0.06392646 0.001556984 0.0004613919 0.0008710474 0.002783086
## 6 0.07990807 0.001579186 0.0004667716 0.0008847842 0.002818572
## time haz sehaz lowerhaz upperhaz
## 496 7.910899 1.622971 0.1755261 1.312962 2.006176
## 497 7.926881 1.646086 0.1790758 1.329999 2.037294
## 498 7.942863 1.669530 0.1826939 1.347251 2.068901
## 499 7.958844 1.693308 0.1863816 1.364723 2.101006
## 500 7.974826 1.717424 0.1901403 1.382417 2.133616
## 501 7.990807 1.741885 0.1939712 1.400335 2.166739
pw$u <- pw$time
pw$time <- exp(pw$u)
pw$haz <- pw$haz / pw$time
pw$lowerhaz <- pw$lowerhaz / pw$time
pw$upperhaz <- pw$upperhaz / pw$time
head(pw)[, 1:5]
## time haz sehaz lowerhaz upperhaz
## 1 1.000000 0.001471256 0.0004404592 0.0008181982 0.002645561
## 2 1.016110 0.001468576 0.0004456057 0.0008179255 0.002636812
## 3 1.032480 0.001465901 0.0004508096 0.0008176527 0.002628092
## 4 1.049113 0.001463231 0.0004560715 0.0008173799 0.002619402
## 5 1.066014 0.001460566 0.0004613919 0.0008171069 0.002610741
## 6 1.083187 0.001457906 0.0004667716 0.0008168339 0.002602109
## time haz sehaz lowerhaz upperhaz
## 496 2726.842 0.0005951833 0.1755261 0.0004814956 0.0007357143
## 497 2770.771 0.0005940893 0.1790758 0.0004800103 0.0007352804
## 498 2815.408 0.0005929974 0.1826939 0.0004785278 0.0007348494
## 499 2860.765 0.0005919074 0.1863816 0.0004770483 0.0007344211
## 500 2906.851 0.0005908194 0.1901403 0.0004755718 0.0007339956
## 501 2953.681 0.0005897335 0.1939712 0.0004740984 0.0007335726
Here is a plot of the result.
plt <- ggplot(pw) +
geom_ribbon(aes(x = pw$time, ymin = pw$lowerhaz, ymax = pw$upperhaz),
fill = 'lightgrey') +
geom_line(aes(x = pw$time, y = pw$lowerhaz), colour = 'blue') +
geom_line(aes(x = pw$time, y = pw$upperhaz), colour = 'blue') +
geom_line(aes(x = pw$time, y = pw$haz), size = 1, colour = 'blue', lty = 1) +
ylim(0, 0.002) +
ggtitle("Weibull hazard fit") + xlab("Days since randomisation") +
ylab("Hazard") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))
plt
Let us compare this now with survreg
. It is most
convenient to use the wrapper provided in flexsurvreg
,
because it uses a parametrisation close to ours. The parametrisation in
flexsurvreg
is of the form S(t) = exp (−(t/μ′)α′),
compared to our S(t) = exp (−λtα),
from which we see that α = α′ and λ = 1/(μ′)α′.
We add the fit from survreg
with a dashed line to the
original plot.
## Call:
## flexsurvreg(formula = Surv(time, d) ~ 1, data = Ova, dist = "weibull")
##
## Estimates:
## est L95% U95% se
## shape 9.20e-01 8.31e-01 1.02e+00 4.81e-02
## scale 1.32e+03 1.16e+03 1.51e+03 8.81e+01
##
## N = 358, Events: 266, Censored: 92
## Total time at risk: 351135
## Log-likelihood = -2176.017, df = 2
## AIC = 4356.034
alp <- exp(fsr$coef[1])
mu <- exp(fsr$coef[2])
lam <- 1 / mu^alp
hw <- function(t, alp, lam) alp * lam * t^(alp - 1)
maxt <- max(Ova$time)
pw$haz_survreg <- hw(pw$time, alp, lam)
plt <- plt +
geom_line(aes(x = pw$time, y = pw$haz_survreg), size = 1, lty = 2)
plt
The fit is quite similar, although not entirely equivalent.
Putter, H., Gampe, J., Eilers, P. H. (2023). Smooth hazard regression for interval censored data. Submitted for publication.