Chapter 13: Describing continuous-time event occurrence data • alda

library(alda)
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(ggplot2)
library(patchwork)
library(survival)
library(muhaz)
library(broom)
library(gt)

13.1 A framework for characterizing the distribution of continuous-time event data

In Section 13.1 Singer and Willett (2003) discuss the salient features of continuous-time event data using a subset of data from Diekmann and colleagues (1996), who measured time to horn honking in a sample of 57 motorists who were purposefully blocked at a green light by a Volkswagen Jetta at a busy intersection near the centre of Munich, West Germany on two busy afternoons (Sunday and Monday) in 1998. Motorists were followed until they honked their horns or took an alternative action (beaming or changing lanes).

For this example we use the honking data set, a person-level data frame with 57 rows and 3 columns:

id: Participant ID.
seconds: Number of seconds to horn honking or an alternative action.
censor: Censoring status.

honking
#> # A tibble: 57 × 3
#>    id    seconds censor
#>    <fct>   <dbl>  <dbl>
#>  1 1        2.88      0
#>  2 2        4.63      1
#>  3 3        2.36      1
#>  4 4        2.68      0
#>  5 5        2.5       0
#>  6 6        4.3       1
#>  7 7        1.86      0
#>  8 8        4.01      1
#>  9 9        1.41      0
#> 10 10       9.59      0
#> # ℹ 47 more rows

The defining feature of continuous time is that it is infinitely divisible—there exist an infinite number of possible instants when the target event can occur, which can be measured with ever-increasing precision by using finer metrics for clocking time. As Singer and Willett (2003) discuss, because continuous time is infinitely divisible, the (theoretical) distribution of event times has two salient features:

The probability of observing any particular event time is infinitesimally small.
The probability that two or more individuals will share the same event time is infinitesimally small.

Although we do not expect to observe these properties exactly in real data due to measurement limitations, such as rounding, we can observe that both of these properties are almost manifested in the event times of the honking data, which has only one tie at 1.41 seconds.

# Similar to Table 13.1:
honking |>
  arrange(seconds) |>
  mutate(
    seconds = ifelse(censor == 0, seconds, paste0(seconds, "*")),
    .keep = "none"
  ) |>
  pull(seconds)
#>  [1] "1.41"   "1.41*"  "1.51"   "1.67"   "1.68"   "1.86"   "2.12"   "2.19"  
#>  [9] "2.36*"  "2.48"   "2.5"    "2.53"   "2.54"   "2.56"   "2.62"   "2.68"  
#> [17] "2.76*"  "2.78*"  "2.83"   "2.88"   "2.89"   "2.92"   "2.98"   "3.05*" 
#> [25] "3.14"   "3.17"   "3.21"   "3.22"   "3.24"   "3.46*"  "3.56"   "3.57"  
#> [33] "3.58"   "3.78"   "4.01*"  "4.1"    "4.18"   "4.3*"   "4.44"   "4.51"  
#> [41] "4.52"   "4.63*"  "4.71*"  "4.96"   "5.12*"  "5.39"   "5.73"   "5.88*" 
#> [49] "6.03"   "6.21*"  "6.3"    "6.6*"   "7.2"    "9.59"   "12.29"  "13.18" 
#> [57] "17.15*"

These features are a significant departure from the distributional features of discrete event times, where the probability of event occurrence in at least some periods is non-zero and ties are pervasive. Because of this departure, we must introduce new continuous time definitions of the survivor and hazard functions:

The continuous-time survival function is the cumulative probability that the event time, $T_i$ , of the $i$ th individual will exceed $t_j$ , the infinite number of possible instants when the target event could occur:

$S(t_{ij}) = \Pr[T_i > t_j],$

which is essentially identical to the discrete-time survival function in definition and meaning, aside from the different metric for clocking time.
The continuous-time hazard function is an instantaneous rate assessing the conditional probability that the $i$ th individual will experience the target event in the next small interval of time, $\Delta t$ , given that they did not experience it in any prior interval, divided by the length of that interval:

$h(t_{ij}) = \lim_{\Delta t \to 0} \left\{ \frac{\Pr[T_i \in (t_j, t_j + \Delta t) | T_i \geq t_j]} {\Delta t} \right\},$

which means that unlike discrete-time hazard probabilities, continuous-time hazard rates do not have an upper bound and can exceed values of 1.0.

13.2 Grouped methods for estimating continuous-time survivor and hazard functions

In Section 13.2 Singer and Willett (2003) discuss two strategies for estimating continuous-time survival and hazard functions from the grouped life table, which partitions continuous time into a manageable number of contiguous (possibly unequal width) intervals:

The discrete-time method estimates continuous-time survival and hazard functions by applying the discrete-time principles of Section 10.2 to the grouped life table.
The actuarial method (aka the life table method) estimates continuous-time survival and hazard functions based on data assumed to be available at each interval’s midpoint, using a similar strategy to the discrete-time method.

We begin by creating a grouped life table for the honking data, which we divide into eight contiguous intervals: seven 1-second intervals, and one 10-second interval.

honking_lifetable <- honking |>
  group_by(time_interval = cut(seconds, breaks = c(1:8, 18), right = FALSE)) |>
  # All grouping needs to be dropped after summarizing in order to calculate the
  # number at risk correctly in the next step.
  summarise(
    n = n(),
    n.event = sum(censor == 0),
    n.censor = sum(censor == 1),
    .groups = "drop"
  ) |>
  mutate(
    n.risk = sum(n) - lag(cumsum(n), default = 0),
    prob.estimate = n.event / n.risk
  ) |>
  select(time_interval, n.risk, n.event:prob.estimate)

honking_lifetable
#> # A tibble: 8 × 5
#>   time_interval n.risk n.event n.censor prob.estimate
#>   <fct>          <int>   <int>    <int>         <dbl>
#> 1 [1,2)             57       5        1        0.0877
#> 2 [2,3)             51      14        3        0.275 
#> 3 [3,4)             34       9        2        0.265 
#> 4 [4,5)             23       6        4        0.261 
#> 5 [5,6)             13       2        2        0.154 
#> 6 [6,7)              9       2        2        0.222 
#> 7 [7,8)              5       1        0        0.2   
#> 8 [8,18)             4       3        1        0.75

Then we can estimate the continuous-time survival and hazard functions “by hand” using the discrete-time and actuarial methods.

honking_lifetable <- honking_lifetable |>
  mutate(
    # Temporary variables
    time_start = as.numeric(str_extract(time_interval, "[[:digit:]]+(?=,)")),
    time_end = as.numeric(str_extract(time_interval, "(?<=,)[[:digit:]]+")),
    time_width = time_end - time_start,
    prob.estimate.actuarial = n.event / (n.risk - (n.censor / 2)),
    # Discrete-time method estimates
    surv.discrete = cumprod(1 - prob.estimate),
    haz.discrete = prob.estimate / time_width,
    # Actuarial method estimates
    surv.actuarial = cumprod(1 - prob.estimate.actuarial),
    haz.actuarial = (
      n.event / (n.risk - (n.censor / 2) - (n.event / 2)) / time_width
    )
  ) |>
  select(-c(time_start:prob.estimate.actuarial))

# Table 13.2, page 477:
honking_lifetable  |>
  gt() |>
  fmt_number(columns = prob.estimate:haz.actuarial, decimals = 4)

time_interval	n.risk	n.event	n.censor	prob.estimate	surv.discrete	haz.discrete	surv.actuarial	haz.actuarial
[1,2)	57	5	1	0.0877	0.9123	0.0877	0.9115	0.0926
[2,3)	51	14	3	0.2745	0.6619	0.2745	0.6537	0.3294
[3,4)	34	9	2	0.2647	0.4867	0.2647	0.4754	0.3158
[4,5)	23	6	4	0.2609	0.3597	0.2609	0.3396	0.3333
[5,6)	13	2	2	0.1538	0.3044	0.1538	0.2830	0.1818
[6,7)	9	2	2	0.2222	0.2367	0.2222	0.2122	0.2857
[7,8)	5	1	0	0.2000	0.1894	0.2000	0.1698	0.2222
[8,18)	4	3	1	0.7500	0.0473	0.0750	0.0243	0.1500

Finally we can plot the estimates of both methods. Following Singer and Willett (2003), we plot the actuarial estimates as a step function associating each survival probability and hazard rate with its entire time interval.

# Baseline plot
honking_plot <- honking_lifetable |>
  mutate(
    seconds = as.numeric(str_extract(time_interval, "(?<=,)[[:digit:]]+"))
  ) |>
  add_row(seconds = 0:1, surv.discrete = 1, surv.actuarial = 1) |>
  ggplot(aes(x = seconds))

# Survival functions
honking_surv.discrete <- honking_plot + geom_line(aes(y = surv.discrete))
honking_surv.actuarial <- honking_plot +
  geom_step(aes(y = surv.actuarial), direction = "vh")

# Hazard functions
honking_haz.discrete <- honking_plot +
  geom_line(aes(y = haz.discrete)) +
  coord_cartesian(ylim = c(0, .35))
honking_haz.actuarial <- honking_plot +
  geom_step(aes(y = haz.actuarial), direction = "vh") +
  coord_cartesian(ylim = c(0, .35))

# Figure 13.1, page 479:
honking_surv.discrete + honking_surv.actuarial +
  honking_haz.discrete + honking_haz.actuarial +
  plot_layout(nrow = 2, axes = "collect_x")

13.3 The Kaplan-Meier method of estimating the continuous-time survivor function

In Section 13.3 Singer and Willett (2003) introduce the Kaplan-Meier method (aka the product-limit method) of estimating the continuous-time survival function, which is a simple extension of the discrete-time method that constructs time intervals using the raw event times such that each interval contains just one observed event time.

We can fit a survival function with the the Kaplan-Meier method using the survfit() function from the survival package (which we briefly discussed in Section 10.1). The model formula for the survfit() function takes the form response ~ terms, where the response must be a “survival object” created by the Surv() function. For right-censored data, the survival object can be created by supplying two unnamed arguments to the Surv() function corresponding to time and event variables, in that order. Note that we can recode a censor variable into an event variable by reversing its values. For 0-1 coded data, we can write the event status as event = censor - 1.

honking_survfit <- survfit(Surv(seconds, 1 - censor) ~ 1, data = honking)
honking_survfit
#> Call: survfit(formula = Surv(seconds, 1 - censor) ~ 1, data = honking)
#> 
#>       n events median 0.95LCL 0.95UCL
#> [1,] 57     42   3.58    3.17    4.96

We can view the grouped life table from the survfit object using either the summary() function, or the tidy() function from the broom package.

summary(honking_survfit)
#> Call: survfit(formula = Surv(seconds, 1 - censor) ~ 1, data = honking)
#> 
#>   time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>   1.41     57       1   0.9825  0.0174      0.94896        1.000
#>   1.51     55       1   0.9646  0.0246      0.91758        1.000
#>   1.67     54       1   0.9467  0.0299      0.88985        1.000
#>   1.68     53       1   0.9289  0.0343      0.86405        0.999
#>   1.86     52       1   0.9110  0.0380      0.83950        0.989
#>   2.12     51       1   0.8931  0.0412      0.81587        0.978
#>   2.19     50       1   0.8753  0.0441      0.79296        0.966
#>   2.48     48       1   0.8570  0.0468      0.77004        0.954
#>   2.50     47       1   0.8388  0.0492      0.74765        0.941
#>   2.53     46       1   0.8206  0.0514      0.72572        0.928
#>   2.54     45       1   0.8023  0.0534      0.70418        0.914
#>   2.56     44       1   0.7841  0.0552      0.68299        0.900
#>   2.62     43       1   0.7659  0.0569      0.66212        0.886
#>   2.68     42       1   0.7476  0.0584      0.64154        0.871
#>   2.83     39       1   0.7285  0.0599      0.61996        0.856
#>   2.88     38       1   0.7093  0.0614      0.59868        0.840
#>   2.89     37       1   0.6901  0.0626      0.57769        0.824
#>   2.92     36       1   0.6710  0.0637      0.55695        0.808
#>   2.98     35       1   0.6518  0.0647      0.53648        0.792
#>   3.14     33       1   0.6320  0.0657      0.51549        0.775
#>   3.17     32       1   0.6123  0.0666      0.49477        0.758
#>   3.21     31       1   0.5925  0.0673      0.47429        0.740
#>   3.22     30       1   0.5728  0.0679      0.45405        0.723
#>   3.24     29       1   0.5530  0.0684      0.43404        0.705
#>   3.56     27       1   0.5325  0.0688      0.41338        0.686
#>   3.57     26       1   0.5121  0.0692      0.39297        0.667
#>   3.58     25       1   0.4916  0.0694      0.37282        0.648
#>   3.78     24       1   0.4711  0.0694      0.35291        0.629
#>   4.10     22       1   0.4497  0.0695      0.33217        0.609
#>   4.18     21       1   0.4283  0.0694      0.31172        0.588
#>   4.44     19       1   0.4057  0.0693      0.29028        0.567
#>   4.51     18       1   0.3832  0.0690      0.26919        0.545
#>   4.52     17       1   0.3606  0.0686      0.24847        0.523
#>   4.96     14       1   0.3349  0.0683      0.22451        0.500
#>   5.39     12       1   0.3070  0.0681      0.19875        0.474
#>   5.73     11       1   0.2791  0.0674      0.17386        0.448
#>   6.03      9       1   0.2481  0.0666      0.14651        0.420
#>   6.30      7       1   0.2126  0.0659      0.11585        0.390
#>   7.20      5       1   0.1701  0.0650      0.08044        0.360
#>   9.59      4       1   0.1276  0.0611      0.04991        0.326
#>  12.29      3       1   0.0851  0.0535      0.02478        0.292
#>  13.18      2       1   0.0425  0.0403      0.00665        0.272

Note that these two methods return different standard errors: The summary() function returns the standard error for the survival function; however, the tidy() function returns the standard error for the cumulative hazard (see Section 13.4), so we need to transform it to get the standard error for the survival function.

tidy(honking_survfit)
#> # A tibble: 56 × 8
#>     time n.risk n.event n.censor estimate std.error conf.high conf.low
#>    <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
#>  1  1.41     57       1        1    0.982    0.0177     1        0.949
#>  2  1.51     55       1        0    0.965    0.0255     1        0.918
#>  3  1.67     54       1        0    0.947    0.0316     1        0.890
#>  4  1.68     53       1        0    0.929    0.0369     0.999    0.864
#>  5  1.86     52       1        0    0.911    0.0417     0.989    0.840
#>  6  2.12     51       1        0    0.893    0.0462     0.978    0.816
#>  7  2.19     50       1        0    0.875    0.0504     0.966    0.793
#>  8  2.36     49       0        1    0.875    0.0504     0.966    0.793
#>  9  2.48     48       1        0    0.857    0.0546     0.954    0.770
#> 10  2.5      47       1        0    0.839    0.0587     0.941    0.748
#> # ℹ 46 more rows

They also return differing numbers of rows: By default, the summary() function returns rows only for intervals where an event occurred (this can be changed with the censored argument); whereas the the tidy() function always returns rows for both uncensored and censored intervals. However, because survival estimates do not change during censored intervals, an easy way to match the result of tidy() with summary() is to group the data by estimate values, then summarize.

honking_lifetable_km <- honking_survfit |>
  survfit0() |>
  tidy() |>
  group_by(pick(estimate:conf.low)) |>
  summarise(
    time_start = first(time),
    n.risk = first(n.risk),
    across(c(n.event, n.censor), sum),
    .groups = "drop"
  ) |>
  arrange(time_start) |>
  mutate(
    std.error = estimate * std.error,
    interval = 1:n() - 1,
    time_end = lead(time_start, default = Inf),
    time_width = time_end - time_start,
    prob.estimate = n.event / n.risk,
    # Note that although there is no Kaplan-Meier-type hazard estimate, this
    # statistic forms the basis of other descriptive methods discussed below.
    # The values depicted here differ slightly from the text because Singer and
    # Willett (2003) rounded the probability estimates before estimating the
    # hazard.
    haz.estimate = prob.estimate / time_width
  ) |>
  select(
    interval,
    time_start,
    time_end,
    n.risk:n.censor,
    prob.estimate,
    surv.estimate = estimate,
    std.error,
    time_width,
    haz.estimate
  )

# Table 13.3, page 484:
honking_lifetable_km |>
  gt() |>
  fmt_number(
    columns = c(prob.estimate, surv.estimate, std.error, haz.estimate),
    decimals = 4
  )

interval	time_start	time_end	n.risk	n.event	n.censor	prob.estimate	surv.estimate	std.error	time_width	haz.estimate
0	0.00	1.41	57	0	0	0.0000	1.0000	0.0000	1.41	0.0000
1	1.41	1.51	57	1	1	0.0175	0.9825	0.0174	0.10	0.1754
2	1.51	1.67	55	1	0	0.0182	0.9646	0.0246	0.16	0.1136
3	1.67	1.68	54	1	0	0.0185	0.9467	0.0299	0.01	1.8519
4	1.68	1.86	53	1	0	0.0189	0.9289	0.0343	0.18	0.1048
5	1.86	2.12	52	1	0	0.0192	0.9110	0.0380	0.26	0.0740
6	2.12	2.19	51	1	0	0.0196	0.8931	0.0412	0.07	0.2801
7	2.19	2.48	50	1	1	0.0200	0.8753	0.0441	0.29	0.0690
8	2.48	2.50	48	1	0	0.0208	0.8570	0.0468	0.02	1.0417
9	2.50	2.53	47	1	0	0.0213	0.8388	0.0492	0.03	0.7092
10	2.53	2.54	46	1	0	0.0217	0.8206	0.0514	0.01	2.1739
11	2.54	2.56	45	1	0	0.0222	0.8023	0.0534	0.02	1.1111
12	2.56	2.62	44	1	0	0.0227	0.7841	0.0552	0.06	0.3788
13	2.62	2.68	43	1	0	0.0233	0.7659	0.0569	0.06	0.3876
14	2.68	2.83	42	1	2	0.0238	0.7476	0.0584	0.15	0.1587
15	2.83	2.88	39	1	0	0.0256	0.7285	0.0599	0.05	0.5128
16	2.88	2.89	38	1	0	0.0263	0.7093	0.0614	0.01	2.6316
17	2.89	2.92	37	1	0	0.0270	0.6901	0.0626	0.03	0.9009
18	2.92	2.98	36	1	0	0.0278	0.6710	0.0637	0.06	0.4630
19	2.98	3.14	35	1	1	0.0286	0.6518	0.0647	0.16	0.1786
20	3.14	3.17	33	1	0	0.0303	0.6320	0.0657	0.03	1.0101
21	3.17	3.21	32	1	0	0.0312	0.6123	0.0666	0.04	0.7812
22	3.21	3.22	31	1	0	0.0323	0.5925	0.0673	0.01	3.2258
23	3.22	3.24	30	1	0	0.0333	0.5728	0.0679	0.02	1.6667
24	3.24	3.56	29	1	1	0.0345	0.5530	0.0684	0.32	0.1078
25	3.56	3.57	27	1	0	0.0370	0.5325	0.0688	0.01	3.7037
26	3.57	3.58	26	1	0	0.0385	0.5121	0.0692	0.01	3.8462
27	3.58	3.78	25	1	0	0.0400	0.4916	0.0694	0.20	0.2000
28	3.78	4.10	24	1	1	0.0417	0.4711	0.0694	0.32	0.1302
29	4.10	4.18	22	1	0	0.0455	0.4497	0.0695	0.08	0.5682
30	4.18	4.44	21	1	1	0.0476	0.4283	0.0694	0.26	0.1832
31	4.44	4.51	19	1	0	0.0526	0.4057	0.0693	0.07	0.7519
32	4.51	4.52	18	1	0	0.0556	0.3832	0.0690	0.01	5.5556
33	4.52	4.96	17	1	2	0.0588	0.3606	0.0686	0.44	0.1337
34	4.96	5.39	14	1	1	0.0714	0.3349	0.0683	0.43	0.1661
35	5.39	5.73	12	1	0	0.0833	0.3070	0.0681	0.34	0.2451
36	5.73	6.03	11	1	1	0.0909	0.2791	0.0674	0.30	0.3030
37	6.03	6.30	9	1	1	0.1111	0.2481	0.0666	0.27	0.4115
38	6.30	7.20	7	1	1	0.1429	0.2126	0.0659	0.90	0.1587
39	7.20	9.59	5	1	0	0.2000	0.1701	0.0650	2.39	0.0837
40	9.59	12.29	4	1	0	0.2500	0.1276	0.0611	2.70	0.0926
41	12.29	13.18	3	1	0	0.3333	0.0851	0.0535	0.89	0.3745
42	13.18	Inf	2	1	1	0.5000	0.0425	0.0403	Inf	0.0000

Finally, we can plot the Kaplan-Meier estimates of the continuous time survival function, and compare them with the discrete-time and actuarial estimates. Note that for plotting, we prefer the grouped life table with uncensored and censored intervals. This is because, if the largest event time is censored (as it is here), it allows us to extend the last interval of the step function out to that largest censored value rather than going to infinity. Therefore, here we re-tidy the honking_survfit object rather than using the honking_lifetable_km grouped life table.

honking_surv.km <- honking_survfit |>
  survfit0() |>
  tidy() |>
  rename(seconds = time, surv.estimate = estimate) |>
  mutate(
    method = factor(
      "Kaplan-Meier", levels = c("Kaplan-Meier", "discrete-time", "actuarial")
    )
  ) |>
  ggplot(aes(x = seconds)) +
    geom_step(
      aes(y = surv.estimate, linetype = method), direction = "hv"
    ) +
    guides(linetype = "none") +
    coord_cartesian(xlim = c(0, 20))

honking_survs <- honking_surv.km +
  geom_line(
    aes(y = surv.discrete, linetype = "discrete-time"),
    data = honking_plot$data,
    alpha = .33
  ) +
  geom_step(
    aes(y = surv.actuarial, linetype = "actuarial"),
    data = honking_plot$data,
    direction = "vh",
    alpha = .33
  ) +
  scale_linetype_manual(values = 1:3) +
  guides(linetype = guide_legend())

# Figure 13.2, page 485:
honking_surv.km + honking_survs +
  plot_layout(ncol = 1, guides = "collect", axes = "collect")

13.4 The cumulative hazard function

In Section 13.4 Singer and Willett (2003) introduce the cumulative hazard function, which is the integral of the hazard function between the integration limits of time 0 and time $t$ (i.e., the “accumulation” of the hazard over time), given by:

$H(t_{ij}) = \int_0^t h(t_{ij}) d t.$

There are two simple ways to estimate the cumulative hazard function:

The Nelson-Aalen method accumulates the estimated total amount of hazard during all instants in each interval, given by the product of the $j$ th Kaplan-Meier–type hazard estimate and its interval’s width:

$\hat H_\text{NA}(t_j) = \hat h_\text{KM}(t_1) \text{width}_1 + \hat h_\text{KM}(t_2) \text{width}_2 + \dots + \hat h_\text{KM}(t_j) \text{width}_j.$
The negative log survivor function method exploits a well-known mathematical relationship between the cumulative hazard and survivor functions, which is that the population cumulative hazard function is identical to the negative log of the population survivor function:

$\hat H_{-\text{LS}}(t_j) = -\log \hat S_\text{KM}(t_{ij}).$

Here we will use both of these methods to plot cumulative hazard functions from the honking_survfit object. Notice that the the estimates are most similar during early event times, when the risk set is large, diverging as the size of the risk set decreases.

honking_lifetable_km_2 <- honking_survfit |>
  survfit0() |>
  tidy() |>
  mutate(
    time_end = lead(time, default = Inf),
    width = time_end - time,
    prob.estimate = n.event / n.risk,
    haz.estimate = prob.estimate / width,
    cumhaz.nelson_aalen = survfit0(honking_survfit)[["cumhaz"]],
    cumhaz.neglogsurv = -log(estimate)
  ) |>
  rename(seconds = time, surv.estimate = estimate)

# Figure 13.4, page 493:
ggplot(honking_lifetable_km_2, aes(x = seconds)) +
  geom_step(
    aes(y = cumhaz.neglogsurv, linetype = "Negative log"),
    direction = "vh"
  ) +
  geom_step(
    aes(y = cumhaz.nelson_aalen, linetype = "Nelson-Aalen"),
    direction = "vh"
  ) +
  labs(
    y = "cumhaz.estimate",
    linetype = "method"
  )

When examining plots of the cumulative hazard function, Singer and Willett (2003) suggest studying how the rate of increase in the cumulative hazard function changes over time to deduce the shape of the underlying hazard function:

A linear rate of increase suggests a constant hazard.
An accelerating rate of increase suggests an increasing hazard.
A decelerating rate of increase suggests a decreasing hazard.
A changing rate of increase suggests that the hazard function has reached either a peak or a trough.

13.5 Kernel-smoothed estimates of the hazard function

In Section 13.5 Singer and Willett (2003) discuss how to visualize the underlying shape of the continuous-time hazard function using kernel-smoothed estimates, which estimate the “average” population value of hazard at many focal points in time by aggregating together all the point estimates in their temporal vicinity.

The muhaz() function from the muhaz package can be used to estimate the kernel-smoothed hazard function from right-censored data. For details on how the muhaz() function works, consult the function documentation.

Here we will estimate and plot kernel-smooth hazard functions for the honking data with bandwidths of 1, 2, and 3. Notice that as the bandwidth increases, the shape of the function becomes smoother. As Singer and Willett (2003) discuss, this suggests a need to explore a range of bandwidths to strike a balance between smooth and precise realizations of the underlying shape of the hazard function.

honking_smoothhaz <- map_df(
  set_names(1:3),
  \(.bandwidth) {
    # muhaz() estimates the hazard function from right-censored data using
    # kernel-based methods, using the vector of survival and event times.
    kernel_smoothed_hazard <- muhaz(
      honking$seconds,
      1 - honking$censor,
      # Narrow the temporal region the smoothed function describes, given the
      # bandwidth and the minimum and maximum observed event times.
      min.time = min(honking$seconds[honking$censor == 0]) + .bandwidth,
      max.time = max(honking$seconds[honking$censor == 0]) - .bandwidth,
      bw.grid = .bandwidth,
      bw.method = "global",
      b.cor = "none",
      kern = "epanechnikov"
    )
    
    kernel_smoothed_hazard |>
      tidy() |>
      rename(smoothhaz.estimate = estimate)
  },
  .id = "bandwidth"
)

# Figure 13.5, page 496:
ggplot(honking_smoothhaz, aes(x = time, y = smoothhaz.estimate)) +
  geom_line() +
  scale_x_continuous(limits = c(0, 20)) +
  facet_wrap(vars(bandwidth), ncol = 1, labeller = label_both)

13.6 Developing an intuition about continuous-time survivor, cumulative hazard, and kernel-smoothed hazard functions

In Section 13.6 Singer and Willett (2003) examine and describe the survival functions, cumulative hazard functions, and kernel-smoothed hazard functions from four studies that differ by their type of target event, metric for clocking time, and underlying profile of risk:

alcohol_relapse: A person-level data frame with 89 rows and 3 columns containing a subset of data from Cooney and colleagues (1991), who measured weeks to first “heavy drinking” day in a sample of 89 recently treated alcoholics. Individuals were followed for up to two years (around 104.286 weeks) or until they relapsed.

alcohol_relapse
#> # A tibble: 89 × 3
#>    id    weeks censor
#>    <fct> <dbl>  <dbl>
#>  1 1     0.714      0
#>  2 2     0.714      0
#>  3 3     1.14       0
#>  4 4     1.43       0
#>  5 5     1.71       0
#>  6 6     1.71       0
#>  7 7     2.14       0
#>  8 8     2.71       0
#>  9 9     3.86       0
#> 10 10    4.14       0
#> # ℹ 79 more rows

judges: A person-level data frame with 109 rows and 7 columns containing data from Zorn and Van Winkle (2000) measuring how long all 107 justices appointed to the U.S. Supreme Court between 1789 and 1980 remained in their position.

judges
#> # A tibble: 109 × 7
#>    id    tenure  dead retired left_appointment appointment_age appointment_year
#>    <fct>  <dbl> <dbl>   <dbl>            <dbl>           <dbl>            <dbl>
#>  1 1          1     0       1                1              50             1789
#>  2 2          6     0       1                1              57             1789
#>  3 3          6     0       1                1              44             1789
#>  4 4          9     1       0                1              47             1789
#>  5 5         21     1       0                1              57             1789
#>  6 6          9     1       0                1              38             1790
#>  7 7          1     0       1                1              59             1791
#>  8 8         13     1       0                1              47             1793
#>  9 9          4     0       1                1              51             1796
#> 10 10        15     1       0                1              55             1796
#> # ℹ 99 more rows

first_depression_2: A person-level data frame with 2974 rows and 3 columns containing data from Sorenson, Rutter, and Aneshensel (1991), who measured age of first depressive episode in a sample of 2974 adults. Age of first depressive episode was measured by asking respondents whether and, if so, at what age they first experienced a depressive episode.

first_depression_2
#> # A tibble: 2,974 × 3
#>    id      age censor
#>    <fct> <dbl>  <dbl>
#>  1 1         4      0
#>  2 2         6      0
#>  3 3         8      0
#>  4 4         8      0
#>  5 5         9      0
#>  6 6         9      0
#>  7 7        10      0
#>  8 8        10      0
#>  9 9        11      0
#> 10 10       12      0
#> # ℹ 2,964 more rows

health_workers: A person-level data frame with 2074 rows and 3 columns containing a subset of data from Singer and colleagues (1998), who measured length of employment in a sample of 2074 health care workers hired by community and migrant health centres. Health care workers were followed for up to 33 months or until termination of employment.

health_workers
#> # A tibble: 2,074 × 3
#>    id    weeks censor
#>    <fct> <dbl>  <dbl>
#>  1 1      0.14      0
#>  2 2      0.42      1
#>  3 3      1         0
#>  4 4      1.28      1
#>  5 5      1.28      1
#>  6 6      1.71      1
#>  7 7      1.85      1
#>  8 8      1.85      1
#>  9 9      1.85      1
#> 10 10     2         1
#> # ℹ 2,064 more rows

First we will estimate survival and cumulative hazard functions for each of the studies.

alcohol_relapse_survfit <- survfit(
  Surv(weeks, 1 - censor) ~ 1, data = alcohol_relapse
)
judges_survfit <- survfit(
  Surv(tenure, dead) ~ 1, data = judges
)
first_depression_survfit <- survfit(
  Surv(age, 1 - censor) ~ 1, data = first_depression_2
)
health_workers_survfit <- survfit(
  Surv(weeks, 1 - censor) ~ 1, data = health_workers
)

study_survfit <- list(
  alcohol_relapse = alcohol_relapse_survfit,
  judges = judges_survfit,
  first_depression = first_depression_survfit,
  health_workers = health_workers_survfit
)

study_lifetables <- study_survfit |>
  map(
    \(.survfit) {
      .survfit |>
        survfit0() |>
        tidy() |>
        mutate(cumhaz.estimate = -log(estimate)) |>
        select(time, surv.estimate = estimate, cumhaz.estimate)
    }
  )

study_lifetables
#> $alcohol_relapse
#> # A tibble: 65 × 3
#>     time surv.estimate cumhaz.estimate
#>    <dbl>         <dbl>           <dbl>
#>  1 0             1              0     
#>  2 0.714         0.978          0.0227
#>  3 1.14          0.966          0.0343
#>  4 1.43          0.955          0.0460
#>  5 1.71          0.933          0.0698
#>  6 2.14          0.921          0.0819
#>  7 2.71          0.910          0.0942
#>  8 3.86          0.899          0.107 
#>  9 4.14          0.888          0.119 
#> 10 4.57          0.876          0.132 
#> # ℹ 55 more rows
#> 
#> $judges
#> # A tibble: 34 × 3
#>     time surv.estimate cumhaz.estimate
#>    <dbl>         <dbl>           <dbl>
#>  1     0         1              0     
#>  2     1         1              0     
#>  3     2         0.981          0.0190
#>  4     3         0.981          0.0190
#>  5     4         0.972          0.0288
#>  6     5         0.932          0.0700
#>  7     6         0.902          0.103 
#>  8     7         0.870          0.140 
#>  9     8         0.859          0.152 
#> 10     9         0.825          0.192 
#> # ℹ 24 more rows
#> 
#> $first_depression
#> # A tibble: 85 × 3
#>     time surv.estimate cumhaz.estimate
#>    <dbl>         <dbl>           <dbl>
#>  1     0         1            0       
#>  2     4         1.00         0.000336
#>  3     6         0.999        0.000673
#>  4     8         0.999        0.00135 
#>  5     9         0.998        0.00202 
#>  6    10         0.997        0.00269 
#>  7    11         0.997        0.00303 
#>  8    12         0.996        0.00371 
#>  9    13         0.994        0.00607 
#> 10    14         0.993        0.00742 
#> # ℹ 75 more rows
#> 
#> $health_workers
#> # A tibble: 649 × 3
#>     time surv.estimate cumhaz.estimate
#>    <dbl>         <dbl>           <dbl>
#>  1  0            1            0       
#>  2  0.14         1.00         0.000482
#>  3  0.42         1.00         0.000482
#>  4  1            0.999        0.000965
#>  5  1.28         0.999        0.000965
#>  6  1.71         0.999        0.000965
#>  7  1.85         0.999        0.000965
#>  8  2            0.999        0.000965
#>  9  2.14         0.999        0.000965
#> 10  2.28         0.999        0.000965
#> # ℹ 639 more rows

Second, we will estimate the kernel-smoothed hazard functions.

study_smoothhaz <- pmap(
  list(
    list(
      alcohol_relapse = alcohol_relapse$weeks,
      judges = judges$tenure,
      first_depression = first_depression_2$age,
      health_workers = health_workers$weeks
    ),
    list(
      1 - alcohol_relapse$censor,
      judges$dead,
      1 - first_depression_2$censor,
      1 - health_workers$censor
    ),
    list(12, 5, 7, 7)
  ),
  \(survival_time, event, bandwidth) {
    kernel_smoothed_hazard <- muhaz(
      survival_time,
      event,
      min.time = min(survival_time[1 - event == 0]) + bandwidth,
      max.time = max(survival_time[1 - event == 0]) - bandwidth,
      bw.grid = bandwidth,
      bw.method = "global",
      b.cor = "none",
      kern = "epanechnikov"
    )
    
    kernel_smoothed_hazard |>
      tidy() |>
      rename(smoothhaz.estimate = estimate)
  }
)

study_smoothhaz
#> $alcohol_relapse
#> # A tibble: 101 × 2
#>     time smoothhaz.estimate
#>    <dbl>              <dbl>
#>  1  12.7             0.0343
#>  2  13.5             0.0344
#>  3  14.2             0.0344
#>  4  15.0             0.0344
#>  5  15.8             0.0340
#>  6  16.5             0.0335
#>  7  17.3             0.0329
#>  8  18.0             0.0320
#>  9  18.8             0.0309
#> 10  19.6             0.0295
#> # ℹ 91 more rows
#> 
#> $judges
#> # A tibble: 101 × 2
#>     time smoothhaz.estimate
#>    <dbl>              <dbl>
#>  1  7                0.0242
#>  2  7.22             0.0241
#>  3  7.44             0.0240
#>  4  7.66             0.0237
#>  5  7.88             0.0234
#>  6  8.1              0.0232
#>  7  8.32             0.0232
#>  8  8.54             0.0230
#>  9  8.76             0.0228
#> 10  8.98             0.0223
#> # ℹ 91 more rows
#> 
#> $first_depression
#> # A tibble: 101 × 2
#>     time smoothhaz.estimate
#>    <dbl>              <dbl>
#>  1  11              0.00113
#>  2  11.4            0.00122
#>  3  11.7            0.00131
#>  4  12.1            0.00139
#>  5  12.4            0.00149
#>  6  12.8            0.00158
#>  7  13.2            0.00167
#>  8  13.5            0.00176
#>  9  13.9            0.00184
#> 10  14.2            0.00193
#> # ℹ 91 more rows
#> 
#> $health_workers
#> # A tibble: 101 × 2
#>     time smoothhaz.estimate
#>    <dbl>              <dbl>
#>  1  7.14            0.00213
#>  2  8.38            0.00240
#>  3  9.63            0.00255
#>  4 10.9             0.00262
#>  5 12.1             0.00262
#>  6 13.4             0.00261
#>  7 14.6             0.00258
#>  8 15.9             0.00247
#>  9 17.1             0.00236
#> 10 18.3             0.00230
#> # ℹ 91 more rows

Third, we will join the survival and cumulative hazard function data frames with the kernel-smoothed hazard function data frames.

study_functions <- map2(
  study_lifetables, study_smoothhaz,
  \(.x, .y) {
    .x |>
      full_join(.y) |>
      arrange(time)
  }
)

study_functions
#> $alcohol_relapse
#> # A tibble: 165 × 4
#>     time surv.estimate cumhaz.estimate smoothhaz.estimate
#>    <dbl>         <dbl>           <dbl>              <dbl>
#>  1 0             1              0                      NA
#>  2 0.714         0.978          0.0227                 NA
#>  3 1.14          0.966          0.0343                 NA
#>  4 1.43          0.955          0.0460                 NA
#>  5 1.71          0.933          0.0698                 NA
#>  6 2.14          0.921          0.0819                 NA
#>  7 2.71          0.910          0.0942                 NA
#>  8 3.86          0.899          0.107                  NA
#>  9 4.14          0.888          0.119                  NA
#> 10 4.57          0.876          0.132                  NA
#> # ℹ 155 more rows
#> 
#> $judges
#> # A tibble: 132 × 4
#>     time surv.estimate cumhaz.estimate smoothhaz.estimate
#>    <dbl>         <dbl>           <dbl>              <dbl>
#>  1  0            1              0                 NA     
#>  2  1            1              0                 NA     
#>  3  2            0.981          0.0190            NA     
#>  4  3            0.981          0.0190            NA     
#>  5  4            0.972          0.0288            NA     
#>  6  5            0.932          0.0700            NA     
#>  7  6            0.902          0.103             NA     
#>  8  7            0.870          0.140              0.0242
#>  9  7.22        NA             NA                  0.0241
#> 10  7.44        NA             NA                  0.0240
#> # ℹ 122 more rows
#> 
#> $first_depression
#> # A tibble: 181 × 4
#>     time surv.estimate cumhaz.estimate smoothhaz.estimate
#>    <dbl>         <dbl>           <dbl>              <dbl>
#>  1   0           1            0                  NA      
#>  2   4           1.00         0.000336           NA      
#>  3   6           0.999        0.000673           NA      
#>  4   8           0.999        0.00135            NA      
#>  5   9           0.998        0.00202            NA      
#>  6  10           0.997        0.00269            NA      
#>  7  11           0.997        0.00303             0.00113
#>  8  11.4        NA           NA                   0.00122
#>  9  11.7        NA           NA                   0.00131
#> 10  12           0.996        0.00371            NA      
#> # ℹ 171 more rows
#> 
#> $health_workers
#> # A tibble: 749 × 4
#>     time surv.estimate cumhaz.estimate smoothhaz.estimate
#>    <dbl>         <dbl>           <dbl>              <dbl>
#>  1  0            1            0                        NA
#>  2  0.14         1.00         0.000482                 NA
#>  3  0.42         1.00         0.000482                 NA
#>  4  1            0.999        0.000965                 NA
#>  5  1.28         0.999        0.000965                 NA
#>  6  1.71         0.999        0.000965                 NA
#>  7  1.85         0.999        0.000965                 NA
#>  8  2            0.999        0.000965                 NA
#>  9  2.14         0.999        0.000965                 NA
#> 10  2.28         0.999        0.000965                 NA
#> # ℹ 739 more rows

Finally, we can plot the functions for each of the studies.

study_plots <- pmap(
  list(
    study_functions,
    list(c(0, 100), c(0, 35), c(0, 100), c(0, 130)),
    list("weeks", "years", "age", "weeks")
  ),
  \(.study, .xlim, .time) {
    
    study_surv <- .study |>
      select(time, surv.estimate) |>
      na.omit() |>
      ggplot(aes(x = time, y = surv.estimate)) +
        geom_step() +
        coord_cartesian(xlim = .xlim, ylim = c(0, 1)) +
        labs(x = .time)
    
    study_cumhaz <- .study |>
      select(time, cumhaz.estimate) |>
      na.omit() |>
      ggplot(aes(x = time, y = cumhaz.estimate)) +
        geom_step() +
        coord_cartesian(xlim = .xlim) +
        labs(x = .time)
    
    study_smoothhaz <- .study |>
      select(time, smoothhaz.estimate) |>
      na.omit() |>
      ggplot(aes(x = time, y = smoothhaz.estimate)) +
        geom_line() +
        coord_cartesian(xlim = .xlim) +
        labs(x = .time)
    
    study_surv / study_cumhaz / study_smoothhaz +
      plot_layout(axes = "collect")
  }
)

First, the alcohol_relapse data.

# Figure 13.6, page 499:
study_plots$alcohol_relapse

Second, the judges data.

# Figure 13.6, page 499:
study_plots$judges

Third, the first_depression data.

# Figure 13.6, page 499:
study_plots$first_depression

Finally, the health_workers data.

# Figure 13.6, page 499:
study_plots$health_workers