Chapter 12: Extending the discrete-time hazard model

library(alda)
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(glmmTMB)
library(broom)
library(broom.mixed)
library(ggplot2)
library(scales)
library(patchwork)
library(modelsummary)
library(gt)

12.1 Alternative specifications for the “main effect” of `time`

In Section 12.1 Singer and Willett (2003) discuss alternative functional forms for the shape of the discrete-time hazard function using data from Gamse and Conger (1997), who measured the number of years to receiving tenure in a sample of 260 semifinalists and fellowship recipients in the National Academy of Education–Spencer Foundation Post-Doctoral Fellowship Program who took an academic job after earning a doctorate. Academics were followed for up to nine years or until they received tenure.

For this example we use the tenure data set, a person-level data frame with 260 rows and 3 columns:

id: Participant ID.
years: Number of years to receiving tenure.
censor: Censoring status.

tenure
#> # A tibble: 260 × 3
#>    id    years censor
#>    <fct> <dbl>  <dbl>
#>  1 111       5      0
#>  2 211       6      0
#>  3 311       8      1
#>  4 411       9      0
#>  5 511       3      0
#>  6 611       6      0
#>  7 711       5      0
#>  8 811       5      1
#>  9 911       7      0
#> 10 1011      8      0
#> # ℹ 250 more rows

When introducing the basic discrete-time hazard model in Chapter 11, Singer and Willett (2003) suggested beginning with a model that placed no constraints on the shape of the hazard function over time (i.e., a completely general functional form). This resulted in a model that was easily interpretable, informative about the temporal shape of hazard, and consistent with life table estimates. However, use of a completely general functional form for time is not an integral feature of the model—any time predictor can be treated as a continuous variable, allowing any number of alternative functional forms for the shape of the discrete-time hazard function.

As Singer and Willett (2003) discuss, specification for time in the discrete-time hazard model should be motivated by a combination of theory, previous research, and exploratory analysis, and serious consideration of alternative functional forms should be given when:

A study involves many discrete time periods.
Hazard is expected to be near zero in some time periods.
Some time periods have small risk sets.

As in choosing the functional form of an individual growth model, the specification for time in the discrete-time hazard model can take on any number of functional forms, including (but not limited to):

Discontinuities (Section 6.1).
Transformations in the ladder of powers (Section 6.2).
Polynomial forms (Section 6.3).
Truly nonlinear trajectories (Section 6.4).

Following Singer and Willett (2003), here we will explore the use of an ordered set of polynomials as the specification for time in the discrete-time hazard model—with a focus on choosing a “final” polynomial trajectory. Before fitting these models, we first need to convert the person-level tenure data set to a person-period data set.

tenure_pp <- tenure |>
  group_by(id) |>
  reframe(
    year = 1:max(years),
    event = if_else(year == years & censor == 0, 1, 0)
  )

tenure_pp
#> # A tibble: 1,474 × 3
#>    id     year event
#>    <fct> <int> <dbl>
#>  1 111       1     0
#>  2 111       2     0
#>  3 111       3     0
#>  4 111       4     0
#>  5 111       5     1
#>  6 211       1     0
#>  7 211       2     0
#>  8 211       3     0
#>  9 211       4     0
#> 10 211       5     0
#> # ℹ 1,464 more rows

Here will fit seven discrete-time hazard models to the tenure_pp person-period data set: six models of increasing polynomial order (starting from a zero-order polynomial), and a model with a completely general functional form for the “main effect” of time.

tenure_fit_constant  <- glm(event ~ 1, family = "binomial", data = tenure_pp)
tenure_fit_linear    <- update(tenure_fit_constant, . ~ year)
tenure_fit_quadratic <- update(tenure_fit_linear, . ~ . + I(year^2))
tenure_fit_cubic     <- update(tenure_fit_quadratic, . ~ . + I(year^3))
tenure_fit_order_4   <- update(tenure_fit_cubic, . ~ . + I(year^4))
tenure_fit_order_5   <- update(tenure_fit_order_4, . ~ . + I(year^5))
tenure_fit_general   <- update(tenure_fit_constant, . ~ 0 + factor(year))

tenure_fits <- list(
  "Constant" = tenure_fit_constant,
  "Linear" = tenure_fit_linear,
  "Quadratic" = tenure_fit_quadratic,
  "Cubic" = tenure_fit_cubic,
  "Fourth order" = tenure_fit_order_4,
  "Fifth order" = tenure_fit_order_5,
  "General" = tenure_fit_general
)

Comparing alternative specifications for the “main effect” of `time`

To select a “final” model among these alternatives, Singer and Willett (2003) recommend comparing the fit of competing specifications with the following approach:

Compare the models graphically by plotting their fitted logit hazard functions together. Models based on alternative specifications for time should be compared against the completely general specification, identifying whether and which of these models can suitably reproduce the general functional form parsimoniously.
Compare deviance statistics across models of increasing temporal complexity, identifying the most parsimonious alternative specifications for time.
Compare deviance statistics for each alternative specification to the completely general specification, identifying which of the most parsimonious alternative specifications for time fit better than the completely general specification.
Examine AIC and BIC statistics to supplement the comparisons of deviance statistics.

Working through this process, if you identify an alternative specification for time that fits nearly as well as the completely general one, appreciably better than all simpler ones, and no worse than all more complex ones, consider selecting it as your “final” model. However, if none of the alternative specifications for time meet these criteria, consider retaining the completely general specification as your “final” model.

We begin by plotting the fitted logit hazard functions together. Following Singer and Willett (2003), we omit the fitted functions for the fourth order and fifth order polynomials because their deviance statistics suggest we cannot distinguish them from the cubic polynomial.

tenure_preds <- tenure_fits |>
  discard_at(c("Fourth order", "Fifth order")) |>
  map(\(.fit) augment(.fit, newdata = tibble(year = 1:9))) |>
  list_rbind(names_to = "time") |>
  mutate(
    time = factor(
      time,
      levels = names(discard_at(tenure_fits, c("Fourth order", "Fifth order")))
    )
  )

# Figure 12.1, page 414 (top):
ggplot(tenure_preds, aes(x = year, y = .fitted, colour = time)) +
  geom_line() +
  scale_color_brewer(palette = "Dark2") +
  scale_x_continuous(breaks = 1:9)

Next we compare goodness-of-fit statistics of the alternative and completely general models.

# Compare models --------------------------------------------------------------
tenure_anovas_poly <- with(
  tenure_fits[1:6],
  do.call(anova, c(map(names(tenure_fits[1:6]), as.name), test = "LRT"))
)

tenure_anovas_general <- map(
  tenure_fits[1:6],
  \(.fit) anova(.fit, tenure_fit_general, test = "LRT")
)

# Make table ------------------------------------------------------------------
tenure_anovas_poly_tidy <- tenure_anovas_poly |>
  tidy() |>
  select(
    deviance = residual.deviance,
    previous_model = deviance,
    p.previous = p.value
  ) |>
  mutate(time = names(tenure_fits[1:6]), .before = deviance)

tenure_anovas_general_tidy <- tenure_anovas_general |>
  map(tidy) |>
  list_rbind(names_to = "time") |>
  na.omit() |>
  select(time, general_model = deviance, p.general = p.value)

# Table 12.2, page 413:
tenure_anovas_poly_tidy |>
  left_join(tenure_anovas_general_tidy) |>
  add_row(time = "General", deviance = deviance(tenure_fit_general)) |>
  mutate(
    AIC = map_dbl(tenure_fits, AIC),
    BIC = map_dbl(tenure_fits, BIC),
    across(where(is.numeric), \(.x) round(.x, digits = 2))
  )
#> # A tibble: 7 × 8
#>   time    deviance previous_model p.previous general_model p.general   AIC   BIC
#>   <chr>      <dbl>          <dbl>      <dbl>         <dbl>     <dbl> <dbl> <dbl>
#> 1 Consta…    1038.          NA         NA           206.        0    1040. 1045.
#> 2 Linear      867.         170.         0            36.3       0     871.  882.
#> 3 Quadra…     836.          31.2        0             5.1       0.53  842.  858.
#> 4 Cubic       833.           3.13       0.08          1.97      0.85  841.  862.
#> 5 Fourth…     833.           0.43       0.51          1.54      0.82  843.  869.
#> 6 Fifth …     833.           0.01       0.92          1.53      0.68  845.  877.
#> 7 General     831.          NA         NA            NA        NA     849.  897.

Following the approach recommended by Singer and Willett (2003), we would select the quadratic polynomial model as our “final” model. We can see that this model has a similar but more parsimonious functional form by comparing the fitted hazard and survival functions of the quadratic polynomial model to the completely general model.

First we will get the fitted hazard and survival functions for both models.

tenure_preds_final <- tenure_preds |>
  filter(time %in% c("Quadratic", "General")) |>
  group_by(time) |>
  mutate(
    haz.fitted = 1 / (1 + exp(-.fitted)),
    surv.fitted = cumprod(1 - haz.fitted)
  ) |>
  group_modify(
    \(.group, ...) add_row(.group, year = 0, surv.fitted = 1, .before = 1)
  )

tenure_preds_final
#> # A tibble: 20 × 5
#> # Groups:   time [2]
#>    time       year .fitted haz.fitted surv.fitted
#>    <fct>     <dbl>   <dbl>      <dbl>       <dbl>
#>  1 Quadratic     0  NA       NA             1    
#>  2 Quadratic     1  -5.86     0.00286       0.997
#>  3 Quadratic     2  -4.37     0.0125        0.985
#>  4 Quadratic     3  -3.13     0.0419        0.943
#>  5 Quadratic     4  -2.15     0.105         0.845
#>  6 Quadratic     5  -1.41     0.196         0.679
#>  7 Quadratic     6  -0.927    0.283         0.486
#>  8 Quadratic     7  -0.695    0.333         0.325
#>  9 Quadratic     8  -0.714    0.329         0.218
#> 10 Quadratic     9  -0.984    0.272         0.159
#> 11 General       0  NA       NA             1    
#> 12 General       1  -4.86     0.00769       0.992
#> 13 General       2  -4.13     0.0158        0.977
#> 14 General       3  -3.38     0.0328        0.945
#> 15 General       4  -2.28     0.0925        0.857
#> 16 General       5  -1.47     0.187         0.697
#> 17 General       6  -0.693    0.333         0.465
#> 18 General       7  -0.785    0.313         0.319
#> 19 General       8  -0.633    0.347         0.208
#> 20 General       9  -1.34     0.208         0.165

Then we can plot them.

tenure_fits_final_haz <- tenure_preds_final |>
  ggplot(aes(x = year, y = haz.fitted, colour = time)) +
    geom_line() +
    scale_colour_manual(values = pal_brewer(palette = "Dark2")(5)[c(3, 5)]) +
    scale_x_continuous(breaks = 0:9)

tenure_fits_final_surv <- tenure_fits_final_haz + aes(y = surv.fitted)

# Figure 12.1, page 414 (bottom):
tenure_fits_final_haz + tenure_fits_final_surv +
  plot_layout(guides = "collect")

12.2 Using the complementary log-log link to specify a discrete-time hazard model

In Section 12.2 Singer and Willett (2003) discuss the complementary log-log (aka clog-log) transformation as an alternative link function for the discrete-time hazard model using a subset of data from Capaldi, Crosby, and Stoolmiller (1996), who measured the grade year of first sexual intercourse in a sample of 180 at-risk heterosexual adolescent males. Adolescent males were followed from Grade 7 up to Grade 12 or until they reported having had sexual intercourse for the first time.

For this example we return to the first_sex data set introduced in Chapter 10, a person-level data frame with 180 rows and 5 columns:

id: Participant ID.
grade: Grade year of first sexual intercourse.
censor: Censoring status.
parental_transition: Binary indicator for whether the adolescent experienced a parental transition (where their parents separated or repartnered).
parental_antisociality: Composite score across four indicators measuring parents’ level of antisocial behaviour during the child’s formative years.

first_sex
#> # A tibble: 180 × 5
#>    id    grade censor parental_transition parental_antisociality
#>    <fct> <dbl>  <dbl>               <dbl>                  <dbl>
#>  1 1         9      0                   0                  1.98 
#>  2 2        12      1                   1                 -0.545
#>  3 3        12      1                   0                 -1.40 
#>  4 5        12      0                   1                  0.974
#>  5 6        11      0                   0                 -0.636
#>  6 7         9      0                   1                 -0.243
#>  7 9        12      1                   0                 -0.869
#>  8 10       11      0                   0                  0.454
#>  9 11       12      1                   1                  0.802
#> 10 12       11      0                   1                 -0.746
#> # ℹ 170 more rows

As usual, we begin by converting the person-level first_sex data set to a person-period data set to facilitate subsequent exploratory analysis and model fitting.

first_sex_pp <- first_sex |>
  rename(grades = grade) |>
  group_by(id) |>
  reframe(
    grade = 7:max(grades),
    event = if_else(grade == grades & censor == 0, 1, 0),
    parental_transition,
    parental_antisociality
  )

first_sex_pp
#> # A tibble: 822 × 5
#>    id    grade event parental_transition parental_antisociality
#>    <fct> <int> <dbl>               <dbl>                  <dbl>
#>  1 1         7     0                   0                  1.98 
#>  2 1         8     0                   0                  1.98 
#>  3 1         9     1                   0                  1.98 
#>  4 2         7     0                   1                 -0.545
#>  5 2         8     0                   1                 -0.545
#>  6 2         9     0                   1                 -0.545
#>  7 2        10     0                   1                 -0.545
#>  8 2        11     0                   1                 -0.545
#>  9 2        12     0                   1                 -0.545
#> 10 3         7     0                   0                 -1.40 
#> # ℹ 812 more rows

Next, we will explore sample hazard functions for the first_sex data using life table methods on the logit and complementary log-log scales. Following from the Chapter 11 examples for the first_sex data, our research question again centres on modelling the relationship between hazard and the time-invariant parental_transition predictor—therefore, we will create a life table stratified by values of parental_transition.

first_sex_lifetable <- first_sex_pp |>
  group_by(parental_transition, grade) |>
  summarise(
    n.risk = n(),
    n.event = sum(event == 1),
    n.censor = sum(event == 0),
    haz.estimate = n.event / n.risk
  )

first_sex_lifetable
#> # A tibble: 12 × 6
#> # Groups:   parental_transition [2]
#>    parental_transition grade n.risk n.event n.censor haz.estimate
#>                  <dbl> <int>  <int>   <int>    <int>        <dbl>
#>  1                   0     7     72       2       70       0.0278
#>  2                   0     8     70       2       68       0.0286
#>  3                   0     9     68       8       60       0.118 
#>  4                   0    10     60       8       52       0.133 
#>  5                   0    11     52      10       42       0.192 
#>  6                   0    12     42       8       34       0.190 
#>  7                   1     7    108      13       95       0.120 
#>  8                   1     8     95       5       90       0.0526
#>  9                   1     9     90      16       74       0.178 
#> 10                   1    10     74      21       53       0.284 
#> 11                   1    11     53      15       38       0.283 
#> 12                   1    12     38      18       20       0.474

We can define functions for each of these transformations to transform the estimated hazard probabilities in the first_sex life table to different scales.

odds <- function(x) x / (1 - x)
log_odds <- function(x) log(odds(x))
cloglog <- function(x) log(-log(1 - x))

first_sex_lifetable <- first_sex_lifetable |>
  mutate(
    logit.estimate = log_odds(haz.estimate),
    cloglog.estimate = cloglog(haz.estimate)
  )

first_sex_lifetable
#> # A tibble: 12 × 8
#> # Groups:   parental_transition [2]
#>    parental_transition grade n.risk n.event n.censor haz.estimate logit.estimate
#>                  <dbl> <int>  <int>   <int>    <int>        <dbl>          <dbl>
#>  1                   0     7     72       2       70       0.0278         -3.56 
#>  2                   0     8     70       2       68       0.0286         -3.53 
#>  3                   0     9     68       8       60       0.118          -2.01 
#>  4                   0    10     60       8       52       0.133          -1.87 
#>  5                   0    11     52      10       42       0.192          -1.44 
#>  6                   0    12     42       8       34       0.190          -1.45 
#>  7                   1     7    108      13       95       0.120          -1.99 
#>  8                   1     8     95       5       90       0.0526         -2.89 
#>  9                   1     9     90      16       74       0.178          -1.53 
#> 10                   1    10     74      21       53       0.284          -0.926
#> 11                   1    11     53      15       38       0.283          -0.930
#> 12                   1    12     38      18       20       0.474          -0.105
#> # ℹ 1 more variable: cloglog.estimate <dbl>

As Singer and Willett (2003) discuss, like the logit transformation, the complementary log-log transformation maps probabilities onto a scale with no upper or lower bounds. However, they differ in that:

The logit transformation yields the logarithm of the odds of event occurrence.
The complementary log-log transformation yields the logarithm of the negated logarithm of the probability of event nonoccurrence.
The logit transformation is symmetric on either side of the “median” hazard probability of 0.5, whereas the complementary log-log transformation is asymmetric.

The practical consequence of these differences is that the two transformations provide similar estimates when hazard is small, but consistently diverge as hazard increases, which we can see by sample hazard functions for the first_sex data.

# Figure 12.3, page 423:
first_sex_lifetable |>
  select(parental_transition, grade, logit.estimate, cloglog.estimate) |>
  pivot_longer(
    cols = c(logit.estimate, cloglog.estimate),
    names_to = "transform",
    names_pattern = "(.*)\\.estimate",
    values_to = "haz.estimate"
  ) |>
  mutate(parental_transition = as.logical(parental_transition)) |>
  ggplot(aes(x = grade, y = haz.estimate)) +
    geom_line(aes(colour = parental_transition, linetype = transform)) +
    scale_colour_brewer(palette = "Dark2")

Fitting the discrete-time hazard model using a complementary log-log link proceeds in the same way as when a logit link is used, invoking the same assumptions about the behaviour of complementary log-log hazard as we would about logit hazard. However, they differ in their proportionality assumptions:

The discrete-time hazard model with logit link is a proportional odds model whose antilogged slope parameter estimates are odds ratios, which we assume are proportional over time across all possible values of the predictor (or combinations of predictors).
The discrete-time hazard model with complementary log-log link is a proportional hazards model whose antilogged slope parameter estimates are hazard ratios, which we assume are proportional over time across all possible values of the predictor (or combinations of predictors).

The significance of this difference is that the proportional hazards assumption invoked by the complementary log-log link yields a discrete-time hazard model that is directly analogous to the continuous-time hazard model. Given this, Singer and Willett (2003) recommend the following when choosing a link function for the discrete-time hazard model:

When data are collected in truly discrete time, the complementary log-log link has no particular advantage over the logit link. Parameter estimates and goodness-of-fit statistics tend to be similar when fitting identical models with alternate link functions, and the primary difference between these two specifications is the effect size metric for the antilogged slope parameter estimates (odds ratios or hazard ratios).
When underlying metric for time is truly continuous—but the data are collected in discrete time due to measurement difficulties—the complementary log-log link is preferable to the logit link. This is the case when analyzing interval-censored data, wherein events unfold in continuous time, but information about event occurrence is restricted to discrete-time intervals.

Here we will fit two discrete-time hazard models to the first_sex_pp person-period data set: one with a complementary log-log link, and another with a logit link.

# Define inverse transformation functions -------------------------------------
log_odds_inv <- function(x) 1 / (1 + exp(-x))
cloglog_inv <- function(x) 1 - exp(-exp(x))

# Fit models ------------------------------------------------------------------
first_sex_fits <- map(
  list(cloglog = "cloglog", logit = "logit"),
  \(.link) {
    glm(
      event ~ -1 + factor(grade) + parental_transition,
      family = binomial(link = .link),
      data = first_sex_pp
    )
  }
)

# Make table ------------------------------------------------------------------

# Table 12.3, page 424:
first_sex_fits |>
  map(tidy) |>
  list_cbind() |>
  unnest_wider(col = c(cloglog, logit), names_sep = ".") |>
  select(term = cloglog.term, cloglog.estimate, logit.estimate) |>
  mutate(
    term = coef_rename(term, factor_name = FALSE),
    cloglog.haz = if_else(
      term != "parental_transition", cloglog_inv(cloglog.estimate), NA
    ),
    logit.haz = if_else(
      term != "parental_transition", log_odds_inv(logit.estimate), NA
    )
  ) |>
  add_row(
    term = "Deviance",
    cloglog.estimate = deviance(first_sex_fits$cloglog),
    logit.estimate = deviance(first_sex_fits$logit)
  )
#> # A tibble: 8 × 5
#>   term                cloglog.estimate logit.estimate cloglog.haz logit.haz
#>   <chr>                          <dbl>          <dbl>       <dbl>     <dbl>
#> 1 Grade 7                       -2.97          -2.99       0.0498    0.0477
#> 2 Grade 8                       -3.66          -3.70       0.0254    0.0241
#> 3 Grade 9                       -2.32          -2.28       0.0940    0.0927
#> 4 Grade 10                      -1.90          -1.82       0.139     0.139 
#> 5 Grade 11                      -1.76          -1.65       0.158     0.161 
#> 6 Grade 12                      -1.34          -1.18       0.230     0.235 
#> 7 Parental Transition            0.785          0.874      0.888     0.705 
#> 8 Deviance                     635.           635.        NA        NA

12.3 Time-varying predictors

In Section 12.3 Singer and Willett (2003) demonstrate how to fit the discrete-time hazard model for data with time-varying predictors using a subset of data from Wheaton, Rozell, and Hall (1997), who measured the relation between age of first depressive episode and several childhood and adult traumatic stressors in a random sample of 1393 adults living in metropolitan Toronto, Ontario. Age of first depressive episode and traumatic stressors was determined through a structured interview.

For this example we use the first_depression_1 data set, a person-period data frame with 36997 rows and 7 columns:

id: Participant ID.
age: Age each record corresponds to.
event: Binary indicator for whether and when the adult experienced a depressive episode.
censor: Censoring status.
parental_divorce: Binary indicator for whether the adult’s parents divorced at this or any previous age.
female: Binary indicator for whether the adult is a female.
siblings: Number of siblings.

first_depression_1
#> # A tibble: 36,997 × 7
#>    id      age event censor parental_divorce female siblings
#>    <fct> <dbl> <dbl>  <dbl>            <dbl>  <dbl>    <dbl>
#>  1 1         4     0      1                0      1        6
#>  2 1         5     0      1                0      1        6
#>  3 1         6     0      1                0      1        6
#>  4 1         7     0      1                0      1        6
#>  5 1         8     0      1                0      1        6
#>  6 1         9     0      1                0      1        6
#>  7 1        10     0      1                0      1        6
#>  8 1        11     0      1                0      1        6
#>  9 1        12     0      1                0      1        6
#> 10 1        13     0      1                0      1        6
#> # ℹ 36,987 more rows

No special strategies are needed to fit the discrete-time hazard model with time-varying predictors. However, as Singer and Willett (2003) discuss, the inclusion of time-varying predictors in a model requires restatements of the three assumptions inherent to basic discrete-time hazard models presented in Chapter 11. We now assume that:

There is a postulated value of logit hazard for each value of the time-varying predictor (or for each combination of predictor values for models with more than one predictor) at each point in time. Therefore, the effect of a time-varying predictor compares different groups of people at different times, rather than static groups of people at different times.
Joining consecutive postulated values of logit hazard will yield hazard functions that have an identical shape for constant values of the time-varying predictor(s). Similar to time-varying predictors for the multilevel model for change (Section 5.3), the most extreme lower and upper constant values of the time-varying predictor(s) can be thought of as an envelope representing the complete set of population logit hazard functions for all possible temporal patterns of the time-varying predictor.
The distance between each of these logit hazard functions is identical in every time period. Therefore, the effect of a time-varying predictor is identical over time.

To clarify these assumptions, here we will fit and plot a model with a cubic polynomial form for age and the time-varying predictor parental_divorce to the first_depression_1 data, along with sample (logit) hazard functions from the life table.

# Make life table -------------------------------------------------------------
first_depression_lifetable <- first_depression_1 |>
  mutate(parental_divorce = as.logical(parental_divorce)) |>
  group_by(parental_divorce, age) |>
  summarise(
    n.risk = n(),
    n.event = sum(event == 1),
    n.censor = sum(event == 0),
    haz.estimate = n.event / n.risk,
    haz.estimate = if_else(haz.estimate == 0, NA, haz.estimate),
    log_odds.estimate = log_odds(haz.estimate),
    .groups = "drop_last"
  )

# Fit model -------------------------------------------------------------------
first_depression_fit_1 <- glm(
  event ~ poly(I(age - 18), 3, raw = TRUE) + parental_divorce,
  family = binomial(link = "logit"),
  data = first_depression_1
)

# When a predictor enters the model as part of a matrix of covariates, such as
# with stats::poly(), it is represented in augment() as a matrix column. A simple
# workaround to get the predictor on its original scale as a vector is to pass
# the original data to augment().
first_depression_preds_1 <- first_depression_fit_1 |>
  augment(data = first_depression_1, type.predict = "response") |>
  rename(haz.estimate = .fitted) |>
  mutate(
    log_odds.estimate = log_odds(haz.estimate),
    parental_divorce = as.logical(parental_divorce)
  )

# Plot ------------------------------------------------------------------------
first_depression_fit_1_haz <- ggplot() +
  aes(x = age, y = haz.estimate, colour = parental_divorce) +
  geom_point(data = first_depression_lifetable) +
  geom_line(data = first_depression_preds_1) +
  scale_colour_brewer(palette = "Dark2") +
  scale_x_continuous(breaks = seq(0, 40, by = 5)) +
  coord_cartesian(xlim = c(0, 40), ylim = c(0, 0.06))

first_depression_fit_1_logit <- first_depression_fit_1_haz +
  aes(y = log_odds.estimate) + 
  scale_y_continuous(breaks = seq(-8, -2, by = 1)) +
  coord_cartesian(xlim = c(0, 40), ylim = c(-8, -2))

# Figure 12.4, page 432:
first_depression_fit_1_haz / first_depression_fit_1_logit +
  plot_layout(guides = "collect", axes = "collect")

Plotting the effects of time-varying predictors

When plotting fitted hazard and survivor functions for prototypical individuals, Singer and Willett (2003) suggest the following strategies for selecting prototypical values for time-varying predictors:

For time-varying predictors whose values can remain constant over time, choose time-invariant values that represent an envelope of all possible temporal patterns of the time-varying predictor.
For time-varying predictors whose values cannot remain constant over time, choose time-varying values that reflect substantively interesting temporal patterns.

Here here we will demonstrate the first of these strategies using an updated version of the first_depression_fit_1 model that includes a predictor for sex.

# Fit model -------------------------------------------------------------------
first_depression_fit_2 <- update(first_depression_fit_1, . ~ . + female)

# Make predictions ------------------------------------------------------------
first_depression_preds_2 <- first_depression_fit_2 |>
  augment(
    newdata = crossing(
      age = 4:39, parental_divorce = c(0, 1), female = c(0, 1)
    ),
    type.predict = "response"
  ) |>
  rename(haz.estimate = .fitted) |>
  group_by(parental_divorce, female) |>
  mutate(
    parental_divorce = as.logical(parental_divorce),
    female = as.logical(female),
    surv.estimate = cumprod(1 - haz.estimate)
  ) |>
  group_modify(
    \(.group, ...) add_row(.group, age = 3, surv.estimate = 1, .before = 1)
  )

# Plot ------------------------------------------------------------------------
first_depression_fit_2_haz <- first_depression_preds_2 |>
  ggplot(aes(x = age, y = haz.estimate)) + 
    geom_line(aes(linetype = parental_divorce, colour = female)) +
    scale_colour_brewer(palette = "Dark2") +
    scale_x_continuous(breaks = seq(0, 40, by = 5)) +
    coord_cartesian(xlim = c(0, 40), ylim = c(0, 0.04))

first_depression_fit_2_surv <- first_depression_fit_2_haz +
  aes(y = surv.estimate) +
  coord_cartesian(xlim = c(0, 40), ylim = c(0, 1))

# Figure 12.5, page 437:
first_depression_fit_2_haz / first_depression_fit_2_surv +
  plot_layout(guides = "collect", axes = "collect")

12.4 The linear additivity assumption: Uncovering violations and simple solutions

In Section 12.4 Singer and Willett (2003) introduce practical strategies for diagnosing and correcting violations of the discrete-time hazard model’s linear additivity assumption, which stipulates that unit differences in the value of a (time-invariant or time-varying) predictor correspond to fixed differences in logit hazard. In other words, we postulate that a predictor’s effect is:

Additive: It does not depend upon the values of other predictors in the model (i.e., there are no interactions between substantive predictors).
Linear: It does not depend upon the position of the unit difference along its scale (i.e., there are no nonlinear effects).

Interactions between substantive predictors

Singer and Willett (2003) illustrate strategies for identifying possible interactions between substantive predictors using data from Keiley and Martin (2002), who measured the effect of child abuse on the risk of first juvenile arrest in a sample of 1553 adolescents aged 8 to 18. Adolescents were followed up to age 18 or until they were arrested.

For this example we use the first_arrest data set, a person-period data frame with 15834 rows and 5 columns:

id: Participant ID.
age: Age each record corresponds to.
event: Binary indicator for whether the adolescent was arrested.- censor: Censoring status.
abused: Binary indicator for whether the adolescent was abused.
black: Binary indicator for whether the adolescent is black.

first_arrest
#> # A tibble: 15,834 × 6
#>    id      age event censor abused black
#>    <fct> <dbl> <dbl>  <dbl>  <dbl> <dbl>
#>  1 1         8     0      0      0     0
#>  2 1         9     0      0      0     0
#>  3 1        10     0      0      0     0
#>  4 1        11     0      0      0     0
#>  5 1        12     0      0      0     0
#>  6 1        13     0      0      0     0
#>  7 1        14     0      0      0     0
#>  8 1        15     1      0      0     0
#>  9 2         8     0      1      1     0
#> 10 2         9     0      1      1     0
#> # ℹ 15,824 more rows

Singer and Willett (2003) recommend using two strategies to identify possible interactions between substantive predictors:

Examine plots of sample (logit) hazard functions for evidence that the magnitude of the differential in hazard varies across groups.
Compare goodness-of-fit statistics of nested models that include and exclude an interaction term.

To demonstrate the first strategy, we begin by constructing a life table for the first_arrest, which we will use to explore the possibility of an interaction between the abused and black predictors.

first_arrest_lifetable <- first_arrest |>
  mutate(across(c(abused, black), as.logical)) |>
  group_by(abused, black, age) |>
  summarise(
    n.risk = n(),
    n.event = sum(event == 1),
    n.censor = sum(event == 0),
    haz.estimate = n.event / n.risk,
    haz.estimate = if_else(haz.estimate == 0, NA, haz.estimate),
    log_odds.estimate = log_odds(haz.estimate),
    .groups = "drop_last"
  ) |>
  na.omit()

first_arrest_lifetable
#> # A tibble: 39 × 8
#> # Groups:   abused, black [4]
#>    abused black   age n.risk n.event n.censor haz.estimate log_odds.estimate
#>    <lgl>  <lgl> <dbl>  <int>   <int>    <int>        <dbl>             <dbl>
#>  1 FALSE  FALSE     8    434       1      433      0.00230             -6.07
#>  2 FALSE  FALSE    10    433       2      431      0.00462             -5.37
#>  3 FALSE  FALSE    11    431       5      426      0.0116              -4.45
#>  4 FALSE  FALSE    13    426       7      419      0.0164              -4.09
#>  5 FALSE  FALSE    14    419       9      410      0.0215              -3.82
#>  6 FALSE  FALSE    15    410      19      391      0.0463              -3.02
#>  7 FALSE  FALSE    16    391      10      381      0.0256              -3.64
#>  8 FALSE  FALSE    17    381      10      371      0.0262              -3.61
#>  9 FALSE  FALSE    18    368       3      365      0.00815             -4.80
#> 10 FALSE  TRUE      9    232       1      231      0.00431             -5.44
#> # ℹ 29 more rows

Now we can plot the sample (logit) hazard functions. Notice that the magnitude of the differential in (logit) hazard associated with abuse is visibly larger for black adolescents compared to white adolescents.

# Figure 12.6, page 445 (top):
first_arrest_lifetable |>
  ggplot(aes(x = age, y = log_odds.estimate, colour = black, linetype = abused)) +
  geom_line() +
  scale_colour_brewer(palette = "Dark2") +
  scale_x_continuous(breaks = 7:19) +
  coord_cartesian(xlim = c(7, 19), ylim = c(-7, -2)) +
  facet_wrap(vars(black), labeller = label_both) +
  ggtitle("Sample (logit) hazard functions") + 
  theme(strip.background = element_blank(), strip.text.x = element_blank())

We can test for this interaction by comparing goodness-of-fit statistics of nested models that include and exclude an interaction term. Equivalently, as we demonstrate here, we can also examine a sequential analysis of deviance table for the model that includes an interaction term, rather than fitting two models (so long as the interaction is the final term in the model formula).

first_arrest_fit <- glm(
  event ~ 0 + factor(age) + abused + black + abused:black,
  family = binomial(link = "logit"),
  data = first_arrest
)

anova(first_arrest_fit, test = "LRT")
#> Analysis of Deviance Table
#> 
#> Model: binomial, link: logit
#> 
#> Response: event
#> 
#> Terms added sequentially (first to last)
#> 
#> 
#>              Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
#> NULL                         15834    21950.6              
#> factor(age)  11  18850.0     15823     3100.6 < 2.2e-16 ***
#> abused        1     22.0     15822     3078.6 2.774e-06 ***
#> black         1     25.2     15821     3053.4 5.287e-07 ***
#> abused:black  1      4.0     15820     3049.4   0.04421 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As usual, we can interpret interaction effects by examining combined parameter estimates or fitted hazard (and survivor) functions for prototypical individuals.

First the combined parameter estimates. A simple way to obtain these estimates is to use the type.predict = "terms" argument in the augment() function, which returns a matrix giving the value of each coefficient multiplied by the prototypical values. Summing each row of the matrix and antilogging the totals results in an odds ratio for each prototypical individual.

first_arrest_fit |>
  augment(
    newdata = crossing(age = NA, abused = c(0, 1), black = c(0, 1)),
    type.predict = "terms"
  ) |>
  mutate(.fitted = exp(rowSums(.fitted, na.rm = TRUE))) |>
  select(abused, black, odds_ratio = .fitted)
#> # A tibble: 4 × 3
#>   abused black odds_ratio
#>    <dbl> <dbl>      <dbl>
#> 1      0     0       1   
#> 2      0     1       1.28
#> 3      1     0       1.43
#> 4      1     1       2.96

Then plots of fitted hazard functions.

first_arrest_preds <- first_arrest_fit |>
  augment(
    newdata = crossing(age = 8:18, abused = c(0, 1), black = c(0, 1)),
    type.predict = "link"
  ) |>
  rename(log_odds.estimate = .fitted) |>
  mutate(across(c(abused, black), as.logical))

# Figure 12.6, page 445 (bottom):
first_arrest_preds |>
  ggplot(aes(x = age, y = log_odds.estimate, colour = black, linetype = abused)) +
  geom_line() +
  scale_colour_brewer(palette = "Dark2") +
  scale_x_continuous(breaks = 7:19) +
  coord_cartesian(xlim = c(7, 19), ylim = c(-7, -2)) +
  ggtitle("Fitted (logit) hazard functions")

Nonlinear effects

Singer and Willett (2003) suggest two general strategies for exploring the linearity assumption and identifying an appropriate nonlinear functional form for any predictors:

Transform predictors so that relationships become linear on the transformed scale, but nonlinear on the raw scale (a la Section 6.2).
Categorize each continuous predictor into a small number of groups represented by dummy variables, then examine the pattern of parameter estimates for consecutive dummy variables to deduce the appropriate (nonlinear) functional form. Equally spaced parameter estimates suggest a linear effect, and unequally spaced parameter estimates suggest a nonlinear effect.

We return to the first_depression_fit_2 model from Section 12.3 to demonstrate the second strategy, examining the effect of a third predictor, siblings. Here we will fit three discrete-time hazard models to the first_depression_1 data: A model that adds siblings in its raw form (Model A), a model that categorizes siblings into a system of dummy variables (Model B), a model that splits siblings into two groups based on the pattern of parameter estimates in Model B (Model C), and a model that gives a cubic siblings specification (Model D).

# Fit models ------------------------------------------------------------------
first_depression_fit_A <- update(first_depression_fit_2, . ~ . + siblings)
first_depression_fit_B <- update(
  first_depression_fit_2, 
  . ~ . +
    between(siblings, 1, 2) +
    between(siblings, 3, 4) +
    between(siblings, 5, 6) +
    between(siblings, 7, 8) +
    between(siblings, 9, Inf)
)
first_depression_fit_C <- update(first_depression_fit_2, . ~ . + I(siblings >= 5))
first_depression_fit_D <- update(
  first_depression_fit_2, . ~ . + poly(siblings, 3, raw = TRUE)
)

first_depression_fits <- list(
  "Model A" = first_depression_fit_A,
  "Model B" = first_depression_fit_B,
  "Model C" = first_depression_fit_C,
  "Model D" = first_depression_fit_D
)

# Make table ------------------------------------------------------------------
options(modelsummary_get = "all")

# Table 12.4, page 449:
first_depression_fits |>
  modelsummary(
    fmt = 4,
    coef_rename = coef_rename,
    gof_map = c("deviance", "AIC"),
    output = "gt"
  )

	Model A	Model B	Model C	Model D
(Intercept)	-4.3587	-4.5001	-4.4828	-4.3943
	(0.1216)	(0.2067)	(0.1087)	(0.1513)
Poly((Age - 18), 3, Raw = TRUE)1	0.0611	0.0615	0.0614	0.0613
	(0.0117)	(0.0117)	(0.0117)	(0.0117)
Poly((Age - 18), 3, Raw = TRUE)2	-0.0073	-0.0073	-0.0073	-0.0073
	(0.0012)	(0.0012)	(0.0012)	(0.0012)
Poly((Age - 18), 3, Raw = TRUE)3	0.0002	0.0002	0.0002	0.0002
	(0.0001)	(0.0001)	(0.0001)	(0.0001)
Parental Divorce	0.3726	0.3727	0.3710	0.3731
	(0.1624)	(0.1625)	(0.1623)	(0.1624)
Female	0.5587	0.5596	0.5581	0.5655
	(0.1095)	(0.1095)	(0.1095)	(0.1097)
Siblings	-0.0814
	(0.0223)
Between(Siblings, 1, 2)TRUE		0.0209
		(0.1976)
Between(Siblings, 3, 4)TRUE		0.0108
		(0.2100)
Between(Siblings, 5, 6)TRUE		-0.4942
		(0.2545)
Between(Siblings, 7, 8)TRUE		-0.7754
		(0.3437)
Between(Siblings, 9, Inf)TRUE		-0.6585
		(0.3440)
(Siblings >= 5)TRUE			-0.6108
			(0.1446)
Poly(Siblings, 3, Raw = TRUE)1				-0.0375
				(0.0782)
Poly(Siblings, 3, Raw = TRUE)2				-0.0109
				(0.0117)
Poly(Siblings, 3, Raw = TRUE)3				0.0005
				(0.0004)
Deviance	4124.29	4117.98	4118.78	4120.62
AIC	4138.3	4140.0	4132.8	4138.6

Notice that the parameter estimates for the five siblings dummy variables in Model B are not equidistant, suggesting that Model A may be misleading due to the number of siblings having a nonlinear effect on risk of depression. Model C attempts to capture this nonlinear effect by splitting siblings into two broad groups, based on the observation that small- to mid-size families appear to have negligible influence on risk of depression, whereas large families have much lower risk; and Model D attempts to capture this nonlinear effect using a cubic polynomial.

As Singer and Willett (2003) conclude, for the first_depression_1 data, the nonlinear effect of siblings was jagged and disjunctive rather than smooth and continuous; therefore, the categorical specification in Model C fit much better than the continuous specification in Model D. However, they caution against routinely categorizing continuous variables to capture nonlinear effects. As usual, it is preferable to keep continuous predictors continuous whenever possible, categorizing continuous variables only as a last resort.

12.5 The proportionality assumption: Uncovering violations and simple solutions

In Section 12.5 Singer and Willett (2003) discuss strategies for investigating and relaxing the the discrete-time hazard model’s proportionality assumption using data from Graham (1997), who measured the relation between mathematics course-taking and gender identity in a sample of 3790 tenth grade high school students. Students were followed for up to 5 terms (eleventh grade, twelfth grade, and the first three semesters of college) or until they stopped enrolling in mathematics courses.

For this example we use the math_dropout data set, a person-period data frame with 9558 rows and 5 columns:

id: Participant ID.
term: Academic term each record corresponds to.
event: Binary indicator for whether the student stopped enrolling in mathematics courses at a given term.
censor: Censoring status.
woman: Binary indicators for whether the student identified as a woman.

math_dropout
#> # A tibble: 9,558 × 5
#>    id      term event censor woman
#>    <fct>  <dbl> <dbl>  <dbl> <dbl>
#>  1 201303     1     0      0     1
#>  2 201303     2     1      0     1
#>  3 201304     1     0      0     1
#>  4 201304     2     0      0     1
#>  5 201304     3     1      0     1
#>  6 201305     1     0      1     1
#>  7 201305     2     0      1     1
#>  8 201311     1     0      1     0
#>  9 201311     2     0      1     0
#> 10 201311     3     0      1     0
#> # ℹ 9,548 more rows

When introducing the basic discrete-time hazard model in Chapter 11, and extending it in Section 12.2, Singer and Willett (2003) stipulated discrete-time hazard models that assumed the effect of each predictor was constant over time, resulting in logit or complementary log-log hazard functions that have an identical distance between them in every time period, and proportional odds or hazard ratios. However, there are at least three ways in which a predictor’s effect may vary over time:

It may increase over time.
It may decrease over time.
It may be particularly pronounced in some time periods, differing erratically over time.

As Singer and Willett (2003) discuss, the existence of time-dependent effects suggests the need for a nonproportional odds model (or nonproportional hazards model), that includes interactions between predictors and time.

Here we will fit three models to the math_dropout data: a proportional odds model that adds woman as a predictor (Model A); a nonproportional odds model that adds a completely general interaction between woman and term, allowing the gender differential to differ in each term (Model B); and a nonproportional odds model that adds a linear interaction between woman and term - 1, allowing the gender differential to smoothly increase or decrease over time (Model C).

# Fit models ------------------------------------------------------------------
math_dropout_fit_A <- glm(
  event ~ 0 + factor(term) + woman,
  family = binomial(link = "logit"),
  data = math_dropout
)
math_dropout_fit_B <- glm(
  event ~ 0 + factor(term) + factor(term):woman,
  family = binomial(link = "logit"),
  data = math_dropout
)
math_dropout_fit_C <- update(math_dropout_fit_A, . ~ . + woman:I(term - 1))

math_dropout_fits <- list(
  "Model A" = math_dropout_fit_A,
  "Model B" = math_dropout_fit_B,
  "Model C" = math_dropout_fit_C
)

# Make table ------------------------------------------------------------------

# Table 12.5, page 459:
math_dropout_fits |>
  modelsummary(
    fmt = 4,
    coef_rename = \(.x) coef_rename(.x, factor_name = FALSE),
    gof_map = c("deviance", "AIC"),
    output = "gt"
  )

	Model A	Model B	Model C
Term 1	-2.1308	-2.0077	-2.0459
	(0.0567)	(0.0715)	(0.0646)
Term 2	-0.9425	-0.9643	-0.9255
	(0.0479)	(0.0585)	(0.0482)
Term 3	-1.4495	-1.4824	-1.4966
	(0.0634)	(0.0847)	(0.0665)
Term 4	-0.6176	-0.7100	-0.7178
	(0.0757)	(0.1007)	(0.0861)
Term 5	-0.7716	-0.8690	-0.9166
	(0.1428)	(0.1908)	(0.1557)
Woman	0.3786		0.2275
	(0.0501)		(0.0774)
Term 1:woman		0.1568
		(0.0978)
Term 2:woman		0.4187
		(0.0792)
Term 3:woman		0.4407
		(0.1158)
Term 4:woman		0.5707
		(0.1445)
Term 5:woman		0.6008
		(0.2857)
Woman:(Term - 1)			0.1198
			(0.0470)
Deviance	9804.31	9796.27	9797.81
AIC	9816.3	9816.3	9811.8

We can test the proportionality assumption by using likelihood ratio tests to compare the deviance statistics between proportional odds models and nonproportional odds models.

As Singer and Willett (2003) discuss, although Model B fits slightly better than Model A, the likelihood ratio test suggests this improvement is insufficient to justify four extra parameters. However, the systematic increase in the factor(term):woman coefficients of Model B suggests a more parsimonious that adds a linear interaction between woman and term, Model C, which does fit better than Model A (and no worse than Model B). We therefore choose Model C as our “final” model, in lieu of Model A with its untenable proportionality assumption.

math_dropout_anovas <- list(
  "Model A vs Model B" = anova(
    math_dropout_fit_A, math_dropout_fit_B, test = "LRT"
  ),
  "Model A vs Model C" = anova(
    math_dropout_fit_A, math_dropout_fit_C, test = "LRT"
  ),
  "Model C vs Model B" = anova(
    math_dropout_fit_C, math_dropout_fit_B, test = "LRT"
  )
)

math_dropout_anovas |>
  map(tidy) |>
  list_rbind(names_to = "test") |>
  mutate(
    term = str_remove(term, "0 \\+|- 1$"),
    across(c(df.residual, df), as.integer)
  ) |>
  group_by(test) |>
  gt() |>
  cols_label(term = "comparison") |>
  fmt_number(columns = where(is.double), decimals = 2) |>
  sub_missing(missing_text = "")

comparison	df.residual	residual.deviance	df	deviance	p.value
Model A vs Model B
event ~ factor(term) + woman	9552	9,804.31
event ~ factor(term) + factor(term):woman	9548	9,796.27	4	8.04	0.09
Model A vs Model C
event ~ factor(term) + woman	9552	9,804.31
event ~ factor(term) + woman + woman:I(term - 1)	9551	9,797.81	1	6.50	0.01
Model C vs Model B
event ~ factor(term) + woman + woman:I(term - 1)	9551	9,797.81
event ~ factor(term) + factor(term):woman	9548	9,796.27	3	1.54	0.67

Finally, we can plot fitted (logit) hazard functions, along with sample (logit) hazard functions from the life table, from each of these models to visualize their behaviour. Notice that the fitted (logit) hazard functions of Model B are identical to the sample (logit) hazard functions from the life table; as Singer and Willett (2003) explain, for categorical predictors, fitted hazard functions from a general interaction with time model reproduce within-group sample hazard functions (just as fitted hazard functions from a model with no substantive predictors reproduce sample hazard functions).

# Make life table -------------------------------------------------------------
math_dropout_lifetable <- math_dropout |>
  mutate(woman = as.logical(woman)) |>
  group_by(woman, term) |>
  summarise(
    model = "Sample",
    n.risk = n(),
    n.event = sum(event == 1),
    n.censor = sum(event == 0),
    haz.estimate = n.event / n.risk,
    haz.estimate = if_else(haz.estimate == 0, NA, haz.estimate),
    log_odds.estimate = log_odds(haz.estimate),
    .groups = "drop_last"
  )

# Make predictions ------------------------------------------------------------
math_dropout_preds <- math_dropout_fits |>
  map(
    \(.fit) {
      .fit |>
        augment(
          newdata = crossing(term = 1:5, woman = c(0, 1)),
          type.predict = "link"
        )
    }
  ) |>
  list_rbind(names_to = "model") |>
  mutate(woman = as.logical(woman)) |>
  rename(log_odds.estimate = .fitted)

# Plot ------------------------------------------------------------------------

# Figure 12.8, page 458:
math_dropout_lifetable |>
  bind_rows(math_dropout_preds) |>
  mutate(model = factor(model, levels = unique(model))) |>
  ggplot(aes(x = term, y = log_odds.estimate, colour = woman)) +
    geom_line() +
    scale_color_brewer(palette = "Dark2") +
    facet_wrap(vars(model)) +
    coord_cartesian(ylim = c(-2, 0))

12.6 The no unobserved heterogeneity assumption: No simple solution

In Section 12.6 Singer and Willett (2003) briefly discuss an additional assumption common to all the discrete-time and continuous-time hazard models presented throughout the book: That all variation in hazard profiles across individuals is assumed to depend on observed variation in the predictors, such that individuals who share identical predictor profiles will have identical population hazard functions. In other words, these models are assumed to have no unobserved heterogeneity, and are conceptually similar to the complete pooling model discussed in Section 3.2.

As Singer and Willett (2003) discuss, ignoring unobserved heterogeneity can have the following consequences:

It always leads to hazard functions that appear to decline over time, due to the composition of the risk set changing over time as high-risk populations experience the target event at high rates, medium-risk populations at moderate rates, and low-risk at low rates. Therefore, the shape of the population hazard function may fail to describe the pattern of risk for the intended population under study, due it being a blended average of the risk profiles of several heterogeneous populations (whose shapes may be entirely different from the blended average).
Parameter estimates can only be interpreted as population averaged coefficients, and not as person-specific coefficients.

As with the multilevel model for change, a multilevel discrete-time hazard model (or a multilevel continuous-time hazard model) can be used to explicitly model heterogeneity by adding a random effects term, such that each group has its own hazard function. Note that the survival analysis literature often refers to multilevel hazard models with only a person-specific random intercept as frailty models, and with only a cluster-specific random intercept as shared frailty models. For further discussion of these models, see Tutz and Schmid (2016) and Austin (2017).

For this example we return to the teachers data set introduced in Chapter 9, a person-level data frame with 3941 rows and 3 columns:

id: Teacher ID.
years: The number of years between a teacher’s dates of hire and departure from the Michigan public schools.
censor: Censoring status.

teachers
#> # A tibble: 3,941 × 3
#>    id    years censor
#>    <fct> <dbl>  <dbl>
#>  1 1         1      0
#>  2 2         2      0
#>  3 3         1      0
#>  4 4         1      0
#>  5 5        12      1
#>  6 6         1      0
#>  7 7        12      1
#>  8 8         1      0
#>  9 9         2      0
#> 10 10        2      0
#> # ℹ 3,931 more rows

We first need to convert the person-level teachers data set to a person-period data set.

teachers_pp <- teachers |>
  group_by(id) |>
  reframe(
    year = 1:years,
    event = if_else(year == years & censor == 0, true = 1, false = 0)
  )

teachers_pp
#> # A tibble: 24,875 × 3
#>    id     year event
#>    <fct> <int> <dbl>
#>  1 1         1     1
#>  2 2         1     0
#>  3 2         2     1
#>  4 3         1     1
#>  5 4         1     1
#>  6 5         1     0
#>  7 5         2     0
#>  8 5         3     0
#>  9 5         4     0
#> 10 5         5     0
#> # ℹ 24,865 more rows

We can fit a multilevel discrete-time hazard model using either the glmer() function from the lme4 package or the glmmTMB() function from the glmmTMB package, which are both functions for fitting generalized linear mixed models. Here we will fit two models to the teachers_pp person-period data set: a basic discrete-time hazard model (Model A), and a frailty model (Model B).

# Fit models ------------------------------------------------------------------

# We are fitting this model with the glmmTMB() function instead of glm() so we
# can compare its fit with the frailty model using the anova() function.
teachers_fit_A <- glmmTMB(
  event ~ 0 + factor(year),
  data = teachers_pp,
  family = binomial(link = "cloglog"),
)

teachers_fit_B <- glmmTMB(
  event ~ 0 + factor(year) + (1 | id),
  data = teachers_pp,
  family = binomial(link = "cloglog"),
)

teachers_fits <- list(
  "Model A" = teachers_fit_A,
  "Model B" = teachers_fit_B
)

# Make table ------------------------------------------------------------------
teachers_fits |>
  modelsummary(
    fmt = 4,
    scales = c("vcov", NA),
    coef_rename = \(.x) coef_rename(.x, factor_name = FALSE),
    gof_map = c("logLik", "deviance", "AIC", "BIC"),
    output = "gt"
  ) |>
  tab_row_group(label = "Goodness-of-Fit", rows = 26:29) |>
  tab_row_group(label = "Variance Components", rows = 25) |>
  tab_row_group(label = "Fixed Effects", rows = 1:24)

	Model A	Model B
Fixed Effects
Year 1	-2.0958	-2.6822
	(0.0469)	(0.0821)
Year 2	-2.1478	-2.5142
	(0.0511)	(0.0667)
Year 3	-2.0953	-2.2858
	(0.0528)	(0.0608)
Year 4	-2.1731	-2.2137
	(0.0583)	(0.0635)
Year 5	-2.3718	-2.2948
	(0.0678)	(0.0727)
Year 6	-2.4516	-2.2805
	(0.0737)	(0.0794)
Year 7	-2.7801	-2.5357
	(0.0902)	(0.0960)
Year 8	-3.0097	-2.7126
	(0.1125)	(0.1184)
Year 9	-3.1439	-2.8045
	(0.1374)	(0.1434)
Year 10	-3.2802	-2.9046
	(0.1690)	(0.1751)
Year 11	-3.6889	-3.2857
	(0.2500)	(0.2552)
Year 12	-4.3529	-3.9332
	(0.4472)	(0.4512)
Variance Components
Var (Intercept)		1.4420
Goodness-of-Fit
Log.Lik.	-7291.871	-7248.334
Deviance	14583.74	11044.42
AIC	14607.7	14522.7
BIC	14705.2	14628.2

Comparing goodness-of-fit statistics between the nested models suggests that the frailty model may be closer to the underlying data-generating process, and that the monotonically decreasing sample hazard function for the teachers data observed in Chapter 10 (Figure 10.1) may have been an artifact of unobserved heterogeneity.

anova(teachers_fit_A, teachers_fit_B)
#> Data: teachers_pp
#> Models:
#> teachers_fit_A: event ~ 0 + factor(year), zi=~0, disp=~1
#> teachers_fit_B: event ~ 0 + factor(year) + (1 | id), zi=~0, disp=~1
#>                Df   AIC   BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
#> teachers_fit_A 12 14608 14705 -7291.9    14584                             
#> teachers_fit_B 13 14523 14628 -7248.3    14497 87.074      1  < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Plotting coincident hazard functions—which are simply individual hazard functions summarized by the number of individuals with the hazard profile—from the frailty model seems to support this assertion. Notice that (1) the individual hazard functions do not follow a monotonically decreasing pattern, and (2) the composition of the risk set changes over time, with high risk teachers leaving early and low risk teachers either leaving much later (or not at all).

# Make predictions
teachers_preds <- teachers_fit_B |>
  augment(data = teachers_pp, type.predict = "response") |>
  rename(haz.estimate = .fitted)

# Wrangle
teachers_cotraj <- teachers_preds |>
  select(id, year, haz.estimate) |>
  pivot_wider(
    names_from = year,
    names_prefix = "year_",
    values_from = haz.estimate
  ) |>
  group_by(pick(starts_with("year"))) |>
  summarise(coincident_trajectories = n(), .groups = "drop") |>
  mutate(trajectory_id = 1:n(), .before = everything()) |>
  pivot_longer(
    cols = starts_with("year"),
    names_to = "year",
    names_prefix = "year_",
    names_transform = as.integer,
    values_to = "haz.estimate"
  )

# Plot
ggplot(teachers_cotraj, aes(x = year, y = haz.estimate)) +
  geom_line(aes(group = trajectory_id, linewidth = coincident_trajectories))  +
  scale_linewidth_continuous(range = c(.25, 2)) +
  scale_x_continuous(breaks = 0:13) +
  coord_cartesian(xlim = c(0, 13), ylim = c(0, .21))

Although this does not wholly rebuke Singer and Willett’s (2003) original conclusion that novice teachers (or those with only a few years of experience) are at greatest risk of leaving teaching, but that risk of leaving declines once they gain experience; it adds an additional layer of nuance not captured by the basic discrete-time hazard model, which we can see by plotting the population hazard functions from both models.

# Make predictions
teachers_haz <- teachers_fits |>
  map(
    \(.fit) {
      augment(
        .fit,
        newdata = tibble(year = 1:12),
        type.predict = "response",
        re.form = NA
      )
    }
  ) |>
  list_rbind(names_to = "model") |>
  rename(haz.estimate = .fitted)

# Plot
ggplot(teachers_haz, aes(x = year, y = haz.estimate)) +
  geom_line() +
  scale_x_continuous(breaks = 0:13) +
  coord_cartesian(xlim = c(0, 13), ylim = c(0, .15)) +
  facet_wrap(vars(model))

12.7 Residual analysis

In Section 12.7 Singer and Willett (2003) return to the first_sex data to discuss two strategies for examining deviance residuals of the discrete-time hazard model:

Using index plots (sequential plots by ID) to examine deviance residuals on a case-by-case basis.
Summarizing each individual’s deviance residuals into a sum of squared deviance residuals to examine their contribution to the model’s deviance statistic.

When examining index plots like these, Singer and Willett (2003) note that most of the deviance residuals will be negative—except for those records where the target event occurred—and suggest looking for observations with extreme residuals, which suggest poor model fit. Index plots of the sum of squared deviance individuals can be used to identify individuals whose outcomes are poorly predicted, again by looking for individuals with extreme aggregated residuals.

Here we will fit a basic discrete-time hazard model to the first_sex_pp person-period data, which follows the same specification as Model D of Table 11.3.

first_sex_fit <- glm(
  event ~ 0 + factor(grade) + parental_transition + parental_antisociality,
  family = binomial(link = "logit"),
  data = first_sex_pp
)

summary(first_sex_fit)
#> 
#> Call:
#> glm(formula = event ~ 0 + factor(grade) + parental_transition + 
#>     parental_antisociality, family = binomial(link = "logit"), 
#>     data = first_sex_pp)
#> 
#> Coefficients:
#>                        Estimate Std. Error z value Pr(>|z|)    
#> factor(grade)7          -2.8932     0.3206  -9.024  < 2e-16 ***
#> factor(grade)8          -3.5848     0.4230  -8.474  < 2e-16 ***
#> factor(grade)9          -2.1502     0.2775  -7.750 9.20e-15 ***
#> factor(grade)10         -1.6932     0.2646  -6.398 1.58e-10 ***
#> factor(grade)11         -1.5177     0.2757  -5.504 3.71e-08 ***
#> factor(grade)12         -1.0099     0.2811  -3.592 0.000328 ***
#> parental_transition      0.6605     0.2367   2.790 0.005266 ** 
#> parental_antisociality   0.2964     0.1254   2.364 0.018089 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1139.53  on 822  degrees of freedom
#> Residual deviance:  629.15  on 814  degrees of freedom
#> AIC: 645.15
#> 
#> Number of Fisher Scoring iterations: 5

We can use the augment() function from the broom package to get deviance residuals for each person-period observation by setting the type.residuals argument to "deviance".

first_sex_fit_residuals <- first_sex_fit |>
  augment(data = first_sex_pp, type.residuals = "deviance") |>
  mutate(
    ss.deviance = sum(.resid^2),
    censor = if_else(any(event == 1), 0, 1),
    .by = id
  ) |>
  select(id:parental_antisociality, censor, deviance = .resid, ss.deviance, -event)

first_sex_fit_residuals
#> # A tibble: 822 × 7
#>    id    grade parental_transition parental_antisociality censor deviance
#>    <fct> <int>               <dbl>                  <dbl>  <dbl>    <dbl>
#>  1 1         7                   0                  1.98       0   -0.436
#>  2 1         8                   0                  1.98       0   -0.312
#>  3 1         9                   0                  1.98       0    1.87 
#>  4 2         7                   1                 -0.545      1   -0.418
#>  5 2         8                   1                 -0.545      1   -0.299
#>  6 2         9                   1                 -0.545      1   -0.592
#>  7 2        10                   1                 -0.545      1   -0.727
#>  8 2        11                   1                 -0.545      1   -0.785
#>  9 2        12                   1                 -0.545      1   -0.969
#> 10 3         7                   0                 -1.40       1   -0.268
#> # ℹ 812 more rows
#> # ℹ 1 more variable: ss.deviance <dbl>

Following Singer and Willett (2003), we begin by presenting deviance residuals for eight adolescents from the first_sex_fit model.

# Table 12.6, page 465:
c(22, 112, 166, 89, 102, 87, 67, 212) |>
  map(\(.id) filter(first_sex_fit_residuals, id == .id)) |>
  list_rbind() |>
  pivot_wider(
    names_from = grade,
    names_prefix = "grade_",
    values_from = deviance
  ) |>
  relocate(ss.deviance, .after = everything()) |>
  gt() |>
  sub_missing(missing_text = "")

id	parental_transition	parental_antisociality	censor	grade_7	grade_8	grade_9	grade_10	grade_11	grade_12	ss.deviance
22	1	-0.64964826	0	-0.4117311	-0.2944395	-0.5840395	-0.7176492	-0.7747741	1.4144633	3.713321
112	1	-0.66093105	1	-0.4110716	-0.2939581	-0.5831423	-0.7165919	-0.7736556	-0.9562858	2.621976
166	1	2.78141260	0	-0.6614533	-0.4806816	-0.9108230	-1.0903113	1.1913978		4.106381
89	0	-0.07515719	0	-0.3248377	-0.2313901	-0.4644566	-0.5751775	1.8623690		4.174028
102	1	0.60493168	0	-0.4913390	2.3694820					5.855859
87	1	2.67790076	0	1.8176297						3.303778
67	1	2.27465175	0	-0.6180257	-0.4476515	-0.8559141	-1.0294308	-1.1007321	1.0429686	4.674059
212	0	-0.96179113	1	-0.2857033	-0.2032141	-0.4097550	-0.5090181	-0.5524318	-0.6958426	1.339299

Finally, we can examine (sum of squared) deviance residuals using index plots. Note that we only convert id from a factor to numeric for a cleaner looking x-axis.

first_sex_fit_ss.deviance <- first_sex_fit_residuals |>
  mutate(id = as.numeric(as.character(id))) |>
  ggplot(aes(x = id, y = ss.deviance)) +
    geom_point() +
    scale_x_continuous(breaks = seq(0, 220, by = 20)) +
    coord_cartesian(xlim = c(0, 220), ylim = c(1, 8))

first_sex_fit_deviance <- first_sex_fit_ss.deviance +
  aes(y = deviance) +
  geom_hline(yintercept = 0, alpha = .25) +
  coord_cartesian(xlim = c(0, 220), ylim = c(-2, 3))

# Figure 12.8, page 467:
first_sex_fit_deviance / first_sex_fit_ss.deviance +
  plot_layout(axes = "collect")

12.1 Alternative specifications for the “main effect” of time

Comparing alternative specifications for the “main effect” of time