Skip to contents

2.1 Creating a longitudinal data set

In Section 2.1 Singer and Willett (2003) introduce two distinct formats of data organization for longitudinal data—the person-level format and the person-period format—using a subset of data from the National Youth Survey (NYS) measuring the development of tolerance towards deviant behaviour in adolescents over time in relation to self-reported sex and exposure to deviant peers (Raudenbush & Chan, 1992). Adolescents’ tolerance towards deviant behaviour was based on a 9-item scale measuring attitudes tolerant of deviant behaviour. The scale was administered each year from age 11 to 15 and so is a time-varying variable. However, adolescents’ self-reported sex and exposure to deviant peers were only recorded at the beginning of the study period and so are time-invariant variables.

For this example we illustrate the difference between the two formats using the deviant_tolerance_pl and deviant_tolerance_pp data sets, which correspond to the adolescent tolerance of deviant behaviour data organized in the person-level and person-period formats, respectively.

The person-Level data set

In the person-level format (also known as wide or multivariate format), each person has only one row of data and multiple columns containing data from each measurement occasion for any time-varying variables. This is demonstrated in the deviant_tolerance_pl data set, a person-level data frame with 16 rows and 8 columns:

  • id: Participant ID.
  • tolerance_11, tolerance_12, tolerance_13, tolerance_14, tolerance_15: Average score across a 9-item scale assessing attitudes favourable to deviant behaviour at ages 11, 12, 13, 14, and 15. Each item used a four point scale (1 = very wrong, 2 = wrong, 3 = a little bit wrong, 4 = not wrong at all).
  • male: Binary indicator for whether the adolescent is a male.
  • exposure: Average score across a 9-item scale assessing level of exposure to deviant peers. Each item used a five point Likert score (ranging from 0 = none, to 4 = all).
deviant_tolerance_pl
#> # A tibble: 16 × 8
#>    id    tolerance_11 tolerance_12 tolerance_13 tolerance_14 tolerance_15  male
#>    <fct>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl> <dbl>
#>  1 9             2.23         1.79         1.9          2.12         2.66     0
#>  2 45            1.12         1.45         1.45         1.45         1.99     1
#>  3 268           1.45         1.34         1.99         1.79         1.34     1
#>  4 314           1.22         1.22         1.55         1.12         1.12     0
#>  5 442           1.45         1.99         1.45         1.67         1.9      0
#>  6 514           1.34         1.67         2.23         2.12         2.44     1
#>  7 569           1.79         1.9          1.9          1.99         1.99     0
#>  8 624           1.12         1.12         1.22         1.12         1.22     1
#>  9 723           1.22         1.34         1.12         1            1.12     0
#> 10 918           1            1            1.22         1.99         1.22     0
#> 11 949           1.99         1.55         1.12         1.45         1.55     1
#> 12 978           1.22         1.34         2.12         3.46         3.32     1
#> 13 1105          1.34         1.9          1.99         1.9          2.12     1
#> 14 1542          1.22         1.22         1.99         1.79         2.12     0
#> 15 1552          1            1.12         2.23         1.55         1.55     0
#> 16 1653          1.11         1.11         1.34         1.55         2.12     0
#> # ℹ 1 more variable: exposure <dbl>

Although the person-level format is common in cross-sectional research, it has four disadvantages that make it ill-suited for longitudinal data analysis:

  1. It restricts data analysis to examining rank order wave-to-wave relationships, leading to non-informative summaries that do not tell us how each person changes over time, nor even the direction of change.
  2. It omits an explicit time-indicator variable, rendering time unavailable for data analysis.
  3. It requires adding an additional variable to the data set for each unique measurement occasion, making it inefficient or useless when the number or spacing of measurement occasions varies across individuals.
  4. It requires adding an additional set of columns for each time-varying predictor (one column per measurement occasion), rendering it unable to easily handle the presence of time-varying predictors.

Singer and Willett (2003) exemplify the first of these disadvantages by postulating how one might analyze the person-level tolerance towards deviant behaviour data set. A natural approach would be to summarize the wave-to-wave relationships among tolerance_11 through tolerance_15 using bivariate correlations and bivariate plots; however, doing so does not tell us anything about how adolescent tolerance towards deviant behaviour changed over time for either individuals or groups. Rather, the weak positive correlation between measurement occasions merely tells us that the rank order of tolerance towards deviant behaviour remained relatively stable across occasions—that is, adolescents who were more tolerant towards deviant behaviour at one measurement occasion tended to be more tolerant at the next.

# Table 2.1, page 20:
deviant_tolerance_pl |>
  select(starts_with("tolerance")) |>
  correlate(diagonal = 1) |>
  shave() |>
  fashion()
#>           term tolerance_11 tolerance_12 tolerance_13 tolerance_14 tolerance_15
#> 1 tolerance_11         1.00                                                    
#> 2 tolerance_12          .66         1.00                                       
#> 3 tolerance_13          .06          .25         1.00                          
#> 4 tolerance_14          .14          .21          .59         1.00             
#> 5 tolerance_15          .26          .39          .57          .83         1.00

The first disadvantage is also apparent when examining bivariate plots between measurement occasions: There is again no way to tell how adolescent tolerance towards deviant behaviour changed over time for either individuals or groups. Moreover, because we lack an explicit time-indicator variable, it isn’t possible to plot the person-level data set in a more meaningful way, such as a time series plot organized by id.

deviant_tolerance_pl |>
  select(starts_with("tolerance")) |>
  pairs()

Considered together, these disadvantages make the person-level format ill-suited for most longitudinal data analyses. Fortunately, each of the disadvantages of the person-level format can be addressed by a simple conversion to the person-period format.

The person-period data set

In the person-period format (also known as long or univariate format), each person has one row of data for each measurement occasion, with a participant identifier variable for each person, and a time-indicator variable for each measurement occasion. In this format, time-invariant variables have identical values across each measurement occasion; whereas time-varying variables have potentially differing values. This is demonstrated in the deviant_tolerance_pp data set, a person-period data frame with 80 rows and 5 columns:

  • id: Participant ID.
  • age: Adolescent age in years.
  • tolerance: Average score across a 9-item scale assessing attitudes favourable to deviant behaviour. Each item used a four point scale (1 = very wrong, 2 = wrong, 3 = a little bit wrong, 4 = not wrong at all).
  • male: Binary indicator for whether the adolescent is a male.
  • exposure: Average score across a 9-item scale assessing level of exposure to deviant peers. Each item used a five point Likert score (ranging from 0 = none, to 4 = all).
deviant_tolerance_pp
#> # A tibble: 80 × 5
#>    id      age tolerance  male exposure
#>    <fct> <dbl>     <dbl> <dbl>    <dbl>
#>  1 9        11      2.23     0     1.54
#>  2 9        12      1.79     0     1.54
#>  3 9        13      1.9      0     1.54
#>  4 9        14      2.12     0     1.54
#>  5 9        15      2.66     0     1.54
#>  6 45       11      1.12     1     1.16
#>  7 45       12      1.45     1     1.16
#>  8 45       13      1.45     1     1.16
#>  9 45       14      1.45     1     1.16
#> 10 45       15      1.99     1     1.16
#> # ℹ 70 more rows

Although the person-period data set contains the same information as the person-level data set, its format for data organization makes it more amenable to longitudinal data analysis, specifically:

  • It includes an explicit participant identifier variable, enabling the data to be sorted into person-specific subsets.
  • It includes an explicit time-indicator variable, rendering time available for data analysis, and accommodating research designs in which the number or spacing of measurement occasions varies across individuals.
  • It needs only a single column for each variable in the data set—whether time-varying or time-invariant, outcome or predictor—making it trivial to handle any number of variables.

Indeed, most R functions are designed to work with data in the person-period format—which falls under the larger umbrella of the tidy data format—due to R’s vectorized nature. As Wickham, Çetinkaya-Rundel, and Grolemund (2023) explain, there are three interrelated rules that make a data set tidy:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

Thus, the person-period format is simply a special case of the tidy data format, which distinguishes itself through its longitudinal nature and requirements for explicit participant identifier and time-indicator variables.

Converting between person-level and person-period data sets

Unfortunately, longitudinal data is often initially stored as a person-level data set, meaning that most real analyses will require at least a little tidying to get your data into a person-period format. There are a few reasons for this:

  • Many people aren’t familiar with the principles of tidy data—nor its special cases like the person person-period format—and it’s hard to derive them yourself without spending a lot of time working with longitudinal data.
  • The person-level format closely resembles the familiar cross-sectional data-set format, making it a seemingly sensible default for inexperienced analysts.
  • Data is often organized to facilitate non-analytical goals, such as data entry, rather than data analysis.

Thus, an essential skill for the aspiring longitudinal data analyst is to be able to convert a person-level data set into a person-period data set. The tidyr package provides two functions that can easily convert a longitudinal data set from one format to the other: pivot_longer() and pivot_wider().

To convert a person-level data set into a person-period data set use pivot_longer():

# Figure 2.1, page 18:
pivot_longer(
  deviant_tolerance_pl,
  cols = starts_with("tolerance_"),
  names_to = "age",
  names_prefix = "tolerance_",
  names_transform = as.integer,
  values_to = "tolerance"
)
#> # A tibble: 80 × 5
#>    id     male exposure   age tolerance
#>    <fct> <dbl>    <dbl> <int>     <dbl>
#>  1 9         0     1.54    11      2.23
#>  2 9         0     1.54    12      1.79
#>  3 9         0     1.54    13      1.9 
#>  4 9         0     1.54    14      2.12
#>  5 9         0     1.54    15      2.66
#>  6 45        1     1.16    11      1.12
#>  7 45        1     1.16    12      1.45
#>  8 45        1     1.16    13      1.45
#>  9 45        1     1.16    14      1.45
#> 10 45        1     1.16    15      1.99
#> # ℹ 70 more rows

After the person-level data, there are five key arguments:

  • cols specifies which columns need to be pivoted into longer format—for longitudinal data, this will always be the columns corresponding to any time-varying variables. This argument uses tidy selection, a small data science language for selecting columns in a data frame (?tidyr_tidy_select), making it simple to select each column of a time-varying variable based on its naming pattern.
  • names_to names the new column (or columns) to create from the information stored in the column names specified by cols. We named the new column age.
  • names_prefix removes the matching text from the start of each column name—for longitudinal data, this will always be the prefix for any time-varying variables separating the variable name from the measurement occasion. This argument uses a regular expression to select the matching text.
  • names_transform applies a function to the new column (or columns). We converted the new column age from type character to type integer.
  • values_to names the new column (or columns) to create from the data stored in cell values. We named the new column tolerance.

Note that "age" and "tolerance" are quoted in the call to pivot_longer() because they represent the column names of the new variables we’re creating, rather than already-existing variables in the data.

Although most longitudinal data analyses will begin by getting your data into a person-period format, it can occasionally be useful to go in the opposite direction. Some computations can be made easier using a person-period data set, and certain functions and analyses expect a person-period data set; therefore, it’s helpful to know how to untidy, transform, and re-tidy your data as needed.

To convert a person-period data set to person-level data set use dplyr::pivot_wider():

pivot_wider(
  deviant_tolerance_pp,
  names_from = age,
  names_prefix = "tolerance_",
  values_from = tolerance
)
#> # A tibble: 16 × 8
#>    id     male exposure tolerance_11 tolerance_12 tolerance_13 tolerance_14
#>    <fct> <dbl>    <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
#>  1 9         0     1.54         2.23         1.79         1.9          2.12
#>  2 45        1     1.16         1.12         1.45         1.45         1.45
#>  3 268       1     0.9          1.45         1.34         1.99         1.79
#>  4 314       0     0.81         1.22         1.22         1.55         1.12
#>  5 442       0     1.13         1.45         1.99         1.45         1.67
#>  6 514       1     0.9          1.34         1.67         2.23         2.12
#>  7 569       0     1.99         1.79         1.9          1.9          1.99
#>  8 624       1     0.98         1.12         1.12         1.22         1.12
#>  9 723       0     0.81         1.22         1.34         1.12         1   
#> 10 918       0     1.21         1            1            1.22         1.99
#> 11 949       1     0.93         1.99         1.55         1.12         1.45
#> 12 978       1     1.59         1.22         1.34         2.12         3.46
#> 13 1105      1     1.38         1.34         1.9          1.99         1.9 
#> 14 1542      0     1.44         1.22         1.22         1.99         1.79
#> 15 1552      0     1.04         1            1.12         2.23         1.55
#> 16 1653      0     1.25         1.11         1.11         1.34         1.55
#> # ℹ 1 more variable: tolerance_15 <dbl>

After the person-period data, there are three key arguments:

  • names_from specifies which column (or columns) to get the name of the output columns from—for longitudinal data, this will always be the columns corresponding to any time-indicator variables.
  • names_prefix adds the specified string to the start of each output column name—for longitudinal data, this will always be the prefix for any time-varying variables separating the variable name from the measurement occasion.
  • values_from specifies which column (or columns) to get the cell values from—for longitudinal data, this will always be the columns corresponding to any time-varying variables.

To learn more about the principles of tidy data and how pivoting works, see the Data Tidying chapter of R for Data Science.

2.2 Descriptive analysis of individual change over time

In Section 2.2 Singer and Willett (2003) use the deviant_tolerance_pp data set to demonstrate how the person-period format facilitates exploratory analyses that describe how individuals in the data set change over time, revealing the nature and idiosyncrasies of each person’s temporal pattern of change.

Empirical growth plots

Empirical growth plots show, for each individual, the sequence of change in a time-varying variable. Here change can be evaluated in either absolute terms against the scale of the variable of interest, or in relative terms in comparison to other sample members. Singer and Willett (2003) identify several questions that are helpful to answer when examining empirical growth plots:

  • Who is increasing?
  • Who is decreasing?
  • Who is increasing the most? The least?
  • Who is decreasing the most? The least?
  • Does anyone increase and then decrease?
  • Does anyone decrease and then increase?

To construct an empirical growth plot with the ggplot2 package, put your time-indicator on the x-axis, your time-varying variable on the y-axis, and facet each individual into a separate panel.

# Figure 2.2, page 25:
deviant_tolerance_empgrowth <- deviant_tolerance_pp |>
  ggplot(aes(x = age, y = tolerance)) +
    geom_point() +
    coord_cartesian(ylim = c(0, 4)) +
    facet_wrap(vars(id), labeller = label_both)

deviant_tolerance_empgrowth

If your data set is large, Singer and Willett (2003) suggest constructing empirical growth plots for a randomly selected a subsample of individuals—perhaps stratified into groups defined by the values of important predictors—rather than using the entire sample.

This task can be be easily accomplished using filter() function from the dplyr package prior to plotting. For example, here we sample four random adolescents. Note the use of the set.seed() function prior to sampling, which sets the state of R’s random number generator: we do this so that the results of the random sample are reproducible.

set.seed(345)

deviant_tolerance_pp |>
  filter(id %in% sample(unique(id), size = 4))
#> # A tibble: 20 × 5
#>    id      age tolerance  male exposure
#>    <fct> <dbl>     <dbl> <dbl>    <dbl>
#>  1 268      11      1.45     1     0.9 
#>  2 268      12      1.34     1     0.9 
#>  3 268      13      1.99     1     0.9 
#>  4 268      14      1.79     1     0.9 
#>  5 268      15      1.34     1     0.9 
#>  6 442      11      1.45     0     1.13
#>  7 442      12      1.99     0     1.13
#>  8 442      13      1.45     0     1.13
#>  9 442      14      1.67     0     1.13
#> 10 442      15      1.9      0     1.13
#> 11 569      11      1.79     0     1.99
#> 12 569      12      1.9      0     1.99
#> 13 569      13      1.9      0     1.99
#> 14 569      14      1.99     0     1.99
#> 15 569      15      1.99     0     1.99
#> 16 1105     11      1.34     1     1.38
#> 17 1105     12      1.9      1     1.38
#> 18 1105     13      1.99     1     1.38
#> 19 1105     14      1.9      1     1.38
#> 20 1105     15      2.12     1     1.38

This approach can also be extended to randomly select a subsample of individuals within different strata by combining the group_split() function from the dplyr package to split the data into a list of different groups, and the map() function from the purrr package to apply the filter() call from the previous example to each group. For example, here we sample two random adolescent males and two random adolescent females, then combine the filtered data frames in the list back together using the list_rbind() function from the purrr package.

set.seed(123)

deviant_tolerance_pp |>
  group_split(male) |>
  map(\(.group) filter(.group, id %in% sample(unique(id), size = 2))) |>
  list_rbind()
#> # A tibble: 20 × 5
#>    id      age tolerance  male exposure
#>    <fct> <dbl>     <dbl> <dbl>    <dbl>
#>  1 442      11      1.45     0     1.13
#>  2 442      12      1.99     0     1.13
#>  3 442      13      1.45     0     1.13
#>  4 442      14      1.67     0     1.13
#>  5 442      15      1.9      0     1.13
#>  6 918      11      1        0     1.21
#>  7 918      12      1        0     1.21
#>  8 918      13      1.22     0     1.21
#>  9 918      14      1.99     0     1.21
#> 10 918      15      1.22     0     1.21
#> 11 268      11      1.45     1     0.9 
#> 12 268      12      1.34     1     0.9 
#> 13 268      13      1.99     1     0.9 
#> 14 268      14      1.79     1     0.9 
#> 15 268      15      1.34     1     0.9 
#> 16 514      11      1.34     1     0.9 
#> 17 514      12      1.67     1     0.9 
#> 18 514      13      2.23     1     0.9 
#> 19 514      14      2.12     1     0.9 
#> 20 514      15      2.44     1     0.9

Using a trajectory to summarize each person’s empirical growth record

Each person’s empirical growth record can be summarized by applying two standardized approaches:

  • The nonparametric approach uses nonparametric smooths to summarize each person’s pattern of change over time graphically without imposing a specific functional form. The primary advantage of the nonparametric approach is that it requires no assumptions.
  • The parametric approach uses separate parametric models fit to each person’s data to summarize their pattern of change over time. Each model uses a common functional form for the trajectories (e.g., a straight line, a quadratic curve, etc.). The primary advantage of the parametric approach is that it provides numeric summaries of the trajectories that can be used for further exploration.

Singer and Willett (2003) recommend using both approaches—beginning with the nonparametric approach—as examining the smoothed trajectories will help you select a common functional form for the trajectories in the parametric approach.

The nonparametric approach

The stat_smooth() function can be used to add a nonparametric smooth layer to the empirical growth record plot. The choice of a particular smoothing algorithm is primarily a matter of convenience, so we’ll use the default loess smoother. The span argument controls the amount of smoothing for the default loess smoother—with smaller numbers producing wigglier lines and larger numbers producing smoother lines; here choose a value that creates a similar smooth to the textbook figure.

# Figure 2.3, page 27:
deviant_tolerance_empgrowth +
  stat_smooth(method = "loess", se = FALSE, span = .9)

Singer and Willett (2003) recommend focusing on the elevation, shape, and tilt of the smoothed trajectories by answering questions like:

  • Do the scores hover at the low, medium, or high end of the scale?
  • Does everyone change over time or do some people remain the same?
  • Do the trajectories have an inflection point or plateau?
  • Is the rate of change steep or shallow?
  • What is the overall functional form of the trajectory at the group-level? Is it linear or curvilinear? smooth or step-like?

Answering the last question is particularly important, as it will help you select a common functional form for the trajectories in the parametric approach.

The parametric approach

For the parametric approach, Singer and Willett (2003) suggest using the following three-step process:

  1. Estimate a within-person linear model for each person in the data set.
  2. Collect summary statistics from each within-person linear model into a single data set.
  3. Add each person’s fitted trajectory to the empirical growth record plot.

To begin, we’ll use the lmList() function from the lme4 package to fit a common linear model for each adolescent in the data set. The model formula for the lmList() function takes the form response ~ terms | group. Here we select a straight line as the common functional form for the trajectories, with age centred at age 11.

deviant_tolerance_fit <- lmList(
  tolerance ~ I(age - 11) | id,
  pool = FALSE,
  data = deviant_tolerance_pp
)

# Table 2.2, page 30:
summary(deviant_tolerance_fit)
#> Call:
#>   Model: tolerance ~ I(age - 11) | NULL 
#>    Data: deviant_tolerance_pp 
#> 
#> Coefficients:
#>    (Intercept) 
#>      Estimate Std. Error   t value     Pr(>|t|)
#> 9       1.902 0.25194841  7.549165 4.819462e-03
#> 45      1.144 0.13335666  8.578499 3.329579e-03
#> 268     1.536 0.26038049  5.899059 9.725771e-03
#> 314     1.306 0.15265648  8.555156 3.356044e-03
#> 442     1.576 0.20786534  7.581832 4.759898e-03
#> 514     1.430 0.13794927 10.366130 1.915399e-03
#> 569     1.816 0.02572936 70.580844 6.267530e-06
#> 624     1.120 0.04000000 28.000000 1.000014e-04
#> 723     1.268 0.08442748 15.018806 6.407318e-04
#> 918     1.000 0.30444376  3.284679 4.626268e-02
#> 949     1.728 0.24118043  7.164760 5.600382e-03
#> 978     1.028 0.31995000  3.213002 4.884420e-02
#> 1105    1.538 0.15115555 10.174949 2.022903e-03
#> 1542    1.194 0.18032748  6.621287 7.015905e-03
#> 1552    1.184 0.37355321  3.169562 5.049772e-02
#> 1653    0.954 0.13925516  6.850734 6.366647e-03
#>    I(age - 11) 
#>      Estimate Std. Error    t value   Pr(>|t|)
#> 9       0.119 0.10285751  1.1569404 0.33105320
#> 45      0.174 0.05444263  3.1960249 0.04948216
#> 268     0.023 0.10629989  0.2163690 0.84257784
#> 314    -0.030 0.06232175 -0.4813729 0.66318168
#> 442     0.058 0.08486067  0.6834733 0.54336337
#> 514     0.265 0.05631755  4.7054602 0.01816360
#> 569     0.049 0.01050397  4.6649040 0.01859462
#> 624     0.020 0.01632993  1.2247449 0.30806801
#> 723    -0.054 0.03446738 -1.5666989 0.21516994
#> 918     0.143 0.12428864  1.1505476 0.33330784
#> 949    -0.098 0.09846150 -0.9953129 0.39294486
#> 978     0.632 0.13061904  4.8384983 0.01683776
#> 1105    0.156 0.06170899  2.5279945 0.08557441
#> 1542    0.237 0.07361839  3.2193045 0.04861002
#> 1552    0.153 0.15250246  1.0032625 0.38965538
#> 1653    0.246 0.05685068  4.3271249 0.02275586

Next we’ll collect summary statistics from each within-person linear model into a single data set using the tidy() function from the broom package. However, because lmList() returns a list of models, we need to apply the tidy() call to each model prior to collecting the summary statistics into a single data set. Ironically, we also need to tidy the result of tidy() to prepare the data for plotting.

deviant_tolerance_tidy <- deviant_tolerance_fit |>
  map(tidy) |>
  list_rbind(names_to = "id") |>
  mutate(
    id = as.factor(id),
    term = case_when(
      term == "(Intercept)" ~ "intercept",
      term == "I(age - 11)" ~ "slope"
    )
  )

deviant_tolerance_abline <- deviant_tolerance_tidy |>
  select(id:estimate) |>
  pivot_wider(names_from = term, values_from = estimate)

deviant_tolerance_abline
#> # A tibble: 16 × 3
#>    id    intercept   slope
#>    <fct>     <dbl>   <dbl>
#>  1 9         1.90   0.119 
#>  2 45        1.14   0.174 
#>  3 268       1.54   0.0230
#>  4 314       1.31  -0.0300
#>  5 442       1.58   0.0580
#>  6 514       1.43   0.265 
#>  7 569       1.82   0.0490
#>  8 624       1.12   0.0200
#>  9 723       1.27  -0.0540
#> 10 918       1      0.143 
#> 11 949       1.73  -0.0980
#> 12 978       1.03   0.632 
#> 13 1105      1.54   0.156 
#> 14 1542      1.19   0.237 
#> 15 1552      1.18   0.153 
#> 16 1653      0.954  0.246

Finally, we can add each person’s fitted trajectory to the empirical growth record plot using the geom_abline() function. However, because we centred age in our linear model, we need to transform the scale of the x-axis of the empirical growth plot to be centred as well—otherwise ggplot2 will not be able to align the fitted trajectories correctly. To do so, we must create a custom transformation object using the new_transform() function from the scales package, which defines the transformation, its inverse, and methods for generating breaks and labels.

transform_centre <- function(subtract) {
  new_transform(
    "centre",
    transform = \(x) x - subtract,
    inverse = \(x) x + subtract
  )
}

# Figure 2.5, page 32:
deviant_tolerance_empgrowth +
  geom_abline(
    aes(intercept = intercept, slope = slope),
    data = deviant_tolerance_abline
  ) +
  scale_x_continuous(transform = transform_centre(11))

Alternatively, if you only plan to examine the parametric trajectories graphically, the three-step process suggested by Singer and Willett (2003) can be skipped altogether by using the stat_smooth() function with the "lm" method. This approach also fits a within-person linear model for each person in the data set; the only drawback is that it makes it awkward (though not impossible) to access the summary statistics from each model.

deviant_tolerance_empgrowth +
  stat_smooth(method = "lm", se = FALSE)

2.3 Exploring differences in change across people

Having explored how each individual changes over time, in Section 2.3 Singer and Willett (2003) continue with the deviant_tolerance_pp data set to demonstrate three strategies for exploring interindividual differences in change:

  1. Plotting the entire set of individual trajectories together, along with an average change trajectory for the entire group. Individual trajectories can either be compared with one another to examine similarities and differences in changes across people, or with the average change trajectory to compare individual change with group change.
  2. Conducting descriptive analyses of key model parameters, such as the estimated intercepts and slopes of the individual change trajectory models.
  3. Exploring the relationship between change and time-invariant predictors. This relationship can be explored through both plots and statistical modelling.

Plotting the entire set of trajectories together

The purpose of the first strategy is to answer generic questions about change, such as:

  • Is the direction and rate of change similar or different across people?
  • How does individual change compare to the group-averaged change trajectory?

For this strategy, Singer and Willett (2003) suggest using both the nonparametric and parametric approaches, as certain patterns in the data may be somewhat easier to interpret using one approach or the other.

deviant_tolerance_grptraj <- map(
  list("loess", "lm"),
  \(.method) {
    deviant_tolerance_pp |>
      mutate(method = .method) |>
      ggplot(mapping = aes(x = age, y = tolerance)) +
        stat_smooth(
          aes(linewidth = "individual", group = id),
          method = .method, se = FALSE, span = .9
        ) +
        stat_smooth(
          aes(linewidth = "average"),
          method = .method, se = FALSE, span = .9
        ) +
        scale_linewidth_manual(values = c(2, .25)) +
        coord_cartesian(ylim = c(0, 4)) +
        facet_wrap(vars(method), labeller = label_both) +
        labs(linewidth = "trajectory")
  }
)

# Figure 2.6, page 34:
wrap_plots(deviant_tolerance_grptraj) +
  plot_layout(guides = "collect", axes = "collect")

Conducting descriptive analyses of key model parameters

The purpose of the second strategy is to answer specific questions about the behaviour of key parameters in the individual change trajectory models, such as:

  • What is the average initial status and the average annual rate of change?
  • What is the observed variability in initial status and annual rate of change?
  • What is the relationship between initial status and annual rate of change?

For this strategy, Singer and Willett (2003) suggest examining the estimated intercepts and slopes from the fitted linear models with the following three summary statistics:

  1. The sample mean, which summarizes the average initial status (intercept) and annual rate of change (slope) across the sample.
  2. The sample variance or standard deviation, which summarize the amount of observed interindividual heterogeneity in initial status and annual rate of change.
  3. The sample correlation, which summarizes the strength and direction of the relationship between initial status and annual rate of change.

The sample mean, variance, and standard deviation can be computed together from the tidied model fits we saved earlier using a combination of the group_by() and summarise() functions from the dplyr package.

# Table 2.3, page 37:
deviant_tolerance_tidy |>
  group_by(term) |>
  summarise(
    mean = mean(estimate),
    var = var(estimate),
    sd = sd(estimate)
  )
#> # A tibble: 2 × 4
#>   term       mean    var    sd
#>   <chr>     <dbl>  <dbl> <dbl>
#> 1 intercept 1.36  0.0887 0.298
#> 2 slope     0.131 0.0297 0.172

The sample correlation needs to be computed in a separate step, and requires some additional transformations to the tidied model fits before doing so. Here we use the correlate() function from the corrr package—which is part of the tidymodels universe of packages—because its API is designed with data pipelines in mind.

deviant_tolerance_tidy |>
  select(id, term, estimate) |>
  pivot_wider(names_from = term, values_from = estimate) |>
  select(-id) |>
  correlate() |>
  stretch(na.rm = TRUE, remove.dups = TRUE)
#> # A tibble: 1 × 3
#>   x         y          r
#>   <chr>     <chr>  <dbl>
#> 1 intercept slope -0.448

Exploring the relationship between change and time-invariant predictors

The purpose of the final strategy is to answer questions about systematic interindividual differences in change, such as:

  • Does the observed (average) initial status and (average) annual rate of change differ across the levels or values of any time-invariant predictors?
  • Is there a relationship between initial status or annual rate of change and any time-invariant predictors?

For this strategy, Singer and Willett (2003) suggest using two approaches:

  1. Plotting (smoothed) individual growth trajectories, displayed separately for groups distinguished by important values of time-invariant predictors. For categorical predictors, each level of the predictor can be used. For continuous predictors, the values can be temporarily categorized for the purpose of display.
  2. Conducting exploratory analyses of the relationship between change and time-invariant predictors, investigating whether the estimated intercepts and slopes of the individual change trajectory models vary systematically with different predictors.

The plotting approach

For the plotting approach, we can adapt the code we used earlier to plot the entire set of trajectories together, simply changing the variable we’ll facet by. Here we facet by the categorical predictor male and the continuous predictor exposure, which we split at its median for the purposes of display.

deviant_tolerance_grptraj_by <- map(
  list(male = "male", exposure = "exposure"),
  \(.by) {
    deviant_tolerance_pp |>
      mutate(
        exposure = if_else(exposure < median(exposure), "low", "high"),
        exposure = factor(exposure, levels = c("low", "high"))
      ) |>
      ggplot(aes(x = age, y = tolerance)) +
        stat_smooth(
          aes(linewidth = "individual", group = id),
          method = "lm", se = FALSE, span = .9
        ) +
        stat_smooth(
          aes(linewidth = "average"),
          method = "lm", se = FALSE, span = .9
        ) +
        scale_linewidth_manual(values = c(2, .25)) +
        coord_cartesian(ylim = c(0, 4)) +
        facet_wrap(.by, labeller = label_both) +
        labs(linewidth = "trajectory")
  }
)

# Figure 2.7, page 38:
wrap_plots(deviant_tolerance_grptraj_by, ncol = 1, guides = "collect")

When examining plots like these, Singer and Willett (2003) recommend looking for systematic patterns in the trajectories to answer questions like:

  • Do the observed trajectories differ across groups?
  • Do observed differences appear more in the intercepts or in the slopes?
  • Are the observed trajectories of some groups more heterogeneous than others?

If we also wished to conduct descriptive analyses of the key model parameters for each of these groups we could use the update() function to update and refit our common linear model to different subsets of the data. Here we store the model fits for each subgroup in a list so they’re easier to iterate upon together.

tolerance_fit_sex <- list(
  male = update(deviant_tolerance_fit, subset = male == 1),
  female = update(deviant_tolerance_fit, subset = male == 0)
)

tolerance_fit_exposure <- list(
  low = update(deviant_tolerance_fit, subset = exposure < 1.145),
  high = update(deviant_tolerance_fit, subset = exposure >= 1.145)
)

For example, here is a descriptive analysis of the intercepts and slopes for males and females.

tolerance_fit_sex |>
  map(
    \(.fit_sex) {
      .fit_sex |>
        map(tidy) |>
        list_rbind(names_to = "id") |>
        group_by(term) |>
          summarise(
            mean = mean(estimate),
            sd = sd(estimate)
          )
    }
  ) |>
    list_rbind(names_to = "sex")
#> # A tibble: 4 × 4
#>   sex    term         mean    sd
#>   <chr>  <chr>       <dbl> <dbl>
#> 1 male   (Intercept) 1.36  0.264
#> 2 male   I(age - 11) 0.167 0.238
#> 3 female (Intercept) 1.36  0.338
#> 4 female I(age - 11) 0.102 0.106

The exploratory analysis approach

For the exploratory analysis approach, Singer and Willett (2003) recommend restricting ourselves to the simplest of approaches to examine relationship between change and time-invariant predictors—bivariate scatter plots and sample correlations. Their reasoning for this restriction is twofold:

  1. The statistical models presented in this chapter are intended for descriptive and exploratory purposes only; their estimates have known biases that make them imperfect measures of each person’s true initial status and true rate of change.
  2. These models will soon be replaced with the multilevel model for change in Chapter 3, which is better-suited for modelling longitudinal data.

Before doing any plotting or computations, we first need to add each adolescent’s male and exposure values to the deviant_tolerance_tidy data frame. This is easily done using the left_join() function from the dplyr package, which performs a mutating join to add columns from one data frame to another, matching observations based on the keys. Here we join a selection of columns from the person-level deviant_tolerance_pl data set: Specifically, the id column, which exists in both data frames and will thus be used for joining; and our two time-invariant predictors male and exposure. We’ll also create a new sex variable, which we can use instead of male for plotting.

deviant_tolerance_tidy_2 <- deviant_tolerance_tidy |>
  left_join(select(deviant_tolerance_pl, id, male, exposure)) |>
  mutate(sex = if_else(male == 0, "female", "male"))

deviant_tolerance_tidy_2
#> # A tibble: 32 × 9
#>    id    term      estimate std.error statistic p.value  male exposure sex   
#>    <fct> <chr>        <dbl>     <dbl>     <dbl>   <dbl> <dbl>    <dbl> <chr> 
#>  1 9     intercept   1.90      0.252      7.55  0.00482     0     1.54 female
#>  2 9     slope       0.119     0.103      1.16  0.331       0     1.54 female
#>  3 45    intercept   1.14      0.133      8.58  0.00333     1     1.16 male  
#>  4 45    slope       0.174     0.0544     3.20  0.0495      1     1.16 male  
#>  5 268   intercept   1.54      0.260      5.90  0.00973     1     0.9  male  
#>  6 268   slope       0.0230    0.106      0.216 0.843       1     0.9  male  
#>  7 314   intercept   1.31      0.153      8.56  0.00336     0     0.81 female
#>  8 314   slope      -0.0300    0.0623    -0.481 0.663       0     0.81 female
#>  9 442   intercept   1.58      0.208      7.58  0.00476     0     1.13 female
#> 10 442   slope       0.0580    0.0849     0.683 0.543       0     1.13 female
#> # ℹ 22 more rows

Now we can create the bivariate scatter plots. Note that we use the .data pronoun inside our call to aes()—the .data pronoun is a special construct in the tidyverse that allows us to treat a character vector of variable names as environment variables, so they work in the expected way for arguments that use non-standard evaluation. To learn more about the .data pronoun, see the dplyr package’s Programming with dplyr vignette.

deviant_tolerance_biplot <- map(
  list(sex = "sex", exposure = "exposure"),
  \(.x) {
    ggplot(deviant_tolerance_tidy_2, aes(x = .data[[.x]], y = estimate)) +
      geom_point() +
      facet_wrap(vars(term), ncol = 1, scales = "free_y")
  }
)

# Figure 2.8, page 40:
wrap_plots(deviant_tolerance_biplot) +
  plot_layout(axes = "collect")

Finally, we can compute the correlation between the intercepts and slopes of the individual change trajectory models with the time-invariant predictors male and exposure. Here we use the cor() function rather than corrr::correlate() since we just want to return the correlation value, not a correlation data frame.

# Correlation values shown in Figure 2.8, page 40:
deviant_tolerance_tidy_2 |>
  group_by(term) |>
  summarise(
    male_cor = cor(estimate, male),
    exposure_cor = cor(estimate, exposure)
  )
#> # A tibble: 2 × 3
#>   term      male_cor exposure_cor
#>   <chr>        <dbl>        <dbl>
#> 1 intercept  0.00863        0.232
#> 2 slope      0.194          0.442