Chapter 17 (Normal) Hierarchical Models With Predictors

In Chapter 15 we convinced you that there’s a need for hierarchical models. In Chapter 16 we built our first hierarchical model – a Normal hierarchical model of \(Y\) with no predictors \(X\). Here we’ll take the next natural step by building a Normal hierarchical regression model of \(Y\) with predictors \(X\). Going full circle, let’s return to the Cherry Blossom 10 mile running race analysis that was featured in Chapter 15. Our goal is to better understand variability in running times: To what extent do some people run faster than others? How are running times associated with age, and to what extent does this differ from person to person?

In answering these questions, we’ll utilize the cherry_blossom_sample data from the bayesrules package, shortened to running here. This data records multiple net running times (in minutes) for each of 36 different runners in their 50’s and 60’s that entered the 10-mile race in multiple years.

# Load packages
library(bayesrules)
library(tidyverse)
library(rstanarm)
library(bayesplot)
library(tidybayes)
library(broom.mixed)

# Load data
data(cherry_blossom_sample)
running <- cherry_blossom_sample

But it turns out that the running data is missing some net race times. Since functions such as prediction_summary(), add_fitted_draws(), and add_predicted_draws() require complete information on each race, we’ll omit the rows with incomplete observations. In doing so, it’s important to use na.omit() after selecting our variables of interest so that we don’t throw out observations that have complete information on these variables just because they have incomplete information on variables we don’t care about.

# Remove NAs
running <- running %>% 
  select(runner, age, net) %>% 
  na.omit()

With multiple observations per runner, this data is hierarchical or grouped. To acknowledge this grouping structure, let \(Y_{ij}\) denote the net running time and \(X_{ij}\) the age for runner \(j\) in their \(i\)th race. In modeling \(Y_{ij}\) by \(X_{ij}\), Chapter 15 previewed that it would be a mistake to ignore the data’s grouped structure: a complete pooling approach ignores the fact that we have multiple dependent observations on each runner, and a no pooling approach assumes that data on one runner can’t tell us anything about another runner (thus that our data also can’t tell us anything about the general running population). Instead, our analysis of the relationship between running times and age will combine two Bayesian modeling paradigms we’ve developed throughout the book:

  • regression models of \(Y\) by predictors \(X\) when our data does not have a group structure (Chapters 9 through 14); and
  • hierarchical models of \(Y\) which account for group-structured data but do not use predictors \(X\) (Chapter 16).
  • Build hierarchical regression models of response variable \(Y\) by predictors \(X\).
  • Evaluate and compare hierarchical and non-hierarchical models.
  • Use hierarchical models for posterior prediction.

17.1 First steps: complete pooling

To explore the association between running times (\(Y_{ij}\)) and age (\(X_{ij}\)), let’s start by ignoring the data’s grouped structure. This isn’t the right approach, but it provides a good point of comparison and a building block to a better model. To this end, the complete pooled regression model of \(Y_{ij}\) from Chapter 15 assumes an age specific linear relationship \(\mu_i = \beta_0 + \beta_1 X_{ij}\) with weakly informative priors:

\[\begin{equation} \begin{split} Y_{ij} | \beta_0, \beta_1, \sigma & \sim N(\mu_i, \sigma^2) \;\; \text{ where } \mu_i = \beta_0 + \beta_1 X_{ij} \\ \beta_{0c} & \sim N(0, 35^2) \\ \beta_1 & \sim N(0, 15^2) \\ \sigma & \sim \text{Exp}(0.072) \\ \end{split} \tag{17.1} \end{equation}\]

This model depends upon three global parameters: intercept \(\beta_0\), age coefficient \(\beta_1\), and variability from the regression model \(\sigma\). Our posterior simulation results for these parameters from Section 15.1, stored in complete_pooled_model, are summarized below.

# Posterior summary statistics
model_summary <- tidy(complete_pooled_model, 
                      conf.int = TRUE, conf.level = 0.80)
model_summary
# A tibble: 2 x 5
  term        estimate std.error conf.low conf.high
  <chr>          <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   75.2      24.6     43.7     106.   
2 age            0.268     0.446   -0.302     0.842

The vibe of this complete pooled model is captured by Figure 17.1: it lumps together all race results and describes the relationship between running time and age by a common global model. Lumped together in this way, a scatterplot of net running times versus age exhibit a weak relationship with a posterior median model of 75.2 \(+\) 0.268age.

# Posterior median model
B0 <- model_summary$estimate[1]
B1 <- model_summary$estimate[2]
ggplot(running, aes(x = age, y = net)) + 
  geom_point() + 
  geom_abline(aes(intercept = B0, slope = B1))
This is an untitled chart with no subtitle or caption. It has x-axis 'age' with labels 50.0, 52.5, 55.0, 57.5 and 60.0. It has y-axis 'net' with labels 60, 80, 100 and 120. It has 2 layers. Layer 1 is a set of 185 points. Layer 2 is an abline graph that VI can not process.

FIGURE 17.1: A scatterplot of net running times versus age along with the posterior median model from the complete pooled model.

This posterior median estimate of the age coefficient \(\beta_1\) suggests that running times tend to increase by a mere 0.27 minutes for each year in age. Further, with an 80% posterior credible interval for \(\beta_1\) which straddles 0, (-0.3, 0.84), our complete pooled regression model suggests there’s not a significant relationship between running time and age. Our intuition (and personal experience) says this is wrong – as adults age they tend to slow down. This intuition is correct.

17.2 Hierarchical model with varying intercepts

17.2.1 Model building

Chapter 15 revealed that it would indeed be a mistake to stop our runner analysis with the complete pooled regression model (17.1). Thus our next goal is to incorporate the data’s grouped structure while maintaining our age predictor \(X_{ij}\). Essentially, we want to combine the regression principles from the complete pooled regression model with the grouping principles from the simple Normal hierarchical model without a predictor from Chapter 16:

\[\begin{equation} \begin{array}{rll} Y_{ij} | \mu_{j}, \sigma_y & \sim N(\mu_{j}, \sigma_y^2) & \text{model of running times WITHIN runner $j$} \\ \mu_{j} | \mu, \sigma_\mu & \stackrel{ind}{\sim} N(\mu, \sigma_\mu^2) & \text{model of how typical running times vary BETWEEN runners} \\ \mu, \sigma_y, \sigma_\mu & \sim \; ... & \text{prior models on global parameters} \\ \end{array} \tag{17.2} \end{equation}\]

This hierarchical regression model will build from (17.2) and unfold in three similar layers.

17.2.1.1 Layer 1: variability within runner

The first layer of the simple Normal hierarchical model (17.2) assumes that each runner’s net running times \(Y_{ij}\) vary normally around their own mean time \(\mu_j\), with no consideration of their age \(X_{ij}\):

\[Y_{ij} | \mu_{j}, \sigma_y \sim N(\mu_{j}, \sigma_y^2) \; .\]

To incorporate information about age into our understanding of running times within runners, we can replace the \(\mu_{j}\) with runner-specific means \(\mu_{ij}\) which depend upon the runner’s age in their \(i\)th race, \(X_{ij}\). There’s more than one approach, but we’ll start with the following:

\[\mu_{ij} = \beta_{0j} + \beta_1 X_{ij} \; .\]

Thus the first layer of our hierarchical model describes the relationship between net times and age within each runner \(j\) by:

\[\begin{equation} Y_{ij} | \beta_{0j}, \beta_1, \sigma_y \sim N(\mu_{ij}, \sigma_y^2) \;\; \text{ where } \; \mu_{ij} = \beta_{0j} + \beta_1 X_{ij} \; . \tag{17.3} \end{equation}\]

For each runner \(j\), (17.3) assumes that their running times are Normally distributed around an age- and runner-specific mean, \(\beta_{0j} + \beta_1 X_{ij}\), with standard deviation \(\sigma_y\). This model depends upon three parameters: \(\beta_{0j}\), \(\beta_1\), and \(\sigma_y\). Paying special attention to subscripts, only \(\beta_{0j}\) depends upon \(j\), thus is runner or group-specific:

  • \(\beta_{0j}\) = intercept of the regression model for runner \(j\).

The other parameters are global, or shared across all runners \(j\):

  • \(\beta_1\) = global age coefficient, i.e. the typical change in a runner’s net time per one year increase in age; and
  • \(\sigma_y\) = within group variability around the mean regression model \(\beta_{0j} + \beta_1 X_{ij}\), hence a measure of the strength of the relationship between an individual runner’s time and their age.

Putting this together, the first layer of our hierarchical model (17.3) assumes that relationships between running time and age randomly vary from runner to runner, having different intercepts \(\beta_{0j}\) but a shared age coefficient \(\beta_1\). Figure 17.2 illustrates these assumptions and helps us translate them within our context: though some runners tend to be faster or slower than others, runners experience roughly the same increase in running times as they age.

This is an untitled chart with no subtitle or caption. It has x-axis 'age' with labels 0, 20, 40 and 60. It has y-axis 'net' with labels 0, 50 and 100. It has 7 layers. Layer 1 is a set of 36 points. Layer 1 has size set to 0.75. Layer 2 is a segment graph that VI can not process. Layer 2 has size set to 0.25. Layer 3 is a set of 185 points. Layer 3 has colour set to vivid violet. Layer 3 has size set to 2. Layer 4 is a text graph that VI can not process. Layer 4 has colour set to vivid violet. Layer 5 is a segment graph that VI can not process. Layer 5 has colour set to vivid violet. Layer 6 is a segment graph that VI can not process. Layer 6 has colour set to vivid violet. Layer 7 is a text graph that VI can not process. Layer 7 has colour set to vivid violet.

FIGURE 17.2: Hypothetical mean regression models, \(\beta_{0j} + \beta_1 X\), for our 36 sampled runners under (17.3). The black dots represent the runner-specific intercepts \(\beta_{0j}\) which vary normally around \(\beta_0\) with standard deviation \(\sigma_0\).

17.2.1.2 Layer 2: variability between runners

As with the simple hierarchical model (17.2), the first layer of our hierarchical regression model (17.3) captured the relationship between running time and age within runners. It’s in the next layer that we must capture how the relationships between running time and age vary from runner to runner, i.e. between runners.

Which of our current model parameters, \((\beta_{0j}, \beta_1, \sigma_y)\), will we model in the next layer of the hierarchical model? What is a reasonable structure for this model?

Since it’s the only regression feature that we’re assuming can vary from runner to runner, the next layer will model variability in the intercept parameters \(\beta_{0j}\). It’s important to recognize here that our 36 sample runners are drawn from the same broader population of runners. Thus instead of taking a no pooling approach which assigns each \(\beta_{0j}\) a unique prior, hence assumes that one runner \(j\) can’t tell us about another, these intercepts should share a prior. To this end we’ll assume that the runner-specific intercept parameters, hence baseline running speeds, vary normally around some mean \(\beta_0\) with standard deviation \(\sigma_0\):

\[\begin{equation} \beta_{0j} | \beta_0, \sigma_0 \stackrel{ind}{\sim} N(\beta_0, \sigma_0^2) \; . \tag{17.4} \end{equation}\]

Figure 17.2 adds context to this layer of the hierarchical model which depends upon two new parameters:

  • \(\beta_0\) = the global average intercept across all runners, i.e. the average runner’s baseline speed; and
  • \(\sigma_0\) = between group variability in intercepts \(\beta_{0j}\), i.e. the extent to which baseline speeds vary from runner to runner.

17.2.1.3 Layer 3: global priors

We’ve now completed the first two layers of our hierarchical regression model which reflect the relationships of running time and age, within and between runners. As with any Bayesian model, the last step is to specify our priors.

For which model parameters must we specify priors in the final layer of our hierarchical regression model?

Building from (17.3) and (17.4), the final layer of our hierarchical regression model must specify priors for the global parameters: \(\beta_0\), \(\beta_1\), \(\sigma_y\), and \(\sigma_0\). It’s these global parameters that describe the entire population of runners, not just those in our sample. As usual, we’ll utilize independent Normal priors for regression coefficients \(\beta_0\) and \(\beta_1\), where our prior understanding of baseline \(\beta_0\) is expressed through the centered intercept \(\beta_{0c}\). Further, we’ll utilize independent Exponential priors for standard deviation terms \(\sigma_y\) and \(\sigma_0\). The final hierarchical regression model thus combines information about the relationship between running times \(Y_{ij}\) and age \(X_{ij}\) within runners (17.3) with information about how baseline speeds \(\beta_{0j}\) vary between runners (17.4) with our prior understanding of the broader global population of runners. As embodied by Figure 17.2, this particular model is often referred to as a hierarchical random intercepts model:

\[\begin{equation} \begin{array}{rll} Y_{ij} | \beta_{0j}, \beta_1, \sigma_y & \sim N(\mu_{ij}, \sigma_y^2) \;\; \text{ with } \;\; \mu_{ij} = \beta_{0j} + \beta_1 X_{ij} & \text{(regression model} \\ && \text{ WITHIN runner $j$)} \\ \beta_{0j} | \beta_0, \sigma_0 & \stackrel{ind}{\sim} N(\beta_0, \sigma_0^2) & \text{(variability in baseline speeds} \\ && \text{ BETWEEN runners)}\\ \beta_{0c} & \sim N(m_0, s_0^2) & \text{(priors on global parameters)} \\ \beta_1 & \sim N(m_1, s_1^2) & \\ \sigma_y & \sim \text{Exp}(l_y) & \\ \sigma_0 & \sim \text{Exp}(l_0) & \\ \end{array} \tag{17.5} \end{equation}\]

The complete set of model assumptions is summarized below.

Normal hierarchical regression assumptions

Let \(Y_{ij}\) and \(X_{ij}\) denote observations for the \(i\)th observation in group \(j\). The appropriateness of the Bayesian Normal hierarchical regression model (17.5) of \(Y_{ij}\) by \(X_{ij}\) depends upon the following assumptions.

  • Structure of the data
    Conditioned on predictor \(X_{ij}\), the outcomes \(Y_{ij}\) on any one group \(j\) are independent of those on another group \(k\), \(Y_{ik}\). However, different data points within the same group \(j\), \(Y_{ij}\) and \(Y_{hj}\), are correlated.
  • Structure of the relationship
    Within any group \(j\), the typical outcome of \(Y_{ij}\) (\(\mu_{ij}\)) can be written as a linear function of predictor \(X_{ij}\).
  • Structure of the variability within groups
    Within any group \(j\) and at any predictor value \(X_{ij}\), the observed values of \(Y_{ij}\) will vary normally around mean \(\mu_{ij}\) with consistent standard deviation \(\sigma_y\).
  • Structure of the variability between groups
    The group-specific baselines or intercepts, \(\beta_{0j}\), vary normally around a global intercept \(\beta_0\) with standard deviation \(\sigma_0\).

Connecting the hierarchical and complete pooled models

The complete pooled model (17.1) is a special case of (17.5). These two models are equivalent when \(\sigma_0 = 0\), i.e. when the intercepts do not differ from group to group.

17.2.2 Another way to think about it

Recall from Chapter 16 that there are two ways to think about group-specific parameters. First, in our hierarchical regression model of running times (17.5), we incorporated runner-specific intercept parameters \(\beta_{0j}\) to indicate that the baseline speed can vary from runner to runner \(j\). We thought of these runner-specific intercepts as normal deviations from some global intercept \(\beta_0\) with standard deviation \(\sigma_0\): \(\beta_{0j} \sim N(\beta_0, \sigma_0^2)\). Equivalently, we can think of the runner-specific intercepts as random tweaks or adjustments \(b_{0j}\) to \(\beta_0\),

\[\beta_{0j} = \beta_0 + b_{0j} \;\]

where these tweaks are normal deviations from 0 with standard deviation \(\sigma_0\):

\[b_{0j} \sim N(0, \sigma_0^2) \; .\]

Consider an example. Suppose that some runner \(j\) has a baseline running speed of \(\beta_{0j} = 24\) minutes, whereas the average baseline speed across all runners is \(\beta_0 = 19\) minutes. Thus at any age, runner \(j\) tends to run 5 minutes slower than average. That is, \(b_{0j} = 5\) and

\[\beta_{0j} = \beta_0 + b_{0j} = 19 + 5 = 24 \; .\]

In general then, we can reframe Layers 1 and 2 of our hierarchical model (17.5) as follows:

\[\begin{equation} \begin{split} Y_{ij} | \beta_{0j}, \beta_1, \sigma_y & \sim N(\mu_{ij}, \sigma_y^2) \;\; \text{ with } \;\; \mu_{ij} = (\beta_0 + b_{0j}) + \beta_1 X_{ij} \\ b_{0j} | \sigma_0 & \stackrel{ind}{\sim} N(0, \sigma_0^2) \\ \beta_{0c} & \sim N(m_0, s_0^2) \\ \beta_1 & \sim N(m_1, s_1^2) \\ \sigma_y & \sim \text{Exp}(l_y) \\ \sigma_0 & \sim \text{Exp}(l_0) \\ \end{split} \tag{17.6} \end{equation}\]

17.2.3 Tuning the prior

With the hierarchical regression model framework in place (17.5), let’s tune the priors to match our prior understanding in this running context. Pretending that we didn’t already see data in Chapter 15, our prior understanding is as follows:

  • The typical runner in this age group runs somewhere between an 8 minute mile and a 12 minute mile during a 10 mile race, thus has a net time somewhere between 80 and 120 minutes for the entire race. As such we’ll set the prior model for the centered global intercept to \(\beta_{0c} \sim N(100, 10^2)\). (This centered intercept is much easier to think about than the raw intercept \(\beta_0\), the typical net time for a 0-year-old runner!)
  • We’re pretty sure that the typical runner’s net time in the 10 mile race will, on average, increase over time. We’re not very sure about the rate of this increase, but think it’s likely between 0.5 and 4.5 minutes per year. Thus we’ll set our prior for the global age coefficient to \(\beta_1 \sim N(2.5, 1^2)\).
  • Beyond the typical net time for the typical runner, we do not have a clear prior understanding of the variability between runners (\(\sigma_0\)), nor of the degree to which a runner’s net times might fluctuate from their regression trend (\(\sigma_y\)). Thus we’ll utilize weakly informative priors on these standard deviation parameters.

Our final tuning of the hierarchical random intercepts model follows, where the priors on \(\sigma_y\) and \(\sigma_0\) are assigned by the stan_glmer() simulation below:

\[\begin{equation} \begin{split} Y_{ij} | \beta_{0j}, \beta_1, \sigma_y & \sim N(\mu_{ij}, \sigma_y^2) \;\; \text{ with } \;\; \mu_{ij} = \beta_{0j} + \beta_1 X_{ij} \\ \beta_{0j} | \beta_0, \sigma_0 & \stackrel{ind}{\sim} N(\beta_0, \sigma_0^2) \\ \beta_{0c} & \sim N(100, 10^2) \\ \beta_1 & \sim N(2.5, 1^2) \\ \sigma_y & \sim \text{Exp}(0.072) \\ \sigma_0 & \sim \text{Exp}(1) \\ \end{split} \tag{17.7} \end{equation}\]

To get a sense for the combined meaning of our prior models, we simulate 20,000 prior parameter sets using stan_glmer() with the following special arguments:

  • We specify the model of net times by age by the formula net ~ age + (1 | runner). This essentially combines a non-hierarchical regression formula (net ~ age) with that for a hierarchical model with no predictor (net ~ (1 | runner)).
  • We specify prior_PD = TRUE to indicate that we wish to simulate parameters from the prior, not posterior, models.
running_model_1_prior <- stan_glmer(
  net ~ age + (1 | runner), 
  data = running, family = gaussian,
  prior_intercept = normal(100, 10),
  prior = normal(2.5, 1), 
  prior_aux = exponential(1, autoscale = TRUE),
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735, 
  prior_PD = TRUE)

The simulation results describe 20,000 prior plausible scenarios for the relationship between running time and age, within and between our 36 runners. Though we encourage you to plot many more on your own, we show just 4 prior plausible scenarios of what the mean regression models, \(\beta_{0j} + \beta_1 X\), might look like for our 36 runners (Figure 17.3 left). The variety across these prior scenarios reflects our general uncertainty about running. Though each scenario is consistent with our sense that runners likely slow down over time, the rate of increase ranges quite a bit. Further, in examining the distances between the runner-specific regression lines, some scenarios reflect a plausibility that there’s little difference between runners, whereas others suggest that some runners might be much faster than others.

Finally, we also simulate 100 datasets of race outcomes from the prior model, across a variety of runners and ages. The 100 density plots in Figure 17.3 (right) reflect the distribution of the net times in these simulated datasets. There is, again, quite a range in these simulations. Though some span ridiculous outcomes (eg: negative net running times), the variety in the simulations and the general set of values they cover, adequately reflect our prior understanding and uncertainty. For example, since a 25-30 minute mile is a good walking (not running) pace, the upper values near 250-300 minutes for the entire 10 mile race seem reasonable.

set.seed(84735)
running %>% 
  add_fitted_draws(running_model_1_prior, n = 4) %>%
  ggplot(aes(x = age, y = net)) +
    geom_line(aes(y = .value, group = paste(runner, .draw))) + 
    facet_wrap(~ .draw)

running %>%
  add_predicted_draws(running_model_1_prior, n = 100) %>%
  ggplot(aes(x = net)) +
    geom_density(aes(x = .prediction, group = .draw)) +
    xlim(-100,300)
This is an untitled chart with no subtitle or caption. The chart is comprised of 4 panels containing sub-charts, arranged horizontally. The panels represent different values of .draw. Each sub-chart has x-axis 'age' with labels 50.0, 52.5, 55.0, 57.5 and 60.0. Each sub-chart has y-axis 'net' with labels 50, 100 and 150. Panel 1 represents data for .draw = 6137. Panel 1 is a set of 36 lines. It has size set to 0.1. Panel 2 represents data for .draw = 14136. Panel 2 is a set of 36 lines. It has size set to 0.1. Panel 3 represents data for .draw = 16233. Panel 3 is a set of 36 lines. It has size set to 0.1. Panel 4 represents data for .draw = 18518. Panel 4 is a set of 36 lines. It has size set to 0.1. This is an untitled chart with no subtitle or caption. It has x-axis 'net' with labels -100, 0, 100, 200 and 300. It has y-axis 'density' with labels 0.000, 0.025, 0.050, 0.075 and 0.100. The chart is a density graph that VI can not process. It has size set to 0.05.

FIGURE 17.3: Simulated scenarios under the prior models of the hierarchical regression model (17.7). At left are 4 prior plausible sets of 36 runner-specific relationships between running time and age, \(\beta_{0j} + \beta_1 X\). At right are density plots of 100 prior plausible sets of net running time data.

17.2.4 Posterior simulation & analysis

After all of the build up, let’s counter our prior understanding of the relationship between running time and age with some data. We plot the running times by age for each of our 36 runners below. This reaffirms our model building process and some of our prior hunches. Most runners’ times do tend to increase with age, and there is variability between the runners themselves – some tend to be faster than others.

ggplot(running, aes(x = age, y = net)) + 
  geom_point() + 
  facet_wrap(~ runner)
This is an untitled chart with no subtitle or caption. The chart is comprised of 36 panels containing sub-charts, arranged horizontally. The panels represent different values of runner. Each sub-chart has x-axis 'age' with labels 50, 55 and 60. Each sub-chart has y-axis 'net' with labels 60, 80, 100 and 120. Panel 1 represents data for runner = 1. Panel 1 is a set of 5 points. The points are at: (53, 83.98),  (54, 74.3),  (55, 75.15),  (56, 74.22) and  (57, 74.25) Panel 2 represents data for runner = 2. Panel 2 is a set of 5 points. The points are at: (52, 82.67),  (53, 80.03),  (54, 88.12),  (55, 80.93) and  (56, 88.1) Panel 3 represents data for runner = 3. Panel 3 is a set of 6 points. The points are at: (51, 89.38),  (52, 89.23),  (53, 89.72),  (54, 88.68),  (55, 97.55) and  (56, 85.05) Panel 4 represents data for runner = 4. Panel 4 is a set of 5 points. The points are at: (53, 98.45),  (54, 101.58),  (55, 109.05),  (56, 100.05) and  (58, 106.1) Panel 5 represents data for runner = 5. Panel 5 is a set of 6 points. The points are at: (51, 76.23),  (52, 71.28),  (53, 76.87),  (54, 68.53),  (56, 82.02) and  (57, 83.78) Panel 6 represents data for runner = 6. Panel 6 is a set of 4 points. The points are at: (55, 72.87),  (58, 77.5),  (59, 80.93) and  (61, 78.32) Panel 7 represents data for runner = 7. Panel 7 is a set of 5 points. The points are at: (51, 96.27),  (52, 99.05),  (53, 104.38),  (54, 107.9) and  (55, 100.5) Panel 8 represents data for runner = 8. Panel 8 is a set of 4 points. The points are at: (53, 99.5),  (54, 104.17),  (55, 102.73) and  (58, 111.08) Panel 9 represents data for runner = 9. Panel 9 is a set of 5 points. The points are at: (55, 96.48),  (56, 94.18),  (57, 94.98),  (58, 113.27) and  (61, 105.03) Panel 10 represents data for runner = 10. Panel 10 is a set of 5 points. The points are at: (54, 121),  (55, 107.35),  (56, 110.67),  (57, 118.05) and  (59, 131) Panel 11 represents data for runner = 11. Panel 11 is a set of 5 points. The points are at: (52, 79.87),  (53, 83.6),  (54, 87.13),  (55, 97.75) and  (56, 98.92) Panel 12 represents data for runner = 12. Panel 12 is a set of 5 points. The points are at: (53, 67.5),  (54, 68.45),  (55, 70.18),  (56, 80.93) and  (57, 71.78) Panel 13 represents data for runner = 13. Panel 13 is a set of 5 points. The points are at: (53, 77.43),  (54, 82.1),  (55, 80.18),  (56, 79.5) and  (57, 91.9) Panel 14 represents data for runner = 14. Panel 14 is a set of 7 points. The points are at: (53, 98.4),  (54, 102.58),  (55, 97.75),  (56, 94.57),  (57, 89.95),  (58, 96.22) and  (59, 97.48) Panel 15 represents data for runner = 15. Panel 15 is a set of 5 points. The points are at: (52, 96.8),  (53, 88.37),  (54, 100.18),  (55, 96.52) and  (56, 98.43) Panel 16 represents data for runner = 16. Panel 16 is a set of 5 points. The points are at: (52, 98.3),  (53, 92.32),  (54, 89.85),  (55, 96.68) and  (57, 92.35) Panel 17 represents data for runner = 17. Panel 17 is a set of 5 points. The points are at: (52, 78.42),  (53, 110.23),  (54, 91.78),  (55, 123.98) and  (57, 95.87) Panel 18 represents data for runner = 18. Panel 18 is a set of 4 points. The points are at: (52, 96.88),  (53, 106.12),  (54, 99.47) and  (56, 103.92) Panel 19 represents data for runner = 19. Panel 19 is a set of 4 points. The points are at: (54, 81.9),  (55, 86.97),  (56, 79.87) and  (58, 83.87) Panel 20 represents data for runner = 20. Panel 20 is a set of 5 points. The points are at: (52, 104.12),  (53, 97.48),  (54, 105.2),  (55, 112.88) and  (56, 120.47) Panel 21 represents data for runner = 21. Panel 21 is a set of 6 points. The points are at: (53, 83),  (54, 83.53),  (55, 82.12),  (56, 85.95),  (57, 92.4) and  (58, 89.58) Panel 22 represents data for runner = 22. Panel 22 is a set of 5 points. The points are at: (52, 83.23),  (53, 88.07),  (54, 84.53),  (55, 92.18) and  (56, 91.67) Panel 23 represents data for runner = 23. Panel 23 is a set of 6 points. The points are at: (53, 102.13),  (54, 101.08),  (55, 98.17),  (56, 106.33),  (57, 105) and  (58, 110.98) Panel 24 represents data for runner = 24. Panel 24 is a set of 6 points. The points are at: (55, 99.08),  (56, 103.93),  (57, 103.82),  (58, 105.92),  (59, 111.17) and  (60, 112.97) Panel 25 represents data for runner = 25. Panel 25 is a set of 5 points. The points are at: (51, 98.43),  (52, 109.83),  (55, 105.73),  (56, 98.93) and  (57, 94.48) Panel 26 represents data for runner = 26. Panel 26 is a set of 5 points. The points are at: (51, 83.42),  (52, 82.72),  (53, 84.6),  (54, 87.35) and  (55, 87.77) Panel 27 represents data for runner = 27. Panel 27 is a set of 6 points. The points are at: (50, 91.97),  (51, 92.3),  (52, 92.47),  (53, 102.5),  (55, 105.53) and  (56, 106.13) Panel 28 represents data for runner = 28. Panel 28 is a set of 5 points. The points are at: (50, 72.15),  (51, 76.13),  (52, 77.92),  (53, 75.1) and  (54, 74.77) Panel 29 represents data for runner = 29. Panel 29 is a set of 6 points. The points are at: (55, 62.72),  (56, 62.47),  (57, 61.52),  (58, 63.28),  (59, 63.98) and  (60, 64.37) Panel 30 represents data for runner = 30. Panel 30 is a set of 6 points. The points are at: (55, 62.28),  (56, 63.63),  (57, 65.97),  (58, 66.15),  (59, 73.48) and  (60, 67.8) Panel 31 represents data for runner = 31. Panel 31 is a set of 5 points. The points are at: (52, 77.32),  (53, 88),  (54, 80.3),  (55, 82.8) and  (56, 84.32) Panel 32 represents data for runner = 32. Panel 32 is a set of 5 points. The points are at: (51, 76.07),  (52, 75.2),  (53, 79.53),  (55, 81.58) and  (57, 80.93) Panel 33 represents data for runner = 33. Panel 33 is a set of 4 points. The points are at: (56, 90.07),  (58, 85.72),  (59, 83.77) and  (60, 83) Panel 34 represents data for runner = 34. Panel 34 is a set of 5 points. The points are at: (54, 70.53),  (55, 74.02),  (56, 73.03),  (57, 80.77) and  (58, 80.77) Panel 35 represents data for runner = 35. Panel 35 is a set of 6 points. The points are at: (53, 85.75),  (54, 85.02),  (55, 83.92),  (56, 90.03),  (58, 89.97) and  (59, 87.15) Panel 36 represents data for runner = 36. Panel 36 is a set of 4 points. The points are at: (53, 105.75),  (54, 103.02),  (56, 105.42) and  (57, 115.77)

FIGURE 17.4: Scatterplots of the net running time by age for each of the 36 runners.

Combining our prior understanding with this data, we take a syntactical shortcut to simulate the posterior random intercepts model (17.5) of net times by age: we update() the running_model_1_prior with prior_PD = FALSE. We encourage you to follow this up with a check of the prior tunings as well as some Markov chain diagnostics:

# Simulate the posterior
running_model_1 <- update(running_model_1_prior, prior_PD = FALSE)

# Check the prior specifications
prior_summary(running_model_1)

# Markov chain diagnostics
mcmc_trace(running_model_1)
mcmc_dens_overlay(running_model_1)
mcmc_acf(running_model_1)
neff_ratio(running_model_1)
rhat(running_model_1)

There are a whopping 40 parameters in our model: 36 runner-specific intercept parameters (\(\beta_{0j}\)) in addition to 4 global parameters (\(\beta_0, \beta_1, \sigma_y, \sigma_0\)). These are labeled as follows in the stan_glmer() simulation results:

  • (Intercept) = \(\beta_0\)
  • age = \(\beta_1\)
  • b[(Intercept) runner:j] = \(b_{0j} = \beta_{0j} - \beta_0\), the difference between runner \(j\)’s baseline speed and the average baseline speed
  • sigma = \(\sigma_y\)
  • Sigma[runner:(Intercept),(Intercept)] = \(\sigma_0^2\)

We’ll keep our focus on the big themes here, first those related to the relationship between running time and age for the typical runner, and then those related to the variability from this average.

17.2.4.1 Posterior analysis of the global relationship

To begin, consider the global relationship between running time and age for the typical runner:

\[\beta_0 + \beta_1 X \; .\]

Posterior summaries for \(\beta_0\) and \(\beta_1\), which are fixed across runners, are shown below.

tidy_summary_1 <- tidy(running_model_1, effects = "fixed",
                       conf.int = TRUE, conf.level = 0.80)
tidy_summary_1
# A tibble: 2 x 5
  term        estimate std.error conf.low conf.high
  <chr>          <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)    19.1     11.9       3.74     34.5 
2 age             1.30     0.213     1.02      1.58

Accordingly, there’s an 80% chance that the typical runner tends to slow down somewhere between 1.02 and 1.58 minutes per year. The fact that this range is entirely and comfortably above 0 provides significant evidence that the typical runner tends to slow down with age. This assertion is visually supported by the 200 posterior plausible global model lines below, superimposed with their posterior median, all of which exhibit positive associations between running time and age. In plotting these model lines, note that we use add_fitted_draws() with re_formula = NA to specify that we are interested in the global, not group-specific, model of running times:

B0 <- tidy_summary_1$estimate[1]
B1 <- tidy_summary_1$estimate[2]
running %>%
  add_fitted_draws(running_model_1, n = 200, re_formula = NA) %>%
  ggplot(aes(x = age, y = net)) +
    geom_line(aes(y = .value, group = .draw), alpha = 0.1) +
    geom_abline(intercept = B0, slope = B1, color = "blue") +
    lims(y = c(75, 110))
This is an untitled chart with no subtitle or caption. It has x-axis 'age' with labels 50.0, 52.5, 55.0, 57.5 and 60.0. It has y-axis 'net' with labels 80, 90, 100 and 110. It has 2 layers. Layer 1 is a set of 200 lines. Layer 1 has alpha set to 0.05. Layer 2 is an abline graph that VI can not process. Layer 2 has colour set to vivid violet.

FIGURE 17.5: 200 posterior plausible global model lines, \(\beta_0 + \beta_1 X\), for the relationship between running time and age.

Statistical vs practical significance

When we use the term “significant” in the discussion above, we’re referring to statistical significance. This merely indicates that a positive relationship exists. In this case, we also believe that a 1.02 to 1.58 minute per year slow down is of a meaningful magnitude that has real-world implications. That is, this positive relationship is also practically significant.

Don’t let the details distract you from the important punchline here! By incorporating the group structure of our data, our hierarchical random intercepts model has what the complete pooled model (17.1) lacked: the power to detect a significant relationship in the relationship between running time and age. Our discussion in Chapter 15 revealed why this happens in our running analysis: pooling all runners’ data together masks the fact that most individual runners slow down with age.

17.2.4.2 Posterior analysis of group-specific relationships

In our next step, let’s examine what the hierarchical random intercepts model reveals about the runner-specific relationships between net running time and age,

\[\beta_{0j} + \beta_1 X_{ij} \; = (\beta_0 + b_{0j}) + \beta_1 X_{ij} \; .\]

We’ll do so by combining what we learned about the global age parameter \(\beta_1\) above, with information on the runner-specific intercept terms \(\beta_{0j}\). The latter will require the specialized syntax we built up in Chapter 16, thus some patience. First, the b[(Intercept) runner:j] chains correspond to the difference in the runner-specific and global intercepts \(b_{0j}\). Thus we obtain MCMC chains for each \(\beta_{0j} = \beta_0 + b_{0j}\) by adding the (Intercept) chain to the b[(Intercept) runner:j] chains via spread_draws() and mutate(). We then use median_qi() to obtain posterior summaries of the \(\beta_{0j}\) chain for each runner \(j\):

# Posterior summaries of runner-specific intercepts
runner_summaries_1 <- running_model_1 %>%
  spread_draws(`(Intercept)`, b[,runner]) %>% 
  mutate(runner_intercept = `(Intercept)` + b) %>% 
  select(-`(Intercept)`, -b) %>% 
  median_qi(.width = 0.80) %>% 
  select(runner, runner_intercept, .lower, .upper)

Consider the results for runners 4 and 5. With a posterior median intercept of 30.8 minutes vs 6.7 minutes, runner 4 seems to have a slower baseline speed than runner 5. Thus at any shared age, we would expect runner 4 to run roughly 24.1 minutes slower than runner 5 (\(30.8 - 6.7\)):

runner_summaries_1 %>% 
  filter(runner %in% c("runner:4", "runner:5"))
# A tibble: 2 x 4
  runner   runner_intercept .lower .upper
  <chr>               <dbl>  <dbl>  <dbl>
1 runner:4            30.8   15.3    46.3
2 runner:5             6.66  -8.42   21.9

These observations are echoed in the plots below which display 100 posterior plausible models of net time by age for runners 4 and 5:

# 100 posterior plausible models for runners 4 & 5
running_45 %>%
  filter(runner %in% c("4", "5")) %>% 
  add_fitted_draws(running_model_1, n = 100) %>%
  ggplot(aes(x = age, y = net)) +
    geom_line(
      aes(y = .value, group = paste(runner, .draw), color = runner),
      alpha = 0.1) +
    geom_point(data = running_45, aes(color = runner))
This is an untitled chart with no subtitle or caption. It has x-axis 'age' with labels 52, 54, 56 and 58. It has y-axis 'net' with labels 70, 80, 90, 100 and 110. There is a legend indicating colour is used to show runner, with 2 levels: 4 shown as black colour and  5 shown as brilliant blue colour. It has 2 layers. Layer 1 is a set of 200 lines. Layer 1 has alpha set to 0.1. Layer 2 is a set of 11 points.

FIGURE 17.6: 100 posterior plausible models of running time by age, \(\beta_{0j} + \beta_1 X\), for subjects \(j \in \{4,5\}\).

We can similarly explore the models for all 36 runners, \(\beta_{0j} + \beta_1 X_{ij}\). For a quick comparison, the runner-specific posterior median models are plotted below and superimposed with the posterior median global model, \(\beta_0 + \beta_1 X_{ij}\). This drives home the point that the global model represents the relationship between running time and age for the most average runner. The individual runner models vary around this global average, some with faster baseline speeds (\(\beta_{0j} < \beta_0\)) and some with slower baseline speeds (\(\beta_{0j} > \beta_0\)).

# Plot runner-specific models with the global model
ggplot(running, aes(y = net, x = age, group = runner)) + 
  geom_abline(data = runner_summaries_1, color = "gray",
              aes(intercept = runner_intercept, slope = B1)) + 
  geom_abline(intercept = B0, slope = B1, color = "blue") + 
  lims(x = c(50, 61), y = c(50, 135))
This is an untitled chart with no subtitle or caption. It has x-axis 'age' with labels 50.0, 52.5, 55.0, 57.5 and 60.0. It has y-axis 'net' with labels 60, 80, 100 and 120. It has 2 layers. Layer 1 is an abline graph that VI can not process. Layer 1 has colour set to light gray. Layer 2 is an abline graph that VI can not process. Layer 2 has colour set to vivid violet.

FIGURE 17.7: The posterior median models for our 36 runners \(j\) as calculated from the hierarchical random intercepts model (gray), with the posterior median global model (blue).

17.2.4.3 Posterior analysis of within and between group variability

All of this talk about the variability in runner-specific models brings us to a posterior consideration of the final remaining model parameters, \(\sigma_y\) and \(\sigma_0\). Whereas \(\sigma_y\) measures the variability from the mean regression model within each runner, \(\sigma_0\) measures the variability in baseline running speeds between the runners. The simulated datasets in Figure 17.8 provide some intuition. In scenario (a), the variability from the mean model within both groups (\(\sigma_y\)) is quite small relative to the variability in the models between the groups (\(\sigma_0\)), leading to a great distinction between these two groups. In scenario (b), \(\sigma_y\) is larger than \(\sigma_0\), leading to little distinction between the groups.

This is an untitled chart with no subtitle or caption. The chart is comprised of 2 panels containing sub-charts, arranged horizontally. The panels represent different values of sims. Each sub-chart has x-axis 'x' with labels -1, 0, 1, 2 and 3. Each sub-chart has y-axis 'y' with labels -20, 0, 20 and 40. In this chart colour is used to show runner. The legend that would normally indicate this has been hidden. Each sub-chart has 2 layers. Panel 1 represents data for sims = (a). Layer 1 of panel 1 is a set of 100 points. Layer 2 of panel 1 is an abline graph that VI can not process. Panel 2 represents data for sims = (b). Layer 1 of panel 2 is a set of 100 points. Layer 2 of panel 2 is an abline graph that VI can not process.

FIGURE 17.8: Simulated output for the relationship between response variable \(Y\) and predictor \(X\) when \(\sigma_y < \sigma_0\) (a) and \(\sigma_y > \sigma_0\) (b).

Posterior tidy() summaries for our variance parameters suggest that the running analysis is more like scenario (a) than scenario (b). For a given runner \(j\), we estimate that their observed running time at any age will deviate from their mean regression model by roughly 5.25 minutes (\(\sigma_y\)). By the authors’ assessment (none of us professional runners!), this deviation is rather small in the context of a long 10-mile race, suggesting a rather strong relationship between running times and age within runners. In contrast, we expect that baseline speeds vary by roughly 13.3 minutes from runner to runner (\(\sigma_0\)).

tidy_sigma <- tidy(running_model_1, effects = "ran_pars")
tidy_sigma
# A tibble: 2 x 3
  term                    group    estimate
  <chr>                   <chr>       <dbl>
1 sd_(Intercept).runner   runner      13.3 
2 sd_Observation.Residual Residual     5.25

Comparatively then, the posterior results suggest that \(\sigma_y < \sigma_0\) – there’s greater variability in the models between runners than variability from the model within runners. Think about this another way. As with the simple hierarchical model in Section 16.3.3, we can decompose the total variability in race times across all runners and races into that explained by the variability between runners and that explained by the variability within each runner (16.8):

\[\text{Var}(Y_{ij}) = \sigma_0^2 + \sigma_y^2 \; .\]

Thus proportionally (16.9), differences between runners account for roughly 86.62% (the majority!) of the total variability in racing times, with fluctuations among individual races within runners explaining the other 13.38%:

sigma_0 <- tidy_sigma[1,3]
sigma_y <- tidy_sigma[2,3]
sigma_0^2 / (sigma_0^2 + sigma_y^2)
  estimate
1   0.8662
sigma_y^2 / (sigma_0^2 + sigma_y^2)
  estimate
1   0.1338

17.3 Hierarchical model with varying intercepts & slopes

Since you’ve gotten this far in the book, you know that once in a while it’s important to stand back from the details and ask: can we do even better? A plot of the data for just four of our runners, Figure 17.9, suggests that the hierarchical random intercepts model (17.5) might oversimplify reality. Though this model recognizes that some runners tend to be faster than others, it assumes that the change in running time with age (\(\beta_1\)) is the same for each runner. In reality, whereas some runners do slow down at similar rates (eg: runners 4 and 5), some slow down quicker (eg: runner 20) and some barely at all (eg: runner 29).

# Plot runner-specific models in the data
running %>% 
  filter(runner %in% c("4", "5", "20", "29")) %>% 
  ggplot(., aes(x = age, y = net)) + 
    geom_point() + 
    geom_smooth(method = "lm", se = FALSE) + 
    facet_grid(~ runner)
This is an untitled chart with no subtitle or caption. The chart is comprised of 4 panels containing sub-charts, arranged horizontally. The panels represent different values of runner. Each sub-chart has x-axis 'age' with labels 52.5, 55.0, 57.5 and 60.0. Each sub-chart has y-axis 'net' with labels 60, 80, 100 and 120. Each sub-chart has 2 layers. Panel 1 represents data for runner = 4. Layer 1 of panel 1 is a set of 5 points. The points are at: (53, 98.45),  (54, 101.58),  (55, 109.05),  (56, 100.05) and  (58, 106.1) Layer 2 of panel 1 is a 'lm' smoothed curve.Panel 2 represents data for runner = 5. Layer 1 of panel 2 is a set of 6 points. The points are at: (51, 76.23),  (52, 71.28),  (53, 76.87),  (54, 68.53),  (56, 82.02) and  (57, 83.78) Layer 2 of panel 2 is a 'lm' smoothed curve.Panel 3 represents data for runner = 20. Layer 1 of panel 3 is a set of 5 points. The points are at: (52, 104.12),  (53, 97.48),  (54, 105.2),  (55, 112.88) and  (56, 120.47) Layer 2 of panel 3 is a 'lm' smoothed curve.Panel 4 represents data for runner = 29. Layer 1 of panel 4 is a set of 6 points. The points are at: (55, 62.72),  (56, 62.47),  (57, 61.52),  (58, 63.28),  (59, 63.98) and  (60, 64.37) Layer 2 of panel 4 is a 'lm' smoothed curve.

FIGURE 17.9: Scatterplots and observed trends in running times vs age for four subjects.

A snapshot of the observed trends for all 36 runners provides a more complete picture of just how much the change in net time with age might vary by runner:

ggplot(running, aes(x = age, y = net, group = runner)) + 
  geom_smooth(method = "lm", se = FALSE, size = 0.5)
This is an untitled chart with no subtitle or caption. It has x-axis 'age' with labels 50.0, 52.5, 55.0, 57.5 and 60.0. It has y-axis 'net' with labels 60, 80, 100 and 120. The chart is a 'lm' smoothed curve.It has size set to 0.5.

FIGURE 17.10: The observed trends in running times vs age for all 36 subjects.

How can we modify the random intercepts model (17.5) to recognize that the rate at which running time changes with age might vary from runner to runner?

17.3.1 Model building

Inspired by Figure 17.10, our goal is to build a model which recognizes that in the relationship between running time and age, both the intercept (i.e. baseline speed) and slope (i.e. rate at which speed changes with age) might vary from runner to runner. To this end, we can replace the global age coefficient \(\beta_1\) in (17.5) by a runner-specific coefficient \(\beta_{1j}\). Thus the model of the relationship between running time and age within each runner \(j\) becomes:

\[\begin{equation} Y_{ij} | \beta_{0j}, \beta_{1j}, \sigma_y \sim N(\mu_{ij}, \sigma_y^2) \;\; \text{ where } \mu_{ij} = \beta_{0j} + \beta_{1j} X_{ij} \; . \tag{17.8} \end{equation}\]

Accordingly, just as we assumed in (17.5) that the runner-specific intercepts \(\beta_{0j}\) are Normally distributed around some global intercept \(\beta_0\) with standard deviation \(\sigma_0\), we now also assume that the runner-specific age coefficients \(\beta_{1j}\) are Normally distributed around some global age coefficient \(\beta_1\) with standard deviation \(\sigma_1\):

\[\begin{equation} \beta_{0j} | \beta_0, \sigma_0 \sim N(\beta_0, \sigma_0^2) \;\;\;\; \text{ and } \;\;\;\; \beta_{1j} | \beta_1, \sigma_1 \sim N(\beta_1, \sigma_1^2) \tag{17.9} \end{equation}\]

But these priors aren’t yet complete – \(\beta_{0j}\) and \(\beta_{1j}\) work together to describe the model for runner \(j\), thus are correlated. Let \(\rho \in [-1,1]\) represent the correlation between \(\beta_{0j}\) and \(\beta_{1j}\). To reflect this correlation, we represent the joint Normal model of \(\beta_{0j}\) and \(\beta_{1j}\) by

\[\begin{equation} \left(\begin{array}{c} \beta_{0j} \\ \beta_{1j} \\ \end{array}\right) \vert \beta_0, \beta_1, \sigma_0, \sigma_1 \;\; \sim \;\; N\left( \left(\begin{array}{c} \beta_0 \\ \beta_1 \\ \end{array}\right), \; \Sigma \right) \tag{17.10} \end{equation}\]

where \((\beta_0, \beta_1)\) is the joint mean and \(\Sigma\) is the 2x2 covariance matrix which encodes the variability and correlation amongst \(\beta_{0j}\) and \(\beta_{1j}\):

\[\begin{equation} \Sigma = \left(\begin{array}{cc} \sigma_0^2 & \rho \sigma_0 \sigma_1 \\ \rho \sigma_0 \sigma_1 & \sigma_1^2 \\ \end{array}\right) \; . \tag{17.11} \end{equation}\]

Though this notation looks overwhelming, it simply indicates that \(\beta_{0j}\) and \(\beta_{1j}\) are both marginally Normal (17.9) and have correlation \(\rho\). The correlation \(\rho\) between the runner-specific intercepts and slopes, \(\beta_{0j}\) and \(\beta_{1j}\), isn’t just a tedious mathematical detail. It’s an interesting feature of the hierarchical model! Figure 17.11 provides some insight. Plot (a) illustrates the scenario in which there’s a strong negative correlation between \(\beta_{0j}\) and \(\beta_{1j}\) – models that start out lower (with small \(\beta_{0j}\)) tend to increase at a more rapid rate (with higher \(\beta_{1j}\)). In plot (c) there’s a strong positive correlation between \(\beta_{0j}\) and \(\beta_{1j}\) – models that start out higher (with larger \(\beta_{0j}\)) also tend to increase at a more rapid rate (with higher \(\beta_{1j}\)). In between these two extremes, plot (b) illustrates the scenario in which there’s no correlation between the group-specific intercepts and slopes.

This chart has title '(a)'. It has x-axis 'x' with labels 0, 10, 20 and 30. It has y-axis 'y' with labels 0, 25, 50, 75 and 100. It has 3 layers. Layer 1 is an abline graph that VI can not process. Layer 1 has size set to 0.2. Layer 2 is 1 horizontal line as follows: Line at y position 0. Layer 3 is a vline graph that VI can not process. This chart has title '(b)'. It has x-axis 'x' with labels 0, 1, 2, 3, 4 and 5. It has y-axis 'y' with labels 0, 25, 50 and 75. It has 3 layers. Layer 1 is a segment graph that VI can not process. Layer 1 has size set to 0.2. Layer 2 is 1 horizontal line as follows: Line at y position 0. Layer 3 is a vline graph that VI can not process. This chart has title '(c)'. It has x-axis 'x' with labels 0, 10, 20 and 30. It has y-axis 'y' with labels 0, 25, 50, 75 and 100. It has 3 layers. Layer 1 is an abline graph that VI can not process. Layer 1 has size set to 0.2. Layer 2 is 1 horizontal line as follows: Line at y position 0. Layer 3 is a vline graph that VI can not process.

FIGURE 17.11: Simulated output for the relationship between response variable \(Y\) and predictor \(X\) when \(\rho < 0\) (a), \(\rho = 0\) (b), and \(\rho > 0\) (c).

Consider the implications of this correlation in the context of our running analysis.73

  1. In our running example, what would it mean for \(\beta_{0j}\) and \(\beta_{1j}\) to be negatively correlated?
    1. Runners that start out slower (i.e. with a higher baseline), also tend to slow down at a more rapid rate.
    2. The rate at which runners slow down over time isn’t associated with how fast they start out.
    3. Runners that start out faster (i.e. with a lower baseline), tend to slow down at a more rapid rate.
  2. Similarly, what would it mean for \(\beta_{0j}\) and \(\beta_{1j}\) to be positively correlated?

The completed hierarchical model (17.12) pulls together (17.8) and (17.10) with priors for the global parameters. For reasons you might imagine, this is often referred to as a hierarchical random intercepts and slopes model:

\[\begin{equation} \begin{array}{rl} Y_{ij} | \beta_{0j}, \beta_{1j}, \sigma_y & \sim N(\mu_{ij}, \sigma_y^2) \;\; \text{ where } \; \mu_{ij} = \beta_{0j} + \beta_{1j} X_{ij} \\ & \\ \left(\begin{array}{c} \beta_{0j} \\ \beta_{1j} \\ \end{array}\right) \vert \beta_0, \beta_1, \sigma_0, \sigma_1 & \sim N\left( \left(\begin{array}{c} \beta_0 \\ \beta_1 \\ \end{array}\right), \; \Sigma \right) \\ & \\ \beta_{0c} & \sim N(100, 10^2) \\ \beta_1 & \sim N(2.5, 1^2) \\ \sigma_y & \sim \text{Exp}(0.072) \\ \Sigma & \sim \text{(decomposition of covariance)} \\ \end{array} \tag{17.12} \end{equation}\]

Equivalently, we can re-express the random intercepts and slopes as random tweaks to the global intercept and slope: \(\mu_{ij} = (\beta_0 + b_{0j}) + (\beta_1 + b_{1j}) X_{ij}\) with

\[\left(\begin{array}{c} b_{0j} \\ b_{1j} \\ \end{array}\right) \vert \; \sigma_0, \sigma_1 \sim N\left( \left(\begin{array}{c} 0 \\ 0 \\ \end{array}\right), \; \Sigma \right) \; .\]

Connecting our hierarchical models

The random intercepts model (17.5) is a special case of (17.12). When \(\sigma_1 = 0\), i.e. when the group-specific age coefficients do not differ from group to group, these two models are equivalent.

Most of the pieces in this model are familiar. For global parameters \(\beta_0\) and \(\beta_1\) we use the tuned Normal priors from (17.7). For \(\sigma_y\) we use a weakly informative prior. Yet there is one big new piece. We need a joint prior model to express our understanding of how the combined \(\sigma_0\), \(\sigma_1\), and \(\rho\) parameters define covariance matrix \(\Sigma\) (17.11). At the writing of this book, the stan_glmer() function allows users to define this prior through a decomposition of covariance, or decov(), model. Very generally speaking, this model decomposes our prior model for the covariance matrix into prior information about three separate pieces:

  1. the correlation between the group-specific intercepts and slopes, \(\rho\) (Figure 17.11);

  2. the combined degree to which the intercepts and slopes vary by group, \(\sigma_0^2 + \sigma_1^2\); and74

  3. the relative proportion of this variability between groups that’s due to differing intercepts vs differing slopes,

    \[\pi_0 = \frac{\sigma_0^2}{\sigma_0^2 + \sigma_1^2} \;\; \text{ vs } \;\; \pi_1 = \frac{\sigma_1^2}{\sigma_0^2 + \sigma_1^2} \; .\]

Figure 17.12 provides some context on this third piece, displaying a few scenarios for the relationship between \(\pi_0\) and \(\pi_1\). In general, \(\pi_0\) and \(\pi_1\) always sum to 1, thus have a push and pull relationship. For example, when \(\pi_0 \approx 1\) and \(\pi_1 \approx 0\), the variability in intercepts (\(\sigma_0^2\)) is large in comparison to the variability in slopes (\(\sigma_1^2\)). Thus the majority of the variability between group-specific models is explained by differences in intercepts (plot a). In contrast, when \(\pi_0 \approx 0\) and \(\pi_1 \approx 1\), the majority of the variability between group-specific models is explained by differences in slopes (plot c). In between these extremes, when \(\pi_0\) and \(\pi_1\) are both approximately 0.5, roughly half of the variability between groups can be explained by differing intercepts and the other half by differing slopes (plot b).

This chart has title '(a)'. It has x-axis 'x' with labels 0, 5, 10 and 15. It has y-axis 'y' with labels -10, 0, 10, 20 and 30. It has 3 layers. Layer 1 is a segment graph that VI can not process. Layer 1 has size set to 0.15. Layer 2 is 1 horizontal line as follows: Line at y position 0. Layer 3 is a vline graph that VI can not process. This chart has title '(b)'. It has x-axis 'x' with labels -2, 0, 2 and 4. It has y-axis 'y' with labels -10, 0, 10, 20 and 30. It has 3 layers. Layer 1 is a segment graph that VI can not process. Layer 1 has size set to 0.15. Layer 2 is 1 horizontal line as follows: Line at y position 0. Layer 3 is a vline graph that VI can not process. This chart has title '(c)'. It has x-axis 'x' with labels -2.5, 0.0, 2.5, 5.0, 7.5 and 10.0. It has y-axis 'y' with labels 0, 25, 50, 75 and 100. It has 3 layers. Layer 1 is a segment graph that VI can not process. Layer 1 has size set to 0.15. Layer 2 is 1 horizontal line as follows: Line at y position 0. Layer 3 is a vline graph that VI can not process.

FIGURE 17.12: Simulated output for the relationship between response variable \(Y\) and predictor \(X\) when \(\pi_0 = 1\) and \(\pi_1 = 0\) (a), \(\pi_0 = \pi_1 = 0.5\) (b), and \(\pi_0 = 0\) and \(\pi_1 = 1\) (c).

In our analysis, we’ll utilize the weakly informative default setting for the hierarchical random intercepts and slopes model: decov(reg = 1, conc = 1, shape = 1, scale = 1) in rstanarm notation. This makes the following prior assumptions regarding the three pieces above:

  1. the correlation \(\rho\) is equally likely to be anywhere between -1 and 1;
  2. we have weakly informative prior information about the total degree to which the intercepts and slopes vary by runner; and
  3. the relative proportion of the variability between runners that’s due to differing intercepts is equally likely to be anywhere between 0 and 1, i.e. we’re not at all sure if there’s more, less, or the same level of variability in the baseline speeds from runner to runner, \(\beta_{0j}\), than in the rate at which their speeds change over time, \(\beta_{1j}\).

We’ll utilize these default assumptions for the covariance prior in this book. Beyond the defaults, specifying and tuning the decomposition of covariance prior requires the acquisition of two new probability models. We present more optional detail in the next section and refer the curious reader to Gabry and Goodrich (2020a) for a more mathematical treatment that scales up to models beyond those considered here.

17.3.2 Optional: the decomposition of covariance model

Let’s take a closer look at the decomposition of covariance model. To begin, we define the three pieces that are important to specifying our prior understanding of the random intercepts and slopes covariance matrix \(\Sigma\) (17.11), thus the \((\sigma_0, \sigma_1, \rho)\) parameters by which it’s defined. These pieces are numbered in accordance to their corresponding interpretations above:

\[\begin{array}{rlr} R & = \left(\begin{array}{cc} 1 & \rho \\ \rho & 1 \end{array}\right) & \text{(1)} \\ \tau & = \sqrt{\sigma_0^2 + \sigma_1^2} & \text{(2)}\\ \pi & = \left(\begin{array}{c} \pi_0 \\ \pi_1 \end{array}\right) = \left(\begin{array}{c} \frac{\sigma_0^2}{\sigma_0^2 + \sigma_1^2} \\ \frac{\sigma_1^2}{\sigma_0^2 + \sigma_1^2} \end{array}\right) & \text{(3)}\\ \end{array}\]

We can decompose \(\Sigma\) into a product which depends on \(R\), \(\tau\), and \(\pi\). If you know some linear algebra, you can confirm this result, though the fact that we can rewrite \(\Sigma\) is what’s important here:

\[\begin{split} \left(\begin{array}{cc} \sigma_0^2 & \rho\sigma_0\sigma_1 \\ \rho\sigma_0\sigma_1 & \sigma_1^2 \end{array}\right) & = \left(\begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array}\right) \left(\begin{array}{cc} 1 & \rho \\ \rho & 1 \end{array}\right)\left(\begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array}\right) = \text{diag}(\sigma_0, \sigma_1) \; R \; \text{diag}(\sigma_0, \sigma_1) \\ \end{split}\]

We can further decompose \((\sigma_0, \sigma_1)\) into the product of \(\tau\) and \(\sqrt{\pi}\):

\[\left(\begin{array}{c} \sigma_0 \\ \sigma_1 \end{array}\right) = \tau \sqrt{\pi} \; .\]

And since we can rewrite \(\Sigma\) using \(R\), \(\tau\), and \(\pi\), we can also express our prior understanding of \(\Sigma\) by our combined prior understanding of these three pieces. This joint prior, which we simply expressed above by

\[\Sigma \sim \text{(decomposition of covariance)}\]

is actually defined by three individual priors:

\[\begin{split} R & \sim \text{LKJ}(\eta) \\ \tau & \sim \text{Gamma}(s, r) \\ \pi & \sim \text{Dirichlet}(2, \delta) \\ \end{split}\]

Let’s begin with the “LKJ” prior model on the correlation matrix \(R\) with regularization hyperparameter \(\eta > 0\). In our model (17.12), \(R\) depends only on the correlation \(\rho\) between the group-specific intercepts \(\beta_{0j}\) and slopes \(\beta_{1j}\). Thus the LKJ prior model simplifies to a prior model on \(\rho\) with pdf

\[f(\rho) = \left[2^{1-2\eta}\frac{\Gamma(2\eta)}{\Gamma(\eta)\Gamma(\eta)} \right] (1 - \rho^2)^{\eta - 1} \;\;\;\; \text{ for } \rho \in [-1,1] \;. \]

Figure 17.13 displays the LKJ pdf under a variety of regularization parameters \(\eta\), illustrating the important comparison of \(\eta\) to 1:

  • Setting \(\eta < 1\) indicates a prior understanding that the group-specific intercepts and slopes are most likely strongly correlated, though we’re not sure if this correlation is negative or positive.
  • When \(\eta = 1\), the LKJ model is uniform from -1 to 1, indicating that the correlation between the intercepts and slopes is equally likely to be anywhere in this range – we’re not really sure.
  • Setting \(\eta > 1\) indicates a prior understanding that the group-specific intercepts and slopes are most likely weakly correlated (\(\rho \approx 0\)). The greater \(\eta\) is, the tighter and tighter the LKJ hugs values of \(\rho\) near 0.
This chart has title 'LKJ(0.5)'. It has x-axis 'rho' with labels -1.0, -0.5, 0.0, 0.5 and 1.0. It has y-axis 'paste(&quot;f(&quot;, rho, &quot;)&quot;, sep = &quot;&quot;)' with labels 0.0, 0.4, 0.8 and 1.2. The chart is a function graph that VI can not process. This chart has title 'LKJ(1)'. It has x-axis 'rho' with labels -1.0, -0.5, 0.0, 0.5 and 1.0. It has y-axis 'paste(&quot;f(&quot;, rho, &quot;)&quot;, sep = &quot;&quot;)' with labels 0.0, 0.4, 0.8 and 1.2. The chart is a function graph that VI can not process. This chart has title 'LKJ(4)'. It has x-axis 'rho' with labels -1.0, -0.5, 0.0, 0.5 and 1.0. It has y-axis 'paste(&quot;f(&quot;, rho, &quot;)&quot;, sep = &quot;&quot;)' with labels 0.0, 0.4, 0.8 and 1.2. The chart is a function graph that VI can not process.

FIGURE 17.13: The LKJ pdf under three regularization parameters.

Next, for the total standard deviation in the intercepts and slopes, \(\tau = \sqrt{\sigma_0^2 + \sigma_1^2}\), we utilize the “usual” Gamma prior (or its Exponential special case). Finally, consider the prior for the \(\pi_0\) and \(\pi_1\) parameters. Recall that \(\pi_0\) and \(\pi_1\) measure the relative proportion of the variability between groups that’s due to differing intercepts vs differing slopes, respectively. Thus \(\pi_0\) and \(\pi_1\) are both restricted to values between 0 and 1 and must sum to 1:

\[\pi_0 + \pi_1 = \frac{\sigma_0^2}{\sigma_0^2 + \sigma_1^2} + \frac{\sigma_1^2}{\sigma_0^2 + \sigma_1^2} = 1\;.\]

Accordingly, we can utilize a joint symmetric Dirichlet\((2,\delta)\) prior with concentration hyperparameter \(\delta > 0\) for \((\pi_0, \pi_1)\). The symmetric Dirichlet pdf defines the relative prior plausibility of valid \((\pi_0, \pi_1)\) pairs:

\[f(\pi_0,\pi_1) = \frac{\Gamma(2\delta)}{\Gamma(\delta)\Gamma(\delta)} (\pi_0 \pi_1)^{\delta - 1} \;\;\;\; \text{ for } \; \pi_0,\pi_1 \in [0,1] \; \text{ and } \; \pi_0 + \pi_1 = 1 \;.\]

In fact, in the special case when we have only two group-specific parameters, \(\beta_{0j}\) and \(\beta_{1j}\), the symmetric Dirichlet model for \((\pi_0, \pi_1)\) is equivalently expressed by:

\[\pi_0 \sim \text{Beta}(\delta, \delta) \;\; \text{ and } \;\;\pi_1 = 1 - \pi_0 \; .\]

Figure 17.14 displays the marginal symmetric Dirichlet pdf, i.e. Beta pdf, for \(\pi_0\) under a variety of concentration parameters \(\delta\), illustrating the important comparison of \(\delta\) to 1:

  • Setting \(\delta < 1\) places more prior weight on proportions \(\pi_0\) near 0 or 1. This indicates a prior understanding that relatively little (\(\pi_0 \approx 0\)) or relatively much (\(\pi_0 \approx 1\)) of the variability between groups is explained by differences in intercepts rather than differences in slopes (Figure 17.12 plots a and c).
  • When \(\delta = 1\), the marginal pdf on \(\pi_0\) is uniform from -1 to 1, indicating that the variability in intercepts explains anywhere between 0 and all of the variability between groups – we’re not really sure.
  • Setting \(\delta > 1\) indicates a prior understanding that roughly half of the variability between groups is explained by differences in intercepts and the other half by differences in slopes (Figure 17.12 plot b). The greater \(\delta\) is, the tighter and tighter this prior hugs values of \(\pi_0\) near 0.5.
This chart has title 'Dirichlet(0.5)'. It has x-axis 'pi[0]' with labels 0.00, 0.25, 0.50, 0.75 and 1.00. It has y-axis 'paste(&quot;f(&quot;, pi[0], &quot;)&quot;, sep = &quot;&quot;)' with labels 0.0, 0.5, 1.0, 1.5, 2.0 and 2.5. The chart is a function graph that VI can not process. This chart has title 'Dirichlet(1)'. It has x-axis 'pi[0]' with labels 0.00, 0.25, 0.50, 0.75 and 1.00. It has y-axis 'paste(&quot;f(&quot;, pi[0], &quot;)&quot;, sep = &quot;&quot;)' with labels 0.0, 0.5, 1.0, 1.5, 2.0 and 2.5. The chart is a function graph that VI can not process. This chart has title 'Dirichlet(4)'. It has x-axis 'pi[0]' with labels 0.00, 0.25, 0.50, 0.75 and 1.00. It has y-axis 'paste(&quot;f(&quot;, pi[0], &quot;)&quot;, sep = &quot;&quot;)' with labels 0.0, 0.5, 1.0, 1.5, 2.0 and 2.5. The chart is a function graph that VI can not process.

FIGURE 17.14: The marginal symmetric Dirichlet pdf under three concentration parameters.

When our models have both group-specific intercepts and slopes, we’ll use the following default decomposition of variance priors which indicate general uncertainty about the correlation between group-specific intercepts and slopes, the overall variability in group-specific model, and the relative degree to which this variability is explained by differing intercepts vs differing slopes:

\[\begin{split} R & \sim \text{LKJ}(1) \\ \tau & \sim \text{Gamma(1,1)}, \text{ i.e. } \text{Exp}(1) \\ \pi & \sim \text{Dirichlet}(2, 1) \\ \end{split}\]

In rstanarm notation, this prior is expressed by decov(reg = 1, conc = 1, shape = 1, scale = 1). To tune this prior, you can change the LKJ regularization parameter \(\eta\), the Gamma shape and scale parameters, and the Dirichlet concentration parameter \(\delta\).

17.3.3 Posterior simulation & analysis

Finally, let’s simulate the posterior of our hierarchical random intercepts and slopes model of running times (17.12). This requires one minor tweak to our stan_glmer() call: instead of using the formula net ~ age + (1 | runner) we use net ~ age + (age | runner).

running_model_2 <- stan_glmer(
  net ~ age + (age | runner),
  data = running, family = gaussian,
  prior_intercept = normal(100, 10),
  prior = normal(2.5, 1), 
  prior_aux = exponential(1, autoscale = TRUE),
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735, adapt_delta = 0.99999
)

# Confirm the prior model specifications
prior_summary(running_model_2)

This simulation will be sloooooooow.

Notice the additional argument in our stan_glmer() syntax: adapt_delta = 0.99999. Simply put, adapt_delta is a tuning parameter for the underlying MCMC algorithm, the technical details of which are outside the scope of this book. Prior to this example, we’ve been running our simulations using the default adapt_delta value of 0.95. However, in this particular example, the default produces a warning: There were 1 divergent transitions after warmup. This warning indicates that the MCMC algorithm had a tough time exploring the posterior plausible range of our parameter values. When encountering this issue, one strategy is to increase adapt_delta to some value closer to 1. Doing so produces a much slower, but more stable, simulation. For more technical details on divergent transitions and how to address them, we recommend Modrak (2019).

17.3.3.1 Posterior analysis of the global and group-specific parameters

Remember thinking that the 40 parameters in the random intercepts model was a lot? This new model has 78 parameters: 36 runner-specific intercept parameters \(\beta_{0j}\), 36 runner-specific age coefficients \(\beta_{1j}\), and 6 global parameters (\(\beta_0, \beta_1, \sigma_y, \sigma_0, \sigma_1, \rho\)). Let’s examine these piece by piece, starting with the global model of the relationship between running time and age,

\[\beta_0 + \beta_1 X \;.\]

The results here for the random intercepts and slopes model (17.12) are quite similar to those for the random intercepts model (17.5): the posterior median model is 18.5 + 1.32 age.

# Quick summary of global regression parameters
tidy(running_model_2, effects = "fixed", conf.int = TRUE, conf.level = 0.80)
# A tibble: 2 x 5
  term        estimate std.error conf.low conf.high
  <chr>          <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)    18.5     11.6       3.61     33.6 
2 age             1.32     0.217     1.04      1.59

Since the global mean model \(\beta_0 + \beta_1 X\) captures the relationship between running time and age for the average runner, we shouldn’t be surprised that our two hierarchical models produced similar assessments. Where these two models start to differ is in their assessments of the runner-specific relationships. Obtaining the MCMC chains for the runner-specific intercepts and slopes gets quite technical. We encourage you to pick through the code below, line by line. Here are some important details to pick up on:

  • Line 2 uses b[term, runner] to grab the chains for all runner-specific parameters. As usual now, these chains correspond to \(b_{0j}\) and \(b_{1j}\), the differences between the runner-specific vs global intercepts and age coefficients.
  • Line 3 creates separate columns for each of the \(b_{0j}\) and \(b_{1j}\) chains and names these b_(Intercept) and b_age.
  • Line 4 obtains the runner-specific intercepts \(\beta_{0j} = \beta_0 + b_{0j}\), named runner_intercept, by summing the global (Intercept) and runner-specific adjustments b_(Intercept). The runner-specific \(\beta_{1j}\) coefficients, runner_age, are created similarly.
# Get MCMC chains for the runner-specific intercepts & slopes
runner_chains_2 <- running_model_2 %>%
  spread_draws(`(Intercept)`, b[term, runner], `age`) %>% 
  pivot_wider(names_from = term, names_glue = "b_{term}",
              values_from = b) %>% 
  mutate(runner_intercept = `(Intercept)` + `b_(Intercept)`,
         runner_age = age + b_age)

From these chains, we can obtain the posterior medians for each runner-specific intercept and age coefficient. Since we’re only obtaining posterior medians here, we use summarize() in combination with group_by() instead of using the median_qi() function:

# Posterior medians of runner-specific models
runner_summaries_2 <- runner_chains_2 %>% 
  group_by(runner) %>% 
  summarize(runner_intercept = median(runner_intercept),
            runner_age = median(runner_age))

# Check it out
head(runner_summaries_2, 3)
# A tibble: 3 x 3
  runner    runner_intercept runner_age
  <chr>                <dbl>      <dbl>
1 runner:1              18.6       1.06
2 runner:10             18.5       1.75
3 runner:11             18.5       1.32

Figure 17.15 plots the posterior median models for all 36 runners.

ggplot(running, aes(y = net, x = age, group = runner)) + 
  geom_abline(data = runner_summaries_2, color = "gray",
              aes(intercept = runner_intercept, slope = runner_age)) + 
  lims(x = c(50, 61), y = c(50, 135))
This is an untitled chart with no subtitle or caption. It has x-axis 'age' with labels 50.0, 52.5, 55.0, 57.5 and 60.0. It has y-axis 'net' with labels 60, 80, 100 and 120. The chart is an abline graph that VI can not process. It has colour set to light gray.

FIGURE 17.15: The posterior median models for the 36 runners, as calculated from the hierarchical random intercepts and slopes model.

Hmph. Are you surprised? We were slightly surprised. The slopes do differ, but not as drastically as we expected. But then we remembered – shrinkage! Consider sample runners 1 and 10. Their posteriors suggest that, on average, runner 10’s running time increases by just 1.06 minute per year whereas runner 1’s increases by 1.75 minutes per year:

runner_summaries_2 %>% 
  filter(runner %in% c("runner:1", "runner:10"))
# A tibble: 2 x 3
  runner    runner_intercept runner_age
  <chr>                <dbl>      <dbl>
1 runner:1              18.6       1.06
2 runner:10             18.5       1.75
This is an untitled chart with no subtitle or caption. The chart is comprised of 2 panels containing sub-charts, arranged horizontally. The panels represent different values of runner. Each sub-chart has x-axis 'age' with labels 54, 56 and 58. Each sub-chart has y-axis 'net' with labels 70, 80, 90, 100, 110 and 120. Each sub-chart has 3 layers. Panel 1 represents data for runner = 1. Layer 1 of panel 1 is a 'lm' smoothed curve.Layer 2 of panel 1 is an abline graph that VI can not process. Layer 3 of panel 1 is an abline graph that VI can not process. Layer 3 has linetype set to dashed. Panel 2 represents data for runner = 10. Layer 1 of panel 2 is a 'lm' smoothed curve.Layer 2 of panel 2 is an abline graph that VI can not process. Layer 3 of panel 2 is an abline graph that VI can not process. Layer 3 has linetype set to dashed.

FIGURE 17.16: For runners 1 and 10, the posterior median relationships between running time and age from the hierarchical random intercepts and slopes model (dashed) are contrasted by the observed no pooled models (blue) and the complete pooled model (black).

Figure 17.16 contrasts these posterior median models for runners 1 and 10 (dashed lines) by the complete pooled posterior models (black) and no pooled posterior models (blue). As usual, the hierarchical models strike a balance between these two extremes. Like the no pooled models, the hierarchical models do vary between the two runners. Yet the difference is not as stark. The hierarchical models are drawn away from the no pooled models and toward the complete pooled models. Though this shrinkage is subtle for runner 10, the association between running time and age switches from negative to positive for runner 1. This is to be expected. Unlike the no pooled approach which models runner-specific relationships using only runner-specific data, our hierarchical model assumes that one runner’s behavior can tell us about another’s. Further, we have very few data points on each runner – at most 7 races. With so few observations, the other runners’ information has ample influence on our posterior understanding for any one individual (as it should). In the case of runner 1, the other 35 runners’ data is enough to make us think that this runner, too, will eventually slow down.

17.3.3.2 Posterior analysis of within and between group variability

Stepping back, we should also ask ourselves: is it worth it? Incorporating the random runner-specific age coefficients introduced 37 parameters into our model of running time by age. Yet at least visually, there doesn’t appear to be much variation among the slopes of the runner-specific models. For a numerical assessment of this variation, we can examine the posterior trends in \(\sigma_1\) (sd_age.runner). While we’re at it, we’ll also check out \(\sigma_0\) (sd_(Intercept).runner), \(\rho\) (cor_(Intercept).age.runner), and \(\sigma_y\) (sd_Observation.Residual):

tidy(running_model_2, effects = "ran_pars")
# A tibble: 4 x 3
  term                       group    estimate
  <chr>                      <chr>       <dbl>
1 sd_(Intercept).runner      runner     1.34  
2 sd_age.runner              runner     0.251 
3 cor_(Intercept).age.runner runner    -0.0955
4 sd_Observation.Residual    Residual   5.17  

Consider some highlights of this output:

  • The standard deviation \(\sigma_1\) in the age coefficients \(\beta_{1j}\) is likely around 0.251 minutes per year. On the scale of a 10 mile race, this indicates very little variability between the runners when it comes to the rate at which running times change with age.

  • Per the output for \(\sigma_y\), an individual runner’s net times tend to deviate from their own mean model by roughly 5.17 minutes.

  • There’s a weak negative correlation of roughly -0.0955 between the runner-specific \(\beta_{0j}\) and \(\beta_{1j}\) parameters. Thus it seems that, ever so slightly, runners that start off faster tend to slow down at a faster rate.

17.4 Model evaluation & selection

We have now built a sequence of three models of running time by age: a complete pooled model (17.1), a hierarchical random intercepts model (17.5), and a hierarchical random intercepts and slopes model (17.12). Figure 17.17 provides reminder of the general assumptions behind these three models.

This is an untitled chart with no subtitle or caption. It has x-axis 'x' with labels 50.0, 52.5, 55.0, 57.5 and 60.0. It has y-axis 'y' with labels 60, 80, 100 and 120. The chart is a 'lm' smoothed curve.It has colour set to black. It has size set to 0.4. This is an untitled chart with no subtitle or caption. It has x-axis 'x' with labels 50.0, 52.5, 55.0, 57.5 and 60.0. It has y-axis 'y' with labels 60, 80, 100 and 120. The chart is a set of 36 lines. The lines are broken or missing where NA values appear or where points exceed the plot boundaries. It has size set to 0.4. This is an untitled chart with no subtitle or caption. It has x-axis 'x' with labels 50.0, 52.5, 55.0, 57.5 and 60.0. It has y-axis 'y' with labels 60, 80, 100 and 120. The chart is a 'lm' smoothed curve.It has colour set to black. It has size set to 0.4.

FIGURE 17.17: The idea behind three different approaches to modeling \(Y\) by \(X\) when utilizing group-structured data: complete pooling (left), hierarchical random intercepts (middle), and hierarchical random intercepts and slopes (right).

So which one should we use? To answer this question, we can compare our three models using the framework of Chapters 10 and 11, asking: (1) How fair is each model?; (2) How wrong is each model?; and (3) How accurate are each model’s posterior predictions? Consider question (1). The context and data collection procedure is the same for each model. Since the data has been anonymized and runners are aware that race results will be public, we think this data collection process is fair. Further, though the models produce slightly different conclusions about the relationship between running time and age (eg: the hierarchical models conclude this relationship is significant), none of these conclusions seem poised to have a negative impact on society or individuals. Thus our three models are equally fair.

Next, consider question (2). Posterior predictive checks suggest that the complete pooled model comparatively underestimates the variability in running times – datasets of running time simulated from the complete pooled posterior tend to exhibit a slightly narrower range than the running times we actually observed. Thus the complete pooled model is more wrong than the hierarchical models.

pp_check(complete_pooled_model) + 
  labs(x = "net", title = "complete pooled model")
pp_check(running_model_1) + 
  labs(x = "net", title = "running model 1")
pp_check(running_model_2) + 
  labs(x = "net", title = "running model 2")
This chart has title 'complete pooled model'. It has x-axis 'net' with labels 50, 75, 100, 125 and 150. It has y-axis '' with labels 0.00, 0.01, 0.02 and 0.03. There is a legend indicating colour is used to show colour, with 2 levels: y shown as dark blue colour and  yrep shown as very pale blue colour. It has 2 layers. Layer 1 is a set of 50 lines. Layer 1 has size set to 0.25. Layer 1 has alpha set to 0.7. Layer 2 is a set of 1 line. Line 1 connects 1024 points. Layer 2 has size set to 1. This chart has title 'running model 1'. It has x-axis 'net' with labels 50, 75, 100, 125 and 150. It has y-axis '' with labels 0.00, 0.01, 0.02 and 0.03. There is a legend indicating colour is used to show colour, with 2 levels: y shown as dark blue colour and  yrep shown as very pale blue colour. It has 2 layers. Layer 1 is a set of 50 lines. Layer 1 has size set to 0.25. Layer 1 has alpha set to 0.7. Layer 2 is a set of 1 line. Line 1 connects 1024 points. Layer 2 has size set to 1. This chart has title 'running model 2'. It has x-axis 'net' with labels 50, 75, 100, 125 and 150. It has y-axis '' with labels 0.00, 0.01, 0.02 and 0.03. There is a legend indicating colour is used to show colour, with 2 levels: y shown as dark blue colour and  yrep shown as very pale blue colour. It has 2 layers. Layer 1 is a set of 50 lines. Layer 1 has size set to 0.25. Layer 1 has alpha set to 0.7. Layer 2 is a set of 1 line. Line 1 connects 1024 points. Layer 2 has size set to 1.

FIGURE 17.18: Posterior predictive checks of the complete pooled model (left), random intercepts model (middle), and random intercepts and slopes model (right).

In fact, we know that the complete pooled model is wrong. By ignoring the data’s grouped structure, it incorrectly assumes that each race observation is independent of the others. Depending upon the trade-offs, we might live with this wrong but simplifying assumption in some analyses. Yet at least two signs point to this being a mistake for our running analysis.

  1. The complete pooled model isn’t powerful enough to detect the significant relationship between running time and age.
  2. Not only have we seen visual evidence that some runners tend to be significantly faster or slower than others, the posterior prediction summaries in Section 17.2.4 suggest that there’s significant variability between runners (\(\sigma_0\)).

In light of this discussion, let’s drop the complete pooled model from consideration. In choosing between running_model_1 and running_model_2, consider question (3): what’s the predictive accuracy of these models? Recall some approaches to answering this question from Chapter 11: posterior prediction summaries and ELPD. To begin, we use the prediction_summary() function from the bayesrules package to compare how well these two models predict the running outcomes of the 36 runners that were part of our sample.

# Calculate prediction summaries
set.seed(84735)
prediction_summary(model = running_model_1, data = running)
    mae mae_scaled within_50 within_95
1 2.626      0.456    0.6865     0.973
prediction_summary(model = running_model_2, data = running)
   mae mae_scaled within_50 within_95
1 2.53     0.4424    0.7027     0.973

By all metrics, running_model_1 and running_model_2 produce similarly accurate posterior predictions. For both models, the observed net running times tend to be 2.63 and 2.53 minutes, or 0.46 and 0.44 standard deviations, from their posterior mean predictions. The posterior predictive models also have similar coverage in terms of the percent of observed running times that fall within their 50% and 95% prediction intervals.

Thinking beyond our own sample of runners, we could also utilize prediction_summary_cv() to obtain cross-validated metrics of posterior predictive accuracy. The idea is the same as for non-hierarchical models, but the details change to reflect the grouped structure of our data. To explore how well our models predict the running behavior of runners that weren’t included in our sample data, we divide the runners, not the individual race outcomes, into distinct folds. For example, for a 10-fold cross validation with 36 runners, each fold would include data on 3 or 4 of the sample runners. Thus we would train each of 10 models using data on 32 or 33 of our sample runners and test it on the other 3 or 4. We include code for the curious but do not run it here.

prediction_summary_cv(model = running_model_1, data = running,
                      k = 10, group = "runner")

For hierarchical models, the prediction_summary_cv() function divides groups, not individual outcomes \(Y\), into distinct folds. Thus if we have 10 groups, 10-fold cross-validation will build each training model using data on 9 groups and test it on the 10th group. This approach makes sense when we want to assess how well our model generalizes to new groups outside our sample for which we have no data. But this isn’t always our goal. Instead, we might want to assess how well our model predicts the new outcomes of groups for which we have at least some data. For example, instead of evaluating how well our model predicts the net times for new runners, we might wish to evaluate how well it predicts the next net time of a runner for whom we have data on past races. In short, prediction_summary_cv() is not a catch-all. For a discussion of the various approaches to cross-validation for hierarchical models, as well as how these can be implemented from scratch, we recommend Vehtari (2019).

Finally, consider one last comparison of our two hierarchical models: the cross-validated expected log-predictive densities (ELPD). The estimated ELPD for running_model_1 is lower (worse) than, though within two standard errors of, the running_model_2 ELPD. Hence, by this metric, there is not a significant difference in the posterior predictive accuracy of our two hierarchical models.

# Calculate ELPD for the 2 models
elpd_hierarchical_1 <- loo(running_model_1)
elpd_hierarchical_2 <- loo(running_model_2)


# Compare the ELPD
loo_compare(elpd_hierarchical_1, elpd_hierarchical_2)
                elpd_diff se_diff
running_model_2  0.0       0.0   
running_model_1 -1.6       1.2   

After reflecting upon our model evaluation, we’re ready to make a final determination: we choose running_model_1. The choice of running_model_1 over the complete_pooled_model was pretty clear: the latter was wrong and didn’t have the power to detect a relationship between running time and age. The choice of running_model_1 over running_model_2 comes down to this: the complexity introduced by the additional random age coefficients in running_model_2 produced little apparent change or benefit. Thus, the additional complexity simply isn’t worth it (at least not to us).

17.5 Posterior prediction

Finally, let’s use our preferred model, running_model_1, to make some posterior predictions. Suppose we want to predict the running time that three different runners will achieve when they’re 61 years old: runner 1, runner 10, and Miles. Though Miles’ running prowess is a mystery, we observed runners 1 and 10 in our sample. Should their trends continue, we expect that runner 10’s time will be slower than that of runner 1 when they’re both 61:

# Plot runner-specific trends for runners 1 & 10
running %>% 
  filter(runner %in% c("1", "10")) %>% 
  ggplot(., aes(x = age, y = net)) + 
    geom_point() + 
    facet_grid(~ runner) + 
    lims(x = c(54, 61))
This is an untitled chart with no subtitle or caption. The chart is comprised of 2 panels containing sub-charts, arranged horizontally. The panels represent different values of runner. Each sub-chart has x-axis 'age' with labels 54, 56, 58 and 60. Each sub-chart has y-axis 'net' with labels 80, 90, 100, 110, 120 and 130. Panel 1 represents data for runner = 1. Panel 1 is a set of 4 points. The points are at: (54, 74.3),  (55, 75.15),  (56, 74.22) and  (57, 74.25) Panel 2 represents data for runner = 10. Panel 2 is a set of 5 points. The points are at: (54, 121),  (55, 107.35),  (56, 110.67),  (57, 118.05) and  (59, 131)

FIGURE 17.19: The observed net running times by age for runners 1 and 10.

In general, let \(Y_{new,j}\) denote a new observation on an observed runner \(j\), specifically runner \(j\)’s running time at age 61. As in Chapter 16, we can approximate the posterior predictive model for \(Y_{new,j}\) by simulating a prediction from the first layer of (17.3), that which describes the variability in race times \(Y_{ij}\), evaluated at each of the 20,000 parameter sets \(\left\lbrace \beta_{0j}^{(i)}, \beta_1^{(i)},\sigma_y^{(i)}\right\rbrace\) in our MCMC simulation:

\[Y_{\text{new},j}^{(i)} | \beta_{0j}, \beta_1, \sigma_y \; \sim \; N\left(\mu_{ij}^{(i)}, \left(\sigma_y^{(i)}\right)^2\right) \;\; \text{ where } \; \mu_{ij}^{(i)} = \beta_{0j}^{(i)} + \beta_1^{(i)} \cdot 61 \; .\]

The resulting posterior predictive model will reflect two sources of uncertainty in runner \(j\)’s race time: the within group sampling variability \(\sigma_y\) (we can’t perfectly predict runner \(j\)’s time from their mean model); and posterior variability in \(\beta_{0j}\), \(\beta_1\), and \(\sigma_y\) (the parameters defining runner \(j\)’s relationship between running time and age are unknown and random). Since we don’t have any data on the baseline speed for our new runner, Miles, there’s a third source of uncertainty in predicting his race time: between group sampling variability \(\sigma_0\) (baseline speeds vary from runner to runner). Though we recommend doing these simulations “by hand” to connect with the concepts of posterior prediction (as we did in Chapter 16), we’ll use the posterior_predict() shortcut function to simulate the posterior predictive models for our three runners:

# Simulate posterior predictive models for the 3 runners
set.seed(84735)
predict_next_race <- posterior_predict(
  running_model_1, 
  newdata = data.frame(runner = c("1", "Miles", "10"),
                       age = c(61, 61, 61)))

These posterior predictive models are plotted in Figure 17.20. As anticipated from their previous trends, our posterior expectation is that runner 10 will have a slower time than runner 1 when they’re 61 years old. Our posterior predictive model for Miles’ net time is somewhere in between these two extremes. The posterior median prediction is just under 100 minutes, similar to what we’d get if we plugged an age of 61 into the global posterior median model for the average runner:

B0 + B1 * 61
[1] 98.54

That is, without any information about Miles, our default assumption is that he’s an average runner. Our uncertainty in this assumption is reflected by the relatively wide posterior predictive model. Naturally, having observed data on runners 1 and 10, we’re more certain about how fast they will be when they’re 61. But Miles is a wild card – he could be really fast or really slow!

# Posterior predictive model plots
mcmc_areas(predict_next_race, prob = 0.8) +
 ggplot2::scale_y_discrete(labels = c("runner 1", "Miles", "runner 10"))
This is an untitled chart with no subtitle or caption. It has x-axis '' with labels 40, 80, 120 and 160. It has y-axis '' with labels runner 1, Miles and runner 10. It has 8 layers. Layer 1 is a blank graph that VI can not process. Layer 2 is a ridgeline graph that VI can not process. Layer 2 has colour set to deep blue. Layer 2 has fill set to very pale blue. Layer 3 is a ridgeline graph that VI can not process. Layer 3 has colour set to white. Layer 3 has fill set to strong blue. Layer 4 is a ridgeline graph that VI can not process. Layer 4 has fill set to white. Layer 4 has colour set to deep blue. Layer 5 is a ridgeline graph that VI can not process. Layer 5 has fill set to white. Layer 5 has colour set to white. Layer 6 is a segment graph that VI can not process. Layer 6 has colour set to deep blue. Layer 7 is a blank graph that VI can not process. Layer 8 is a blank graph that VI can not process.

FIGURE 17.20: Posterior predictive models for the net running times at age 61 for sample runners 1 and 10, as well as Miles, a runner that wasn’t in our original sample.

17.6 Details: longitudinal data

The running data on net times by age is longitudinal. We observe each runner over time, where this time (or aging) is of primary interest. Though our hierarchical models of this relationship account for the correlation in running times within any runner, they make a simplifying assumption about this correlation: it’s the same across all ages. In contrast, you might imagine that observations at one age are more strongly correlated with observations at similar ages. For example, a runner’s net time at age 60 is likely more strongly correlated with their net time at age 59 than at age 50. It’s beyond the scope of this book, but we can adjust the structure of our hierarchical models to account for a longitudinal correlation structure. We encourage the interested reader to check out the bayeslongitudinal R package (Carreño and Cuervo 2017) and the foundational paper by Laird and Ware (1982).

17.7 Example: danceability

Let’s implement our hierarchical regression tools in a different context, by revisiting our spotify data from Chapter 16. There we studied a hierarchical model of song popularity, accounting for the fact that we had grouped data with multiple songs per sampled artist. Here we’ll switch our focus to a song’s danceability and how this might be explained by two features: its genre and valence or mood. The danceability and valence of a song are both measured on a scale from 0 (low) to 100 (high). Thus lower valence scores are assigned to negative / sad / angry songs and higher scores to positive / happy / euphoric songs.

Starting out, we have a vague sense that the typical song has a danceability rating around 50, yet we don’t have any strong prior understanding of the possible relationship between danceability, genre, and valence, nor how this might differ from artist to artist. Thus we’ll utilize weakly informative priors throughout our analysis. Moving on from the priors, load the data necessary to this analysis:

# Import and wrangle the data
data(spotify)
spotify <- spotify %>% 
  select(artist, title, danceability, valence, genre)

Figure 17.21 illustrates both some global and artist-specific patterns in the relationships among these variables:

ggplot(spotify, aes(y = danceability, x = genre)) + 
  geom_boxplot()
ggplot(spotify, aes(y = danceability, x = valence)) + 
  geom_point()
ggplot(spotify, aes(y = danceability, x = valence, group = artist)) + 
  geom_smooth(method = "lm", se = FALSE, size = 0.5)
This is an untitled chart with no subtitle or caption. It has x-axis 'genre' with labels edm, latin, pop, r&amp;b, rap and rock. It has y-axis 'danceability' with labels 40, 60 and 80. The chart is a boxplot comprised of 6 boxes with whiskers. There is a box at x=edm. It has median 71.45. The box goes from 61.6 to 75.45, and the whiskers extend to 47.5 and 90.7. There are 1 outliers for this boxplot. There is a box at x=latin. It has median 73.1. The box goes from 66.4 to 75.85, and the whiskers extend to 64.3 and 87.9. There are 4 outliers for this boxplot. There is a box at x=pop. It has median 72.1. The box goes from 57.15 to 75.7, and the whiskers extend to 34.5 and 95.1. There are 0 outliers for this boxplot. There is a box at x=r&amp;b. It has median 65.5. The box goes from 50.4 to 74.4, and the whiskers extend to 24 and 91.3. There are 0 outliers for this boxplot. There is a box at x=rap. It has median 74.6. The box goes from 61.22 to 82.7, and the whiskers extend to 45.8 and 93.4. There are 1 outliers for this boxplot. There is a box at x=rock. It has median 53.05. The box goes from 45.22 to 60.68, and the whiskers extend to 39.7 and 73.6. There are 0 outliers for this boxplot. This is an untitled chart with no subtitle or caption. It has x-axis 'valence' with labels 25, 50, 75 and 100. It has y-axis 'danceability' with labels 40, 60 and 80. The chart is a set of 350 points. This is an untitled chart with no subtitle or caption. It has x-axis 'valence' with labels 25, 50, 75 and 100. It has y-axis 'danceability' with labels 40, 60 and 80. The chart is a 'lm' smoothed curve.It has size set to 0.5.

FIGURE 17.21: The relationship of danceability with genre (left) and valence (middle) for individual songs. The relationship of danceability with valence by artist (right).

When pooling all songs together, notice that rock songs tend to be the least danceable and rap songs the most (by a slight margin). Further, danceability seems to have a weak but positive association with valence. Makes sense. Sad songs tend not to inspire dance. Taking this all with a grain of salt, recall that the global models might mask what’s actually going on. To this end, the artist-specific models in the final plot paint a more detailed picture. The two key themes to pick up on here are that (1) some artists’ songs tend to be more danceable than others; and (2) the association between danceability and valence might differ among artists, though it’s typically positive.

To model these relationships, let’s define some notation. For song \(i\) within artist \(j\), let \(Y_{ij}\) denote danceability and \(X_{ij1}\) denote valence. Next, note that there are 6 different genres, edm being the baseline or reference. Thus we must define 5 additional predictors, one for each non-edm genre. Specifically let \((X_{ij2}, X_{ij3}, \ldots, X_{ij6})\) be indicators of whether a song falls in the latin, pop, r&b, rap, and rock genres, respectively. For example,

\[X_{ij2} = \begin{cases} 1 & \text{ latin genre} \\ 0 & \text{ otherwise} \\ \end{cases}\]

Thus for an edm song, all genre indicators are 0. We’ll consider two possible models of danceability by valence and genre. The first layer of Model 1 assumes that \(Y_{ij} | (\beta_{0j}, \beta_1, \beta_2, \ldots, \beta_6, \sigma_y) \sim N(\mu_{ij}, \sigma_y^2)\) with

\[\mu_{ij} = \beta_{0j} + \beta_1 X_{ij1} + \beta_2 X_{ij2} + \cdots + \beta_6 X_{ij6} \; .\]

The global coefficients \((\beta_1, \beta_2, \ldots, \beta_6)\) reflect an assumption that the relationships between danceability, valence, and genre are similar for each artist. Yet the artist-specific intercepts \(\beta_{0j}\) assume that, when holding constant a song’s valence and genre, some artists’ songs tend to be more danceable than others’.

The first layer of Model 2 incorporates additional artist-specific valence coefficients, assuming \(Y_{ij} | (\beta_{0j}, \beta_{1j}, \beta_2, \ldots, \beta_6, \sigma_y) \sim N(\mu_{ij}, \sigma_y^2)\) with

\[\mu_{ij} = \beta_{0j} + \beta_{1j} X_{ij1} + \beta_2 X_{ij2} + \cdots + \beta_6 X_{ij6} \; .\]

Thus unlike Model 1, Model 2 assumes that the relationship between danceability and valence might differ by artist. This is consistent with the artist-specific models above – for most artists, danceability increased with the happiness of a song. For others, it decreased. Both models are simulated utilizing weakly informative priors below:

spotify_model_1 <- stan_glmer(
  danceability ~ valence + genre + (1 | artist),
  data = spotify, family = gaussian,
  prior_intercept = normal(50, 2.5, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735)

spotify_model_2 <- stan_glmer(
  danceability ~ valence + genre + (valence | artist), 
  data = spotify, family = gaussian,
  prior_intercept = normal(50, 2.5, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735)

# Check out the prior specifications
prior_summary(spotify_model_1)
prior_summary(spotify_model_2)

Posterior predictive checks of these two models produce similar results, both models producing posterior simulated datasets of song danceability that are consistent with the main features in the original song data.

pp_check(spotify_model_1) +
  xlab("danceability")
pp_check(spotify_model_2) +
  xlab("danceability")
This is an untitled chart with no subtitle or caption. It has x-axis 'danceability' with labels 30, 60, 90 and 120. It has y-axis '' with labels 0.01, 0.02 and 0.03. There is a legend indicating colour is used to show colour, with 2 levels: y shown as dark blue colour and  yrep shown as very pale blue colour. It has 2 layers. Layer 1 is a set of 50 lines. Layer 1 has size set to 0.25. Layer 1 has alpha set to 0.7. Layer 2 is a set of 1 line. Line 1 connects 1024 points. Layer 2 has size set to 1. This is an untitled chart with no subtitle or caption. It has x-axis 'danceability' with labels 25, 50, 75, 100 and 125. It has y-axis '' with labels 0.01, 0.02 and 0.03. There is a legend indicating colour is used to show colour, with 2 levels: y shown as dark blue colour and  yrep shown as very pale blue colour. It has 2 layers. Layer 1 is a set of 50 lines. Layer 1 has size set to 0.25. Layer 1 has alpha set to 0.7. Layer 2 is a set of 1 line. Line 1 connects 1024 points. Layer 2 has size set to 1.

FIGURE 17.22: Posterior predictive checks of Spotify Model 1 (left) and Model 2 (right).

Yet upon a comparison of their ELPDs, we think it’s best to go forward with Model 1. The quality of the two models do not significantly differ, and Model 1 is substantially simpler. Without more data per artist, it’s difficult to know if the artist-specific valence coefficients are insignificant due to the fact that the relationship between danceability and valence doesn’t vary by artist, or if we simply don’t have enough data per artist to determine that it does.

# Calculate ELPD for the 2 models
elpd_spotify_1 <- loo(spotify_model_1)
elpd_spotify_2 <- loo(spotify_model_2)

# Compare the ELPD
loo_compare(elpd_spotify_1, elpd_spotify_2)
                elpd_diff se_diff
spotify_model_2  0.0       0.0   
spotify_model_1 -5.3       3.0   

Digging into Model 1, first consider posterior summaries for the global model parameters (\(\beta_0, \beta_1, \dots, \beta_6)\). For the average artist in any genre, we’d expect danceability to increase by between 2.16 and 3 points for every 10 point increase on the valence scale – a statistically significant but fairly marginal bump. Among genres, it appears that when controlling for valence, only rock is significantly less danceable than edm. Its 80% credible interval is the only one to lie entirely above or below 0, suggesting that for the average artist, the typical danceability of a rock song is between 1.42 and 12 points lower than that of an edm song with the same valence.

tidy(spotify_model_1, effects = "fixed",
     conf.int = TRUE, conf.level = 0.80)
# A tibble: 7 x 5
  term        estimate std.error conf.low conf.high
  <chr>          <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   53.3      3.02     49.4      57.1  
2 valence        0.258    0.0330    0.216     0.300
3 genrelatin     0.362    3.58     -4.15      5.02 
4 genrepop       0.466    2.67     -2.96      3.89 
5 genrer&b      -2.43     2.85     -6.05      1.22 
6 genrerap       1.68     3.15     -2.29      5.68 
7 genrerock     -6.73     4.14    -12.0      -1.42 

In interpreting these summaries, keep in mind that the genre coefficients directly compare each genre to edm alone and not, say, rock to r&b. In contrast, mcmc_areas() offers a useful visual comparison of all genre posteriors. Other than the rock coefficient, 0 is a fairly posterior plausible value for the other genre coefficients, reaffirming that these genres aren’t significantly more or less danceable than edm. There’s also quite a bit of overlap in the posteriors. As such, though there’s evidence that some of these genres are more danceable than others (eg: rap vs r&b), the difference isn’t substantial.

# Plot the posterior models of the genre coefficients
mcmc_areas(spotify_model_1, pars = vars(starts_with("genre")), prob = 0.8) + 
  geom_vline(xintercept = 0)
This is an untitled chart with no subtitle or caption. It has x-axis '' with labels -20, -10, 0 and 10. It has y-axis '' with labels genrerock, genrerap, genrer&amp;b, genrepop and genrelatin. It has 9 layers. Layer 1 is a vline graph that VI can not process. Layer 1 has colour set to purplish white. Layer 1 has size set to 0.5. Layer 2 is a ridgeline graph that VI can not process. Layer 2 has colour set to deep blue. Layer 2 has fill set to very pale blue. Layer 3 is a ridgeline graph that VI can not process. Layer 3 has colour set to white. Layer 3 has fill set to strong blue. Layer 4 is a ridgeline graph that VI can not process. Layer 4 has fill set to white. Layer 4 has colour set to deep blue. Layer 5 is a ridgeline graph that VI can not process. Layer 5 has fill set to white. Layer 5 has colour set to white. Layer 6 is a segment graph that VI can not process. Layer 6 has colour set to deep blue. Layer 7 is a blank graph that VI can not process. Layer 8 is a blank graph that VI can not process. Layer 9 is a vline graph that VI can not process.

FIGURE 17.23: Posterior models for the genre related coefficients in Spotify Model 1.

Finally, consider some posterior results for two artists in our sample: Missy Elliott and Camilo. The below tidy() summary compares the typical danceability levels of these artists to the average artist, \(b_{0j} = \beta_{0j} - \beta_0\). When controlling for valence and genre, Elliott’s songs tend to be significantly more danceable than the average artist’s, whereas Camilo’s tend to be less danceable. For example, there’s an 80% posterior chance that Elliott’s typical song danceability is between 2.78 and 14.8 points higher than average.

tidy(spotify_model_1, effects = "ran_vals",
     conf.int = TRUE, conf.level = 0.80) %>% 
  filter(level %in% c("Camilo", "Missy_Elliott")) %>% 
  select(level, estimate, conf.low, conf.high)
# A tibble: 2 x 4
  level         estimate conf.low conf.high
  <chr>            <dbl>    <dbl>     <dbl>
1 Camilo           -2.00    -7.29      3.44
2 Missy_Elliott     8.69     2.78     14.8 

To predict the danceability of their next songs, our hierarchical regression model takes into consideration the artists’ typical danceability levels as well as the song’s valence and genre. Suppose that both artists’ next songs have a valence score of 80, but true to their genres, Elliot’s is a rap song and Camilo’s is in the Latin genre. Figure 17.24 plots both artists’ posterior predictive models along with that of Mohsen Beats, a rock artist that wasn’t in our sample but also releases a song with a valence level of 80. As we’d expect, the danceability of Elliott’s song is likely to be the highest among these three. Further, though Camilo’s typical danceability is lower than average, we expect Mohsen Beats’s song to be even less danceable since it’s of the least danceable genre.

# Simulate posterior predictive models for the 3 artists
set.seed(84735)
predict_next_song <- posterior_predict(
  spotify_model_1,
  newdata = data.frame(
    artist = c("Camilo", "Mohsen Beats", "Missy Elliott"), 
    valence = c(80, 60, 90), genre = c("latin", "rock", "rap")))

# Posterior predictive model plots
mcmc_areas(predict_next_song, prob = 0.8) +
 ggplot2::scale_y_discrete(
   labels = c("Camilo", "Mohsen Beats", "Missy Elliott"))
This is an untitled chart with no subtitle or caption. It has x-axis '' with labels 0, 50 and 100. It has y-axis '' with labels Camilo, Mohsen Beats and Missy Elliott. It has 8 layers. Layer 1 is a vline graph that VI can not process. Layer 1 has colour set to purplish white. Layer 1 has size set to 0.5. Layer 2 is a ridgeline graph that VI can not process. Layer 2 has colour set to deep blue. Layer 2 has fill set to very pale blue. Layer 3 is a ridgeline graph that VI can not process. Layer 3 has colour set to white. Layer 3 has fill set to strong blue. Layer 4 is a ridgeline graph that VI can not process. Layer 4 has fill set to white. Layer 4 has colour set to deep blue. Layer 5 is a ridgeline graph that VI can not process. Layer 5 has fill set to white. Layer 5 has colour set to white. Layer 6 is a segment graph that VI can not process. Layer 6 has colour set to deep blue. Layer 7 is a blank graph that VI can not process. Layer 8 is a blank graph that VI can not process.

FIGURE 17.24: Posterior predictive models for the popularity of the next songs by Missy Elliott, Mohsen Beats, and Camilo.

One last thing

If you’re paying close attention, you might notice a flaw in our model. It produces posterior predictions of danceability that exceed the maximum danceability level of 100. This is a consequence of using the Normal to model a response variable with a limited range (here 0 - 100). The Normal assumption was “good enough” here, but we might consider replacing it with, say, a Beta model.

17.8 Chapter summary

In Chapter 17 we explored how to build a hierarchical model of \(Y\) by predictors \(X\) when working with group structured or hierarchical data. Suppose we have multiple data points on each of \(m\) groups. For each group \(j \in \{1,2,\ldots,m\}\), let \(Y_{ij}\) and \((X_{ij1}, X_{ij2}, \ldots, X_{ijp})\) denote the \(i\)th set of observed data on response variable \(Y\) and \(p\) different predictors \(X\). Then a Normal hierarchical regression model of \(Y\) vs \(X\) consists of three layers: it combines information about the relationship between \(Y\) and \(X\) within groups, with information about how these relationships vary between groups, with our prior understanding of the broader global population. Letting \(\beta_j\) denote a set of group-specific parameters and (\(\beta, \sigma_y, \sigma\)) a set of global parameters:

\[\begin{array}{rll} Y_{ij} | \beta_j, \sigma_y & \sim N(\mu_{ij}, \sigma_y^2) & \text{regression model within group $j$} \\ \beta_j | \beta, \sigma & \sim N(\beta, \sigma^2) & \text{variability in regression parameters between groups} \\ \beta, \sigma_y, \sigma, ... & \sim \cdots & \text{prior models on global parameters} \\ \end{array}\]

Where we have some choices to make is in the definition of the regression mean \(\mu_{ij}\). In the simplest case of a random intercepts model, we assume that groups might have unique baselines \(\beta_{0j}\), yet share a common relationship between \(Y\) and \(X\):

\[\begin{split} \mu_{ij} & = \beta_{0j} + \beta_1 X_{ij1} + \beta_2 X_{ij2} + \cdots + \beta_p X_{ijp} \\ & = (\beta_0 + b_{0j}) + \beta_1 X_{ij1} + \beta_2 X_{ij2} + \cdots + \beta_p X_{ijp} \\ \end{split}\]

In the most complicated case of a random intercepts and slopes model, we assume that groups have unique baselines and unique relationships between \(Y\) and each \(X\):

\[\begin{split} \mu_{ij} & = \beta_{0j} + \beta_{1j} X_{ij1} + \beta_{2j} X_{ij2} + \cdots + \beta_{pj} X_{ijp} \\ & = (\beta_0 + b_{0j}) + (\beta_1 + b_{1j}) X_{ij1} + (\beta_2 + b_{2j}) X_{ij2} + \cdots + (\beta_p + b_{pj}) X_{ijp} \\ \end{split}\]

In between, we might assume that some predictors need group-specific coefficients and others don’t:

\[\begin{split} \mu_{ij} & = \beta_{0j} + \beta_{1j} X_{ij1} + \beta_2 X_{ij2} + \cdots + \beta_p X_{ijp} \\ & = (\beta_0 + b_{0j}) + (\beta_1 + b_{1j}) X_{ij1} + \beta_2 X_{ij2} + \cdots + \beta_p X_{ijp} \\ \end{split}\]

17.9 Exercises

17.9.1 Conceptual Exercises

Exercise 17.1 (Translating assumptions into model notation) To test the relationship between reaction times and sleep deprivation, researchers enlisted 3 people in a 10 day study. Let \(Y_{ij}\) denote the reaction time (in ms) to a given stimulus and \(X_{ij}\) the number of days of sleep deprivation for the \(i\)th observation on subject \(j\). For each set of assumptions below, use mathematical notation to represent an appropriate Bayesian hierarchical model of \(Y_{ij}\) vs \(X_{ij}\).
  1. Not only do some people tend to react more quickly than others, sleep deprivation might impact some people’s reaction times more than others.
  2. Though some people tend to react more quickly than others, the impact of sleep deprivation on reaction time is the same for all.
  3. Nobody has inherently faster reaction times, though sleep deprivation might impact some people’s reaction times more than others.
Exercise 17.2 (Sketch the assumption: Part 1) Continuing with the sleep study, suppose we model the relationship between reaction time \(Y_{ij}\) and days of sleep deprivation \(X_{ij}\) using the random intercepts model (17.5).
  1. Sketch a plot of data that we might see if \(\sigma_y > \sigma_0\).
  2. Explain what \(\sigma_y > \sigma_0\) would mean in the context of the sleep study.
  3. Repeat part a assuming \(\sigma_y < \sigma_0\).
  4. Repeat part b assuming \(\sigma_y < \sigma_0\).
Exercise 17.3 (Sketch the assumption: Part 2) Suppose instead that we model the relationship between reaction time \(Y_{ij}\) and days of sleep deprivation \(X_{ij}\) using the random intercepts and slopes model (17.12) (with different priors).
  1. Sketch a plot of subject-specific trends that we might see if the correlation parameter were positive \(\rho > 0\).
  2. Explain what \(\rho > 0\) would mean in the context of the sleep study.
  3. Repeat part a for \(\rho < 0\).
  4. Repeat part b for \(\rho < 0\).
Exercise 17.4 (Making meaning of models) To study the relationship between weight and height among pug puppies, you collect data on 10 different litters, each containing 4 to 6 puppies born to the same mother. Let \(Y_{ij}\) and \(X_{ij}\) denote the weight and height, respectively, for puppy \(i\) in litter \(j\).
  1. Write out formal model notation for model 1, a random intercepts model of \(Y_{ij}\) vs \(X_{ij}\).
  2. Write out formal model notation for model 2, a random intercepts and slopes model of \(Y_{ij}\) vs \(X_{ij}\).
  3. Summarize the key differences in the assumptions behind models 1 and 2. Root this discussion in the puppy context.
Exercise 17.5 (Translating models to code) Suppose we had weight and height data for the puppy study. Write out appropriate stan_glmer() model code for models 1 and 2 from Exercise 17.4.

17.9.2 Applied Exercises

Exercise 17.6 (Sleep: Setting up the model) In the above conceptual exercises, you considered a sleep study of the relationship between reaction time and the number of days of sleep deprivation. You will explore this relationship in more depth here. To this end, suppose researchers tell us that on a typical day, the average person should have a reaction time of roughly 250ms to the stimulus used in the sleep study. Beyond this baseline, we’ll balance weakly informative priors with the sleepstudy data from the lme4 package to better understand reaction times. Specifically, consider two possible models as expressed by stan_glmer() syntax:
model formula
1 Reaction ~ Days + (1 | Subject)
2 Reaction ~ Days + (Days | Subject)
  1. What’s the grouping variable in the sleepstudy data and why is it important to incorporate this grouping structure into our analysis?
  2. Use formal notation to define the hierarchical regression structure of models 1 and 2. (You will tune the priors in the next exercise.)
  3. Summarize the key differences between models 1 and 2. Root this discussion in the sleep study.
  4. Using the sleepstudy data, construct and discuss a plot that helps you explore which model is more appropriate: 1 or 2.
Exercise 17.7 (Sleep: Simulating the model) Continuing with the sleep analysis, let’s simulate and dig into the hierarchical posteriors.
  1. Simulate the posteriors of models 1 and 2. Remember to use a baseline reaction time of 250ms, and weakly informative priors otherwise.
  2. For model 2, report the global posterior median regression model of Reaction time.
  3. For model 2, construct and interpret 80% credible intervals for the Days regression coefficient.
  4. For model 2, calculate and interpret the posterior medians of \(\sigma_y\), \(\sigma_0\), \(\sigma_1\), and \(\rho\).
Exercise 17.8 (Sleep: Group-specific inference) Next, let’s dig into what Model 2 indicates about the individuals that participated in the sleep study.
  1. Use your posterior simulation to identify the person for whom reaction time changes the least with sleep deprivation. Write out their posterior median regression model.
  2. Repeat part a, this time for the person for whom reaction time changes the most with sleep deprivation.
  3. Use your posterior simulation to identify the person that has the slowest baseline reaction time. Write out their posterior median regression model.
  4. Repeat part c, this time for the person that has the fastest baseline reaction time.
  5. Simulate, plot, and discuss the posterior predictive model of reaction time after 5 days of sleep deprivation for two subjects: you and Subject 308. You’re encouraged to try this from scratch before relying on the posterior_predict() shortcut.
Exercise 17.9 (Sleep: Which model?)
  1. Evaluate the two models of reaction time: are they wrong? are they fair? how accurate are their posterior predictions?
  2. Which of the two models do you prefer and what does this indicate about the relationship between reaction time and sleep deprivation? Justify your answer with posterior evidence.
Exercise 17.10 (Voices: Setting up the model) Does one’s voice pitch change depending on attitude? To address this question, Winter and Grawunder (2012) conducted a study in which each subject participated in various role-playing dialogs. These dialogs spanned different contexts (eg: asking for a favor) and were approached with different attitudes (polite vs informal). In the next exercises you’ll explore a hierarchical regression analysis of \(Y_{ij}\), the average voice pitch in subject \(j\)’s \(i\)th dialog session (measured in Hz), by \(X_{ij}\), whether or not the dialog was polite (vs informal). Beyond a baseline understanding that the typical voice pitch is around 200 Hz, you should utilize weakly informative priors.
  1. Using formal notation, define the hierarchical regression model of \(Y_{ij}\) vs \(X_{ij1}\). In doing so, assume that baseline voice pitch differs from subject to subject, but that the impact of attitude on voice pitch is similar among all subjects.
  2. Compare and contrast the meanings of model parameters \(\beta_{0j}\) and \(\beta_0\) in the context of this voice pitch study. NOTE: Remember that \(X_{ij}\) is a categorical indicator variable.
  3. Compare and contrast the meanings of model parameters \(\sigma_y\) and \(\sigma_0\) in the context of this voice pitch study.
Exercise 17.11 (Voices: Check out some data) To balance our weakly informative priors for the model of pitch by attitude, check out some data.
  1. Load the voices data from the bayesrules package. How many study subjects are included in this sample? In how many dialogs did each subject participate?
  2. Construct and discuss a plot which illustrates the relationship between voice pitch and attitude both within and between subjects.
Exercise 17.12 (Voices: Simulating the model) Continuing with the voice pitch analysis, in this exercise you will simulate and dig into the hierarchical posterior of your model parameters.
  1. Simulate the hierarchical posterior model of voice pitch by attitude. Construct and discuss trace plots, density plots, autocorrelation plots, and a pp_check() of the chain output.
  2. Construct and interpret a 95% credible interval for \(\beta_0\).
  3. Construct and interpret a 95% credible interval for \(\beta_1\).
  4. Is there ample evidence that, for the average subject, voice pitch differs depending on attitude (polite vs informal)? Explain.
Exercise 17.13 (Voices: Focusing on the individual) Continuing with the voice pitch analysis, in this exercise you will focus on specific subjects.
  1. Report the global posterior median model of the relationship between voice pitch and attitude.
  2. Report and contrast the posterior median models for two subjects in our data: A and F.
  3. Using posterior_predict(), simulate posterior predictive models of voice pitch in a new polite dialog for three different subjects: A, F, and you. Illustrate your simulation results using mcmc_areas() and discuss your findings.

17.9.3 Open-ended exercises

Exercise 17.14 (Open exercise: Coffee) What makes some coffee beans more winning than others? In this open-ended exercise, construct, interpret, and evaluate a hierarchical regression model of the professional rating or “total cup points” awarded to a batch of coffee beans. Do so using the coffee_ratings_small data from the bayesrules package which contains information for multiple batches of coffee beans from each of 27 farms. You get to choose which predictors to use.
Exercise 17.15 (Sleep: Different priors) In our earlier sleep analysis, we utilized weakly informative priors. Pretending that you haven’t already seen the data, specify a model of Reaction time by Days of sleep deprivation using priors that you tune yourself. Use prior simulation to illustrate your prior understanding.

References

Carreño, Edwin Javier Castillo, and Edilberto Cepeda Cuervo. 2017. Bayeslongitudinal: Adjust Longitudinal Regression Models Using Bayesian Methodology. https://CRAN.R-project.org/package=bayeslongitudinal.
Gabry, Jonah, and Ben Goodrich. 2020a. “Estimating Generalized Linear Models with Group-Specific Terms with Rstanarm.” https://mc-stan.org/rstanarm/articles/glmer.html.
Laird, N., and J. Ware. 1982. “Random-Effects Models for Longitudinal Data.” Biometrics 38 4: 963–74.
Modrak, Martin. 2019. “Divergent Transitions – a Primer.” https://discourse.mc-stan.org/t/divergent-transitions-a-primer/17099.
Vehtari, Aki. 2019. “Cross-Validation for Hierarchical Models.” https://avehtari.github.io/modelselection/rats_kcv.html.
Winter, Bodo, and Sven Grawunder. 2012. “The Phonetic Profile of Korean Formal and Informal Speech Registers.” Journal of Phonetics 40: 808–15.

  1. Answers: 1 = c; 2 = a↩︎

  2. Technically, this is the combined variability in group-specific intercepts and slopes when assuming they are uncorrelated.↩︎