Chapter 17 (Normal) Hierarchical Models with Predictors

In Chapter 15 we convinced you that there’s a need for hierarchical models. In Chapter 16 we built our first hierarchical model – a Normal hierarchical model of $Y$ with no predictors $X$ . Here we’ll take the next natural step by building a Normal hierarchical regression model of $Y$ with predictors $X$ . Going full circle, let’s return to the Cherry Blossom 10 mile running race analysis that was featured in Chapter 15. Our goal is to better understand variability in running times: To what extent do some people run faster than others? How are running times associated with age, and to what extent does this differ from person to person?

In answering these questions, we’ll utilize the cherry_blossom_sample data from the bayesrules package, shortened to running here. This data records multiple net running times (in minutes) for each of 36 different runners in their 50s and 60s that entered the 10-mile race in multiple years.

# Load packages
library(bayesrules)
library(tidyverse)
library(rstanarm)
library(bayesplot)
library(tidybayes)
library(broom.mixed)

# Load data
data(cherry_blossom_sample)
running <- cherry_blossom_sample

But it turns out that the running data is missing some net race times. Since functions such as prediction_summary(), add_fitted_draws(), and add_predicted_draws() require complete information on each race, we’ll omit the rows with incomplete observations. In doing so, it’s important to use na.omit() after selecting our variables of interest so that we don’t throw out observations that have complete information on these variables just because they have incomplete information on variables we don’t care about.

# Remove NAs
running <- running %>% 
  select(runner, age, net) %>% 
  na.omit()

With multiple observations per runner, this data is hierarchical or grouped. To acknowledge this grouping structure, let $Y_{i j}$ denote the net running time and $X_{i j}$ the age for runner $j$ in their $i$ th race. In modeling $Y_{i j}$ by $X_{i j}$ , Chapter 15 previewed that it would be a mistake to ignore the data’s grouped structure: a complete pooling approach ignores the fact that we have multiple dependent observations on each runner, and a no pooling approach assumes that data on one runner can’t tell us anything about another runner (thus that our data also can’t tell us anything about the general running population). Instead, our analysis of the relationship between running times and age will combine two Bayesian modeling paradigms we’ve developed throughout the book:

regression models of $Y$ by predictors $X$ when our data does not have a group structure (Chapters 9 through 14); and
hierarchical models of $Y$ which account for group-structured data but do not use predictors $X$ (Chapter 16).

Build hierarchical regression models of response variable $Y$ by predictors $X$ .
Evaluate and compare hierarchical and non-hierarchical models.
Use hierarchical models for posterior prediction.

17.1 First steps: Complete pooling

To explore the association between running times ( $Y_{i j}$ ) and age ( $X_{i j}$ ), let’s start by ignoring the data’s grouped structure. This isn’t the right approach, but it provides a good point of comparison and a building block to a better model. To this end, the complete pooled regression model of $Y_{i j}$ from Chapter 15 assumes an age-specific linear relationship $μ_{i} = β_{0} + β_{1} X_{i j}$ with weakly informative priors:

$\begin{matrix} (17.1) & \begin{aligned} Y_{i j} | β_{0}, β_{1}, σ & \sim N (μ_{i}, σ^{2}) where μ_{i} = β_{0} + β_{1} X_{i j} \\ β_{0 c} & \sim N (0, 35^{2}) \\ β_{1} & \sim N (0, 15^{2}) \\ σ & \sim Exp (0.072) \end{aligned} \end{matrix}$

This model depends upon three global parameters: intercept $β_{0}$ , age coefficient $β_{1}$ , and variability from the regression model $σ$ . Our posterior simulation results for these parameters from Section 15.1, stored in complete_pooled_model, are summarized below.

# Posterior summary statistics
model_summary <- tidy(complete_pooled_model, 
                      conf.int = TRUE, conf.level = 0.80)
model_summary
# A tibble: 2 x 5
  term        estimate std.error conf.low conf.high
  <chr>          <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   75.2      24.6     43.7     106.   
2 age            0.268     0.446   -0.302     0.842

The vibe of this complete pooled model is captured by Figure 17.1: it lumps together all race results and describes the relationship between running time and age by a common global model. Lumped together in this way, a scatterplot of net running times versus age exhibit a weak relationship with a posterior median model of 75.2 $+$ 0.268 age.

# Posterior median model
B0 <- model_summary$estimate[1]
B1 <- model_summary$estimate[2]
ggplot(running, aes(x = age, y = net)) + 
  geom_point() + 
  geom_abline(aes(intercept = B0, slope = B1))

A scatterplot of net times (y-axis) vs age (x-axis). Each of the 252 points represent a single race outcome for a single runner. There is a lot of variability in the points and no discernible relationship between net times and age. This relationship is represented by a near-flat, positively sloped model line.

FIGURE 17.1: A scatterplot of net running times versus age along with the posterior median model from the complete pooled model.

This posterior median estimate of the age coefficient $β_{1}$ suggests that running times tend to increase by a mere 0.27 minutes for each year in age. Further, with an 80% posterior credible interval for $β_{1}$ which straddles 0, (-0.3, 0.84), our complete pooled regression model suggests there’s not a significant relationship between running time and age. Our intuition (and personal experience) says this is wrong – as adults age they tend to slow down. This intuition is correct.

17.2 Hierarchical model with varying intercepts

17.2.1 Model building

Chapter 15 revealed that it would indeed be a mistake to stop our runner analysis with the complete pooled regression model (17.1). Thus, our next goal is to incorporate the data’s grouped structure while maintaining our age predictor $X_{i j}$ . Essentially, we want to combine the regression principles from the complete pooled regression model with the grouping principles from the simple Normal hierarchical model without a predictor from Chapter 16:

$\begin{matrix} (17.2) & \begin{aligned} Y_{i j} | μ_{j}, σ_{y} & \sim N (μ_{j}, σ_{y}^{2}) & model of running times WITHIN runner j \\ μ_{j} | μ, σ_{μ} & \overset{i n d}{\sim} N (μ, σ_{μ}^{2}) & model of how typical running times vary BETWEEN runners \\ μ, σ_{y}, σ_{μ} & \sim . . . & prior models on global parameters \end{aligned} \end{matrix}$

This hierarchical regression model will build from (17.2) and unfold in three similar layers.

17.2.1.1 Layer 1: Variability within runner

The first layer of the simple Normal hierarchical model (17.2) assumes that each runner’s net running times $Y_{i j}$ vary normally around their own mean time $μ_{j}$ , with no consideration of their age $X_{i j}$ :

$Y_{i j} | μ_{j}, σ_{y} \sim N (μ_{j}, σ_{y}^{2}) .$

To incorporate information about age into our understanding of running times within runners, we can replace the $μ_{j}$ with runner-specific means $μ_{i j}$ , which depend upon the runner’s age in their $i$ th race, $X_{i j}$ . There’s more than one approach, but we’ll start with the following:

$μ_{i j} = β_{0 j} + β_{1} X_{i j} .$

Thus, the first layer of our hierarchical model describes the relationship between net times and age within each runner $j$ by:

$\begin{matrix} (17.3) & Y_{i j} | β_{0 j}, β_{1}, σ_{y} \sim N (μ_{i j}, σ_{y}^{2}) where μ_{i j} = β_{0 j} + β_{1} X_{i j} . \end{matrix}$

For each runner $j$ , (17.3) assumes that their running times are Normally distributed around an age- and runner-specific mean, $β_{0 j} + β_{1} X_{i j}$ , with standard deviation $σ_{y}$ . This model depends upon three parameters: $β_{0 j}$ , $β_{1}$ , and $σ_{y}$ . Paying special attention to subscripts, only $β_{0 j}$ depends upon $j$ , and thus is runner- or group-specific:

$β_{0 j}$ = intercept of the regression model for runner $j$ .

The other parameters are global, or shared across all runners $j$ :

$β_{1}$ = global age coefficient, i.e., the typical change in a runner’s net time per one year increase in age; and
$σ_{y}$ = within-group variability around the mean regression model $β_{0 j} + β_{1} X_{i j}$ , and hence a measure of the strength of the relationship between an individual runner’s time and their age.

Putting this together, the first layer of our hierarchical model (17.3) assumes that relationships between running time and age randomly vary from runner to runner, having different intercepts $β_{0 j}$ but a shared age coefficient $β_{1}$ . Figure 17.2 illustrates these assumptions and helps us translate them within our context: though some runners tend to be faster or slower than others, runners experience roughly the same increase in running times as they age.

This is a plot of net running time (y-axis) by age (x-axis). There are 36 parallel, positively-sloped model lines, 1 per runner. The intercepts range from -10 to 50. The center of the intercepts is marked by beta 0 and the spread in intercepts is marked by sigma 0.

FIGURE 17.2: Hypothetical mean regression models, $β_{0 j} + β_{1} X$ , for our 36 sampled runners under (17.3). The black dots represent the runner-specific intercepts $β_{0 j}$ which vary normally around $β_{0}$ with standard deviation $σ_{0}$ .

17.2.1.2 Layer 2: Variability between runners

As with the simple hierarchical model (17.2), the first layer of our hierarchical regression model (17.3) captured the relationship between running time and age within runners. It’s in the next layer that we must capture how the relationships between running time and age vary from runner to runner, i.e., between runners.

Which of our current model parameters, $(β_{0 j}, β_{1}, σ_{y})$ , will we model in the next layer of the hierarchical model? What is a reasonable structure for this model?

Since it’s the only regression feature that we’re assuming can vary from runner to runner, the next layer will model variability in the intercept parameters $β_{0 j}$ . It’s important to recognize here that our 36 sample runners are drawn from the same broader population of runners. Thus, instead of taking a no pooling approach, which assigns each $β_{0 j}$ a unique prior, and hence assumes that one runner $j$ can’t tell us about another, these intercepts should share a prior. To this end we’ll assume that the runner-specific intercept parameters, and hence baseline running speeds, vary normally around some mean $β_{0}$ with standard deviation $σ_{0}$ :

$\begin{matrix} (17.4) & β_{0 j} | β_{0}, σ_{0} \overset{i n d}{\sim} N (β_{0}, σ_{0}^{2}) . \end{matrix}$

Figure 17.2 adds context to this layer of the hierarchical model which depends upon two new parameters:

$β_{0}$ = the global average intercept across all runners, i.e., the average runner’s baseline speed; and
$σ_{0}$ = between-group variability in intercepts $β_{0 j}$ , i.e., the extent to which baseline speeds vary from runner to runner.

17.2.1.3 Layer 3: Global priors

We’ve now completed the first two layers of our hierarchical regression model which reflect the relationships of running time and age, within and between runners. As with any Bayesian model, the last step is to specify our priors.

For which model parameters must we specify priors in the final layer of our hierarchical regression model?

Building from (17.3) and (17.4), the final layer of our hierarchical regression model must specify priors for the global parameters: $β_{0}$ , $β_{1}$ , $σ_{y}$ , and $σ_{0}$ . It’s these global parameters that describe the entire population of runners, not just those in our sample. As usual, we’ll utilize independent Normal priors for regression coefficients $β_{0}$ and $β_{1}$ , where our prior understanding of baseline $β_{0}$ is expressed through the centered intercept $β_{0 c}$ . Further, we’ll utilize independent Exponential priors for standard deviation terms $σ_{y}$ and $σ_{0}$ . The final hierarchical regression model thus combines information about the relationship between running times $Y_{i j}$ and age $X_{i j}$ within runners (17.3) with information about how baseline speeds $β_{0 j}$ vary between runners (17.4) with our prior understanding of the broader global population of runners. As embodied by Figure 17.2, this particular model is often referred to as a hierarchical random intercepts model:

$\begin{matrix} (17.5) & \begin{aligned} Y_{i j} | β_{0 j}, β_{1}, σ_{y} & \sim N (μ_{i j}, σ_{y}^{2}) with μ_{i j} = β_{0 j} + β_{1} X_{i j} & (regression model \\ WITHIN runner j) \\ β_{0 j} | β_{0}, σ_{0} & \overset{i n d}{\sim} N (β_{0}, σ_{0}^{2}) & (variability in baseline speeds \\ BETWEEN runners) \\ β_{0 c} & \sim N (m_{0}, s_{0}^{2}) & (priors on global parameters) \\ β_{1} & \sim N (m_{1}, s_{1}^{2}) \\ σ_{y} & \sim Exp (l_{y}) \\ σ_{0} & \sim Exp (l_{0}) . \end{aligned} \end{matrix}$

The complete set of model assumptions is summarized below.

Normal hierarchical regression assumptions

Let $Y_{i j}$ and $X_{i j}$ denote observations for the $i$ th observation in group $j$ . The appropriateness of the Bayesian Normal hierarchical regression model (17.5) of $Y_{i j}$ by $X_{i j}$ depends upon the following assumptions.

Structure of the data
Conditioned on predictor $X_{i j}$ , the outcomes $Y_{i j}$ on any one group $j$ are independent of those on another group $k$ , $Y_{i k}$ . However, different data points within the same group $j$ , $Y_{i j}$ and $Y_{h j}$ , are correlated.
Structure of the relationship
Within any group $j$ , the typical outcome of $Y_{i j}$ ( $μ_{i j}$ ) can be written as a linear function of predictor $X_{i j}$ .
Structure of the variability within groups
Within any group $j$ and at any predictor value $X_{i j}$ , the observed values of $Y_{i j}$ will vary normally around mean $μ_{i j}$ with consistent standard deviation $σ_{y}$ .
Structure of the variability between groups
The group-specific baselines or intercepts, $β_{0 j}$ , vary normally around a global intercept $β_{0}$ with standard deviation $σ_{0}$ .

Connecting the hierarchical and complete pooled models

The complete pooled model (17.1) is a special case of (17.5). These two models are equivalent when $σ_{0} = 0$ , i.e., when the intercepts do not differ from group to group.

17.2.2 Another way to think about it

Recall from Chapter 16 that there are two ways to think about group-specific parameters. First, in our hierarchical regression model of running times (17.5), we incorporated runner-specific intercept parameters $β_{0 j}$ to indicate that the baseline speed can vary from runner to runner $j$ . We thought of these runner-specific intercepts as normal deviations from some global intercept $β_{0}$ with standard deviation $σ_{0}$ : $β_{0 j} \sim N (β_{0}, σ_{0}^{2})$ . Equivalently, we can think of the runner-specific intercepts as random tweaks or adjustments $b_{0 j}$ to $β_{0}$ ,

$β_{0 j} = β_{0} + b_{0 j}$

where these tweaks are normal deviations from 0 with standard deviation $σ_{0}$ :

$b_{0 j} \sim N (0, σ_{0}^{2}) .$

Consider an example. Suppose that some runner $j$ has a baseline running speed of $β_{0 j} = 24$ minutes, whereas the average baseline speed across all runners is $β_{0} = 19$ minutes. Thus, at any age, runner $j$ tends to run 5 minutes slower than average. That is, $b_{0 j} = 5$ and

$β_{0 j} = β_{0} + b_{0 j} = 19 + 5 = 24 .$

In general, then, we can reframe Layers 1 and 2 of our hierarchical model (17.5) as follows:

$\begin{matrix} (17.6) & \begin{aligned} Y_{i j} | β_{0 j}, β_{1}, σ_{y} & \sim N (μ_{i j}, σ_{y}^{2}) with μ_{i j} = (β_{0} + b_{0 j}) + β_{1} X_{i j} \\ b_{0 j} | σ_{0} & \overset{i n d}{\sim} N (0, σ_{0}^{2}) \\ β_{0 c} & \sim N (m_{0}, s_{0}^{2}) \\ β_{1} & \sim N (m_{1}, s_{1}^{2}) \\ σ_{y} & \sim Exp (l_{y}) \\ σ_{0} & \sim Exp (l_{0}) . \end{aligned} \end{matrix}$

17.2.3 Tuning the prior

With the hierarchical regression model framework in place (17.5), let’s tune the priors to match our prior understanding in this running context. Pretending that we didn’t already see data in Chapter 15, our prior understanding is as follows:

The typical runner in this age group runs somewhere between an 8-minute mile and a 12-minute mile during a 10-mile race, and thus has a net time somewhere between 80 and 120 minutes for the entire race. As such we’ll set the prior model for the centered global intercept to $β_{0 c} \sim N (100, 10^{2})$ . (This centered intercept is much easier to think about than the raw intercept $β_{0}$ , the typical net time for a 0-year-old runner!)
We’re pretty sure that the typical runner’s net time in the 10-mile race will, on average, increase over time. We’re not very sure about the rate of this increase, but think it’s likely between 0.5 and 4.5 minutes per year. Thus, we’ll set our prior for the global age coefficient to $β_{1} \sim N (2.5, 1^{2})$ .
Beyond the typical net time for the typical runner, we do not have a clear prior understanding of the variability between runners ( $σ_{0}$ ), nor of the degree to which a runner’s net times might fluctuate from their regression trend ( $σ_{y}$ ). Thus, we’ll utilize weakly informative priors on these standard deviation parameters.

Our final tuning of the hierarchical random intercepts model follows, where the priors on $σ_{y}$ and $σ_{0}$ are assigned by the stan_glmer() simulation below:

$\begin{matrix} (17.7) & \begin{aligned} Y_{i j} | β_{0 j}, β_{1}, σ_{y} & \sim N (μ_{i j}, σ_{y}^{2}) with μ_{i j} = β_{0 j} + β_{1} X_{i j} \\ β_{0 j} | β_{0}, σ_{0} & \overset{i n d}{\sim} N (β_{0}, σ_{0}^{2}) \\ β_{0 c} & \sim N (100, 10^{2}) \\ β_{1} & \sim N (2.5, 1^{2}) \\ σ_{y} & \sim Exp (0.072) \\ σ_{0} & \sim Exp (1) . \end{aligned} \end{matrix}$

To get a sense for the combined meaning of our prior models, we simulate 20,000 prior parameter sets using stan_glmer() with the following special arguments:

We specify the model of net times by age by the formula net ~ age + (1 | runner). This essentially combines a non-hierarchical regression formula (net ~ age) with that for a hierarchical model with no predictor (net ~ (1 | runner)).
We specify prior_PD = TRUE to indicate that we wish to simulate parameters from the prior, not posterior, models.

running_model_1_prior <- stan_glmer(
  net ~ age + (1 | runner), 
  data = running, family = gaussian,
  prior_intercept = normal(100, 10),
  prior = normal(2.5, 1), 
  prior_aux = exponential(1, autoscale = TRUE),
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735, 
  prior_PD = TRUE)

The simulation results describe 20,000 prior plausible scenarios for the relationship between running time and age, within and between our 36 runners. Though we encourage you to plot many more on your own, we show just 4 prior plausible scenarios of what the mean regression models, $β_{0 j} + β_{1} X$ , might look like for our 36 runners (Figure 17.3 left). The variety across these prior scenarios reflects our general uncertainty about running. Though each scenario is consistent with our sense that runners likely slow down over time, the rate of increase ranges quite a bit. Further, in examining the distances between the runner-specific regression lines, some scenarios reflect a plausibility that there’s little difference between runners, whereas others suggest that some runners might be much faster than others.

Finally, we also simulate 100 datasets of race outcomes from the prior model, across a variety of runners and ages. The 100 density plots in Figure 17.3 (right) reflect the distribution of the net times in these simulated datasets. There is, again, quite a range in these simulations. Though some span ridiculous outcomes (e.g., negative net running times), the variety in the simulations and the general set of values they cover, adequately reflect our prior understanding and uncertainty. For example, since a 25- to 30-minute mile is a good walking (not running) pace, the upper values near 250-300 minutes for the entire 10-mile race seem reasonable.

set.seed(84735)
running %>% 
  add_fitted_draws(running_model_1_prior, n = 4) %>%
  ggplot(aes(x = age, y = net)) +
    geom_line(aes(y = .value, group = paste(runner, .draw))) + 
    facet_wrap(~ .draw)

running %>%
  add_predicted_draws(running_model_1_prior, n = 100) %>%
  ggplot(aes(x = net)) +
    geom_density(aes(x = .prediction, group = .draw)) +
    xlim(-100,300)

There are 2 plots. The left has four separate plots of net time (y-axis) by age (x-axis). Within each plot are 36 parallel lines. In 2 of the plots, these lines are nearly flat and have very similar intercepts. In the other 2 plots, these lines have a steeper slope and a wider variety of intercepts. The right plot has 100 density curves of net running time. Though most are bell-shaped and centered near 100 minutes, there is a lot of variability in spread. Some curves range from -100 to 300 minutes, whereas others range from 70 to 130 minutes.

FIGURE 17.3: Simulated scenarios under the prior models of the hierarchical regression model (17.7). At left are 4 prior plausible sets of 36 runner-specific relationships between running time and age, $β_{0 j} + β_{1} X$ . At right are density plots of 100 prior plausible sets of net running time data.

17.2.4 Posterior simulation & analysis

After all of the buildup, let’s counter our prior understanding of the relationship between running time and age with some data. We plot the running times by age for each of our 36 runners below. This reaffirms our model-building process and some of our prior hunches. Most runners’ times do tend to increase with age, and there is variability between the runners themselves – some tend to be faster than others.

ggplot(running, aes(x = age, y = net)) + 
  geom_point() + 
  facet_wrap(~ runner)

There are 36 scatterplots of net time (y-axis) vs age (x-axis), 1 for each runner. The net times range from 60 to 120 minutes. The ages range from 50 to 62 years. Each runner has multiple data points. The behavior in these points varies from runner to runner, both in their typical net times and in the change in net time with age.

FIGURE 17.4: Scatterplots of the net running time by age for each of the 36 runners.

Combining our prior understanding with this data, we take a syntactical shortcut to simulate the posterior random intercepts model (17.5) of net times by age: we update() the running_model_1_prior with prior_PD = FALSE. We encourage you to follow this up with a check of the prior tunings as well as some Markov chain diagnostics:

# Simulate the posterior
running_model_1 <- update(running_model_1_prior, prior_PD = FALSE)

# Check the prior specifications
prior_summary(running_model_1)

# Markov chain diagnostics
mcmc_trace(running_model_1)
mcmc_dens_overlay(running_model_1)
mcmc_acf(running_model_1)
neff_ratio(running_model_1)
rhat(running_model_1)

There are a whopping 40 parameters in our model: 36 runner-specific intercept parameters ( $β_{0 j}$ ) in addition to 4 global parameters ( $β_{0}, β_{1}, σ_{y}, σ_{0}$ ). These are labeled as follows in the stan_glmer() simulation results:

(Intercept) = $β_{0}$
age = $β_{1}$
b[(Intercept) runner:j] = $b_{0 j} = β_{0 j} - β_{0}$ , the difference between runner $j$ ’s baseline speed and the average baseline speed
sigma = $σ_{y}$
Sigma[runner:(Intercept),(Intercept)] = $σ_{0}^{2}$

We’ll keep our focus on the big themes here, first those related to the relationship between running time and age for the typical runner, and then those related to the variability from this average.

17.2.4.1 Posterior analysis of the global relationship

To begin, consider the global relationship between running time and age for the typical runner:

$β_{0} + β_{1} X .$

Posterior summaries for $β_{0}$ and $β_{1}$ , which are fixed across runners, are shown below.

tidy_summary_1 <- tidy(running_model_1, effects = "fixed",
                       conf.int = TRUE, conf.level = 0.80)
tidy_summary_1
# A tibble: 2 x 5
  term        estimate std.error conf.low conf.high
  <chr>          <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)    19.1     11.9       3.74     34.5 
2 age             1.30     0.213     1.02      1.58

Accordingly, there’s an 80% chance that the typical runner tends to slow down somewhere between 1.02 and 1.58 minutes per year. The fact that this range is entirely and comfortably above 0 provides significant evidence that the typical runner tends to slow down with age. This assertion is visually supported by the 200 posterior plausible global model lines below, superimposed with their posterior median, all of which exhibit positive associations between running time and age. In plotting these model lines, note that we use add_fitted_draws() with re_formula = NA to specify that we are interested in the global, not group-specific, model of running times:

B0 <- tidy_summary_1$estimate[1]
B1 <- tidy_summary_1$estimate[2]
running %>%
  add_fitted_draws(running_model_1, n = 200, re_formula = NA) %>%
  ggplot(aes(x = age, y = net)) +
    geom_line(aes(y = .value, group = .draw), alpha = 0.1) +
    geom_abline(intercept = B0, slope = B1, color = "blue") +
    lims(y = c(75, 110))

There are 201 model lines of net time (y-axis) vs age (x-axis). The 200 black lines have similar slopes, yet the intercepts span a roughly 15 minute range. A singular blue model line cuts through the center of the 200 black lines.

FIGURE 17.5: 200 posterior plausible global model lines, $β_{0} + β_{1} X$ , for the relationship between running time and age.

Statistical vs practical significance

When we use the term “significant” in the discussion above, we’re referring to statistical significance. This merely indicates that a positive relationship exists. In this case, we also believe that a per-year slowdown of 1.02 to 1.58 minutes is of a meaningful magnitude that has real-world implications. That is, this positive relationship is also practically significant.

Don’t let the details distract you from the important punchline here! By incorporating the group structure of our data, our hierarchical random intercepts model has what the complete pooled model (17.1) lacked: the power to detect a significant relationship between running time and age. Our discussion in Chapter 15 revealed why this happens in our running analysis: pooling all runners’ data together masks the fact that most individual runners slow down with age.

17.2.4.2 Posterior analysis of group-specific relationships

In our next step, let’s examine what the hierarchical random intercepts model reveals about the runner-specific relationships between net running time and age,

$β_{0 j} + β_{1} X_{i j} = (β_{0} + b_{0 j}) + β_{1} X_{i j} .$

We’ll do so by combining what we learned about the global age parameter $β_{1}$ above, with information on the runner-specific intercept terms $β_{0 j}$ . The latter will require the specialized syntax we built up in Chapter 16, and thus some patience. First, the b[(Intercept) runner:j] chains correspond to the difference in the runner-specific and global intercepts $b_{0 j}$ . Thus, we obtain MCMC chains for each $β_{0 j} = β_{0} + b_{0 j}$ by adding the (Intercept) chain to the b[(Intercept) runner:j] chains via spread_draws() and mutate(). We then use median_qi() to obtain posterior summaries of the $β_{0 j}$ chain for each runner $j$ :

# Posterior summaries of runner-specific intercepts
runner_summaries_1 <- running_model_1 %>%
  spread_draws(`(Intercept)`, b[,runner]) %>% 
  mutate(runner_intercept = `(Intercept)` + b) %>% 
  select(-`(Intercept)`, -b) %>% 
  median_qi(.width = 0.80) %>% 
  select(runner, runner_intercept, .lower, .upper)

Consider the results for runners 4 and 5. With a posterior median intercept of 30.8 minutes vs 6.7 minutes, runner 4 seems to have a slower baseline speed than runner 5. Thus, at any shared age, we would expect runner 4 to run roughly 24.1 minutes slower than runner 5 ( $30.8 - 6.7$ ):

runner_summaries_1 %>% 
  filter(runner %in% c("runner:4", "runner:5"))
# A tibble: 2 x 4
  runner   runner_intercept .lower .upper
  <chr>               <dbl>  <dbl>  <dbl>
1 runner:4            30.8   15.3    46.3
2 runner:5             6.66  -8.42   21.9

These observations are echoed in the plots below, which display 100 posterior plausible models of net time by age for runners 4 and 5:

# 100 posterior plausible models for runners 4 & 5
running %>%
  filter(runner %in% c("4", "5")) %>% 
  add_fitted_draws(running_model_1, n = 100) %>%
  ggplot(aes(x = age, y = net)) +
    geom_line(
      aes(y = .value, group = paste(runner, .draw), color = runner),
      alpha = 0.1) +
    geom_point(aes(color = runner))

The are 200 model lines of net time (y-axis) vs age (x-axis), 100 for each of 2 runners, labeled 4 and 5. Within both runners, the lines have similar slopes and moderately increase with age. However, the lines for runner 4 tend to fall above 100 minutes, thus are higher than those for runner 5 which tend to fall below 85 minutes.

FIGURE 17.6: 100 posterior plausible models of running time by age, $β_{0 j} + β_{1} X$ , for subjects $j \in {4, 5}$ .

We can similarly explore the models for all 36 runners, $β_{0 j} + β_{1} X_{i j}$ . For a quick comparison, the runner-specific posterior median models are plotted below and superimposed with the posterior median global model, $β_{0} + β_{1} X_{i j}$ . This drives home the point that the global model represents the relationship between running time and age for the most average runner. The individual runner models vary around this global average, some with faster baseline speeds ( $β_{0 j} < β_{0}$ ) and some with slower baseline speeds ( $β_{0 j} > β_{0}$ ).

# Plot runner-specific models with the global model
ggplot(running, aes(y = net, x = age, group = runner)) + 
  geom_abline(data = runner_summaries_1, color = "gray",
              aes(intercept = runner_intercept, slope = B1)) + 
  geom_abline(intercept = B0, slope = B1, color = "blue") + 
  lims(x = c(50, 61), y = c(50, 135))

There are 36 gray model lines of net time (y-axis) vs age (x-axis), 1 for each runner. These all have the same slope, yet different intercepts which span a range of roughly 50 minutes. There is a blue model line that cuts through the center of the gray lines.

FIGURE 17.7: The posterior median models for our 36 runners $j$ as calculated from the hierarchical random intercepts model (gray), with the posterior median global model (blue).

17.2.4.3 Posterior analysis of within- and between-group variability

All of this talk about the variability in runner-specific models brings us to a posterior consideration of the final remaining model parameters, $σ_{y}$ and $σ_{0}$ . Whereas $σ_{y}$ measures the variability from the mean regression model within each runner, $σ_{0}$ measures the variability in baseline running speeds between the runners. The simulated datasets in Figure 17.8 provide some intuition. In scenario (a), the variability from the mean model within both groups ( $σ_{y}$ ) is quite small relative to the variability in the models between the groups ( $σ_{0}$ ), leading to a great distinction between these two groups. In scenario (b), $σ_{y}$ is larger than $σ_{0}$ , leading to little distinction between the groups.

There are 2 scatterplots of y vs x, labeled a and b. Both plots have 50 blue points and 50 black points. Both also have 2 parallel, upward-sloping model lines, one for the blue points and one for the black. In plot a, the blue points and black points don't overlap and are scattered tightly around their respective model line. In plot b, there's a lot of overlap in the blue points and black points which are quite scattered around their respective model line.

FIGURE 17.8: Simulated output for the relationship between response variable $Y$ and predictor $X$ when $σ_{y} < σ_{0}$ (a) and $σ_{y} > σ_{0}$ (b).

Posterior tidy() summaries for our variance parameters suggest that the running analysis is more like scenario (a) than scenario (b). For a given runner $j$ , we estimate that their observed running time at any age will deviate from their mean regression model by roughly 5.25 minutes ( $σ_{y}$ ). By the authors’ assessment (none of us professional runners!), this deviation is rather small in the context of a long 10-mile race, suggesting a rather strong relationship between running times and age within runners. In contrast, we expect that baseline speeds vary by roughly 13.3 minutes from runner to runner ( $σ_{0}$ ).

tidy_sigma <- tidy(running_model_1, effects = "ran_pars")
tidy_sigma
# A tibble: 2 x 3
  term                    group    estimate
  <chr>                   <chr>       <dbl>
1 sd_(Intercept).runner   runner      13.3 
2 sd_Observation.Residual Residual     5.25

Comparatively then, the posterior results suggest that $σ_{y} < σ_{0}$ – there’s greater variability in the models between runners than variability from the model within runners. Think about this another way. As with the simple hierarchical model in Section 16.3.3, we can decompose the total variability in race times across all runners and races into that explained by the variability between runners and that explained by the variability within each runner (16.8):

$Var (Y_{i j}) = σ_{0}^{2} + σ_{y}^{2} .$

Thus, proportionally (16.9), differences between runners account for roughly 86.62% (the majority!) of the total variability in racing times, with fluctuations among individual races within runners explaining the other 13.38%:

sigma_0 <- tidy_sigma[1,3]
sigma_y <- tidy_sigma[2,3]
sigma_0^2 / (sigma_0^2 + sigma_y^2)
  estimate
1   0.8662
sigma_y^2 / (sigma_0^2 + sigma_y^2)
  estimate
1   0.1338

17.3 Hierarchical model with varying intercepts & slopes

Since you’ve gotten this far in the book, you know that once in a while it’s important to stand back from the details and ask: can we do even better? A plot of the data for just four of our runners, Figure 17.9, suggests that the hierarchical random intercepts model (17.5) might oversimplify reality. Though this model recognizes that some runners tend to be faster than others, it assumes that the change in running time with age ( $β_{1}$ ) is the same for each runner. In reality, whereas some runners do slow down at similar rates (e.g., runners 4 and 5), some slow down quicker (e.g., runner 20) and some barely at all (e.g., runner 29).

# Plot runner-specific models in the data
running %>% 
  filter(runner %in% c("4", "5", "20", "29")) %>% 
  ggplot(., aes(x = age, y = net)) + 
    geom_point() + 
    geom_smooth(method = "lm", se = FALSE) + 
    facet_grid(~ runner)

There are 4 scatterplots of net running time (y-axis) vs age (x-axis), labeled 4, 5, 20, and 29. Each has 5 data points. In plots 4 and 5, the data points have a moderate upward trend. There is a steeper upward trend in plot 20 and a fairly flat trend in plot 29. Comparitively, the data points in plots 4 and 20 tend to be higher (above 100 minutes) than those in plot 29 (which fall below 70 minutes).

FIGURE 17.9: Scatterplots and observed trends in running times vs age for four subjects.

A snapshot of the observed trends for all 36 runners provides a more complete picture of just how much the change in net time with age might vary by runner:

ggplot(running, aes(x = age, y = net, group = runner)) + 
  geom_smooth(method = "lm", se = FALSE, size = 0.5)

A plot of 36 model lines for net time (y-axis) vs age (x-axis), 1 for each runner. Almost all of these lines have positive slopes, though very different intercepts. For example, some runners have lines that stay below 70 minutes across age, whereas others have lines that stay above 110 minutes.

FIGURE 17.10: Observed trends in running time versus age for the 36 subjects (gray) along with the posterior median model (blue).

How can we modify the random intercepts model (17.5) to recognize that the rate at which running time changes with age might vary from runner to runner?

17.3.1 Model building

Inspired by Figure 17.10, our goal is to build a model which recognizes that in the relationship between running time and age, both the intercept (i.e., baseline speed) and slope (i.e., rate at which speed changes with age) might vary from runner to runner. To this end, we can replace the global age coefficient $β_{1}$ in (17.5) by a runner-specific coefficient $β_{1 j}$ . Thus, the model of the relationship between running time and age within each runner $j$ becomes:

$\begin{matrix} (17.8) & Y_{i j} | β_{0 j}, β_{1 j}, σ_{y} \sim N (μ_{i j}, σ_{y}^{2}) where μ_{i j} = β_{0 j} + β_{1 j} X_{i j} . \end{matrix}$

Accordingly, just as we assumed in (17.5) that the runner-specific intercepts $β_{0 j}$ are Normally distributed around some global intercept $β_{0}$ with standard deviation $σ_{0}$ , we now also assume that the runner-specific age coefficients $β_{1 j}$ are Normally distributed around some global age coefficient $β_{1}$ with standard deviation $σ_{1}$ :

$\begin{matrix} (17.9) & β_{0 j} | β_{0}, σ_{0} \sim N (β_{0}, σ_{0}^{2}) and β_{1 j} | β_{1}, σ_{1} \sim N (β_{1}, σ_{1}^{2}) \end{matrix}$

But these priors aren’t yet complete – $β_{0 j}$ and $β_{1 j}$ work together to describe the model for runner $j$ , and thus are correlated. Let $ρ \in [- 1, 1]$ represent the correlation between $β_{0 j}$ and $β_{1 j}$ . To reflect this correlation, we represent the joint Normal model of $β_{0 j}$ and $β_{1 j}$ by

$\begin{matrix} (17.10) & (\begin{matrix} β_{0 j} \\ β_{1 j} \end{matrix}) | β_{0}, β_{1}, σ_{0}, σ_{1} \sim N ((\begin{matrix} β_{0} \\ β_{1} \end{matrix}), Σ) \end{matrix}$

where $(β_{0}, β_{1})$ is the joint mean and $Σ$ is the 2x2 covariance matrix which encodes the variability and correlation amongst $β_{0 j}$ and $β_{1 j}$ :

$\begin{matrix} (17.11) & Σ = (\begin{array}{cc} σ_{0}^{2} & ρ σ_{0} σ_{1} \\ ρ σ_{0} σ_{1} & σ_{1}^{2} \end{array}) . \end{matrix}$

Though this notation looks overwhelming, it simply indicates that $β_{0 j}$ and $β_{1 j}$ are both marginally Normal (17.9) and have correlation $ρ$ . The correlation $ρ$ between the runner-specific intercepts and slopes, $β_{0 j}$ and $β_{1 j}$ , isn’t just a tedious mathematical detail. It’s an interesting feature of the hierarchical model! Figure 17.11 provides some insight. Plot (a) illustrates the scenario in which there’s a strong negative correlation between $β_{0 j}$ and $β_{1 j}$ – models that start out lower (with small $β_{0 j}$ ) tend to increase at a more rapid rate (with higher $β_{1 j}$ ). In plot (c) there’s a strong positive correlation between $β_{0 j}$ and $β_{1 j}$ – models that start out higher (with larger $β_{0 j}$ ) also tend to increase at a more rapid rate (with higher $β_{1 j}$ ). In between these two extremes, plot (b) illustrates the scenario in which there’s no correlation between the group-specific intercepts and slopes.

There are 3 plots, each with 10 model lines of y vs x which have different intercepts and slopes. In plot a, the lines with smaller intercepts tend to have bigger slopes. In plot b, there is no apparent relationship between intercepts and slopes. In plot c, the lines with bigger intercepts tend to have bigger slopes, creating a fan shape in the lines.

FIGURE 17.11: Simulated output for the relationship between response variable $Y$ and predictor $X$ when $ρ < 0$ (a), $ρ = 0$ (b), and $ρ > 0$ (c).

Consider the implications of this correlation in the context of our running analysis.⁷⁴

In our running example, what would it mean for $β_{0 j}$ and $β_{1 j}$ to be negatively correlated?
1. Runners that start out slower (i.e., with a higher baseline), also tend to slow down at a more rapid rate.
2. The rate at which runners slow down over time isn’t associated with how fast they start out.
3. Runners that start out faster (i.e., with a lower baseline), tend to slow down at a more rapid rate.
Similarly, what would it mean for $β_{0 j}$ and $β_{1 j}$ to be positively correlated?

The completed hierarchical model (17.12) pulls together (17.8) and (17.10) with priors for the global parameters. For reasons you might imagine, this is often referred to as a hierarchical random intercepts and slopes model:

$\begin{matrix} (17.12) & \begin{aligned} Y_{i j} | β_{0 j}, β_{1 j}, σ_{y} & \sim N (μ_{i j}, σ_{y}^{2}) where μ_{i j} = β_{0 j} + β_{1 j} X_{i j} \\ (\begin{matrix} β_{0 j} \\ β_{1 j} \end{matrix}) | β_{0}, β_{1}, σ_{0}, σ_{1} & \sim N ((\begin{matrix} β_{0} \\ β_{1} \end{matrix}), Σ) \\ β_{0 c} & \sim N (100, 10^{2}) \\ β_{1} & \sim N (2.5, 1^{2}) \\ σ_{y} & \sim Exp (0.072) \\ Σ & \sim (decomposition of covariance) . \end{aligned} \end{matrix}$

Equivalently, we can re-express the random intercepts and slopes as random tweaks to the global intercept and slope: $μ_{i j} = (β_{0} + b_{0 j}) + (β_{1} + b_{1 j}) X_{i j}$ with

$(\begin{matrix} b_{0 j} \\ b_{1 j} \end{matrix}) | σ_{0}, σ_{1} \sim N ((\begin{matrix} 0 \\ 0 \end{matrix}), Σ) .$

Connecting our hierarchical models

The random intercepts model (17.5) is a special case of (17.12). When $σ_{1} = 0$ , i.e., when the group-specific age coefficients do not differ from group to group, these two models are equivalent.

Most of the pieces in this model are familiar. For global parameters $β_{0}$ and $β_{1}$ we use the tuned Normal priors from (17.7). For $σ_{y}$ we use a weakly informative prior. Yet there is one big new piece. We need a joint prior model to express our understanding of how the combined $σ_{0}$ , $σ_{1}$ , and $ρ$ parameters define covariance matrix $Σ$ (17.11). At the writing of this book, the stan_glmer() function allows users to define this prior through a decomposition of covariance, or decov(), model. Very generally speaking, this model decomposes our prior model for the covariance matrix into prior information about three separate pieces:

the correlation between the group-specific intercepts and slopes, $ρ$ (Figure 17.11);
the combined degree to which the intercepts and slopes vary by group, $σ_{0}^{2} + σ_{1}^{2}$ ; and⁷⁵
the relative proportion of this variability between groups that’s due to differing intercepts vs differing slopes,

$π_{0} = \frac{σ_{0}^{2}}{σ_{0}^{2} + σ_{1}^{2}} vs π_{1} = \frac{σ_{1}^{2}}{σ_{0}^{2} + σ_{1}^{2}} .$

Figure 17.12 provides some context on this third piece, displaying a few scenarios for the relationship between $π_{0}$ and $π_{1}$ . In general, $π_{0}$ and $π_{1}$ always sum to 1, and thus have a push-and-pull relationship. For example, when $π_{0} \approx 1$ and $π_{1} \approx 0$ , the variability in intercepts ( $σ_{0}^{2}$ ) is large in comparison to the variability in slopes ( $σ_{1}^{2}$ ). Thus, the majority of the variability between group-specific models is explained by differences in intercepts (plot a). In contrast, when $π_{0} \approx 0$ and $π_{1} \approx 1$ , the majority of the variability between group-specific models is explained by differences in slopes (plot c). In between these extremes, when $π_{0}$ and $π_{1}$ are both approximately 0.5, roughly half of the variability between groups can be explained by differing intercepts and the other half by differing slopes (plot b).

There are 3 plots, each with 20 model lines of y vs x. In plot a, the lines are parallel with different intercepts but equal slopes. In plot b, the lines have different intercepts and different slopes. In plot c, the lines share the same intercepts but have different slopes.

FIGURE 17.12: Simulated output for the relationship between response variable $Y$ and predictor $X$ when $π_{0} = 1$ and $π_{1} = 0$ (a), $π_{0} = π_{1} = 0.5$ (b), and $π_{0} = 0$ and $π_{1} = 1$ (c).

In our analysis, we’ll utilize the weakly informative default setting for the hierarchical random intercepts and slopes model: decov(reg = 1, conc = 1, shape = 1, scale = 1) in rstanarm notation. This makes the following prior assumptions regarding the three pieces above:

the correlation $ρ$ is equally likely to be anywhere between -1 and 1;
we have weakly informative prior information about the total degree to which the intercepts and slopes vary by runner; and
the relative proportion of the variability between runners that’s due to differing intercepts is equally likely to be anywhere between 0 and 1, i.e., we’re not at all sure if there’s more, less, or the same level of variability in the baseline speeds from runner to runner, $β_{0 j}$ , than in the rate at which their speeds change over time, $β_{1 j}$ .

We’ll utilize these default assumptions for the covariance prior in this book. Beyond the defaults, specifying and tuning the decomposition of covariance prior requires the acquisition of two new probability models. We present more optional detail in the next section and refer the curious reader to Gabry and Goodrich (2020a) for a more mathematical treatment that scales up to models beyond those considered here.

17.3.2 Optional: The decomposition of covariance model

Let’s take a closer look at the decomposition of covariance model. To begin, we define the three pieces that are important to specifying our prior understanding of the random intercepts and slopes covariance matrix $Σ$ (17.11), and thus the $(σ_{0}, σ_{1}, ρ)$ parameters by which it’s defined. These pieces are numbered in accordance to their corresponding interpretations above:

$\begin{aligned} R & = (\begin{array}{cc} 1 & ρ \\ ρ & 1 \end{array}) & (1) \\ τ & = \sqrt{σ_{0}^{2} + σ_{1}^{2}} & (2) \\ π & = (\begin{matrix} π_{0} \\ π_{1} \end{matrix}) = (\begin{matrix} \frac{σ_{0}^{2}}{σ_{0}^{2} + σ_{1}^{2}} \\ \frac{σ_{1}^{2}}{σ_{0}^{2} + σ_{1}^{2}} \end{matrix}) & (3) \end{aligned}$

We can decompose $Σ$ into a product which depends on $R$ , $τ$ , and $π$ . If you know some linear algebra, you can confirm this result, though the fact that we can rewrite $Σ$ is what’s important here:

$\begin{aligned} (\begin{array}{cc} σ_{0}^{2} & ρ σ_{0} σ_{1} \\ ρ σ_{0} σ_{1} & σ_{1}^{2} \end{array}) & = (\begin{array}{cc} σ_{0} & 0 \\ 0 & σ_{1} \end{array}) (\begin{array}{cc} 1 & ρ \\ ρ & 1 \end{array}) (\begin{array}{cc} σ_{0} & 0 \\ 0 & σ_{1} \end{array}) = diag (σ_{0}, σ_{1}) R diag (σ_{0}, σ_{1}) . \end{aligned}$

We can further decompose $(σ_{0}, σ_{1})$ into the product of $τ$ and $\sqrt{π}$ :

$(\begin{matrix} σ_{0} \\ σ_{1} \end{matrix}) = τ \sqrt{π} .$

And since we can rewrite $Σ$ using $R$ , $τ$ , and $π$ , we can also express our prior understanding of $Σ$ by our combined prior understanding of these three pieces. This joint prior, which we simply expressed above by

$Σ \sim (decomposition of covariance)$

is actually defined by three individual priors:

$\begin{aligned} R & \sim LKJ (η) \\ τ & \sim Gamma (s, r) \\ π & \sim Dirichlet (2, δ) \end{aligned}$

Let’s begin with the “LKJ” prior model on the correlation matrix $R$ with regularization hyperparameter $η > 0$ . In our model (17.12), $R$ depends only on the correlation $ρ$ between the group-specific intercepts $β_{0 j}$ and slopes $β_{1 j}$ . Thus, the LKJ prior model simplifies to a prior model on $ρ$ with pdf

$f (ρ) = [2^{1 - 2 η} \frac{Γ (2 η)}{Γ (η) Γ (η)}] (1 - ρ^{2})^{η - 1} for ρ \in [- 1, 1] .$

Figure 17.13 displays the LKJ pdf under a variety of regularization parameters $η$ , illustrating the important comparison of $η$ to 1:

Setting $η < 1$ indicates a prior understanding that the group-specific intercepts and slopes are most likely strongly correlated, though we’re not sure if this correlation is negative or positive.
When $η = 1$ , the LKJ model is uniform from -1 to 1, indicating that the correlation between the intercepts and slopes is equally likely to be anywhere in this range – we’re not really sure.
Setting $η > 1$ indicates a prior understanding that the group-specific intercepts and slopes are most likely weakly correlated ( $ρ \approx 0$ ). The greater $η$ is, the tighter and tighter the LKJ hugs values of $ρ$ near 0.

There are 3 density plots of rho, labeled LKJ(0.5), LKJ(1), and LKJ(4). The rho values range from -1 to 1. The LKJ(0.5) density curve is symmetric around 0. It's highest at the extreme rho values of -1 and 1, and flat in the middle. The LKJ(2) density curve is flat with a height of 0.5. The LKJ(4) curve is bell-shaped, centered at 0, and ranges from roughly -0.95 to 0.95.

FIGURE 17.13: The LKJ pdf under three regularization parameters.

Next, for the total standard deviation in the intercepts and slopes, $τ = \sqrt{σ_{0}^{2} + σ_{1}^{2}}$ , we utilize the “usual” Gamma prior (or its Exponential special case). Finally, consider the prior for the $π_{0}$ and $π_{1}$ parameters. Recall that $π_{0}$ and $π_{1}$ measure the relative proportion of the variability between groups that’s due to differing intercepts vs differing slopes, respectively. Thus, $π_{0}$ and $π_{1}$ are both restricted to values between 0 and 1 and must sum to 1:

$π_{0} + π_{1} = \frac{σ_{0}^{2}}{σ_{0}^{2} + σ_{1}^{2}} + \frac{σ_{1}^{2}}{σ_{0}^{2} + σ_{1}^{2}} = 1.$

Accordingly, we can utilize a joint symmetric Dirichlet $(2, δ)$ prior with concentration hyperparameter $δ > 0$ for $(π_{0}, π_{1})$ . The symmetric Dirichlet pdf defines the relative prior plausibility of valid $(π_{0}, π_{1})$ pairs:

$f (π_{0}, π_{1}) = \frac{Γ (2 δ)}{Γ (δ) Γ (δ)} (π_{0} π_{1})^{δ - 1} for π_{0}, π_{1} \in [0, 1] and π_{0} + π_{1} = 1 .$

In fact, in the special case when we have only two group-specific parameters, $β_{0 j}$ and $β_{1 j}$ , the symmetric Dirichlet model for $(π_{0}, π_{1})$ is equivalently expressed by:

$π_{0} \sim Beta (δ, δ) and π_{1} = 1 - π_{0} .$

Figure 17.14 displays the marginal symmetric Dirichlet pdf, i.e., Beta pdf, for $π_{0}$ under a variety of concentration parameters $δ$ , illustrating the important comparison of $δ$ to 1:

Setting $δ < 1$ places more prior weight on proportions $π_{0}$ near 0 or 1. This indicates a prior understanding that relatively little ( $π_{0} \approx 0$ ) or relatively much ( $π_{0} \approx 1$ ) of the variability between groups is explained by differences in intercepts rather than differences in slopes (Figure 17.12 plots a and c).
When $δ = 1$ , the marginal pdf on $π_{0}$ is uniform from 0 to 1, indicating that the variability in intercepts explains anywhere between 0 and all of the variability between groups – we’re not really sure.
Setting $δ > 1$ indicates a prior understanding that roughly half of the variability between groups is explained by differences in intercepts and the other half by differences in slopes (Figure 17.12 plot b). The greater $δ$ is, the tighter and tighter this prior hugs values of $π_{0}$ near 0.5.

There are 3 density plots of pi 0, labeled Dirichlet(0.5), Dirichlet(1), and Dirichlet(4). The pi 0 values range from 0 to 1. The Dirichlet(0.5) density curve is symmetric around 0.5. It's highest at the extreme pi 0 values of 0 and 1, and flat in the middle. The Dirichlet(1) density curve is flat with a height of 1. The Dirichlet(4) curve is bell-shaped, centered at 0.5, and ranges from roughly 0.05 to 0.95.

FIGURE 17.14: The marginal symmetric Dirichlet pdf under three concentration parameters.

When our models have both group-specific intercepts and slopes, we’ll use the following default decomposition of variance priors which indicate general uncertainty about the correlation between group-specific intercepts and slopes, the overall variability in group-specific model, and the relative degree to which this variability is explained by differing intercepts vs differing slopes:

$\begin{aligned} R & \sim LKJ (1) \\ τ & \sim Gamma(1,1), i.e., Exp (1) \\ π & \sim Dirichlet (2, 1) \end{aligned}$

In rstanarm notation, this prior is expressed by decov(reg = 1, conc = 1, shape = 1, scale = 1). To tune this prior, you can change the LKJ regularization parameter $η$ , the Gamma shape and scale parameters, and the Dirichlet concentration parameter $δ$ .

17.3.3 Posterior simulation & analysis

Finally, let’s simulate the posterior of our hierarchical random intercepts and slopes model of running times (17.12). This requires one minor tweak to our stan_glmer() call: instead of using the formula net ~ age + (1 | runner) we use net ~ age + (age | runner).

running_model_2 <- stan_glmer(
  net ~ age + (age | runner),
  data = running, family = gaussian,
  prior_intercept = normal(100, 10),
  prior = normal(2.5, 1), 
  prior_aux = exponential(1, autoscale = TRUE),
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735, adapt_delta = 0.99999
)

# Confirm the prior model specifications
prior_summary(running_model_2)

This simulation will be sloooooooow. Notice the additional argument in our stan_glmer() syntax: adapt_delta = 0.99999. Simply put, adapt_delta is a tuning parameter for the underlying MCMC algorithm, the technical details of which are outside the scope of this book. Prior to this example, we’ve been running our simulations using the default adapt_delta value of 0.95. However, in this particular example, the default produces a warning: There were 1 divergent transitions after warmup. This warning indicates that the MCMC algorithm had a tough time exploring the posterior plausible range of our parameter values. When encountering this issue, one strategy is to increase adapt_delta to some value closer to 1. Doing so produces a much slower, but more stable, simulation. For more technical details on divergent transitions and how to address them, we recommend Modrak (2019).

17.3.3.1 Posterior analysis of the global and group-specific parameters

Remember thinking that the 40 parameters in the random intercepts model was a lot? This new model has 78 parameters: 36 runner-specific intercept parameters $β_{0 j}$ , 36 runner-specific age coefficients $β_{1 j}$ , and 6 global parameters ( $β_{0}, β_{1}, σ_{y}, σ_{0}, σ_{1}, ρ$ ). Let’s examine these piece by piece, starting with the global model of the relationship between running time and age,

$β_{0} + β_{1} X .$

The results here for the random intercepts and slopes model (17.12) are quite similar to those for the random intercepts model (17.5): the posterior median model is 18.5 + 1.32 age.

# Quick summary of global regression parameters
tidy(running_model_2, effects = "fixed", conf.int = TRUE, conf.level = 0.80)
# A tibble: 2 x 5
  term        estimate std.error conf.low conf.high
  <chr>          <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)    18.5     11.6       3.61     33.6 
2 age             1.32     0.217     1.04      1.59

Since the global mean model $β_{0} + β_{1} X$ captures the relationship between running time and age for the average runner, we shouldn’t be surprised that our two hierarchical models produced similar assessments. Where these two models start to differ is in their assessments of the runner-specific relationships. Obtaining the MCMC chains for the runner-specific intercepts and slopes gets quite technical. We encourage you to pick through the code below, line by line. Here are some important details to pick up on:

spread_draws() uses b[term, runner] to grab the chains for all runner-specific parameters. As usual now, these chains correspond to $b_{0 j}$ and $b_{1 j}$ , the differences between the runner-specific vs global intercepts and age coefficients.
pivot_wider() creates separate columns for each of the $b_{0 j}$ and $b_{1 j}$ chains and names these b_(Intercept) and b_age.
mutate() obtains the runner-specific intercepts $β_{0 j} = β_{0} + b_{0 j}$ , named runner_intercept, by summing the global (Intercept) and runner-specific adjustments b_(Intercept). The runner-specific $β_{1 j}$ coefficients, runner_age, are created similarly.

# Get MCMC chains for the runner-specific intercepts & slopes
runner_chains_2 <- running_model_2 %>%
  spread_draws(`(Intercept)`, b[term, runner], `age`) %>% 
  pivot_wider(names_from = term, names_glue = "b_{term}",
              values_from = b) %>% 
  mutate(runner_intercept = `(Intercept)` + `b_(Intercept)`,
         runner_age = age + b_age)

From these chains, we can obtain the posterior medians for each runner-specific intercept and age coefficient. Since we’re only obtaining posterior medians here, we use summarize() in combination with group_by() instead of using the median_qi() function:

# Posterior medians of runner-specific models
runner_summaries_2 <- runner_chains_2 %>% 
  group_by(runner) %>% 
  summarize(runner_intercept = median(runner_intercept),
            runner_age = median(runner_age))

# Check it out
head(runner_summaries_2, 3)
# A tibble: 3 x 3
  runner    runner_intercept runner_age
  <chr>                <dbl>      <dbl>
1 runner:1              18.6       1.06
2 runner:10             18.5       1.75
3 runner:11             18.5       1.32

Figure 17.15 plots the posterior median models for all 36 runners.

ggplot(running, aes(y = net, x = age, group = runner)) + 
  geom_abline(data = runner_summaries_2, color = "gray",
              aes(intercept = runner_intercept, slope = runner_age)) + 
  lims(x = c(50, 61), y = c(50, 135))

There are 36 gray model lines of net time (y-axis) vs age (x-axis). They have different intercepts which span a range of roughly 40 minutes. They all have moderately positive slopes that slightly differ from runner to runner.

FIGURE 17.15: The posterior median models for the 36 runners, as calculated from the hierarchical random intercepts and slopes model.

Hmph. Are you surprised? We were slightly surprised. The slopes do differ, but not as drastically as we expected. But then we remembered – shrinkage! Consider sample runners 1 and 10. Their posteriors suggest that, on average, runner 10’s running time increases by just 1.06 minute per year, whereas runner 1’s increases by 1.75 minutes per year:

runner_summaries_2 %>% 
  filter(runner %in% c("runner:1", "runner:10"))
# A tibble: 2 x 3
  runner    runner_intercept runner_age
  <chr>                <dbl>      <dbl>
1 runner:1              18.6       1.06
2 runner:10             18.5       1.75

There are 2 plots of net running time (y-axis) vs age (x-axis), labeled 1 and 10. In each plot is a nearly flat black line that falls at the vertical centers. Each also has a blue line and a dashed line. The blue and dashed lines fall below the black line in plot 1 and above the black line in plot 10. The dashed lines have similar, moderate slopes which are greater than the slope of the black line. However, the slopes of the blue lines differ -- it's negative in plot 1 and positive in plot 10.

FIGURE 17.16: For runners 1 and 10, the posterior median relationships between running time and age from the hierarchical random intercepts and slopes model (dashed) are contrasted by the observed no pooled models (blue) and the complete pooled model (black).

Figure 17.16 contrasts these posterior median models for runners 1 and 10 (dashed lines) by the complete pooled posterior models (black) and no pooled posterior models (blue). As usual, the hierarchical models strike a balance between these two extremes. Like the no pooled models, the hierarchical models do vary between the two runners. Yet the difference is not as stark. The hierarchical models are drawn away from the no pooled models and toward the complete pooled models. Though this shrinkage is subtle for runner 10, the association between running time and age switches from negative to positive for runner 1. This is to be expected. Unlike the no pooled approach, which models runner-specific relationships using only runner-specific data, our hierarchical model assumes that one runner’s behavior can tell us about another’s. Further, we have very few data points on each runner – at most 7 races. With so few observations, the other runners’ information has ample influence on our posterior understanding for any one individual (as it should). In the case of runner 1, the other 35 runners’ data is enough to make us think that this runner, too, will eventually slow down.

17.3.3.2 Posterior analysis of within- and between-group variability

Stepping back, we should also ask ourselves: Is it worth it? Incorporating the random runner-specific age coefficients introduced 37 parameters into our model of running time by age. Yet at least visually, there doesn’t appear to be much variation among the slopes of the runner-specific models. For a numerical assessment of this variation, we can examine the posterior trends in $σ_{1}$ (sd_age.runner). While we’re at it, we’ll also check out $σ_{0}$ (sd_(Intercept).runner), $ρ$ (cor_(Intercept).age.runner), and $σ_{y}$ (sd_Observation.Residual):

tidy(running_model_2, effects = "ran_pars")
# A tibble: 4 x 3
  term                       group    estimate
  <chr>                      <chr>       <dbl>
1 sd_(Intercept).runner      runner     1.34  
2 sd_age.runner              runner     0.251 
3 cor_(Intercept).age.runner runner    -0.0955
4 sd_Observation.Residual    Residual   5.17

Consider some highlights of this output:

The standard deviation $σ_{1}$ in the age coefficients $β_{1 j}$ is likely around 0.251 minutes per year. On the scale of a 10-mile race, this indicates very little variability between the runners when it comes to the rate at which running times change with age.
Per the output for $σ_{y}$ , an individual runner’s net times tend to deviate from their own mean model by roughly 5.17 minutes.
There’s a weak negative correlation of roughly -0.0955 between the runner-specific $β_{0 j}$ and $β_{1 j}$ parameters. Thus, it seems that, ever so slightly, runners that start off faster tend to slow down at a faster rate.

17.4 Model evaluation & selection

We have now built a sequence of three models of running time by age: a complete pooled model (17.1), a hierarchical random intercepts model (17.5), and a hierarchical random intercepts and slopes model (17.12). Figure 17.17 provides a reminder of the general assumptions behind these three models.

There are 3 plots of net time (y-axis) vs age (x-axis). The left plot has a singular line with a nearly flat, but positive slope. The middle plot has 36 parallel, upward-sloping lines. The right plot has 36 lines with different intercepts and slopes.

FIGURE 17.17: The idea behind three different approaches to modeling $Y$ by $X$ when utilizing group-structured data: complete pooling (left), hierarchical random intercepts (middle), and hierarchical random intercepts and slopes (right).

So which one should we use? To answer this question, we can compare our three models using the framework of Chapters 10 and 11, and asking these questions: (1) How fair is each model? (2) How wrong is each model? (3) How accurate are each model’s posterior predictions? Consider question (1). The context and data collection procedure is the same for each model. Since the data has been anonymized and runners are aware that race results will be public, we think this data collection process is fair. Further, though the models produce slightly different conclusions about the relationship between running time and age (e.g., the hierarchical models conclude this relationship is significant), none of these conclusions seem poised to have a negative impact on society or individuals. Thus, our three models are equally fair.

Next, consider question (2). Posterior predictive checks suggest that the complete pooled model comparatively underestimates the variability in running times – datasets of running time simulated from the complete pooled posterior tend to exhibit a slightly narrower range than the running times we actually observed. Thus, the complete pooled model is more wrong than the hierarchical models.

pp_check(complete_pooled_model) + 
  labs(x = "net", title = "complete pooled model")
pp_check(running_model_1) + 
  labs(x = "net", title = "running model 1")
pp_check(running_model_2) + 
  labs(x = "net", title = "running model 2")

There are 3 plots, labeled complete pooled model, running model 1, and running model 2. In each there are 50 light blue density curves of simulated net running time data and the same 1 dark blue density curve of the observed net running time data. This dark blue density curve is roughly bell-shaped, with two peaks near 80 and 100 minutes, and ranges from roughly 50 to 125 minutes. In each plot, the 50 light blue density curves are similar to each other and similar to the dark blue curve. The slight exception is in the complete pooled model where the 50 light blue density curves are slightly narrower than the dark blue curve.

FIGURE 17.18: Posterior predictive checks of the complete pooled model (left), random intercepts model (middle), and random intercepts and slopes model (right).

In fact, we know that the complete pooled model is wrong. By ignoring the data’s grouped structure, it incorrectly assumes that each race observation is independent of the others. Depending upon the trade-offs, we might live with this wrong but simplifying assumption in some analyses. Yet at least two signs point to this being a mistake for our running analysis.

The complete pooled model isn’t powerful enough to detect the significant relationship between running time and age.
Not only have we seen visual evidence that some runners tend to be significantly faster or slower than others, the posterior prediction summaries in Section 17.2.4 suggest that there’s significant variability between runners ( $σ_{0}$ ).

In light of this discussion, let’s drop the complete pooled model from consideration. In choosing between running_model_1 and running_model_2, consider question (3): what’s the predictive accuracy of these models? Recall some approaches to answering this question from Chapter 11: posterior prediction summaries and ELPD. To begin, we use the prediction_summary() function from the bayesrules package to compare how well these two models predict the running outcomes of the 36 runners that were part of our sample.

# Calculate prediction summaries
set.seed(84735)
prediction_summary(model = running_model_1, data = running)
    mae mae_scaled within_50 within_95
1 2.626      0.456    0.6865     0.973
prediction_summary(model = running_model_2, data = running)
   mae mae_scaled within_50 within_95
1 2.53     0.4424    0.7027     0.973

By all metrics, running_model_1 and running_model_2 produce similarly accurate posterior predictions. For both models, the observed net running times tend to be 2.63 and 2.53 minutes, or 0.46 and 0.44 standard deviations, from their posterior mean predictions. The posterior predictive models also have similar coverage in terms of the percent of observed running times that fall within their 50% and 95% prediction intervals.

Thinking beyond our own sample of runners, we could also utilize prediction_summary_cv() to obtain cross-validated metrics of posterior predictive accuracy. The idea is the same as for non-hierarchical models, but the details change to reflect the grouped structure of our data. To explore how well our models predict the running behavior of runners that weren’t included in our sample data, we divide the runners, not the individual race outcomes, into distinct folds. For example, for a 10-fold cross validation with 36 runners, each fold would include data on 3 or 4 of the sample runners. Thus, we would train each of 10 models using data on 32 or 33 of our sample runners and test it on the other 3 or 4. We include code for the curious but do not run it here.

prediction_summary_cv(model = running_model_1, data = running,
                      k = 10, group = "runner")

For hierarchical models, the prediction_summary_cv() function divides groups, not individual outcomes $Y$ , into distinct folds. Thus, if we have 10 groups, 10-fold cross-validation will build each training model using data on 9 groups and test it on the 10th group. This approach makes sense when we want to assess how well our model generalizes to new groups outside our sample for which we have no data. But this isn’t always our goal. Instead, we might want to assess how well our model predicts the new outcomes of groups for which we have at least some data. For example, instead of evaluating how well our model predicts the net times for new runners, we might wish to evaluate how well it predicts the next net time of a runner for whom we have data on past races. In short, prediction_summary_cv() is not a catch-all. For a discussion of the various approaches to cross-validation for hierarchical models, as well as how these can be implemented from scratch, we recommend Vehtari (2019).

Finally, consider one last comparison of our two hierarchical models: the cross-validated expected log-predictive densities (ELPD). The estimated ELPD for running_model_1 is lower (worse) than, though within two standard errors of, the running_model_2 ELPD. Hence, by this metric, there is not a significant difference in the posterior predictive accuracy of our two hierarchical models.

# Calculate ELPD for the 2 models
elpd_hierarchical_1 <- loo(running_model_1)
elpd_hierarchical_2 <- loo(running_model_2)


# Compare the ELPD
loo_compare(elpd_hierarchical_1, elpd_hierarchical_2)
                elpd_diff se_diff
running_model_2  0.0       0.0   
running_model_1 -1.6       1.2

After reflecting upon our model evaluation, we’re ready to make a final determination: we choose running_model_1. The choice of running_model_1 over the complete_pooled_model was pretty clear: the latter was wrong and didn’t have the power to detect a relationship between running time and age. The choice of running_model_1 over running_model_2 comes down to this: the complexity introduced by the additional random age coefficients in running_model_2 produced little apparent change or benefit. Thus, the additional complexity simply isn’t worth it (at least not to us).

17.5 Posterior prediction

Finally, let’s use our preferred model, running_model_1, to make some posterior predictions. Suppose we want to predict the running time that three different runners will achieve when they’re 61 years old: runner 1, runner 10, and Miles. Though Miles’ running prowess is a mystery, we observed runners 1 and 10 in our sample. Should their trends continue, we expect that runner 10’s time will be slower than that of runner 1 when they’re both 61:

# Plot runner-specific trends for runners 1 & 10
running %>% 
  filter(runner %in% c("1", "10")) %>% 
  ggplot(., aes(x = age, y = net)) + 
    geom_point() + 
    facet_grid(~ runner) + 
    lims(x = c(54, 61))

There are 2 scatterplots of net running time (y-axis) vs age (x-axis), labeled 1 and 10, with 4 and 5 data points respectively. For plot 1, the 4 data points all fall below 80 minutes and remain fairly flat across a span of 4 years. For plot 10, the 5 data points all fall below 100 minutes and increase to 130 minutes of the span of 6 years.

FIGURE 17.19: The observed net running times by age for runners 1 and 10.

In general, let $Y_{n e w, j}$ denote a new observation on an observed runner $j$ , specifically runner $j$ ’s running time at age 61. As in Chapter 16, we can approximate the posterior predictive model for $Y_{n e w, j}$ by simulating a prediction from the first layer of (17.3), that which describes the variability in race times $Y_{i j}$ , evaluated at each of the 20,000 parameter sets ${β_{0 j}^{(i)}, β_{1}^{(i)}, σ_{y}^{(i)}}$ in our MCMC simulation:

$Y_{new, j}^{(i)} | β_{0 j}, β_{1}, σ_{y} \sim N (μ_{i j}^{(i)}, {(σ_{y}^{(i)})}^{2}) where μ_{i j}^{(i)} = β_{0 j}^{(i)} + β_{1}^{(i)} \cdot 61 .$

The resulting posterior predictive model will reflect two sources of uncertainty in runner $j$ ’s race time: the within-group sampling variability $σ_{y}$ (we can’t perfectly predict runner $j$ ’s time from their mean model); and posterior variability in $β_{0 j}$ , $β_{1}$ , and $σ_{y}$ (the parameters defining runner $j$ ’s relationship between running time and age are unknown and random). Since we don’t have any data on the baseline speed for our new runner, Miles, there’s a third source of uncertainty in predicting his race time: between-group sampling variability $σ_{0}$ (baseline speeds vary from runner to runner). Though we recommend doing these simulations “by hand” to connect with the concepts of posterior prediction (as we did in Chapter 16), we’ll use the posterior_predict() shortcut function to simulate the posterior predictive models for our three runners:

# Simulate posterior predictive models for the 3 runners
set.seed(84735)
predict_next_race <- posterior_predict(
  running_model_1, 
  newdata = data.frame(runner = c("1", "Miles", "10"),
                       age = c(61, 61, 61)))

These posterior predictive models are plotted in Figure 17.20. As anticipated from their previous trends, our posterior expectation is that runner 10 will have a slower time than runner 1 when they’re 61 years old. Our posterior predictive model for Miles’ net time is somewhere in between these two extremes. The posterior median prediction is just under 100 minutes, similar to what we’d get if we plugged an age of 61 into the global posterior median model for the average runner:

B0 + B1 * 61
[1] 98.54

That is, without any information about Miles, our default assumption is that he’s an average runner. Our uncertainty in this assumption is reflected by the relatively wide posterior predictive model. Naturally, having observed data on runners 1 and 10, we’re more certain about how fast they will be when they’re 61. But Miles is a wild card – he could be really fast or really slow!

# Posterior predictive model plots
mcmc_areas(predict_next_race, prob = 0.8) +
 ggplot2::scale_y_discrete(labels = c("runner 1", "Miles", "runner 10"))

There are 3 bell-shaped density plots of net running time (x-axis), 1 for each runner: runner 1, Miles, and runner 10. The density plot for runner 1 is centered around 85 minutes and ranges from roughly 65 to 105 minutes. The density plot for runner 10 is much higher and has little overlap with that for runner 1. It is centered around 125 minutes and ranges from roughly 105 to 145 minutes. The density plot for Miles falls in the center and is much wider, thus overlaps with the plots for runners 1 and 10. Miles' density plot is centered around 100 minutes and ranges from roughly 40 to 160 minutes.

FIGURE 17.20: Posterior predictive models for the net running times at age 61 for sample runners 1 and 10, as well as Miles, a runner that wasn’t in our original sample.

17.6 Details: Longitudinal data

The running data on net times by age is longitudinal. We observe each runner over time, where this time (or aging) is of primary interest. Though our hierarchical models of this relationship account for the correlation in running times within any runner, they make a simplifying assumption about this correlation: it’s the same across all ages. In contrast, you might imagine that observations at one age are more strongly correlated with observations at similar ages. For example, a runner’s net time at age 60 is likely more strongly correlated with their net time at age 59 than at age 50. It’s beyond the scope of this book, but we can adjust the structure of our hierarchical models to account for a longitudinal correlation structure. We encourage the interested reader to check out the bayeslongitudinal R package (Carreño and Cuervo 2017) and the foundational paper by Laird and Ware (1982).

17.7 Example: Danceability

Let’s implement our hierarchical regression tools in a different context, by revisiting our spotify data from Chapter 16. There we studied a hierarchical model of song popularity, accounting for the fact that we had grouped data with multiple songs per sampled artist. Here we’ll switch our focus to a song’s danceability and how this might be explained by two features: its genre and valence or mood. The danceability and valence of a song are both measured on a scale from 0 (low) to 100 (high). Thus, lower valence scores are assigned to negative / sad / angry songs and higher scores to positive / happy / euphoric songs.

Starting out, we have a vague sense that the typical song has a danceability rating around 50, yet we don’t have any strong prior understanding of the possible relationship between danceability, genre, and valence, nor how this might differ from artist to artist. Thus, we’ll utilize weakly informative priors throughout our analysis. Moving on from the priors, load the data necessary to this analysis:

# Import and wrangle the data
data(spotify)
spotify <- spotify %>% 
  select(artist, title, danceability, valence, genre)

Figure 17.21 illustrates both some global and artist-specific patterns in the relationships among these variables:

ggplot(spotify, aes(y = danceability, x = genre)) + 
  geom_boxplot()
ggplot(spotify, aes(y = danceability, x = valence)) + 
  geom_point()
ggplot(spotify, aes(y = danceability, x = valence, group = artist)) + 
  geom_smooth(method = "lm", se = FALSE, size = 0.5)

There are 3 plots of danceability (y-axis) which ranges from 20 to 90. The left plot has boxplots of danceability for each of 6 genres. There is a lot of overlaps in the boxplots, having similar centers and spreads. The boxplot is slightly higher for the rap genre, and substantially lower for the rock genre. The middle plot is a scatterplot of danceability vs valence (x-axis) which ranges from 0 to 100. There are 350 data points that exhibit a moderate, positive relationship between danceability and valence. The right plot has 44 model lines of danceability vs valence (x-axis), 1 per artist. These have different intercepts, with some lines falling below the popularity of others. They also have different slopes, most of which are positive.

FIGURE 17.21: The relationship of danceability with genre (left) and valence (middle) for individual songs. The relationship of danceability with valence by artist (right).

When pooling all songs together, notice that rock songs tend to be the least danceable and rap songs the most (by a slight margin). Further, danceability seems to have a weak but positive association with valence. Makes sense. Sad songs tend not to inspire dance. Taking this all with a grain of salt, recall that the global models might mask what’s actually going on. To this end, the artist-specific models in the final plot paint a more detailed picture. The two key themes to pick up on here are that (1) some artists’ songs tend to be more danceable than others, and (2) the association between danceability and valence might differ among artists, though it’s typically positive.

To model these relationships, let’s define some notation. For song $i$ within artist $j$ , let $Y_{i j}$ denote danceability and $X_{i j 1}$ denote valence. Next, note that there are 6 different genres, edm being the baseline or reference. Thus, we must define 5 additional predictors, one for each non-edm genre. Specifically, let $(X_{i j 2}, X_{i j 3}, \dots, X_{i j 6})$ be indicators of whether a song falls in the latin, pop, r&b, rap, and rock genres, respectively. For example,

$X_{i j 2} = {\begin{cases} 1 & Latin genre \\ 0 & otherwise . \end{cases}$

Thus, for an edm song, all genre indicators are 0. We’ll consider two possible models of danceability by valence and genre. The first layer of Model 1 assumes that $Y_{i j} | (β_{0 j}, β_{1}, β_{2}, \dots, β_{6}, σ_{y}) \sim N (μ_{i j}, σ_{y}^{2})$ with

$μ_{i j} = β_{0 j} + β_{1} X_{i j 1} + β_{2} X_{i j 2} + \dots + β_{6} X_{i j 6} .$

The global coefficients $(β_{1}, β_{2}, \dots, β_{6})$ reflect an assumption that the relationships between danceability, valence, and genre are similar for each artist. Yet the artist-specific intercepts $β_{0 j}$ assume that, when holding constant a song’s valence and genre, some artists’ songs tend to be more danceable than other artists’ songs.

The first layer of Model 2 incorporates additional artist-specific valence coefficients, assuming $Y_{i j} | (β_{0 j}, β_{1 j}, β_{2}, \dots, β_{6}, σ_{y}) \sim N (μ_{i j}, σ_{y}^{2})$ with

$μ_{i j} = β_{0 j} + β_{1 j} X_{i j 1} + β_{2} X_{i j 2} + \dots + β_{6} X_{i j 6} .$

Thus, unlike Model 1, Model 2 assumes that the relationship between danceability and valence might differ by artist. This is consistent with the artist-specific models above – for most artists, danceability increased with the happiness of a song. For others, it decreased. Both models are simulated utilizing the following weakly informative priors:

spotify_model_1 <- stan_glmer(
  danceability ~ valence + genre + (1 | artist),
  data = spotify, family = gaussian,
  prior_intercept = normal(50, 2.5, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735)

spotify_model_2 <- stan_glmer(
  danceability ~ valence + genre + (valence | artist), 
  data = spotify, family = gaussian,
  prior_intercept = normal(50, 2.5, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735)

# Check out the prior specifications
prior_summary(spotify_model_1)
prior_summary(spotify_model_2)

Posterior predictive checks of these two models are similar, both models producing posterior simulated datasets of song danceability that are consistent with the main features in the original song data.

pp_check(spotify_model_1) +
  xlab("danceability")
pp_check(spotify_model_2) +
  xlab("danceability")

There are 2 plots, one corresponding to spotify_model_1 and the other to spotify_model_2. They appear to be very similar. Each has 51 density plots of danceability, 50 are light blue and 1 is dark blue. All are similar -- roughly bell-shaped, centered around 65, and ranging from 30 to 100. The dark blue line is slightly more left skewed with a slightly higher center.

FIGURE 17.22: Posterior predictive checks of Spotify Model 1 (left) and Model 2 (right).

Yet upon a comparison of their ELPDs, we think it’s best to go forward with Model 1. The quality of the two models do not significantly differ, and Model 1 is substantially simpler. Without more data per artist, it’s difficult to know if the artist-specific valence coefficients are insignificant due to the fact that the relationship between danceability and valence doesn’t vary by artist, or if we simply don’t have enough data per artist to determine that it does.

# Calculate ELPD for the 2 models
elpd_spotify_1 <- loo(spotify_model_1)
elpd_spotify_2 <- loo(spotify_model_2)

# Compare the ELPD
loo_compare(elpd_spotify_1, elpd_spotify_2)
                elpd_diff se_diff
spotify_model_2  0.0       0.0   
spotify_model_1 -5.3       3.0

Digging into Model 1, first consider posterior summaries for the global model parameters ( $β_{0}, β_{1}, \dots, β_{6})$ . For the average artist in any genre, we’d expect danceability to increase by between 2.16 and 3 points for every 10-point increase on the valence scale – a statistically significant but fairly marginal bump. Among genres, it appears that when controlling for valence, only rock is significantly less danceable than edm. Its 80% credible interval is the only one to lie entirely above or below 0, suggesting that for the average artist, the typical danceability of a rock song is between 1.42 and 12 points lower than that of an edm song with the same valence.

tidy(spotify_model_1, effects = "fixed",
     conf.int = TRUE, conf.level = 0.80)
# A tibble: 7 x 5
  term        estimate std.error conf.low conf.high
  <chr>          <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   53.3      3.02     49.4      57.1  
2 valence        0.258    0.0330    0.216     0.300
3 genrelatin     0.362    3.58     -4.15      5.02 
4 genrepop       0.466    2.67     -2.96      3.89 
5 genrer&b      -2.43     2.85     -6.05      1.22 
6 genrerap       1.68     3.15     -2.29      5.68 
7 genrerock     -6.73     4.14    -12.0      -1.42

In interpreting these summaries, keep in mind that the genre coefficients directly compare each genre to edm alone and not, say, rock to r&b. In contrast, mcmc_areas() offers a useful visual comparison of all genre posteriors. Other than the rock coefficient, 0 is a fairly posterior plausible value for the other genre coefficients, reaffirming that these genres aren’t significantly more or less danceable than edm. There’s also quite a bit of overlap in the posteriors. As such, though there’s evidence that some of these genres are more danceable than others (e.g., rap vs r&b), the difference isn’t substantial.

# Plot the posterior models of the genre coefficients
mcmc_areas(spotify_model_1, pars = vars(starts_with("genre")), prob = 0.8) + 
  geom_vline(xintercept = 0)

There are 5 bell-shaped density plots, one for each of the 5 genre-related coefficients, and a vertical line at a coefficient value of 0. There is a lot of overlap in the densities. The rock genre has the leftmost density, centered near -6 and ranging from roughly -15 to 3. It is the only density curve that falls mostly below 0. The other density plots are centered around values closer to 0 and are similar to one another.

FIGURE 17.23: Posterior models for the genre-related coefficients in Spotify Model 1.

Finally, consider some posterior results for two artists in our sample: Missy Elliott and Camilo. The below tidy() summary compares the typical danceability levels of these artists to the average artist, $b_{0 j} = β_{0 j} - β_{0}$ . When controlling for valence and genre, Elliott’s songs tend to be significantly more danceable than the average artist’s, whereas Camilo’s tend to be less danceable. For example, there’s an 80% posterior chance that Elliott’s typical song danceability is between 2.78 and 14.8 points higher than average.

tidy(spotify_model_1, effects = "ran_vals",
     conf.int = TRUE, conf.level = 0.80) %>% 
  filter(level %in% c("Camilo", "Missy_Elliott")) %>% 
  select(level, estimate, conf.low, conf.high)
# A tibble: 2 x 4
  level         estimate conf.low conf.high
  <chr>            <dbl>    <dbl>     <dbl>
1 Camilo           -2.00    -7.29      3.44
2 Missy_Elliott     8.69     2.78     14.8

To predict the danceability of their next songs, our hierarchical regression model takes into consideration the artists’ typical danceability levels as well as the song’s valence and genre. Suppose that both artists’ next songs have a valence score of 80, but true to their genres, Elliot’s is a rap song and Camilo’s is in the Latin genre. Figure 17.24 plots both artists’ posterior predictive models along with that of Mohsen Beats, a rock artist that wasn’t in our sample but also releases a song with a valence level of 80. As we’d expect, the danceability of Elliott’s song is likely to be the highest among these three. Further, though Camilo’s typical danceability is lower than average, we expect Mohsen Beats’s song to be even less danceable since it’s of the least danceable genre.

# Simulate posterior predictive models for the 3 artists
set.seed(84735)
predict_next_song <- posterior_predict(
  spotify_model_1,
  newdata = data.frame(
    artist = c("Camilo", "Mohsen Beats", "Missy Elliott"), 
    valence = c(80, 60, 90), genre = c("latin", "rock", "rap")))

# Posterior predictive model plots
mcmc_areas(predict_next_song, prob = 0.8) +
 ggplot2::scale_y_discrete(
   labels = c("Camilo", "Mohsen Beats", "Missy Elliott"))

There are 3 bell-shaped density plots of danceability, one for each of 3 artists: Camilo, Mohsen Beats, and Missy Elliott. There is moderate overlap in the densities. Mohsen Beats has the leftmost density, centered near 63 and ranging from roughly 13 to 113. Camilo's density plot is slightly higher and narrower, centered near 70 and ranging from roughly 35 to 105. Missy Elliott's density plot is higher yet, centered near 87 and ranging from roughly 57 to 117.

FIGURE 17.24: Posterior predictive models for the popularity of the next songs by Missy Elliott, Mohsen Beats, and Camilo.

One last thing

If you’re paying close attention, you might notice a flaw in our model. It produces posterior predictions of danceability that exceed the maximum danceability level of 100. This is a consequence of using the Normal to model a response variable with a limited range (here 0 - 100). The Normal assumption was “good enough” here, but we might consider replacing it with, say, a Beta model.

17.8 Chapter summary

In Chapter 17 we explored how to build a hierarchical model of $Y$ by predictors $X$ when working with group structured or hierarchical data. Suppose we have multiple data points on each of $m$ groups. For each group $j \in {1, 2, \dots, m}$ , let $Y_{i j}$ and $(X_{i j 1}, X_{i j 2}, \dots, X_{i j p})$ denote the $i$ th set of observed data on response variable $Y$ and $p$ different predictors $X$ . Then a Normal hierarchical regression model of $Y$ vs $X$ consists of three layers: it combines information about the relationship between $Y$ and $X$ within groups, with information about how these relationships vary between groups, with our prior understanding of the broader global population. Letting $β_{j}$ denote a set of group-specific parameters and ( $β, σ_{y}, σ$ ) a set of global parameters:

$\begin{aligned} Y_{i j} | β_{j}, σ_{y} & \sim N (μ_{i j}, σ_{y}^{2}) & regression model within group j \\ β_{j} | β, σ & \sim N (β, σ^{2}) & variability in regression parameters between groups \\ β, σ_{y}, σ, . . . & \sim \dots & prior models on global parameters \end{aligned}$

Where we have some choices to make is in the definition of the regression mean $μ_{i j}$ . In the simplest case of a random intercepts model, we assume that groups might have unique baselines $β_{0 j}$ , yet share a common relationship between $Y$ and $X$ :

$\begin{aligned} μ_{i j} & = β_{0 j} + β_{1} X_{i j 1} + β_{2} X_{i j 2} + \dots + β_{p} X_{i j p} \\ = (β_{0} + b_{0 j}) + β_{1} X_{i j 1} + β_{2} X_{i j 2} + \dots + β_{p} X_{i j p} . \end{aligned}$

In the most complicated case of a random intercepts and slopes model, we assume that groups have unique baselines and unique relationships between $Y$ and each $X$ :

$\begin{aligned} μ_{i j} & = β_{0 j} + β_{1 j} X_{i j 1} + β_{2 j} X_{i j 2} + \dots + β_{p j} X_{i j p} \\ = (β_{0} + b_{0 j}) + (β_{1} + b_{1 j}) X_{i j 1} + (β_{2} + b_{2 j}) X_{i j 2} + \dots + (β_{p} + b_{p j}) X_{i j p} . \end{aligned}$

In between, we might assume that some predictors need group-specific coefficients and others don’t:

$\begin{aligned} μ_{i j} & = β_{0 j} + β_{1 j} X_{i j 1} + β_{2} X_{i j 2} + \dots + β_{p} X_{i j p} \\ = (β_{0} + b_{0 j}) + (β_{1} + b_{1 j}) X_{i j 1} + β_{2} X_{i j 2} + \dots + β_{p} X_{i j p} . \end{aligned}$

17.9 Exercises

17.9.1 Conceptual exercises

Exercise 17.1 (Translating assumptions into model notation) To test the relationship between reaction times and sleep deprivation, researchers enlisted 3 people in a 10-day study. Let

Y_{i j}

denote the reaction time (in ms) to a given stimulus and

X_{i j}

the number of days of sleep deprivation for the

i

th observation on subject

j

. For each set of assumptions below, use mathematical notation to represent an appropriate Bayesian hierarchical model of

Y_{i j}

X_{i j}

Not only do some people tend to react more quickly than others, sleep deprivation might impact some people’s reaction times more than others.
Though some people tend to react more quickly than others, the impact of sleep deprivation on reaction time is the same for all.
Nobody has inherently faster reaction times, though sleep deprivation might impact some people’s reaction times more than others.

Exercise 17.2 (Sketch the assumption: Part 1) Continuing with the sleep study, suppose we model the relationship between reaction time

Y_{i j}

and days of sleep deprivation

X_{i j}

using the random intercepts model (17.5).

Sketch a plot of data that we might see if $σ_{y} > σ_{0}$ .
Explain what $σ_{y} > σ_{0}$ would mean in the context of the sleep study.
Repeat part a assuming $σ_{y} < σ_{0}$ .
Repeat part b assuming $σ_{y} < σ_{0}$ .

Exercise 17.3 (Sketch the assumption: Part 2) Suppose instead that we model the relationship between reaction time

Y_{i j}

and days of sleep deprivation

X_{i j}

using the random intercepts and slopes model (17.12) (with different priors).

Sketch a plot of subject-specific trends that we might see if the correlation parameter were positive $ρ > 0$ .
Explain what $ρ > 0$ would mean in the context of the sleep study.
Repeat part a for $ρ < 0$ .
Repeat part b for $ρ < 0$ .

Exercise 17.4 (Making meaning of models) To study the relationship between weight and height among pug puppies, you collect data on 10 different litters, each containing 4 to 6 puppies born to the same mother. Let

Y_{i j}

and

X_{i j}

denote the weight and height, respectively, for puppy

i

in litter

j

Write out formal model notation for model 1, a random intercepts model of $Y_{i j}$ vs $X_{i j}$ .
Write out formal model notation for model 2, a random intercepts and slopes model of $Y_{i j}$ vs $X_{i j}$ .
Summarize the key differences in the assumptions behind models 1 and 2. Root this discussion in the puppy context.

Exercise 17.5 (Translating models to code) Suppose we had weight and height data for the puppy study. Write out appropriate stan_glmer() model code for models 1 and 2 from Exercise 17.4.

17.9.2 Applied exercises

Exercise 17.6 (Sleep: setting up the model) In the above conceptual exercises, you considered a sleep study of the relationship between reaction time and the number of days of sleep deprivation. You will explore this relationship in more depth here. To this end, suppose researchers tell us that on a typical day, the average person should have a reaction time of roughly 250ms to the stimulus used in the sleep study. Beyond this baseline, we’ll balance weakly informative priors with the sleepstudy data from the lme4 package to better understand reaction times. Specifically, consider two possible models as expressed by stan_glmer() syntax:

model	formula
1	`Reaction ~ Days + (1 \| Subject)`
2	`Reaction ~ Days + (Days \| Subject)`

What’s the grouping variable in the sleepstudy data and why is it important to incorporate this grouping structure into our analysis?
Use formal notation to define the hierarchical regression structure of models 1 and 2. (You will tune the priors in the next exercise.)
Summarize the key differences between models 1 and 2. Root this discussion in the sleep study.
Using the sleepstudy data, construct and discuss a plot that helps you explore which model is more appropriate: 1 or 2.

Exercise 17.7 (Sleep: simulating the model) Continuing with the sleep analysis, let’s simulate and dig into the hierarchical posteriors.

Simulate the posteriors of models 1 and 2. Remember to use a baseline reaction time of 250ms, and weakly informative priors otherwise.
For model 2, report the global posterior median regression model of Reaction time.
For model 2, construct and interpret 80% credible intervals for the Days regression coefficient.
For model 2, calculate and interpret the posterior medians of $σ_{y}$ , $σ_{0}$ , $σ_{1}$ , and $ρ$ .

Exercise 17.8 (Sleep: group-specific inference) Next, let’s dig into what Model 2 indicates about the individuals that participated in the sleep study.

Use your posterior simulation to identify the person for whom reaction time changes the least with sleep deprivation. Write out their posterior median regression model.
Repeat part a, this time for the person for whom reaction time changes the most with sleep deprivation.
Use your posterior simulation to identify the person that has the slowest baseline reaction time. Write out their posterior median regression model.
Repeat part c, this time for the person that has the fastest baseline reaction time.
Simulate, plot, and discuss the posterior predictive model of reaction time after 5 days of sleep deprivation for two subjects: you and Subject 308. You’re encouraged to try this from scratch before relying on the posterior_predict() shortcut.

Exercise 17.9 (Sleep: Which model?)

Evaluate the two models of reaction time: Are they wrong? Are they fair? How accurate are their posterior predictions?
Which of the two models do you prefer and what does this indicate about the relationship between reaction time and sleep deprivation? Justify your answer with posterior evidence.

Exercise 17.10 (Voices: setting up the model) Does one’s voice pitch change depending on attitude? To address this question, Winter and Grawunder (2012) conducted a study in which each subject participated in various role-playing dialogs. These dialogs spanned different contexts (e.g., asking for a favor) and were approached with different attitudes (polite vs informal). In the next exercises you’ll explore a hierarchical regression analysis of

Y_{i j}

, the average voice pitch in subject

j

’s

i

th dialog session (measured in Hz), by

X_{i j}

, whether or not the dialog was polite (vs informal). Beyond a baseline understanding that the typical voice pitch is around 200 Hz, you should utilize weakly informative priors.

Using formal notation, define the hierarchical regression model of $Y_{i j}$ vs $X_{i j 1}$ . In doing so, assume that baseline voice pitch differs from subject to subject, but that the impact of attitude on voice pitch is similar among all subjects.
Compare and contrast the meanings of model parameters $β_{0 j}$ and $β_{0}$ in the context of this voice pitch study. NOTE: Remember that $X_{i j}$ is a categorical indicator variable.
Compare and contrast the meanings of model parameters $σ_{y}$ and $σ_{0}$ in the context of this voice pitch study.

Exercise 17.11 (Voices: check out some data) To balance our weakly informative priors for the model of pitch by attitude, check out some data.

Load the voices data from the bayesrules package. How many study subjects are included in this sample? In how many dialogs did each subject participate?
Construct and discuss a plot which illustrates the relationship between voice pitch and attitude both within and between subjects.

Exercise 17.12 (Voices: simulating the model) Continuing with the voice pitch analysis, in this exercise you will simulate and dig into the hierarchical posterior of your model parameters.

Simulate the hierarchical posterior model of voice pitch by attitude. Construct and discuss trace plots, density plots, autocorrelation plots, and a pp_check() of the chain output.
Construct and interpret a 95% credible interval for $β_{0}$ .
Construct and interpret a 95% credible interval for $β_{1}$ .
Is there ample evidence that, for the average subject, voice pitch differs depending on attitude (polite vs informal)? Explain.

Exercise 17.13 (Voices: focusing on the individual) Continuing with the voice pitch analysis, in this exercise you will focus on specific subjects.

Report the global posterior median model of the relationship between voice pitch and attitude.
Report and contrast the posterior median models for two subjects in our data: A and F.
Using posterior_predict(), simulate posterior predictive models of voice pitch in a new polite dialog for three different subjects: A, F, and you. Illustrate your simulation results using mcmc_areas() and discuss your findings.

17.9.3 Open-ended exercises

Exercise 17.14 (Open exercise: coffee) What makes some coffee beans more winning than others? In this open-ended exercise, construct, interpret, and evaluate a hierarchical regression model of the professional rating or “total cup points” awarded to a batch of coffee beans. Do so using the coffee_ratings_small data from the bayesrules package which contains information for multiple batches of coffee beans from each of 27 farms. You get to choose which predictors to use.

Exercise 17.15 (Sleep: different priors) In our earlier sleep analysis, we utilized weakly informative priors. Pretending that you haven’t already seen the data, specify a model of Reaction time by Days of sleep deprivation using priors that you tune yourself. Use prior simulation to illustrate your prior understanding.

References

Carreño, Edwin Javier Castillo, and Edilberto Cepeda Cuervo. 2017. Bayeslongitudinal: Adjust Longitudinal Regression Models Using Bayesian Methodology. https://CRAN.R-project.org/package=bayeslongitudinal.

Gabry, Jonah, and Ben Goodrich. 2020a. “Estimating Generalized Linear Models with Group-Specific Terms with Rstanarm.” https://mc-stan.org/rstanarm/articles/glmer.html.

Laird, N., and J. Ware. 1982. “Random-Effects Models for Longitudinal Data.” Biometrics 38 4: 963–74.

Modrak, Martin. 2019. “Divergent Transitions – a Primer.” https://discourse.mc-stan.org/t/divergent-transitions-a-primer/17099.

Vehtari, Aki. 2019. “Cross-Validation for Hierarchical Models.” https://avehtari.github.io/modelselection/rats_kcv.html.

Winter, Bodo, and Sven Grawunder. 2012. “The Phonetic Profile of Korean Formal and Informal Speech Registers.” Journal of Phonetics 40: 808–15.

Answers: 1 = c; 2 = a↩︎
Technically, this is the combined variability in group-specific intercepts and slopes when assuming they are uncorrelated.↩︎