Chapter 6 Approximating the Posterior

Welcome to Unit 2!

In Unit 1, you learned to think like a Bayesian and to build some fundamental Bayesian models in this spirit. Further, by cranking these models through Bayes’ Rule, you were able to mathematically specify the corresponding posteriors. Those days are over. Though merely hypothetical for now, some day (starting in Chapter ??!) the models you’ll be interested in analyzing will get too complicated to mathematically specify.

Never fear – data analysts are not known to throw up their hands in the face of the unknown. When we can’t know or specify something, we approximate it. In Unit 2 you will learn to use Markov chain Monte Carlo simulation techniques to approximate otherwise out of reach posterior models as well as how to use the resulting output for Bayesian posterior analysis. As such, Unit 2 serves as a critical bridge to applying the fundamental concepts from Unit 1 in the more sophisticated model settings of Unit 3 and beyond.

Learning requires the occasional leap. You’ve already taken a few. From Chapter 2 to Chapter 3, you took the leap from using simple discrete priors to using continuous Beta priors for a proportion \(\pi\). From Chapter 3 to Chapter 5, you took the leap from engineering the Beta-Binomial model to a family of Bayesian models that can be applied in a wider variety of settings. With each leap, your Bayesian toolkit became more flexible and powerful, but at the cost of the underlying math becoming a bit more complicated. As you continue to generalize your Bayesian methods in more sophisticated settings, this complexity will continue to grow.

Consider Michelle’s run for president. In Section 2.2 you built a simple Bayesian model of the event that Michelle would secure her party’s nomination given that she won the Iowa caucus. In Chapter 3 you took your analysis one step further, building a model of Michelle’s support in Minnesota based on polling data in that state. You could continue to refine your analysis of Michelle’s chances of becoming president. To begin, you could model Michelle’s support in all fifty states and Washington, D.C.. Better yet, this model might incorporate data on past state-level voting trends and demographics. The trade-off is that increasing your model’s flexibility also makes it more complicated. Whereas your Minnesota-only model depended upon only one parameter \(\pi\), Michelle’s level of support in that state, the new model depends upon dozens of parameters. Here, let \(\theta = (\theta_1, \theta_2, \ldots, \theta_k)\) denote a generic set of \(k \ge 1\) parameters upon which a Bayesian model depends. In your growing Bayesian model of Michelle’s election chances, \(\theta\) includes 51 parameters that represent her support in each state as well as multiple parameters that define the relationships between Michelle’s support among voters, state-level demographics, and past voting trends. Our posterior analysis thus relies on building the posterior pdf of \(\theta\) given a set of observed data \(y\) on state-level polls, demographics, and voting trends,

\[f(\theta | y) = \frac{f(\theta)L(\theta | y)}{f(y)} \propto f(\theta)L(\theta | y) \; .\]

Though this formula looks familiar, complexity lurks beneath. Since \(\theta\) is a vector with \(k\) elements, \(f(\theta)\) and \(L(\theta|y)\) are complicated multivariate functions. And, if we value our time, we can forget about calculating the normalizing constant \(f(y)\) across all possible \(\theta\), an intractable multiple integral (or multiple sum) for which a closed form solution might not exist:

\[f(y) = \int_{\theta_1}\int_{\theta_2} \cdots \int_{\theta_k} f(\theta)L(\theta | y) d\theta_k \cdots d\theta_2 d\theta_1 \; .\]

Fortunately, when a Bayesian posterior is either impossible or prohibitively difficult to specify, we’re not out of luck. We must simply change our strategy: instead of specifying the posterior, we can approximate the posterior via simulation. We’ll explore two simulation techniques: grid approximation and Markov chain Monte Carlo (MCMC). Both techniques produce a sample of \(N\) \(\theta\) values,

\[\left\lbrace \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)} \right\rbrace \; ,\]

the properties of which reflect those of the posterior model for \(\theta\). In Chapter 6, we’ll explore these simulation techniques in the context of the familiar Beta-Binomial and Gamma-Poisson models. Though these models don’t require simulation (we can and did specify their posteriors in Unit 1), exploring simulation in these familiar settings will help us build intuition for the process and give us peace of mind that it actually works when we eventually do need it.

  • Implement and examine the limitations of using grid approximation to simulate a posterior model.
  • Explore the fundamental properties of MCMC posterior simulation techniques and how these differ from grid approximation.
  • Implement MCMC simulation in R.
  • Learn several Markov chain diagnostics for examining the quality of an MCMC posterior approximation.

But first, load some packages that we’ll be utilizing throughout the remainder of this chapter (and book!). Among these, rstan is quite unique, thus be sure to revisit the Preface for directions on installing this package.

# Load packages
library(tidyverse)
library(janitor)
library(rstan)
library(bayesplot)

6.1 Grid approximation

Imagine there’s an image that you can’t view in its entirety - you only observe snippets along a grid that sweeps from left to right across the image. The finer the grid, the clearer the image. And if the grid is fine enough, the result is an imperfect but excellent approximation of the complete image:

This is the big picture idea behind Bayesian grid approximation, in which case the target “image” is posterior pdf \(f(\theta | y)\). We needn’t observe \(f(\theta | y)\) at every possible \(\theta\) to get a sense of its structure. Rather, we can evaluate \(f(\theta | y)\) at a finite, discrete grid of possible \(\theta\) values. Subsequently, we can take random samples from this discretized pdf to approximate the full posterior pdf \(f(\theta | y)\). We formalize these ideas here and apply them below.

Grid approximation

Grid approximation produces a sample of \(N\) independent \(\theta\) values, \(\left\lbrace \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)} \right\rbrace\), from a discretized approximation of posterior pdf \(f(\theta|y)\). This algorithm evolves in four steps:

  1. Define a discrete grid of possible \(\theta\) values.
  2. Evaluate the prior pdf \(f(\theta)\) and likelihood function \(L(\theta|y)\) at each \(\theta\) grid value.
  3. Obtain a discrete approximation of posterior pdf \(f(\theta |y)\) by: (a) calculating the product \(f(\theta)L(\theta|y)\) at each \(\theta\) grid value; and then (b) normalizing the products so that they sum to 1 across all \(\theta\).
  4. Randomly sample \(N\) \(\theta\) grid values with respect to their corresponding normalized posterior probabilities.

6.1.1 A Beta-Binomial example

To bring the grid approximation technique to life, let’s explore the following generic Beta-Binomial model (one without a corresponding data story):

\[\begin{equation} \begin{split} Y|\pi & \sim \text{Bin}(10, \pi) \\ \pi & \sim \text{Beta}(2, 2) \; . \\ \end{split} \tag{6.1} \end{equation}\]

We can interpret \(Y\) here as the number of successes in 10 independent trials. Each trial has probability of success \(\pi\) where our prior understanding about \(\pi\) is captured by a \(\text{Beta}(2, 2)\) model. Suppose we observe \(Y = 9\) successes. Then by our work in Chapter 3, we know that the updated posterior model of \(\pi\) is Beta with parameters 11 (\(Y + \alpha = 9 + 2\)) and 3 (\(n - Y + \beta = 10 - 9 + 2\)):

\[\pi | (Y = 9) \sim \text{Beta}(11, 3) \; .\]

We’re now going to ask you to forget that we were able to specify this posterior. Instead, we’ll explore how we can approximate the posterior using grid approximation. As Step 1, we need to split the continuum of possible \(\pi\) values on 0 to 1 into a finite grid. We’ll start with a course grid of only 6 \(\pi\) values, \(\pi \in \{0, 0.2, 0.4, 0.6, 0.8, 1\}\):

# Step 1: Define a grid of 6 pi values
pi_grid   <- seq(from = 0, to = 1, length = 6)
grid_data <- data.frame(pi_grid)

In Step 2 we evaluate the \(\text{Beta}(2,2)\) prior pdf and \(\text{Bin}(10, \pi)\) likelihood function with \(Y = 9\) at each \(\pi\) in pi_grid:

# Step 2: Evaluate the prior & likelihood at each pi
grid_data <- grid_data %>% 
  mutate(prior = dbeta(pi_grid, 2, 2)) %>% 
  mutate(likelihood = dbinom(9, 10, pi_grid))

In Step 3, we calculate the product of the likelihood and prior at each grid value. This provides an unnormalized discrete approximation of the posterior pdf (that doesn’t sum to one) which we subsequently normalize:

# Step 3: Approximate the posterior
grid_data <- grid_data %>% 
  mutate(unnormalized = likelihood * prior) %>% 
  mutate(posterior = unnormalized / sum(unnormalized))

# Confirm that the posterior approximation sums to 1
grid_data %>% 
  summarize(sum(unnormalized), sum(posterior))
  sum(unnormalized) sum(posterior)
1             0.318              1

The resulting discretized posterior pdf, rounded to 2 decimal places and plotted below, provides a very rigid glimpse into the actual posterior pdf. It places a roughly combined 99% chance on \(\pi\) being either 0.6 or 0.8, a 1% chance on \(\pi\) being 0.4, and a near 0% chance on the other 3 \(\pi\) grid values:

# Examine the grid approximated posterior
round(grid_data, 2)
  pi_grid prior likelihood unnormalized posterior
1     0.0  0.00       0.00         0.00      0.00
2     0.2  0.96       0.00         0.00      0.00
3     0.4  1.44       0.00         0.00      0.01
4     0.6  1.44       0.04         0.06      0.18
5     0.8  0.96       0.27         0.26      0.81
6     1.0  0.00       0.00         0.00      0.00

# Plot the grid approximated posterior
ggplot(grid_data, aes(x = pi_grid, y = posterior)) + 
  geom_point() + 
  geom_segment(aes(x = pi_grid, xend = pi_grid, 
    y = 0, yend = posterior))

You might anticipate what will happen when we use this approximation to simulate samples from the posterior in Step 4. Each sample draw has only 6 possible outcomes and is highly likely to be 0.6 or 0.8. Let’s try it: use sample_n() to take a sample of size = 10000 values from the 6-length grid_data, with replacement and using the discretized posterior probabilities as sample weights.

# Set the seed
set.seed(84735)

# Step 4: sample from the discretized posterior
post_sample <- sample_n(grid_data, size = 10000, 
  weight = posterior, replace = TRUE)

As expected, most of our 10,000 sample values of \(\pi\) were 0.6 or 0.8, few were 0.4, and none were below 0.4 or above 0.8:

# A table of the 10000 sample values
post_sample %>% 
  tabyl(pi_grid) %>% 
  adorn_totals("row")
 pi_grid     n percent
     0.4    69  0.0069
     0.6  1885  0.1885
     0.8  8046  0.8046
   Total 10000  1.0000

This is an extremely oversimplified approximation of the true \(\text{Beta}(11, 3)\) posterior. If you need any more convincing, the plot below superimposes the true target posterior pdf \(f(\pi|y)\) on a scaled histogram of the 10,000 sample values. If this were a good approximation, the histogram would mimic the shape, location, and spread of the smooth pdf. It does not.

# Histogram of the grid simulation with posterior pdf
ggplot(post_sample, aes(x = pi_grid)) + 
  geom_histogram(aes(y = ..density..), color = "white") + 
  stat_function(fun = dbeta, args = list(11, 3)) + 
  lims(x = c(0, 1))

Remember the rainbow image and how we got a more complete picture by viewing snippets along a finer grid? Similarly, instead of chopping up the 0-to-1 continuum of possible \(\pi\) values into a grid of only 6 values, let’s try a more reasonable grid of 101 values: \(\pi \in \{0, 0.01, 0.02, \ldots, 0.99, 1\}\). The first 3 steps of the grid approximation using this refined grid are performed below:

# Step 1: Define a grid of 101 pi values
pi_grid    <- seq(from = 0, to = 1, length = 101)
grid_data  <- data.frame(pi_grid)

# Step 2: Evaluate the prior & likelihood at each pi
grid_data <- grid_data %>% 
  mutate(prior = dbeta(pi_grid, 2, 2)) %>% 
  mutate(likelihood = dbinom(9, 10, pi_grid))

# Step 3: Approximate the posterior
grid_data <- grid_data %>% 
  mutate(unnormalized = likelihood * prior) %>% 
  mutate(posterior = unnormalized / sum(unnormalized))

The discretized posterior pdf is quite smooth, especially in comparison to the rigid approximation when we only used 6 grid values:

ggplot(grid_data, aes(x = pi_grid, y = posterior)) + 
  geom_point() + 
  geom_segment(aes(x = pi_grid, xend = pi_grid, 
    y = 0, yend = posterior))

Finally, in step 4 of the grid approximation, sample and plot 10,000 draws from the discretized posterior pdf:

# Set the seed
set.seed(84735)

# Step 4: sample from the discretized posterior
post_sample <- sample_n(grid_data, size = 10000, 
  weight = posterior, replace = TRUE)

# Histogram of the grid simulation with posterior pdf 
ggplot(post_sample, aes(x = pi_grid)) + 
  geom_histogram(aes(y = ..density..), color = "white", 
    binwidth = 0.05) + 
  stat_function(fun = dbeta, args = list(11, 3)) + 
  lims(x = c(0, 1))

The shape, location, and spread of these sample values mimic those of the true target posterior pdf \(f(\pi|y)\), represented by the smooth curve. Exciting! Take a step back to appreciate what we’ve just accomplished. We’ve taken an independent sample from a discretized approximation of the posterior pdf \(f(\pi|y)\) that’s defined only on a grid of \(\pi\) values. The algorithm is pretty intuitive and didn’t require advanced programming skills to implement. Most of all, the result was a sample that behaves much like a random sample taken directly from \(f(\pi|y)\) itself. Thus grid approximation can be a lifeline in scenarios in which the posterior is intractable, providing an approximation that’s nearly indistinguishable from the “real thing.”

6.1.2 A Gamma-Poisson example

For practice, let’s apply the grid approximation technique to simulate the posterior of another Bayesian model, this time a Gamma-Poisson. Let \(Y\) be the number of events that occur in a one hour period, where events occur at an average rate of \(\lambda\) per hour. Further, suppose we collect two data points (\(Y_1,Y_2\)) and place a \(\text{Gamma}(3, 1)\) prior on \(\lambda\):

\[\begin{equation} \begin{split} Y_i|\lambda & \stackrel{ind}{\sim} \text{Pois}(\lambda) \\ \lambda & \sim \text{Gamma}(3, 1) \; . \\ \end{split} \tag{6.2} \end{equation}\]

If we observe \(Y_1 = 2\) events in the first one-hour observation period and \(Y_2 = 8\) in the next, then by our work in Chapter 5, the updated posterior model of \(\lambda\) is Gamma with parameters 13 (\(s + \sum Y = 3 + 10\)) and 3 (\(r + n = 1 + 2\)):

\[\lambda | ((Y_1,Y_2) = (2,3)) \sim \text{Gamma}(13, 3) \; .\]

Let’s approximate this posterior using grid approximation. In Step 1, we must specify a discrete grid of reasonable \(\lambda\) values. Unlike the \(\pi\) parameter in the Beta-Binomial which is restricted to the finite 0-to-1 interval, the Poisson parameter \(\lambda\) can technically take on any non-negative real value (\(\lambda \in [0,\infty)\)). However, in a plot of the Gamma prior pdf and Poisson likelihood function it appears that, though possible, values of \(\lambda\) beyond 15 are implausible:

With this in mind, we’ll set up a discrete grid of \(\lambda\) values between 0 and 15, essentially truncating the tail of the posterior. Before completing the corresponding grid approximation together, we encourage you to challenge yourself by first trying this on your own.

Fill in the code below to construct a grid approximation of the Gamma-Poisson posterior corresponding to (6.2). In doing so, use a grid of 501 \(\lambda\) values between 0 and 15.

# Step 1: Define a grid of 501 lambda values
lambda_grid <- seq(from = ___, to = ___, length = ___)
grid_data   <- data.frame(lambda_grid)

# Step 2: Evaluate the prior & likelihood at each lambda
grid_data <- grid_data %>% 
  ___(prior = dgamma(___, ___, ___)) %>% 
  ___(likelihood = dpois(___, ___) * dpois(___, ___))

# Step 3: Approximate the posterior
grid_data <- grid_data %>% 
  ___(unnormalized = ___) %>% 
  ___(posterior = ___)

# Set the seed
set.seed(84735)

# Step 4: sample from the discretized posterior
post_sample <- sample_n(___, size = ___, 
  weight = ___, replace = ___)

Once you’ve quizzed yourself, check out the complete code below. Notice that much of this is the same as it was for the Beta-Binomial model. There are two key differences. First, we use dgamma() and dpois() instead of dbeta() and dbinom() to evaluate the prior and likelihood. Second, since we have a sample of two data points (\(Y_1,Y_2\)), the Poisson joint likelihood function of \(\lambda\) is the product of the marginal likelihoods, \(L(\lambda | Y_1,Y_2) = L(\lambda | Y_1) L(\lambda | Y_2)\).

# Step 1: Define a grid of 501 lambda values
lambda_grid <- seq(from = 0, to = 15, length = 501)
grid_data   <- data.frame(lambda_grid)

# Step 2: Evaluate the prior & likelihood at each lambda
grid_data <- grid_data %>% 
  mutate(prior = dgamma(lambda_grid, 3, 1)) %>% 
  mutate(likelihood = dpois(2, lambda_grid) * dpois(8, lambda_grid))

# Step 3: Approximate the posterior
grid_data <- grid_data %>% 
  mutate(unnormalized = likelihood * prior) %>% 
  mutate(posterior = unnormalized / sum(unnormalized))

# Set the seed
set.seed(84735)

# Step 4: sample from the discretized posterior
post_sample <- sample_n(grid_data, size = 10000, 
  weight = posterior, replace = TRUE)

Most importantly, grid approximation again produced a decent approximation of our target posterior:

# Histogram of the grid simulation with posterior pdf 
ggplot(post_sample, aes(x = lambda_grid)) + 
  geom_histogram(aes(y = ..density..), color = "white") + 
  stat_function(fun = dgamma, args = list(13, 3)) + 
  lims(x = c(0, 15))

6.1.3 Limitations

Some say that all good things must come to an end. Though we don’t agree with this saying in general, it happens to be true in the case of grid approximation. Limitations in the grid approximation method quickly present themselves as our models get more complicated. Mainly, when our model has multiple (possibly thousands) of model parameters \(\theta = (\theta_1, \theta_2, \ldots, \theta_k)\), grid approximation suffers from the curse of dimensionality. Let’s return to our image approximation for some intuition. Above we assumed that we could only see snippets of the image along a grid that sweeps from left to right along the x-axis. This is analogous to using grid approximation to simulate a model with one parameter. Suppose instead that our model has two parameters. Or, in the case of the image approximation, we can only see snippets of the image along a grid that sweeps from left to right along the x-axis and from top to bottom along the y-axis:

When we chop both the x- and y-axes into grids, there are bigger gaps in the image approximation. To achieve a more refined approximation, we need a finer grid than when we only chopped the x-axis into a grid. Analogously, when using grid approximation to simulate multivariate posteriors, we need to divide the multidimensional sample space of \(\theta = (\theta_1, \theta_2, \ldots, \theta_k)\) into a very, very fine grid in order to prevent big gaps in our approximation. In practice, this might not be feasible. When evaluated on finer and finer grids, the grid approximation method becomes computationally expensive. You can’t merely start the simulation, get up for a cup of coffee, come back, and poof the simulation is done. You might have to start the simulation and go off to a month-long meditation retreat (to practice the patience you’ll need for grid approximation). MCMC methods provide a more flexible alternative.

6.2 Markov chains via RStan

MCMC methods and their coinage hold some historical significance. The Markov chain component of MCMC is named for the Russian mathematician Andrey Markov (1856 - 1922). The etymology of the Monte Carlo component is more nuanced and dubious. As part of their top secret nuclear weapons project in the 1940’s, Stanislav Ulam, John von Neumann, and their collaborators at the Los Alamos National Laboratory used Markov chains to simulate and better understand neutron travel (Eckhardt 1987). One perk of such top secret government work is getting to pick a code name for your project. The Los Alamos team chose “Monte Carlo,” a code name which is said to be inspired by the opulent Monte Carlo casino in the equally opulent Principality of Monaco on the French Riviera. What inspired the link between random simulation methods and casinos? We can’t help you there.

A billboard at a uranium processing plant in Oak Ridge, TN, a sister site of Los Alamos. Image: https://commons.wikimedia.org/wiki/File:Oak_Ridge_Wise_Monkeys.jpg

FIGURE 6.1: A billboard at a uranium processing plant in Oak Ridge, TN, a sister site of Los Alamos. Image: https://commons.wikimedia.org/wiki/File:Oak_Ridge_Wise_Monkeys.jpg

Today, the application of Markov chains to simulate probability models using methods pioneered by the Monte Carlo project is known as “Markov chain Monte Carlo.” As desired, MCMC simulation methods scale up for more complicated Bayesian models. But this flexibility doesn’t come for free. Like grid approximation samples, MCMC samples are not taken directly from the posterior pdf \(f(\theta|y)\). Yet unlike grid approximation samples, MCMC samples aren’t even independent - each subsequent sample value depends directly upon the previous value. This feature is known as the Markov property and is evoked by the “chain” terminology. Specifically, let \(\left\lbrace \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)} \right\rbrace\) be an \(N\)-length Markov chain. Then the model from which the \((i+1)\)st chain value is drawn depends upon the preceding chain values only through the most recent value:

\[f\left(\theta^{(i + 1)} \; | \; \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(i)}, y\right) = f\left(\theta^{(i + 1)} \; | \; \theta^{(i)}, y\right) \; .\]

This dependence reiterates the fact that each chain value is drawn from some pdf other than the target posterior:

\[f\left(\theta^{(i + 1)} \; | \; \theta^{(i)}, y\right) \ne f\left(\theta^{(i + 1)} \; | \; y\right) \; .\]

Markov chain Monte Carlo

MCMC simulation produces a chain of \(N\) dependent \(\theta\) values, \(\left\lbrace \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)} \right\rbrace\), which are not drawn from the posterior pdf \(f(\theta|y)\).

It likely seems strange to approximate the posterior using a dependent sample that’s not even taken from the posterior. But mathemagically, with reasonable MCMC algorithms, it can work. The rub is that these algorithms have a steeper learning curve than the grid approximation technique. If you’re curious about the details, we encourage you to explore Chapter 7. If you’re not curious or are short on time, you can get through the remainder of the book without the details. Though a firm grasp of these algorithms would be ideal, there’s a growing number of MCMC computing resources that can do the heavy lifting for us. In this book, we’ll utilize RStan, which combines the power of R with the Stan engine. You’ll get into the nitty gritty of RStan in Chapter 6, building up the code line by line and thereby gaining familiarity with the key elements of an RStan simulation. Starting in Chapter ??, you will utilize the complementary RStanArm package which provides shortcuts for simulating a broad framework of Bayesian applied regression models (Arm).

6.2.1 A Beta-Binomial example

There are two essential steps to all RStan analyses: (1) define the Bayesian model structure in RStan notation; and (2) simulate the posterior. Let’s examine these steps in the context of the Beta-Binomial model defined by (6.1), starting with step 1. In defining the structure of this Beta-Binomial model, we need to specify the three aspects upon which it depends:

  • data
    Data \(Y\) is the observed number of successes in 10 trials. RStan isn’t a mind reader – we must specify that \(Y\) is an integer between 0 and 10.

  • parameters
    The model depends upon parameter \(\pi\), or pi in RStan notation. We must specify that \(\pi\) can be any real number between 0 and 1.

  • model
    The model is defined by the Bin(10,\(\pi\)) likelihood for \(Y\) and the Beta(2,2) prior for \(\pi\). We specify these using binomial() and beta().

Below we translate this Beta-Binomial structure into RStan syntax and store it as the character string bb_model. We encourage you to pause and examine the code, noting how it matches up with the three aspects above.

# STEP 1: DEFINE the model
bb_model <- "
  data {
    int<lower=0, upper=10> Y;
  }
  parameters {
    real<lower=0, upper=1> pi;
  }
  model {
    Y ~ binomial(10, pi);
    pi ~ beta(2, 2);
  }
"

In step 2, we simulate the posterior using the stan() function. Very loosely speaking, stan() designs and runs an MCMC algorithm to produce an approximate sample from the Beta-Binomial posterior.

The simulation will be quite slow for each new model.

# STEP 2: SIMULATE the posterior
bb_sim <- stan(
  model_code = bb_model, data = list(Y = 9), 
  chains = 4, iter = 5000*2, seed = 84735)

Note that stan() requires two types of arguments. First, we must specify the model information by:

  • model_code = the character string defining the model (here bb_model)
  • data = a list of the observed data (here Y = 9).

Second, we must specify the desired Markov chain information using three additional arguments:

  • The chains argument specifies how many parallel Markov chains to run. We run four chains here, a choice we discuss in Section 6.3.

  • The iter argument specifies the desired number of iterations in each Markov chain. By default, the first half of these iterations are thrown out as “burn-in” or “warm-up” samples, the idea being that it takes some time before a Markov chain starts producing values that mimic a random sample from the posterior. The second half is kept as the final Markov chain sample. Thus we choose the number of iterations to be double our desired Markov chain sample size.

  • To set the random number generating seed for an RStan simulation, we utilize the seed argument within the stan() function.

The result, stored in bb_sim, is a stanfit object. This object includes four parallel Markov chains run for 10,000 iterations each. After tossing out the first 5,000 iterations of all four chains, we end up with four separate Markov chain samples of size 5,000, or a combined Markov chain sample size of 20,000. The first 4 \(\pi\) values for each chain are extracted and shown here:

as.array(bb_sim, pars = "pi") %>% 
  head(4)
, , parameters = pi

          chains
iterations chain:1 chain:2 chain:3 chain:4
      [1,]  0.9403  0.8777  0.3803  0.6649
      [2,]  0.9301  0.9802  0.8186  0.6501
      [3,]  0.9012  0.9383  0.8458  0.7001
      [4,]  0.9224  0.9540  0.7336  0.5902

It’s important to remember that these Markov chain values are NOT a random sample from the posterior and are NOT independent. Rather, each of the four parallel chains forms a dependent 5,000 length Markov chain of \(\pi\) values, \(\left(\pi^{(1)}, \pi^{(2)}, \ldots, \pi^{(5000)} \right)\). For example, in iteration 1, chain:1 starts at a value of roughly \(\pi^{(1)}\) = 0.9403. The value at iteration 2, \(\pi^{(2)}\), depends upon \(\pi^{(1)}\). In this case the chain moves from 0.9403 to 0.9301. Similarly, the chain moves from 0.9301 to 0.9012, from 0.9012 to 0.9224, and so on. Thus the chain traverses the sample space or range of posterior plausible \(\pi\) values. A Markov chain trace plot illustrates this traversal, plotting the \(\pi\) value (y-axis) in each iteration (x-axis). We use the mcmc_trace() function in the bayesplot package to construct the trace plot of all four Markov chains:

mcmc_trace(bb_sim, pars = "pi", size = 0.5)
Trace plots of the four parallel Markov chains for the Beta-Binomial model.

FIGURE 6.2: Trace plots of the four parallel Markov chains for the Beta-Binomial model.

Figure 6.3 zooms in on the trace plot of chain 1. In the first 20 iterations (left), the chain largely explores values between 0.57 and 0.94. After 200 iterations (right), the Markov chain has started to explore new territory, traversing a slightly wider range of values between 0.49 and 0.96. Both trace plots also exhibit evidence of the dependence among the Markov chain values, places where the chain tends to float up for multiple iterations and then down for multiple iterations.

A trace plot of the first 20 iterations (left) and 200 iterations (right) of the first Beta-Binomial Markov chain.

FIGURE 6.3: A trace plot of the first 20 iterations (left) and 200 iterations (right) of the first Beta-Binomial Markov chain.

The trace plots in Figure 6.3 illuminate the longitudinal behavior of the Markov chains, marking each value of each subsequent iteration. We also want to examine the distribution of the values these chains visit along their journey, ignoring the order of these visits. The histogram and density plot below provide a snapshot of this distribution for the combined 20,000 chain values, 5,000 from each of the four separate chains. Notice the important punchline here: the distribution of the Markov chain values produces an excellent approximation of the target Beta(11, 3) posterior model of \(\pi\) (superimposed in red). That’s a relief – that was the whole point!

# Histogram of the Markov chain values
mcmc_hist(bb_sim, pars = "pi")

# Density plot of the Markov chain values
mcmc_dens(bb_sim, pars = "pi")

6.2.2 A Gamma-Poisson example

For more rstan practice, let’s apply these tools to approximate the Gamma-Poisson posterior corresponding to (6.2) upon observing data \((Y_1,Y_2) = (2,8)\). Recall that this involves two steps. In step 1, we define the structure of the Gamma-Poisson model, specifying the three aspects upon which it depends: the data, parameters, and model. In step 2, we simulate the posterior. We encourage you to challenge yourself by trying this on your own first. Your code will involve two terms you haven’t yet seen, but might guess how to use: poisson() and gamma(). Further, your code will incorporate a vector of \((Y_1,Y_2)\) variables and observations as opposed to a single variable \(Y\).

Fill in the code below to construct an MCMC approximation of the Gamma-Poisson posterior corresponding to (6.2). In doing so, run four parallel chains for 10,000 iterations each (resulting in a sample size of 5,000 per chain).

# STEP 1: DEFINE the model
gp_model <- "
  data {
    int<___> Y[2];
  }
  parameters {
    real<___> lambda;
  }
  model {
    Y ~ ___(___);
    lambda ~ ___(___, ___);
  }
"

# STEP 2: SIMULATE the posterior
gp_sim <- ___(
  model_code = ___, data = list(___), 
  chains = ___, iter = ___, seed = 84735)

We now present the solution. In defining the model in step 1, take note of the three aspects upon which it depends.

  • data
    Data Y[2] represents the vector of the event counts, \((Y_1,Y_2)\), where the counts can be any non-negative integers in \(\{0,1,2,\ldots\}\).

  • parameters
    The model depends upon parameter \(\lambda\) which can be any non-negative real number in \([0,\infty)\).

  • model
    The model is defined by the Pois(\(\lambda\)) likelihood for \(Y\) and the Gamma(3,1) prior for \(\lambda\). We specify these using poisson() and gamma().

Further, in step 2, we must feed in the vector of both data points, Y = c(2,8). It follows that:

# STEP 1: DEFINE the model
gp_model <- "
  data {
    int<lower=0> Y[2];
  }
  parameters {
    real<lower=0> lambda;
  }
  model {
    Y ~ poisson(lambda);
    lambda ~ gamma(3, 1);
  }
"

# STEP 2: SIMULATE the posterior
gp_sim <- stan(
  model_code = gp_model, data = list(Y = c(2,8)), 
  chains = 4, iter = 5000*2, seed = 84735)

Trace plots illustrate the paths these chains take through the \(\lambda\) sample space. To this end, note that the \(\lambda\) chain values range from roughly 0 to 10, but mostly remain below 7.5:

# Construct trace plots
mcmc_trace(gp_sim, pars = "lambda", size = 0.5)

A histogram and density plot provide a better look into the overall distribution of \(\lambda\) Markov chain values. This distribution provides an excellent approximation of the Gamma(13,3) posterior (superimposed in red):

# Histogram of the Markov chain values
mcmc_hist(gp_sim, pars = "lambda")

# Density plot of the Markov chain values
mcmc_dens(gp_sim, pars = "lambda")

6.3 Markov chain diagnostics

Simulation is fantastic! Even when we’re able to derive a Bayesian posterior, we can use simulated samples from the posterior to verify our work and provide some intuition. More importantly, when we’re not able to derive a posterior, simulated samples provide a crucial approximation. In the MCMC examples above, we saw that the Markov chains traverse the sample space of our parameter (\(\pi\) or \(\lambda\)) and, in the end, mimic a random sample that converges to the posterior. “Approximation” and “converges” are key words here - simulations aren’t perfect. This begs the following questions, stated here for MCMC simulations though similar questions pertain to grid approximation:

  • What does a good Markov chain look like?
  • How accurate is the Markov chain approximation of the posterior?
  • How big should our Markov chain sample size be? For how many iterations should we run the Markov chain?

Answering these questions is both an art and science. There are no one-size-fits-all magic formulas that provide definitive answers here. Rather, it’s through experience that you get a feel for what “good” Markov chains look like and what you can do to fix a “bad” Markov chain. In this section, we’ll focus on a couple visual diagnostic tools that will get us started: trace plots and parallel chains. These can be followed up with and supplemented by two numerical diagnostics: effective sample size and R-hat (\(\hat{R}\)). Utilizing these diagnostics should be done holistically. Since no single visual or numerical diagnostic is one-size-fits-all, they provide a fuller picture of Markov chain quality when considered together.

6.3.1 Examining trace plots

Reexamine the trace plots for the Beta-Binomial simulation in Figure 6.2. These are textbook examples of what we want trace plots to look like. Mainly, they look like a bunch of white noise with no discernible trends or notable phenomena. This nothingness implies that the chains are stable. In contrast, the hypothetical Markov chains for the same Beta-Binomial model shown in Figure 6.4 illustrate potential chain behavior that should give us pause.

Trace plots (left) and corresponding density plots (right) of two hypothetical Markov chains. These provide examples of what “bad” Markov chains might look like. The superimposed black lines (right) represent the target Beta(11,3) posterior pdf.

FIGURE 6.4: Trace plots (left) and corresponding density plots (right) of two hypothetical Markov chains. These provide examples of what “bad” Markov chains might look like. The superimposed black lines (right) represent the target Beta(11,3) posterior pdf.

First, the trend of the top chain has not yet stabilized after 5,000 iterations, suggesting that it has not yet “found” or does not yet know how to explore the range of posterior plausible \(\pi\) values. It is mixing slowly. The downward trend also hints at strong correlation among the draws. All of this is bad. Though Markov chains are inherently dependent, the more they behave like independent samples, the smaller the error in the resulting posterior approximation (roughly). The bottom chain exhibits a different problem. It tends to get stuck when it visits smaller values of \(\pi\).

Both of these goofy-looking chains result in a serious issue: they produce poor approximations of the Beta(11,3) posterior (superimposed in black). After 5,000 iterations, we can see that the top chain hasn’t yet explored the tails of the posterior and that, upon getting stuck, the bottom chain over-samples some values in the left tail of \(\pi\) (hence the various spikes in the density plot). In turn, these poor posterior approximations would lead to misleading conclusions about our posterior model.

In practice, we run rstan simulations when we can’t specify, thus want to approximate the posterior. This means that we don’t typically have the privilege of being able to compare our simulation results to the “real” posterior. This is why diagnostics are so important. If we see bad trace plots like those in Figure 6.4, there are two immediate steps we can take:

  1. Check the model. Are there mistakes in the code? Are there mistakes in the model itself, ie. are the assumed prior and likelihood models appropriate?
  2. Run the chain for more iterations. Some undesirable short-term chain trends might iron out in the long-term.

We’ll get practice with the more nuanced Step 1 throughout the book. Step 2 is easy, though requires extra computation time.

6.3.2 Comparing parallel chains

Recall that our stan() simulation for the Beta-Binomial model produced four parallel Markov chains. Not only do we want to see stability in each individual chain (as discussed above), we want to see consistency across the four chains. Mainly, though we expect different chains take different paths, they should enjoy similar features and produce similar approximations of the posterior. For example, the trace plots of the four parallel chains in Figure 6.2 appear similar in their randomness. Further, in the density plots below, we see that these four chains produce nearly indistinguishable approximations of the Beta(11,3) posterior. This provides evidence that our simulation is stable and sufficiently long – running the chains for more iterations likely wouldn’t produce drastically different or improved approximations of the posterior.

# Density plots of individual chains
mcmc_dens_overlay(bb_sim, pars = "pi")

For a point of comparison, let’s implement a shorter Markov chain simulation for the same model. Instead of running four parallel chains for 10,000 iterations and a resulting sample size of 5,000 each, run four parallel chains for only 100 iterations and a resulting sample size of 50 each:

# STEP 2: SIMULATE the posterior
bb_sim_short <- stan(
  model_code = bb_model, data = list(Y = 9), 
  chains = 4, iter = 50*2, seed = 84735)

The trace plots and corresponding density plots of the short Markov chains are shown below. Though the chains’ trace plots exhibit similar random behavior, their corresponding density plots differ, hence they produce discrepant approximations of the posterior. In the face of such instability, it would be a mistake to stop our simulation after only 100 iterations.

# Trace plots of short chains
mcmc_trace(bb_sim_short, pars = "pi")

# Density plots of individual short chains
mcmc_dens_overlay(bb_sim_short, pars = "pi")

6.3.3 Calculating effective sample size

Recall from Section 6.3.1 that the more a dependent Markov chain behaves like an independent sample, the smaller the error in the resulting posterior approximation (loosely speaking). Though trace plots provide some visual insight into this behavior, supplementary numerical assessments can provide more nuanced information. To begin, recall that our Markov chain simulation bb_sim produced a combined 20,000 dependent values of \(\pi\), sampled from models other than the posterior. We then used this chain to approximate the posterior model of \(\pi\). Knowing that the error in this approximation is likely larger than if we had used 20,000 independent sample values drawn directly from the posterior begs the following question: How many independent sample values would it take to produce an equivalently accurate posterior approximation? The effective sample size provides an answer.

Effective sample size

Let \(N\) denote the actual sample size or length of a dependent Markov chain. The effective sample size of this chain, \(N_{eff}\), quantifies the number of independent samples it would take to produce an equivalently accurate posterior approximation. The greater the \(N_{eff}\) the better, yet it’s typically true that the approximation accuracy of our Markov chain is only as good as that of a smaller independent sample, ie.

\[N_{eff} < N.\]

Though there’s no magic rule for interpreting \(N_{eff}\), and it should be utilized alongside other diagnostics such as the trace plot, we might be suspicious of a Markov chain for which \(N_{eff}\) is less than 10% of the actual sample size \(N\).

A quick call to the summary() function provides the estimated effective sample size, n_eff, of our Markov chain simulation (we’ll dig into the other summary statistics later in the book):

# Summarize the combined values of the 4 long chains
summary(bb_sim, pars = c("pi"))$summary
    mean  se_mean     sd   2.5%    25% 50%    75%
pi 0.786 0.001291 0.1059 0.5477 0.7198 0.8 0.8658
    97.5% n_eff Rhat
pi 0.9489  6721    1

The reported n_eff indicates that the accuracy in using our 20,000 length Markov chain to approximate the posterior is only as great as if we had used a much smaller independent sample of size 6721.

6.3.4 Calculating R-hat

Just as the effective sample size provides a numerical supplement to the visual trace plot diagnostic in assessing the degree to which a Markov chain behaves like a random sample, the split-\(\hat{R}\) metric (or “R-hat” for short) provides a numerical supplement to the visual comparison of parallel chains. Recall from Section 6.3.2 that, not only do we want to each individual Markov chain in our simulation to be stable, we want there to be consistency across the parallel chains. R-hat addresses this consistency by comparing the combined variability in sampled \(\pi\) values across all chains to the variability within each individual chain. Before presenting a more formal definition, take the following self quiz to tap into your intuition.50

Figure 6.5 provides simulation results for bb_sim (top row) along with a bad hypothetical alternative (bottom row). Based on the patterns in these plots, what do you think is a marker of a “good” Markov chain simulation?

  1. The variability across all chains is greater than the variability within any individual chain.
  2. The variability across all chains is comparable to the variability within any individual chain.
Simulation results for bb_sim (top row) and a hypothetical alternative (bottom row). These simulation results include trace plots of the four parallel chains (left), density plots for each individual chain (middle), and a density plot of the combined chains (right).

FIGURE 6.5: Simulation results for bb_sim (top row) and a hypothetical alternative (bottom row). These simulation results include trace plots of the four parallel chains (left), density plots for each individual chain (middle), and a density plot of the combined chains (right).

To answer this quiz, let’s dig into Figure 6.5. Based on what we learned in Section 6.3.2, we can see that bb_sim is superior to the alternative – its parallel chains exhibit similar features and produce similar approximations of the posterior. In particular, the variability in \(\pi\) values is nearly identical within each chain (top middle plot). As a consequence, the variability in \(\pi\) values across all chains combined (top right plot) is similar to that of the individual chains. In contrast, notice that the four parallel chains in the alternative simulation produce conflicting approximations of the posterior (which is bad!). As a consequence, the variability in \(\pi\) values across all chains combined (bottom right plot) is much larger / wider than the variability in \(\pi\) within any individual chain (bottom middle plot) That is, there’s much more consistency within any individual chain than there is when we combine them. This discrepancy between the chains hints at a lack of stability in the alternative simulation.

Bringing this analysis together, we’ve intuited the importance of the relationship between the variability across and within all parallel chains. Specifically:

  • in a “good” Markov chain simulation, the variability across parallel chains will be roughly comparable to the variability within any individual chain;

  • in a “bad” Markov chain simulation, the variability across parallel chains might exceed the typical variability within each chain.

We can quantify the relationship between the combined chain variability and within chain variability using R-hat. We provide an intuitive definition of R-hat here. For a more detailed definition, see Vehtari et al. (2020). And to learn more about the exciting connection between effective sample size and R-hat, see Vats and Knudson (2018)!

R-hat

Consider a Markov chain simulation of parameter \(\theta\) which utilizes four parallel chains. Let \(\text{Var}_\text{combined}\) denote the combined variability in \(\theta\) across all four chains and \(\text{Var}_\text{within}\) denote the typical variability within any individual chain. The R-hat metric calculates the ratio between these two sources of variability:

\[\text{R-hat} \approx \sqrt{\frac{\text{Var}_\text{combined}}{\text{Var}_\text{within}}} \; .\]

Ideally, \(\text{R-hat} \approx 1\), reflecting stability across the parallel chains. In contrast, \(\text{R-hat} > 1\) indicates instability, with the variability across chains exceeding that within the chains. Though no golden rule exists, an R-hat ratio greater than 1.05 raises some red flags about the stability of the simulation.

You may not have noticed, but the Rhat value is included in our simulation summary():

summary(bb_sim, pars = "pi")$summary
    mean  se_mean     sd   2.5%    25% 50%    75%
pi 0.786 0.001291 0.1059 0.5477 0.7198 0.8 0.8658
    97.5% n_eff Rhat
pi 0.9489  6721    1

Reflecting our observation that the variability across and within our four parallel chains is comparable, bb_sim has an R-hat value that’s effectively equal to 1. In contrast, our bad hypothetical simulation exhibited in Figure 6.5 has an R-hat value of 5.35. That is, the variability across all chain values is more than 5 times the typical variability within each chains. This well exceeds the 1.05 red flag marker, providing ample evidence that the hypothetical parallel chains do not produce consistent posterior approximations, thus the simulation is unstable.

6.4 Chapter summary

As our Bayesian models get more sophisticated, their posteriors will become too difficult, if not impossible, to specify. In Chapter 6, you learned two simulation techniques that can be utilized to approximate the posterior in such scenarios: grid approximation and Markov chain Monte Carlo. Both techniques produce a sample of \(N\) \(\theta\) values,

\[\left\lbrace \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)} \right\rbrace \; .\]

However, the properties of these samples differ:

  • Grid approximation takes an independent sample of \(\theta^{(i)}\) from a discretized approximation of posterior pdf \(f(\theta|y)\). This approach is nicely straightforward but breaks down in more complicated model settings.
  • Markov chain Monte Carlo offers a more flexible alternative. Though MCMC produces a dependent sample of \(\theta^{(i)}\) which are not drawn from the posterior pdf \(f(\theta|y)\), this sample will mimic the posterior model so long as the chain length \(N\) is big enough.

Finally, you learned some MCMC diagnostics that provide a check of the resulting simulation quality. In short, we can visually examine trace plots and density plots of multiple parallel chains for stability in our simulation:

# Diagnostic plots
mcmc_trace(my_sim, pars = "___")
mcmc_dens_overlay(my_sim, pars = "___")

And we can supplement these visual diagnostics with numerical diagnostics such as effective sample size and R-hat:

summary(my_sim, pars = "___")$summary

6.5 Exercises

6.5.1 Conceptual exercises

Exercise 6.1 (Steps for grid approximation)
  1. Identify the steps for the grid approximation of a posterior model.
  2. Which step(s) would you change to make the approximation more accurate?
  3. How would you change those steps to make the grid approximation more accurate?
Exercise 6.2 (Trace plot diagnostics) Different MCMC simulation scenarios are described below. Sketch by hand what a single chain trace plot might look like for each simulation.
  1. The chain is mixing too slowly.
  2. The chain has high correlation.
  3. The chain has a tendency to get “stuck.”
  4. The chain has no problems!
Exercise 6.3 (MCMC woes) Different MCMC simulation scenarios are described below. For each, describe how the scenario could impact the simulated posterior model.
  1. The chain is mixing too slowly.
  2. The chain has high correlation.
  3. The chain has a tendency to get “stuck.”
Exercise 6.4 (MCMC simulation: Thank you for being a friend) Your friend missed class this week and they are allergic to reading textbooks (a common affliction). Since you are a true friend (who has read the book chapter) you decide to help them out:
  1. Explain to your friend why it is important to look at MCMC diagnostics.
  2. Explain why MCMC simulations are helpful.
  3. Explain the benefits of using RStan.
  4. Tell your friend something you don’t understand about the chapter.

6.5.2 Practice: grid approximation

Exercise 6.5 (Beta-Binomial grid approximation) Consider the Beta-Binomial model for \(\pi\) with \(Y | \pi \sim \text{Bin}(n, \pi)\) and \(\pi \sim \text{Beta}(3,8)\). Suppose that in \(n = 10\) independent trials, you observe \(Y = 2\) successes. Utilize grid approximation with grid values \(\pi \in \{0, 0.25,0.5, 0.75, 1\}\) to approximate the posterior model of \(\pi\).
Exercise 6.6 (Beta-Binomial grid approximation: more grid) Repeat the exercise above, but this time use a grid of 201 equally spaced values between 0 and 1.
Exercise 6.7 (Gamma-Poisson grid approximation) Consider the Gamma-Poisson model for \(\lambda\) with \(Y_i | \lambda \stackrel{ind}{\sim} \text{Pois}(\lambda)\) and \(\lambda \sim \text{Gamma}(20,5)\). Suppose you observe \(n=3\) independent data points \((Y_1,Y_2,Y_3) = (0,1,0)\). Utilize grid approximation with grid values \(\lambda \in \{0, 1, 2, \ldots, 8\}\) to approximate the posterior model of \(\lambda\).
Exercise 6.8 (Gamma-Poisson grid approximation: more grid) Repeat the exercise above, but this time use a grid of 201 equally spaced values between 0 and 8.
Exercise 6.9 (Normal-Normal grid approximation) Consider the Normal-Normal model for \(\mu\) with \(Y_i | \mu \stackrel{ind}{\sim} N(\mu, 1.3^2)\) and \(\mu \sim N(10, 1.2^2)\). Suppose that on \(n = 4\) independent observations, you observe data \((Y_1,Y_2,Y_3,Y_4) = (7.1, 8.9, 8.4, 8.6)\). Utilize grid approximation with grid values \(\mu \in \{5, 6, 7, \ldots, 15\}\) to approximate the posterior model of \(\mu\).
Exercise 6.10 (Normal-Normal grid approximation: more grid) Repeat the exercise above, but this time use a grid of 201 equally spaced values between 5 and 15.
Exercise 6.11 (The Curse of Dimensionality) As we note in this chapter, grid approximation suffers from the curse of dimensionality.
  1. Describe a situation in which we would want to have inference for multiple parameters (ie. high dimensional Bayesian models).
  2. In your own words, explain how dimensionality can affect grid approximation and why this is a curse.

6.5.3 Practice: MCMC

Exercise 6.12 (Comparing MCMC to Grid Approximation)
  1. What drawback(s) do MCMC and grid approximation share?
  2. What advantage(s) do MCMC and grid approximation share?
  3. What is an advantage of grid approximation over MCMC?
  4. What is an advantage of MCMC over grid approximation?
Exercise 6.13 (Is it a Markov Chain?) Below are examples of “chains” \(\left\lbrace \theta^{(1)}, \theta^{(2)},\ldots,\theta^{(N)} \right\rbrace\), for different probability parameters \(\theta\). For each example, determine whether the given chain is a Markov chain. Explain.
  1. You go out to eat \(N\) nights in a row and \(\theta^{(i)}\) is the probability you go to a Thai restaurant on day i.
  2. You play the lottery \(N\) days in a row and \(\theta^{(i)}\) is the probability you win the lottery on day i.
  3. You play your roommate in chess for \(N\) games in a row and \(\theta^{(i)}\) is the probability you win game i against your roommate.
Exercise 6.14 (MCMC with RStan: Step 1) Use the given information to define the Bayesian model structure using the correct RStan syntax. You don’t need to run the code just provide the syntax.
  1. \(Y|\pi \sim \text{Bin}(20, \pi)\) with \(\pi \sim\text{Beta}(1, 1)\).
  2. \(Y|\lambda \sim \text{Pois}(\lambda)\) with \(\lambda \sim\text{Gamma}(4, 2)\).
  3. \(Y|\mu \sim \text{Norm}(\mu, 1)\) with \(\mu\sim\text{Norm}(0, 100)\).
Exercise 6.15 (MCMC with RStan: Steps 1 and 2) Use the given information to (1) define the Bayesian model structure; and (2) simulate the posterior using the correct RStan syntax. You don’t need to run the code, just provide the syntax.
  1. \(Y|\pi \sim \text{Bin}(20, \pi)\) and \(\pi \sim\text{Beta}(1, 1)\) with \(Y = 12\).
  2. \(Y|\lambda \sim \text{Pois}(\lambda)\) and \(\lambda \sim\text{Gamma}(4, 2)\) with \(Y = 3\).
  3. \(Y|\mu \sim \text{Norm}(\mu, 1)\) and \(\mu\sim\text{Norm}(0, 100)\) with \(Y = 12.2\).
Exercise 6.16 (MCMC with RStan: Beta-Binomial) Consider the Beta-Binomial model for \(\pi\) with \(Y | \pi \sim \text{Bin}(n, \pi)\) and \(\pi \sim \text{Beta}(3,8)\). Suppose that in \(n = 10\) independent trials, you observe \(Y = 2\) successes.
  1. Simulate the posterior model of \(\pi\) with RStan using 3 chains and 12000 iterations per chain.
  2. Produce trace plots for each of the three chains.
  3. What is the range of values on the trace plot x-axis? Why is the maximum value of this range not 12000?
  4. Create a density plot of the values for each of the three chains.
  5. Hearkening back to Chapter 5, specify the posterior model of \(\pi\). How does your MCMC approximation compare?
Exercise 6.17 (MCMC with RStan: Beta-Binomial once more with feeling) Consider the Beta-Binomial model for \(\pi\) with \(Y | \pi \sim \text{Bin}(n, \pi)\) and \(\pi \sim \text{Beta}(4,3)\). Suppose that in \(n = 12\) independent trials, you observe \(Y = 4\) successes.
  1. Simulate the posterior model of \(\pi\) with RStan using 4 chains and 10000 iterations per chain.
  2. Produce trace plots for all four chains.
  3. What is the range of values on the trace plot x-axis? Why is the maximum value of this range not 10000?
  4. Create a density plot of the values for each of the four chains.
Exercise 6.18 (MCMC with RStan: Gamma-Poisson) Consider the Gamma-Poisson model for \(\lambda\) with \(Y_i | \lambda \stackrel{ind}{\sim} \text{Pois}(\lambda)\) and \(\lambda \sim \text{Gamma}(20,5)\). Suppose that you observe \(n=3\) independent data points \((Y_1,Y_2,Y_3) = (0,1,0)\).
  1. Simulate the posterior model of \(\lambda\) with RStan using 4 chains and 10000 iterations per chain.
  2. Produce trace plots for all four chains.
  3. Create a density plot of the values for each of the four chains.
  4. From the density plots, what seems to be the most posterior plausible value of \(\lambda\)?
  5. Hearkening back to Chapter 5, specify the posterior model of \(\lambda\). How does your MCMC approximation compare?
Exercise 6.19 (MCMC with RStan: Gamma-Poisson again) Repeat exercise ?? for the Gamma-Poisson model of \(\lambda\) with \(Y_i | \lambda \stackrel{ind}{\sim} \text{Pois}(\lambda)\) and \(\lambda \sim \text{Gamma}(5,5)\). Suppose that you observe \(n=3\) independent data points \((Y_1,Y_2,Y_3) = (0,1,0)\).
Exercise 6.20 (MCMC with RStan: Normal-Normal) Repeat exercise ?? for the Normal-Normal model of \(\mu\) with \(Y_i | \mu \stackrel{ind}{\sim} N(\mu, 1.3^2)\) and \(\mu \sim N(10, 1.2^2)\). Suppose that on \(n = 4\) independent observations, you observe data \((Y_1,Y_2,Y_3,Y_4) = (7.1, 8.9, 8.4, 8.6)\).
Exercise 6.21 (MCMC with RStan: Normal-Normal part deux) Repeat exercise ?? for the Normal-Normal model of \(\mu\) with \(Y_i | \mu \stackrel{ind}{\sim} N(\mu, 8^2)\) and \(\mu \sim N(-14, 2^2)\). Suppose that on \(n = 5\) independent observations, you observe data \((Y_1,Y_2,Y_3,Y_4,Y_5) = (-10.1, 5.5, 0.1, -1.4, 11.5)\).

  1. Answer: b↩︎