# Chapter 6 Approximating the Posterior

**Welcome to Unit 2!**

In Unit 1, you learned to think like a Bayesian and to build some fundamental Bayesian models in this spirit.
Further, by cranking these models through Bayes’ Rule, you were able to mathematically *specify* the corresponding posteriors.
Those days are over.
Though merely hypothetical for now, *some* day (starting in Chapter **??**!) the models you’ll be interested in analyzing will get too complicated to mathematically specify.

Never fear – data analysts are not known to throw up their hands in the face of the unknown.
When we can’t *know* or specify something, we *approximate* it.
In Unit 2 you will learn to use **Markov chain Monte Carlo simulation** techniques to *approximate* otherwise out of reach posterior models as well as how to use the resulting output for Bayesian **posterior analysis**.
As such, Unit 2 serves as a critical bridge to applying the fundamental concepts from Unit 1 in the more sophisticated model settings of Unit 3 and beyond.

Learning requires the occasional leap. You’ve already taken a few. From Chapter 2 to Chapter 3, you took the leap from using simple discrete priors to using continuous Beta priors for a proportion \(\pi\). From Chapter 3 to Chapter 5, you took the leap from engineering the Beta-Binomial model to a family of Bayesian models that can be applied in a wider variety of settings. With each leap, your Bayesian toolkit became more flexible and powerful, but at the cost of the underlying math becoming a bit more complicated. As you continue to generalize your Bayesian methods in more sophisticated settings, this complexity will continue to grow.

Consider Michelle’s run for president.
In Section 2.2 you built a simple Bayesian model of the event that Michelle would secure her party’s nomination given that she won the Iowa caucus.
In Chapter 3 you took your analysis one step further, building a model of Michelle’s support in Minnesota based on polling data in that state.
You could *continue* to refine your analysis of Michelle’s chances of becoming president.
To begin, you could model Michelle’s support in *all* fifty states and Washington, D.C..
Better yet, this model might incorporate data on past state-level voting trends and demographics.
The trade-off is that increasing your model’s flexibility also makes it more complicated.
Whereas your Minnesota-only model depended upon only one parameter \(\pi\), Michelle’s level of support in that state, the new model depends upon *dozens* of parameters.
Here, let \(\theta = (\theta_1, \theta_2, \ldots, \theta_k)\) denote a generic set of \(k \ge 1\) parameters upon which a Bayesian model depends.
In your growing Bayesian model of Michelle’s election chances, \(\theta\) includes 51 parameters that represent her support in each state as well as multiple parameters that define the *relationships* between Michelle’s support among voters, state-level demographics, and past voting trends.
Our posterior analysis thus relies on building the posterior pdf of \(\theta\) given a set of observed data \(y\) on state-level polls, demographics, and voting trends,

\[f(\theta | y) = \frac{f(\theta)L(\theta | y)}{f(y)} \propto f(\theta)L(\theta | y) \; .\]

Though this formula looks familiar, complexity lurks beneath.
Since \(\theta\) is a vector with \(k\) elements, \(f(\theta)\) and \(L(\theta|y)\) are complicated *multivariate* functions.
And, if we value our time, we can forget about calculating the normalizing constant \(f(y)\) across all possible \(\theta\), an intractable multiple integral (or multiple sum) for which a closed form solution might not exist:

\[f(y) = \int_{\theta_1}\int_{\theta_2} \cdots \int_{\theta_k} f(\theta)L(\theta | y) d\theta_k \cdots d\theta_2 d\theta_1 \; .\]

Fortunately, when a Bayesian posterior is either impossible or prohibitively difficult to specify, we’re not out of luck.
We must simply change our strategy: instead of *specifying* the posterior, we can **approximate the posterior via simulation**.
We’ll explore two simulation techniques: **grid approximation** and **Markov chain Monte Carlo (MCMC)**.
Both techniques produce a **sample** of \(N\) \(\theta\) values,

\[\left\lbrace \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)} \right\rbrace \; ,\]

the properties of which reflect those of the posterior model for \(\theta\).
In Chapter 6, we’ll explore these simulation techniques in the context of the familiar Beta-Binomial and Gamma-Poisson models.
Though these models don’t *require* simulation (we can and did specify their posteriors in Unit 1), exploring simulation in these familiar settings will help us build intuition for the process and give us peace of mind that it actually works when we eventually *do* need it.

- Implement and examine the limitations of using
**grid approximation**to simulate a posterior model. - Explore the fundamental properties of
**MCMC**posterior simulation techniques and how these differ from grid approximation. - Implement MCMC simulation in R.
- Learn several
**Markov chain diagnostics**for examining the*quality*of an MCMC posterior approximation.

But first, load some packages that we’ll be utilizing throughout the remainder of this chapter (and book!).
Among these, **rstan** is quite unique, thus be sure to revisit the Preface for directions on installing this package.

```
# Load packages
library(tidyverse)
library(janitor)
library(rstan)
library(bayesplot)
```

## 6.1 Grid approximation

Imagine there’s an image that you can’t view in its entirety - you only observe snippets along a grid that sweeps from left to right across the image. The finer the grid, the clearer the image. And if the grid is fine enough, the result is an imperfect but excellent approximation of the complete image:

This is the big picture idea behind Bayesian grid approximation, in which case the target “image” is posterior pdf \(f(\theta | y)\).
We needn’t observe \(f(\theta | y)\) at *every* possible \(\theta\) to get a sense of its structure.
Rather, we can evaluate \(f(\theta | y)\) at a *finite, discrete grid* of possible \(\theta\) values.
Subsequently, we can take random samples from this *discretized* pdf to approximate the full posterior pdf \(f(\theta | y)\).
We formalize these ideas here and apply them below.

**Grid approximation**

Grid approximation produces a sample of \(N\) **independent** \(\theta\) values, \(\left\lbrace \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)} \right\rbrace\), from a **discretized approximation** of posterior pdf \(f(\theta|y)\). This algorithm evolves in four steps:

- Define a discrete grid of possible \(\theta\) values.
- Evaluate the prior pdf \(f(\theta)\) and likelihood function \(L(\theta|y)\) at each \(\theta\) grid value.
- Obtain a discrete approximation of posterior pdf \(f(\theta |y)\) by: (a) calculating the product \(f(\theta)L(\theta|y)\) at each \(\theta\) grid value; and then (b)
*normalizing*the products so that they sum to 1 across all \(\theta\). - Randomly sample \(N\) \(\theta\) grid values with respect to their corresponding normalized posterior probabilities.

### 6.1.1 A Beta-Binomial example

To bring the grid approximation technique to life, let’s explore the following generic Beta-Binomial model (one without a corresponding data story):

\[\begin{equation} \begin{split} Y|\pi & \sim \text{Bin}(10, \pi) \\ \pi & \sim \text{Beta}(2, 2) \; . \\ \end{split} \tag{6.1} \end{equation}\]

We can interpret \(Y\) here as the number of successes in 10 independent trials. Each trial has probability of success \(\pi\) where our prior understanding about \(\pi\) is captured by a \(\text{Beta}(2, 2)\) model. Suppose we observe \(Y = 9\) successes. Then by our work in Chapter 3, we know that the updated posterior model of \(\pi\) is Beta with parameters 11 (\(Y + \alpha = 9 + 2\)) and 3 (\(n - Y + \beta = 10 - 9 + 2\)):

\[\pi | (Y = 9) \sim \text{Beta}(11, 3) \; .\]

We’re now going to ask you to *forget* that we were able to specify this posterior.
Instead, we’ll explore how we can *approximate* the posterior using grid approximation.
As **Step 1**, we need to split the continuum of possible \(\pi\) values on 0 to 1 into a finite grid.
We’ll start with a course grid of only 6 \(\pi\) values, \(\pi \in \{0, 0.2, 0.4, 0.6, 0.8, 1\}\):

```
# Step 1: Define a grid of 6 pi values
<- seq(from = 0, to = 1, length = 6)
pi_grid <- data.frame(pi_grid) grid_data
```

In **Step 2** we evaluate the \(\text{Beta}(2,2)\) prior pdf and \(\text{Bin}(10, \pi)\) likelihood function with \(Y = 9\) at each \(\pi\) in `pi_grid`

:

```
# Step 2: Evaluate the prior & likelihood at each pi
<- grid_data %>%
grid_data mutate(prior = dbeta(pi_grid, 2, 2)) %>%
mutate(likelihood = dbinom(9, 10, pi_grid))
```

In **Step 3**, we calculate the product of the `likelihood`

and `prior`

at each grid value.
This provides an *unnormalized* discrete approximation of the posterior pdf (that doesn’t sum to one) which we subsequently **normalize**:

```
# Step 3: Approximate the posterior
<- grid_data %>%
grid_data mutate(unnormalized = likelihood * prior) %>%
mutate(posterior = unnormalized / sum(unnormalized))
# Confirm that the posterior approximation sums to 1
%>%
grid_data summarize(sum(unnormalized), sum(posterior))
sum(unnormalized) sum(posterior)
1 0.318 1
```

The resulting discretized posterior pdf, rounded to 2 decimal places and plotted below, provides a very rigid glimpse into the *actual* posterior pdf.
It places a roughly combined 99% chance on \(\pi\) being either 0.6 or 0.8, a 1% chance on \(\pi\) being 0.4, and a near 0% chance on the other 3 \(\pi\) grid values:

```
# Examine the grid approximated posterior
round(grid_data, 2)
pi_grid prior likelihood unnormalized posterior1 0.0 0.00 0.00 0.00 0.00
2 0.2 0.96 0.00 0.00 0.00
3 0.4 1.44 0.00 0.00 0.01
4 0.6 1.44 0.04 0.06 0.18
5 0.8 0.96 0.27 0.26 0.81
6 1.0 0.00 0.00 0.00 0.00
# Plot the grid approximated posterior
ggplot(grid_data, aes(x = pi_grid, y = posterior)) +
geom_point() +
geom_segment(aes(x = pi_grid, xend = pi_grid,
y = 0, yend = posterior))
```

You might anticipate what will happen when we use this approximation to simulate samples from the posterior in **Step 4**.
Each sample draw has only 6 possible outcomes and is highly likely to be 0.6 or 0.8.
Let’s try it: use `sample_n()`

to take a sample of `size = 10000`

values from the 6-length `grid_data`

, with replacement and using the discretized `posterior`

probabilities as sample weights.

```
# Set the seed
set.seed(84735)
# Step 4: sample from the discretized posterior
<- sample_n(grid_data, size = 10000,
post_sample weight = posterior, replace = TRUE)
```

As expected, most of our 10,000 sample values of \(\pi\) were 0.6 or 0.8, few were 0.4, and *none* were below 0.4 or above 0.8:

```
# A table of the 10000 sample values
%>%
post_sample tabyl(pi_grid) %>%
adorn_totals("row")
pi_grid n percent0.4 69 0.0069
0.6 1885 0.1885
0.8 8046 0.8046
10000 1.0000 Total
```

This is an *extremely* oversimplified approximation of the true \(\text{Beta}(11, 3)\) posterior.
If you need any more convincing, the plot below superimposes the true target posterior pdf \(f(\pi|y)\) on a scaled histogram of the 10,000 sample values.
If this *were* a good approximation, the histogram would mimic the shape, location, and spread of the smooth pdf.
It does not.

```
# Histogram of the grid simulation with posterior pdf
ggplot(post_sample, aes(x = pi_grid)) +
geom_histogram(aes(y = ..density..), color = "white") +
stat_function(fun = dbeta, args = list(11, 3)) +
lims(x = c(0, 1))
```

Remember the rainbow image and how we got a more complete picture by viewing snippets along a finer grid?
Similarly, instead of chopping up the 0-to-1 continuum of possible \(\pi\) values into a grid of only 6 values, let’s try a more reasonable grid of **101** values: \(\pi \in \{0, 0.01, 0.02, \ldots, 0.99, 1\}\).
The first 3 steps of the grid approximation using this refined grid are performed below:

```
# Step 1: Define a grid of 101 pi values
<- seq(from = 0, to = 1, length = 101)
pi_grid <- data.frame(pi_grid)
grid_data
# Step 2: Evaluate the prior & likelihood at each pi
<- grid_data %>%
grid_data mutate(prior = dbeta(pi_grid, 2, 2)) %>%
mutate(likelihood = dbinom(9, 10, pi_grid))
# Step 3: Approximate the posterior
<- grid_data %>%
grid_data mutate(unnormalized = likelihood * prior) %>%
mutate(posterior = unnormalized / sum(unnormalized))
```

The discretized posterior pdf is quite smooth, especially in comparison to the rigid approximation when we only used 6 grid values:

```
ggplot(grid_data, aes(x = pi_grid, y = posterior)) +
geom_point() +
geom_segment(aes(x = pi_grid, xend = pi_grid,
y = 0, yend = posterior))
```

Finally, in step 4 of the grid approximation, sample and plot 10,000 draws from the discretized posterior pdf:

```
# Set the seed
set.seed(84735)
# Step 4: sample from the discretized posterior
<- sample_n(grid_data, size = 10000,
post_sample weight = posterior, replace = TRUE)
# Histogram of the grid simulation with posterior pdf
ggplot(post_sample, aes(x = pi_grid)) +
geom_histogram(aes(y = ..density..), color = "white",
binwidth = 0.05) +
stat_function(fun = dbeta, args = list(11, 3)) +
lims(x = c(0, 1))
```

The shape, location, and spread of these sample values mimic those of the true target posterior pdf \(f(\pi|y)\), represented by the smooth curve.
Exciting!
Take a step back to appreciate what we’ve just accomplished.
We’ve taken an independent *sample* from a *discretized approximation* of the posterior pdf \(f(\pi|y)\) that’s defined only on a *grid* of \(\pi\) values.
The algorithm is pretty intuitive and didn’t require advanced programming skills to implement.
Most of all, the result was a sample that behaves much like a random sample taken directly from \(f(\pi|y)\) itself.
Thus grid approximation can be a lifeline in scenarios in which the posterior is intractable, providing an approximation that’s nearly indistinguishable from the “real thing.”

### 6.1.2 A Gamma-Poisson example

For practice, let’s apply the grid approximation technique to simulate the posterior of another Bayesian model, this time a Gamma-Poisson. Let \(Y\) be the number of events that occur in a one hour period, where events occur at an average rate of \(\lambda\) per hour. Further, suppose we collect two data points (\(Y_1,Y_2\)) and place a \(\text{Gamma}(3, 1)\) prior on \(\lambda\):

\[\begin{equation} \begin{split} Y_i|\lambda & \stackrel{ind}{\sim} \text{Pois}(\lambda) \\ \lambda & \sim \text{Gamma}(3, 1) \; . \\ \end{split} \tag{6.2} \end{equation}\]

If we observe \(Y_1 = 2\) events in the first one-hour observation period and \(Y_2 = 8\) in the next, then by our work in Chapter 5, the updated posterior model of \(\lambda\) is Gamma with parameters 13 (\(s + \sum Y = 3 + 10\)) and 3 (\(r + n = 1 + 2\)):

\[\lambda | ((Y_1,Y_2) = (2,3)) \sim \text{Gamma}(13, 3) \; .\]

Let’s *approximate* this posterior using grid approximation.
In **Step 1**, we must specify a discrete grid of reasonable \(\lambda\) values.
Unlike the \(\pi\) parameter in the Beta-Binomial which is restricted to the finite 0-to-1 interval, the Poisson parameter \(\lambda\) can technically take on any non-negative real value (\(\lambda \in [0,\infty)\)).
However, in a plot of the Gamma prior pdf and Poisson likelihood function it appears that, though possible, values of \(\lambda\) beyond 15 are implausible:

With this in mind, we’ll set up a discrete grid of \(\lambda\) values between 0 and 15, essentially truncating the tail of the posterior. Before completing the corresponding grid approximation together, we encourage you to challenge yourself by first trying this on your own.

Fill in the code below to construct a grid approximation of the Gamma-Poisson posterior corresponding to (6.2). In doing so, use a grid of 501 \(\lambda\) values between 0 and 15.

```
# Step 1: Define a grid of 501 lambda values
<- seq(from = ___, to = ___, length = ___)
lambda_grid <- data.frame(lambda_grid)
grid_data
# Step 2: Evaluate the prior & likelihood at each lambda
<- grid_data %>%
grid_data ___(prior = dgamma(___, ___, ___)) %>%
___(likelihood = dpois(___, ___) * dpois(___, ___))
# Step 3: Approximate the posterior
<- grid_data %>%
grid_data ___(unnormalized = ___) %>%
___(posterior = ___)
# Set the seed
set.seed(84735)
# Step 4: sample from the discretized posterior
<- sample_n(___, size = ___,
post_sample weight = ___, replace = ___)
```

Once you’ve quizzed yourself, check out the complete code below.
Notice that much of this is the same as it was for the Beta-Binomial model.
There are two key differences.
First, we use `dgamma()`

and `dpois()`

instead of `dbeta()`

and `dbinom()`

to evaluate the prior and likelihood.
Second, since we have a sample of two data points (\(Y_1,Y_2\)), the Poisson joint likelihood function of \(\lambda\) is the product of the marginal likelihoods, \(L(\lambda | Y_1,Y_2) = L(\lambda | Y_1) L(\lambda | Y_2)\).

```
# Step 1: Define a grid of 501 lambda values
<- seq(from = 0, to = 15, length = 501)
lambda_grid <- data.frame(lambda_grid)
grid_data
# Step 2: Evaluate the prior & likelihood at each lambda
<- grid_data %>%
grid_data mutate(prior = dgamma(lambda_grid, 3, 1)) %>%
mutate(likelihood = dpois(2, lambda_grid) * dpois(8, lambda_grid))
# Step 3: Approximate the posterior
<- grid_data %>%
grid_data mutate(unnormalized = likelihood * prior) %>%
mutate(posterior = unnormalized / sum(unnormalized))
# Set the seed
set.seed(84735)
# Step 4: sample from the discretized posterior
<- sample_n(grid_data, size = 10000,
post_sample weight = posterior, replace = TRUE)
```

Most importantly, grid approximation again produced a decent approximation of our target posterior:

```
# Histogram of the grid simulation with posterior pdf
ggplot(post_sample, aes(x = lambda_grid)) +
geom_histogram(aes(y = ..density..), color = "white") +
stat_function(fun = dgamma, args = list(13, 3)) +
lims(x = c(0, 15))
```

### 6.1.3 Limitations

Some say that all good things must come to an end.
Though we don’t agree with this saying in general, it happens to be true in the case of grid approximation.
Limitations in the grid approximation method quickly present themselves as our models get more complicated.
Mainly, when our model has multiple (possibly *thousands*) of model parameters \(\theta = (\theta_1, \theta_2, \ldots, \theta_k)\), grid approximation suffers from the **curse of dimensionality**.
Let’s return to our image approximation for some intuition.
Above we assumed that we could only see snippets of the image along a grid that sweeps from left to right along the x-axis.
This is analogous to using grid approximation to simulate a model with *one* parameter.
Suppose instead that our model has *two* parameters.
Or, in the case of the image approximation, we can only see snippets of the image along a grid that sweeps from left to right along the x-axis *and* from top to bottom along the y-axis:

When we chop *both* the x- and y-axes into grids, there are bigger *gaps* in the image approximation.
To achieve a more refined approximation, we need a finer grid than when we only chopped the x-axis into a grid.
Analogously, when using grid approximation to simulate multivariate posteriors, we need to divide the multidimensional sample space of \(\theta = (\theta_1, \theta_2, \ldots, \theta_k)\) into a very, very fine grid in order to prevent big gaps in our approximation.
In practice, this might not be feasible.
When evaluated on finer and finer grids, the grid approximation method becomes computationally expensive.
You can’t merely start the simulation, get up for a cup of coffee, come back, and *poof* the simulation is done.
You might have to start the simulation and go off to a month-long meditation retreat (to practice the patience you’ll need for grid approximation).
MCMC methods provide a more flexible alternative.

## 6.2 Markov chains via RStan

MCMC methods and their coinage hold some historical significance.
The *Markov chain* component of MCMC is named for the Russian mathematician Andrey Markov (1856 - 1922).
The etymology of the *Monte Carlo* component is more nuanced and dubious.
As part of their top secret nuclear weapons project in the 1940’s, Stanislav Ulam, John von Neumann, and their collaborators at the Los Alamos National Laboratory used Markov chains to simulate and better understand neutron travel (Eckhardt 1987).
One perk of such top secret government work is getting to pick a code name for your project.
The Los Alamos team chose “Monte Carlo,” a code name which is said to be inspired by the opulent Monte Carlo casino in the equally opulent Principality of Monaco on the French Riviera.
What inspired the link between random simulation methods and casinos?
We can’t help you there.

Today, the application of Markov chains to simulate probability models using methods pioneered by the Monte Carlo project is known as “Markov chain Monte Carlo.”
As desired, MCMC simulation methods scale up for more complicated Bayesian models.
But this flexibility doesn’t come for free.
Like grid approximation samples, **MCMC samples are not taken directly from the posterior pdf \(f(\theta|y)\)**.
Yet unlike grid approximation samples,

**MCMC samples aren’t even independent**- each subsequent sample value depends directly upon the previous value. This feature is known as the

**Markov property**and is evoked by the “chain” terminology. Specifically, let \(\left\lbrace \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)} \right\rbrace\) be an \(N\)-length

**Markov chain**. Then the model from which the \((i+1)\)st chain value is drawn depends upon the preceding chain values only through the most recent value:

\[f\left(\theta^{(i + 1)} \; | \; \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(i)}, y\right) = f\left(\theta^{(i + 1)} \; | \; \theta^{(i)}, y\right) \; .\]

This dependence reiterates the fact that each chain value is drawn from some pdf *other than* the target posterior:

\[f\left(\theta^{(i + 1)} \; | \; \theta^{(i)}, y\right) \ne f\left(\theta^{(i + 1)} \; | \; y\right) \; .\]

**Markov chain Monte Carlo**

MCMC simulation produces a *chain* of \(N\) **dependent** \(\theta\) values, \(\left\lbrace \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)} \right\rbrace\), which are **not** drawn from the posterior pdf \(f(\theta|y)\).

It likely seems strange to approximate the posterior using a *dependent* sample that’s not even taken from the posterior.
But mathemagically, with reasonable MCMC *algorithms*, it can work.
The rub is that these algorithms have a steeper learning curve than the grid approximation technique.
If you’re curious about the details, we encourage you to explore Chapter 7.
If you’re *not* curious or are short on time, you can get through the remainder of the book without the details.
Though a firm grasp of these algorithms would be ideal, there’s a growing number of MCMC computing resources that can do the heavy lifting for us.
In this book, we’ll utilize RStan, which combines the power of R with the Stan engine.
You’ll get into the nitty gritty of RStan in Chapter 6, building up the code line by line and thereby gaining familiarity with the key elements of an RStan simulation.
Starting in Chapter **??**, you will utilize the complementary RStanArm package which provides shortcuts for simulating a broad framework of Bayesian *applied regression models* (`Arm`

).

### 6.2.1 A Beta-Binomial example

There are two essential steps to all RStan analyses: (1) **define the Bayesian model structure** in RStan notation; and (2) **simulate the posterior**.
Let’s examine these steps in the context of the Beta-Binomial model defined by (6.1), starting with **step 1**.
In defining the structure of this Beta-Binomial model, we need to specify the three aspects upon which it depends:

`data`

Data \(Y\) is the observed number of successes in 10 trials. RStan isn’t a mind reader – we must specify that \(Y\) is an*integer*between 0 and 10.`parameters`

The model depends upon parameter \(\pi\), or`pi`

in RStan notation. We must specify that \(\pi\) can be any*real*number between 0 and 1.`model`

The model is defined by the Bin(10,\(\pi\)) likelihood for \(Y\) and the Beta(2,2) prior for \(\pi\). We specify these using`binomial()`

and`beta()`

.

Below we translate this Beta-Binomial structure into RStan syntax and store it as the *character string* `bb_model`

.
We encourage you to pause and examine the code, noting how it matches up with the three aspects above.

```
# STEP 1: DEFINE the model
<- "
bb_model data {
int<lower=0, upper=10> Y;
}
parameters {
real<lower=0, upper=1> pi;
}
model {
Y ~ binomial(10, pi);
pi ~ beta(2, 2);
}
"
```

In **step 2**, we *simulate* the posterior using the `stan()`

function.
*Very* loosely speaking, `stan()`

designs and runs an MCMC algorithm to produce an approximate sample from the Beta-Binomial posterior.

The simulation will be quite slow for each new model.

```
# STEP 2: SIMULATE the posterior
<- stan(
bb_sim model_code = bb_model, data = list(Y = 9),
chains = 4, iter = 5000*2, seed = 84735)
```

Note that `stan()`

requires two types of arguments.
First, we must specify the **model information** by:

`model_code`

= the character string defining the model (here`bb_model`

)`data`

= a list of the observed data (here`Y = 9`

).

Second, we must specify the desired **Markov chain information** using three additional arguments:

The

`chains`

argument specifies how many**parallel Markov chains**to run. We run four chains here, a choice we discuss in Section 6.3.The

`iter`

argument specifies the desired number of**iterations**in each Markov chain. By default, the first half of these iterations are thrown out as “burn-in” or “warm-up” samples, the idea being that it takes some time before a Markov chain starts producing values that mimic a random sample from the posterior. The second half is kept as the final Markov chain sample. Thus we choose the number of iterations to be double our desired**Markov chain sample size**.To set the random number generating seed for an RStan simulation, we utilize the

`seed`

argument*within*the`stan()`

function.

The result, stored in `bb_sim`

, is a `stanfit`

object.
This object includes four parallel Markov chains run for 10,000 iterations each.
After tossing out the first 5,000 iterations of all four chains, we end up with four separate Markov chain samples of size 5,000, or a combined Markov chain sample size of 20,000.
The first 4 \(\pi\) values for each chain are extracted and shown here:

```
as.array(bb_sim, pars = "pi") %>%
head(4)
= pi
, , parameters
chains:1 chain:2 chain:3 chain:4
iterations chain1,] 0.9403 0.8777 0.3803 0.6649
[2,] 0.9301 0.9802 0.8186 0.6501
[3,] 0.9012 0.9383 0.8458 0.7001
[4,] 0.9224 0.9540 0.7336 0.5902 [
```

It’s important to remember that these **Markov chain values are NOT a random sample from the posterior and are NOT independent**.
Rather, each of the four parallel chains forms a dependent 5,000 length Markov *chain* of \(\pi\) values, \(\left(\pi^{(1)}, \pi^{(2)}, \ldots, \pi^{(5000)} \right)\).
For example, in iteration 1, `chain:1`

starts at a value of roughly \(\pi^{(1)}\) = 0.9403.
The value at iteration 2, \(\pi^{(2)}\), *depends upon* \(\pi^{(1)}\).
In this case the chain moves from 0.9403 to 0.9301.
Similarly, the chain moves from 0.9301 to 0.9012, from 0.9012 to 0.9224, and so on.
Thus the chain traverses the **sample space** or range of posterior plausible \(\pi\) values.
A Markov chain **trace plot** illustrates this traversal, plotting the \(\pi\) value (y-axis) in each iteration (x-axis).
We use the `mcmc_trace()`

function in the **bayesplot** package to construct the trace plot of all four Markov chains:

`mcmc_trace(bb_sim, pars = "pi", size = 0.5)`

Figure 6.3 zooms in on the trace plot of chain 1. In the first 20 iterations (left), the chain largely explores values between 0.57 and 0.94. After 200 iterations (right), the Markov chain has started to explore new territory, traversing a slightly wider range of values between 0.49 and 0.96. Both trace plots also exhibit evidence of the dependence among the Markov chain values, places where the chain tends to float up for multiple iterations and then down for multiple iterations.

The trace plots in Figure 6.3 illuminate the **longitudinal** behavior of the Markov chains, marking each value of each subsequent iteration.
We also want to examine the **distribution** of the values these chains visit along their journey, ignoring the order of these visits.
The histogram and density plot below provide a snapshot of this distribution for the combined 20,000 chain values, 5,000 from each of the four separate chains.
Notice the important punchline here: the distribution of the Markov chain values produces an excellent *approximation* of the target Beta(11, 3) posterior model of \(\pi\) (superimposed in red).
That’s a relief – that was the whole point!

```
# Histogram of the Markov chain values
mcmc_hist(bb_sim, pars = "pi")
# Density plot of the Markov chain values
mcmc_dens(bb_sim, pars = "pi")
```

### 6.2.2 A Gamma-Poisson example

For more **rstan** practice, let’s apply these tools to approximate the Gamma-Poisson posterior corresponding to (6.2) upon observing data \((Y_1,Y_2) = (2,8)\).
Recall that this involves two steps.
In **step 1**, we define the structure of the Gamma-Poisson model, specifying the three aspects upon which it depends: the `data`

, `parameters`

, and `model`

.
In **step 2**, we simulate the posterior.
We encourage you to challenge yourself by trying this on your own first.
Your code will involve two terms you haven’t yet seen, but might guess how to use: `poisson()`

and `gamma()`

.
Further, your code will incorporate a *vector* of \((Y_1,Y_2)\) variables and observations as opposed to a single variable \(Y\).

Fill in the code below to construct an MCMC approximation of the Gamma-Poisson posterior corresponding to (6.2). In doing so, run four parallel chains for 10,000 iterations each (resulting in a sample size of 5,000 per chain).

```
# STEP 1: DEFINE the model
<- "
gp_model data {
int<___> Y[2];
}
parameters {
real<___> lambda;
}
model {
Y ~ ___(___);
lambda ~ ___(___, ___);
}
"
# STEP 2: SIMULATE the posterior
<- ___(
gp_sim model_code = ___, data = list(___),
chains = ___, iter = ___, seed = 84735)
```

We now present the solution. In defining the model in step 1, take note of the three aspects upon which it depends.

`data`

Data`Y[2]`

represents the*vector*of the event counts, \((Y_1,Y_2)\), where the counts can be any non-negative*integers*in \(\{0,1,2,\ldots\}\).`parameters`

The model depends upon parameter \(\lambda\) which can be any non-negative*real*number in \([0,\infty)\).`model`

The model is defined by the Pois(\(\lambda\)) likelihood for \(Y\) and the Gamma(3,1) prior for \(\lambda\). We specify these using`poisson()`

and`gamma()`

.

Further, in step 2, we must feed in the vector of both `data`

points, `Y = c(2,8)`

.
It follows that:

```
# STEP 1: DEFINE the model
<- "
gp_model data {
int<lower=0> Y[2];
}
parameters {
real<lower=0> lambda;
}
model {
Y ~ poisson(lambda);
lambda ~ gamma(3, 1);
}
"
# STEP 2: SIMULATE the posterior
<- stan(
gp_sim model_code = gp_model, data = list(Y = c(2,8)),
chains = 4, iter = 5000*2, seed = 84735)
```

Trace plots illustrate the paths these chains take through the \(\lambda\) sample space. To this end, note that the \(\lambda\) chain values range from roughly 0 to 10, but mostly remain below 7.5:

```
# Construct trace plots
mcmc_trace(gp_sim, pars = "lambda", size = 0.5)
```

A histogram and density plot provide a better look into the overall distribution of \(\lambda\) Markov chain values. This distribution provides an excellent approximation of the Gamma(13,3) posterior (superimposed in red):

```
# Histogram of the Markov chain values
mcmc_hist(gp_sim, pars = "lambda")
# Density plot of the Markov chain values
mcmc_dens(gp_sim, pars = "lambda")
```

## 6.3 Markov chain diagnostics

Simulation is fantastic!
Even when we’re able to derive a Bayesian posterior, we can use simulated samples from the posterior to verify our work and provide some intuition.
More importantly, when we’re *not* able to derive a posterior, simulated samples provide a crucial *approximation*.
In the MCMC examples above, we saw that the Markov chains traverse the sample space of our parameter (\(\pi\) or \(\lambda\)) and, in the end, *mimic* a random sample that *converges* to the posterior.
“Approximation” and “converges” are key words here - **simulations aren’t perfect**.
This begs the following questions, stated here for MCMC simulations though similar questions pertain to grid approximation:

- What does a
*good*Markov chain look like? - How
*accurate*is the Markov chain approximation of the posterior? - How
*big*should our Markov chain sample size be? For how many*iterations*should we run the Markov chain?

Answering these questions is both an art and science.
There are no one-size-fits-all magic formulas that provide definitive answers here.
Rather, it’s through experience that you get a feel for what “good” Markov chains look like and what you can do to fix a “bad” Markov chain.
In this section, we’ll focus on a couple visual diagnostic tools that will get us started: **trace plots** and **parallel chains**.
These can be followed up with and supplemented by two numerical diagnostics: **effective sample size** and **R-hat (\(\hat{R}\))**.
Utilizing these diagnostics should be done *holistically*.
Since no single visual or numerical diagnostic is one-size-fits-all, they provide a fuller picture of Markov chain quality when considered together.

### 6.3.1 Examining trace plots

Reexamine the trace plots for the Beta-Binomial simulation in Figure 6.2.
These are textbook examples of what we *want* trace plots to look like.
Mainly, they look like a bunch of white noise with no discernible trends or notable phenomena.
This nothingness implies that the chains are *stable*.
In contrast, the hypothetical Markov chains for the same Beta-Binomial model shown in Figure 6.4 illustrate potential chain behavior that should give us pause.

First, the trend of the top chain has not yet stabilized after 5,000 iterations, suggesting that it has not yet “found” or does not yet know how to explore the range of posterior plausible \(\pi\) values.
It is **mixing slowly**.
The downward trend also hints at strong correlation among the draws.
All of this is bad.
Though Markov chains are inherently dependent, the more they *behave* like independent samples, the smaller the error in the resulting posterior approximation (roughly).
The bottom chain exhibits a different problem.
It tends to get *stuck* when it visits smaller values of \(\pi\).

Both of these goofy-looking chains result in a serious issue: they produce poor approximations of the Beta(11,3) posterior (superimposed in black). After 5,000 iterations, we can see that the top chain hasn’t yet explored the tails of the posterior and that, upon getting stuck, the bottom chain over-samples some values in the left tail of \(\pi\) (hence the various spikes in the density plot). In turn, these poor posterior approximations would lead to misleading conclusions about our posterior model.

*In practice*, we run `rstan`

simulations when we can’t specify, thus want to approximate the posterior.
This means that we don’t typically have the privilege of being able to compare our simulation results to the “real” posterior.
This is why diagnostics are so important.
If we see bad trace plots like those in Figure 6.4, there are two immediate steps we can take:

- Check the model. Are there mistakes in the code? Are there mistakes in the model itself, ie. are the assumed prior and likelihood models appropriate?
- Run the chain for more iterations. Some undesirable short-term chain trends might iron out in the long-term.

We’ll get practice with the more nuanced Step 1 throughout the book. Step 2 is easy, though requires extra computation time.

### 6.3.2 Comparing parallel chains

Recall that our `stan()`

simulation for the Beta-Binomial model produced four parallel Markov chains.
Not only do we want to see stability in each individual chain (as discussed above), we want to see consistency *across* the four chains.
Mainly, though we expect different chains take different paths, they should enjoy similar features and produce similar approximations of the posterior.
For example, the trace plots of the four parallel chains in Figure 6.2 appear similar in their randomness.
Further, in the density plots below, we see that these four chains produce nearly indistinguishable approximations of the Beta(11,3) posterior.
This provides evidence that our simulation is stable and sufficiently long – running the chains for more iterations likely wouldn’t produce drastically different or improved approximations of the posterior.

```
# Density plots of individual chains
mcmc_dens_overlay(bb_sim, pars = "pi")
```

For a point of comparison, let’s implement a *shorter* Markov chain simulation for the same model.
Instead of running four parallel chains for 10,000 iterations and a resulting sample size of 5,000 each, run four parallel chains for only 100 iterations and a resulting sample size of 50 each:

```
# STEP 2: SIMULATE the posterior
<- stan(
bb_sim_short model_code = bb_model, data = list(Y = 9),
chains = 4, iter = 50*2, seed = 84735)
```

The trace plots and corresponding density plots of the short Markov chains are shown below. Though the chains’ trace plots exhibit similar random behavior, their corresponding density plots differ, hence they produce discrepant approximations of the posterior. In the face of such instability, it would be a mistake to stop our simulation after only 100 iterations.

```
# Trace plots of short chains
mcmc_trace(bb_sim_short, pars = "pi")
# Density plots of individual short chains
mcmc_dens_overlay(bb_sim_short, pars = "pi")
```

### 6.3.3 Calculating effective sample size

Recall from Section 6.3.1 that the more a dependent Markov chain *behaves* like an independent sample, the smaller the error in the resulting posterior approximation (loosely speaking).
Though trace plots provide some visual insight into this behavior, supplementary numerical assessments can provide more nuanced information.
To begin, recall that our Markov chain simulation `bb_sim`

produced a combined 20,000 *dependent* values of \(\pi\), sampled from models *other* than the posterior.
We then used this chain to approximate the posterior model of \(\pi\).
Knowing that the *error* in this approximation is likely larger than if we had used 20,000 *independent* sample values drawn directly *from the posterior* begs the following question: *How many independent sample values would it take to produce an equivalently accurate posterior approximation?*
The **effective sample size** provides an answer.

**Effective sample size**

Let \(N\) denote the *actual* sample size or length of a *dependent* Markov chain.
The *effective* sample size of this chain, \(N_{eff}\), quantifies the number of *independent* samples it would take to produce an equivalently accurate posterior approximation.
The greater the \(N_{eff}\) the better, yet it’s typically true that the approximation accuracy of our Markov chain is only as good as that of a smaller independent sample, ie.

\[N_{eff} < N.\]

Though there’s no magic rule for interpreting \(N_{eff}\), and it should be utilized alongside other diagnostics such as the trace plot, we might be suspicious of a Markov chain for which \(N_{eff}\) is less than 10% of the actual sample size \(N\).

A quick call to the `summary()`

function provides the estimated effective sample size, `n_eff`

, of our Markov chain simulation (we’ll dig into the other summary statistics later in the book):

```
# Summarize the combined values of the 4 long chains
summary(bb_sim, pars = c("pi"))$summary
2.5% 25% 50% 75%
mean se_mean sd 0.786 0.001291 0.1059 0.5477 0.7198 0.8 0.8658
pi 97.5% n_eff Rhat
0.9489 6721 1 pi
```

The reported `n_eff`

indicates that the accuracy in using our 20,000 length Markov chain to approximate the posterior is only as great as if we had used a much smaller *independent* sample of size 6721.

### 6.3.4 Calculating R-hat

Just as the effective sample size provides a numerical supplement to the visual trace plot diagnostic in assessing the degree to which a Markov chain behaves like a random sample, the split-\(\hat{R}\) metric (or “R-hat” for short) provides a numerical supplement to the visual comparison of parallel chains.
Recall from Section 6.3.2 that, not only do we want to each *individual* Markov chain in our simulation to be stable, we want there to be consistency *across* the parallel chains.
R-hat addresses this consistency by comparing the combined variability in sampled \(\pi\) values *across* all chains to the variability *within* each individual chain.
Before presenting a more formal definition, take the following self quiz to tap into your intuition.^{50}

Figure 6.5 provides simulation results for `bb_sim`

(top row) along with a *bad* hypothetical alternative (bottom row).
Based on the patterns in these plots, what do you think is a marker of a “good” Markov chain simulation?

- The variability
*across*all chains is**greater**than the variability*within*any individual chain. - The variability
*across*all chains is**comparable**to the variability*within*any individual chain.

To answer this quiz, let’s dig into Figure 6.5.
Based on what we learned in Section 6.3.2, we can see that `bb_sim`

is superior to the alternative – its parallel chains exhibit similar features and produce similar approximations of the posterior.
In particular, the variability in \(\pi\) values is nearly identical *within* each chain (top middle plot).
As a consequence, the variability in \(\pi\) values *across* all chains combined (top right plot) is similar to that of the individual chains.
In contrast, notice that the four parallel chains in the alternative simulation produce conflicting approximations of the posterior (which is bad!).
As a consequence, the variability in \(\pi\) values across all chains combined (bottom right plot) is much larger / wider than the variability in \(\pi\) within any individual chain (bottom middle plot)
That is, there’s much more consistency within any individual chain than there is when we combine them.
This discrepancy between the chains hints at a lack of stability in the alternative simulation.

Bringing this analysis together, we’ve intuited the importance of the relationship between the variability *across* and *within* all parallel chains.
Specifically:

in a “good” Markov chain simulation, the variability across parallel chains will be roughly comparable to the variability within any individual chain;

in a “bad” Markov chain simulation, the variability across parallel chains might exceed the typical variability within each chain.

We can *quantify* the relationship between the combined chain variability and within chain variability using R-hat.
We provide an intuitive definition of R-hat here.
For a more detailed definition, see Vehtari et al. (2020).
And to learn more about the exciting connection between effective sample size and R-hat, see Vats and Knudson (2018)!

**R-hat**

Consider a Markov chain simulation of parameter \(\theta\) which utilizes four parallel chains. Let \(\text{Var}_\text{combined}\) denote the combined variability in \(\theta\) across all four chains and \(\text{Var}_\text{within}\) denote the typical variability within any individual chain. The R-hat metric calculates the ratio between these two sources of variability:

\[\text{R-hat} \approx \sqrt{\frac{\text{Var}_\text{combined}}{\text{Var}_\text{within}}} \; .\]

Ideally, \(\text{R-hat} \approx 1\), reflecting stability across the parallel chains. In contrast, \(\text{R-hat} > 1\) indicates instability, with the variability across chains exceeding that within the chains. Though no golden rule exists, an R-hat ratio greater than 1.05 raises some red flags about the stability of the simulation.

You may not have noticed, but the `Rhat`

value is included in our simulation `summary()`

:

```
summary(bb_sim, pars = "pi")$summary
2.5% 25% 50% 75%
mean se_mean sd 0.786 0.001291 0.1059 0.5477 0.7198 0.8 0.8658
pi 97.5% n_eff Rhat
0.9489 6721 1 pi
```

Reflecting our observation that the variability across and within our four parallel chains is comparable, `bb_sim`

has an R-hat value that’s effectively equal to 1.
In contrast, our bad hypothetical simulation exhibited in Figure 6.5 has an R-hat value of 5.35.
That is, the variability across all chain values is more than 5 *times* the typical variability within each chains.
This *well* exceeds the 1.05 red flag marker, providing ample evidence that the hypothetical parallel chains do *not* produce consistent posterior approximations, thus the simulation is unstable.

## 6.4 Chapter summary

As our Bayesian models get more sophisticated, their posteriors will become too difficult, if not impossible, to specify.
In Chapter 6, you learned two **simulation techniques** that can be utilized to **approximate the posterior** in such scenarios: grid approximation and Markov chain Monte Carlo.
Both techniques produce a **sample** of \(N\) \(\theta\) values,

\[\left\lbrace \theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)} \right\rbrace \; .\]

However, the properties of these samples differ:

**Grid approximation**takes an**independent**sample of \(\theta^{(i)}\) from a*discretized approximation*of posterior pdf \(f(\theta|y)\). This approach is nicely straightforward but breaks down in more complicated model settings.**Markov chain Monte Carlo**offers a more flexible alternative. Though MCMC produces a**dependent**sample of \(\theta^{(i)}\) which are**not**drawn from the posterior pdf \(f(\theta|y)\), this sample will mimic the posterior model so long as the chain length \(N\) is big enough.

Finally, you learned some MCMC **diagnostics** that provide a check of the resulting simulation quality. In short, we can visually examine **trace plots** and density plots of multiple **parallel chains** for **stability** in our simulation:

```
# Diagnostic plots
mcmc_trace(my_sim, pars = "___")
mcmc_dens_overlay(my_sim, pars = "___")
```

And we can supplement these visual diagnostics with numerical diagnostics such as **effective sample size** and **R-hat**:

`summary(my_sim, pars = "___")$summary`

## 6.5 Exercises

### 6.5.1 Conceptual exercises

**Exercise 6.1 (Steps for grid approximation)**

- Identify the steps for the grid approximation of a posterior model.
*Which*step(s) would you change to make the approximation more accurate?*How*would you change those steps to make the grid approximation more accurate?

**Exercise 6.2 (Trace plot diagnostics)**Different MCMC simulation scenarios are described below. Sketch by hand what a single chain trace plot might look like for each simulation.

- The chain is mixing too slowly.
- The chain has high correlation.
- The chain has a tendency to get “stuck.”
- The chain has no problems!

**Exercise 6.3 (MCMC woes)**Different MCMC simulation scenarios are described below. For each, describe how the scenario could impact the simulated posterior model.

- The chain is mixing too slowly.
- The chain has high correlation.
- The chain has a tendency to get “stuck.”

**Exercise 6.4 (MCMC simulation: Thank you for being a friend)**Your friend missed class this week and they are allergic to reading textbooks (a common affliction). Since you are a true friend (who has read the book chapter) you decide to help them out:

- Explain to your friend why it is important to look at MCMC diagnostics.
- Explain why MCMC simulations are helpful.
- Explain the benefits of using RStan.
- Tell your friend something you don’t understand about the chapter.

### 6.5.2 Practice: grid approximation

**Exercise 6.5 (Beta-Binomial grid approximation)**Consider the Beta-Binomial model for \(\pi\) with \(Y | \pi \sim \text{Bin}(n, \pi)\) and \(\pi \sim \text{Beta}(3,8)\). Suppose that in \(n = 10\) independent trials, you observe \(Y = 2\) successes. Utilize grid approximation with grid values \(\pi \in \{0, 0.25,0.5, 0.75, 1\}\) to approximate the posterior model of \(\pi\).

**Exercise 6.6 (Beta-Binomial grid approximation: more grid)**Repeat the exercise above, but this time use a grid of 201 equally spaced values between 0 and 1.

**Exercise 6.7 (Gamma-Poisson grid approximation)**Consider the Gamma-Poisson model for \(\lambda\) with \(Y_i | \lambda \stackrel{ind}{\sim} \text{Pois}(\lambda)\) and \(\lambda \sim \text{Gamma}(20,5)\). Suppose you observe \(n=3\) independent data points \((Y_1,Y_2,Y_3) = (0,1,0)\). Utilize grid approximation with grid values \(\lambda \in \{0, 1, 2, \ldots, 8\}\) to approximate the posterior model of \(\lambda\).

**Exercise 6.8 (Gamma-Poisson grid approximation: more grid)**Repeat the exercise above, but this time use a grid of 201 equally spaced values between 0 and 8.

**Exercise 6.9 (Normal-Normal grid approximation)**Consider the Normal-Normal model for \(\mu\) with \(Y_i | \mu \stackrel{ind}{\sim} N(\mu, 1.3^2)\) and \(\mu \sim N(10, 1.2^2)\). Suppose that on \(n = 4\) independent observations, you observe data \((Y_1,Y_2,Y_3,Y_4) = (7.1, 8.9, 8.4, 8.6)\). Utilize grid approximation with grid values \(\mu \in \{5, 6, 7, \ldots, 15\}\) to approximate the posterior model of \(\mu\).

**Exercise 6.10 (Normal-Normal grid approximation: more grid)**Repeat the exercise above, but this time use a grid of 201 equally spaced values between 5 and 15.

**Exercise 6.11 (The Curse of Dimensionality)**As we note in this chapter, grid approximation suffers from

**the curse of dimensionality**.

- Describe a situation in which we would want to have inference for multiple parameters (ie. high dimensional Bayesian models).
- In your own words, explain how dimensionality can affect grid approximation and why this is a curse.

### 6.5.3 Practice: MCMC

**Exercise 6.12 (Comparing MCMC to Grid Approximation)**

- What drawback(s) do MCMC and grid approximation share?
- What advantage(s) do MCMC and grid approximation share?
- What is an advantage of grid approximation over MCMC?
- What is an advantage of MCMC over grid approximation?

**Exercise 6.13 (Is it a Markov Chain?)**Below are examples of “chains” \(\left\lbrace \theta^{(1)}, \theta^{(2)},\ldots,\theta^{(N)} \right\rbrace\), for different probability parameters \(\theta\). For each example, determine whether the given chain is a

*Markov*chain. Explain.

- You go out to eat \(N\) nights in a row and \(\theta^{(i)}\) is the probability you go to a Thai restaurant on day
*i*. - You play the lottery \(N\) days in a row and \(\theta^{(i)}\) is the probability you win the lottery on day
*i*. - You play your roommate in chess for \(N\) games in a row and \(\theta^{(i)}\) is the probability you win game
*i*against your roommate.

**Exercise 6.14 (MCMC with RStan: Step 1)**Use the given information to define the Bayesian model structure using the correct RStan syntax. You don’t need to run the code just provide the syntax.

- \(Y|\pi \sim \text{Bin}(20, \pi)\) with \(\pi \sim\text{Beta}(1, 1)\).
- \(Y|\lambda \sim \text{Pois}(\lambda)\) with \(\lambda \sim\text{Gamma}(4, 2)\).
- \(Y|\mu \sim \text{Norm}(\mu, 1)\) with \(\mu\sim\text{Norm}(0, 100)\).

**Exercise 6.15 (MCMC with RStan: Steps 1 and 2)**Use the given information to (1) define the Bayesian model structure; and (2) simulate the posterior using the correct RStan syntax. You don’t need to run the code, just provide the syntax.

- \(Y|\pi \sim \text{Bin}(20, \pi)\) and \(\pi \sim\text{Beta}(1, 1)\) with \(Y = 12\).
- \(Y|\lambda \sim \text{Pois}(\lambda)\) and \(\lambda \sim\text{Gamma}(4, 2)\) with \(Y = 3\).
- \(Y|\mu \sim \text{Norm}(\mu, 1)\) and \(\mu\sim\text{Norm}(0, 100)\) with \(Y = 12.2\).

**Exercise 6.16 (MCMC with RStan: Beta-Binomial)**Consider the Beta-Binomial model for \(\pi\) with \(Y | \pi \sim \text{Bin}(n, \pi)\) and \(\pi \sim \text{Beta}(3,8)\). Suppose that in \(n = 10\) independent trials, you observe \(Y = 2\) successes.

- Simulate the posterior model of \(\pi\) with RStan using 3 chains and 12000 iterations per chain.
- Produce trace plots for each of the three chains.
- What is the range of values on the trace plot x-axis? Why is the maximum value of this range not 12000?
- Create a density plot of the values for each of the three chains.
- Hearkening back to Chapter 5, specify the posterior model of \(\pi\). How does your MCMC approximation compare?

**Exercise 6.17 (MCMC with RStan: Beta-Binomial once more with feeling)**Consider the Beta-Binomial model for \(\pi\) with \(Y | \pi \sim \text{Bin}(n, \pi)\) and \(\pi \sim \text{Beta}(4,3)\). Suppose that in \(n = 12\) independent trials, you observe \(Y = 4\) successes.

- Simulate the posterior model of \(\pi\) with RStan using 4 chains and 10000 iterations per chain.
- Produce trace plots for all four chains.
- What is the range of values on the trace plot x-axis? Why is the maximum value of this range not 10000?
- Create a density plot of the values for each of the four chains.

**Exercise 6.18 (MCMC with RStan: Gamma-Poisson)**Consider the Gamma-Poisson model for \(\lambda\) with \(Y_i | \lambda \stackrel{ind}{\sim} \text{Pois}(\lambda)\) and \(\lambda \sim \text{Gamma}(20,5)\). Suppose that you observe \(n=3\) independent data points \((Y_1,Y_2,Y_3) = (0,1,0)\).

- Simulate the posterior model of \(\lambda\) with RStan using 4 chains and 10000 iterations per chain.
- Produce trace plots for all four chains.
- Create a density plot of the values for each of the four chains.
- From the density plots, what seems to be the most posterior plausible value of \(\lambda\)?
- Hearkening back to Chapter 5, specify the posterior model of \(\lambda\). How does your MCMC approximation compare?

**Exercise 6.19 (MCMC with RStan: Gamma-Poisson again)**Repeat exercise

**??**for the Gamma-Poisson model of \(\lambda\) with \(Y_i | \lambda \stackrel{ind}{\sim} \text{Pois}(\lambda)\) and \(\lambda \sim \text{Gamma}(5,5)\). Suppose that you observe \(n=3\) independent data points \((Y_1,Y_2,Y_3) = (0,1,0)\).

**Exercise 6.20 (MCMC with RStan: Normal-Normal)**Repeat exercise

**??**for the Normal-Normal model of \(\mu\) with \(Y_i | \mu \stackrel{ind}{\sim} N(\mu, 1.3^2)\) and \(\mu \sim N(10, 1.2^2)\). Suppose that on \(n = 4\) independent observations, you observe data \((Y_1,Y_2,Y_3,Y_4) = (7.1, 8.9, 8.4, 8.6)\).

**Exercise 6.21 (MCMC with RStan: Normal-Normal part deux)**Repeat exercise

**??**for the Normal-Normal model of \(\mu\) with \(Y_i | \mu \stackrel{ind}{\sim} N(\mu, 8^2)\) and \(\mu \sim N(-14, 2^2)\). Suppose that on \(n = 5\) independent observations, you observe data \((Y_1,Y_2,Y_3,Y_4,Y_5) = (-10.1, 5.5, 0.1, -1.4, 11.5)\).

Answer: b↩︎