Chapter 5 Conjugate family

In the novel Anna Karenina, Tolstoy wrote “Happy families are all alike; every unhappy family is unhappy in its own way.” In this chapter we will learn about conjugate families, which are all alike in the sense that they make the authors very happy. Read on to learn why!

  • Practice building posterior models.
    You will build posterior models by practicing how to recognize kernels and make use of proportionality.
  • Familiarize yourself with conjugacy.
    You will learn about what makes a prior conjugate and why this is a helpful property. In brief, conjugate priors make it easier to build posterior models. Conjugate priors spark joy!

To get started, load the following packages which we’ll use throughout the chapter:

# Load packages
library(bayesrules)
library(tidyverse)

5.1 Revisiting choice of prior

How do we choose a prior? In Chapters 3 and 4 we used the flexibility of the Beta model to appropriately model our prior understanding of a proportion parameter \(\pi \in [0,1]\). There are other criteria to consider when choosing a prior model:

  • Computational ease
    Especially if we don’t have access to computing power, it is helpful if the posterior model is easy to build.
  • Interpretability
    We’ve seen that posterior models are a compromise between the data and the prior model. A posterior model is interpretable, thus more useful, when you can look at the formulation of the posterior model and identify the contribution of the data relative to the contribution of the prior.

The Beta-Binomial has both of these criteria covered. With the Beta-Binomial model, calculation is easy. Once we know the parameters for the Beta(\(\alpha, \beta\)) prior model and the observed data \(Y = y\) for the Bin(\(n, \pi\)) likelihood, the posterior model Beta(\(\alpha+y, \beta+n-y\)) follows. This posterior reflects the influence of the data, through the values of \(y\) and \(n\), relative to the prior hyperparameters \(\alpha\) and \(\beta\). If \(\alpha\) and \(\beta\) are larger relative to the sample size \(n\), then the posterior will not budge that much from the prior. However, if the sample size \(n\) is large relative to \(\alpha\) and \(\beta\), then the data will take over and be more influential in the posterior model. In fact, the Beta-Binomial belongs to a larger class of prior-data combinations called conjugate families that allow for both computational ease and interpretable posteriors. Recall a general definition similar to that from Chapter 3.

Conjugate prior

Let the prior model for parameter \(\theta\) have pdf \(f(\theta)\) and the model of data \(Y\) conditioned on \(\theta\) have likelihood function \(L(\theta|y)\). If the resulting posterior model with pdf \(f(\theta|y) \propto f(\theta)L(\theta|y)\) is of the same model family as the prior, then we say this is a conjugate prior.

Hyperparameter

A hyperparameter is a parameter used in a prior model. For example, if the prior model for parameter \(\pi\) is Beta(\(\alpha, \beta\)), then \(\alpha\) and \(\beta\) are hyperparameters for \(\pi\).

5.1.1 A non-conjugate prior for Binomial data

To emphasize the utility (and fun) of conjugate priors, it can be helpful to consider a non-conjugate prior. Let parameter \(\pi\) be a proportion between 0 and 1 and suppose we plan to collect data \(Y\) where, conditional on \(\pi\), \(Y|\pi \sim \text{Bin}(n,\pi)\). Instead of our conjugate \(\text{Beta}(\alpha,\beta)\) prior for \(\pi\), let’s try out a non-conjugate prior with pdf \(f(\pi)\), plotted in Figure 5.1:

\[\begin{equation} f(\pi)=e-e^\pi\; \text{ for } \pi \in [0,1]. \tag{5.1} \end{equation}\]

Though not a Beta pdf, \(f(\pi)\) is indeed a valid pdf since it meets both of following conditions: \(f(\pi)\) is non-negative on the support of \(\pi\); and \(\int_0^1 f(\pi)=1\).

A non-conjugate prior for \(\pi\).

FIGURE 5.1: A non-conjugate prior for \(\pi\).

Next, suppose we observe \(Y = 10\) successes from \(n = 50\) independent trials, all having the same probability of success \(\pi\). The resulting Binomial likelihood function of \(\pi\) is:

\[L(\pi | (y=10)) = {50 \choose 10} \pi^{10} (1-\pi)^{40} \; \; \text{ for } \pi \in [0,1] \; .\]

Recall from Chapter 3 that, when we put the prior and the likelihood together, we are already on our path to finding the posterior model:

\[f(\pi | (y = 10)) \propto f(\pi) L(\pi | (y = 10)) = (e-e^\pi) \cdot \binom{50}{10} \pi^{10} (1-\pi)^{40}.\]

As we did in Chapter 3, we will drop all constants that do not depend on \(\pi\) since we are only specifying \(f(\pi|y)\) up to a proportionality constant:

\[f(\pi | y ) \propto (e-e^\pi) \pi^{10} (1-\pi)^{40}.\]

Notice here that our non-Beta prior didn’t produce a neat and clean answer for the exact posterior model (fully specified and not up to a proportionality constant). We cannot squeeze this posterior pdf kernel into a Beta box nor any other familiar model for that matter. That is, we cannot rewrite \((e-e^\pi) \pi^{10} (1-\pi)^{40}\) so that it shares the same structure as a Beta kernel, \(\pi^{\blacksquare - 1}(1-\pi)^{\blacksquare-1}\). Instead we will need to integrate this kernel in order to complete the normalizing constant, hence posterior specification:

\[\begin{equation} f(\pi|y=10)= \frac{(e-e^\pi) \pi^{10} (1-\pi)^{40}}{\int_0^1(e-e^\pi) \pi^{10} (1-\pi)^{40}d\pi} \; \; \text{ for } \pi \in [0,1]. \tag{5.2} \end{equation}\]

This is where we really start to feel the pain of not having a conjugate prior! Since this is a particularly unpleasant integral to evaluate, we will leave ourselves with (5.2) as the final posterior. Yikes, what a mess! This is a valid posterior pdf – it’s non-negative and integrates to 1 across \(\pi \in [0,1]\). But it doesn’t have much else going for it. Consider a few characteristics about the posterior model that result from this particular non-conjugate prior (5.1):

  • The calculation for this posterior was messy and unpleasant.
  • It is difficult to derive any intuition about the balance between the prior information and the data we observed from this posterior model.
  • It would be difficult to specify the corresponding posterior mean, mode, and variance (a process which would require even more integration).

This leaves us with the question: could we use the conjugate Beta prior and still capture much of the same information as the messy non-conjugate prior (5.1)? If so, then we solve the problems of messy calculations and indecipherable posterior models!

Which of the following Beta models would most closely approximate the non-conjugate prior for \(\pi\) (Figure 5.1)?

  1. Beta(3,1)
  2. Beta(1,3)
  3. Beta(2,1)
  4. Beta(1,2)

The answer to this quiz is d. One way to find the answer is by comparing the plot of the non-conjugate prior in Figure 5.1 with plots of the possible Beta prior models using plot_beta(). For example, the prior information expressed by (5.1) is pretty well captured by the Beta(1,2):

plot_beta(alpha = 1, beta = 2)

5.2 Gamma-Poisson conjugate family

Stop callin’ / Stop callin’ / I don’t wanna think anymore / I got my head and my heart on the dance floor
- Lady Gaga featuring Beyoncé. Lyric from Telephone.

In 2019, one of the co-authors of this book got fed up with the number of fraud risk phone calls they were receiving. They set out with a goal of modeling rate \(\lambda\), the typical number of fraud risk calls received per day. Prior to collecting any data, the co-author’s guess was that this rate was most likely around 5 calls per day, but could also reasonably range between 2 and 7 calls per day. To learn more, they planned to record the number of fraud risk phone calls on each of \(n\) sampled days, \((Y_1,Y_2,\ldots,Y_n)\).

Before moving forward, it’s important to recognize here that our investigation of \(\lambda\) will not fit into the familiar Beta-Binomial framework. First, \(\lambda\) is not a proportion limited to be between 0 and 1, but rather a rate parameter that can take on any positive value (eg: we can’t receive -7 calls per day). Further, each data point \(Y_i\) is a count that can technically take on any non-negative integer in \(\{0,1,2,\ldots\}\), thus is not Binomial. Not to fret. Our study of \(\lambda\) will introduce us to a new conjugate family, the Gamma-Poisson.

5.2.1 Poisson likelihood

The spirit of our analysis starts with a prior understanding of \(\lambda\), the daily rate of fraud risk phone calls. Yet before choosing a prior model structure and tuning this to match our prior understanding, we start our Bayesian modeling process by choosing a way to model the dependence of our daily phone call count data \(Y_i\) on the typical daily rate of such calls \(\lambda\). Upon identifying a reasonable likelihood model for our data, we can identify a prior model which can be tuned to match our prior understanding while also mathematically complementing the likelihood. Keeping in mind that each data point \(Y_i\) is a random count that can go from 0 to a really big number (\(Y \in \{0,1,2,\cdots\}\)), the Poisson model, described in its general form below, makes a reasonable candidate for its likelihood model.

The Poisson model

Let random variable \(Y\) be the number of independent events that occur in a fixed amount of time or space, where \(\lambda > 0\) is the rate at which these events occur. Then the dependence of \(Y\) on parameter \(\lambda\) can be modeled by the Poisson. In mathematical notation:

\[Y | \lambda \sim \text{Pois}(\lambda) \]

Correspondingly, the Poisson model is specified by a conditional pmf:

\[\begin{equation} f(y|\lambda) = \frac{\lambda^y e^{-\lambda}}{y!}\;\; \text{ for } y \in \{0,1,2,\ldots\} \tag{5.3} \end{equation}\]

where \(f(y|\lambda)\) sums to one across \(y\), \(\sum_{y=0}^\infty f(y|\lambda) = 1\). Further, a Poisson random variable \(Y\) is assumed to have equal mean and variance,

\[\begin{equation} E(Y|\lambda) = \text{Var}(Y|\lambda) = \lambda \; . \tag{5.4} \end{equation}\]

Let \((Y_1, Y_2, \cdots, Y_n)\) denote the number of fraud risk calls we observed on each of the \(n\) days in our data collection period. Though the observed values of fraud phone calls \(Y_i\) might differ from day to day, we can assume that each can be independently modeled by the Poisson:

\[Y_i | \lambda \stackrel{ind}{\sim} \text{Pois}(\lambda)\]

with unique pmfs

\[f(y_i|\lambda) = \frac{\lambda^{y_i}e^{-\lambda}}{y_i!}\;\; \text{ for } y_i \in \{0,1,2,\ldots\} \; .\]

Using this data information, we can construct a likelihood function of \(\lambda\) by which to assess the compatibility of different possible \(\lambda\) values with our observed collection of sample data \((Y_1, Y_2, \cdots, Y_n) = (y_1, y_2, \cdots, y_n)\). If we were working with just one data point, say \(Y_1\), the likelihood function would simply be equivalent to the conditional pmf of \(Y_1\):

\[L(\lambda | y_1) = f(y_1 | \lambda) \; .\]

Yet in our phone call study we have a sample of \(n\) data points, not one. Thus we need a joint likelihood function which measures how well different \(\lambda\) agree with the joint information in this sample.

Joint likelihood function

Let \((Y_1,Y_2,\ldots,Y_n)\) be an independent sample of random variables and \(\vec{y} = (y_1,y_2,\ldots,y_n)\) be the corresponding vector of observed values. Further, let likelihood function \(L(\lambda | y_i)\) measure the compatibility of parameter \(\lambda\) with each individual observed data point \(Y_i = y_i\). Then by the assumption of independence, the joint likelihood function is

\[\begin{equation} L(\lambda | \vec{y}) = \prod_{i=1}^n L(\lambda | y_i) = L(\lambda | y_1) \cdot L(\lambda | y_2) \cdots L(\lambda | y_n) \; . \tag{5.5} \end{equation}\]

The greater the joint likelihood, the greater the compatibility of \(\lambda\) with the observed collection of data points (\(y_1,y_2,\ldots,y_n\)).

The formula of the joint likelihood function for our fraud risk call sample follows by applying the general definition (5.5) to our Poisson pmfs:

\[\begin{equation} L(\lambda | \vec{y}) = \prod_{i=1}^{n}f(y_i | \lambda) = \prod_{i=1}^{n}\frac{\lambda^{y_i}e^{-\lambda}}{y_i!} \;\; \text{ for } \; \lambda > 0 \; . \tag{5.6} \end{equation}\]

This looks like a mess, but it can be simplified. In this simplification, it’s important to recognize that we have \(n\) unique data points \(y_i\), not \(n\) copies of the same data point \(y\). That is, we need to pay careful attention to the \(i\) subscripts:

\[\begin{split} L(\lambda | \vec{y}) & = \prod_{i=1}^{n}\frac{\lambda^{y_i}e^{-\lambda}}{y_i!} \\ & = \frac{\lambda^{y_1}e^{-\lambda}}{y_1!} \cdot \frac{\lambda^{y_2}e^{-\lambda}}{y_2!} \cdots \frac{\lambda^{y_n}e^{-\lambda}}{y_n!} \\ & =\frac{\lambda^{\sum y_i}e^{-n\lambda}}{\prod_{i=1}^n y_i!} \\ \end{split}\]

It is desirable to represent the likelihood function up to a proportionality constant here, especially since \(\prod y_i!\) will be cumbersome to calculate when \(n\) is large, and what we really care about in the likelihood is \(\lambda\). As such, we can further simplify our likelihood function by dropping the constants that don’t depend upon \(\lambda\):

\[L(\lambda | \vec{y})\propto \lambda^{\sum y_i} e^{-n\lambda}\; .\]

Returning to our fraud risk calls, on four separate days in the second week of August 2019, we received \(\vec{y} = (y_1,y_2,y_3,y_4) = (6,2,2,1)\) such calls. Thus we have a sample of \(n = 4\) data points with a total of 11 fraud risk calls,

\[\sum_{i=1}^4y_i = 6 + 2 + 2 + 1 = 11\]

and an average of 2.75 phone calls per day,

\[\overline{y} = \frac{\sum_{i=1}^4y_i}{4} = 2.75 \; .\] Using this data, the joint likelihood function for \(\lambda\) is:

\[L(\lambda | \vec{y}) = \frac{\lambda^{11}e^{-4\lambda}}{6!\times2!\times2!\times1!} \propto \lambda^{11}e^{-4\lambda} \; .\]

Note that when we express the likelihood up to a proportionality constant, the sum of the data points (\(\sum y_i = 11\)) and the number of data points (\(n = 4\)) is all the information that is required from the data. We don’t need to know the value of each individual data point.

Let’s visualize the portion of likelihood function \(L(\lambda | \vec{y})\) for \(\lambda\) between 0 and 10. Why can’t we visualize the whole likelihood? Because \(\lambda \in (0, \infty)\) and this book would be pretty expensive if we had infinite pages. For this visualization we can use the plot_poisson_likelihood() function from the bayesrules package where y is the vector of data values, and lambda_upper_bound is the maximum value of \(\lambda\) to view on the x-axis:

plot_poisson_likelihood(y = c(6, 2, 2, 1), lambda_upper_bound = 10)
Poisson likelihood function for the phone call data.

FIGURE 5.2: Poisson likelihood function for the phone call data.

So what can we learn from this likelihood function? Well, our four days of data collection suggest that the rate of fraud risk calls likely ranges from one to five calls per day. Further, values of \(\lambda\) near \(\overline{y} = 2.75\), the observed average number of daily calls in our four day sample, are the most compatible with our data.

5.2.2 Potential priors

The Poisson likelihood model provides one of two key pieces for our Bayesian analysis of \(\lambda\), the daily rate of fraud risk calls. The other key piece is a prior model for \(\lambda\). Before we collected any data, our guess was that this rate was most likely around 5 calls per day, but could also reasonably range between 2 and 7 calls per day. In order to tune a prior to match these ideas about \(\lambda\), we first have to identify a reasonable probability model structure.

Remember here that \(\lambda\) is a positive and continuous rate, meaning that \(\lambda\) does not have to be a whole number. Accordingly, a reasonable prior probability model will also have continuous and positive support, ie. be defined on \(\lambda > 0\). There are several named probability models with this property, including the \(F\), Weibull, and Gamma models as well as common special cases of the Gamma: the \(\chi^2\) and Exponential models. To make the construction of the posterior model of \(\lambda\) more straightforward, it would be convenient to choose a conjugate prior.

Suppose we have a random sample of Poisson random variables \((Y_1, Y_2, \ldots, Y_n)\) with joint likelihood function \(L(\lambda | \vec{y})\propto \lambda^{\sum y_i}e^{-n\lambda}\) for \(\lambda > 0\). Which of the following do you think would provide a convenient conjugate prior model for \(\lambda\)? Why?

  1. Gamma(\(s,r\)) with pdf \(f(\lambda) \propto \lambda^{s - 1} e^{-r \lambda}\)
  2. Weibull(\(s,r\)) with pdf \(f(\lambda) \propto \lambda^{s - 1} e^{(-r \lambda)^s}\)
  3. a special case of the F(\(s,r\)) with \(s = r\) and pdf \(f(\lambda) \propto \lambda^{\frac{s}{2} - 1}\left( 1 + \lambda\right)^{-s}\)

The answer is a. The Gamma model will provide a conjugate prior for \(\lambda\) when our data has a Poisson likelihood. You might have guessed this from the section title. You might also have guessed this from the shared features of the Poisson joint likelihood function \(L(\lambda|\vec{y})\) and the Gamma pdf \(f(\lambda)\) – both are proportional to

\[\lambda^{\blacksquare}e^{-\blacksquare\lambda}\]

with differing \(\blacksquare\). In fact, we’ll prove that combining the prior and joint likelihood produces a posterior pdf with this same structure. That is, the posterior will be of the same Gamma model family as the prior. We’ll later add some clarity to this abstraction. First, let’s learn more about the Gamma model.

5.2.3 Gamma prior

The Gamma & Exponential models

Let \(\lambda\) be a random variable which can take any positive value, ie. \(\lambda > 0\). Then the variability in \(\lambda\) might be well modeled by a Gamma model with shape parameter \(s > 0\) and rate parameter \(r > 0\):

\[\lambda \sim \text{Gamma}(s, r) \; .\]

The Gamma model is specified by continuous pdf

\[\begin{equation} f(\lambda) = \frac{r^s}{\Gamma(s)} \lambda^{s-1} e^{-r\lambda} \;\; \text{ for } \lambda > 0 \tag{5.7} \end{equation}\]

where constant \(\Gamma(s) = \int_0^\infty z^{s - 1} e^{-z}dz\). When \(s\) is a positive integer, \(s \in \{1,2,3,\ldots\}\), this constant simplifies to \(\Gamma(s) = (s - 1)!\). Further, the trend and variability in \(\lambda\) are measured by:

\[\begin{equation} \begin{split} E(\lambda) & = \frac{s}{r} \\ \text{Mode}(\lambda) & = \frac{s - 1}{r} \;\; \text{ for } s \ge 1 \\ \text{Var}(\lambda) & = \frac{s}{r^2} \\ \end{split} \tag{5.8} \end{equation}\]

The Exponential model,

\[\lambda \sim \text{Exp}(r)\]

is a special case of the Gamma with shape parameter \(s = 1\), \(\text{Gamma}(1,r)\).

Figure 5.3 illustrates how different shape and rate parameters impact the Gamma pdf (5.7).

Gamma models with different tuning parameters. The vertical dashed and solid vertical lines represent the modes and means, respectively.

FIGURE 5.3: Gamma models with different tuning parameters. The vertical dashed and solid vertical lines represent the modes and means, respectively.

  1. How would you describe the trend of a Gamma(\(s,r\)) variable \(\lambda\) when \(s > r\) (eg: Gamma(2,1))?

    1. Right-skewed with a mean greater than 1.
    2. Right-skewed with a mean less than 1.
    3. Symmetric with a mean around 1.
    4. Left-skewed with a mean greater than 1.
  2. Using the same options as above, how would you describe the trend of a Gamma(\(s,r\)) variable \(\lambda\) when \(s < r\) (eg: Gamma(1,2))?

  3. For which model is there greater variability in the plausible values of \(\lambda\), Gamma(20,20) or Gamma(20,100)?

The answers to this quiz are in the footnote below.46 To confirm these answers, we can support the visual properties of the \(\lambda \sim \text{Gamma}(s,r)\) variables in Figure 5.3 with quantitative measures of trend and variability (5.8). For example, notice that the mean of \(\lambda\), \(E(\lambda) = s/r\), is greater than 1 when \(s > r\) and less than 1 when \(s < r\). Further, the variability in \(\lambda\), \(\text{Var}(\lambda) = s / r^2\), decreases as \(r\) increases.

Now that we have some intuition for how the Gamma(\(s,r\)) model works, we can tune the shape and rate parameters \(s\) and \(r\) to reflect our prior information about the daily rate of fraud risk phone calls \(\lambda\). Before looking at our data, we guessed that \(\lambda\) is about 5, and most likely somewhere between 2 and 7. Our Gamma(\(s,r\)) prior should have similar trends and variability. For example, we want to pick \(s\) and \(r\) for which \(\lambda\) tends to be around 5,

\[E(\lambda) = \frac{s}{r} \approx 5 \; .\]

This can be achieved by setting \(s\) to be 5 times \(r\), \(s = 5r\). Next we want to make sure that most of the values of our Gamma(\(s, r\)) prior are between 2 and 7. Through some trial and error within these constraints, and plotting various Gamma models using plot_gamma() in the bayesrules package, we find that the features of the Gamma(10,2) model closely match the trend and variability in our prior understanding:

plot_gamma(shape = 10, rate = 2)
The pdf of the Gamma(10,2) prior.

FIGURE 5.4: The pdf of the Gamma(10,2) prior.

Thus a reasonable prior model for the daily rate of fraud risk phone calls is

\[\lambda \sim \text{Gamma}(10,2)\]

with prior pdf \(f(\lambda)\) following from plugging \(s = 10\) and \(r = 2\) into (5.7).

\[f(\lambda) = \frac{2^{10}}{\Gamma(2)} \lambda^{2-1} e^{-10\lambda} \;\; \text{ for } \lambda > 0\;.\]

5.2.4 Gamma-Poisson conjugacy

As we discussed at the start of this chapter, conjugate families can come in handy. Fortunately for us, using a Gamma prior for a rate parameter \(\lambda\) and a Poisson likelihood for corresponding count data \(Y\) is another example of a conjugate family. This means that, spoiler, the posterior model for \(\lambda\) will also have a Gamma model with updated parameters. We’ll state and prove this in the general setting before applying the results to our phone call situation.

The Gamma-Poisson Bayesian model

Let \(\lambda > 0\) be an unknown rate parameter and \((Y_1,Y_2,\ldots,Y_n)\) be an independent \(\text{Pois}(\lambda)\) sample. The Gamma-Poisson Bayesian model complements the Poisson structure of data \(Y\) with a Gamma prior on \(\lambda\):

\[\begin{split} Y_i | \lambda & \stackrel{ind}{\sim} \text{Pois}(\lambda) \\ \lambda & \sim \text{Gamma}(s, r) \\ \end{split}\]

Upon observing data \(\vec{y} = (y_1,y_2,\ldots,y_n)\), the posterior model of \(\lambda\) is also a Gamma with updated parameters:

\[\begin{equation} \lambda|\vec{y} \; \sim \; \text{Gamma}(s + \sum y_i, \; r + n) \; . \tag{5.9} \end{equation}\]

Let’s prove this result. In general, recall that the posterior pdf of \(\lambda\) is proportional to the product of the prior pdf and joint likelihood function defined by (5.7) and (5.6), respectively:

\[f(\lambda|\vec{y}) \propto f(\lambda)L(\lambda|\vec{y}) = \frac{r^s}{\Gamma(s)} \lambda^{s-1} e^{-r\lambda} \cdot \frac{\lambda^{\sum y_i}e^{-n\lambda}}{\prod y_i!} \;\;\; \text{ for } \lambda > 0.\]

Next, remember that any non-\(\lambda\) multiplicative constant in the above equation can be “proportional-ed” out. This includes multiplicative terms that depend only on \(s\), \(r\), and the \(y\)’s. Thus, boiling the prior pdf and likelihood function down to their kernels, we get:

\[\begin{split} f(\lambda|\vec{y}) & \propto \lambda^{s-1} e^{-r\lambda} \cdot \lambda^{\sum y_i}e^{-n\lambda} \\ & = \lambda^{s + \sum y_i - 1} e^{-(r+n)\lambda} \\ \end{split}\]

where the final line follows by combining like terms. What we’re left with here is the kernel of the posterior pdf. This particular kernel corresponds to the pdf of a Gamma model with shape parameter \(s + \sum y_i\) and rate parameter \(r + n\). Thus we’ve proven that

\[ \lambda|\vec{y} \; \sim \; \text{Gamma}\bigg(s + \sum y_i, r + n \bigg) \; .\]

Let’s apply this result to our fraud risk calls example. There we have a Gamma(10,2) prior for \(\lambda\), the daily rate of fraud risk calls. Further, across \(n = 4\) days of data collection, we observed a total of \(\sum y_i = 11\) fraud risk calls for an average of \(\overline{y} = 2.75\) calls per day. It follows from (5.9) that the posterior model of \(\lambda\) is a Gamma with an updated shape parameter of 21 (\(s + \sum y_i = 10 + 11\)) and rate parameter of 6 (\(r + n = 2 + 4\)):

\[\lambda|\vec{y} \; \sim \; \text{Gamma}(21, 6) \; .\]

We can visualize the prior, scaled likelihood, and posterior models for \(\lambda\) all in a single plot with the plot_gamma_poisson() function! How magical! For this function to work, we must specify a few things: the prior shape and rate parameters as well as the information from our data, the observed total number of phone calls sum_y \((\sum y_i)\) and the sample size n:

plot_gamma_poisson(shape = 10, rate = 2, sum_y = 11, n = 4)
The prior, likelihood, and posterior of the Gamma-Poisson model for our fraud risk phone call analysis.

FIGURE 5.5: The prior, likelihood, and posterior of the Gamma-Poisson model for our fraud risk phone call analysis.

Our posterior notion about the daily rate of fraud calls is a compromise between our vague prior and the observed phone call data. Since our prior notion was more variable than the likelihood, the posterior model of \(\lambda\) has more in common with the likelihood. Specifically, utilizing the properties of the Gamma(10,2) prior and Gamma(21,6) posterior, notice that our posterior understanding of the typical daily rate of phone calls dropped from 5 to 3.5 per day:

\[E(\lambda) = \frac{10}{2} = 5 \;\; \text{ and } \;\; E(\lambda | \vec{y}) = \frac{21}{6} = 3.5 \; .\]

Though a compromise between the prior mean and data mean, this posterior mean is closer to the data mean of \(\overline{y} = 2.75\) calls per day.

Hot tip

The posterior mean will always be between the prior mean and the data mean. If your posterior mean falls outside that range, it indicates that you made an error and should retrace some steps.

Further, with the additional information about \(\lambda\) from the data, the variability in our understanding of \(\lambda\) makes a nearly five fold drop from 2.5 to 0.583 squared calls per day:

\[\text{Var}(\lambda) = \frac{10}{2^2} = 2.5 \;\; \text{ and } \;\; \text{Var}(\lambda | \vec{y}) = \frac{21}{6^2} = 0.583 \; .\]

There is another convenient function in the bayesrules package that helps us contrast the prior and posterior models and confirms the results above: summarize_gamma_poisson() which uses the same parameters as plot_gamma_poisson(). Let’s use that here:

summarize_gamma_poisson(shape = 10, rate = 2, sum_y = 11, n = 4)
      model shape rate mean  mode    var
1     prior    10    2  5.0 4.500 2.5000
2 posterior    21    6  3.5 3.333 0.5833

5.3 Normal-Normal conjugate family

We now have two conjugate families in our toolkit: the Beta-Binomial and the Gamma-Poisson. But there are many more conjugate families that exist! It’s impossible to cover them all, but there is a third conjugate family that’s especially helpful to know: the Normal-Normal.

Consider a data story. As scientists learn more about brain health, the dangers of contact sports (particularly American football) are gaining greater attention (Bachynski 2019). American football is a popular sport in which players often sustain repeated concussions. Among all American college football players who have incurred concussions, we are interested in \(\mu\), the average volume (in cubic centimeters) of a specific part of the brain: the hippocampus. Though we don’t have prior information about football players specifically, Wikipedia tells us that among the general population of human adults, both halves of the hippocampus have a volume between 3.0 and 3.5 cubic centimeters.47 Thus the total hippocampal volume of both sides of the brain is between 6 and 7 cubic centimeters. Using this as a starting point, we’ll assume that the mean hippocampal volume among football players, \(\mu\), is also somewhere between 6 and 7 cubic centimeters, with an average of 6.5.

To learn more about \(\mu\), consider some data. Rashmi Singh and colleagues performed a cross-sectional study of hippocampal volumes among 75 subjects (Singh et al. 2014). Their data is provided by the FootballBrain dataset in the Lock5Data package (Lock et al. 2016). We provide a smaller version of this dataset as football in the bayesrules package. For our analysis, we’ll focus on the hippocampal volumes for the \(n = 25\) study subjects who played collegiate American football and who have been diagnosed with concussions:

# Load the data
data(football)

# Filter and scale the data from microliters to cubic cm
football <- football %>%
  filter(group == "fb_concuss") %>% 
  mutate(volume = hipp / 1000)

Let \((Y_1,Y_2,\ldots,Y_{25})\) denote the hippocampal volumes for these 25 players. In this sample, the average hippocampal volume is roughly \(\overline{y}\) = 5.735 cubic centimeters with a standard deviation of 0.593. Further, the hippocampal volumes range from roughly 4.5 to 7 cubic centimeters:

football %>%
  summarize(mean(volume), sd(volume))
  mean(volume) sd(volume)
1        5.735     0.5934
ggplot(football, aes(x = volume)) + 
  geom_density()

5.3.1 Normal likelihood

The first step in our analysis of \(\mu\), the average hippocampal volume of all American college football players with a history of concussions, is to model the dependence of each sampled player’s volume \(Y_i\) upon \(\mu\). Since hippocampal volume \(Y_i\) is measured on a continuous scale, there are many possible common models for describing the variability in \(Y_i\) from player to player: Beta, Exponential, Gamma, Normal, Chi-square, F, etc. We can immediately toss out the Beta model – it assumes that \(Y_i \in [0,1]\) whereas we observed above that our players’ hippocampal volumes range from roughly 4.5 to 7 cubic centimeters. Among the remaining options, the Normal model is quite reasonable. Consistent with the observed data in the density plot above, the Normal model assumes that the players’ hippocampal volumes are symmetrically distributed around some global average volume (\(\mu\)!).

As you’ll see over and over throughout this book, reasonable doesn’t mean perfect. Though the Normal model does a great job of capturing the main features of our sample data \((Y_1,Y_2,\ldots,Y_{25})\), it technically assumes that each player’s hippocampal volume can range from \(-\infty\) to \(\infty\). Since we can tune the Normal model to put negligible weight on unreasonable values of hippocampal volume, we’re not too worried about this incorrect assumption here.

The Normal model

Let \(Y\) be a random variable which can take any value between \(-\infty\) and \(\infty\), ie. \(Y \in (-\infty,\infty)\). Then the variability in \(Y\) might be well represented by a Normal model with mean parameter \(\mu \in (-\infty, \infty)\) and standard deviation parameter \(\sigma > 0\):

\[Y \sim N(\mu, \sigma^2)\]

The Normal model is specified by continuous pdf

\[\begin{equation} f(y) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\bigg[{-\frac{(y-\mu)^2}{2\sigma^2}}\bigg] \;\; \text{ for } y \in (-\infty,\infty) \tag{5.10} \end{equation}\]

and has the following features:

\[\begin{split} E(Y) & = \text{ Mode}(Y) = \mu \\ \text{Var}(Y) & = \sigma^2 \\ \end{split}\]

Further, \(\sigma\) provides a sense of scale for \(Y\). Roughly 95% of \(Y\) values will be within 2 standard deviations of \(\mu\):

\[\begin{equation} \mu \pm 2\sigma \; . \tag{5.11} \end{equation}\]

Normal models with varying means and variances.

FIGURE 5.6: Normal models with varying means and variances.

The formula behind the Normal pdfs in Figure 5.6 is specified by (5.10). We can see the impact that both the mean parameter, \(\mu\), and the standard deviation parameter, \(\sigma\), have on the Normal model of \(Y\). As \(\mu\) gets larger, the center of the model shifts over to the right. As \(\sigma\) gets larger, the model becomes more spread out. To play around some more, you can plot Normal models using the plot_normal(mean, sd) function from the bayesrules package.

Returning to our football analysis, we assume that the hippocampal volumes of our \(n = 25\) players, \((Y_1,Y_2,\ldots,Y_{25})\), are independent and Normally distributed around a mean volume \(\mu\) with standard deviation \(\sigma\):

\[Y_i | \mu \; \sim \; N(\mu, \sigma^2) \;.\]

Notice here that the \(Y_i\) depend on standard deviation \(\sigma\) as well as \(\mu\), our parameter of interest. For now we assume the standard deviation to be fixed and known as \(\sigma\) = 0.593 cubic centimeters, the sample standard deviation in hippocampal volumes that we observed among our 25 players.48

With the Normal model in place, we’re ready to build the joint likelihood function of \(\mu\) and, in turn, assess the compatibility of different possible \(\mu\) with our observed sample data \(\vec{y} = (y_1,y_2,\ldots,y_{n})\) on \(n = 25\) players. This joint likelihood function is the product of the likelihood functions \(L(\mu|y_i) = f(y_i|\mu)\) defined by (5.10). For \(\mu \in (-\infty, \infty)\),

\[L(\mu |\vec{y})= \prod_{i=1}^{n}L(\mu | y_i) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma^2}} \exp\bigg[{-\frac{(y_i-\mu)^2}{2\sigma^2}}\bigg].\]

Simplifying this up to a proportionality constant:

\[L(\mu |\vec{y}) \propto \prod_{i=1}^{n} \exp\bigg[{-\frac{(y_i-\mu)^2}{2\sigma^2}}\bigg] = \exp\bigg[{-\frac{\sum_{i=1}^n (y_i-\mu)^2}{2\sigma^2}}\bigg] \; .\]

Through a bit more rearranging (which we encourage you to verify), we can make this even easier to digest by using the sample mean \(\bar{y}\) and sample size \(n\) to summarize our data values:

\[\begin{equation} L(\mu | \vec{y}) \propto \exp\bigg[{-\frac{(\bar{y}-\mu)^2}{2\sigma^2/n}}\bigg] \;\;\;\; \text{ for } \; \mu \in (-\infty, \infty). \tag{5.12} \end{equation}\]

Recall that among our \(n = 25\) sampled football players, we observed a mean hippocampal volume of \(\overline{y}\) = 5.735 cubic centimeters and are assuming that the standard deviation in volume from player to player is \(\sigma\) = 0.593. The proportional likelihood function of \(\mu\) evaluated at these data values,

\[L(\mu | \vec{y}) \propto \exp\bigg[{-\frac{(5.735-\mu)^2}{2(0.593^2/25)}}\bigg] \;\;\;\; \text{ for } \; \mu \in (-\infty, \infty),\]

is represented in Figure 5.7.

The joint Normal likelihood function for the mean hippocampal volume.

FIGURE 5.7: The joint Normal likelihood function for the mean hippocampal volume.

Based on our sample data alone, it seems unlikely that the mean hippocampal volume across all players \(\mu\) is less than 5.25 or greater than 6.25 cubic centimeters. Further, the most likely value for \(\mu\) is equivalent to the mean from our observed sample of 25 players: 5.735 cubic centimeters. This shouldn’t come as a surprise, the likelihood is entirely based on what we observe in our data. Therefore the likelihood will give the sample mean as the most likely value for our population mean \(\mu\).

5.3.2 Normal prior

With the likelihood function in place, let’s formalize a prior model for \(\mu\), the mean hippocampal volume among all football players that have had concussions. The first step is to identify a prior probability model structure. To this end, a Normal prior makes a reasonable choice. Specifically, we’ll assume that \(\mu\) itself is Normally distributed around some mean \(\theta\) with standard deviation \(\tau\),49

\[\mu \sim N(\theta, \tau^2) \; , \]

with prior pdf

\[\begin{equation} f(\mu) = \frac{1}{\sqrt{2\pi\tau^2}} \exp\bigg[{-\frac{(\mu - \theta)^2}{2\tau^2}}\bigg] \;\; \text{ for } \mu \in (-\infty,\infty) \; . \tag{5.13} \end{equation}\]

Not only does a Normal prior match the technical Normal likelihood assumption that \(\mu \in (-\infty,\infty)\), we’ll prove below that this is a conjugate prior. You might anticipate this result from the fact that the joint likelihood function \(L(\mu | \vec{y})\) (5.12) and prior pdf \(f(\mu)\) (5.13) are both proportional to

\[\exp\bigg[{-\frac{(\mu - \blacksquare)^2}{2\blacksquare^2}}\bigg]\]

with different \(\blacksquare\).

Using our understanding of a Normal model, let’s now tune the prior hyperparameters \(\theta\) and \(\tau\) to reflect our prior understanding about \(\mu\). Based on our rigorous Wikipedia research, recall that we would expect \(\mu\) to be between 6 and 7 cubic centimeters. Thus we set the Normal prior mean \(\theta\) to the average of 6.5. Our Normal prior can also account for our uncertainty about \(\mu\). We aren’t very sure about what is going on here, so let’s set the Normal prior standard deviation to \(\tau = 0.5\). In other words, by (5.11), roughly 95% of our prior model for \(\mu\) will be between 5.5 and 7.5 cubic centimeters (\(6.5 \pm 2*0.5\)). This is a little wider than what Wikipedia indicated, because we are neither sure that we trust the Wikipedia article nor confident that the features for the typical adult translates to American football players with a history of concussions. Putting this together, our tuned prior model for \(\mu\) is:

\[\mu \sim N(6.5, 0.5^2) \;.\]

To verify that it accurately reflects our prior knowledge, we plot our prior model below using plot_normal(). Note: remember that you are specifying the standard deviation and not the variance in plot_normal(). This distinction makes a big difference!

plot_normal(mean = 6.5, sd = 0.5)
Normal prior model with mean 6.5 and standard deviation 0.5.

FIGURE 5.8: Normal prior model with mean 6.5 and standard deviation 0.5.

5.3.3 Normal-Normal conjugacy

To obtain our posterior model of \(\mu\) we must combine our prior information and our likelihood. Again, we’re quite fortunate – the Normal-Normal is another conjugate family! In other words, the posterior model for \(\mu\) will also be Normal with updated hyperparameters that are informed by the prior and our observed data.

The Normal-Normal Bayesian model

Let \(\mu \in (-\infty,\infty)\) be an unknown mean parameter and \((Y_1,Y_2,\ldots,Y_n)\) be an independent \(N(\mu,\sigma^2)\) sample where \(\sigma\) is assumed to be known. The Normal-Normal Bayesian model complements the Normal structure of the data with a Normal prior on \(\mu\):

\[\begin{split} Y_i | \mu & \stackrel{ind}{\sim} N(\mu, \sigma^2) \\ \mu & \sim N(\theta, \tau^2) \\ \end{split}\]

Upon observing data \(\vec{y} = (y_1,y_2,\ldots,y_n)\), the posterior model of \(\mu\) is also a Normal with updated parameters:

\[\begin{equation} \mu|\vec{y} \; \sim \; N\bigg(\frac{\theta\sigma^2/n + \bar{y}\tau^2}{\tau^2+\sigma^2/n}, \; \frac{\tau^2\sigma^2/n}{\tau^2+\sigma^2/n}\bigg) \; . \tag{5.14} \end{equation}\]

Whooo, that is a mouthful! We provide an optional proof of this result below. But first, let’s apply this result to build a posterior model for \(\mu\), the average hippocampal volume among American football players with a history of concussions. Recall the necessary pieces that we need to plug into (5.14):

  • our Normal prior model of \(\mu\) had mean \(\theta = 6.5\) and standard deviation \(\tau = 0.5\);
  • our sample of \(n = 25\) football players had a sample mean volume \(\bar{y}=5.734\);
  • we assumed a known standard deviation among individual hippocampal volumes of \(\sigma = 0.593\).

It follows that the posterior model of \(\mu\) is:

\[\mu | \vec{y} \; \sim \; N\bigg(\frac{6.5*0.593^2/25 + 5.734*0.5^2}{0.5^2+0.593^2/25}, \; \frac{0.5^2*0.593^2/25}{0.5^2+0.593^2/25}\bigg)\;,\]

or, further simplified,

\[\mu | \vec{y} \; \sim \; N\bigg(5.775, 0.115^2 \bigg) \; .\]

We can plot and summarize our Normal-Normal analysis of \(\mu\) using plot_normal_normal() and summarize_normal_normal() in the bayesrules package:

plot_normal_normal(mean = 6.5, sd = 0.5, sigma = 0.593, 
  y_bar = 5.734, n = 25)


summarize_normal_normal(mean = 6.5, sd = 0.5, sigma = 0.593, 
  y_bar = 5.735, n = 25)
      model  mean  mode     var
1     prior 6.500 6.500 0.25000
2 posterior 5.776 5.776 0.01332

Our resulting posterior understanding is that the mean hippocampal volume among football players that have had concussions, \(\mu\), is most likely around 5.775 cubic centimeters but might reasonably range from 5.545 to 6.005 (\(5.775 \pm 2*0.115\)). Though the posterior mean is a compromise between the prior mean (\(\theta = 6.5\)) and sample mean volume (\(\bar{y} = 5.735\)), notice that the posterior is heavily weighted towards the data and likelihood. We also see that our posterior has a smaller variance than the prior – thanks to our data, we are much more sure about \(\mu\) than we were before.

5.3.4 Optional: proving Normal-Normal conjugacy

For completeness sake, we prove here that the Normal-Normal model produces posterior model (5.14). If derivations are not your thing, that’s totally fine! Feel free to skip ahead to the next section and know that you can use this Normal-Normal model with a light heart. Let’s get right into it. The posterior pdf of \(\mu\) is proportional to the product of the Normal prior pdf (5.13) and the joint likelihood function (5.12). For \(\mu \in (-\infty, \infty)\):

\[f(\mu|\vec{y}) \propto f(\mu)L(\mu|\vec{y}) \propto \exp\bigg[{\frac{-(\mu - \theta)^2}{2\tau^2}}\bigg] \cdot \exp\bigg[{-\frac{(\bar{y}-\mu)^2}{2\sigma^2/n}}\bigg] \; .\]

Next, we can expand the squares in the exponents and sweep under the rug of proportionality both the \(\theta^2\) in the numerator of the first exponent and the \(\bar{y}^2\) in the numerator of the second exponent:

\[\begin{split} f(\mu|\vec{y}) & \propto \exp\Bigg[{\frac{-\mu^2+2\mu\theta-\theta^2}{2\tau^2}}\Bigg]\exp\Bigg[{\frac{-\mu^2+2\mu\bar{y}-\bar{y}^2}{2\sigma^2/n}}\Bigg] \\ & \propto \exp\Bigg[{\frac{-\mu^2+2\mu\theta}{2\tau^2}}\Bigg]\exp\Bigg[{\frac{-\mu^2+2\mu\bar{y}}{2\sigma^2/n}}\Bigg] \\ \end{split}\]

Next, we will give the exponents common denominators and combine them into a single exponent:

\[\begin{split} f(\mu|\vec{y}) & \propto \exp\Bigg[{\frac{(-\mu^2+2\mu\theta)\sigma^2/n}{2\tau^2\sigma^2/n}}\Bigg]\exp\Bigg[{\frac{(-\mu^2+2\mu\bar{y})\tau^2}{2\tau^2\sigma^2/n}}\Bigg] \\ & \propto \exp\Bigg[{\frac{(-\mu^2+2\mu\theta)\sigma^2/n +(-\mu^2+2\mu\bar{y})\tau^2}{2\tau^2\sigma^2/n}}\Bigg] \\ \end{split}\]

Now let’s combine \(\mu\) terms and rearrange so that \(\mu^2\) is by itself:

\[\begin{split} f(\mu|\vec{y}) & \propto \exp\Bigg[{\frac{-\mu^2(\tau^2+\sigma^2/n)+2\mu(\theta\sigma^2/n+ \bar{y}\tau^2) }{2\tau^2\sigma^2/n}}\Bigg] \\ & \propto \exp\Bigg[{\frac{-\mu^2+2\mu\frac{\theta\sigma^2/n+ \bar{y}\tau^2}{\tau^2+\sigma^2/n} }{2(\tau^2\sigma^2/n) /(\tau^2+\sigma^2/n)}}\Bigg] \\ \end{split}\]

This may seem like too much to deal with, but if you look closely, you can see that we can bring back some constants which do not depend upon \(\mu\) to complete the square in the numerator:

\[\begin{split} f(\mu|\vec{y}) & \propto \exp\Bigg[{\frac{-\bigg(\mu - \frac{\theta\sigma^2/n+ \bar{y}\tau^2}{\tau^2+\sigma^2/n}\bigg)^2 }{2(\tau^2\sigma^2/n) /(\tau^2+\sigma^2/n)}}\Bigg] \\ \end{split}\]

This may still seem messy, but once we complete the square, we actually have the kernel of a Normal pdf for \(\mu\), \(\exp\bigg[{-\frac{(\mu - \blacksquare)^2}{2\blacksquare^2}}\bigg]\). By identifying the missing pieces \(\blacksquare\), we can thus conclude that

\[ \mu|\vec{y} \; \sim \; N\left(\frac{\theta\sigma^2/n+ \bar{y}\tau^2}{\tau^2+\sigma^2/n},{\frac{\tau^2\sigma^2/n }{\tau^2+\sigma^2/n}} \right).\]

5.4 Why no simulation in this chapter?

As you may have gathered, we love some simulations! There is something reassuring about the reality check that a simulation can provide. Yet we are at a crossroads. The Gamma-Poisson and Normal-Normal models we’ve studied here are tough to simulate using the techniques we’ve learned thus far. Letting \(\theta\) represent some parameter of interest, recall the steps we’ve used for past simulations:

  1. Simulate, say, 10000 values of \(\theta\) from the prior model.
  2. Simulate a set of sample data \(Y\) from each of the simulated \(\theta\) values.
  3. Filter out only those of the 10000 simulated sets of \((\theta,Y)\) for which the simulated \(Y\) data matches the data we actually observed.
  4. Use the remaining \(\theta\) values to approximate the posterior model of \(\theta\).

The issue with extending this simulation technique to the Gamma-Poisson and Normal-Normal examples in this chapter comes with step 3. In both of our examples, we had a sample size greater than one, \((Y_1,Y_2,\ldots, Y_n)\). Further, in the Normal-Normal example, our data values \(Y_i\) are continuous. In both of these scenarios, it’s very likely that no simulated sets of \((\theta,Y)\) will perfectly match our observed sample data. This is true even if we carry out millions of simulations in step 3. Spoiler alert! There are other ways to approximate the posterior model which we will learn in Unit 2.

5.5 Critiques of conjugate family models

Before we end the chapter, we want to acknowledge that conjugate family models also have drawbacks. In particular:

  • Conjugate family models do not always allow you to have a flat prior. While the Beta prior can be tuned to be flat (when \(\alpha=\beta=1\)), neither the Normal or the Gamma priors (or any proper models with infinite support) can ever be tuned to be totally flat.
  • Sometimes the prior model that is conjugate isn’t flexible enough to fit your prior understanding. For example, a Normal model is always symmetric around the mean \(\mu\) and is unimodal. So if your prior understanding is not symmetric or is not unimodal, then the Normal prior will not be able to reflect this.

5.6 Chapter summary

In Chapter 5, you learned about conjugacy and applied it to a few different situations. Our main takeaways for this chapter are:

  • Using conjugate priors allows us to have easy to derive and readily interpretable posterior models.
  • We can use the Beta-Binomial, Gamma-Poisson, and Normal-Normal conjugate families to derive posterior models.
  • We have learned how to use several functions from the bayesrules package including: plot_poisson_likelihood(), plot_gamma(), plot_gamma_poisson(), summarize_gamma_poisson(), plot_normal(), plot_normal_normal(), and summarize_normal_normal().

We hope that you now appreciate the utility of conjugate priors!



5.7 Exercises

5.7.1 Practice: Gamma-Poisson

Exercise 5.1 (Tuning the Gamma prior) For each situation below, use the given prior information to tune an appropriate Gamma(\(s, r\)) prior model for \(\lambda\).
  1. The most common value of \(\lambda\) is 4, and the mean is 7.
  2. The most common value of \(\lambda\) is 10 and the mean is 12.
  3. The mean of \(\lambda\) is 4 and the variance is 12.
Exercise 5.2 (Tuning the Gamma prior: Redux) For each situation below, use the given prior information to tune an appropriate Gamma(\(s, r\)) prior model for \(\lambda\).
  1. The mean of \(\lambda\) is 22 and the variance is 3.
  2. The most common value of \(\lambda\) is 5, and the variance is 3.
  3. The most common value of \(\lambda\) is 14, and the variance is 6.
Exercise 5.3 (Finding the Poisson likelihood) For each situation below, we observe the outcomes for a random sample of Poisson variables, \(Y_i | \lambda \stackrel{ind}{\sim} \text{Pois}(\lambda)\). Specify the joint likelihood function of \(\lambda\).
  1. \((y_1,y_2,y_3) = (3,7,19)\)
  2. \((y_1,y_2,y_3,y_4) = (12,12,12,0)\)
  3. \(y_1 = 12\)
  4. \((y_1,y_2,y_3,y_4,y_5) = (16,10,17,11,11)\)
Exercise 5.4 (Visualizing the Poisson likelihood) In parts a-d of this exercise, use plot_poisson_likelihood() to visualize the joint likelihood function for each scenario in exercise r ex53.
Exercise 5.5 (Finding the Gamma-Poisson posterior) Assume a prior model of \(\lambda \sim \text{Gamma}(24, 2)\). In parts a-d of this exercise, specify the posterior models of \(\lambda\) corresponding to the data in each scenario of exercise r ex53.
Exercise 5.6 (Finding the Gamma-Poisson posterior with a different prior) Assume a prior model of \(\lambda \sim \text{Gamma}(2, 2)\). In parts a-d of this exercise, specify the posterior models of \(\lambda\) corresponding to the data in each scenario of exercise r ex53.
Exercise 5.7 (Text messages) Let random variable \(\lambda\) represent the rate of text messages people receive in an hour. At first, you believe that the typical number of messages per hour is 5 with a standard deviation of 0.25 messages.
  1. Tune an appropriate Gamma(\(s,r\)) prior model for \(\lambda\).
  2. Plot the prior model using plot_gamma().
  3. What is the prior probability that the rate of text messages per hour is larger than 10? Hint: pgamma().
Exercise 5.8 (Text messages with data) Continuing with the previous exercise, you decide to collect data from six of your friends. They received 7, 3, 8, 9, 10, 12 text messages in the previous hour.
  1. Plot the resulting joint likelihood function of \(\lambda\) using plot_poisson_likelihood().
  2. Use plot_gamma_poisson() to plot the prior, likelihood, and the posterior of \(\lambda\).
  3. Use summarize_gamma_poisson() to calculate descriptive statistics for the prior and the posterior models.
  4. Comment on how your understanding about \(\lambda\) changed from the prior (in the previous exercise) to the posterior based on the data you collected from your friends.

5.7.2 Practice: Normal-Normal

Exercise 5.9 (Finding and visualizing the Normal likelihood) In each situation below, we observe the outcomes for a Normal random sample, \(Y_i | \mu \stackrel{ind}{\sim} N(\mu,\sigma^2)\) with known \(\sigma\). Specify the corresponding joint likelihood function of \(\mu\) and visualize this function using plot_normal_likelihood().
  1. \((y_1,y_2,y_3) = (-4.3,0.7,-19.4)\) and \(\sigma = 10\)
  2. \((y_1,y_2,y_3,y_4) = (-12,1.2,-4.5,0.6)\) and \(\sigma = 6\)
  3. \((y_1,y_2) = (12.4,6.1)\) and \(\sigma = 5\)
  4. \((y_1,y_2,y_3,y_4,y_5) = (1.6,0.09,1.7,1.1,1.1)\) and \(\sigma = 0.6\)
Exercise 5.10 (Investing in stock) You just bought stock in FancyTech. Let random variable \(\mu\) be the average dollar amount that your FancyTech stock goes up or down in a one day period. At first, you believe that \(\mu\) is 7.2 dollars with a standard deviation of 2.6 dollars.
  1. Tune and, using plot_normal(), plot an appropriate Normal prior model for \(\mu\).
  2. According to your plot, does it seem plausible that the FancyTech stock would increase by an average of 7.6 dollars in a day?
  3. Does it seem plausible that the FancyTech stock would increase by an average of 4 dollars in a day?
  4. What is the prior probability that, on average, you lose money? Hint: pnorm()
  5. What is the prior probability that, on average, your stock price goes up by more than 8 dollars per day?
Exercise 5.11 (Investing in stock with data) Continuing with the previous exercise, suppose it’s reasonable to assume that the daily changes in FancyTech stock value are Normally distributed around a mean of \(\mu\) (the unknown daily average change) with a known standard deviation of \(\sigma = 2\) dollars. On a random sample of 4 days, you observe changes in stock value of -0.7, 1.2, 4.5, and -4 dollars.
  1. Plot the corresponding likelihood function of \(\mu\) using plot_normal_likelihood().
  2. Use plot_normal_normal() to plot the prior, likelihood, and the posterior models for \(\mu\).
  3. Use summarize_normal_normal() to calculate descriptive statistics for the prior and the posterior models.
  4. Comment on how your understanding about \(\mu\) evolved from the prior (in the previous exercise) to the posterior based on the observed data.
  5. What is the posterior probability that, on average, you lose money? Hint: pnorm().
  6. What is the posterior probability that, on average, your stock price goes up by more than 8 dollars per day?
Exercise 5.12 (Normal-Normal calculation) Prof. Abebe and Prof. Morales both recently finished their PhDs and are teaching their first statistics classes at Bayesian University. Their colleagues told them that the average final exam score across all students, \(\mu\), varies Normally from year to year with a mean of 80 points and a standard deviation of 4. Further, individual students’ scores \(Y\) vary Normally around \(\mu\) with a known standard deviation of 3 points.
  1. Prof. Abebe conducts the final exam and observes that his 32 students scored an average of 84 points with a standard deviation of 3 points. Calculate the posterior mean and variance of \(\mu\) using the data from Prof. Abebe’s class.
  2. Prof. Morales conducts the final exam and observes that her 32 students scored an average of 82 points with a standard deviation of 2 points. Calculate the posterior mean and variance of \(\mu\) using the data from Prof. Morales’ class.
  3. Next, use Prof. Abebe and Prof. Morales’ combined exams to calculate the posterior mean and variance of \(\mu\).
Exercise 5.13 (Football players without concussions) In this chapter we used the Normal-Normal model to analyze the mean hippocampal volume among American football players who have incurred concussions. In this exercise we will explore the mean volume, \(\mu\), for football players who have not been diagnosed with a concussion. We’ll start with the same prior information as before: the average hippocampal volume \(\mu\) has a prior mean of \(\theta = 6.5\) cubic centimeters and prior standard deviation \(\tau = 0.5\).
  1. Use the football dataset to calculate the sample mean hippocampal volume, sample standard deviation in volume from player to player (which we will use as our \(\sigma^2\)), and sample size of the football players who have not been diagnosed with a concussion.
  2. Identify the posterior model of \(\mu\).
  3. Use summarize_normal_normal() to verify your answer.
  4. Use plot_normal_normal() to visualize the prior, likelihood, and posterior models of \(\mu\). Describe the influence of the data on the posterior model in one sentence.
Exercise 5.14 (Healthy control brains) Repeat the previous exercise, this time for the average hippocampal volume \(\mu\) for the control group, ie. people that are not football players and have not had concussions.
Exercise 5.15 (Normal-Normal Simulation)
  1. Your friend Alex has read through Chapter 4 of this book, but did not read Chapter 5. Without using jargon, explain to Alex in 4 sentences or fewer why it’s difficult to simulate a Normal-Normal posterior using the simulation methods we have learned thus far.

  2. To prove your point, try (and fail) to simulate the posterior model of \(\mu\) for the following model upon observing a single data point \(Y = 1.1\):

    \[\begin{split} Y | \mu & \sim N(\mu, 1^2) \\ \mu & \sim N(0, 1^2) \\ \end{split}\]

5.7.3 General practice

Exercise 5.16 (Which model?) Below are kernels for Normal, Poisson, Gamma, Beta, and Binomial models. Identify the appropriate model with specific parameter values.
  1. \(f(\theta) \propto 0.3^\theta 0.7^{(1-\theta)}\) for \(\theta\in \{0,1,2,\ldots,16\}\)
  2. \(f(\theta) \propto 1 /\theta!\) for \(\theta\in \{0,1,2,\ldots,\infty\}\)
  3. \(f(\theta) \propto \theta^4 (1-\theta)^7\) for \(\theta \in (0, 1)\)
  4. \(f(\theta) \propto e^{-\theta^2}\) for \(\theta \in (-\infty, \infty)\)
Exercise 5.17 (Which model: Back for more!) Below are kernels for Normal, Poisson, Gamma, Beta, and Binomial models. Identify the appropriate model with specific parameter values.
  1. \(f(\theta) \propto e^{-\theta^2}\theta^{15}\) for \(\theta \in (0, \infty)\)
  2. \(f(\theta) \propto e^{\frac{-(\theta-12)^2}{18}}\) for \(\theta \in (-\infty, \infty)\)
  3. \(f(\theta) \propto 0.3^\theta /\theta!\) for \(\theta \in \{0,1,2,\ldots,\infty\}\)
Exercise 5.18 (Weighted average) In Chapter 4, we expressed the Beta-Binomial posterior mean of proportion \(\pi\) as a weighted average of the prior mean of \(\pi\) and the observed sample proportion. Similarly:
  1. Express the Gamma-Poisson posterior mean of rate \(\lambda\), \(E(\lambda | \vec{y})\), as a weighted average of the prior mean \(E(\lambda)\) and sample mean \(\overline{y}\).
  2. Express the Normal-Normal posterior mean of \(\mu\), \(E(\mu | \vec{y})\), as a weighted average of the prior mean \(E(\mu)\) and sample mean \(\overline{y}\).
Exercise 5.19 (Counting insects) You have been hired as a research assistant to a biology professor. Your job is to count the number of a specific insect in a \(20 m^2\) area. Your prior idea about the density of this insect is well captured by a Gamma model with an expected value of 0.5 insects per \(m^2\) and a variance of 0.0625 insects\(^2\) per \(m^2\). You go out into the field and count 3, 2, 5, 1, 2 insects in the first five square meters respectively and then do not encounter the insect in the rest of the field (the other 15 square meters).
  1. What likelihood model, Normal or Poisson, should you use to model the dependence of your data on the insect data? Explain why.
  2. Plot the prior, likelihood, and posterior model of insect density using the appropriate function, either plot_gamma_poisson() or plot_normal_normal(). Comment on whether the posterior model more closely resembles the data or the prior model. Explain why this makes sense.
  3. What is the posterior mean and variance of the insect density?
  4. Describe your posterior conclusions in context, in a way that your biology professor would find helpful.

Exercise 5.20 (The Beta-Geometric Bayesian model) Consider the following new Bayesian model:

\[\begin{split} Y|\theta & \sim \text{Geometric}(\theta) \\ \theta & \sim \text{Beta}(\alpha, \beta) \\ \end{split}\]

where \(f(y|\theta) = \theta (1 - \theta)^{y - 1}\) for \(y \in \{1,2,\ldots\}\).
  1. Derive the posterior model for \(\theta\) given observed data \(Y = y\). If possible, identify the name of the posterior model and its parameters.
  2. Is the Beta model a conjugate prior for the Geometric likelihood? Explain.

  1. 1:a, 2:b, 3: Gamma(20,20)↩︎

  2. https://en.wikipedia.org/wiki/Hippocampus↩︎

  3. You’ll have a chance to relax this silly-ish assumption that we know \(\sigma\) but don’t know \(\mu\) in later chapters.↩︎

  4. Note that we could use whatever Greek (or other) letters we wish for the prior mean and variance parameters other than \(\mu\) and \(\sigma\) which are already being used, thus have different meaning, in our model.↩︎