Chapter 5 Conjugate families

In the novel Anna Karenina, Tolstoy wrote “Happy families are all alike; every unhappy family is unhappy in its own way.” In this chapter we will learn about conjugate families, which are all alike in the sense that they make the authors very happy. Read on to learn why!

  • Practice building Bayesian models.
    You will build Bayesian models by practicing how to recognize kernels and make use of proportionality.
  • Familiarize yourself with conjugacy.
    You will learn about what makes a prior conjugate and why this is a helpful property. In brief, conjugate priors make it easier to build posterior models. Conjugate priors spark joy!

To get started, load the following packages:

# Load packages
library(bayesrules)
library(tidyverse)

5.1 Revisiting choice of prior

How do we choose a prior? In Chapters 3 and 4 we used the flexibility of the Beta model to reflect our prior understanding of a proportion parameter \(\pi \in [0,1]\). There are other criteria to consider when choosing a prior model:

  • Computational ease
    Especially if we don’t have access to computing power, it is helpful if the posterior model is easy to build.
  • Interpretability
    We’ve seen that posterior models are a compromise between the data and the prior model. A posterior model is interpretable, thus more useful, when you can look at its formulation and identify the contribution of the data relative to that of the prior.

The Beta-Binomial has both of these criteria covered. Its calculation is easy. Once we know the Beta(\(\alpha, \beta\)) prior parameters and the observed data \(Y = y\) for the Bin(\(n, \pi\)) likelihood, the Beta(\(\alpha+y, \beta+n-y\)) posterior model follows. This posterior reflects the influence of the data, through the values of \(y\) and \(n\), relative to the prior hyperparameters \(\alpha\) and \(\beta\). If \(\alpha\) and \(\beta\) are large relative to the sample size \(n\), then the posterior will not budge that much from the prior. However, if the sample size \(n\) is large relative to \(\alpha\) and \(\beta\), then the data will take over and be more influential in the posterior model. In fact, the Beta-Binomial belongs to a larger class of prior-data combinations called conjugate families that allow for both computational ease and interpretable posteriors. Recall a general definition similar to that from Chapter 3.

Conjugate prior

Let the prior model for parameter \(\theta\) have pdf \(f(\theta)\) and the model of data \(Y\) conditioned on \(\theta\) have likelihood function \(L(\theta|y)\). If the resulting posterior model with pdf \(f(\theta|y) \propto f(\theta)L(\theta|y)\) is of the same model family as the prior, then we say this is a conjugate prior.

5.1.1 A non-conjugate prior for Binomial data

To emphasize the utility (and fun) of conjugate priors, it can be helpful to consider a non-conjugate prior. Let parameter \(\pi\) be a proportion between 0 and 1 and suppose we plan to collect data \(Y\) where, conditional on \(\pi\), \(Y|\pi \sim \text{Bin}(n,\pi)\). Instead of our conjugate \(\text{Beta}(\alpha,\beta)\) prior for \(\pi\), let’s try out a non-conjugate prior with pdf \(f(\pi)\), plotted in Figure 5.1:

\[\begin{equation} f(\pi)=e-e^\pi\; \text{ for } \pi \in [0,1]. \tag{5.1} \end{equation}\]

Though not a Beta pdf, \(f(\pi)\) is indeed a valid pdf since: \(f(\pi)\) is non-negative on the support of \(\pi\); and the area under the pdf is 1, i.e. \(\int_0^1 f(\pi)=1\).

This is an untitled chart with no subtitle or caption. It has x-axis 'pi' with labels 0.00, 0.25, 0.50, 0.75 and 1.00. It has y-axis 'paste("f(", pi, ")")' with labels 0.0, 0.5, 1.0, 1.5 and 2.0. The chart is a set of 1 line. Line 1 connects 101 points.

FIGURE 5.1: A non-conjugate prior for \(\pi\).

Next, suppose we observe \(Y = 10\) successes from \(n = 50\) independent trials, all having the same probability of success \(\pi\). The resulting Binomial likelihood function of \(\pi\) is:

\[L(\pi | (y=10)) = \left(\!\begin{array}{c} 50 \\ 10 \end{array}\!\right) \pi^{10} (1-\pi)^{40} \; \; \text{ for } \pi \in [0,1] \; .\]

Recall from Chapter 3 that, when we put the prior and the likelihood together, we are already on our path to finding the posterior model:

\[f(\pi | (y = 10)) \propto f(\pi) L(\pi | (y = 10)) = (e-e^\pi) \cdot \binom{50}{10} \pi^{10} (1-\pi)^{40}.\]

As we did in Chapter 3, we will drop all constants that do not depend on \(\pi\) since we are only specifying \(f(\pi|y)\) up to a proportionality constant:

\[f(\pi | y ) \propto (e-e^\pi) \pi^{10} (1-\pi)^{40}.\]

Notice here that our non-Beta prior didn’t produce a neat and clean answer for the exact posterior model (fully specified and not up to a proportionality constant). We cannot squeeze this posterior pdf kernel into a Beta box nor any other familiar model for that matter. That is, we cannot rewrite \((e-e^\pi) \pi^{10} (1-\pi)^{40}\) so that it shares the same structure as a Beta kernel, \(\pi^{\blacksquare - 1}(1-\pi)^{\blacksquare-1}\). Instead we will need to integrate this kernel in order to complete the normalizing constant, hence posterior specification:

\[\begin{equation} f(\pi|y=10)= \frac{(e-e^\pi) \pi^{10} (1-\pi)^{40}}{\int_0^1(e-e^\pi) \pi^{10} (1-\pi)^{40}d\pi} \; \; \text{ for } \pi \in [0,1]. \tag{5.2} \end{equation}\]

This is where we really start to feel the pain of not having a conjugate prior! Since this is a particularly unpleasant integral to evaluate, and we’ve been trying to avoid doing any integration altogether, we will leave ourselves with (5.2) as the final posterior. Yikes, what a mess! This is a valid posterior pdf – it’s non-negative and integrates to 1 across \(\pi \in [0,1]\). But it doesn’t have much else going for it. Consider a few characteristics about the posterior model that result from this particular non-conjugate prior (5.1):

  • The calculation for this posterior was messy and unpleasant.
  • It is difficult to derive any intuition about the balance between the prior information and the data we observed from this posterior model.
  • It would be difficult to specify the corresponding posterior mean, mode, and standard deviation, a process which would require even more integration.

This leaves us with the question: could we use the conjugate Beta prior and still capture the broader information of the messy non-conjugate prior (5.1)? If so, then we solve the problems of messy calculations and indecipherable posterior models!

Which Beta model would most closely approximate the non-conjugate prior for \(\pi\) (Figure 5.1)?

  1. Beta(3,1)
  2. Beta(1,3)
  3. Beta(2,1)
  4. Beta(1,2)

The answer to this quiz is d. One way to find the answer is by comparing the non-conjugate prior in Figure 5.1 with plots of the possible Beta prior models. For example, the non-conjugate prior information is pretty well captured by the Beta(1,2):

plot_beta(alpha = 1, beta = 2)

This is an untitled chart with no subtitle or caption. It has x-axis 'pi' with labels 0.00, 0.25, 0.50, 0.75 and 1.00. It has y-axis 'paste("f(", pi, ")")' with labels 0.0, 0.5, 1.0, 1.5 and 2.0. The chart is a function graph that VI can not process.

5.2 Gamma-Poisson conjugate family

Stop callin’ / Stop callin’ / I don’t wanna think anymore / I got my head and my heart on the dance floor
- Lady Gaga featuring Beyoncé. Lyric from Telephone.

In 2019, one of this book’s authors got fed up with the number of fraud risk phone calls they were receiving. They set out with a goal of modeling rate \(\lambda\), the typical number of fraud risk calls received per day. Prior to collecting any data, the author’s guess was that this rate was most likely around 5 calls per day, but could also reasonably range between 2 and 7 calls per day. To learn more, they planned to record the number of fraud risk phone calls on each of \(n\) sampled days, \((Y_1,Y_2,\ldots,Y_n)\).

Before moving forward, it’s important to recognize here that our investigation of \(\lambda\) will not fit into the familiar Beta-Binomial framework. First, \(\lambda\) is not a proportion limited to be between 0 and 1, but rather a rate parameter that can take on any positive value (eg: we can’t receive -7 calls per day). Further, each data point \(Y_i\) is a count that can technically take on any non-negative integer in \(\{0,1,2,\ldots\}\), thus is not limited by some number of trials \(n\) as is true for a Binomial. Not to fret. Our study of \(\lambda\) will introduce us to a new conjugate family, the Gamma-Poisson.

5.2.1 Poisson likelihood

The spirit of our analysis starts with a prior understanding of \(\lambda\), the daily rate of fraud risk phone calls. Yet before choosing a prior model structure and tuning this to match our prior understanding, we start our Bayesian analysis by modeling the dependence of our daily phone call count data \(Y_i\) on the typical daily rate of such calls \(\lambda\). Upon identifying a reasonable likelihood model for our data, we can identify a prior model which can be tuned to match our prior understanding while also mathematically complementing the likelihood. Keeping in mind that each data point \(Y_i\) is a random count that can go from 0 to a really big number, \(Y \in \{0,1,2,\ldots\}\), the Poisson model, described in its general form below, makes a reasonable candidate for its likelihood model.

The Poisson model

Let discrete random variable \(Y\) be the number of independent events that occur in a fixed amount of time or space, where \(\lambda > 0\) is the rate at which these events occur. Then the dependence of \(Y\) on parameter \(\lambda\) can be modeled by the Poisson. In mathematical notation:

\[Y | \lambda \sim \text{Pois}(\lambda) \]

Correspondingly, the Poisson model is specified by pmf

\[\begin{equation} f(y|\lambda) = \frac{\lambda^y e^{-\lambda}}{y!}\;\; \text{ for } y \in \{0,1,2,\ldots\} \tag{5.3} \end{equation}\]

where \(f(y|\lambda)\) sums to one across \(y\), \(\sum_{y=0}^\infty f(y|\lambda) = 1\). Further, a Poisson random variable \(Y\) has equal mean and variance,

\[\begin{equation} E(Y|\lambda) = \text{Var}(Y|\lambda) = \lambda \; . \tag{5.4} \end{equation}\]

Figure 5.2 illustrates the Poisson pmf (5.3) under different rate parameters.

This is an untitled chart with no subtitle or caption. The chart is comprised of 3 panels containing sub-charts, arranged horizontally. The panels represent different values of setting. Each sub-chart has x-axis 'y' with labels 0, 2, 4, 6, 8, 10 and 12. Each sub-chart has y-axis 'f(y)' with labels 0.0, 0.1, 0.2 and 0.3. Each sub-chart has 2 layers. Panel 1 represents data for setting = Pois(1). Layer 1 of panel 1 is a set of 13 points. Layer 1 has size set to 0.75. Layer 2 of panel 1 is a segment graph that VI can not process. Panel 2 represents data for setting = Pois(2). Layer 1 of panel 2 is a set of 13 points. Layer 1 has size set to 0.75. Layer 2 of panel 2 is a segment graph that VI can not process. Panel 3 represents data for setting = Pois(5). Layer 1 of panel 3 is a set of 13 points. Layer 1 has size set to 0.75. Layer 2 of panel 3 is a segment graph that VI can not process.

FIGURE 5.2: Poisson pmfs with different rate parameters.

Let \((Y_1, Y_2, \ldots, Y_n)\) denote the number of fraud risk calls we observed on each of the \(n\) days in our data collection period. Though the observed values of fraud phone calls \(Y_i\) might differ from day to day, we can assume that each can be independently modeled by the Poisson:

\[Y_i | \lambda \stackrel{ind}{\sim} \text{Pois}(\lambda)\]

with unique pmfs

\[f(y_i|\lambda) = \frac{\lambda^{y_i}e^{-\lambda}}{y_i!}\;\; \text{ for } y_i \in \{0,1,2,\ldots\} \; .\]

Yet in weighing the evidence of the phone call data, we won’t want to analyze each call, one by one. Rather, we’ll need to process the collective or joint information in our \(n\) data points. This information is captured by the joint probability mass function.

Joint probability mass function

Let \((Y_1,Y_2,\ldots,Y_n)\) be an independent sample of random variables and \(\vec{y} = (y_1,y_2,\ldots,y_n)\) be the corresponding vector of observed values. Further, let \(f(y_i | \lambda)\) denote the pmf of an individual observed data point \(Y_i = y_i\). Then by the assumption of independence, the following joint pmf specifies the randomness in the collective sample:

\[\begin{equation} f(\vec{y} | \lambda) = \prod_{i=1}^n f(y_i | \lambda) = f(y_1 | \lambda) \cdot f(y_2 | \lambda) \cdot \cdots \cdot f(y_n | \lambda) \; . \tag{5.5} \end{equation}\]

The joint pmf for our fraud risk call sample follows by applying the general definition (5.5) to our Poisson pmfs:

\[\begin{equation} f(\vec{y} | \lambda) = \prod_{i=1}^{n}f(y_i | \lambda) = \prod_{i=1}^{n}\frac{\lambda^{y_i}e^{-\lambda}}{y_i!} \;\; \text{ for } \; \lambda > 0 \; . \tag{5.6} \end{equation}\]

This looks like a mess, but it can be simplified. In this simplification, it’s important to recognize that we have \(n\) unique data points \(y_i\), not \(n\) copies of the same data point \(y\). Thus we need to pay careful attention to the \(i\) subscripts:

\[f(\vec{y} | \lambda) = \frac{\lambda^{y_1}e^{-\lambda}}{y_1!} \cdot \frac{\lambda^{y_2}e^{-\lambda}}{y_2!} \cdots \frac{\lambda^{y_n}e^{-\lambda}}{y_n!} =\frac{\lambda^{\sum y_i}e^{-n\lambda}}{\prod_{i=1}^n y_i!}\]

Once we observe actual sample data, we can flip this joint pmf on its head to define the likelihood function of \(\lambda\). The Poisson likelihood function is equivalent in formula to the joint pmf \(f(\vec{y}|\lambda)\), yet is a function of \(\lambda\) which helps us assess the compatibility of different possible \(\lambda\) values with our observed collection of sample data \(\vec{y}\):

\[L(\lambda | \vec{y}) = \frac{\lambda^{11}e^{-4\lambda}}{6!\times2!\times2!\times1!} \propto \lambda^{11}e^{-4\lambda} \; .\]

Note that when we express the likelihood up to a proportionality constant, the sum of the data points (\(\sum y_i = 11\)) and the number of data points (\(n = 4\)) is all the information that is required from the data. We don’t need to know the value of each individual data point.

Let’s visualize the portion of likelihood function \(L(\lambda | \vec{y})\) for \(\lambda\) between 0 and 10. Why can’t we visualize the whole likelihood? Because \(\lambda \in (0, \infty)\) and this book would be pretty expensive if we had infinite pages. For this visualization we can use the plot_poisson_likelihood() function from the bayesrules package where y is the vector of data values, and lambda_upper_bound is the maximum value of \(\lambda\) to view on the x-axis:

plot_poisson_likelihood(y = c(6, 2, 2, 1), lambda_upper_bound = 10)

This is an untitled chart with no subtitle or caption. It has x-axis 'lambda' with labels 0.0, 2.5, 5.0, 7.5 and 10.0. It has y-axis 'paste("L(", lambda, "|(Y=", y, "))")' with labels 0e+00, 1e-04, 2e-04, 3e-04 and 4e-04. The chart is a set of 1 line. Line 1 connects 10201 points.

So what can we learn from this likelihood function? Well, our four days of data collection suggest that the rate of fraud risk calls likely ranges from one to five calls per day. Further, values of \(\lambda\) near \(\overline{y} = 2.75\), the observed average number of daily calls in our four day sample, are the most compatible with our data. >>>>>>> 96345e7729af5559139724f2d006a9abb6e98cc2

\[\begin{equation} L(\lambda | \vec{y}) = \frac{\lambda^{\sum y_i}e^{-n\lambda}}{\prod_{i=1}^n y_i!} \propto \lambda^{\sum y_i}e^{-n\lambda} \; . \tag{5.6} \end{equation}\]

It is convenient to represent the likelihood function up to a proportionality constant here, especially since \(\prod y_i!\) will be cumbersome to calculate when \(n\) is large, and what we really care about in the likelihood is \(\lambda\). And when we express the likelihood up to a proportionality constant, note that the sum of the data points (\(\sum y_i\)) and the number of data points (\(n\)) is all the information that is required from the data. We don’t need to know the value of each individual data point \(y_i\). Taking this for a spin with real data points later in our analysis will provide some clarity.

5.2.2 Potential priors

The Poisson likelihood model provides one of two key pieces for our Bayesian analysis of \(\lambda\), the daily rate of fraud risk calls. The other key piece is a prior model for \(\lambda\). Our original guess was that this rate is most likely around 5 calls per day, but could also reasonably range between 2 and 7 calls per day. In order to tune a prior to match these ideas about \(\lambda\), we first have to identify a reasonable probability model structure.

Remember here that \(\lambda\) is a positive and continuous rate, meaning that \(\lambda\) does not have to be a whole number. Accordingly, a reasonable prior probability model will also have continuous and positive support, i.e. be defined on \(\lambda > 0\). There are several named and studied probability models with this property, including the \(F\), Weibull, and Gamma. We don’t dig into all of these in this book. Rather, to make the \(\lambda\) posterior model construction more straightforward and convenient, we’ll focus on identifying a conjugate prior model.

Suppose we have a random sample of Poisson random variables \((Y_1, Y_2, \ldots, Y_n)\) with likelihood function \(L(\lambda | \vec{y})\propto \lambda^{\sum y_i}e^{-n\lambda}\) for \(\lambda > 0\). What do you think would provide a convenient conjugate prior model for \(\lambda\)? Why?

  1. a “Gamma” model with pdf \(f(\lambda) \propto \lambda^{s - 1} e^{-r \lambda}\)
  2. a “Weibull” model with pdf \(f(\lambda) \propto \lambda^{s - 1} e^{(-r \lambda)^s}\)
  3. a special case of the “F” model with pdf \(f(\lambda) \propto \lambda^{\frac{s}{2} - 1}\left( 1 + \lambda\right)^{-s}\)

The answer is a. The Gamma model will provide a conjugate prior for \(\lambda\) when our data has a Poisson model. You might have guessed this from the section title. You might also have guessed this from the shared features of the Poisson likelihood function \(L(\lambda|\vec{y})\) and the Gamma pdf \(f(\lambda)\) – both are proportional to

\[\lambda^{\blacksquare}e^{-\blacksquare\lambda}\]

with differing \(\blacksquare\). In fact, we’ll prove that combining the prior and likelihood produces a posterior pdf with this same structure. That is, the posterior will be of the same Gamma model family as the prior. First, let’s learn more about the Gamma model.

5.2.3 Gamma prior

The Gamma & Exponential models

Let \(\lambda\) be a continuous random variable which can take any positive value, i.e. \(\lambda > 0\). Then the variability in \(\lambda\) might be well modeled by a Gamma model with shape hyperparameter \(s > 0\) and rate hyperparameter \(r > 0\):

\[\lambda \sim \text{Gamma}(s, r) \; .\]

The Gamma model is specified by continuous pdf

\[\begin{equation} f(\lambda) = \frac{r^s}{\Gamma(s)} \lambda^{s-1} e^{-r\lambda} \;\; \text{ for } \lambda > 0 \tag{5.7} \end{equation}\]

Further, the trend and variability in \(\lambda\) are measured by:

\[\begin{equation} \begin{split} E(\lambda) & = \frac{s}{r} \\ \text{Mode}(\lambda) & = \frac{s - 1}{r} \;\; \text{ for } s \ge 1 \\ \text{Var}(\lambda) & = \frac{s}{r^2} \\ \end{split} \tag{5.8} \end{equation}\]

The Exponential model,

\[\lambda \sim \text{Exp}(r)\]

is a special case of the Gamma with shape hyperparameter \(s = 1\), \(\text{Gamma}(1,r)\).

Figure 5.3 illustrates how different shape and rate hyperparameters impact the Gamma pdf (5.7).

This is an untitled chart with no subtitle or caption. The chart is comprised of 6 panels containing sub-charts, arranged horizontally. The panels represent different values of setting. Each sub-chart has x-axis 'lambda' with labels 0, 2, 4 and 6. Each sub-chart has y-axis 'paste("f(", lambda, ")")' with labels 0.0, 0.5, 1.0, 1.5 and 2.0. Each sub-chart has 3 layers. Panel 1 represents data for setting = Gamma(1,1). Layer 1 of panel 1 is a set of 1 line. Line 1 connects 500 points. Layer 2 of panel 1 is a segment graph that VI can not process. Layer 2 has colour set to vivid violet. Layer 3 of panel 1 is a segment graph that VI can not process. Layer 3 has linetype set to dashed. Layer 3 has colour set to vivid violet. Panel 2 represents data for setting = Gamma(2,1). Layer 1 of panel 2 is a set of 1 line. Line 1 connects 500 points. Layer 2 of panel 2 is a segment graph that VI can not process. Layer 2 has colour set to vivid violet. Layer 3 of panel 2 is a segment graph that VI can not process. Layer 3 has linetype set to dashed. Layer 3 has colour set to vivid violet. Panel 3 represents data for setting = Gamma(4,1). Layer 1 of panel 3 is a set of 1 line. Line 1 connects 500 points. Layer 2 of panel 3 is a segment graph that VI can not process. Layer 2 has colour set to vivid violet. Layer 3 of panel 3 is a segment graph that VI can not process. Layer 3 has linetype set to dashed. Layer 3 has colour set to vivid violet. Panel 4 represents data for setting = Gamma(1,2). Layer 1 of panel 4 is a set of 1 line. Line 1 connects 500 points. Layer 2 of panel 4 is a segment graph that VI can not process. Layer 2 has colour set to vivid violet. Layer 3 of panel 4 is a segment graph that VI can not process. Layer 3 has linetype set to dashed. Layer 3 has colour set to vivid violet. Panel 5 represents data for setting = Gamma(2,2). Layer 1 of panel 5 is a set of 1 line. Line 1 connects 500 points. Layer 2 of panel 5 is a segment graph that VI can not process. Layer 2 has colour set to vivid violet. Layer 3 of panel 5 is a segment graph that VI can not process. Layer 3 has linetype set to dashed. Layer 3 has colour set to vivid violet. Panel 6 represents data for setting = Gamma(4,2). Layer 1 of panel 6 is a set of 1 line. Line 1 connects 500 points. Layer 2 of panel 6 is a segment graph that VI can not process. Layer 2 has colour set to vivid violet. Layer 3 of panel 6 is a segment graph that VI can not process. Layer 3 has linetype set to dashed. Layer 3 has colour set to vivid violet.

FIGURE 5.3: Gamma models with different tuning hyperparameters. The dashed and solid vertical lines represent the modes and means, respectively.

  1. How would you describe the trend of a Gamma(\(s,r\)) variable \(\lambda\) when \(s > r\) (eg: Gamma(2,1))?

    1. Right-skewed with a mean greater than 1.
    2. Right-skewed with a mean less than 1.
    3. Symmetric with a mean around 1.
    4. Left-skewed with a mean greater than 1.
  2. Using the same options as above, how would you describe the trend of a Gamma(\(s,r\)) variable \(\lambda\) when \(s < r\) (eg: Gamma(1,2))?

  3. For which model is there greater variability in the plausible values of \(\lambda\), Gamma(20,20) or Gamma(20,100)?

The answers to this quiz are in the footnote below.43 To confirm these answers, we can support the visual properties of the \(\lambda \sim \text{Gamma}(s,r)\) variables in Figure 5.3 with quantitative measures of trend and variability (5.8). For example, notice that the mean of \(\lambda\), \(E(\lambda) = s/r\), is greater than 1 when \(s > r\) and less than 1 when \(s < r\). Further, the variability in \(\lambda\), \(\text{Var}(\lambda) = s / r^2\), decreases as \(r\) increases.

Now that we have some intuition for how the Gamma(\(s,r\)) model works, we can tune the shape and rate hyperparameters \(s\) and \(r\) to reflect our prior information about the daily rate of fraud risk phone calls \(\lambda\). Before looking at our data, we guessed that \(\lambda\) is about 5, and most likely somewhere between 2 and 7. Our Gamma(\(s,r\)) prior should have similar trends and variability. For example, we want to pick \(s\) and \(r\) for which \(\lambda\) tends to be around 5,

\[E(\lambda) = \frac{s}{r} \approx 5 \; .\]

This can be achieved by setting \(s\) to be 5 times \(r\), \(s = 5r\). Next we want to make sure that most values of our Gamma(\(s, r\)) prior are between 2 and 7. Through some trial and error within these constraints, and plotting various Gamma models using plot_gamma() in the bayesrules package, we find that the Gamma(10,2) features closely match the trend and variability in our prior understanding:

plot_gamma(shape = 10, rate = 2)

This is an untitled chart with no subtitle or caption. It has x-axis 'lambda' with labels 0, 5, 10 and 15. It has y-axis 'paste(&quot;f(&quot;, lambda, &quot;)&quot;)' with labels 0.0, 0.1 and 0.2. The chart is a function graph that VI can not process.

Thus a reasonable prior model for the daily rate of fraud risk phone calls is

\[\lambda \sim \text{Gamma}(10,2)\]

with prior pdf \(f(\lambda)\) following from plugging \(s = 10\) and \(r = 2\) into (5.7).

\[f(\lambda) = \frac{2^{10}}{\Gamma(2)} \lambda^{2-1} e^{-10\lambda} \;\; \text{ for } \lambda > 0\;.\]

5.2.4 Gamma-Poisson conjugacy

As we discussed at the start of this chapter, conjugate families can come in handy. Fortunately for us, using a Gamma prior for a rate parameter \(\lambda\) and a Poisson model for corresponding count data \(Y\) is another example of a conjugate family. This means that, spoiler, the posterior model for \(\lambda\) will also have a Gamma model with updated parameters. We’ll state and prove this in the general setting before applying the results to our phone call situation.

The Gamma-Poisson Bayesian model

Let \(\lambda > 0\) be an unknown rate parameter and \((Y_1,Y_2,\ldots,Y_n)\) be an independent \(\text{Pois}(\lambda)\) sample. The Gamma-Poisson Bayesian model complements the Poisson structure of data \(Y\) with a Gamma prior on \(\lambda\):

\[\begin{split} Y_i | \lambda & \stackrel{ind}{\sim} \text{Pois}(\lambda) \\ \lambda & \sim \text{Gamma}(s, r) \\ \end{split}\]

Upon observing data \(\vec{y} = (y_1,y_2,\ldots,y_n)\), the posterior model of \(\lambda\) is also a Gamma with updated parameters:

\[\begin{equation} \lambda|\vec{y} \; \sim \; \text{Gamma}(s + \sum y_i, \; r + n) \; . \tag{5.9} \end{equation}\]

Let’s prove this result. In general, recall that the posterior pdf of \(\lambda\) is proportional to the product of the prior pdf and likelihood function defined by (5.7) and (5.6), respectively:

\[f(\lambda|\vec{y}) \propto f(\lambda)L(\lambda|\vec{y}) = \frac{r^s}{\Gamma(s)} \lambda^{s-1} e^{-r\lambda} \cdot \frac{\lambda^{\sum y_i}e^{-n\lambda}}{\prod y_i!} \;\;\; \text{ for } \lambda > 0.\]

Next, remember that any non-\(\lambda\) multiplicative constant in the above equation can be “proportional-ed” out. Thus, boiling the prior pdf and likelihood function down to their kernels, we get:

\[\begin{split} f(\lambda|\vec{y}) & \propto \lambda^{s-1} e^{-r\lambda} \cdot \lambda^{\sum y_i}e^{-n\lambda} \\ & = \lambda^{s + \sum y_i - 1} e^{-(r+n)\lambda} \\ \end{split}\]

where the final line follows by combining like terms. What we’re left with here is the kernel of the posterior pdf. This particular kernel corresponds to the pdf of a Gamma model with shape parameter \(s + \sum y_i\) and rate parameter \(r + n\). Thus we’ve proven that

\[ \lambda|\vec{y} \; \sim \; \text{Gamma}\bigg(s + \sum y_i, r + n \bigg) \; .\]

To this end, let’s return to our fraud risk calls. There we have a Gamma(10,2) prior for \(\lambda\), the daily rate of fraud risk calls. Further, on four separate days in the second week of August 2019, we received \(\vec{y} = (y_1,y_2,y_3,y_4) = (6,2,2,1)\) such calls. Thus we have a sample of \(n = 4\) data points with a total of 11 fraud risk calls,

\[\sum_{i=1}^4y_i = 6 + 2 + 2 + 1 = 11\]

and an average of 2.75 phone calls per day,

\[\overline{y} = \frac{\sum_{i=1}^4y_i}{4} = 2.75 \; .\]

Plugging this data into (5.6), the resulting Poisson likelihood function of \(\lambda\) is

\[L(\lambda | \vec{y}) = \frac{\lambda^{11}e^{-4\lambda}}{6!\times2!\times2!\times1!} \propto \lambda^{11}e^{-4\lambda} \; .\]

We visualize a portion of likelihood function \(L(\lambda | \vec{y})\) for \(\lambda\) between 0 and 10 using the plot_poisson_likelihood() function in the bayesrules package where y is the vector of data values, and lambda_upper_bound is the maximum value of \(\lambda\) to view on the x-axis. (Why can’t we visualize the whole likelihood? Because \(\lambda \in (0, \infty)\) and this book would be pretty expensive if we had infinite pages.)

plot_poisson_likelihood(y = c(6, 2, 2, 1), lambda_upper_bound = 10)

The punchline is this. Underlying rates of between one to five fraud risk calls per day are most consistent with our four days of data. Further, values of \(\lambda\) near \(\overline{y} = 2.75\), the observed average number of daily calls in our four day sample, are the most compatible with this data.

Combining what we learned from the data with our prior understanding of \(\lambda\), it follows from (5.9) that the posterior model of \(\lambda\) is a Gamma with an updated shape parameter of 21 (\(s + \sum y_i = 10 + 11\)) and rate parameter of 6 (\(r + n = 2 + 4\)):

\[\lambda|\vec{y} \; \sim \; \text{Gamma}(21, 6) \; .\]

We can visualize the prior, scaled likelihood, and posterior models for \(\lambda\) all in a single plot with the plot_gamma_poisson() function in the bayesrules package. How magical! For this function to work, we must specify a few things: the prior shape and rate hyperparameters as well as the information from our data, the observed total number of phone calls sum_y \((\sum y_i)\) and the sample size n:

plot_gamma_poisson(shape = 10, rate = 2, sum_y = 11, n = 4)

This is an untitled chart with no subtitle or caption. It has x-axis 'lambda' with labels 0, 5, 10 and 15. It has y-axis 'density' with labels 0.0, 0.2 and 0.4. There is a legend indicating fill is used to show fill, with 3 levels: (scaled) likelihood shown as strong blue fill,  posterior shown as brilliant green fill and  prior shown as brilliant greenish yellow fill. It has 6 layers. Layer 1 is a function graph that VI can not process. Layer 2 is an area graph that VI can not process. Layer 2 has alpha set to 0.5. Layer 3 is a function graph that VI can not process. Layer 4 is an area graph that VI can not process. Layer 4 has alpha set to 0.5. Layer 5 is a function graph that VI can not process. Layer 6 is an area graph that VI can not process. Layer 6 has alpha set to 0.5.

Our posterior notion about the daily rate of fraud calls is a compromise between our vague prior and the observed phone call data. Since our prior notion was quite variable in comparison to the strength in our sample data, the posterior model of \(\lambda\) is more in sync with the data. Specifically, utilizing the properties of the Gamma(10,2) prior and Gamma(21,6) posterior, notice that our posterior understanding of the typical daily rate of phone calls dropped from 5 to 3.5 per day:

\[E(\lambda) = \frac{10}{2} = 5 \;\; \text{ and } \;\; E(\lambda | \vec{y}) = \frac{21}{6} = 3.5 \; .\]

Though a compromise between the prior mean and data mean, this posterior mean is closer to the data mean of \(\overline{y} = 2.75\) calls per day.

Hot tip

The posterior mean will always be between the prior mean and the data mean. If your posterior mean falls outside that range, it indicates that you made an error and should retrace some steps.

Further, with the additional information about \(\lambda\) from the data, the variability in our understanding of \(\lambda\) makes a nearly five fold drop from 2.5 to 0.583 squared calls per day:

\[\text{Var}(\lambda) = \frac{10}{2^2} = 2.5 \;\; \text{ and } \;\; \text{Var}(\lambda | \vec{y}) = \frac{21}{6^2} = 0.583 \; .\]

The convenient summarize_gamma_poisson() function in the bayesrules package, which uses the same arguments as plot_gamma_poisson(), helps us contrast the prior and posterior models and confirms the results above:

summarize_gamma_poisson(shape = 10, rate = 2, sum_y = 11, n = 4)
      model shape rate mean  mode    var
1     prior    10    2  5.0 4.500 2.5000
2 posterior    21    6  3.5 3.333 0.5833

5.3 Normal-Normal conjugate family

We now have two conjugate families in our toolkit: the Beta-Binomial and the Gamma-Poisson. But many more conjugate families exist! It’s impossible to cover them all, but there is a third conjugate family that’s especially helpful to know: the Normal-Normal.

Consider a data story. As scientists learn more about brain health, the dangers of contact sports are gaining greater attention (Bachynski 2019). For example, American football players often sustain repeated concussions. Among all American college football players who have incurred concussions, we are interested in \(\mu\), the average volume (in cubic centimeters) of a specific part of the brain: the hippocampus. Though we don’t have prior information about football players specifically, Wikipedia tells us that among the general population of human adults, both halves of the hippocampus have a volume between 3.0 and 3.5 cubic centimeters.44 Thus the total hippocampal volume of both sides of the brain is between 6 and 7 cubic centimeters. Using this as a starting point, we’ll assume that the mean hippocampal volume among football players, \(\mu\), is also somewhere between 6 and 7 cubic centimeters, with an average of 6.5.

To learn more about \(\mu\), consider some data. Rashmi Singh and colleagues performed a cross-sectional study of hippocampal volumes among 75 subjects (Singh et al. 2014). Their data is provided by the FootballBrain dataset in the Lock5Data package (Lock et al. 2016). We provide a smaller version of this dataset as football in the bayesrules package. For our analysis, we’ll focus on the hippocampal volumes (in cubic centimeters) for the \(n = 25\) study subjects who played collegiate American football and who have been diagnosed with concussions:

# Load the data
data(football)
football <- football %>%
  filter(group == "fb_concuss")

Let \((Y_1,Y_2,\ldots,Y_{25})\) denote the hippocampal volumes for these 25 players. In this sample, the average hippocampal volume is roughly \(\overline{y}\) = 5.735 cubic centimeters with a standard deviation of 0.593. Further, the hippocampal volumes range from roughly 4.5 to 7 cubic centimeters:

football %>%
  summarize(mean(volume), sd(volume))
ggplot(football, aes(x = volume)) + 
  geom_density()
  mean(volume) sd(volume)
1        5.735     0.5934

This is an untitled chart with no subtitle or caption. It has x-axis 'volume' with labels 5, 6 and 7. It has y-axis 'density' with labels 0.00, 0.25, 0.50, 0.75 and 1.00. The chart is a density graph that VI can not process.

5.3.1 Normal likelihood

The first step in our analysis of \(\mu\), the average hippocampal volume of all American college football players with a history of concussions, is to model the dependence of each sampled player’s volume \(Y_i\) upon \(\mu\). Since hippocampal volume \(Y_i\) is measured on a continuous scale, there are many possible common models for describing the variability in \(Y_i\) from player to player: Beta, Exponential, Gamma, Normal, Chi-square, F, etc. We can immediately toss out the Beta model – it assumes that \(Y_i \in [0,1]\) whereas we observed above that our players’ hippocampal volumes range from roughly 4.5 to 7 cubic centimeters. Among the remaining options, the Normal model is quite reasonable. Consistent with the observed data in the density plot above, the Normal model assumes that the players’ hippocampal volumes are symmetrically distributed around some global average volume, “\(\mu\).”

As you’ll see over and over throughout this book, reasonable doesn’t mean perfect. Though the Normal model does a great job of capturing the main features of our sample data \((Y_1,Y_2,\ldots,Y_{25})\), it technically assumes that each player’s hippocampal volume can range from \(-\infty\) to \(\infty\). Since we can tune the Normal model to put negligible weight on unreasonable values of hippocampal volume, we’re not too worried about this incorrect assumption here.

The Normal model

Let \(Y\) be a random variable which can take any value between \(-\infty\) and \(\infty\), i.e. \(Y \in (-\infty,\infty)\). Then the variability in \(Y\) might be well represented by a Normal model with mean parameter \(\mu \in (-\infty, \infty)\) and standard deviation parameter \(\sigma > 0\):

\[Y \sim N(\mu, \sigma^2)\]

The Normal model is specified by continuous pdf

\[\begin{equation} f(y) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\bigg[{-\frac{(y-\mu)^2}{2\sigma^2}}\bigg] \;\; \text{ for } y \in (-\infty,\infty) \tag{5.10} \end{equation}\]

and has the following features:

\[\begin{split} E(Y) & = \text{ Mode}(Y) = \mu \\ \text{Var}(Y) & = \sigma^2 \\ \end{split}\]

Further, \(\sigma\) provides a sense of scale for \(Y\). Roughly 95% of \(Y\) values will be within 2 standard deviations of \(\mu\):

\[\begin{equation} \mu \pm 2\sigma \; . \tag{5.11} \end{equation}\]

This is an untitled chart with no subtitle or caption. The chart is comprised of 3 panels containing sub-charts, arranged horizontally. The panels represent different values of setting. Each sub-chart has x-axis 'y' with labels -2.5, 0.0, 2.5, 5.0, 7.5 and 10.0. Each sub-chart has y-axis 'f(y)' with labels 0.0, 0.2, 0.4, 0.6 and 0.8. Panel 1 represents data for setting = N(2,0.25). Panel 1 is a set of 1 line. Line 1 connects 500 points. Panel 2 represents data for setting = N(2,1). Panel 2 is a set of 1 line. Line 1 connects 500 points. Panel 3 represents data for setting = N(4,4). Panel 3 is a set of 1 line. Line 1 connects 500 points.

FIGURE 5.4: Normal models with varying means and variances.

The formula behind the Normal pdfs in Figure 5.4 is specified by (5.10). We can see the impact that both the mean parameter, \(\mu\), and the standard deviation parameter, \(\sigma\), have on the Normal model of \(Y\). As \(\mu\) gets larger, the model shifts to the right. As \(\sigma\) gets larger, the model becomes more spread out. To play around some more, you can plot Normal models using the plot_normal() function from the bayesrules package.

Returning to our football analysis, we assume that the hippocampal volumes of our \(n = 25\) players, \((Y_1,Y_2,\ldots,Y_{25})\), are independent and Normally distributed around a mean volume \(\mu\) with standard deviation \(\sigma\):

\[Y_i | \mu \; \sim \; N(\mu, \sigma^2) \;.\]

Notice here that the \(Y_i\) depend on standard deviation \(\sigma\) as well as \(\mu\), our parameter of interest. For now we assume the standard deviation to be fixed and known as \(\sigma = 0.5\) cubic centimeters, a figure gathered from the fact that hippocampal volumes between 6 and 7 cubic centimeters are common.45

With the Normal model in place, we’re ready to build the joint likelihood function of \(\mu\) and, in turn, assess the compatibility of different possible \(\mu\) with our observed sample data \(\vec{y} = (y_1,y_2,\ldots,y_{n})\) on \(n = 25\) players. This joint likelihood function is the product of the likelihood functions \(L(\mu|y_i) = f(y_i|\mu)\) defined by (5.10). For \(\mu \in (-\infty, \infty)\),

\[L(\mu |\vec{y})= \prod_{i=1}^{n}L(\mu | y_i) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma^2}} \exp\bigg[{-\frac{(y_i-\mu)^2}{2\sigma^2}}\bigg].\]

Simplifying this up to a proportionality constant:

\[L(\mu |\vec{y}) \propto \prod_{i=1}^{n} \exp\bigg[{-\frac{(y_i-\mu)^2}{2\sigma^2}}\bigg] = \exp\bigg[{-\frac{\sum_{i=1}^n (y_i-\mu)^2}{2\sigma^2}}\bigg] \; .\]

Through a bit more rearranging (which we encourage you to verify), we can make this even easier to digest by using the sample mean \(\bar{y}\) and sample size \(n\) to summarize our data values:

\[\begin{equation} L(\mu | \vec{y}) \propto \exp\bigg[{-\frac{(\bar{y}-\mu)^2}{2\sigma^2/n}}\bigg] \;\;\;\; \text{ for } \; \mu \in (-\infty, \infty). \tag{5.12} \end{equation}\]

Recall that among our \(n = 25\) sampled football players, we observed a mean hippocampal volume of \(\overline{y}\) = 5.735 cubic centimeters and are assuming that the standard deviation in volume from player to player is \(\sigma = 0.5\). The proportional likelihood function of \(\mu\) evaluated at these data values,

\[L(\mu | \vec{y}) \propto \exp\bigg[{-\frac{(5.735-\mu)^2}{2(0.5^2/25)}}\bigg] \;\;\;\; \text{ for } \; \mu \in (-\infty, \infty),\]

is represented in Figure ??.

This is an untitled chart with no subtitle or caption. It has x-axis 'mu' with labels 5.25, 5.50, 5.75, 6.00 and 6.25. It has y-axis 'paste(&quot;L(&quot;, mu, &quot;|&quot;, bar(y), &quot;= 5.735)&quot;)' with labels 0, 1, 2 and 3. The chart is a function graph that VI can not process.

FIGURE 5.5: The joint Normal likelihood function for mean hippocampal volume.

This is an untitled chart with no subtitle or caption. It has x-axis 'mu' with labels 5.25, 5.50, 5.75, 6.00 and 6.25. It has y-axis 'paste(&quot;L(&quot;, mu, &quot;|&quot;, bar(y), &quot;= 5.735)&quot;)' with labels 0, 1, 2 and 3. The chart is a function graph that VI can not process.

FIGURE 5.6: The joint Normal likelihood function for mean hippocampal volume.

Based on our sample data alone, it seems unlikely that the mean hippocampal volume across all players \(\mu\) is less than 5.3 or greater than 6.1 cubic centimeters. Further, the most likely value for \(\mu\) is equivalent to the mean from our observed sample of 25 players: 5.735 cubic centimeters. This shouldn’t come as a surprise, the likelihood is entirely based on what we observe in our data. Therefore the likelihood will give the sample mean as the most likely value for our population mean \(\mu\).

5.3.2 Normal prior

With the likelihood in place, let’s formalize a prior model for \(\mu\), the mean hippocampal volume among all football players that have had concussions. To this end, a Normal prior makes a reasonable choice. Specifically, we’ll assume that \(\mu\) itself is Normally distributed around some mean \(\theta\) with standard deviation \(\tau\),46

\[\mu \sim N(\theta, \tau^2) \; , \]

with prior pdf

\[\begin{equation} f(\mu) = \frac{1}{\sqrt{2\pi\tau^2}} \exp\bigg[{-\frac{(\mu - \theta)^2}{2\tau^2}}\bigg] \;\; \text{ for } \mu \in (-\infty,\infty) \; . \tag{5.13} \end{equation}\]

Not only does a Normal prior match the technical Normal likelihood assumption that \(\mu \in (-\infty,\infty)\), we’ll prove below that this is a conjugate prior. You might anticipate this result from the fact that the joint likelihood function \(L(\mu | \vec{y})\) (5.12) and prior pdf \(f(\mu)\) (5.13) are both proportional to

\[\exp\bigg[{-\frac{(\mu - \blacksquare)^2}{2\blacksquare^2}}\bigg]\]

with different \(\blacksquare\).

Using our understanding of a Normal model, let’s now tune the prior hyperparameters \(\theta\) and \(\tau\) to reflect our prior understanding about \(\mu\). Based on our rigorous Wikipedia research, recall that we would expect \(\mu\) to be between 6 and 7 cubic centimeters. Thus we set the Normal prior mean \(\theta\) to the average of 6.5. Our Normal prior can also account for our uncertainty about \(\mu\). We aren’t very sure about what is going on here, so let’s set the Normal prior standard deviation to \(\tau = 0.4\). In other words, by (5.11), roughly 95% of our prior model for \(\mu\) will be between 5.7 and 7.3 cubic centimeters (\(6.5 \pm 2*0.4\)). This is wider than what Wikipedia indicated, because we are neither sure about the Wikipedia sources nor confident that the features for the typical adult translates to American football players with a history of concussions. Putting this together, our tuned prior model for \(\mu\) is:

\[\mu \sim N(6.5, 0.4^2) \;.\]

To verify that it accurately reflects our prior knowledge, we plot our prior model below using plot_normal(). Note: remember that you are specifying the standard deviation and not the variance in plot_normal(). This distinction makes a big difference!

plot_normal(mean = 6.5, sd = 0.4)

This is an untitled chart with no subtitle or caption. It has x-axis 'mu' with labels 5, 6, 7 and 8. It has y-axis 'paste(&quot;f(&quot;, mu, &quot;)&quot;)' with labels 0.00, 0.25, 0.50, 0.75 and 1.00. The chart is a function graph that VI can not process.

5.3.3 Normal-Normal conjugacy

To obtain our posterior model of \(\mu\) we must combine our prior information and our likelihood. Again, we’re quite fortunate – the Normal-Normal is another conjugate family! Thus the posterior model for \(\mu\) will also be Normal with updated hyperparameters that are informed by the prior and observed data.

The Normal-Normal Bayesian model

Let \(\mu \in (-\infty,\infty)\) be an unknown mean parameter and \((Y_1,Y_2,\ldots,Y_n)\) be an independent \(N(\mu,\sigma^2)\) sample where \(\sigma\) is assumed to be known. The Normal-Normal Bayesian model complements the Normal data structure with a Normal prior on \(\mu\):

\[\begin{split} Y_i | \mu & \stackrel{ind}{\sim} N(\mu, \sigma^2) \\ \mu & \sim N(\theta, \tau^2) \\ \end{split}\]

Upon observing data \(\vec{y} = (y_1,y_2,\ldots,y_n)\) with mean \(\overline{y}\), the posterior model of \(\mu\) is also a Normal with updated parameters:

\[\begin{equation} \mu|\vec{y} \; \sim \; N\bigg(\frac{\theta\sigma^2/n + \bar{y}\tau^2}{\tau^2+\sigma^2/n}, \; \frac{\tau^2\sigma^2/n}{\tau^2+\sigma^2/n}\bigg) \; . \tag{5.14} \end{equation}\]

Whooo, that is a mouthful! We provide an optional proof of this result below. But first, let’s apply this result to build a posterior model for \(\mu\), the average hippocampal volume among American football players with a history of concussions. Recall the necessary pieces that we need to plug into (5.14):

  • our Normal prior model of \(\mu\) had mean \(\theta = 6.5\) and standard deviation \(\tau = 0.4\);
  • our sample of \(n = 25\) football players had a sample mean volume \(\bar{y}=5.735\);
  • we assumed a known standard deviation among individual hippocampal volumes of \(\sigma = 0.5\).

It follows that the posterior model of \(\mu\) is:

\[\mu | \vec{y} \; \sim \; N\bigg(\frac{6.5*0.5^2/25 + 5.735*0.4^2}{0.4^2+0.5^2/25}, \; \frac{0.4^2*0.5^2/25}{0.4^2+0.5^2/25}\bigg)\;,\]

or, further simplified,

\[\mu | \vec{y} \; \sim \; N\bigg(5.78, 0.009^2 \bigg) \; .\]

We can plot and summarize our Normal-Normal analysis of \(\mu\) using plot_normal_normal() and summarize_normal_normal() in the bayesrules package:

plot_normal_normal(mean = 6.5, sd = 0.4, sigma = 0.5, 
  y_bar = 5.735, n = 25)

summarize_normal_normal(mean = 6.5, sd = 0.4, sigma = 0.5, 
  y_bar = 5.735, n = 25)

This is an untitled chart with no subtitle or caption. It has x-axis 'mu' with labels 5, 6, 7 and 8. It has y-axis 'density' with labels 0, 1, 2, 3 and 4. There is a legend indicating fill is used to show fill, with 3 levels: (scaled) likelihood shown as strong blue fill,  posterior shown as brilliant green fill and  prior shown as brilliant greenish yellow fill. It has 6 layers. Layer 1 is a function graph that VI can not process. Layer 2 is an area graph that VI can not process. Layer 2 has alpha set to 0.5. Layer 3 is a function graph that VI can not process. Layer 4 is an area graph that VI can not process. Layer 4 has alpha set to 0.5. Layer 5 is a function graph that VI can not process. Layer 6 is an area graph that VI can not process. Layer 6 has alpha set to 0.5.

      model mean mode      var
1     prior 6.50 6.50 0.160000
2 posterior 5.78 5.78 0.009412

Our resulting posterior understanding is that the mean hippocampal volume among football players that have had concussions, \(\mu\), is most likely around 5.78 cubic centimeters but might reasonably range from 5.762 to 5.798 (\(5.78 \pm 2*0.009\)). Though the posterior mean is a compromise between the prior mean (\(\theta = 6.5\)) and sample mean volume (\(\bar{y} = 5.735\)), notice that the posterior is heavily weighted towards the data and likelihood. We also see that our posterior has a smaller variance than the prior – thanks to our data, we are much more sure about \(\mu\) than we were before.

5.3.4 Optional: proving Normal-Normal conjugacy

For completeness sake, we prove here that the Normal-Normal model produces posterior model (5.14). If derivations are not your thing, that’s totally fine! Feel free to skip ahead to the next section and know that you can use this Normal-Normal model with a light heart. Let’s get right into it. The posterior pdf of \(\mu\) is proportional to the product of the Normal prior pdf (5.13) and the joint likelihood function (5.12). For \(\mu \in (-\infty, \infty)\):

\[f(\mu|\vec{y}) \propto f(\mu)L(\mu|\vec{y}) \propto \exp\bigg[{\frac{-(\mu - \theta)^2}{2\tau^2}}\bigg] \cdot \exp\bigg[{-\frac{(\bar{y}-\mu)^2}{2\sigma^2/n}}\bigg] \; .\]

Next, we can expand the squares in the exponents and sweep under the rug of proportionality both the \(\theta^2\) in the numerator of the first exponent and the \(\bar{y}^2\) in the numerator of the second exponent:

\[\begin{split} f(\mu|\vec{y}) & \propto \exp\Bigg[{\frac{-\mu^2+2\mu\theta-\theta^2}{2\tau^2}}\Bigg]\exp\Bigg[{\frac{-\mu^2+2\mu\bar{y}-\bar{y}^2}{2\sigma^2/n}}\Bigg] \\ & \propto \exp\Bigg[{\frac{-\mu^2+2\mu\theta}{2\tau^2}}\Bigg]\exp\Bigg[{\frac{-\mu^2+2\mu\bar{y}}{2\sigma^2/n}}\Bigg] \\ \end{split}\]

Next, we will give the exponents common denominators and combine them into a single exponent:

\[\begin{split} f(\mu|\vec{y}) & \propto \exp\Bigg[{\frac{(-\mu^2+2\mu\theta)\sigma^2/n}{2\tau^2\sigma^2/n}}\Bigg]\exp\Bigg[{\frac{(-\mu^2+2\mu\bar{y})\tau^2}{2\tau^2\sigma^2/n}}\Bigg] \\ & \propto \exp\Bigg[{\frac{(-\mu^2+2\mu\theta)\sigma^2/n +(-\mu^2+2\mu\bar{y})\tau^2}{2\tau^2\sigma^2/n}}\Bigg] \\ \end{split}\]

Now let’s combine \(\mu\) terms and rearrange so that \(\mu^2\) is by itself:

\[\begin{split} f(\mu|\vec{y}) & \propto \exp\Bigg[{\frac{-\mu^2(\tau^2+\sigma^2/n)+2\mu(\theta\sigma^2/n+ \bar{y}\tau^2) }{2\tau^2\sigma^2/n}}\Bigg] \\ & \propto \exp\Bigg[{\frac{-\mu^2+2\mu\frac{\theta\sigma^2/n+ \bar{y}\tau^2}{\tau^2+\sigma^2/n} }{2(\tau^2\sigma^2/n) /(\tau^2+\sigma^2/n)}}\Bigg] \\ \end{split}\]

This may seem like too much to deal with, but if you look closely, you can see that we can bring back some constants which do not depend upon \(\mu\) to complete the square in the numerator:

\[\begin{split} f(\mu|\vec{y}) & \propto \exp\Bigg[{\frac{-\bigg(\mu - \frac{\theta\sigma^2/n+ \bar{y}\tau^2}{\tau^2+\sigma^2/n}\bigg)^2 }{2(\tau^2\sigma^2/n) /(\tau^2+\sigma^2/n)}}\Bigg] \\ \end{split}\]

This may still seem messy, but once we complete the square, we actually have the kernel of a Normal pdf for \(\mu\), \(\exp\bigg[{-\frac{(\mu - \blacksquare)^2}{2\blacksquare^2}}\bigg]\). By identifying the missing pieces \(\blacksquare\), we can thus conclude that

\[ \mu|\vec{y} \; \sim \; N\left(\frac{\theta\sigma^2/n+ \bar{y}\tau^2}{\tau^2+\sigma^2/n},{\frac{\tau^2\sigma^2/n }{\tau^2+\sigma^2/n}} \right).\]

5.4 Why no simulation in this chapter?

As you may have gathered, we love some simulations! There is something reassuring about the reality check that a simulation can provide. Yet we are at a crossroads. The Gamma-Poisson and Normal-Normal models we’ve studied here are tough to simulate using the techniques we’ve learned thus far. Letting \(\theta\) represent some parameter of interest, recall the steps we’ve used for past simulations:

  1. Simulate, say, 10000 values of \(\theta\) from the prior model.
  2. Simulate a set of sample data \(Y\) from each simulated \(\theta\) value.
  3. Filter out only those of the 10000 simulated sets of \((\theta,Y)\) for which the simulated \(Y\) data matches the data we actually observed.
  4. Use the remaining \(\theta\) values to approximate the posterior of \(\theta\).

The issue with extending this simulation technique to the Gamma-Poisson and Normal-Normal examples in this chapter comes with step 3. In both of our examples, we had a sample size greater than one, \((Y_1,Y_2,\ldots, Y_n)\). Further, in the Normal-Normal example, our data values \(Y_i\) are continuous. In both of these scenarios, it’s very likely that no simulated sets of \((\theta,Y)\) will perfectly match our observed sample data. This is true even if we carry out millions of simulations in step 3. Spoiler alert! There are other ways to approximate the posterior model which we will learn in Unit 2.

5.5 Critiques of conjugate family models

Before we end the chapter, we want to acknowledge that conjugate family models also have drawbacks. In particular:

  • Conjugate family models do not always allow you to have a flat prior. While the Beta prior can be tuned to be flat (when \(\alpha=\beta=1\)), neither the Normal or the Gamma priors (or any proper models with infinite support) can ever be tuned to be totally flat.
  • Sometimes the prior model that is conjugate isn’t flexible enough to fit your prior understanding. For example, a Normal model is always symmetric around the mean \(\mu\) and is unimodal. So if your prior understanding is not symmetric or is not unimodal, then the Normal prior will not be able to reflect this.

5.6 Chapter summary

In Chapter 5, you learned about conjugacy and applied it to a few different situations. Our main takeaways for this chapter are:

  • Using conjugate priors allows us to have easy to derive and readily interpretable posterior models.
  • We can use the Beta-Binomial, Gamma-Poisson, and Normal-Normal conjugate families to derive posterior models.
  • We can use several functions from the bayesrules package to explore these conjugate families: plot_poisson_likelihood(), plot_gamma(), plot_gamma_poisson(), summarize_gamma_poisson(), plot_normal(), plot_normal_normal(), and summarize_normal_normal().

We hope that you now appreciate the utility of conjugate priors!

5.7 Exercises

5.7.1 Practice: Gamma-Poisson

Exercise 5.1 (Tuning a Gamma prior) For each situation below, tune an appropriate Gamma(\(s, r\)) prior model for \(\lambda\).
  1. The most common value of \(\lambda\) is 4, and the mean is 7.
  2. The most common value of \(\lambda\) is 10 and the mean is 12.
  3. The most common value of \(\lambda\) is 5, and the variance is 3.
  4. The most common value of \(\lambda\) is 14, and the variance is 6.
  5. The mean of \(\lambda\) is 4 and the variance is 12.
  6. The mean of \(\lambda\) is 22 and the variance is 3.
Exercise 5.2 (Finding the Poisson likelihood) For each situation below, we observe the outcomes for a random sample of Poisson variables, \(Y_i | \lambda \stackrel{ind}{\sim} \text{Pois}(\lambda)\). Specify and plot the joint likelihood function of \(\lambda\).
  1. \((y_1,y_2,y_3) = (3,7,19)\)
  2. \((y_1,y_2,y_3,y_4) = (12,12,12,0)\)
  3. \(y_1 = 12\)
  4. \((y_1,y_2,y_3,y_4,y_5) = (16,10,17,11,11)\)
Exercise 5.3 (Finding the Gamma-Poisson posterior) Assume a prior model of \(\lambda \sim \text{Gamma}(24, 2)\). Specify the posterior models of \(\lambda\) corresponding to the data in each scenario of Exercise 5.2.
Exercise 5.4 (Finding the Gamma-Poisson posterior with a different prior) Assume a prior model of \(\lambda \sim \text{Gamma}(2, 2)\). Specify the posterior models of \(\lambda\) corresponding to the data in each scenario of Exercise 5.2.
Exercise 5.5 (Text messages) Let random variable \(\lambda\) represent the rate of text messages people receive in an hour. At first, you believe that the typical number of messages per hour is 5 with a standard deviation of 0.25 messages.
  1. Tune and plot an appropriate Gamma(\(s,r\)) prior model for \(\lambda\).
  2. What is the prior probability that the rate of text messages per hour is larger than 10? Hint: learn about pgamma().
Exercise 5.6 (Text messages with data) Continuing with Exercise 5.5, you collect data from six friends. They received 7, 3, 8, 9, 10, 12 text messages in the previous hour.
  1. Plot the resulting joint likelihood function of \(\lambda\).
  2. Plot the prior, likelihood, and the posterior of \(\lambda\).
  3. Use summarize_gamma_poisson() to calculate descriptive statistics for the prior and the posterior models.
  4. Comment on how your understanding about \(\lambda\) changed from the prior (in the previous exercise) to the posterior based on the data you collected from your friends.

5.7.2 Practice: Normal-Normal

Exercise 5.7 (Finding and visualizing the Normal likelihood) In each situation below, we observe the outcomes for a Normal random sample, \(Y_i | \mu \stackrel{ind}{\sim} N(\mu,\sigma^2)\) with known \(\sigma\). Specify and plot the corresponding joint likelihood function of \(\mu\).
  1. \((y_1,y_2,y_3) = (-4.3,0.7,-19.4)\) and \(\sigma = 10\)
  2. \((y_1,y_2,y_3,y_4) = (-12,1.2,-4.5,0.6)\) and \(\sigma = 6\)
  3. \((y_1,y_2) = (12.4,6.1)\) and \(\sigma = 5\)
  4. \((y_1,y_2,y_3,y_4,y_5) = (1.6,0.09,1.7,1.1,1.1)\) and \(\sigma = 0.6\)
Exercise 5.8 (Investing in stock) You just bought stock in FancyTech. Let random variable \(\mu\) be the average dollar amount that your FancyTech stock goes up or down in a one day period. At first, you believe that \(\mu\) is 7.2 dollars with a standard deviation of 2.6 dollars.
  1. Tune and plot an appropriate Normal prior model for \(\mu\).
  2. According to your plot, does it seem plausible that the FancyTech stock would increase by an average of 7.6 dollars in a day?
  3. Does it seem plausible that the FancyTech stock would increase by an average of 4 dollars in a day?
  4. What is the prior probability that, on average, you lose money? Hint: pnorm()
  5. What is the prior probability that, on average, your stock price goes up by more than 8 dollars per day?
Exercise 5.9 (Investing in stock with data) Continuing with Exercise 5.8, it’s reasonable to assume that the daily changes in FancyTech stock value are Normally distributed around an unknown mean of \(\mu\) with a known standard deviation of \(\sigma = 2\) dollars. On a random sample of 4 days, you observe changes in stock value of -0.7, 1.2, 4.5, and -4 dollars.
  1. Plot the corresponding likelihood function of \(\mu\).
  2. Plot the prior, likelihood, and the posterior models for \(\mu\).
  3. Use summarize_normal_normal() to calculate descriptive statistics for the prior and the posterior models.
  4. Comment on how your understanding about \(\mu\) evolved from the prior (in the previous exercise) to the posterior based on the observed data.
  5. What is the posterior probability that, on average, you lose money? Hint: pnorm().
  6. What is the posterior probability that, on average, your stock price goes up by more than 8 dollars per day?
Exercise 5.10 (Normal-Normal calculation) Prof. Abebe and Prof. Morales both recently finished their PhDs and are teaching their first statistics classes at Bayesian University. Their colleagues told them that the average final exam score across all students, \(\mu\), varies Normally from year to year with a mean of 80 points and a standard deviation of 4. Further, individual students’ scores \(Y\) vary Normally around \(\mu\) with a known standard deviation of 3 points.
  1. Prof. Abebe conducts the final exam and observes that his 32 students scored an average of 84 points with a standard deviation of 3 points. Calculate the posterior mean and variance of \(\mu\) using the data from Prof. Abebe’s class.
  2. Prof. Morales conducts the final exam and observes that her 32 students scored an average of 82 points with a standard deviation of 2 points. Calculate the posterior mean and variance of \(\mu\) using the data from Prof. Morales’ class.
  3. Next, use Prof. Abebe and Prof. Morales’ combined exams to calculate the posterior mean and variance of \(\mu\).
Exercise 5.11 (Football players without concussions) In this chapter we used the Normal-Normal model to analyze the mean hippocampal volume among American football players who have incurred concussions. In this exercise we will explore the mean volume, \(\mu\), for football players who have not been diagnosed with a concussion. We’ll start with the same prior information as before: the average hippocampal volume \(\mu\) has a prior mean of \(\theta = 6.5\) cubic centimeters and prior standard deviation \(\tau = 0.5\).
  1. Use the football data to calculate the sample mean hippocampal volume, sample standard deviation in volume from player to player (which we will use as our \(\sigma^2\)), and sample size of the football players who have not been diagnosed with a concussion.
  2. Identify the posterior model of \(\mu\) and verify your answer using summarize_normal_normal().
  3. Plot the prior, likelihood, and posterior models of \(\mu\). Describe the influence of the data on the posterior model in one sentence.
Exercise 5.12 (Healthy control brains) Repeat the Exercise 5.11, this time for the average hippocampal volume \(\mu\) for the control group, i.e. people that are not football players and have not had concussions.
Exercise 5.13 (Normal-Normal Simulation)
  1. Your friend Alex has read Chapter 4 of this book, but not Chapter 5. Explain to Alex why it’s difficult to simulate a Normal-Normal posterior using the simulation methods we have learned thus far.

  2. To prove your point, try (and fail) to simulate the posterior of \(\mu\) for the following model upon observing a single data point \(Y = 1.1\):

    \[\begin{split} Y | \mu & \sim N(\mu, 1^2) \\ \mu & \sim N(0, 1^2) \\ \end{split}\]

5.7.3 General practice

Exercise 5.14 (Which model?) Below are kernels for Normal, Poisson, Gamma, Beta, and Binomial models. Identify the appropriate model with specific parameter values.
  1. \(f(\theta) \propto 0.3^\theta 0.7^{(1-\theta)}\) for \(\theta\in \{0,1,2,\ldots,16\}\)
  2. \(f(\theta) \propto 1 /\theta!\) for \(\theta\in \{0,1,2,\ldots,\infty\}\)
  3. \(f(\theta) \propto \theta^4 (1-\theta)^7\) for \(\theta \in (0, 1)\)
  4. \(f(\theta) \propto e^{-\theta^2}\) for \(\theta \in (-\infty, \infty)\)
Exercise 5.15 (Which model: Back for more!) Below are kernels for Normal, Poisson, Gamma, Beta, and Binomial models. Identify the appropriate model with specific parameter values.
  1. \(f(\theta) \propto e^{-\theta^2}\theta^{15}\) for \(\theta \in (0, \infty)\)
  2. \(f(\theta) \propto e^{\frac{-(\theta-12)^2}{18}}\) for \(\theta \in (-\infty, \infty)\)
  3. \(f(\theta) \propto 0.3^\theta /\theta!\) for \(\theta \in \{0,1,2,\ldots,\infty\}\)
Exercise 5.16 (Weighted average) In Chapter 4, we expressed the Beta-Binomial posterior mean of proportion \(\pi\) as a weighted average of the prior mean of \(\pi\) and the observed sample proportion. Similarly:
  1. Express the Gamma-Poisson posterior mean of rate \(\lambda\), \(E(\lambda | \vec{y})\), as a weighted average of the prior mean \(E(\lambda)\) and sample mean \(\overline{y}\).
  2. Express the Normal-Normal posterior mean of \(\mu\), \(E(\mu | \vec{y})\), as a weighted average of the prior mean \(E(\mu)\) and sample mean \(\overline{y}\).
Exercise 5.17 (Counting insects) You have been hired as a research assistant to a biology professor. Your job is to count the number of a specific insect in a \(20 m^2\) area. Your prior idea about the density of this insect is well captured by a Gamma model with an expected value of 0.5 insects per \(m^2\) and a variance of 0.0625 insects\(^2\) per \(m^2\). You go out into the field and count 3, 2, 5, 1, 2 insects in the first five square meters respectively and then do not encounter the insect in the rest of the field (the other 15 square meters).
  1. What likelihood model, Normal or Poisson, should you use to model the dependence of your data on the insect data? Explain why.
  2. Plot the prior, likelihood, and posterior model of insect density. Comment on whether the posterior model more closely resembles the data or the prior model. Explain why this makes sense.
  3. What is the posterior mean and variance of the insect density?
  4. Describe your posterior conclusions in context, in a way that your biology professor would find helpful.

Exercise 5.18 (The Beta-Geometric model) Consider the following new Bayesian model:

\[\begin{split} Y|\theta & \sim \text{Geometric}(\theta) \\ \theta & \sim \text{Beta}(\alpha, \beta) \\ \end{split}\]

where \(f(y|\theta) = \theta (1 - \theta)^{y - 1}\) for \(y \in \{1,2,\ldots\}\).
  1. Derive the posterior model for \(\theta\) given observed data \(Y = y\). If possible, identify the name of the posterior model and its parameters.
  2. Is the Beta model a conjugate prior for the Geometric likelihood?

  1. 1:a, 2:b, 3: Gamma(20,20)↩︎

  2. https://en.wikipedia.org/wiki/Hippocampus↩︎

  3. You’ll have a chance to relax this silly-ish assumption that we know \(\sigma\) but don’t know \(\mu\) in later chapters.↩︎

  4. Note that we could use whatever Greek (or other) letters we wish for the prior mean and variance parameters other than \(\mu\) and \(\sigma\) which are already being used, thus have different meaning, in our model.↩︎