Chapter 5 Conjugate Families
In the novel Anna Karenina, Tolstoy wrote “Happy families are all alike; every unhappy family is unhappy in its own way.” In this chapter we will learn about conjugate families, which are all alike in the sense that they make the authors very happy. Read on to learn why.
- Practice building Bayesian models. You will build Bayesian models by practicing how to recognize kernels and make use of proportionality.
- Familiarize yourself with conjugacy. You will learn about what makes a prior conjugate and why this is a helpful property. In brief, conjugate priors make it easier to build posterior models. Conjugate priors spark joy!
Getting started
To prepare for this chapter, note that we’ll be using some new Greek letters throughout our analysis: λλ = lambda, μμ = mu or “mew,” σσ = sigma, ττ = tau, and θθ = theta. Further, load the packages below.
library(bayesrules)
library(tidyverse)
5.1 Revisiting choice of prior
How do we choose a prior? In Chapters 3 and 4 we used the flexibility of the Beta model to reflect our prior understanding of a proportion parameter π∈[0,1]π∈[0,1]. There are other criteria to consider when choosing a prior model:
- Computational ease
Especially if we don’t have access to computing power, it is helpful if the posterior model is easy to build. - Interpretability
We’ve seen that posterior models are a compromise between the data and the prior model. A posterior model is interpretable, and thus more useful, when you can look at its formulation and identify the contribution of the data relative to that of the prior.
The Beta-Binomial has both of these criteria covered. Its calculation is easy. Once we know the Beta(α,βα,β) prior hyperparameters and the observed data Y=yY=y for the Bin(n,πn,π) model, the Beta(α+y,β+n−yα+y,β+n−y) posterior model follows. This posterior reflects the influence of the data, through the values of yy and nn, relative to the prior hyperparameters αα and ββ. If αα and ββ are large relative to the sample size nn, then the posterior will not budge that much from the prior. However, if the sample size nn is large relative to αα and ββ, then the data will take over and be more influential in the posterior model. In fact, the Beta-Binomial belongs to a larger class of prior-data combinations called conjugate families that enjoy both computational ease and interpretable posteriors. Recall a general definition similar to that from Chapter 3.
Conjugate prior
Let the prior model for parameter θθ have pdf f(θ)f(θ) and the model of data YY conditioned on θθ have likelihood function L(θ|y)L(θ|y). If the resulting posterior model with pdf f(θ|y)∝f(θ)L(θ|y)f(θ|y)∝f(θ)L(θ|y) is of the same model family as the prior, then we say this is a conjugate prior.
To emphasize the utility (and fun!) of conjugate priors, it can be helpful to consider a non-conjugate prior. Let parameter ππ be a proportion between 0 and 1 and suppose we plan to collect data YY where, conditional on ππ, Y|π∼Bin(n,π)Y|π∼Bin(n,π). Instead of our conjugate Beta(α,β)Beta(α,β) prior for ππ, let’s try out a non-conjugate prior with pdf f(π)f(π), plotted in Figure 5.1:
f(π)=e−eπ for π∈[0,1].f(π)=e−eπ for π∈[0,1].(5.1)
Though not a Beta pdf, f(π)f(π) is indeed a valid pdf since f(π)f(π) is non-negative on the support of ππ and the area under the pdf is 1, i.e., ∫10f(π)=1∫10f(π)=1.

FIGURE 5.1: A non-conjugate prior for ππ.
Next, suppose we observe Y=10Y=10 successes from n=50n=50 independent trials, all having the same probability of success ππ. The resulting Binomial likelihood function of ππ is:
L(π|y=10)=(5010)π10(1−π)40 for π∈[0,1].L(π|y=10)=(5010)π10(1−π)40 for π∈[0,1].
Recall from Chapter 3 that, when we put the prior and the likelihood together, we are already on our path to finding the posterior model with pdf
f(π|y=10)∝f(π)L(π|y=10)=(e−eπ)⋅(5010)π10(1−π)40.f(π|y=10)∝f(π)L(π|y=10)=(e−eπ)⋅(5010)π10(1−π)40.
As we did in Chapter 3, we will drop all constants that do not depend on ππ since we are only specifying f(π|y)f(π|y) up to a proportionality constant:
f(π|y)∝(e−eπ)π10(1−π)40.f(π|y)∝(e−eπ)π10(1−π)40.
Notice here that our non-Beta prior didn’t produce a neat and clean answer for the exact posterior model (fully specified and not up to a proportionality constant). We cannot squeeze this posterior pdf kernel into a Beta box or any other familiar model for that matter. That is, we cannot rewrite (e−eπ)π10(1−π)40(e−eπ)π10(1−π)40 so that it shares the same structure as a Beta kernel, π◼−1(1−π)◼−1π■−1(1−π)■−1. Instead we will need to integrate this kernel in order to complete the normalizing constant, and hence posterior specification:
f(π|y=10)=(e−eπ)π10(1−π)40∫10(e−eπ)π10(1−π)40dπ for π∈[0,1].f(π|y=10)=(e−eπ)π10(1−π)40∫10(e−eπ)π10(1−π)40dπ for π∈[0,1].(5.2)
This is where we really start to feel the pain of not having a conjugate prior! Since this is a particularly unpleasant integral to evaluate, and we’ve been trying to avoid doing any integration altogether, we will leave ourselves with (5.2) as the final posterior. Yikes, what a mess. This is a valid posterior pdf – it’s non-negative and integrates to 1 across π∈[0,1]π∈[0,1]. But it doesn’t have much else going for it. Consider a few characteristics about the posterior model that result from this particular non-conjugate prior (5.1):
- The calculation for this posterior was messy and unpleasant.
- It is difficult to derive any intuition about the balance between the prior information and the data we observed from this posterior model.
- It would be difficult to specify features such as the posterior mean, mode, and standard deviation, a process which would require even more integration.
This leaves us with the question: could we use the conjugate Beta prior and still capture the broader information of the messy non-conjugate prior (5.1)? If so, then we solve the problems of messy calculations and indecipherable posterior models.
Which Beta model would most closely approximate the non-conjugate prior for ππ (Figure 5.1)?
- Beta(3,1)
- Beta(1,3)
- Beta(2,1)
- Beta(1,2)
The answer to this quiz is d. One way to find the answer is by comparing the non-conjugate prior in Figure 5.1 with plots of the possible Beta prior models. For example, the non-conjugate prior information is pretty well captured by the Beta(1,2) (Figure 5.2):
plot_beta(alpha = 1, beta = 2)

FIGURE 5.2: A conjugate Beta(1,2) prior model for ππ.
5.2 Gamma-Poisson conjugate family
Stop callin’ / Stop callin’ / I don’t wanna think anymore / I got my head and my heart on the dance floor
—Lady Gaga featuring Beyoncé. Lyrics from the song “Telephone.”
Last year, one of this book’s authors got fed up with the number of fraud risk phone calls they were receiving. They set out with a goal of modeling rate λλ, the typical number of fraud risk calls received per day. Prior to collecting any data, the author’s guess was that this rate was most likely around 5 calls per day, but could also reasonably range between 2 and 7 calls per day. To learn more, they planned to record the number of fraud risk phone calls on each of nn sampled days, (Y1,Y2,…,Yn)(Y1,Y2,…,Yn).
In moving forward with our investigation of λλ, it’s important to recognize that it will not fit into the familiar Beta-Binomial framework. First, λλ is not a proportion limited to be between 0 and 1, but rather a rate parameter that can take on any positive value (e.g., we can’t receive -7 calls per day). Thus, a Beta prior for λλ won’t work. Further, each data point YiYi is a count that can technically take on any non-negative integer in {0,1,2,…}{0,1,2,…}, and thus is not limited by some number of trials nn as is true for the Binomial. Not to fret. Our study of λλ will introduce us to a new conjugate family, the Gamma-Poisson.
5.2.1 The Poisson data model
The spirit of our analysis starts with a prior understanding of λλ, the daily rate of fraud risk phone calls. Yet before choosing a prior model structure and tuning this to match our prior understanding, it’s beneficial to identify a model for the dependence of our daily phone call count data YiYi on the typical daily rate of such calls λλ. Upon identifying a reasonable data model, we can identify a prior model which can be tuned to match our prior understanding while also mathematically complementing the data model’s corresponding likelihood function. Keeping in mind that each data point YiYi is a random count that can go from 0 to a really big number, Y∈{0,1,2,…}Y∈{0,1,2,…}, the Poisson model, described in its general form below, makes a reasonable candidate for modeling this data.
The Poisson model
Let discrete random variable YY be the number of independent events that occur in a fixed amount of time or space, where λ>0λ>0 is the rate at which these events occur. Then the dependence of YY on parameter λλ can be modeled by the Poisson. In mathematical notation:
Y|λ∼Pois(λ).Y|λ∼Pois(λ).
Correspondingly, the Poisson model is specified by pmf
f(y|λ)=λye−λy! for y∈{0,1,2,…}f(y|λ)=λye−λy! for y∈{0,1,2,…}(5.3)
where f(y|λ)f(y|λ) sums to one across yy, ∑∞y=0f(y|λ)=1∑∞y=0f(y|λ)=1. Further, a Poisson random variable YY has equal mean and variance,
E(Y|λ)=Var(Y|λ)=λ.E(Y|λ)=Var(Y|λ)=λ.(5.4)
Figure 5.3 illustrates the Poisson pmf (5.3) under different rate parameters λλ. In general, as the rate of events λλ increases, the typical number of events increases, the variability increases, and the skew decreases. For example, when events occur at a rate of λ=1λ=1, the model is heavily skewed toward observing a small number of events – we’re most likely to observe 0 or 1 events, and rarely more than 3. In contrast, when events occur at a higher rate of λ=5λ=5, the model is roughly symmetric and more variable – we’re most likely to observe 4 or 5 events, though have a reasonable chance of observing anywhere between 1 and 10 events.

FIGURE 5.3: Poisson pmfs with different rate parameters.
Let (Y1,Y2,…,Yn)(Y1,Y2,…,Yn) denote the number of fraud risk calls we observed on each of the nn days in our data collection period. We assume that the daily number of calls might differ from day to day and can be independently modeled by the Poisson. Thus, on each day ii,
Yi|λind∼Pois(λ)Yi|λind∼Pois(λ)
with unique pmf
f(yi|λ)=λyie−λyi! for yi∈{0,1,2,…}.f(yi|λ)=λyie−λyi! for yi∈{0,1,2,…}.
Yet in weighing the evidence of the phone call data, we won’t want to analyze each individual day. Rather, we’ll need to process the collective or joint information in our nn data points. This information is captured by the joint probability mass function.
Joint probability mass function
Let (Y1,Y2,…,Yn)(Y1,Y2,…,Yn) be an independent sample of random variables and →y=(y1,y2,…,yn)→y=(y1,y2,…,yn) be the corresponding vector of observed values. Further, let f(yi|λ)f(yi|λ) denote the pmf of an individual observed data point Yi=yiYi=yi. Then by the assumption of independence, the following joint pmf specifies the randomness in and plausibility of the collective sample:
f(→y|λ)=n∏i=1f(yi|λ)=f(y1|λ)⋅f(y2|λ)⋅⋯⋅f(yn|λ).f(→y|λ)=n∏i=1f(yi|λ)=f(y1|λ)⋅f(y2|λ)⋅⋯⋅f(yn|λ).(5.5)
Connecting concepts
This product is analogous to the joint probability of independent events being the product of the marginal probabilities, P(A∩B)=P(A)P(B)P(A∩B)=P(A)P(B).
The joint pmf for our fraud risk call sample follows by applying the general definition (5.5) to our Poisson pmfs. Letting the number of calls on each day ii be yi∈{0,1,2,…}yi∈{0,1,2,…},
f(→y|λ)=n∏i=1f(yi|λ)=n∏i=1λyie−λyi!.f(→y|λ)=n∏i=1f(yi|λ)=n∏i=1λyie−λyi!.(5.6)
This looks like a mess, but it can be simplified. In this simplification, it’s important to recognize that we have nn unique data points yiyi, not nn copies of the same data point yy. Thus, we need to pay careful attention to the ii subscripts. It follows that
f(→y|λ)=λy1e−λy1!⋅λy2e−λy2!⋯λyne−λyn!=[λy1λy2⋯λyn][e−λe−λ⋯e−λ]y1!y2!⋯yn!=λ∑yie−nλ∏ni=1yi!f(→y|λ)=λy1e−λy1!⋅λy2e−λy2!⋯λyne−λyn!=[λy1λy2⋯λyn][e−λe−λ⋯e−λ]y1!y2!⋯yn!=λ∑yie−nλ∏ni=1yi!
where we’ve simplified the products in the final line by appealing to the properties below.
Simplifying products
Let (x,y,a,b)(x,y,a,b) be a set of constants. Then we can utilize the following facts when simplifying products involving exponents:
xaxb=xa+b and xaya=(xy)axaxb=xa+b and xaya=(xy)a(5.7)
Once we observe actual sample data, we can flip this joint pmf on its head to define the likelihood function of λλ. The Poisson likelihood function is equivalent in formula to the joint pmf f(→y|λ)f(→y|λ), yet is a function of λλ which helps us assess the compatibility of different possible λλ values with our observed collection of sample data →y→y:
L(λ|→y)=λ∑yie−nλ∏ni=1yi!∝λ∑yie−nλ for λ>0.L(λ|→y)=λ∑yie−nλ∏ni=1yi!∝λ∑yie−nλ for λ>0.(5.6)
It is convenient to represent the likelihood function up to a proportionality constant here, especially since ∏yi!∏yi! will be cumbersome to calculate when nn is large, and what we really care about in the likelihood is λλ. And when we express the likelihood up to a proportionality constant, note that the sum of the data points (∑yi∑yi) and the number of data points (nn) is all the information that is required from the data. We don’t need to know the value of each individual data point yiyi. Taking this for a spin with real data points later in our analysis will provide some clarity.
5.2.2 Potential priors
The Poisson data model provides one of two key pieces for our Bayesian analysis of λλ, the daily rate of fraud risk calls. The other key piece is a prior model for λλ. Our original guess was that this rate is most likely around 5 calls per day, but could also reasonably range between 2 and 7 calls per day. In order to tune a prior to match these ideas about λλ, we first have to identify a reasonable probability model structure.
Remember here that λλ is a positive and continuous rate, meaning that λλ does not have to be a whole number. Accordingly, a reasonable prior probability model will also have continuous and positive support, i.e., be defined on λ>0λ>0. There are several named and studied probability models with this property, including the FF, Weibull, and Gamma. We don’t dig into all of these in this book. Rather, to make the λλ posterior model construction more straightforward and convenient, we’ll focus on identifying a conjugate prior model.
Suppose we have a random sample of Poisson random variables (Y1,Y2,…,Yn)(Y1,Y2,…,Yn) with likelihood function L(λ|→y)∝λ∑yie−nλL(λ|→y)∝λ∑yie−nλ for λ>0λ>0. What do you think would provide a convenient conjugate prior model for λλ? Why?
- A “Gamma” model with pdf f(λ)∝λs−1e−rλf(λ)∝λs−1e−rλ
- A “Weibull” model with pdf f(λ)∝λs−1e(−rλ)sf(λ)∝λs−1e(−rλ)s
- A special case of the “F” model with pdf f(λ)∝λs2−1(1+λ)−sf(λ)∝λs2−1(1+λ)−s
The answer is a. The Gamma model will provide a conjugate prior for λλ when our data has a Poisson model. You might have guessed this from the section title (clever). You might also have guessed this from the shared features of the Poisson likelihood function L(λ|→y)L(λ|→y) (5.6) and the Gamma pdf f(λ)f(λ). Both are proportional to
λ◼e−◼λλ■e−■λ
with differing ◼■. In fact, we’ll prove that combining the prior and likelihood produces a posterior pdf with this same structure. That is, the posterior will be of the same Gamma model family as the prior. First, let’s learn more about the Gamma model.
5.2.3 Gamma prior
The Gamma & Exponential models
Let λλ be a continuous random variable which can take any positive value, i.e., λ>0λ>0. Then the variability in λλ might be well modeled by a Gamma model with shape hyperparameter s>0s>0 and rate hyperparameter r>0r>0:
λ∼Gamma(s,r).λ∼Gamma(s,r).
The Gamma model is specified by continuous pdf
f(λ)=rsΓ(s)λs−1e−rλ for λ>0.f(λ)=rsΓ(s)λs−1e−rλ for λ>0.(5.8)
Further, the central tendency and variability in λλ are measured by:
E(λ)=srMode(λ)=s−1r for s≥1Var(λ)=sr2.E(λ)=srMode(λ)=s−1r for s≥1Var(λ)=sr2.(5.9)
The Exponential model is a special case of the Gamma with shape s=1s=1, Gamma(1,r)Gamma(1,r):
λ∼Exp(r).λ∼Exp(r).
Notice that the Gamma model depends upon two hyperparameters, rr and ss. Assess your understanding of how these hyperparameters impact the Gamma model properties in the following quiz.37
Figure 5.4 illustrates how different shape and rate hyperparameters impact the Gamma pdf (5.8). Based on these plots:
How would you describe the typical behavior of a Gamma(s,rs,r) variable λλ when s>rs>r (e.g., Gamma(2,1))?
- Right-skewed with a mean greater than 1.
- Right-skewed with a mean less than 1.
- Symmetric with a mean around 1.
Using the same options as above, how would you describe the typical behavior of a Gamma(s,rs,r) variable λλ when s<rs<r (e.g., Gamma(1,2))?
For which model is there greater variability in the plausible values of λλ, Gamma(20,20) or Gamma(20,100)?

FIGURE 5.4: Gamma models with different hyperparameters. The dashed and solid vertical lines represent the modes and means, respectively.
In general, Figure 5.4 illustrates that λ∼Gamma(s,r)λ∼Gamma(s,r) variables are positive and right skewed. Further, the general shape and rate of decrease in the skew are controlled by hyperparameters ss and rr. The quantitative measures of central tendency and variability in (5.9) provide some insight. For example, notice that the mean of λλ, E(λ)=s/rE(λ)=s/r, is greater than 1 when s>rs>r and less than 1 when s<rs<r. Further, as ss increases relative to rr, the skew in λλ decreases and the variability, Var(λ)=s/r2Var(λ)=s/r2, increases.
Now that we have some intuition for how the Gamma(s,rs,r) model works, we can tune it to reflect our prior information about the daily rate of fraud risk phone calls λλ. Recall our earlier assumption that λλ is about 5, and most likely somewhere between 2 and 7. Our Gamma(s,rs,r) prior should have similar patterns. For example, we want to pick ss and rr for which λλ tends to be around 5,
E(λ)=sr≈5.E(λ)=sr≈5.
This can be achieved by setting ss to be 5 times rr, s=5rs=5r.
Next we want to make sure that most values of our Gamma(s,rs,r) prior are between 2 and 7.
Through some trial and error within these constraints, and plotting various Gamma models using plot_gamma()
in the bayesrules package, we find that the Gamma(10,2) features closely match the central tendency and variability in our prior understanding (Figure 5.5).
Thus, a reasonable prior model for the daily rate of fraud risk phone calls is
λ∼Gamma(10,2)λ∼Gamma(10,2)
with prior pdf f(λ)f(λ) following from plugging s=10s=10 and r=2r=2 into (5.8).
f(λ)=210Γ(10)λ10−1e−2λ for λ>0.f(λ)=210Γ(10)λ10−1e−2λ for λ>0.
# Plot the Gamma(10, 2) prior
plot_gamma(shape = 10, rate = 2)

FIGURE 5.5: The pdf of a Gamma(10,2) prior for λλ, the daily rate of fraud risk calls.
5.2.4 Gamma-Poisson conjugacy
As we discussed at the start of this chapter, conjugate families can come in handy. Fortunately for us, using a Gamma prior for a rate parameter λλ and a Poisson model for corresponding count data YY is another example of a conjugate family. This means that, spoiler, the posterior model for λλ will also have a Gamma model with updated parameters. We’ll state and prove this in the general setting before applying the results to our phone call situation.
The Gamma-Poisson Bayesian model
Let λ>0λ>0 be an unknown rate parameter and (Y1,Y2,…,Yn)(Y1,Y2,…,Yn) be an independent Pois(λ)Pois(λ) sample. The Gamma-Poisson Bayesian model complements the Poisson structure of data YY with a Gamma prior on λλ:
Yi|λind∼Pois(λ)λ∼Gamma(s,r).Yi|λind∼Pois(λ)λ∼Gamma(s,r).
Upon observing data →y=(y1,y2,…,yn)→y=(y1,y2,…,yn), the posterior model of λλ is also a Gamma with updated parameters:
λ|→y∼Gamma(s+∑yi,r+n).λ|→y∼Gamma(s+∑yi,r+n).(5.10)
Let’s prove this result. In general, recall that the posterior pdf of λλ is proportional to the product of the prior pdf and likelihood function defined by (5.8) and (5.6), respectively:
f(λ|→y)∝f(λ)L(λ|→y)=rsΓ(s)λs−1e−rλ⋅λ∑yie−nλ∏yi! for λ>0.f(λ|→y)∝f(λ)L(λ|→y)=rsΓ(s)λs−1e−rλ⋅λ∑yie−nλ∏yi! for λ>0.
Next, remember that any non-λλ multiplicative constant in the above equation can be “proportional-ed” out. Thus, boiling the prior pdf and likelihood function down to their kernels, we get
f(λ|→y)∝λs−1e−rλ⋅λ∑yie−nλ=λs+∑yi−1e−(r+n)λf(λ|→y)∝λs−1e−rλ⋅λ∑yie−nλ=λs+∑yi−1e−(r+n)λ
where the final line follows by combining like terms. What we’re left with here is the kernel of the posterior pdf. This particular kernel corresponds to the pdf of a Gamma model (5.8), with shape parameter s+∑yis+∑yi and rate parameter r+nr+n. Thus, we’ve proven that
λ|→y∼Gamma(s+∑yi,r+n).λ|→y∼Gamma(s+∑yi,r+n).
Let’s apply this result to our fraud risk calls. There we have a Gamma(10,2) prior for λλ, the daily rate of calls. Further, on four separate days in the second week of August, we received →y=(y1,y2,y3,y4)=(6,2,2,1)→y=(y1,y2,y3,y4)=(6,2,2,1) such calls. Thus, we have a sample of n=4n=4 data points with a total of 11 fraud risk calls and an average of 2.75 phone calls per day:
4∑i=1yi=6+2+2+1=11 and ¯y=∑4i=1yi4=2.75.4∑i=1yi=6+2+2+1=11 and ¯¯¯y=∑4i=1yi4=2.75.
Plugging this data into (5.6), the resulting Poisson likelihood function of λλ is
L(λ|→y)=λ11e−4λ6!×2!×2!×1!∝λ11e−4λ for λ>0.L(λ|→y)=λ11e−4λ6!×2!×2!×1!∝λ11e−4λ for λ>0.
We visualize a portion of L(λ|→y)L(λ|→y) for λλ between 0 and 10 using the plot_poisson_likelihood()
function in the bayesrules package.
Here, y
is the vector of data values and lambda_upper_bound
is the maximum value of λλ to view on the x-axis.
(Why can’t we visualize the whole likelihood?
Because λ∈(0,∞)λ∈(0,∞) and this book would be pretty expensive if we had infinite pages.)
plot_poisson_likelihood(y = c(6, 2, 2, 1), lambda_upper_bound = 10)

FIGURE 5.6: The likelihood function of λλ, the daily rate of fraud risk calls, given a four-day sample of phone call data.
The punchline is this. Underlying rates λλ of between one to five fraud risk calls per day are consistent with our phone call data. And across this spectrum, rates near 2.75 are the most compatible with this data. This makes sense. The Poisson data model assumes that λλ is the underlying average daily phone call count, E(Yi|λ)=λE(Yi|λ)=λ. As such, we’re most likely to observe a sample with an average daily phone call rate of ¯y=2.75¯¯¯y=2.75 when the underlying rate λλ is also 2.75.
Combining these observations with our Gamma(10,2) prior model of λλ, it follows from (5.10) that the posterior model of λλ is a Gamma with an updated shape parameter of 21 (s+∑yi=10+11s+∑yi=10+11) and rate parameter of 6 (r+n=2+4r+n=2+4):
λ|→y∼Gamma(21,6).λ|→y∼Gamma(21,6).
We can visualize the prior pdf, scaled likelihood function, and posterior pdf for λλ all in a single plot with the plot_gamma_poisson()
function in the bayesrules package.
How magical.
For this function to work, we must specify a few things: the prior shape
and rate
hyperparameters as well as the information from our data, the observed total number of phone calls sum_y
(∑yi)(∑yi) and the sample size n
:
plot_gamma_poisson(shape = 10, rate = 2, sum_y = 11, n = 4)

FIGURE 5.7: The Gamma-Poisson model of λλ, the daily rate of fraud risk calls.
Our posterior notion about the daily rate of fraud calls is, of course, a compromise between our vague prior and the observed phone call data. Since our prior notion was quite variable in comparison to the strength in our sample data, the posterior model of λλ is more in sync with the data. Specifically, utilizing the properties of the Gamma(10,2) prior and Gamma(21,6) posterior as defined by (5.9), notice that our posterior understanding of the typical daily rate of phone calls dropped from 5 to 3.5 per day:
E(λ)=102=5 and E(λ|→y)=216=3.5.E(λ)=102=5 and E(λ|→y)=216=3.5.
Though a compromise between the prior mean and data mean, this posterior mean is closer to the data mean of ¯y=2.75¯¯¯y=2.75 calls per day.
Hot tip
The posterior mean will always be between the prior mean and the data mean. If your posterior mean falls outside that range, it indicates that you made an error and should retrace some steps.
Further, with the additional information about λλ from the data, the variability in our understanding of λλ drops by more than half, from a standard deviation of 1.581 to 0.764 calls per day:
SD(λ)=√1022=1.581 and SD(λ|→y)=√2162=0.764.SD(λ)=√1022=1.581 and SD(λ|→y)=√2162=0.764.
The convenient summarize_gamma_poisson()
function in the bayesrules package, which uses the same arguments as plot_gamma_poisson()
, helps us contrast the prior and posterior models and confirms the results above:
summarize_gamma_poisson(shape = 10, rate = 2, sum_y = 11, n = 4)
model shape rate mean mode var sd1 prior 10 2 5.0 4.500 2.5000 1.5811
2 posterior 21 6 3.5 3.333 0.5833 0.7638
5.3 Normal-Normal conjugate family
We now have two conjugate families in our toolkit: the Beta-Binomial and the Gamma-Poisson. But many more conjugate families exist! It’s impossible to cover them all, but there is a third conjugate family that’s especially helpful to know: the Normal-Normal. Consider a data story. As scientists learn more about brain health, the dangers of concussions (hence of activities in which participants sustain repeated concussions) are gaining greater attention (Bachynski 2019). Among all people who have a history of concussions, we are interested in μμ, the average volume (in cubic centimeters) of a specific part of the brain: the hippocampus. Though we don’t have prior information about this group in particular, Wikipedia tells us that among the general population of human adults, both halves of the hippocampus have a volume between 3.0 and 3.5 cubic centimeters.38 Thus, the total hippocampal volume of both sides of the brain is between 6 and 7 cm3cm3. Using this as a starting point, we’ll assume that the mean hippocampal volume among people with a history of concussions, μμ, is also somewhere between 6 and 7 cm3cm3, with an average of 6.5. We’ll balance this prior understanding with data on the hippocampal volumes of n=25n=25 subjects, (Y1,Y2,…,Yn)(Y1,Y2,…,Yn), using the Normal-Normal Bayesian model.
5.3.1 The Normal data model
Again, the spirit of our Bayesian analysis starts with our prior understanding of μμ. Yet the specification of an appropriate prior model structure for μμ (which we can then tune) can be guided by first identifying a model for the dependence of our data YiYi upon μμ. Since hippocampal volumes YiYi are measured on a continuous scale, there are many possible common models of the variability in YiYi from person to person: Beta, Exponential, Gamma, Normal, F, etc. From this list, we can immediately eliminate the Beta model – it assumes that Yi∈[0,1]Yi∈[0,1], whereas hippocampal volumes tend to be around 6.5 cm3cm3. Among the remaining options, the Normal model is quite reasonable – biological measurements like hippocampal volume are often symmetrically or Normally distributed around some global average, here μμ.
The Normal model
Let YY be a continuous random variable which can take any value between −∞−∞ and ∞∞, i.e., Y∈(−∞,∞)Y∈(−∞,∞). Then the variability in YY might be well represented by a Normal model with mean parameter μ∈(−∞,∞)μ∈(−∞,∞) and standard deviation parameter σ>0σ>0:
Y∼N(μ,σ2).Y∼N(μ,σ2).
The Normal model is specified by continuous pdf
f(y)=1√2πσ2exp[−(y−μ)22σ2] for y∈(−∞,∞)f(y)=1√2πσ2exp[−(y−μ)22σ2] for y∈(−∞,∞)(5.11)
and has the following features:
E(Y)= Mode(Y)=μVar(Y)=σ2SD(Y)=σ.E(Y)= Mode(Y)=μVar(Y)=σ2SD(Y)=σ.
Further, σσ provides a sense of scale for YY. Roughly 95% of YY values will be within 2 standard deviations of μμ:
μ±2σ.μ±2σ.(5.12)

FIGURE 5.8: Normal pdfs with varying mean and standard deviation parameters.
Figure 5.8 illustrates the Normal model under a variety of mean and standard deviation parameter values, μμ and σσ.
No matter the parameters, the Normal model is bell-shaped and symmetric around μμ – thus as μμ gets larger, the model shifts to the right along with it.
Further, σσ controls the variability of the Normal model – as σσ gets larger, the model becomes more spread out.
Finally, though a Normal variable YY can technically range from −∞−∞ to ∞∞, the Normal model assigns negligible plausibility to YY values that are more than 3 standard deviations σσ from the mean μμ.
To play around some more, you can plot Normal models using the plot_normal()
function from the bayesrules package.
Returning to our brain analysis, we can reasonably assume that the hippocampal volumes of our n=25n=25 subjects, (Y1,Y2,…,Yn)(Y1,Y2,…,Yn), are independent and Normally distributed around a mean volume μμ with standard deviation σσ. Further, to keep our focus on μμ, we’ll assume throughout our analysis that the standard deviation is known to be σ=0.5σ=0.5 cm3cm3.39 This choice of σσ suggests that most people have hippocampal volumes within 2σ=12σ=1 cm3cm3 of the average. Thus, the dependence of YiYi on the unknown mean μμ is:
Yi|μ∼N(μ,σ2).Yi|μ∼N(μ,σ2).
Reasonable doesn’t mean perfect. Though we’ll later see that our hippocampal volume data does exhibit Normal behavior, the Normal model technically assumes that each subject’s hippocampal volume can range from −∞−∞ to ∞∞. However, we’re not too worried about this incorrect assumption here. Per our earlier discussion of Figure 5.8, the Normal model will put negligible weight on unreasonable values of hippocampal volume. In general, not letting perfect be the enemy of good will be a theme throughout this book (mainly because there is no perfect).
Accordingly, the joint pdf which describes the collective randomness in our n=25n=25 subjects’ hippocampal volumes, (Y1,Y2,…,Yn)(Y1,Y2,…,Yn), is the product of the unique Normal pdfs f(yi|μ)f(yi|μ) defined by (5.11),
f(→y|μ)=n∏i=1f(yi|μ)=n∏i=11√2πσ2exp[−(yi−μ)22σ2].f(→y|μ)=n∏i=1f(yi|μ)=n∏i=11√2πσ2exp[−(yi−μ)22σ2].
Once we observe our sample data →y→y, we can flip the joint pdf on its head to obtain the Normal likelihood function of μμ, L(μ|→y)=f(→y|μ)L(μ|→y)=f(→y|μ). Remembering that we’re assuming σσ is a known constant, we can simplify the likelihood up to a proportionality constant by dropping the terms that don’t depend upon μμ. Then for μ∈(−∞,∞)μ∈(−∞,∞),
L(μ|→y)∝n∏i=1exp[−(yi−μ)22σ2]=exp[−∑ni=1(yi−μ)22σ2].L(μ|→y)∝n∏i=1exp[−(yi−μ)22σ2]=exp[−∑ni=1(yi−μ)22σ2]. Through a bit more rearranging (which we encourage you to verify if, like us, you enjoy algebra), we can make this even easier to digest by using the sample mean ˉy¯y and sample size nn to summarize our data values:
L(μ|→y)∝exp[−(ˉy−μ)22σ2/n] for μ∈(−∞,∞).L(μ|→y)∝exp[−(¯y−μ)22σ2/n] for μ∈(−∞,∞).(5.13)
Don’t forget the whole point of this exercise! Specifying a model for the data along with its corresponding likelihood function provides the tools we’ll need to assess the compatibility of our data →y→y with different values of μμ (once we actually collect that data).
5.3.2 Normal prior
With the likelihood in place, let’s formalize a prior model for μμ, the mean hippocampal volume among people that have a history of concussions. By the properties of the Yi|μ∼N(μ,σ2)Yi|μ∼N(μ,σ2) data model, the Normal mean parameter μμ can technically take any value between −∞−∞ and ∞∞. Thus, a Normal prior for μμ, which is also defined for μ∈(−∞,∞)μ∈(−∞,∞), makes a reasonable choice. Specifically, we’ll assume that μμ itself is Normally distributed around some mean θθ with standard deviation ττ:
μ∼N(θ,τ2),μ∼N(θ,τ2),
where μμ has prior pdf
f(μ)=1√2πτ2exp[−(μ−θ)22τ2] for μ∈(−∞,∞).f(μ)=1√2πτ2exp[−(μ−θ)22τ2] for μ∈(−∞,∞).(5.14)
Not only does the Normal prior assumption that μ∈(−∞,∞)μ∈(−∞,∞) match the same assumption of the Normal data model, we’ll prove below that this is a conjugate prior. You might anticipate this result from the fact that the likelihood function L(μ|→y)L(μ|→y) (5.13) and prior pdf f(μ)f(μ) (5.14) are both proportional to
exp[−(μ−◼)22◼2]exp[−(μ−■)22■2]
with different ◼■.
Using our understanding of a Normal model, we can now tune the prior hyperparameters θθ and ττ to reflect our prior understanding and uncertainty about the average hippocampal volume among people that have a history of concussions, μμ. Based on our rigorous Wikipedia research that hippocampal volumes tend to be between 6 and 7 cm3cm3, we’ll set the Normal prior mean θθ to the midpoint, 6.5. Further, we’ll set the Normal prior standard deviation to τ=0.4τ=0.4. In other words, by (5.12), we think there’s a 95% chance that μμ is somewhere between 5.7 and 7.3 cm3cm3 (6.5±2∗0.46.5±2∗0.4). This range is wider, and hence more conservative, than what Wikipedia indicated. Our uncertainty here reflects the fact that we didn’t vet the Wikipedia sources, we aren’t confident that the features for the typical adult translates to people with a history of concussions, and we generally aren’t sure what’s going on here (i.e., we’re not brain experts). Putting this together, our tuned prior model for μμ is:
μ∼N(6.5,0.42).μ∼N(6.5,0.42).
plot_normal(mean = 6.5, sd = 0.4)

FIGURE 5.9: A Normal prior model for μμ, with mean 6.5 and standard deviation 0.4.
5.3.3 Normal-Normal conjugacy
To obtain our posterior model of μμ we must combine the information from our prior and our data. Again, we were clever to pick a Normal prior model – the Normal-Normal is another convenient conjugate family! Thus, the posterior model for μμ will also be Normal with updated parameters that are informed by the prior and observed data.
The Normal-Normal Bayesian model
Let μ∈(−∞,∞)μ∈(−∞,∞) be an unknown mean parameter and (Y1,Y2,…,Yn)(Y1,Y2,…,Yn) be an independent N(μ,σ2)N(μ,σ2) sample where σσ is assumed to be known. The Normal-Normal Bayesian model complements the Normal data structure with a Normal prior on μμ:
Yi|μind∼N(μ,σ2)μ∼N(θ,τ2)Yi|μind∼N(μ,σ2)μ∼N(θ,τ2)
Upon observing data →y=(y1,y2,…,yn)→y=(y1,y2,…,yn) with mean ¯y¯¯¯y, the posterior model of μμ is also Normal with updated parameters:
μ|→y∼N(θσ2nτ2+σ2+ˉynτ2nτ2+σ2,τ2σ2nτ2+σ2).μ|→y∼N(θσ2nτ2+σ2+¯ynτ2nτ2+σ2,τ2σ2nτ2+σ2).(5.15)
Whooo, that is a mouthful! We provide an optional proof of this result in Section 5.3.4. Even without that proof, we can observe the balance that the Normal posterior (5.15) strikes between the prior and the data. First, the posterior mean is a weighted average of the prior mean E(μ)=θE(μ)=θ and the sample mean ¯y¯¯¯y. Second, the posterior variance is informed by the prior variability ττ and variability in the data σσ. Both are impacted by sample size nn. First, as nn increases, the posterior mean places less weight on the prior mean and more weight on sample mean ¯y¯¯¯y:
σ2nτ2+σ2→0 and nτ2nτ2+σ2→1.σ2nτ2+σ2→0 and nτ2nτ2+σ2→1.
Further, as nn increases, the posterior variance decreases:
τ2σ2nτ2+σ2→0.τ2σ2nτ2+σ2→0.
That is, the more and more data we have, our posterior certainty about μμ increases and becomes more in sync with the data.
Let’s apply and examine this result in our analysis of μμ, the average hippocampal volume among people that have a history of concussions.
We’ve already built our prior model of μμ, μ∼N(6.5,0.42)μ∼N(6.5,0.42).
Next, consider some data.
The football
data in bayesrules, a subset of the FootballBrain
data in the Lock5Data package (Lock et al. 2016), includes results for a cross-sectional study of hippocampal volumes among 75 subjects (Singh et al. 2014): 25 collegiate football players with a history of concussions (fb_concuss
), 25 collegiate football players that do not have a history of concussions (fb_no_concuss
), and 25 control subjects.
For our analysis, we’ll focus on the n=25n=25 subjects with a history of concussions (fb_concuss
):
# Load the data
data(football)
<- football %>%
concussion_subjects filter(group == "fb_concuss")
These subjects have an average hippocampal volume of ¯y¯¯¯y = 5.735 cm3cm3:
%>%
concussion_subjects summarize(mean(volume))
mean(volume)
1 5.735
Further, the hippocampal volumes appear to vary normally from subject to subject, ranging from roughly 4.5 to 7 cm3cm3. That is, our assumed Normal data model about individual hippocampal volumes, Yi|μ∼N(μ,σ2)Yi|μ∼N(μ,σ2) with an assumed standard deviation of σ=0.5σ=0.5, seems reasonable:
ggplot(concussion_subjects, aes(x = volume)) +
geom_density()

FIGURE 5.10: A density plot of the hippocampal volumes (in cubic centimeters) among 25 subjects that have experienced concussions.
Plugging this information from the data (n=25n=25, ¯y¯¯¯y = 5.735, and σ=0.5σ=0.5) into (5.13) defines the Normal likelihood function of μμ:
L(μ|→y)∝exp[−(5.735−μ)22(0.52/25)] for μ∈(−∞,∞).L(μ|→y)∝exp[−(5.735−μ)22(0.52/25)] for μ∈(−∞,∞).
We plot this likelihood function using plot_normal_likelihood()
, providing our observed volume
data and data standard deviation σ=0.5σ=0.5 (Figure 5.11).
This likelihood illustrates the compatibility of our observed hippocampal data with different μμ values.
To this end, the hippocampal patterns observed in our data would most likely have arisen if the mean hippocampal volume across all people with a history of concussions, μμ, were between 5.3 and 6.1 cm3cm3.
Further, we’re most likely to have observed a mean volume of ¯y=¯¯¯y= 5.735 among our 25 sample subjects if the underlying population mean μμ were also 5.735.
plot_normal_likelihood(y = concussion_subjects$volume, sigma = 0.5)

FIGURE 5.11: The Normal likelihood function for mean hippocampal volume μμ.
We now have all necessary pieces to plug into (5.15), and hence to specify the posterior model of μμ:
- our Normal prior model of μμ had mean θ=6.5θ=6.5 and standard deviation τ=0.4τ=0.4;
- our n=25n=25 sample subjects had a sample mean volume ˉy=5.735¯y=5.735;
- we assumed a known standard deviation among individual hippocampal volumes of σ=0.5σ=0.5.
It follows that the posterior model of μμ is:
μ|→y∼N(6.5⋅0.5225⋅0.42+0.52+5.735⋅25⋅0.4225⋅0.42+0.52,0.42⋅0.5225⋅0.42+0.52).μ|→y∼N(6.5⋅0.5225⋅0.42+0.52+5.735⋅25⋅0.4225⋅0.42+0.52,0.42⋅0.5225⋅0.42+0.52).
Further simplified,
μ|→y∼N(5.78,0.0092)μ|→y∼N(5.78,0.0092)
where the posterior mean places roughly 94% of its weight on the data mean (¯y=5.375¯¯¯y=5.375) and only 6% of its weight on the prior mean (E(μ)=6.5E(μ)=6.5):
E(μ|→y)=6.5⋅0.0588+5.735⋅0.9412=5.78.E(μ|→y)=6.5⋅0.0588+5.735⋅0.9412=5.78.
Bringing all of these pieces together, we plot and summarize our Normal-Normal analysis of μμ using plot_normal_normal()
and summarize_normal_normal()
in the bayesrules package.
Though a compromise between the prior and data, our posterior understanding of μμ is more heavily influenced by the latter.
In light of our data, we are much more certain about the mean hippocampal volume among people with a history of concussions, and believe that this figure is somewhere in the range from 5.586 to 5.974 cm3cm3 (5.78±2∗0.0975.78±2∗0.097).
plot_normal_normal(mean = 6.5, sd = 0.4, sigma = 0.5,
y_bar = 5.735, n = 25)

FIGURE 5.12: The Normal-Normal model of μμ, average hippocampal volume.
summarize_normal_normal(mean = 6.5, sd = 0.4, sigma = 0.5,
y_bar = 5.735, n = 25)
model mean mode var sd1 prior 6.50 6.50 0.160000 0.40000
2 posterior 5.78 5.78 0.009412 0.09701
5.3.4 Optional: Proving Normal-Normal conjugacy
For completeness sake, we prove here that the Normal-Normal model produces posterior model (5.15). If derivations are not your thing, that’s totally fine. Feel free to skip ahead to the next section and know that you can use this Normal-Normal model with a light heart. Let’s get right into it. The posterior pdf of μμ is proportional to the product of the Normal prior pdf (5.14) and the likelihood function (5.13). For μ∈(−∞,∞)μ∈(−∞,∞):
f(μ|→y)∝f(μ)L(μ|→y)∝exp[−(μ−θ)22τ2]⋅exp[−(ˉy−μ)22σ2/n].f(μ|→y)∝f(μ)L(μ|→y)∝exp[−(μ−θ)22τ2]⋅exp[−(¯y−μ)22σ2/n].
Next, we can expand the squares in the exponents and sweep under the rug of proportionality both the θ2θ2 in the numerator of the first exponent and the ˉy2¯y2 in the numerator of the second exponent:
f(μ|→y)∝exp[−μ2+2μθ−θ22τ2]exp[−μ2+2μˉy−ˉy22σ2/n]∝exp[−μ2+2μθ2τ2]exp[−μ2+2μˉy2σ2/n].f(μ|→y)∝exp[−μ2+2μθ−θ22τ2]exp[−μ2+2μ¯y−¯y22σ2/n]∝exp[−μ2+2μθ2τ2]exp[−μ2+2μ¯y2σ2/n].
Next, we give the exponents common denominators and combine them into a single exponent:
f(μ|→y)∝exp[(−μ2+2μθ)σ2/n2τ2σ2/n]exp[(−μ2+2μˉy)τ22τ2σ2/n]∝exp[(−μ2+2μθ)σ2+(−μ2+2μˉy)nτ22τ2σ2].f(μ|→y)∝exp[(−μ2+2μθ)σ2/n2τ2σ2/n]exp[(−μ2+2μ¯y)τ22τ2σ2/n]∝exp[(−μ2+2μθ)σ2+(−μ2+2μ¯y)nτ22τ2σ2].
Now let’s combine μμ terms and rearrange so that μ2μ2 is by itself:
f(μ|→y)∝exp[−μ2(nτ2+σ2)+2μ(θσ2+ˉynτ2)2τ2σ2]∝exp[−μ2+2μ(θσ2+ˉynτ2nτ2+σ2)2(τ2σ2)/(nτ2+σ2)].f(μ|→y)∝exp[−μ2(nτ2+σ2)+2μ(θσ2+¯ynτ2)2τ2σ2]∝exp[−μ2+2μ(θσ2+¯ynτ2nτ2+σ2)2(τ2σ2)/(nτ2+σ2)].
This may seem like too much to deal with, but if you look closely, you can see that we can bring back some constants which do not depend upon μμ to complete the square in the numerator:
f(μ|→y)∝exp[−(μ−θσ2+ˉynτ2nτ2+σ2)22(τ2σ2)/(nτ2+σ2)].f(μ|→y)∝exp[−(μ−θσ2+¯ynτ2nτ2+σ2)22(τ2σ2)/(nτ2+σ2)].
This may still seem messy, but once we complete the square, we actually have the kernel of a Normal pdf for μμ, exp[−(μ−◼)22◼2]exp[−(μ−■)22■2]. By identifying the missing pieces ◼■, we can thus conclude that
μ|→y∼N(θσ2+ˉynτ2nτ2+σ2,τ2σ2nτ2+σ2)μ|→y∼N(θσ2+¯ynτ2nτ2+σ2,τ2σ2nτ2+σ2)
where we can reorganize the posterior mean as a weighted average of the prior mean μμ and data mean ¯y¯¯¯y:
θσ2+ˉynτ2nτ2+σ2=θσ2nτ2+σ2+ˉynτ2nτ2+σ2.θσ2+¯ynτ2nτ2+σ2=θσ2nτ2+σ2+¯ynτ2nτ2+σ2.
5.4 Why no simulation in this chapter?
As you may have gathered, we love some simulations. There is something reassuring about the reality check that a simulation can provide. Yet we are at a crossroads. The Gamma-Poisson and Normal-Normal models we’ve studied here are tough to simulate using the techniques we’ve learned thus far. Letting θθ represent some parameter of interest, recall the steps we’ve used for past simulations:
- Simulate, say, 10000 values of θθ from the prior model.
- Simulate a set of sample data YY from each simulated θθ value.
- Filter out only those of the 10000 simulated sets of (θ,Y)(θ,Y) for which the simulated YY data matches the data we actually observed.
- Use the remaining θθ values to approximate the posterior of θθ.
The issue with extending this simulation technique to the Gamma-Poisson and Normal-Normal examples in this chapter comes with step 3. In both of our examples, we had a sample size greater than one, (Y1,Y2,…,Yn)(Y1,Y2,…,Yn). Further, in the Normal-Normal example, our data values YiYi are continuous. In both of these scenarios, it’s very likely that no simulated sets of (θ,Y)(θ,Y) will perfectly match our observed sample data. This is true even if we carry out millions of simulations in step 3. Spoiler alert! There are other ways to approximate the posterior model which we will learn in Unit 2.
5.5 Critiques of conjugate family models
Before we end the chapter, we want to acknowledge that conjugate family models also have drawbacks. In particular:
- A conjugate prior model isn’t always flexible enough to fit your prior understanding. For example, a Normal model is always unimodal and symmetric around the mean μμ. So if your prior understanding is not symmetric or is not unimodal, then the Normal prior might not be the best tool for the job.
- Conjugate family models do not always allow you to have an entirely flat prior. While we can tune a flat Beta prior by setting α=β=1α=β=1, neither the Normal nor Gamma priors (or any proper models with infinite support) can be tuned to be totally flat. The best we can do is tune the priors to have very high variance, so that they’re almost flat.
5.6 Chapter summary
In Chapter 5, you learned about conjugacy and applied it to a few different situations. Our main takeaways for this chapter are:
- Using conjugate priors allows us to have easy-to-derive and readily interpretable posterior models.
- The Beta-Binomial, Gamma-Poisson, and Normal-Normal conjugate families allow us to analyze data YY in different scenarios. The Beta-Binomial is convenient when our data YY is the number of successes in a set of nn trials, the Gamma-Poisson when YY is a count with no upper limit, and the Normal-Normal when YY is continuous.
- We can use several functions from the bayesrules package to explore these conjugate families:
plot_poisson_likelihood()
,plot_gamma()
,plot_gamma_poisson()
,summarize_gamma_poisson()
,plot_normal()
,plot_normal_normal()
, andsummarize_normal_normal()
.
We hope that you now appreciate the utility of conjugate priors!
5.7 Exercises
5.7.1 Practice: Gamma-Poisson
- The most common value of λλ is 4, and the mean is 7.
- The most common value of λλ is 10 and the mean is 12.
- The most common value of λλ is 5, and the variance is 3.
- The most common value of λλ is 14, and the variance is 6.
- The mean of λλ is 4 and the variance is 12.
- The mean of λλ is 22 and the variance is 3.
- (y1,y2,y3)=(3,7,19)(y1,y2,y3)=(3,7,19)
- (y1,y2,y3,y4)=(12,12,12,0)(y1,y2,y3,y4)=(12,12,12,0)
- y1=12y1=12
- (y1,y2,y3,y4,y5)=(16,10,17,11,11)(y1,y2,y3,y4,y5)=(16,10,17,11,11)
- Tune and plot an appropriate Gamma(s,rs,r) prior model for λλ.
- What is the prior probability that the rate of text messages per hour is larger than 10? Hint: learn about
pgamma()
.
- Plot the resulting likelihood function of λλ.
- Plot the prior pdf, likelihood function, and the posterior pdf of λλ.
- Use
summarize_gamma_poisson()
to calculate descriptive statistics for the prior and the posterior models of λλ. - Comment on how your understanding about λλ changed from the prior (in the previous exercise) to the posterior based on the data you collected from your friends.
Yi|λind∼Pois(λ)λ∼Gamma(1,0.25)Yi|λind∼Pois(λ)λ∼Gamma(1,0.25)
Plot and summarize our prior understanding of λλ.
Why is the Poisson model a reasonable choice for our data YiYi?
The
wwc_2019_matches
data in the fivethirtyeight package includes the number of goals scored by the two teams in each 2019 Women’s World Cup match. Define, plot, and discuss the total number of goals scored per game:library(fivethirtyeight) data("wwc_2019_matches") <- wwc_2019_matches %>% wwc_2019_matches mutate(total_goals = score1 + score2)
Identify the posterior model of λλ and verify your answer using
summarize_gamma_poisson()
.Plot the prior pdf, likelihood function, and posterior pdf of λλ. Describe the evolution in your understanding of λλ from the prior to the posterior.
5.7.2 Practice: Normal-Normal
- (y1,y2,y3)=(−4.3,0.7,−19.4)(y1,y2,y3)=(−4.3,0.7,−19.4) and σ=10σ=10
- (y1,y2,y3,y4)=(−12,1.2,−4.5,0.6)(y1,y2,y3,y4)=(−12,1.2,−4.5,0.6) and σ=6σ=6
- (y1,y2)=(12.4,6.1)(y1,y2)=(12.4,6.1) and σ=5σ=5
- (y1,y2,y3,y4,y5)=(1.6,0.09,1.7,1.1,1.1)(y1,y2,y3,y4,y5)=(1.6,0.09,1.7,1.1,1.1) and σ=0.6σ=0.6
- Tune and plot an appropriate Normal prior model for μμ.
- According to your plot, does it seem plausible that the FancyTech stock would increase by an average of 7.6 dollars in a day?
- Does it seem plausible that the FancyTech stock would increase by an average of 4 dollars in a day?
- What is the prior probability that, on average, the stock price goes down? Hint:
pnorm()
. - What is the prior probability that, on average, your stock price goes up by more than 8 dollars per day?
- Plot the corresponding likelihood function of μμ.
- Plot the prior pdf, likelihood function, and the posterior pdf for μμ.
- Use
summarize_normal_normal()
to calculate descriptive statistics for the prior and the posterior models. - Comment on how your understanding about μμ evolved from the prior (in the previous exercise) to the posterior based on the observed data.
- What is the posterior probability that, on average, the stock price goes down? Hint:
pnorm()
. - What is the posterior probability that, on average, your stock price goes up by more than 8 dollars per day?
- Prof. Abebe conducts the final exam and observes that his 32 students scored an average of 86 points. Calculate the posterior mean and variance of μμ using the data from Prof. Abebe’s class.
- Prof. Morales conducts the final exam and observes that her 32 students scored an average of 82 points. Calculate the posterior mean and variance of μμ using the data from Prof. Morales’ class.
- Next, use Prof. Abebe and Prof. Morales’ combined exams to calculate the posterior mean and variance of μμ.
- Use the
football
data to calculate the sample mean hippocampal volume and sample size of the control subjects who have not been diagnosed with a concussion. - Identify the posterior model of μμ and verify your answer using
summarize_normal_normal()
. - Plot the prior pdf, likelihood function, and posterior pdf of μμ. Describe the evolution in your understanding of μμ from the prior to the posterior.
- Tune and plot a Normal prior for μμ that reflects your friend’s understanding.
- The
weather_perth
data in the bayesrules package includes 1000 daily observations of 3 p.m. temperatures in Perth (temp3pm
). Plot this data and discuss whether it’s reasonable to assume a Normal model for the temperature data. - Identify the posterior model of μμ and verify your answer using
summarize_normal_normal()
. - Plot the prior pdf, likelihood function, and posterior pdf of μμ. Describe the evolution in your understanding of μμ from the prior to the posterior.
Your friend Alex has read Chapter 4 of this book, but not Chapter 5. Explain to Alex why it’s difficult to simulate a Normal-Normal posterior using the simulation methods we have learned thus far.
To prove your point, try (and fail) to simulate the posterior of μμ for the following model upon observing a single data point Y=1.1Y=1.1:
Y|μ∼N(μ,12)μ∼N(0,12)Y|μ∼N(μ,12)μ∼N(0,12)
5.7.3 General practice exercises
- f(θ)∝0.3θ0.716−θf(θ)∝0.3θ0.716−θ for θ∈{0,1,2,…,16}θ∈{0,1,2,…,16}
- f(θ)∝1/θ!f(θ)∝1/θ! for θ∈{0,1,2,…,∞}θ∈{0,1,2,…,∞}
- f(θ)∝θ4(1−θ)7f(θ)∝θ4(1−θ)7 for θ∈[0,1]θ∈[0,1]
- f(θ)∝e−θ2f(θ)∝e−θ2 for θ∈(−∞,∞)θ∈(−∞,∞)
- f(θ)∝e−2θθ15f(θ)∝e−2θθ15 for θ>0θ>0
- f(θ)∝e−(θ−12)218f(θ)∝e−(θ−12)218 for θ∈(−∞,∞)θ∈(−∞,∞)
- f(θ)∝0.3θ/θ!f(θ)∝0.3θ/θ! for θ∈{0,1,2,…,∞}θ∈{0,1,2,…,∞}
- What model, Normal or Poisson, should you use to model the dependence of your insect count data on the underlying insect density θθ? Explain why.
- Plot the prior pdf, likelihood function, and posterior pdf of insect density θθ. Comment on whether the posterior model is more in sync with the data or prior. Explain why this makes sense.
- What is the posterior mean and standard deviation of the insect density?
- Describe your posterior conclusions in context, in a way that a biologist would find helpful.
Exercise 5.19 (The Beta-Geometric model) Consider the following new Bayesian model:
Y|θ∼Geometric(θ)θ∼Beta(α,β)Y|θ∼Geometric(θ)θ∼Beta(α,β)
where the Geometric model has pmf f(y|θ)=θ(1−θ)y−1f(y|θ)=θ(1−θ)y−1 for y∈{1,2,…}y∈{1,2,…}.- Derive the posterior model for θθ given observed data Y=yY=y. If possible, identify the name of the posterior model and its parameters.
- Is the Beta model a conjugate prior for the Geometric data model?