Chapter 2 Bayes’ Rule

The Collins Dictionary named “fake news” the 2017 term of the year. And for good reason. Fake, misleading, and biased news has proliferated along with online news and social media platforms which allow users to post articles with little quality control. It’s then increasingly important to develop tools that help readers flag news articles as “real” or “fake.” In the current chapter, you’ll explore how the Bayesian philosophy you played around with in Chapter 1 can help us distinguish the real from the fake. To this end, we’ll examine a sample of 150 articles, stored as fake_news in the bayesrules package, which were posted on Facebook and fact checked by five BuzzFeed journalists (Shu et al. 2017). Let’s load this dataset along with a set of packages we’ll use throughout the chapter:

# Load packages

# Import article data

The fake_news dataset contains the full text for actual news articles, both real and fake. As such, some of these articles contain disturbing language or cover disturbing topics. Though we believe it’s important to provide our original resources here (as opposed to providing article metadata), you do not need to read the articles in order to do the analysis ahead. Please keep this in mind if you do dig deeper into this data.

In this particular collection, 40% of the articles are fake and 60% are real:

fake_news %>% 
  tabyl(type) %>% 
  type   n percent
  fake  60     0.4
  real  90     0.6
 Total 150     1.0

Using this information alone, we could build a very simple news filter which uses the following rule: since most articles are real, we should read and believe all articles. This filter would certainly solve the problem of mistakenly disregarding real articles, but at the cost of reading a lot of fake news. It also only takes into account the overall rates of, not the typical features of, real and fake news. For example, suppose that the most recent article posted to a social media platform is titled: “The president has a funny secret!” Some features of this title probably set off some red flags. For example, the usage of an exclamation point might seem like an odd choice for a real news article. Our data backs up this instinct – in our article collection, roughly 28.33% (17 of 60) of fake news titles but only roughly 2.22% (2 of 90) of real news titles use an exclamation point:

# Tabulate exclamation usage and article type
fake_news %>% 
  tabyl(title_has_excl, type) %>% 
 title_has_excl fake real
          FALSE   43   88
           TRUE   17    2
          Total   60   90

Thus we have two pieces of contradictory information. Our prior information suggested that any incoming article is most likely to be real. However, the exclamation point data is more consistent with fake news. Thinking like Bayesians, we know that balancing both pieces of information is important in developing a posterior understanding of whether the article is fake (Figure 2.1). Put your own Bayesian thinking to use in a quick self-quiz of your current intuition about the posterior chances that the most recent article is fake.

FIGURE 2.1: Bayesian knowledge building diagram for whether the article is fake or not

Which of the following best describes your updated, posterior understanding of whether the article is fake?

  1. The chance that this article is fake drops from 40% to 20%. The exclamation point in the title might simply indicate that the article’s author is enthusiastic about their work.
  2. The chance that this article is fake jumps from 40% to roughly 90%. Though exclamation points are more common among fake articles, let’s not forget that only 40% of articles are fake.
  3. The chance that this article is fake jumps from 40% to roughly 98%. Given that so few real articles use exclamation points, this article is most certainly fake.

The correct answer is given in the footnotes below.13 But if your intuition was incorrect, please don’t fret! By the end of Chapter 2, you will have learned how to support Bayesian thinking with rigorous Bayesian calculations using Bayes’ Rule, the aptly named foundation of Bayesian statistics. Also a fair heads up: of any other chapter in this book, Chapter 2 introduces the most Bayesian concepts, notation, and vocabulary. No matter your level of previous probability experience, you should expect that you’ll need to read this chapter multiple times before it all sinks in.

  • Explore foundational probability tools such as marginal, conditional, and joint probability models and the Binomial model.

  • Conduct your first formal Bayesian analysis! You will construct your first prior and likelihood models and, from these, construct your first posterior models via Bayes’ Rule.

  • Practice your Bayesian grammar. Imagnie how dif ficult it would beto reed this bok if the authers didnt spellcheck or use proper grammar and! punctuation. In this spirit, you’ll practice the formal notation and terminology central to Bayesian grammar.

  • Simulate Bayesian models. Simulation is integral to building intuition for and supporting Bayesian analyses. You’ll conduct your first simulation, using the R statistical software, in this chapter.

2.1 Building a Bayesian model for events

The development of our simple fake news filter boils down to the study of two variables: an article’s fake vs real status and its use of exclamation points. These features can vary from article to article. Some are fake, some aren’t. Some use exclamation points, some don’t. We can represent the randomness in these variables using probability models. In this section we will build: a prior probability model for our prior understanding of whether the most recent article is fake; a likelihood model for interpreting the exclamation point data; and, eventually, a posterior probability model which summarizes the posterior plausibility that the article is fake.

2.1.1 Prior probability model

As a first step in our Bayesian analysis, we’ll formalize our prior understanding of whether the new article is fake. We earlier determined that 40% of articles are fake and 60% are real. That is, before even reading the new article, there’s a 0.4 prior probability that it’s fake and a 0.6 prior probability it’s not. We can represent this information using mathematical notation. Letting \(B\) denote the event that an article is fake and \(B^c\) (“\(B\) complement” or “B not”) denote the event that it’s not fake, we have

\[P(B) = 0.40 \;\; \text{ and } \;\; P(B^c) = 0.60 \; .\]

As a collection, \(P(B)\) and \(P(B^c)\) specify the simple prior model of fake news (Table 2.1). As a valid probability model must, it (1) accounts for all possible events (all articles must be fake or real); and (2) assigns prior probabilities to each event where these probabilities sum to one.

TABLE 2.1: Prior model of fake news.
event \(B\) \(B^c\) Total
probability 0.4 0.6 1

2.1.2 Conditional probability and likelihood

In the second step of our Bayesian analysis, we will summarize the insights from our data. Specifically, we will formalize our observation that the exclamation point data collected on the new article is more compatible with fake news than with real news. Recall that if an article is fake, then there’s a roughly 28.33% chance it uses exclamation points in the title. In contrast, if an article is real, then there’s only a roughly 2.22% chance it uses exclamation points. When stated this way, it’s clear that the occurrence of exclamation points (which we’ll label as event \(A\)) depends upon, or is conditioned upon whether the article is fake (event \(B\)). This dependence is specified by the following conditional probabilities where we read \(P(A|B)\) as the conditional probability that an article uses exclamation points (\(A\)) given that it is fake (\(B\)):

\[P(A | B) = 0.2833 \;\; \text{ and } \;\; P(A | B^c) = 0.0222.\]

Conditional probabilities are fundamental to Bayesian analyses, thus it’s worth taking a quick pause to absorb this concept. In general, the comparison of the conditional probability of \(A\) given \(B\), \(P(A|B)\), to the unconditional probability of \(A\), \(P(A)\), reveals the extent to which information about \(B\) informs our understanding of \(A\). In some cases, the certainty of an event might increase in light of additional knowledge. For example, if somebody practices the clarinet every day, then their probability of joining an orchestra’s clarinet section is higher than that among the general population14:

\[P(\text{orchestra} \; | \; \text{practice}) > P(\text{orchestra}) \; .\]

Conversely, the certainty of an event might decrease in light of additional information. For example, if you’re a fastidious hand washer, then you’re less likely to get the flu:

\[P(\text{flu} \; | \; \text{wash hands}) < P(\text{flu}) \; .\]

The order of conditioning is also important. Since they measure two different phenomena, it’s typically the case that \(P(A|B) \ne P(B|A)\). For instance, roughly 100% of puppies are adorable. Thus if the next object you pass on the street is a puppy, \(P(\text{adorable} \; | \; \text{puppy}) = 1\). However, certainly not every adorable object is a puppy, ie. \(P(\text{puppy} \; | \; \text{adorable}) < 1\).

Let’s reexamine our fake news example with these conditional concepts in place. The conditional probabilities we derived above, \(P(A | B) = 0.2833\) and \(P(A | B^c) = 0.0222\), indicate that exclamation point usage is more common among fake news than real news. Thus, in light of its use of exclamation points (\(A\)), the article is more likely to be fake (\(B\)).

It’s important to congratulate yourself here. Not only did you use conditional probability to rigorously evaluate the exclamation point data, you did some subtle mental gymnastics to do so. On its face, the conditional probability \(P(A|B)\) measures the uncertainty in event \(A\) given we know event \(B\) occurs. However, we find ourselves in the opposite situation. We know that the incoming article used exclamation points (\(A\)). What we don’t know is whether the article is fake or not (\(B\) or \(B^c\)). To distinguish their use in weighing known evidence \(A\) when \(B\) is uncertain from their use in measuring the uncertainty in \(A\) when \(B\) is known, we’ll refer to the conditional probability terms as likelihoods and utilize the following \(L(\cdot | A)\) notation to help reiterate this distinction:

\[L(B|A) := P(A|B) \;\; \text{ and } \;\; L(B^c|A) := P(A|B^c) \; .\]

When calculated, \(L(B|A)\) and \(L(B^c|A)\) provide insight into the relative likelihoods of different possible outcomes (\(B\) or \(B^c\)) given the observed data \(A\).

Probability vs likelihood

When \(B\) is known, the conditional probability function \(P(\cdot | B)\) allows us to compare the probabilities of an unknown event, \(A\) or \(A^c\), occurring with \(B\):

\[P(A|B) \; \text{ vs } \; P(A^c|B) \; .\]

When \(A\) is known, the likelihood function \(L( \cdot | A) := P(A | \cdot)\) allows us to compare the likelihoods of different unknown scenarios, \(B\) or \(B^c\), producing data \(A\):

\[L(B|A) \; \text{ vs } \; L(B^c|A) \; .\]

Thus the likelihood function provides the tool we need to evaluate the relative compatibility of events \(B\) or \(B^c\) with data \(A\).

NOTE: Please be patient with yourself here. The distinction between likelihoods and conditional probabilities is quite subtle, especially since people use these terms interchangeably in casual conversation. You will get more practice with these concepts throughout the book.

2.1.3 Normalizing constant

Above we developed a formal understanding of exclamation point usage in real vs fake news. The marginal probability of observing exclamation points across all news articles, \(P(A)\), provides an important point of comparison and sense of scale. In our quest to calculate this normalizing constant15, we’ll first use our prior and likelihood models to fill in the table below. This table summarizes the possible joint occurrences of the fake news and exclamation point variables. We encourage you to take a crack at this before reading on, utilizing the information we’ve gathered on exclamation points.

\(B\) \(B^c\) Total
Total 0.4 0.6 1

First, focus on the \(B\) column which splits fake articles into two groups: (1) those that are fake and use exclamation points (\(A \cap B\)); and (2) those that are fake and don’t use exclamation points (\(A^c \cap B\)).16 To determine the probabilities of these joint events, note that 40% of articles are fake (\(P(B) = 0.40\)) and 28.33% of fake articles use exclamation points (\(P(A|B) = 0.2833\)). It follows that 28.33% of 40%, or 11.33%, of all articles are fake with exclamation points. That is, the joint probability of observing both \(A\) and \(B\) is

\[P(A \cap B) = P(A|B)P(B) = 0.2833 \cdot 0.40 = 0.1133 \; .\]

Further, note that since 28.33% of fake articles use exclamation points, 71.67% do not:

\[P(A^c|B) = 1 - P(A|B) = 1 - 0.2833 = 0.7167 \; .\]

Thus, 71.67% of 40% (28.67%) of all articles are fake without exclamations:

\[P(A^c \cap B) = P(A^c|B)P(B) = 0.7167 \cdot 0.40 = 0.2867 \; .\]

We can similarly break down real articles into those that do and those that don’t use exclamation points. To this end, across all articles, only 1.33% (2.22% of 60%) are real and use exclamation points whereas 58.67% (97.78% of 60%) are real without exclamation points:

\[\begin{split} P(A \cap B^c) & = P(A|B^c)P(B^c) = 0.0222 \cdot 0.60 = 0.0133 \\ P(A^c \cap B^c) & = P(A^c|B^c)P(B^c) = 0.9778 \cdot 0.60 = 0.5867.\\ \end{split}\]

In these calculations, we intuited a general formula for calculating \(P(A \cap B)\) and, through rearranging, a formula for \(P(A|B)\).

Calculating joint and conditional probabilities

For any events \(A\) and \(B\), the joint probability of \(A \cap B\) can be calculated by weighting the conditional probability of \(A\) given \(B\) by the marginal probability of \(B\):

\[\begin{equation} P(A \cap B) = P(A | B)P(B) \tag{2.1} \end{equation}\]

Dividing both sides of (2.1) by \(P(B)\) reveals the definition of the conditional probability of \(A\) given \(B\):

\[\begin{equation} P(A | B) = \frac{P(A \cap B)}{P(B)} \; . \tag{2.2} \end{equation}\]

Thus to evaluate the chance that \(A\) occurs in light of information \(B\), we can consider the chance that they occur together (\(P(A \cap B)\)) relative to the chance that \(B\) occurs at all (\(P(B)\)).

Table 2.2 summarizes our new understanding of the joint behavior of our two article variables. The fact that the grand total of this table is one confirms that our calculations are reasonable! Table 2.2 also provides the point of comparison we sought: across all news articles, the probability of observing exclamation points is \(P(A) \approx 0.127\) (within rounding).

TABLE 2.2: A joint probability model of the fake status and exclamation point usage across all articles.
\(B\) \(B^c\) Total
\(A\) 0.1133 0.0133 0.1266
\(A^c\) 0.2867 0.5867 0.8734
Total 0.4 0.6 1

Let’s take a step back and consider the theory behind the calculation of \(P(A)\). We started by recognizing the two ways that an article can use exclamation points: if it is fake (\(A \cap B\)) and if it is not fake (\(A \cap B^c\)). Thus the total probability of observing \(A\) is the combined probability of its parts:

\[P(A) = P(A \cap B) + P(A \cap B^c).\]

By (2.1), we can compute the two pieces of this puzzle using the information we have about exclamation point usage among fake and real news (\(P(A|B)\) and \(P(A|B^c)\)) as well as the prior probabilities of fake and real news (\(P(B)\) and \(P(B^c)\)):

\[\begin{equation} \begin{split} P(A) & = P(A \cap B) + P(A \cap B^c) \\ & = P(A|B)P(B) + P(A|B^c)P(B^c) \; . \end{split} \tag{2.3} \end{equation}\]

Finally, plugging in, we can confirm that roughly 12.7% of all articles use exclamation points: \(P(A) = 0.2833 \cdot 0.40 + 0.0222 \cdot 0.60 \approx 0.127\). The formula we’ve built to calculate \(P(A)\) here is a special case of the aptly named Law of Total Probability (LTP).

2.1.4 Posterior probability model (via Bayes’ Rule!)

We’re now in a position to answer the ultimate question: What’s the probability that the latest article is fake? Formally speaking, we aim to calculate the posterior probability that the article is fake given that it uses exclamation points, \(P(B|A)\). To build some intuition, let’s revisit Table 2.2. Since our article uses exclamation points, we can zoom in on the 12.66% of articles that fall into the \(A\) row. Among these articles, proportionally 89.5% (0.1133 / 0.1266) are fake and 10.5% (0.0133 / 0.1266) are real. This provides us with the answer we were seeking: there’s an 89.5% posterior chance that this latest article is fake.

Stepping back from the details that led up to this final calculation, we accomplished something big: we built Bayes’ Rule from scratch! In short, Bayes’ Rule provides the framework for defining a posterior model for an event \(B\) from two pieces: the prior probability of \(B\) and the likelihood of \(B\) in light of data \(A\).

Bayes’ Rule for events

For events \(A\) and \(B\), the posterior probability of \(B\) given \(A\) follows by combining (2.2) with (2.1):

\[\begin{equation} P(B |A) = \frac{P(A \cap B)}{P(A)} = \frac{P(B)L(B|A)}{P(A)} \tag{2.4} \end{equation}\]

where \(L(B|A) = P(A|B)\) and by the Law of Total Probability (2.3),

\[P(A) = P(A|B)P(B) + P(A|B^c)P(B^c) \; .\]

More generally,

\[\text{posterior} = \frac{\text{prior } \cdot \text{ likelihood}}{\text{normalizing constant}} \; .\]

Bayes’ Rule provides the mechanism we need to put our Bayesian thinking into practice. To convince ourselves, let’s directly apply Bayes’ Rule to our news analysis. Into (2.4), we can plug in the prior information that 40% of articles are fake, the 28.33% likelihood of the article being fake given its exclamation point usage, and the 12.66% marginal probability of observing exclamation points across all articles. By this calculation, the posterior probability that the incoming article is fake is roughly 0.895 (just as we calculated in Table 2.2):

\[P(B|A) = \frac{P(B)L(B|A)}{P(A)} = \frac{ 0.40 \cdot 0.2833}{0.1266} \approx 0.895 \;.\]

The complete posterior model is summarized in Table 2.3 along with the prior model for comparison. This table reveals the journey that has been our news analysis. We started this analysis with a prior understanding that there’s only a 40% chance that the incoming article, titled “The president has a funny secret!”, is fake. Yet upon observing the use of an exclamation point in the title, a feature that’s more common to fake news, our posterior understanding evolved quite a bit – the chance that the article is fake jumped to 89.5%.

TABLE 2.3: Prior and posterior models of fake news.
event \(B\) \(B^c\) Total
prior probability 0.4 0.6 1
posterior probability 0.895 0.105 1

2.1.5 Posterior simulation

It’s important to keep in mind that the probability models we built for our news analysis above are just that - models. They provide theoretical representations of what we observe in practice. To build some intuition for the connection between the articles that might actually be posted to social media and their corresponding models, let’s run a quick simulation. First, define the possible article type (real or fake) and their corresponding prior probabilities:

# Define possible articles
article <- data.frame(type = c("real", "fake"))

# Define the prior model
prior <- c(0.6, 0.4)

To simulate the articles that might be posted to your social media, we can use the sample_n() function in the dplyr package (Wickham et al. 2020) to randomly sample rows from the article data frame. In doing so, we must specify the sample size and that the sample should be taken with replacement (replace = TRUE). Sampling with replacement ensures that we start with a fresh set of possibilities for each article – any article can either be fake or real. Finally, we set weight = prior to specify that there’s a 60% chance an article is real and a 40% chance it’s fake. To try this out, “run” the following code multiple times, each time simulating three articles.

# Simulate 3 articles
sample_n(article, size = 3, weight = prior, replace = TRUE)

Notice that you can get different results every time you run the code above. That’s because simulation, like articles, is random! Specifically, behind the R curtain is a random number generator (RNG) that’s in charge of producing random samples. Every time we ask for a new sample, the RNG “starts” at a new place: the random seed. Starting at different seeds can thus produce different samples. This is a great thing in general – random samples should be random. However, within a single analysis, we want to be able to reproduce our random simulation results (ie. we don’t want the fine points of our results to change every time we re-run our code or for different readers to get different results). We can achieve this reproducibility by setting the seed, or specifying a starting point for the RNG, using the set.seed() function applied to a positive integer (here 84735). Try running the below code a few times and notice that the results are always the same – the first two articles are fake and the third is real:17

# Set the seed. Simulate 3 articles.
sample_n(article, size = 3, weight = prior, replace = TRUE)
1 fake
2 fake
3 real

Now that we understand how to simulate a few articles, let’s dream bigger: simulate 10,000 articles and store the results in article_sim.

# Simulate 10000 articles. 
article_sim <- sample_n(article, 
  size = 10000, weight = prior, replace = TRUE)

The 10,000 simulated articles are summarized in the bar graph below. Reflecting the model from which these articles were generated, roughly 40% are fake.

# Plot the outcomes
ggplot(article_sim, aes(x = type)) + 

Specifically, \(4031\) of the 10,000 simulated articles were fake:

# Tally the outcomes
article_sim %>% 
  tabyl(type) %>% 
  type     n percent
  fake  4031  0.4031
  real  5969  0.5969
 Total 10000  1.0000

Next, let’s simulate the exclamation point usage among these 10,000 articles. The likelihood variable defined below specifies that there’s a 28.33% chance that any fake article and a 2.22% chance that any real article uses exclamation points:

article_sim <- article_sim %>% 
  mutate(likelihood = case_when(
    type == "fake" ~ 0.2833,
    type == "real" ~ 0.0222))

Rows: 10,000
Columns: 2
$ type       <chr> "fake", "fake", "real", "fake", "…
$ likelihood <dbl> 0.2833, 0.2833, 0.0222, 0.2833, 0…

From this likelihood, we simulate exclamation point data for each article. This syntax is a bit more complicated. First, the group_by() statement specifies that the exclamation point simulation is to be performed separately for each of the 10,000 articles. Second, we use sample() to simulate the exclamation point data, no or yes, based on the likelihood and store the results as usage. Note that sample() is similar to sample_n() but samples values from vectors instead of rows from data frames.

# Define whether there are exclamation points
data <- c("no", "yes")

# Simulate exclamation point usage 
article_sim <- article_sim %>%
  group_by(1:n()) %>% 
  mutate(usage = sample(data, size = 1, 
    prob = c(1 - likelihood, likelihood)))

The article_sim data frame now contains 10,000 simulated articles with different features:

ggplot(article_sim, aes(x = usage)) + 
  geom_bar() + 
  facet_wrap(~ type)

The patterns in the plot above reflect the underlying likelihood model. Specifically, roughly 28% (1120 / 4031) of fake articles and 2% (136 / 5969) of real articles use exclamation points. Our 10,000 simulated articles now reflect the features of our prior and likelihood. In turn, we can use them to approximate the posterior probability that the latest article is fake. We start by filtering out just those simulated articles that match our data, ie. those that use exclamation points:

article_sim %>% 
  filter(usage == "yes") %>% 
  tabyl(type) %>% 
  type    n percent
  fake 1120  0.8917
  real  136  0.1083
 Total 1256  1.0000

Among the 1256 simulated articles that use exclamation points, roughly \(89.2\)% are fake. This approximation is quite close to the actual posterior probability of roughly 0.895! Of course, our posterior assessment of this latest article would change if we had seen different data, ie. if the article didn’t use exclamation points. The plot below reveals a simple rule: If an article uses exclamation points, it’s most likely fake. Otherwise, it’s most likely real (and we should read it!).

ggplot(article_sim, aes(x = type)) + 
  geom_bar() + 
  facet_wrap(~ usage)

2.2 Example: Iowa caucuses

Let’s put Bayes’ Rule into action in another example. Every four years, Americans go to the polls to cast their vote for President of the United States. Yet there’s a lot of action before this final vote. Through a series of state-level primaries and caucuses that span several months, each political party (eg: Democrat, Republican, Green) narrows down a pool of potential candidates to just one nominee. The first caucuses are held in the state of Iowa and are watched very closely as a harbinger of things to come.

Consider the following scenario. “Michelle” has decided to run for president in the next election and is competing with ten other candidates for her party’s nomination. Going into the election, 30% of voters in a national poll supported Michelle. Yet counter to these low prior expectations, she went on to win the Iowa caucuses. (Congratulations, Michelle!) What are we to think now? That is, weighing the prior election information with the data that Michelle won in Iowa, what’s the chance that she goes on to earn her party’s nomination? We need a bit more detail in order to answer this question. To begin, let \(N\) be the event that Michelle earns her party’s nomination and \(I\) be the event that she wins the Iowa caucus. Our goal then is to calculate the posterior probability \(P(N|I)\) which, by Bayes’ Rule (2.4), is

\[\begin{equation} P(N | I) = \frac{P(N)L(N|I)}{P(I)} \; . \tag{2.5} \end{equation}\]

Consider \(P(N)\), the prior probability that Michelle goes on to secure her party’s nomination (prior to observing her win in Iowa). There are several approaches to defining \(P(N)\). For one, if we didn’t have any information about Michelle, it would be reasonable to say that she has a 10% chance of being nominated since she is one of ten candidates. Or we might utilize the fact that Michelle’s national support hovers at 30%. We’ll explore trade-offs in building priors later in the book, but there is no right or wrong answer here. For the purposes of this example, let’s go with the pollsters’ prior probability:

\[P(N) = 0.3 \; . \]

Next, consider \(L(N|I) = P(I|N)\), the likelihood that Michelle goes on to win the nomination knowing that she won the Iowa caucus. We need a little Google help on this one. A quick search18 reveals that in the presidential elections from 1980 to 2020, 16 of the 24 candidates that went on to win their party’s nomination had started with a win in Iowa. In contrast, only 8 of the 73 major candidates that lost their party’s nomination had won in Iowa. From this information, we can specify the likelihood that Michelle will go on to win or lose given that she won in Iowa as

\[\begin{split} L(N|I) & := P(I|N) = \frac{16}{24} \approx 0.667 \\ L(N^c|I) & := P(I|N^c) = \frac{8}{73} \approx 0.110 \; . \end{split}\]

Thus based solely on her win in Iowa, Michelle is more likely than not to eventually win the nomination. We are now poised to update our prior understanding of her chances with her observed win in Iowa. Before we do this together, we encourage you to check in with your intuition or, better yet, take a crack at calculating the posterior probability \(P(N|I)\) on your own.

What’s the posterior probability that Michelle will secure her party’s nomination?

  1. Roughly 75%. Michelle’s win bodes very well.
  2. Roughly 50%. Michelle’s chances inched up just a bit.
  3. Roughly 30%. Michelle’s win doesn’t change your mind.

Let’s check your work. By the Law of Total Probability (2.3), the marginal probability that Michelle would win Iowa across both nomination scenarios (win or lose) is roughly 28%:

\[\begin{split} P(I) & = P(I|N)P(N) + P(I|N^c)P(N^c) \\ & = \frac{16}{24} \cdot 0.30 + \frac{8}{73} \cdot 0.70 \\ & \approx 0.277 \; . \\ \end{split}\]

It follows by plugging into (2.5) that there’s a roughly 72% posterior chance that Michelle goes on to win her party’s nomination:

\[P(N | I) = \frac{0.30 \cdot \frac{16}{24}}{\frac{16}{24} \cdot 0.30 + \frac{8}{73} \cdot 0.70}\; \approx 0.723 \; .\]

Soak it in. Upon observing Michelle’s win in Iowa, which has been a historical precursor to a nomination, her chances of securing the nomination more than doubled from a prior probability of 30% to a posterior probability of 72%. Things are really looking up for Michelle!

2.3 Building a Bayesian model for random variables

In our Bayesian analyses above, we constructed posterior models for categorical variables. In the fake news analysis, we examined the categorical status of an article: fake or real. In Michelle’s case, we examined the categorical outcome of her nomination: win or lose. However, it’s often the case in a Bayesian analysis that our outcomes of interest are numerical. Though some of the details will change, the same Bayes’ Rule principles we built above generalize to the study of numerical random variables.

2.3.1 Prior probability model

In 1996, world chess champion (and human!) Gary Kasparov played a much anticipated six-game chess match against an IBM supercomputer named Deep Blue. Of the six games, Kasparov won three, drew (tied in) two, and lost one. Thus Kasparov won the overall match, preserving the notion that machines don’t perform as well as humans when it comes to chess. Yet Kasparov and Deep Blue were to meet again for a six-game match in 1997. Let \(\pi\) denote Kasparov’s (numerical) chances of winning any particular game in the re-match, thus a measure of his overall skill relative to Deep Blue.19 Given the complexity of chess, machines, and humans, \(\pi\) is unknown and can vary or fluctuate over time. Or, in short, \(\pi\) is a random variable.

As usual, our analysis of random variable \(\pi\) will start with a prior model which (1) identifies what values \(\pi\) can take; and (2) assigns a prior weight to each. Consider the prior model defined in Table 2.4. We’ll get into how we might build such a prior in later chapters. For now, let’s focus on interpreting and utilizing the given prior.

TABLE 2.4: Prior model of \(\pi\), Kasparov’s chance of beating Deep Blue.
\(\pi\) 0.2 0.5 0.8 Total
\(f(\pi)\) 0.10 0.25 0.65 1

The first thing you might notice is that this model greatly simplifies reality.20 Though Kasparov’s win probability \(\pi\) can technically be any number from zero to one, this prior assumes that \(\pi\) has a discrete set of possibilities: Kasparov’s win probability is either 20%, 50%, or 80%. Next, examine the probability mass function (pmf) \(f(\cdot)\) which specifies the prior probability of each possible \(\pi\) value. This pmf reflects the prior understanding that Kasparov learned a lot about Deep Blue’s game strategy in 1996, and so will most likely improve in 1997. Specifically, this pmf places a 65% chance on Kasparov’s win probability jumping to \(\pi = 0.8\) (\(f(\pi = 0.8) = 0.65\)) and only a 10% chance on his win probability dropping to \(\pi = 0.2\) (\(f(\pi = 0.2) = 0.10\)).

Discrete probability models

Let \(Y\) be a discrete random variable with pmf \(f(y)\). Then the pmf defines the probability of any given \(y\), \(f(y) = P(Y = y)\), and has the following properties:

  • \(\sum_{\text{all } y} f(y) = 1\)
  • \(0 \le f(y) \le 1\) for all values of \(y\) in the range of \(Y\)

2.3.2 A Binomial likelihood

In the second step of our Bayesian analysis, we’ll collect and process data which can inform our understanding of \(\pi\), Kasparov’s skill level relative to that of Deep Blue. Here, our data \(Y\) is the number of the six games in the 1997 re-match that Kasparov wins.21 Since the chess match outcome isn’t predetermined, \(Y\) is a random variable that can take any value in \(\{0,1,...,6\}\). Further, \(Y\) inherently depends upon Kasparov’s win probability \(\pi\). If \(\pi\) were 0.80, Kasparov’s victories \(Y\) would also tend to be high. If \(\pi\) were 0.20, \(Y\) would tend to be low. For our formal Bayesian analysis, we must model this dependence of \(Y\) on \(\pi\). That is, we must develop a conditional probability model of how \(Y\) depends upon or is conditioned upon the value of \(\pi\).

Conditional probability model

Let \(Y\) be a discrete random variable and \(\pi\) be a parameter upon which \(Y\) depends. Then the conditional probability model of \(Y\) given \(\pi\) is specified by conditional pmf \(f(y|\pi)\). Here, \(f(y|\pi)\) is the conditional probability of observing \(y\) given \(\pi\), \(f(y|\pi) = P((Y = y) | \pi)\), and has the following properties:

  • \(\sum_{\text{all } y} f(y|\pi) = 1\)
  • \(0 \le f(y|\pi) \le 1\) for all values of \(y\) in the range of \(Y\)

In modeling the dependence of \(Y\) on \(\pi\) in our chess example, we first make two assumptions about the chess match: the outcome of any one game doesn’t influence the outcome of another, ie. games are independent; and Kasparov has an equal probability, \(\pi\), of winning each game in the match. This is a common framework in statistical analysis, one which can be represented by the Binomial model. We summarize the Binomial model below.22

The Binomial model

Let random variable \(Y\) be the number of successes in a fixed number of trials \(n\). Assume that the trials are independent and that the probability of success in each trial is \(\pi\). Then the conditional dependence of \(Y\) on \(\pi\) can be modeled by the Binomial model with parameters \(n\) and \(\pi\). In mathematical notation:

\[Y | \pi \sim \text{Bin}(n,\pi) \]

where “\(\sim\)” can be read as “modeled by.” Correspondingly, the Binomial model is specified by conditional pmf

\[\begin{equation} f(y|\pi) = {n \choose y} \pi^y (1-\pi)^{n-y} \;\; \text{ for } y \in \{0,1,2,\ldots,n\} \tag{2.6} \end{equation}\]

where \({n \choose y} = \frac{n!}{y!(n-y)!}\).

The dependence of Kasparov’s victories \(Y\) in \(n = 6\) games on his win probability \(\pi\) follows a Binomial model,

\[Y | \pi \sim \text{Bin}(6,\pi)\]

with conditional pmf

\[\begin{equation} f(y|\pi) = {6 \choose y} \pi^y (1 - \pi)^{6 - y} \;\; \text{ for } y \in \{0,1,2,3,4,5,6\} \; . \tag{2.7} \end{equation}\]

The conditional pmf \(f(y|\pi)\) summarizes the conditional probability of observing any number of wins \(Y = y\) for any given win probability \(\pi\). For example, if Kasparov’s underlying chance of beating Deep Blue were \(\pi = 0.8\), then there’s a roughly 26% chance he’d win all six games:

\[f((y = 6) | (\pi = 0.8)) = {6 \choose 6} 0.8^6 (1 - 0.8)^{6 - 6} = 1 \cdot 0.8^6 \cdot 1 \approx 0.26 \; .\]

Figure 2.2 plots the conditional pmfs \(f(y|\pi)\), thus the random trends in \(Y\), under each possible value of Kasparov’s win probability \(\pi\). These plots confirm our intuition that Kasparov’s victories \(Y\) would tend to be low if Kasparov’s win probability \(\pi\) were low (far left) and high if \(\pi\) were high (far right).

The pmf of a Bin(6, \(\pi\)) model is plotted for each possible value of \(\pi \in \{0.2, 0.5, 0.8\}\). The masses marked by the blue dashed lines correspond to the eventual observed data, \(Y = 1\) win.

FIGURE 2.2: The pmf of a Bin(6, \(\pi\)) model is plotted for each possible value of \(\pi \in \{0.2, 0.5, 0.8\}\). The masses marked by the blue dashed lines correspond to the eventual observed data, \(Y = 1\) win.

In the end, Kasparov only won one of the six games against Deep Blue in 1997 (\(Y = 1\)). The next step in our Bayesian analysis is to determine how compatible the various possible \(\pi\) are with this data. Put another way, we want to evaluate the likelihood of each possible \(\pi\) given that Kasparov won \(Y = 1\) game. It turns out that the likelihood is staring us straight in the face! Extracting only the masses in Figure 2.2 that correspond to our observed data, \(Y = 1\), reveals the likelihood function of \(\pi\) (Figure 2.3).

The likelihood function \(L(\pi|(y = 1))\) of observing \(Y = 1\) win in six games for any win probability \(\pi \in \{0.2, 0.5, 0.8\}\).

FIGURE 2.3: The likelihood function \(L(\pi|(y = 1))\) of observing \(Y = 1\) win in six games for any win probability \(\pi \in \{0.2, 0.5, 0.8\}\).

The formula for the likelihood function follows from evaluating the conditional pmf \(f(y|\pi)\) in (2.7) at the observed data \(Y = 1\). For \(\pi \in \{0.2,0.5,0.8\}\),

\[L(\pi | (y = 1)) = f((y=1) | \pi) = {6 \choose 1} \pi^1 (1-\pi)^{6-1} = 6\pi(1-\pi)^5 \; . \]

For example, given that Kasparov only won one game, there’s only a 0.0015 likelihood that he’s the superior player (ie. that \(\pi = 0.8\)):

\[L((\pi = 0.8) | (y = 1)) = 6\cdot 0.8 \cdot (1-0.8)^5 \approx 0.0015\; . \]

Table 2.5 summarizes the likelihood evaluated at each possible value of \(\pi\). There are some not-to-miss details here. First, though it is equivalent in formula to the conditional pmf \(f((y=1) | \pi)\), we denote the likelihood function of \(\pi\) given the observed data \(Y = 1\) as \(L(\pi | (y = 1))\). This emphasizes that the likelihood is a function of the unknown win probability \(\pi\). Further, note that the likelihood function does not sum to one across \(\pi\), thus is not a probability model. (Mental gymnastics!) Rather, it provides a mechanism by which to compare the compatibility of different \(\pi\) with the observed data \(Y = 1\).

TABLE 2.5: Likelihood function of \(\pi\) given Kasparov won \(Y=1\) of six games.
\(\pi\) 0.2 0.5 0.8
\(L(\pi | (y=1))\) 0.3932 0.0938 0.0015

Putting this all together, the likelihood function summarized in Figure 2.3 and Table 2.5 illustrates that Kasparov being the weaker player (\(\pi = 0.2\)) is the most likely scenario in light of his one game win. In fact, it’s nearly impossible that Kasparov would only win one game if his win probability against Deep Blue were as high as \(\pi = 0.8\). Thus if we ignored the prior information, our best guess would be that Deep Blue is now the dominant player. But of course, as Bayesians, we’ll weigh the balance between our prior and likelihood.

Probability mass functions vs likelihood functions

When \(\pi\) is known, the conditional pmf \(f(\cdot | \pi)\) allows us to compare the probabilities of different possible values of data \(Y\) (eg: \(y_1\) or \(y_2\)) occurring with \(\pi\):

\[f(y_1|\pi) \; \text{ vs } \; f(y_2|\pi) \; .\]

When \(Y=y\) is known, the likelihood function \(L(\cdot | y) = f(y | \cdot)\) allows us to compare the relative likelihoods of different possible values of \(\pi\) (eg: \(\pi_1\) or \(\pi_2\)) given that we observed data \(y\):

\[L(\pi_1|y) \; \text{ vs } \; L(\pi_2|y) \; .\]

Thus \(L(\cdot | y)\) provides the tool we need to evaluate the relative compatibility of data \(Y=y\) with variable \(\pi\).

2.3.3 Normalizing constant

Bayes’ Rule, our mechanism for combining information from the prior model of \(\pi\) with evidence about \(\pi\) from the data or likelihood, requires three pieces of information: the prior, likelihood, and a normalizing constant which summarizes the marginal probability of the data. We’ve taken care of the first two, now let’s take care of the third. To this end, we must determine the total probability that Kasparov would win \(Y = 1\) game across all possible win probabilities \(\pi\), \(f(y = 1)\). Just as we did in our other examples, we can appeal to the Law of Total Probability (LTP). The idea is this. Kasparov’s one win outcome (\(Y = 1\)) is the sum of its parts: the three possible scenarios in which we observe \(Y = 1\) with a win probability \(\pi\) that’s either 0.2, 0.5, or 0.8 weighted by the prior probabilities of these \(\pi\) values. Taking a little leap from (2.3), a special case of the LTP for events, this means that

\[f(y = 1) = \sum_{\pi \in \{0.2,0.5,0.8\}} f((y=1) | \pi) f(\pi)\]

or, expanding the summation \(\Sigma\),

\[\begin{equation} \begin{split} f(y = 1) & = f((y=1) | (\pi = 0.2)) f(\pi = 0.2) + f((y=1) | (\pi = 0.5)) f(\pi = 0.5) \\ & \hspace{.2in} + f((y=1) | (\pi = 0.8)) f(\pi = 0.8) \\ & \approx 0.3932 \cdot 0.10 + 0.0938 \cdot 0.25 + 0.0015 \cdot 0.65 \\ & \approx 0.0637 \; . \\ \end{split} \tag{2.8} \end{equation}\]

Thus, across all possible \(\pi\), there’s only a roughly 6% chance that Kasparov would have won only one game. It would, of course, be great if this all clicked. But if it doesn’t, please don’t let this calculation discourage you from moving forward. We’ll see later in the chapter that a magical shortcut allows us to altogether bypass this calculation.

2.3.4 Posterior probability model

Figure 2.4 summarizes what we know thus far and where we have yet to go. Heading into their 1997 re-match, our prior model suggested that Kasparov’s win probability against Deep Blue was high (left plot). But! Then he only won one of six games. In light of this result, it’s most likely that Kasparov’s win probability is low (middle plot). Our updated, posterior model (right plot) of Kasparov’s win probability will balance this prior and likelihood. Specifically, our formal calculations below will verify that Kasparov’s chances of beating Deep Blue most likely dipped to \(\pi = 0.20\) between 1996 to 1997. It’s also relatively possible that his 1997 losing streak was a fluke, and that he’s more evenly matched with Deep Blue (\(\pi = 0.50\)). In contrast, it’s highly unlikely that Kasparov is the superior player (\(\pi = 0.80\)).

The prior model of \(\pi\) (left), likelihood function of \(\pi\) (middle), and posterior model of \(\pi\) (right). The y-axis scales are omitted for ease of comparison across scales.

FIGURE 2.4: The prior model of \(\pi\) (left), likelihood function of \(\pi\) (middle), and posterior model of \(\pi\) (right). The y-axis scales are omitted for ease of comparison across scales.

The posterior model plotted in Figure 2.4 is specified by the conditional posterior pmf

\[f(\pi | (y = 1)) \; .\]

Conceptually, \(f(\pi | (y=1))\) is the posterior probability of a given win probability \(\pi\) given that Kasparov only won one of six games against Deep Blue. Thus defining the posterior \(f(\pi | (y = 1))\) isn’t much different than it was in our previous examples. Just as you might hope, Bayes’ Rule still holds:

\[\text{ posterior } = \; \frac{\text{ prior } \cdot \text{ likelihood }}{\text{normalizing constant}} \; .\]

In the chess setting, we can translate this as

\[\begin{equation} f(\pi | (y=1)) = \frac{f(\pi)L(\pi|(y=1))}{f(y = 1)} \;\; \text{ for } \pi \in \{0.2,0.5,0.8\} \; . \tag{2.9} \end{equation}\]

All that remains is a little “plug-and-chug”: the prior \(f(\pi)\) is defined by Table 2.4, the likelihood \(L(\pi|(y=1))\) by Table 2.5, and the normalizing constant \(f(y=1)\) by (2.8). The posterior probabilities follow:

\[\begin{equation} \begin{split} f((\pi = 0.2) | (y = 1)) & = \frac{0.10 \cdot 0.3932}{0.0637} \approx 0.617 \\ f((\pi = 0.5) | (y = 1)) & = \frac{0.25 \cdot 0.0938}{0.0637} \approx 0.368 \\ f((\pi = 0.8) | (y = 1)) & = \frac{0.65 \cdot 0.0015}{0.0637} \approx 0.015 \\ \end{split} \tag{2.10} \end{equation}\]

This posterior probability model is summarized in Table 2.6 along with the prior probability model for contrast. These details confirm the trends in the plots with which we started this section. Mainly, though we were fairly confident that Kasparov’s performance would have improved from 1996 to 1997, after winning only one game, the chances that Kasparov’s win probability was 0.8 dropped from 0.65 to 0.015. In fact, the scenario with the greatest posterior support is that Kasparov’s win probability dropped to only 0.2. Good news for machines. Bad news for humans.

TABLE 2.6: Prior and posterior probability models of \(\pi\), Kasparov’s chance of beating Deep Blue in chess.
\(\pi\) 0.2 0.5 0.8 Total
\(f(\pi)\) 0.10 0.25 0.65 1
\(f(\pi | (y=1))\) 0.617 0.368 0.015 1

We close this section by generalizing the tools we built for the chess analysis.

Bayes’ Rule for variables

For any variables \(\pi\) and \(Y\), let \(f(\pi)\) denote the prior pmf of \(\pi\) and \(L(\pi|y)\) denote the likelihood of \(\pi\) given observed data \(Y=y\). Then the posterior pmf of \(\pi\) given data \(Y=y\) is

\[\begin{equation} f(\pi | y) = \frac{\text{ prior } \cdot \text{ likelihood }}{\text{ normalizing constant }} = \frac{f(\pi)L(\pi|y)}{f(y)} \tag{2.11} \end{equation}\]

where, by the Law of Total Probability, the overall probability of observing data \(Y=y\) across all possible \(\pi\) is

\[\begin{equation} f(y) = \sum_{\text{all } \pi} f(\pi)L(\pi|y). \tag{2.12} \end{equation}\]

2.3.5 Posterior shortcut

We now make good on our promise that, moving forward, we needn’t continue calculating the normalizing constant. To begin, notice in (2.10) that \(f(y = 1) = 0.0637\) appears in the denominator of \(f(\pi|(y=1))\) for each \(\pi \in \{0.2,0.5,0.8\}\). This explains the term normalizing constant – its only purpose is to normalize the posterior probabilities so that they sum to one:

\[f((\pi = 0.2 | (y = 1)) + f((\pi = 0.5) | (y = 1)) + f((\pi = 0.8) | (y = 1)) = 1 \;. \]

We needn’t actually calculate \(f(y=1)\) to specify the posterior probabilities. Instead, we can simply note that \(f(y=1)\) is some constant \(1/c\), thus

\[\begin{split} f((\pi = 0.2) | (y = 1)) & = c \cdot 0.10 \cdot 0.3932 \propto 0.039320 \\ f((\pi = 0.5) | (y = 1)) & = c \cdot 0.25 \cdot 0.0938 \propto 0.023450 \\ f((\pi = 0.8) | (y = 1)) & = c \cdot 0.65 \cdot 0.0015 \propto 0.000975 \\ \end{split}\]

where \(\propto\) denotes “proportional to.” Though these three unnormalized posterior probabilities don’t add up to one (thus we can’t stop here), Figure 2.5 demonstrates that they preserve the proportional relationships of the normalized posterior probabilities.

The normalized posterior pdf of \(\pi\) (left) and the unnormalized posterior pdf of \(\pi\) (right) with different y-axis scales.

FIGURE 2.5: The normalized posterior pdf of \(\pi\) (left) and the unnormalized posterior pdf of \(\pi\) (right) with different y-axis scales.

To normalize these unnormalized probabilities while preserving their relative relationships, we can compare each relative to the whole. Specifically, we can divide each unnormalized probability by their sum. For example:

\[f((\pi = 0.2) | (y = 1)) = \frac{0.039320}{0.039320 + 0.023450 + 0.000975} \approx 0.617 \; .\]

Though we’ve just intuited this result, it also follows mathematically by combining (2.11) and (2.12):

\[f(\pi | y) = \frac{f(\pi)L(\pi|y)}{f(y)} = \frac{f(\pi)L(\pi|y)}{\sum_{\text{all } \pi} f(\pi)L(\pi|y)} .\]

We state the general form of this proportionality result below and will get plenty of practice with this concept in the coming chapters.


Since the normalizing constant \(f(y)\) is merely a constant which does not depend on \(\pi\), the posterior pdf \(f(\pi|y)\) is proportional to the product of the prior and likelihood:

\[f(\pi | y) = \frac{f(\pi)L(\pi|y)}{f(y)} \propto f(\pi)L(\pi|y) \; .\]

That is,

\[\text{ posterior } \propto \text{ prior } \cdot \text{ likelihood }.\]

The significance of this proportionality is that we need not calculate the normalizing constant in order to identify the posterior model. We need only the prior and likelihood.

2.3.6 Posterior simulation

We’ll conclude this section with a simulation that provides insight into and supports our Bayesian analysis of Kasparov’s chess skills. Ultimately, we’ll simulate 10,000 scenarios of the six-game chess series. To begin, set up the possible values of win probability \(\pi\) and the corresponding prior model \(f(\pi)\):

# Define possible win probabilities
chess <- data.frame(pi = c(0.2, 0.5, 0.8))

# Define prior model
prior <- c(0.10, 0.25, 0.65)

Next, simulate 10,000 possible outcomes of \(\pi\) from the prior model and store the results in the chess_sim data frame.

# Simulate 10000 values of pi from the prior
chess_sim <- sample_n(chess, 
  size = 10000, weight = prior, replace = TRUE)

From each of the 10,000 prior plausible values pi, we can simulate six games and record Kasparov’s number of wins, y. Since the dependence of y on pi follows a Binomial model, we can directly simulate y using the rbinom() function with size = 6 and prob = pi.

# Simulate 10000 match outcomes
chess_sim <- chess_sim %>% 
  mutate(y = rbinom(10000, size = 6, prob = pi))

These 10,000 simulated pairs of win probabilities \(\pi\) and data \(Y\) provide insight into the two central pieces of our Bayesian model of \(\pi\), the prior and likelihood. The collection of 10,000 simulated values of pi closely approximate the prior model \(f(\pi)\):

# Plot the prior
ggplot(chess_sim, aes(x = pi)) + 

# Summarize the prior
chess_sim %>% 
  tabyl(pi) %>% 
    pi     n percent
   0.2  1017  0.1017
   0.5  2521  0.2521
   0.8  6462  0.6462
 Total 10000  1.0000

Further, the 10,000 simulated match outcomes y illuminate the dependence of these outcomes on Kasparov’s win probability pi, closely mimicking the conditional pmfs \(f(y|\pi)\) from Figure 2.2.

# Plot y by pi
ggplot(chess_sim, aes(x = y)) + 
  stat_count(aes(y = ..prop..)) + 
  facet_wrap(~ pi)

Finally, let’s focus on the simulated outcomes that match the observed data that Kasparov won one game. Among these simulations, the majority (60.4%) corresponded to the scenario in which Kasparov’s win probability \(\pi\) was 0.2 and very few (1.8%) corresponded to the scenario in which \(\pi\) was 0.8. These observations very closely approximate the posterior model of \(\pi\) which we formally built above.

# Focus on simulations with y = 1
win_one <- chess_sim %>% 
  filter(y == 1)

# Plot the posterior approximation
ggplot(win_one, aes(x = pi)) + 

# Summarize the posterior approximation
win_one %>% 
  tabyl(pi) %>% 
    pi   n percent
   0.2 404 0.60389
   0.5 253 0.37818
   0.8  12 0.01794
 Total 669 1.00000

2.4 Chapter summary

In Chapter 2, you learned Bayes’ Rule and that Bayes Rules! Every Bayesian analysis consists of three common steps.

  1. Construct a prior model for your variable of interest, \(\pi\). The prior model specifies two important pieces of information: the possible values of \(\pi\) and the relative prior plausibility of each.

  2. Summarize the dependence of data \(Y\) on \(\pi\) via a conditional pmf \(f(y|\pi)\). Upon observing data \(Y = y\), define the likelihood function \(L(\pi|y) := f(y|\pi)\) which encodes the relative likelihood of different \(\pi\) values in light of data \(Y = y\).

  3. Build the posterior model of \(\pi\) via Bayes’ Rule which balances the prior and likelihood:

    \[\begin{equation*} \text{posterior} = \frac{\text{prior} \cdot \text{likelihood}}{\text{normalizing constant}} \propto \text{prior} \cdot \text{likelihood} \end{equation*}\]

    More technically,

    \[f(\pi|y) = \frac{f(\pi)L(\pi|y)}{f(y)} \propto f(\pi)L(\pi|y)\]

2.5 Exercises

2.5.1 Practice: Building up to Bayes’ Rule

Exercise 2.1 (Comparing the prior and posterior) For each scenario below, use mathematical notation to represent the two events of interest, \(A\) and \(B\). Further, explain what you believe to be the relationship between the posterior and prior probabilities of \(B\): \(P(B|A) > P(B)\) or \(P(B|A) < P(B)\). We provide a solution to the first scenario as an example.
  1. You just finished reading author Nicole Dennis-Benn’s first novel, and you enjoyed it! What’s the posterior chance that you’ll enjoy their newest novel and how does this compare to the prior? Solution: Let \(B\) be the event that you enjoy Benn’s newest novel and \(A\) be the event that you enjoyed their first novel. Then \(P(B|A) > P(B)\) since liking one of Benn’s novels is a good indication that you’ll like another.
  2. What’s the posterior probability that your friend gets a job offer at the local market if they forgot to send a thank you note to the hiring manager? And how does this compare to the prior?
  3. Suppose it’s 0 degrees Fahrenheit in Minnesota on a January day. What’s the posterior probability that it will be 60 degrees tomorrow and how does this compare to the prior?
  4. Suppose the authors only got 3 hours of sleep last night. What’s the posterior probability that they make several typos in their writing today and how does this compare to the prior?
  5. Your friend includes three hashtags in their tweet. What’s the posterior chance it will get retweeted and how does this compare to the prior?
Exercise 2.2 (Marginal, conditional, or joint?) Define the following events for a resident of a fictional town: \(A\) = drives 10 miles per hour above the speed limit, \(B\) = gets a speeding ticket, \(C\) = took statistics at the local college, \(D\) = has used R, \(E\) = likes the music of Prince, and \(F\) = is Minnesotan. Several facts about these events are listed below. Specify each of these facts using probability notation, paying special attention to whether it’s a marginal, conditional, or joint probability.
  1. 73% of people that drive 10 miles per hour above the speed limit get pulled over by the police.
  2. 20% of residents drive 10 miles per hour above the speed limit
  3. 15% of residents have used R.
  4. 91% of statistics students at the local college have used R.
  5. 38% of residents are Minnesotans that like the music of Prince.
  6. 95% of the Minnesotan residents like the music of Prince.
Exercise 2.3 (Binomial practice) For each variable \(Y\) below, determine whether \(Y\) is Binomial. If yes, use notation to specify this model and its parameters. If not, explain why the Binomial model is not appropriate for \(Y\).
  1. At a certain hospital, an average of 6 babies are born each hour. Let \(Y\) be the number of babies born between 9am and 10am tomorrow.
  2. Tulips planted in fall have a 90% chance of blooming in spring. You plant 27 tulips this year. Let \(Y\) be the number that bloom.
  3. Each time they try out for the television show Ru Paul’s Drag Race, Alaska has a 17% probability of succeeding. Let \(Y\) be the number of times Alaska has to try out until they’re successful.
  4. \(Y\) is the amount of time that Henry is late to your lunch date.
  5. \(Y\) is the probability that your friends will throw you a surprise birthday party even though you said you hate being the center of attention and just want to go out to eat.
  6. You invite 60 people to your “\(\pi\) day” party, none of whom know each other, and each of whom has an 80% chance of showing up. Let \(Y\) be the total number of guests at your party.

2.5.2 Apply: Bayes’ Rule for events

Exercise 2.4 (Vampires?) Edward is trying to prove to Bella that vampires exist. Bella thinks there is a 0.05 probability that vampires exist. She also believes that the probability that someone can sparkle like a diamond if vampires exist is 0.7, and the probability that someone can sparkle like a diamond if vampires don’t exist is 0.03. Edward then goes into a meadow and shows Bella that he can sparkle like a diamond. Given that Edward sparkled like a diamond, what is the probability that vampires exist?
Exercise 2.5 (Sick trees) A local arboretum contains a variety of tree species, including elms, maples, and others. Unfortunately, 18% of all trees in the arboretum are infected with mold. Among the infected trees, 15% are elms, 80% are maples, and 5% are other species. Among the uninfected trees, 20% are elms, 10% are maples, and 70% are other species. In monitoring the spread of mold, an arboretum employee randomly selects a tree to test.
  1. What’s the prior probability that the selected tree has mold?
  2. The tree happens to be a maple. What’s the probability that the employee would have selected a maple?
  3. What’s the posterior probability that the selected maple tree has mold?
  4. Compare the prior and posterior probability of the tree having mold. How did your understanding of whether or not the tree had mold change in light of the fact that it’s a maple?
Exercise 2.6 (Restaurant ratings) The probability that Sandra will like a restaurant is 0.7. Among the restaurants that she likes, 20% have five stars on Yelp, 50% have four stars, and 30% have less than four stars. What other information do we need if we want to find the posterior probability that Sandra likes a restaurant given that it has less than four stars on Yelp?
Exercise 2.7 (Dating app) Matt is on a dating app looking for love. Matt swipes right on 8% of the profiles he views. Of the people that Matt swipes right on, 40% are men, 30% are women, 20% are non-binary, and 10% identify in another way. Of the people that Matt does not swipe right on, 45% are men, 40% are women, 10% are non-binary, and 5% identify in some other way.
  1. What’s the probability that a randomly chosen person on this dating app is non-binary?
  2. Given that Matt is looking at the profile of someone who is non-binary, what’s the posterior probability that he swipes right?
Exercise 2.8 (Flight delays) For a certain airline, 30% of the flights depart in the morning, 30% depart in the afternoon, and 40% depart in the evening. Frustratingly, 15% of all flights have a delayed departure. Of the delayed flights, 40% are morning flights, 50% are afternoon flights, and 10% are evening flights. Alicia and Mine are taking separate flights on the airline to attend a conference.
  1. Alicia’s flight is not delayed. What’s the probability that she’s on a morning flight?
  2. Mine is on a morning flight. What’s the probability that her flight will be delayed?
Exercise 2.9 (Good mood, bad mood) Your roommate has two moods, either good or bad. Having lived with your roommate for so long, you’ve noticed a pattern: their moods are highly related to how many text messages they receive the day before. Your roommate has a 10% chance of getting zero texts, an 85% chance of getting between 1 and 45 texts, and a 5% chance of getting more than 45 texts. Their probability of being in a good mood is 20% if they get 0 texts, 40% if they get between 1 and 45 texts, and 90% if they get more than 45 texts.
good mood bad mood Total
0 texts
1-45 texts
46+ texts
Total 1
  1. Use the information above to fill in the table above.
  2. Today’s a new day. What’s the probability that your roommate is in a good mood? What part of the Bayes’ Rule equation is this: the prior, likelihood, normalizing constant, or posterior?
  3. You surreptitiously took a peek at your roommate’s phone (we are attempting to withold judgment of this dastardly maneuver) and see that your roommate received 43 text messages yesterday. How likely are they to have received this many texts if they’re in a good mood today? What part of the Bayes’ Rule equation is this?
  4. What is the posterior probability that your roommate is in a good mood given that they received 43 text messages yesterday?
Exercise 2.10 (LGBTQ students: rural and urban) A recent study of 415,000 Californian public middle school and high school students found that 8.5% live in rural areas and 91.5% in urban areas.23 Of the students living in rural areas, 10% identified as Lesbian, Gay, Bisexual, Transgender, or Queer (LGBTQ) while 10.5% of students living in urban areas report an LGBTQ identity.
  1. What’s the probabilty that a student from this study identifies as LGBTQ?
  2. If you know that a student in the study identifies as LGBTQ, what’s the probability that they live in a rural area?
  3. If you know that a student in the study does not identify as LGBTQ, what is the probability that they live in an urban area?

2.5.3 Apply: Bayes’ Rule for random variables

Exercise 2.11 (Internship) Muhammad applies for six equally competitive data science internships. He has the following prior model for his chances of getting into any given internship (\(\pi\)):
\(\pi\) 0.3 0.4 0.5 Total
\(f(\pi)\) 0.25 0.60 0.15 1
  1. Let \(Y\) be the number of internship offers that Muhammed gets. Specify the model for the dependence of \(Y\) on \(\pi\) and the corresponding pmf, \(f(y|\pi)\).
  2. Muhammed got some pretty amazing news. He was offered four of the six internships! How likely would this be if \(\pi = 0.3\)?
  3. Construct the posterior model of \(\pi\) in light of Muhammed’s internship news.
Exercise 2.12 (Making mugs) Miles is learning how to make a mug in his ceramics class. A difficult part of the process is creating or “pulling” the handle. Miles is unsure about the probability that one of his handles will actually be good enough for a mug. Let’s call this probability \(\pi\). Miles’ guess is represented by the following prior model:
\(\pi\) 0.1 0.25 0.4 Total
\(f(\pi)\) 0.2 0.45 0.35 1
  1. Miles has enough clay for 7 handles. Let \(Y\) be the number of handles that will be good enough for a mug. Specify the model for the dependence of \(Y\) on \(\pi\) and the corresponding pmf, \(f(y|\pi)\).
  2. Miles pulls 7 handles and only 1 of them is good enough for a mug. What is the posterior pmf of \(\pi\), \(f(\pi|(y=1))\)?
  3. Compare the posterior model to the prior model of \(\pi\). How would you characterize the differences between them?
  4. Miles’ instructor Kris had a different prior for Miles’ ability to pull a handle (below). Find Kris’ posterior \(f(\pi|(y=1))\) and compare it to Miles’ posterior.
    \(\pi\) 0.1 0.25 0.4 Total
    \(f(\pi)\) 0.15 0.15 0.7 1
Exercise 2.13 (Lactose intolerant) Lactose intolerance, an inability to digest milk often resulting in an upset stomach, is a fairly common trait in adults. Fatima wants to learn more about the proportion of people who are lactose intolerant (\(\pi\)). Her prior model for \(\pi\) is:
\(\pi\) 0.4 0.5 0.6 0.7 Total
\(f(\pi)\) 0.1 0.2 0.44 0.26 1
  1. Fatima surveys a random sample of 80 adults and 47 are lactose intolerant. Without doing any math, make a guess at the posterior model of \(\pi\), and explain your reasoning.
  2. Calculate the posterior model. How does this compare to your guess in part a?
  3. If Fatima had instead collected a sample of 800 adults and 470 (keeping the sample proportion the same as above) are lactose intolerant, how does that change the posterior model?
Exercise 2.14 (Late bus) Li Qiang takes the 8:30am bus to work every morning. If the bus is late, Li Qiang will be late to work. To learn about the probability that her bus will be late (\(\pi\)), Li Qiang first surveys 20 other commuters: 3 think \(\pi\) is 0.15, 3 think \(\pi\) is 0.25, 8 think \(\pi\) is 0.5, 3 think \(\pi\) is 0.75, and 3 think \(\pi\) is 0.85.
  1. Convert the information from the 20 commuters that Li Qiang surveyed into a prior model for \(\pi\).
  2. Li Qiang wants to update that prior model with the data she collected: in 13 days, the 8:30am bus was late 3 times. Find the posterior model for \(\pi\).
  3. Compare and comment on the prior and posterior models. What did Li Qiang learn about the bus?
Exercise 2.15 (Cuckoo birds) Cuckoo birds are brood parasites, meaning that they lay their eggs in the nests of other birds (hosts), so that the host birds will raise the cuckoo bird hatchlings. Lisa is an ornithologist studying the success rate, \(\pi\), of cuckoo bird hatchlings that survive at least one week. She is taking over the project from a previous researcher who speculated in their notes the following prior model for \(\pi\):
\(\pi\) 0.6 0.65 0.7 0.75 Total
\(f(\pi)\) 0.3 0.4 0.2 0.1 1

Starting from this prior, Lisa collects some data. Among the 15 hatchlings she studied, 10 survived for at least one week.

  1. If the previous researcher had been more sure that the hatchings would survive, how would the prior model be different?
  2. If the previous researcher had been less sure that the hatchilings would survive, how would the prior model be different?
  3. What is the posterior model for \(\pi\)?
  4. Lisa needs to explain the posterior model for \(\pi\) in a research paper for ornithologists, and she can’t assume they understand Bayesian statistics. Write two or three sentences explaining the posterior model in context.
Exercise 2.16 (Fake art) An article in The Daily Beast reports differing opinions on the proportion (\(\pi\)) of museum artworks that are fake or forged.24
  1. After reading the article, define your own prior model for \(\pi\) and provide evidence from the article to justify your choice.

  2. Compare your prior to the following:











    What is similar? What is different?

  3. Suppose you randomly choose 10 artworks. Assuming the prior from part b, what is the minimum number of artworks that would need to be forged for \(f(\pi=0.6|Y=y)>0.4\)?

2.5.4 Practice: simulation

Exercise 2.17 (Lactose intolerant redux) Repeat the “Lactose intolerant” exercise utilizing simulation to approximate the posterior model of \(\pi\) corresponding to Fatima’s survey data. Specifically, simulate data for 10,000 people and remember to set your random number seed.
Exercise 2.18 (Cuckoo birds redux) Repeat the “Cuckoo birds” exercise utilizing simulation, with a simulation sample size of 10,000, to approximate the posterior model of \(\pi\).
Exercise 2.19 (Cat image recognition) Is your social media filled with cat memes? Whether you like it or not, cats have taken over the internet. In fact, in 2015 the Museum of the Moving Image in New York had an exhibition called How Cats Took Over the Internet25. Thus, companies that can identify cat images on the internet can benefit. For her first data science internship, Zainab has written an algorithm to detect cat images. When given a cat image, 80% of the time the algorithm correctly identifies it as a cat. When given a non-cat image, the algorithm falsely identifies it as a cat 50% of the time. Zainab tests her algorithm further with a new set the images, of which 8% are cats. What is the probability that an image is actually a cat if the algorithm identifies it as a cat? Answer this question by simulating data for 10,000 images.
Exercise 2.20 (Medical tests) A medical test is designed to detect a disease that about 3% of the population has. For 93% of those who have the disease, the test yields a positive result. In addition, the test falsely yields a positive result for 7% of those without the disease. What is the probability that a person has the disease given that they have tested positive? Answer this question by simulating data for 10,000 people.
Exercise 2.21 (Titanic) The R titanic package provides data on 891 of roughly 2400 Titanic passengers. Among these passengers, 216 had first class tickets, 184 had second class tickets, and 491 had third class tickets. Among those with first class tickets, 136 survived and among those with third class tickets, 119 survived. Utilize simulation to calculate the following probabilities of interest.
  1. What is the probability that a passenger survived given that they had a first class ticket?
  2. What is the probability that a passenger survived given that they had a third class ticket?
Exercise 2.22 (Bayesian professor) A professor gives her students a multiple choice test. Each question has four choices. The professor suspects that students who do not study should get a score of 25%. Those who study a little could potentially get 60%. Those who study hard, could get 90%. Even though other test scores are possible, the professor wants her life to be simple for now. She thinks that about 1/10 of her students should get a score of 25%, another 8/10 should get 60%, and the other 1/10 should get 90%. The professor grades the exams and sees that Jose has answered 83 of the 100 questions correctly. Based on this test result, should the professor assign a score of 25%, 60%, or 90% to Jose? Support your answer with a simulation.

  1. Correct answer = b.↩︎

  2. We can’t cite any rigorous research article here, but imagine what orchestras would sound like if this weren’t true.↩︎

  3. This term might be a bit mysterious now, but will make sense by the end of this chapter.↩︎

  4. We read “\(\cap\)” as “and” or the “intersection” of two events.↩︎

  5. If you get different random samples than those printed here, it likely means that you are using a different version of R, and you need to update to the most recent R version.↩︎


  7. Greek letters are conventionally used to denote our primary variables of interest.↩︎

  8. As we keep progressing with Bayes, we’ll get the chance to make our models more nuanced and realistic.↩︎

  9. Capital letters toward the end of the alphabet (eg: \(X, Y, Z\)) are conventionally used to denote random variables related to our data.↩︎

  10. If you’re interested in learning more about this model and how to actually build its pmf (which is encouraged but certainly not necessary to moving forward), we recommend Chapter 3 of K. Blitzstein and Hwang (2019).↩︎