INFSCI 2595 · Week 4 Study Guide · Probability Basics & Bayes

Week 4 Overview

// why probability matters · Bayesian motivation · lecture roadmap

Why Probability in Machine Learning? Motivation

Statistics and probability play an important role in many machine learning algorithms. The core Bayesian insight is that model parameters are not fixed unknown constants, but random variables with their own probability distributions. This means we can express genuine uncertainty about parameters — not just uncertainty from sampling noise, but inherent uncertainty that we update as we observe data.

      Bayesian: Parameters are random variables. Uncertainty is represented as a probability distribution over parameter values — a prior before data, a posterior after. 

      Frequentist (MLE): Parameters are fixed unknowns. The only uncertainty comes from sampling noise — we get a point estimate (e.g. μ̂ = m/N) and quantify variability with confidence intervals based on repeated sampling.

What This Lecture Is (and Isn't) Important

In this lecture we are NOT training a predictive model. Instead we are using the rules and concepts of probability to describe behavior — collecting observations of an event and asking what probability best explains what we observed.

This is the probabilistic foundation that makes logistic regression and all Bayesian ML work. We need to understand distributions and likelihoods before we can fit models.

Lecture 4 Roadmap Structure

#	Topic	Key takeaway
1	Probability Basics	Event probability, joint, marginal, conditional, independence — the Panda/Duck example
2	Bayes' Theorem	Posterior = Likelihood × Prior / Evidence; diagnostic test worked example
3	Bernoulli Distribution	PMF for binary outcomes; the Star Wars movie example
4	Likelihood & MLE	Product of N independent Bernoullis → log-likelihood → μ̂ = m/N
5	Binomial Distribution	Counting sequences, combinations, p(m\|N,μ), R's dbinom(); small-data problem
6	Bayesian Approach & MAP	Prior distributions, Beta conjugate prior, posterior derivation, MAP estimate μ̂ = (m+a−1)/(N+a+b−2)
7	Credible vs. Confidence Intervals	Bayesian credible intervals vs. frequentist confidence intervals; prior prevents unrealistic values

Probability Basics

// event probability · joint · marginal (sum rule) · conditional · independence · chain rule

Probability of an Event Core

The probability of an event is the proportion of times that event occurs out of the total number of trials.

Binary response y — class probabilities

p(y = 1) = probability of the EVENT class, denoted μ
p(y = 0) = probability of the NON-EVENT class = 1 − μ

The sample probability of each class is the count of that class in the dataset divided by the total size. This is the simplest estimate of p(y).

Joint Probabilities — The Panda/Duck Example Example

We have two variables: Animal (Panda or Duck) and Color (Red or Black). A random sample of 16 objects gives the following counts:

	Red	Black	Marginal (Animal)
Panda	4	3	7
Duck	6	3	9
Marginal (Color)	10	6	16

Joint probabilities — e.g. p(Panda, Red) = 4/16 = 0.25. The probability of both being a Panda and Red simultaneously.

The Sum Rule — Marginal from Joint Sum Rule

The marginal probability of a variable is calculated by summing across its intersections with all levels of a second variable.

Discrete B

p(A = a) = Σ_b p(A = a, B = b)

Continuous B

p(A = a) = ∫ p(a, b) db

Animal example:

    p(Panda) = p(P,Red) + p(P,Black) = 4/16 + 3/16 = 7/16

    p(Duck)  = p(D,Red) + p(D,Black) = 6/16 + 3/16 = 9/16

    p(Red)   = p(R,Panda) + p(R,Duck) = 4/16 + 6/16 = 10/16

Independent Events Definition

Events A and B are independent if observing A does not affect the probability of B. The joint probability becomes the product of the marginals:

Independence — formal definition

p(A, B) = p(A) · p(B) if and only if A and B are independent

✓ Independent: Flip a coin twice. The 2nd flip is 50/50 regardless of the 1st — knowing the 1st gives no information about the 2nd.

✓ Independent: "A coin lands heads" and "it is not raining outside." These events are unrelated — one does not influence the other.

The independence assumption is critical for the likelihood function later: assuming the N observations are independent lets us factor the joint distribution into a product of N simple Bernoulli distributions.

Conditional Probability & the Chain Rule Core

Split the 16 animals into two groups. Group 1 has 12 animals, Group 2 has 4. The probability of selecting each animal now depends on which group you're in:

	Group 1	Group 2	Marginal
Duck	6	3	9
Panda	6	1	7
Marginal	12	4	16

Group 1 (n=12)

p(Duck | G=1) = 6/12 = 0.5
p(Panda | G=1) = 6/12 = 0.5

Group 2 (n=4)

p(Duck | G=2) = 3/4 = 0.75
p(Panda | G=2) = 1/4 = 0.25

Formal definition and chain rule

p(A | G) = p(A, G) / p(G)

↔ p(A, G) = p(A | G) · p(G) ← chain rule

The chain rule rearranges the conditional probability definition. Setting p(A,G) = p(G,A) and applying the chain rule in both directions is exactly how Bayes' theorem is derived.

Summary: Three Types of Probability Summary

Type	Notation	Meaning
Marginal	`p(A)`	Probability of A irrespective of any other variable.
Joint	`p(A, B)`	Probability of events A and B occurring together.
Conditional	`p(A \| B)`	Probability of A given that B has already occurred.

Bayes' Theorem

// derivation · posterior · likelihood · prior · evidence · diagnostic test example

Deriving Bayes' Theorem Core

1

Symmetry of joint probability

p(A, B) = p(B, A)

2

Apply the chain rule in both directions

p(A|B) · p(B) = p(B|A) · p(A)

3

Solve for p(A|B) → Bayes' Theorem

p(A | B) = p(B | A) · p(A) / p(B)

This lets us reverse the conditioning — compute p(A|B) from p(B|A).

The Four Named Components Anatomy

p(A|B)

Posterior

=

p(B|A)

Likelihood

·

p(A)

Prior

÷

p(B)

Evidence

Name	Symbol	Role
Posterior	`p(A\|B)`	Updated belief about A after observing evidence B. This is what we want.
Likelihood	`p(B\|A)`	How probable is evidence B if A were true? Comes from the model / data.
Prior	`p(A)`	Our belief about A before any evidence. Encodes background knowledge.
Evidence	`p(B)`	Total probability of evidence, summed across all outcomes. Normalizing constant.

The posterior represents updated beliefs about the prior based on new evidence.

Animal Example: Applying Bayes Walk-through

We randomly sample one animal from one of two groups. We observe it is a Duck. What is the probability it came from Group 2?

	Group 1	Group 2	Marginal
Duck	6	3	9
Panda	6	1	7
Marginal	12	4	16

1

Set up the known quantities

p(G=1) = 12/16 = 3/4 · p(G=2) = 4/16 = 1/4
p(Duck|G=1) = 6/12 = 1/2 · p(Duck|G=2) = 3/4

2

Compute the evidence p(Duck) via total probability

p(Duck) = p(Duck|G=1)·p(G=1) + p(Duck|G=2)·p(G=2) = (1/2)(3/4) + (3/4)(1/4) = 3/8 + 3/16 = 9/16

3

Apply Bayes' Theorem

p(G=2|Duck) = p(Duck|G=2)·p(G=2)/p(Duck) = (3/4·1/4)/(9/16) = (3/16)/(9/16) = 1/3 ≈ 0.333

Although Group 2 is small (prior = 1/4), seeing a Duck raises its probability because ducks are proportionally more common in Group 2.

Diagnostic Test for a Disease Base Rate Neglect

The classic Bayesian example. Dis and Test are binary variables for disease state and test result.

Quantity	Symbol	Value
Sensitivity (true positive rate)	`p(Test=1 \| Dis=1)`	0.95
False positive rate	`p(Test=1 \| Dis=0)`	0.01
Prevalence (prior)	`p(Dis=1)`	1/100,000 = 0.00001
Goal: posterior	`p(Dis=1 \| Test=1)`	?

1

Evidence: p(Test=1) via total probability

= 0.95×0.00001 + 0.01×0.9999 = 0.0000095 + 0.009999 ≈ 0.01

2

Posterior via Bayes' Theorem

p(Dis=1|Test=1) = (0.95 × 0.00001) / 0.01 ≈ 0.001 = 0.1%

Key lesson — Base Rate Neglect: Even with a 95%-sensitive test, a positive result only means ~0.1% probability of having the disease — because the disease is extremely rare (1 in 100,000). The prior dominates. Most people intuitively over-estimate this probability by ignoring the base rate.

Bernoulli Distribution

// binary outcome encoding · PMF · parameter μ · the movie example · visualizing p(y|μ)

The Running Example: Star Wars Setup

We ask people: Did you like Star Wars Episode VIII — The Last Jedi? The answer is binary, encoded exactly as in Week 3:

Event: liked the movie

y = 1

Non-event: did NOT like it

y = 0

We denote the probability someone liked the movie — p(y=1) — by the parameter μ. It follows that p(y=0) = 1 − μ. Since μ is a probability: 0 ≤ μ ≤ 1.

The Bernoulli PMF Core

The probability mass function over both values y ∈ {0, 1} can be written compactly in a single expression — named after Jacob Bernoulli:

Bernoulli PMF

p(y | μ) = μ^y · (1 − μ)^(1−y)

Verify when y = 1

p(1|μ) = μ^1 · (1−μ)^0 = μ · 1 = μ ✓

Verify when y = 0

p(0|μ) = μ^0 · (1−μ)^1 = 1 · (1−μ) = 1−μ ✓

The Bernoulli distribution is applicable for any binary outcome problem: sports (win/lose), engineering (pass/fail), manufacturing (defect/no defect), medicine (disease/healthy), etc.

Visualizing the Bernoulli PMF for Different μ Values Intuition

The PMF has only two bars — at x=0 and x=1. Their heights depend entirely on μ:

μ	p(y=0) = 1−μ	p(y=1) = μ	Interpretation
`0.15`	0.85 (tall)	0.15 (short)	Event does not happen too frequently — 0s are more common
`0.45`	0.55	0.45	Near-equal — high uncertainty
`0.75`	0.25 (short)	0.75 (tall)	Event is common — mostly 1s

As μ increases from 0 to 1, the bar at y=0 shrinks and the bar at y=1 grows. When μ=0.5, both bars are equal — maximum uncertainty.

Likelihood Function & MLE

// independence assumption · product of Bernoullis · log-likelihood derivation · μ̂ = m/N

Extending to N Observations Core

We ask 4 people. We observe the sequence: Person 1=No, 2=No, 3=Yes, 4=No. What is the probability of this sequence?

The joint probability p(x₁=0, x₂=0, x₃=1, x₄=0 | μ) seems hard — but assuming people respond independently, we can factor it into a product of 4 Bernoullis.

Independence: joint = product of marginals

p(A, B) = p(A|B) · p(B) = p(A) · p(B) when A ⊥ B

Likelihood function — N independent Bernoullis

p(x | μ) = ∏ₙ₌₁ᴺ Bernoulli(xₙ | μ) = ∏ₙ₌₁ᴺ μ^xₙ · (1−μ)^(1−xₙ)

4-person example: p(0,0,1,0|μ) = (1−μ)(1−μ)μ(1−μ) = μ¹·(1−μ)³

Maximum Likelihood Estimation (MLE) Core

When μ is unknown, we find the value that best explains the data — the value that maximizes the likelihood:

MLE objective

μ̂_MLE = argmax_{μ ∈ [0,1]} p(x | μ)

Rather than maximizing the likelihood directly (a product of many small numbers), we maximize the log-likelihood instead. Since log is monotone increasing, the maximizer is the same.

Log-Likelihood Derivation — Step by Step Walk-through

1

Product → Sum (log of product = sum of logs)

log[p(x|μ)] = Σₙ₌₁ᴺ log[μ^xₙ · (1−μ)^(1−xₙ)]

2

Apply log(aᵇ) = b·log(a)

log[p(x|μ)] = Σₙ ( xₙ·log[μ] + (1−xₙ)·log[1−μ] )

3

Assuming μ is constant: pull log(μ) and log(1−μ) outside the sums

= log(μ) · Σxₙ + log(1−μ) · Σ(1−xₙ)
= log(μ) · m + log(1−μ) · (N − m)

where m = number of events (people who said Yes) and N − m = number of non-events

Final compact form:

    log[p(x|μ)] = log(μ) × m  +  log(1−μ) × (N − m)

Solving for μ̂_MLE Result

1

Take the derivative with respect to μ

d/dμ log[p(x|μ)] = m/μ − (N−m)/(1−μ)

2

Set derivative to zero and solve

m/μ − (N−m)/(1−μ) = 0 → (1−μ)·m = μ·(N−m)
m − μ·m = μ·N − μ·m → m = μ·N

3

The MLE is just counting!

μ̂_MLE = m / N = (1/N) · Σₙ xₙ

The MLE for the Bernoulli parameter is the sample proportion of events.

⚠️ Small-data warning: With only N=4 people and μ_TRUE=0.2, observing 0 Yes (μ̂=0.0) has ~40% probability. Observing 2 Yes (μ̂=0.5) has ~15% probability. The MLE can be far from the truth purely by chance with small samples.

Binomial Distribution

// counting sequences · combinations · PMF derivation · interactive explorer · R's dbinom()

A Different Question Motivation

Bernoulli answers: what is p(y=1) for a single trial?
Binomial answers: what is the probability the event occurs exactly m times out of N trials?

Movie example shift: instead of "will this one person like the movie?", we ask "if we ask 4 people, what is p(exactly 1 out of 4 says Yes)?"

Multiple Sequences Have the Same Count Key Insight

The sequence (No, No, Yes, No) gives 1 Yes out of 4 — but there are 4 possible orderings that also give exactly 1 Yes:

Person 1	Person 2	Person 3	Person 4	Probability
Yes	No	No	No	`μ(1−μ)³`
No	Yes	No	No	`μ(1−μ)³`
No	No	Yes	No	`μ(1−μ)³`
No	No	No	Yes	`μ(1−μ)³`

Each sequence has the same probability μ(1−μ)³. Total p(exactly 1 Yes in 4 trials) = 4 × μ(1−μ)³.

Counting Combinations C(N,m) Math

How many distinct orderings of N trials contain exactly m events? This is the combination (N choose m) — order doesn't matter, only which m positions are "Yes":

Combinations — "N choose m"

C(N, m) = N! / (m! · (N − m)!)

m events in N=4 trials	C(4, m)	Interpretation
m = 0	1	Only 1 way: (No,No,No,No)
m = 1	4	4 ways: Yes can be in any 1 of 4 positions
m = 2	6	6 ways to choose which 2 of 4 positions are Yes
m = 3	4	4 ways
m = 4	1	Only 1 way: (Yes,Yes,Yes,Yes)

Total: 1+4+6+4+1 = 16 = 2⁴ ✓ — all possible binary sequences of length 4 are accounted for.

The Binomial PMF Core

Binomial distribution PMF

p(m | N, μ) = C(N, m) · μ^m · (1−μ)^(N−m)

m ∈ {0, 1, 2, …, N}

Full table for N=4:

    p(0|4,μ) = 1·μ⁰·(1−μ)⁴

    p(1|4,μ) = 4·μ¹·(1−μ)³

    p(2|4,μ) = 6·μ²·(1−μ)²

    p(3|4,μ) = 4·μ³·(1−μ)¹

    p(4|4,μ) = 1·μ⁴·(1−μ)⁰

Bernoulli as a special case: Setting N=1 recovers the Bernoulli PMF — m ∈ {0,1} and C(1,m)=1. The Bernoulli is a Binomial with N=1.

Interactive: Binomial PMF Explorer Interactive

Adjust N and μ to see how the distribution of "number of events m" changes. Each bar shows p(m | N, μ).

N — number of trials 8

μ — event probability 0.30

Hover bars to see exact probabilities

Movie Example: p(m | N=4, μ_TRUE=0.2) Concrete Numbers

Assuming the TRUE probability of liking the movie is 0.2:

m (# who said Yes)	p(m \| N=4, μ=0.2)	Note
0	≈ 0.41 (41%)	Most likely single outcome!
1	≈ 0.41 (41%)	Also very common
2	≈ 0.15 (15%)	Small but non-negligible
3	≈ 0.026	Unlikely
4	≈ 0.002	Very unlikely

If we observed m=0 → MLE = 0.0. If we observed m=2 → MLE = 0.5. Both are far from the truth (0.2), yet both happen with meaningful probability. MLE is unreliable with small data.

What's the solution? Incorporate additional information about plausible values of μ before seeing data — that is, a prior distribution. This is where Bayesian inference picks up in the next lecture.

Binomial in R R Code

# dbinom(x, size, prob) — Binomial PMF
# x ↔ m (number of events)
# size ↔ N (number of trials)
# prob ↔ μ (event probability)

# P(exactly 1 Yes in 4 trials) with μ=0.2
dbinom(x = 1, size = 4, prob = 0.2)
# → 0.4096

# Full PMF for N=8, μ=0.2
m_vals <- 0:8
probs  <- dbinom(m_vals, size = 8, prob = 0.2)

# Plot it
barplot(probs, names.arg = m_vals,
        xlab = "m", ylab = "p(m | μ)",
        main = "Binomial PMF — N=8, μ=0.2")

Bayesian Approach & MAP

// prior distributions · Beta conjugate prior · posterior derivation · MAP estimate · credible intervals

Why a Prior? The Problem With MLE on Small / Biased Data Motivation

MLE gives μ̂ = m/N — fine with lots of data, but badly misleading with small or biased samples. Example from the lecture:

Source	Like (m)	Total (N)	μ̂ MLE
StarFanatic.com	23	25	0.92
IHateStarWars.com	1	45	0.02

Both estimates are far from the true value (~0.30). The solution is to introduce a prior distribution over μ that encodes our initial beliefs — before seeing any (biased) data.

"About half the people I know like TLJ… I don't really think IHateStarWars seems very accurate…"

    This illustrates an important feature of Bayesian inference: the prior is inherently subjective — two analysts could choose different priors and reach different posteriors from the same data. This is sometimes seen as a limitation, but also a strength: it forces you to make your assumptions explicit. Crucially, as data accumulates the likelihood dominates and the posterior converges regardless of the starting prior — the data ultimately guides us toward the truth.

The Bayesian Formulation Core

Start with an initial hypothesis, then update it based on data. The key equation:

p(μ|x)

Posterior

=

p(x|μ)

Likelihood

·

p(μ)

Prior

÷

p(x)

Evidence

Proportional form (ignoring normalizing constant)

p(μ | x) ∝ p(x | μ) · p(μ)

The evidence p(x) is a constant with respect to μ — it doesn't affect the argmax. So for optimization (MAP), we can ignore it and work with just likelihood × prior.

The Beta Distribution — Conjugate Prior for Binomial Core

When the likelihood is Binomial/Bernoulli, the natural choice of prior is the Beta distribution. It is a continuous PDF bounded between 0 and 1 — exactly the domain of μ.

Beta PDF

Beta(μ | a, b) = Γ(a+b) / [Γ(a)·Γ(b)] · μ^(a−1) · (1−μ)^(b−1)

Parameter	Interpretation	Example shape
`a`	Prior "pseudo-count" of events (successes)	a=1, b=1 → Uniform (flat prior)
`b`	Prior "pseudo-count" of non-events (failures)	a=3, b=3 → symmetric hill at 0.5

Conjugate prior: Because Beta is conjugate to the Binomial, the posterior is also a Beta distribution — just with updated parameters. This means no numerical integration is needed!

Beta Mean

E[μ | a, b] = a / (a + b)

Beta Mode (peak of the distribution)

Mode = (a − 1) / (a + b − 2)

Note: the mode of the Beta distribution will become the MAP estimate once we plug in the updated posterior parameters anew and bnew — that derivation comes in the next card.

Deriving the Posterior — Beta × Binomial Walk-through

1

Multiply likelihood and prior

p(μ|m,N) ∝ Binomial(m|N,μ) × Beta(μ|a,b)
∝ μ^m · (1−μ)^(N−m) · μ^(a−1) · (1−μ)^(b−1)

2

Combine powers of like bases

∝ μ^(m+a−1) · (1−μ)^(N−m+b−1)

3

Recognize the Beta form — define updated hyperparameters

a_new = m + a b_new = N − m + b
Posterior = Beta(μ | a_new, b_new)

The prior parameters act like "imaginary" prior observations — they shift and concentrate the posterior.

The MAP Estimate — Solving for μ̂ Core

MAP = Maximum A Posteriori: find μ that maximizes the posterior. Because Evidence is constant, this means maximizing Likelihood × Prior.

MAP Objective

μ̂_MAP = argmax_μ p(x|μ) · p(μ)

1

Take log-posterior

log p ∝ (a_new − 1)·log μ + (b_new − 1)·log(1 − μ)

2

Differentiate and set to zero

d/dμ = (a_new − 1)/μ − (b_new − 1)/(1 − μ) = 0

3

Solve for μ → MAP Estimate

μ̂_MAP = (a_new − 1) / (a_new + b_new − 2)
= (m + a − 1) / (N + a + b − 2)

This is a weighted average between the MLE (m/N) and the prior mean (a/(a+b)). With more data, it approaches the MLE.

Key Insight: The MAP estimate is a compromise between your prior beliefs and the data. As N → ∞, the data dominates and μ̂_MAP → μ̂_MLE.

Prior vs. Likelihood Influence Intuition

Small dataset (N=45): Posterior is pulled strongly toward the prior. Biased site IHateStarWars (1/45) is overridden by RT prior (199/485).

Small data example (a=199, b=286, m=1, N=45)

μ̂_MAP ≈ 0.218

Large dataset (N=18,923): Posterior looks like the likelihood. Metacritic dominates the RT prior.

Large data (Metacritic as likelihood)

Posterior ≈ Likelihood

The more data you have, the less the prior matters. With infinite data, the prior is irrelevant.

Credible Intervals vs. Confidence Intervals Comparison

Both methods quantify uncertainty, but they differ philosophically and practically:

Property	Bayesian — Credible Interval	Frequentist — Confidence Interval
Interpretation	There is X% probability that μ lies in this range given the observed data	If we repeated the experiment many times, X% of intervals would contain the true μ
Incorporates prior?	Yes — prior prevents extreme estimates	No — can give unrealistic values (e.g., 0 or 1) with small N
Small N behavior	Stays reasonable thanks to prior (e.g., m=0 gives mean ≈ 0.2, not 0)	Can collapse to a point or span impossible values
In R (Binomial)	qbeta() on posterior Beta(a_new, b_new)	binom.test() — Clopper-Pearson method

The Bayesian approach prevents unrealistic values due to the influence of the prior! Classical 90% CI for m=0 out of N=4 extends all the way to [0, 0.49], while the Bayesian credible interval stays sensibly bounded.

How Prior Choice Affects the Posterior Sensitivity

Four types of prior are commonly considered. The posterior behaves differently depending on how informative the prior is:

Prior Type	Beta Params	Effect on Posterior
Informative	Large a, small b (e.g., a=100, b=3)	Strongly concentrates near low μ; resists data pulling it away
Less than 50%	a < b (e.g., a=4, b=11)	Probability mass mostly below 0.5; moderately informative
Uniform	a=1, b=1	Flat — all values of μ equally likely; MAP equals MLE
Vague	a<1, b<1 (e.g., a=0.1, b=0.1)	U-shaped — probability mass at extremes (0 and 1); anti-regularizing

With more trials (e.g., N=40 vs N=20), the posterior becomes tighter and less sensitive to the choice of prior. All four priors converge as N grows.

Posterior Summary Statistics Reading Results

Rather than reporting just a point estimate, Bayesian analysis reports the full posterior distribution. Key summaries:

Central tendency

Median = 50th quantile of Beta
Mean = a_new / (a_new + b_new)

Uncertainty (credible intervals)

50% CI → 25th to 75th quantile
90% CI → 5th to 95th quantile

In the lecture's m=0 out of N=4 example: the posterior mean ≈ 0.2 (not 0!), and zero falls below the 0.05 quantile — consistent with our prior belief that μ > 0.05. The prior saves us from the nonsensical MLE of 0.

Glossary

// all key terms from week 4 · sorted alphabetically

Flashcards

// click card to reveal answer · mark correct or missed · track your score

✓ 0 · ✗ 0

0 / 0

Press "Shuffle & Start" to begin.