INFSCI 2595 · Week 4

Applied ML · Probability Basics & Bayes

Interactive Study Guide — Lecture Notes Companion
Week 4 Overview
// why probability matters · Bayesian motivation · lecture roadmap
Why Probability in Machine Learning? Motivation

Statistics and probability play an important role in many machine learning algorithms. The core Bayesian insight is that model parameters are not fixed unknown constants, but random variables with their own probability distributions. This means we can express genuine uncertainty about parameters — not just uncertainty from sampling noise, but inherent uncertainty that we update as we observe data.

Bayesian: Parameters are random variables. Uncertainty is represented as a probability distribution over parameter values — a prior before data, a posterior after.

Frequentist (MLE): Parameters are fixed unknowns. The only uncertainty comes from sampling noise — we get a point estimate (e.g. μ̂ = m/N) and quantify variability with confidence intervals based on repeated sampling.
What This Lecture Is (and Isn't) Important

In this lecture we are NOT training a predictive model. Instead we are using the rules and concepts of probability to describe behavior — collecting observations of an event and asking what probability best explains what we observed.

This is the probabilistic foundation that makes logistic regression and all Bayesian ML work. We need to understand distributions and likelihoods before we can fit models.
Lecture 4 Roadmap Structure
#TopicKey takeaway
1Probability BasicsEvent probability, joint, marginal, conditional, independence — the Panda/Duck example
2Bayes' TheoremPosterior = Likelihood × Prior / Evidence; diagnostic test worked example
3Bernoulli DistributionPMF for binary outcomes; the Star Wars movie example
4Likelihood & MLEProduct of N independent Bernoullis → log-likelihood → μ̂ = m/N
5Binomial DistributionCounting sequences, combinations, p(m|N,μ), R's dbinom(); small-data problem
6Bayesian Approach & MAPPrior distributions, Beta conjugate prior, posterior derivation, MAP estimate μ̂ = (m+a−1)/(N+a+b−2)
7Credible vs. Confidence IntervalsBayesian credible intervals vs. frequentist confidence intervals; prior prevents unrealistic values
Probability Basics
// event probability · joint · marginal (sum rule) · conditional · independence · chain rule
Probability of an Event Core

The probability of an event is the proportion of times that event occurs out of the total number of trials.

Binary response y — class probabilities
p(y = 1) = probability of the EVENT class, denoted μ
p(y = 0) = probability of the NON-EVENT class = 1 − μ
The sample probability of each class is the count of that class in the dataset divided by the total size. This is the simplest estimate of p(y).
Joint Probabilities — The Panda/Duck Example Example

We have two variables: Animal (Panda or Duck) and Color (Red or Black). A random sample of 16 objects gives the following counts:

RedBlackMarginal (Animal)
Panda437
Duck639
Marginal (Color)10616
Joint probabilities — e.g. p(Panda, Red) = 4/16 = 0.25. The probability of both being a Panda and Red simultaneously.
The Sum Rule — Marginal from Joint Sum Rule

The marginal probability of a variable is calculated by summing across its intersections with all levels of a second variable.

Discrete B
p(A = a) = Σ_b p(A = a, B = b)
Continuous B
p(A = a) = ∫ p(a, b) db
Animal example:
p(Panda) = p(P,Red) + p(P,Black) = 4/16 + 3/16 = 7/16
p(Duck)  = p(D,Red) + p(D,Black) = 6/16 + 3/16 = 9/16
p(Red)   = p(R,Panda) + p(R,Duck) = 4/16 + 6/16 = 10/16
Independent Events Definition

Events A and B are independent if observing A does not affect the probability of B. The joint probability becomes the product of the marginals:

Independence — formal definition
p(A, B) = p(A) · p(B) if and only if A and B are independent
✓ Independent: Flip a coin twice. The 2nd flip is 50/50 regardless of the 1st — knowing the 1st gives no information about the 2nd.
✓ Independent: "A coin lands heads" and "it is not raining outside." These events are unrelated — one does not influence the other.
The independence assumption is critical for the likelihood function later: assuming the N observations are independent lets us factor the joint distribution into a product of N simple Bernoulli distributions.
Conditional Probability & the Chain Rule Core

Split the 16 animals into two groups. Group 1 has 12 animals, Group 2 has 4. The probability of selecting each animal now depends on which group you're in:

Group 1Group 2Marginal
Duck639
Panda617
Marginal12416
Group 1 (n=12)
p(Duck | G=1) = 6/12 = 0.5
p(Panda | G=1) = 6/12 = 0.5
Group 2 (n=4)
p(Duck | G=2) = 3/4 = 0.75
p(Panda | G=2) = 1/4 = 0.25
Formal definition and chain rule
p(A | G) = p(A, G) / p(G)

↔ p(A, G) = p(A | G) · p(G) ← chain rule
The chain rule rearranges the conditional probability definition. Setting p(A,G) = p(G,A) and applying the chain rule in both directions is exactly how Bayes' theorem is derived.
Summary: Three Types of Probability Summary
TypeNotationMeaning
Marginalp(A)Probability of A irrespective of any other variable.
Jointp(A, B)Probability of events A and B occurring together.
Conditionalp(A | B)Probability of A given that B has already occurred.
Bayes' Theorem
// derivation · posterior · likelihood · prior · evidence · diagnostic test example
Deriving Bayes' Theorem Core
1
Symmetry of joint probability
p(A, B) = p(B, A)
2
Apply the chain rule in both directions
p(A|B) · p(B) = p(B|A) · p(A)
3
Solve for p(A|B) → Bayes' Theorem
p(A | B) = p(B | A) · p(A) / p(B)
This lets us reverse the conditioning — compute p(A|B) from p(B|A).
The Four Named Components Anatomy
p(A|B)
Posterior
=
p(B|A)
Likelihood
·
p(A)
Prior
÷
p(B)
Evidence
NameSymbolRole
Posteriorp(A|B)Updated belief about A after observing evidence B. This is what we want.
Likelihoodp(B|A)How probable is evidence B if A were true? Comes from the model / data.
Priorp(A)Our belief about A before any evidence. Encodes background knowledge.
Evidencep(B)Total probability of evidence, summed across all outcomes. Normalizing constant.
The posterior represents updated beliefs about the prior based on new evidence.
Animal Example: Applying Bayes Walk-through

We randomly sample one animal from one of two groups. We observe it is a Duck. What is the probability it came from Group 2?

Group 1Group 2Marginal
Duck639
Panda617
Marginal12416
1
Set up the known quantities
p(G=1) = 12/16 = 3/4  ·  p(G=2) = 4/16 = 1/4
p(Duck|G=1) = 6/12 = 1/2  ·  p(Duck|G=2) = 3/4
2
Compute the evidence p(Duck) via total probability
p(Duck) = p(Duck|G=1)·p(G=1) + p(Duck|G=2)·p(G=2) = (1/2)(3/4) + (3/4)(1/4) = 3/8 + 3/16 = 9/16
3
Apply Bayes' Theorem
p(G=2|Duck) = p(Duck|G=2)·p(G=2)/p(Duck) = (3/4·1/4)/(9/16) = (3/16)/(9/16) = 1/3 ≈ 0.333
Although Group 2 is small (prior = 1/4), seeing a Duck raises its probability because ducks are proportionally more common in Group 2.
Diagnostic Test for a Disease Base Rate Neglect

The classic Bayesian example. Dis and Test are binary variables for disease state and test result.

QuantitySymbolValue
Sensitivity (true positive rate)p(Test=1 | Dis=1)0.95
False positive ratep(Test=1 | Dis=0)0.01
Prevalence (prior)p(Dis=1)1/100,000 = 0.00001
Goal: posteriorp(Dis=1 | Test=1)?
1
Evidence: p(Test=1) via total probability
= 0.95×0.00001 + 0.01×0.9999 = 0.0000095 + 0.009999 ≈ 0.01
2
Posterior via Bayes' Theorem
p(Dis=1|Test=1) = (0.95 × 0.00001) / 0.01 ≈ 0.001 = 0.1%
Key lesson — Base Rate Neglect: Even with a 95%-sensitive test, a positive result only means ~0.1% probability of having the disease — because the disease is extremely rare (1 in 100,000). The prior dominates. Most people intuitively over-estimate this probability by ignoring the base rate.
Bernoulli Distribution
// binary outcome encoding · PMF · parameter μ · the movie example · visualizing p(y|μ)
The Running Example: Star Wars Setup

We ask people: Did you like Star Wars Episode VIII — The Last Jedi? The answer is binary, encoded exactly as in Week 3:

Event: liked the movie
y = 1
Non-event: did NOT like it
y = 0
We denote the probability someone liked the movie — p(y=1) — by the parameter μ. It follows that p(y=0) = 1 − μ. Since μ is a probability: 0 ≤ μ ≤ 1.
The Bernoulli PMF Core

The probability mass function over both values y ∈ {0, 1} can be written compactly in a single expression — named after Jacob Bernoulli:

Bernoulli PMF
p(y | μ) = μ^y · (1 − μ)^(1−y)
Verify when y = 1
p(1|μ) = μ^1 · (1−μ)^0 = μ · 1 = μ
Verify when y = 0
p(0|μ) = μ^0 · (1−μ)^1 = 1 · (1−μ) = 1−μ
The Bernoulli distribution is applicable for any binary outcome problem: sports (win/lose), engineering (pass/fail), manufacturing (defect/no defect), medicine (disease/healthy), etc.
Visualizing the Bernoulli PMF for Different μ Values Intuition

The PMF has only two bars — at x=0 and x=1. Their heights depend entirely on μ:

μp(y=0) = 1−μp(y=1) = μInterpretation
0.150.85 (tall)0.15 (short)Event does not happen too frequently — 0s are more common
0.450.550.45Near-equal — high uncertainty
0.750.25 (short)0.75 (tall)Event is common — mostly 1s
As μ increases from 0 to 1, the bar at y=0 shrinks and the bar at y=1 grows. When μ=0.5, both bars are equal — maximum uncertainty.
Likelihood Function & MLE
// independence assumption · product of Bernoullis · log-likelihood derivation · μ̂ = m/N
Extending to N Observations Core

We ask 4 people. We observe the sequence: Person 1=No, 2=No, 3=Yes, 4=No. What is the probability of this sequence?

The joint probability p(x₁=0, x₂=0, x₃=1, x₄=0 | μ) seems hard — but assuming people respond independently, we can factor it into a product of 4 Bernoullis.
Independence: joint = product of marginals
p(A, B) = p(A|B) · p(B) = p(A) · p(B) when A ⊥ B
Likelihood function — N independent Bernoullis
p(x | μ) = ∏ₙ₌₁ᴺ Bernoulli(xₙ | μ) = ∏ₙ₌₁ᴺ μ^xₙ · (1−μ)^(1−xₙ)
4-person example: p(0,0,1,0|μ) = (1−μ)(1−μ)μ(1−μ) = μ¹·(1−μ)³
Maximum Likelihood Estimation (MLE) Core

When μ is unknown, we find the value that best explains the data — the value that maximizes the likelihood:

MLE objective
μ̂_MLE = argmax_{μ ∈ [0,1]} p(x | μ)
Rather than maximizing the likelihood directly (a product of many small numbers), we maximize the log-likelihood instead. Since log is monotone increasing, the maximizer is the same.
Log-Likelihood Derivation — Step by Step Walk-through
1
Product → Sum (log of product = sum of logs)
log[p(x|μ)] = Σₙ₌₁ᴺ log[μ^xₙ · (1−μ)^(1−xₙ)]
2
Apply log(aᵇ) = b·log(a)
log[p(x|μ)] = Σₙ ( xₙ·log[μ] + (1−xₙ)·log[1−μ] )
3
Assuming μ is constant: pull log(μ) and log(1−μ) outside the sums
= log(μ) · Σxₙ + log(1−μ) · Σ(1−xₙ)
= log(μ) · m + log(1−μ) · (N − m)

where m = number of events (people who said Yes) and N − m = number of non-events

Final compact form:
log[p(x|μ)] = log(μ) × m + log(1−μ) × (N − m)
Solving for μ̂_MLE Result
1
Take the derivative with respect to μ
d/dμ log[p(x|μ)] = m/μ − (N−m)/(1−μ)
2
Set derivative to zero and solve
m/μ − (N−m)/(1−μ) = 0  →  (1−μ)·m = μ·(N−m)
m − μ·m = μ·N − μ·m  →  m = μ·N
3
The MLE is just counting!
μ̂_MLE = m / N = (1/N) · Σₙ xₙ
The MLE for the Bernoulli parameter is the sample proportion of events.
⚠️ Small-data warning: With only N=4 people and μ_TRUE=0.2, observing 0 Yes (μ̂=0.0) has ~40% probability. Observing 2 Yes (μ̂=0.5) has ~15% probability. The MLE can be far from the truth purely by chance with small samples.
Binomial Distribution
// counting sequences · combinations · PMF derivation · interactive explorer · R's dbinom()
A Different Question Motivation

Bernoulli answers: what is p(y=1) for a single trial?
Binomial answers: what is the probability the event occurs exactly m times out of N trials?

Movie example shift: instead of "will this one person like the movie?", we ask "if we ask 4 people, what is p(exactly 1 out of 4 says Yes)?"
Multiple Sequences Have the Same Count Key Insight

The sequence (No, No, Yes, No) gives 1 Yes out of 4 — but there are 4 possible orderings that also give exactly 1 Yes:

Person 1Person 2Person 3Person 4Probability
YesNoNoNoμ(1−μ)³
NoYesNoNoμ(1−μ)³
NoNoYesNoμ(1−μ)³
NoNoNoYesμ(1−μ)³
Each sequence has the same probability μ(1−μ)³. Total p(exactly 1 Yes in 4 trials) = 4 × μ(1−μ)³.
Counting Combinations C(N,m) Math

How many distinct orderings of N trials contain exactly m events? This is the combination (N choose m) — order doesn't matter, only which m positions are "Yes":

Combinations — "N choose m"
C(N, m) = N! / (m! · (N − m)!)
m events in N=4 trialsC(4, m)Interpretation
m = 01Only 1 way: (No,No,No,No)
m = 144 ways: Yes can be in any 1 of 4 positions
m = 266 ways to choose which 2 of 4 positions are Yes
m = 344 ways
m = 41Only 1 way: (Yes,Yes,Yes,Yes)
Total: 1+4+6+4+1 = 16 = 2⁴ ✓ — all possible binary sequences of length 4 are accounted for.
The Binomial PMF Core
Binomial distribution PMF
p(m | N, μ) = C(N, m) · μ^m · (1−μ)^(N−m)

m ∈ {0, 1, 2, …, N}
Full table for N=4:
p(0|4,μ) = 1·μ⁰·(1−μ)⁴
p(1|4,μ) = 4·μ¹·(1−μ)³
p(2|4,μ) = 6·μ²·(1−μ)²
p(3|4,μ) = 4·μ³·(1−μ)¹
p(4|4,μ) = 1·μ⁴·(1−μ)⁰
Bernoulli as a special case: Setting N=1 recovers the Bernoulli PMF — m ∈ {0,1} and C(1,m)=1. The Bernoulli is a Binomial with N=1.
Interactive: Binomial PMF Explorer Interactive

Adjust N and μ to see how the distribution of "number of events m" changes. Each bar shows p(m | N, μ).

Hover bars to see exact probabilities
Movie Example: p(m | N=4, μ_TRUE=0.2) Concrete Numbers

Assuming the TRUE probability of liking the movie is 0.2:

m (# who said Yes)p(m | N=4, μ=0.2)Note
0≈ 0.41 (41%)Most likely single outcome!
1≈ 0.41 (41%)Also very common
2≈ 0.15 (15%)Small but non-negligible
3≈ 0.026Unlikely
4≈ 0.002Very unlikely
If we observed m=0 → MLE = 0.0. If we observed m=2 → MLE = 0.5. Both are far from the truth (0.2), yet both happen with meaningful probability. MLE is unreliable with small data.
What's the solution? Incorporate additional information about plausible values of μ before seeing data — that is, a prior distribution. This is where Bayesian inference picks up in the next lecture.
Binomial in R R Code
# dbinom(x, size, prob) — Binomial PMF
# x ↔ m (number of events)
# size ↔ N (number of trials)
# prob ↔ μ (event probability)

# P(exactly 1 Yes in 4 trials) with μ=0.2
dbinom(x = 1, size = 4, prob = 0.2)
# → 0.4096

# Full PMF for N=8, μ=0.2
m_vals <- 0:8
probs  <- dbinom(m_vals, size = 8, prob = 0.2)

# Plot it
barplot(probs, names.arg = m_vals,
        xlab = "m", ylab = "p(m | μ)",
        main = "Binomial PMF — N=8, μ=0.2")
Bayesian Approach & MAP
// prior distributions · Beta conjugate prior · posterior derivation · MAP estimate · credible intervals
Why a Prior? The Problem With MLE on Small / Biased Data Motivation

MLE gives μ̂ = m/N — fine with lots of data, but badly misleading with small or biased samples. Example from the lecture:

SourceLike (m)Total (N)μ̂ MLE
StarFanatic.com23250.92
IHateStarWars.com1450.02
Both estimates are far from the true value (~0.30). The solution is to introduce a prior distribution over μ that encodes our initial beliefs — before seeing any (biased) data.
"About half the people I know like TLJ… I don't really think IHateStarWars seems very accurate…"

This illustrates an important feature of Bayesian inference: the prior is inherently subjective — two analysts could choose different priors and reach different posteriors from the same data. This is sometimes seen as a limitation, but also a strength: it forces you to make your assumptions explicit. Crucially, as data accumulates the likelihood dominates and the posterior converges regardless of the starting prior — the data ultimately guides us toward the truth.
The Bayesian Formulation Core

Start with an initial hypothesis, then update it based on data. The key equation:

p(μ|x)
Posterior
=
p(x|μ)
Likelihood
·
p(μ)
Prior
÷
p(x)
Evidence
Proportional form (ignoring normalizing constant)
p(μ | x) ∝ p(x | μ) · p(μ)
The evidence p(x) is a constant with respect to μ — it doesn't affect the argmax. So for optimization (MAP), we can ignore it and work with just likelihood × prior.
The Beta Distribution — Conjugate Prior for Binomial Core

When the likelihood is Binomial/Bernoulli, the natural choice of prior is the Beta distribution. It is a continuous PDF bounded between 0 and 1 — exactly the domain of μ.

Beta PDF
Beta(μ | a, b) = Γ(a+b) / [Γ(a)·Γ(b)] · μ^(a−1) · (1−μ)^(b−1)
ParameterInterpretationExample shape
aPrior "pseudo-count" of events (successes)a=1, b=1 → Uniform (flat prior)
bPrior "pseudo-count" of non-events (failures)a=3, b=3 → symmetric hill at 0.5
Conjugate prior: Because Beta is conjugate to the Binomial, the posterior is also a Beta distribution — just with updated parameters. This means no numerical integration is needed!
Beta Mean
E[μ | a, b] = a / (a + b)
Beta Mode (peak of the distribution)
Mode = (a − 1) / (a + b − 2)
Note: the mode of the Beta distribution will become the MAP estimate once we plug in the updated posterior parameters anew and bnew — that derivation comes in the next card.
Deriving the Posterior — Beta × Binomial Walk-through
1
Multiply likelihood and prior
p(μ|m,N) ∝ Binomial(m|N,μ) × Beta(μ|a,b)
∝ μ^m · (1−μ)^(N−m) · μ^(a−1) · (1−μ)^(b−1)
2
Combine powers of like bases
∝ μ^(m+a−1) · (1−μ)^(N−m+b−1)
3
Recognize the Beta form — define updated hyperparameters
a_new = m + a     b_new = N − m + b
Posterior = Beta(μ | a_new, b_new)

The prior parameters act like "imaginary" prior observations — they shift and concentrate the posterior.

The MAP Estimate — Solving for μ̂ Core

MAP = Maximum A Posteriori: find μ that maximizes the posterior. Because Evidence is constant, this means maximizing Likelihood × Prior.

MAP Objective
μ̂_MAP = argmax_μ p(x|μ) · p(μ)
1
Take log-posterior
log p ∝ (a_new − 1)·log μ + (b_new − 1)·log(1 − μ)
2
Differentiate and set to zero
d/dμ = (a_new − 1)/μ − (b_new − 1)/(1 − μ) = 0
3
Solve for μ → MAP Estimate
μ̂_MAP = (a_new − 1) / (a_new + b_new − 2)
= (m + a − 1) / (N + a + b − 2)

This is a weighted average between the MLE (m/N) and the prior mean (a/(a+b)). With more data, it approaches the MLE.

Key Insight: The MAP estimate is a compromise between your prior beliefs and the data. As N → ∞, the data dominates and μ̂_MAP → μ̂_MLE.
Prior vs. Likelihood Influence Intuition
Small dataset (N=45): Posterior is pulled strongly toward the prior. Biased site IHateStarWars (1/45) is overridden by RT prior (199/485).
Small data example (a=199, b=286, m=1, N=45)
μ̂_MAP ≈ 0.218
Large dataset (N=18,923): Posterior looks like the likelihood. Metacritic dominates the RT prior.
Large data (Metacritic as likelihood)
Posterior ≈ Likelihood
The more data you have, the less the prior matters. With infinite data, the prior is irrelevant.
Credible Intervals vs. Confidence Intervals Comparison

Both methods quantify uncertainty, but they differ philosophically and practically:

PropertyBayesian — Credible IntervalFrequentist — Confidence Interval
InterpretationThere is X% probability that μ lies in this range given the observed dataIf we repeated the experiment many times, X% of intervals would contain the true μ
Incorporates prior?Yes — prior prevents extreme estimatesNo — can give unrealistic values (e.g., 0 or 1) with small N
Small N behaviorStays reasonable thanks to prior (e.g., m=0 gives mean ≈ 0.2, not 0)Can collapse to a point or span impossible values
In R (Binomial)qbeta() on posterior Beta(a_new, b_new)binom.test() — Clopper-Pearson method
The Bayesian approach prevents unrealistic values due to the influence of the prior! Classical 90% CI for m=0 out of N=4 extends all the way to [0, 0.49], while the Bayesian credible interval stays sensibly bounded.
How Prior Choice Affects the Posterior Sensitivity

Four types of prior are commonly considered. The posterior behaves differently depending on how informative the prior is:

Prior TypeBeta ParamsEffect on Posterior
InformativeLarge a, small b (e.g., a=100, b=3)Strongly concentrates near low μ; resists data pulling it away
Less than 50%a < b (e.g., a=4, b=11)Probability mass mostly below 0.5; moderately informative
Uniforma=1, b=1Flat — all values of μ equally likely; MAP equals MLE
Vaguea<1, b<1 (e.g., a=0.1, b=0.1)U-shaped — probability mass at extremes (0 and 1); anti-regularizing
With more trials (e.g., N=40 vs N=20), the posterior becomes tighter and less sensitive to the choice of prior. All four priors converge as N grows.
Posterior Summary Statistics Reading Results

Rather than reporting just a point estimate, Bayesian analysis reports the full posterior distribution. Key summaries:

Central tendency
Median = 50th quantile of Beta
Mean = a_new / (a_new + b_new)
Uncertainty (credible intervals)
50% CI → 25th to 75th quantile
90% CI → 5th to 95th quantile
In the lecture's m=0 out of N=4 example: the posterior mean ≈ 0.2 (not 0!), and zero falls below the 0.05 quantile — consistent with our prior belief that μ > 0.05. The prior saves us from the nonsensical MLE of 0.
Glossary
// all key terms from week 4 · sorted alphabetically
Flashcards
// click card to reveal answer · mark correct or missed · track your score
0  ·  0
0 / 0
Press "Shuffle & Start" to begin.