Statistics and probability play an important role in many machine learning algorithms. The core Bayesian insight is that model parameters are not fixed unknown constants, but random variables with their own probability distributions. This means we can express genuine uncertainty about parameters — not just uncertainty from sampling noise, but inherent uncertainty that we update as we observe data.
Frequentist (MLE): Parameters are fixed unknowns. The only uncertainty comes from sampling noise — we get a point estimate (e.g. μ̂ = m/N) and quantify variability with confidence intervals based on repeated sampling.
In this lecture we are NOT training a predictive model. Instead we are using the rules and concepts of probability to describe behavior — collecting observations of an event and asking what probability best explains what we observed.
| # | Topic | Key takeaway |
|---|---|---|
| 1 | Probability Basics | Event probability, joint, marginal, conditional, independence — the Panda/Duck example |
| 2 | Bayes' Theorem | Posterior = Likelihood × Prior / Evidence; diagnostic test worked example |
| 3 | Bernoulli Distribution | PMF for binary outcomes; the Star Wars movie example |
| 4 | Likelihood & MLE | Product of N independent Bernoullis → log-likelihood → μ̂ = m/N |
| 5 | Binomial Distribution | Counting sequences, combinations, p(m|N,μ), R's dbinom(); small-data problem |
| 6 | Bayesian Approach & MAP | Prior distributions, Beta conjugate prior, posterior derivation, MAP estimate μ̂ = (m+a−1)/(N+a+b−2) |
| 7 | Credible vs. Confidence Intervals | Bayesian credible intervals vs. frequentist confidence intervals; prior prevents unrealistic values |
The probability of an event is the proportion of times that event occurs out of the total number of trials.
p(y = 0) = probability of the NON-EVENT class = 1 − μ
We have two variables: Animal (Panda or Duck) and Color (Red or Black). A random sample of 16 objects gives the following counts:
| Red | Black | Marginal (Animal) | |
|---|---|---|---|
| Panda | 4 | 3 | 7 |
| Duck | 6 | 3 | 9 |
| Marginal (Color) | 10 | 6 | 16 |
The marginal probability of a variable is calculated by summing across its intersections with all levels of a second variable.
p(Panda) = p(P,Red) + p(P,Black) = 4/16 + 3/16 = 7/16
p(Duck) = p(D,Red) + p(D,Black) = 6/16 + 3/16 = 9/16
p(Red) = p(R,Panda) + p(R,Duck) = 4/16 + 6/16 = 10/16
Events A and B are independent if observing A does not affect the probability of B. The joint probability becomes the product of the marginals:
Split the 16 animals into two groups. Group 1 has 12 animals, Group 2 has 4. The probability of selecting each animal now depends on which group you're in:
| Group 1 | Group 2 | Marginal | |
|---|---|---|---|
| Duck | 6 | 3 | 9 |
| Panda | 6 | 1 | 7 |
| Marginal | 12 | 4 | 16 |
p(Panda | G=1) = 6/12 = 0.5
p(Panda | G=2) = 1/4 = 0.25
↔ p(A, G) = p(A | G) · p(G) ← chain rule
| Type | Notation | Meaning |
|---|---|---|
| Marginal | p(A) | Probability of A irrespective of any other variable. |
| Joint | p(A, B) | Probability of events A and B occurring together. |
| Conditional | p(A | B) | Probability of A given that B has already occurred. |
| Name | Symbol | Role |
|---|---|---|
| Posterior | p(A|B) | Updated belief about A after observing evidence B. This is what we want. |
| Likelihood | p(B|A) | How probable is evidence B if A were true? Comes from the model / data. |
| Prior | p(A) | Our belief about A before any evidence. Encodes background knowledge. |
| Evidence | p(B) | Total probability of evidence, summed across all outcomes. Normalizing constant. |
We randomly sample one animal from one of two groups. We observe it is a Duck. What is the probability it came from Group 2?
| Group 1 | Group 2 | Marginal | |
|---|---|---|---|
| Duck | 6 | 3 | 9 |
| Panda | 6 | 1 | 7 |
| Marginal | 12 | 4 | 16 |
p(Duck|G=1) = 6/12 = 1/2 · p(Duck|G=2) = 3/4
The classic Bayesian example. Dis and Test are binary variables for disease state and test result.
| Quantity | Symbol | Value |
|---|---|---|
| Sensitivity (true positive rate) | p(Test=1 | Dis=1) | 0.95 |
| False positive rate | p(Test=1 | Dis=0) | 0.01 |
| Prevalence (prior) | p(Dis=1) | 1/100,000 = 0.00001 |
| Goal: posterior | p(Dis=1 | Test=1) | ? |
We ask people: Did you like Star Wars Episode VIII — The Last Jedi? The answer is binary, encoded exactly as in Week 3:
The probability mass function over both values y ∈ {0, 1} can be written compactly in a single expression — named after Jacob Bernoulli:
The PMF has only two bars — at x=0 and x=1. Their heights depend entirely on μ:
| μ | p(y=0) = 1−μ | p(y=1) = μ | Interpretation |
|---|---|---|---|
0.15 | 0.85 (tall) | 0.15 (short) | Event does not happen too frequently — 0s are more common |
0.45 | 0.55 | 0.45 | Near-equal — high uncertainty |
0.75 | 0.25 (short) | 0.75 (tall) | Event is common — mostly 1s |
We ask 4 people. We observe the sequence: Person 1=No, 2=No, 3=Yes, 4=No. What is the probability of this sequence?
When μ is unknown, we find the value that best explains the data — the value that maximizes the likelihood:
= log(μ) · m + log(1−μ) · (N − m)
where m = number of events (people who said Yes) and N − m = number of non-events
log[p(x|μ)] = log(μ) × m + log(1−μ) × (N − m)m − μ·m = μ·N − μ·m → m = μ·N
Bernoulli answers: what is p(y=1) for a single trial?
Binomial answers: what is the probability the event occurs exactly m times out of N trials?
The sequence (No, No, Yes, No) gives 1 Yes out of 4 — but there are 4 possible orderings that also give exactly 1 Yes:
| Person 1 | Person 2 | Person 3 | Person 4 | Probability |
|---|---|---|---|---|
| Yes | No | No | No | μ(1−μ)³ |
| No | Yes | No | No | μ(1−μ)³ |
| No | No | Yes | No | μ(1−μ)³ |
| No | No | No | Yes | μ(1−μ)³ |
How many distinct orderings of N trials contain exactly m events? This is the combination (N choose m) — order doesn't matter, only which m positions are "Yes":
| m events in N=4 trials | C(4, m) | Interpretation |
|---|---|---|
| m = 0 | 1 | Only 1 way: (No,No,No,No) |
| m = 1 | 4 | 4 ways: Yes can be in any 1 of 4 positions |
| m = 2 | 6 | 6 ways to choose which 2 of 4 positions are Yes |
| m = 3 | 4 | 4 ways |
| m = 4 | 1 | Only 1 way: (Yes,Yes,Yes,Yes) |
m ∈ {0, 1, 2, …, N}
p(0|4,μ) = 1·μ⁰·(1−μ)⁴
p(1|4,μ) = 4·μ¹·(1−μ)³
p(2|4,μ) = 6·μ²·(1−μ)²
p(3|4,μ) = 4·μ³·(1−μ)¹
p(4|4,μ) = 1·μ⁴·(1−μ)⁰
Adjust N and μ to see how the distribution of "number of events m" changes. Each bar shows p(m | N, μ).
Assuming the TRUE probability of liking the movie is 0.2:
| m (# who said Yes) | p(m | N=4, μ=0.2) | Note |
|---|---|---|
| 0 | ≈ 0.41 (41%) | Most likely single outcome! |
| 1 | ≈ 0.41 (41%) | Also very common |
| 2 | ≈ 0.15 (15%) | Small but non-negligible |
| 3 | ≈ 0.026 | Unlikely |
| 4 | ≈ 0.002 | Very unlikely |
# dbinom(x, size, prob) — Binomial PMF # x ↔ m (number of events) # size ↔ N (number of trials) # prob ↔ μ (event probability) # P(exactly 1 Yes in 4 trials) with μ=0.2 dbinom(x = 1, size = 4, prob = 0.2) # → 0.4096 # Full PMF for N=8, μ=0.2 m_vals <- 0:8 probs <- dbinom(m_vals, size = 8, prob = 0.2) # Plot it barplot(probs, names.arg = m_vals, xlab = "m", ylab = "p(m | μ)", main = "Binomial PMF — N=8, μ=0.2")
MLE gives μ̂ = m/N — fine with lots of data, but badly misleading with small or biased samples. Example from the lecture:
| Source | Like (m) | Total (N) | μ̂ MLE |
|---|---|---|---|
| StarFanatic.com | 23 | 25 | 0.92 |
| IHateStarWars.com | 1 | 45 | 0.02 |
This illustrates an important feature of Bayesian inference: the prior is inherently subjective — two analysts could choose different priors and reach different posteriors from the same data. This is sometimes seen as a limitation, but also a strength: it forces you to make your assumptions explicit. Crucially, as data accumulates the likelihood dominates and the posterior converges regardless of the starting prior — the data ultimately guides us toward the truth.
Start with an initial hypothesis, then update it based on data. The key equation:
When the likelihood is Binomial/Bernoulli, the natural choice of prior is the Beta distribution. It is a continuous PDF bounded between 0 and 1 — exactly the domain of μ.
| Parameter | Interpretation | Example shape |
|---|---|---|
a | Prior "pseudo-count" of events (successes) | a=1, b=1 → Uniform (flat prior) |
b | Prior "pseudo-count" of non-events (failures) | a=3, b=3 → symmetric hill at 0.5 |
∝ μ^m · (1−μ)^(N−m) · μ^(a−1) · (1−μ)^(b−1)
Posterior = Beta(μ | a_new, b_new)
The prior parameters act like "imaginary" prior observations — they shift and concentrate the posterior.
MAP = Maximum A Posteriori: find μ that maximizes the posterior. Because Evidence is constant, this means maximizing Likelihood × Prior.
= (m + a − 1) / (N + a + b − 2)
This is a weighted average between the MLE (m/N) and the prior mean (a/(a+b)). With more data, it approaches the MLE.
Both methods quantify uncertainty, but they differ philosophically and practically:
| Property | Bayesian — Credible Interval | Frequentist — Confidence Interval |
|---|---|---|
| Interpretation | There is X% probability that μ lies in this range given the observed data | If we repeated the experiment many times, X% of intervals would contain the true μ |
| Incorporates prior? | Yes — prior prevents extreme estimates | No — can give unrealistic values (e.g., 0 or 1) with small N |
| Small N behavior | Stays reasonable thanks to prior (e.g., m=0 gives mean ≈ 0.2, not 0) | Can collapse to a point or span impossible values |
| In R (Binomial) | qbeta() on posterior Beta(a_new, b_new) | binom.test() — Clopper-Pearson method |
Four types of prior are commonly considered. The posterior behaves differently depending on how informative the prior is:
| Prior Type | Beta Params | Effect on Posterior |
|---|---|---|
| Informative | Large a, small b (e.g., a=100, b=3) | Strongly concentrates near low μ; resists data pulling it away |
| Less than 50% | a < b (e.g., a=4, b=11) | Probability mass mostly below 0.5; moderately informative |
| Uniform | a=1, b=1 | Flat — all values of μ equally likely; MAP equals MLE |
| Vague | a<1, b<1 (e.g., a=0.1, b=0.1) | U-shaped — probability mass at extremes (0 and 1); anti-regularizing |
Rather than reporting just a point estimate, Bayesian analysis reports the full posterior distribution. Key summaries:
Mean = a_new / (a_new + b_new)
90% CI → 5th to 95th quantile