We extend maximum likelihood and Bayesian inference from the Bernoulli/Binomial setting to the continuous Gaussian (Normal) distribution. The running example is weighing a 25-pound dumbbell multiple times to estimate the true mean μ.
Both MLE and Bayes let us learn μ from data. The fundamental difference:
We lift a 25-pound dumbbell N times, recording weight measurements x₁, x₂, …, x_N. Each measurement contains noise because no scale is perfectly repeatable. We model the process as:
| Parameter | Meaning | In example |
|---|---|---|
| μ | True population mean weight | Unknown — what we're estimating |
| σ | Noise / measurement error | Known: σ = 1 pound (manufacturer spec) |
The Gaussian (or Normal) is a continuous probability distribution — the famous "bell curve." It is fully defined by two hyperparameters:
Note: the PDF gives a density, not a probability. It can exceed 1. Probabilities are obtained by integrating over intervals.
Adjust μ and σ to see how the Gaussian shape changes in real-time.
A fixed percentage of probability mass lies within each band around μ:
Any Gaussian can be expressed in terms of the standard normal via the z-score:
This means normal(x | μ, σ) is equivalently defined as: start with z ~ normal(z | 0, 1), then set x = σ·z + μ. This change-of-variables idea is fundamental and appears throughout our models.
We observe N dumbbell weight measurements. Each is assumed conditionally independent given (μ, σ), so the joint likelihood factors into a product:
The sample mean x̄ is the sufficient statistic for μ:
Instead of a point estimate, the Bayesian approach gives us a posterior distribution over μ — capturing our full uncertainty.
If the prior has the same functional form as the likelihood, the posterior has the same distributional family as the prior — this is called a conjugate prior.
| Symbol | Meaning |
|---|---|
| μ₀ | Prior mean — our best guess for μ before seeing data |
| τ₀ | Prior standard deviation — how uncertain we are about μ a priori |
Taking the log of (likelihood × prior) and collecting μ-dependent terms:
This is a sum of two quadratics in μ. Completing the square reveals the posterior is itself a Gaussian — the conjugate property at work. The full algebra is on Canvas.
Precision is defined as the inverse of variance. It measures how "sharp" or confident a distribution is:
After observing N measurements x = {x₁,…,x_N}, the posterior on μ is a Gaussian:
Posterior precision = prior precision + data precision.
Weighted average of prior mean and sample mean.
The posterior mean can be written in multiple equivalent ways:
Set the parameters and see the posterior update live.
The plot shows three curves: the prior (gold), the likelihood as a function of μ (violet), and the posterior (teal). The posterior is always between the prior and the likelihood — a precision-weighted compromise.
| Slider change | What happens to the posterior |
|---|---|
| ↑ N (more data) | Likelihood narrows and dominates; posterior moves toward x̄ and sharpens |
| ↑ τ₀ (wider prior) | Prior flattens; posterior moves toward x̄ and more closely tracks the likelihood |
| ↓ τ₀ (tighter prior) | Prior sharpens; posterior is pulled strongly toward μ₀, especially when N is small |
| Move x̄ away from μ₀ | Posterior shifts between them; high-precision source wins |
| Move μ₀ | Prior shifts; posterior tracks it when N is small, ignores it when N is large |
Notice that N individual observations only appear in the posterior formulas through two quantities:
As the number of observations grows without bound, the data overwhelm the prior:
The prior mean μ₀ contribution vanishes: (x̄ − μ₀) · 0 → 0.
Precision → ∞. Perfect certainty about μ.
A prior standard deviation τ₀ → ∞ represents a completely "diffuse" prior — we have essentially no prior knowledge:
Prior precision → 0, so prior has no weight.
Classical standard error of the mean.
| Situation | Prior Influence | Posterior ≈ |
|---|---|---|
| Small N, informative prior | High | Weighted blend, pulled toward prior |
| Small N, diffuse prior | Low | Mostly data-driven, high uncertainty |
| Large N (any prior) | Negligible | ≈ normal(x̄, σ/√N) |
| N → ∞ (any prior) | Zero | Point mass at x̄ (MLE) |
For the dumbbell experiment, we trust the manufacturer's label. We set a tight prior:
This says: before collecting data, we believe μ is centred at 25 pounds and we'd be surprised if it were outside [22, 28]. The prior is constraining — observations outside 20–30 lbs are considered implausible.
What if we have almost no prior knowledge about μ? We use a very wide prior:
| Property | Informative (τ₀=1) | Diffuse (τ₀=25) |
|---|---|---|
| Prior precision | High (1/τ₀² = 1.0) | Low (1/τ₀² = 0.0016) |
| Prior influence at N=1 | Strong | Negligible |
| Posterior mean at N=1 | Pulled toward μ₀ = 25 | ≈ x₁ (first observation) |
| After 50 observations | Posterior τ_N's become essentially identical! | |
| Allows unphysical values? | No (essentially) | Yes — negative μ possible |
A beautiful property of conjugate models: today's posterior becomes tomorrow's prior. You can update sequentially, one observation at a time, and get the same result as batch processing all N observations at once.
Previously we assumed the measurement noise σ was known. Now we drop that assumption — both μ and σ are unknown. This changes everything:
σ fixed (manufacturer spec)
Posterior: p(μ | x, σ) — 1D
No σ assumption needed
Posterior: p(μ, σ | x) — 2D
We now target the joint distribution of both unknowns conditioned on all data:
One tractable approach: assume μ and σ are a-priori independent:
Even if we specify independent priors, the posterior will in general create a relationship between μ and σ — the data couple the two parameters through the likelihood.
Example: μ₀=250, τ₀=2
Example: ℓ=0.5, u=5.5 (pounds)
The uniform prior on σ reflects our bounded uncertainty about measurement repeatability:
| Setting | Value | Meaning |
|---|---|---|
| μ₀ | 250 | Prior belief: we weigh ~250 lbs |
| τ₀ | 2 | ≈99% probability weight < 256 lbs a-priori |
| ℓ (lower) | 0.5 | Scale at least somewhat noisy |
| u (upper) | 5.5 | Scale not wildly inaccurate (±16 lbs at 3σ) |
The unnormalised joint posterior is:
With only two unknowns (μ, σ) we can visualise the posterior as a 2D surface. The colour encodes the un-normalised log-posterior value (viridis scale):
| Colour | Log-posterior value | Meaning |
|---|---|---|
| Bright yellow | High (near maximum) | Most plausible (μ, σ) combinations — MAP region |
| Green / teal | Moderate | Somewhat plausible |
| Blue / dark blue | Low | Less plausible but within prior bounds |
| Dark purple | Very low | Highly implausible — almost ruled out by data |
| Grey (masked) | −∞ | Outside prior bounds — impossible by assumption |
Posterior surface (N=10, μ₀=250, τ₀=2, σ ~ Uniform(0.5, 5.5)):
The Maximum A Posteriori (MAP) estimate is the pair (μ, σ) that maximises the posterior — analogous to MLE but including the prior:
σ̂_MAP ≈ 2.5
Vertical length → uncertainty in σ
Looking at 1D slices of the log-posterior reveals important structure:
How does the joint posterior evolve as we add observations one by one?
| N | μ contours | σ contours | Key observation |
|---|---|---|---|
| 0 (prior) | Vertical lines | Flat (uniform) | σ is completely unconstrained a-priori |
| 1 | No longer vertical | σ MAP pushed to upper bound | First obs > 260 — prior on μ is too narrow; needs high σ to explain it |
| 5–7 | Inner ring appears | σ MAP moves inward | More data resolve both μ and σ simultaneously |
| 10 | Compact ellipse | σ̂ ≈ 2.5 | Posterior well-localised near (μ≈255, σ≈2.5) |
| 30+ | Stable, shrinking | Stable, shrinking | Diminishing returns — surface shape barely changes |
Heatmaps of the un-normalised log-posterior. Yellow = high (plausible), dark blue = low, grey = outside prior bounds. Black rings = iso-probability contours.
As N grows, the posterior contours become increasingly elliptical — approaching the shape of a bivariate Gaussian centred on the MAP. This observation motivates the Laplace (Normal) approximation, which we will cover in the next lecture.
The Multivariate Normal (MVN) generalises the Gaussian to D-dimensional vectors x = (x₁, x₂, …, x_D):
| Symbol | Name | Role |
|---|---|---|
| μ | Mean vector (D×1) | Location centre of the distribution |
| Σ | Covariance matrix (D×D) | Shape, scale, and orientation of the "cloud" |
| |Σ| | Determinant of Σ | Normalisation constant |
| Σ⁻¹ | Precision matrix | Inverse covariance — appears in exponent |
The term in the exponent generalises the familiar 1D squared z-score:
For two variables x₁ and x₂ with correlation coefficient ρ:
Σ = [[σ₁², ρσ₁σ₂],
[ρσ₁σ₂, σ₂² ]]
Two powerful closure properties of the MVN:
x₁ ~ normal(μ₁, σ₁) x₂ ~ normal(μ₂, σ₂)
This holds in any number of dimensions — every subset of variables in an MVN is also MVN.
x₂ | x₁ ~ normal(μ₂ + ρ(σ₂/σ₁)(x₁−μ₁), (1−ρ²)σ₂²)
Knowing x₁ shifts the conditional mean toward x₁ (weighted by ρ) and reduces variance.
The correlation coefficient ρ ∈ (−1, 1) controls how "tilted" the elliptical contours are:
| ρ value | Contour shape | Interpretation |
|---|---|---|
| ρ = 0 | Circles (if σ₁ = σ₂) or axis-aligned ellipses | Variables independent — knowing x₁ gives no info about x₂ |
| ρ > 0 | Ellipses tilted ↗ (positive slope) | High x₁ → expect high x₂; knowing one reduces uncertainty in other |
| ρ < 0 | Ellipses tilted ↘ (negative slope) | High x₁ → expect low x₂ |
| |ρ| → 1 | Very thin, elongated ellipses | Nearly perfect linear relationship; near-singular Σ |