INFSCI 2595 · Week 5 Study Guide · Fitting a Gaussian (Mean & Variance)

Week 5 Overview

// gaussian distribution · MLE · conjugate prior · posterior updating · precision weighting

What This Lecture Covers Roadmap

We extend maximum likelihood and Bayesian inference from the Bernoulli/Binomial setting to the continuous Gaussian (Normal) distribution. The running example is weighing a 25-pound dumbbell multiple times to estimate the true mean μ.

Big Picture Concept

Both MLE and Bayes let us learn μ from data. The fundamental difference:

MLE — Frequentist

μ̂_MLE = x̄ = (1/N) Σ xₙ

Point estimate only. No uncertainty.

Bayesian — Full Posterior

p(μ | x, σ) = normal(μ | μ_N, τ_N)

Distribution over μ. Captures uncertainty.

      Key result: the posterior mean is a precision-weighted average of the prior mean and the sample mean (data). With infinite data, both approaches agree.
    

The Running Example Dumbbell Experiment

We lift a 25-pound dumbbell N times, recording weight measurements x₁, x₂, …, x_N. Each measurement contains noise because no scale is perfectly repeatable. We model the process as:

xₙ | μ, σ ~ normal(xₙ | μ, σ)

Parameter	Meaning	In example
μ	True population mean weight	Unknown — what we're estimating
σ	Noise / measurement error	Known: σ = 1 pound (manufacturer spec)

Gaussian Basics

// normal distribution · hyperparameters · pdf · standard normal · z-score · coverage

Definition and Hyperparameters Core

The Gaussian (or Normal) is a continuous probability distribution — the famous "bell curve." It is fully defined by two hyperparameters:

Mean (location)

μ = E[x]

Shifts the curve left/right. Also the median and mode.

Variance (spread)

σ² = var(x)

Controls the WIDTH. σ (std. dev.) = √(var x).

We write: x | μ, σ ~ normal(x | μ, σ) or equivalently x ~ 𝒩(μ, σ²). The PDF can be written using either the variance σ² or the standard deviation σ as the scale parameter.

The Gaussian PDF Formula

Gaussian PDF

normal(x | μ, σ) = (1 / (σ√(2π))) · exp(−(x − μ)² / (2σ²))

Standard Normal (μ=0, σ=1)

normal(x | 0, 1) = (1 / √(2π)) · exp(−x² / 2)

Note: the PDF gives a density, not a probability. It can exceed 1. Probabilities are obtained by integrating over intervals.

Interactive Gaussian Explorer Demo

Adjust μ and σ to see how the Gaussian shape changes in real-time.

Mean (μ) 0

Std. Dev. (σ) 1.0

Coverage: ±σ, ±2σ, ±3σ Empirical Rule

A fixed percentage of probability mass lies within each band around μ:

±1σ · 68.3%

±2σ · 95.4%

±3σ · 99.7%

This is why "3-sigma events" are extremely rare — only 0.3% of observations fall outside ±3σ. Useful for outlier detection and quality control.

Z-Score and Change of Variables Technique

Any Gaussian can be expressed in terms of the standard normal via the z-score:

Z-score (standardisation)

z = (x − μ) / σ → x = σ·z + μ

This means normal(x | μ, σ) is equivalently defined as: start with z ~ normal(z | 0, 1), then set x = σ·z + μ. This change-of-variables idea is fundamental and appears throughout our models.

MLE for μ

// maximum likelihood · log-likelihood · derivation · sample average · sufficient statistic

Setup: Conditionally Independent Observations Model

We observe N dumbbell weight measurements. Each is assumed conditionally independent given (μ, σ), so the joint likelihood factors into a product:

Joint Likelihood

p(x | μ, σ) = ∏ₙ p(xₙ | μ, σ) = ∏ₙ normal(xₙ | μ, σ)

Conditional independence is a strong but often reasonable assumption. It means knowing one measurement tells us nothing extra about another, once we know μ and σ.

Log-Likelihood Derivation Core Math

1

Take the log of the joint likelihood

log p(x | μ, σ) = Σₙ log normal(xₙ | μ, σ)

Log turns the product into a sum — much easier to optimise.

2

Expand using the Gaussian PDF

= Σₙ [ log(1/(σ√(2π))) − (xₙ − μ)² / (2σ²) ]

3

Identify the μ-dependent term

Only the squared residual term depends on μ. Maximising log-likelihood with respect to μ means minimising Σₙ(xₙ − μ)² — the sum of squared errors.

4

Take the derivative and set to zero

d/dμ [ Σₙ (xₙ − μ)² ] = 0 → μ̂_MLE = (1/N) Σₙ xₙ = x̄

🎯 The MLE for the Gaussian mean is simply the sample average! This is reassuringly familiar — but only a point estimate with no uncertainty quantification.

Sufficient Statistic Key Concept

The sample mean x̄ is the sufficient statistic for μ:

Sufficient Statistic

x̄ = (1/N) Σₙ₌₁ᴺ xₙ

"Sufficient" means the sample mean captures all the information the N observations contain about μ. Once you know x̄ and N, you don't need the individual xₙ values to estimate μ. This concept recurs throughout Bayesian inference.

Bayesian Setup

// bayesian approach · conjugate prior · normal-normal model · posterior proportionality

The Bayesian Philosophy Concept

Instead of a point estimate, the Bayesian approach gives us a posterior distribution over μ — capturing our full uncertainty.

p(μ|x,σ)

Posterior

∝

∏ normal(xₙ|μ,σ)

Likelihood

×

p(μ)

Prior

The posterior is proportional to likelihood × prior. The normalising constant ensures it integrates to 1 — for the conjugate case this is automatic.

Conjugate Prior for the Gaussian Likelihood Core

If the prior has the same functional form as the likelihood, the posterior has the same distributional family as the prior — this is called a conjugate prior.

The conjugate prior for the Gaussian likelihood is… a Gaussian! Normal likelihood × Normal prior = Normal posterior.

Prior on μ

μ | μ₀, τ₀ ~ normal(μ | μ₀, τ₀)

Symbol	Meaning
μ₀	Prior mean — our best guess for μ before seeing data
τ₀	Prior standard deviation — how uncertain we are about μ a priori

Log-Posterior Derivation Start Math

Taking the log of (likelihood × prior) and collecting μ-dependent terms:

Log-Posterior (proportional)

log p(μ|x,σ) ∝ −(1/2σ²) Σₙ(xₙ−μ)² − (1/2τ₀²)(μ−μ₀)²

This is a sum of two quadratics in μ. Completing the square reveals the posterior is itself a Gaussian — the conjugate property at work. The full algebra is on Canvas.

In log-space a Gaussian becomes a parabola. Adding two parabolas gives another parabola — hence another Gaussian when you exponentiate!

Precision: The Key to Combining Gaussians Insight

Precision is defined as the inverse of variance. It measures how "sharp" or confident a distribution is:

Prior Precision

1 / τ₀²

Data Precision

N / σ²

The posterior precision = prior precision + data precision. More data → more precise posterior → narrower distribution → less uncertainty.

Posterior Formulas

// posterior mean · posterior std dev · precision weighting · weighted average interpretation

The Posterior Distribution Result

After observing N measurements x = {x₁,…,x_N}, the posterior on μ is a Gaussian:

Posterior

p(μ | x, σ) = normal(μ | μ_N, τ_N)

Posterior Std Dev τ_N

1/τ_N² = 1/τ₀² + N/σ²

Posterior precision = prior precision + data precision.

Posterior Mean μ_N

μ_N = [(1/τ₀²)·μ₀ + (N/σ²)·x̄] / (1/τ₀² + N/σ²)

Weighted average of prior mean and sample mean.

Posterior Mean as Weighted Average Interpretation

The posterior mean can be written in multiple equivalent ways:

Explicit Weighting

μ_N = w_prior · μ₀ + w_data · x̄

      where w_prior = (1/τ₀²) / (1/τ₀² + N/σ²)  and  w_data = (N/σ²) / (1/τ₀² + N/σ²)
    

As "adjusting" the prior mean

μ_N = μ₀ + (x̄ − μ₀) · [τ₀² / (σ²/N + τ₀²)]

As "shrinking" the data toward the prior

μ_N = x̄ − (x̄ − μ₀) · [σ²/N / (σ²/N + τ₀²)]

All three expressions are equivalent. They reveal that the posterior mean is always between the prior mean and the sample mean — a compromise whose balance depends on relative precision.

Interactive Posterior Calculator Demo

Set the parameters and see the posterior update live.

Prior mean (μ₀) 25.0

Prior std dev (τ₀) 1.0

Sample mean (x̄) 25.0

Sample size (N) 10

Observation noise σ = 1 (fixed, known)

What to Expect as You Move the Sliders Interpretation Guide

The plot shows three curves: the prior (gold), the likelihood as a function of μ (violet), and the posterior (teal). The posterior is always between the prior and the likelihood — a precision-weighted compromise.

Slider change	What happens to the posterior
↑ N (more data)	Likelihood narrows and dominates; posterior moves toward x̄ and sharpens
↑ τ₀ (wider prior)	Prior flattens; posterior moves toward x̄ and more closely tracks the likelihood
↓ τ₀ (tighter prior)	Prior sharpens; posterior is pulled strongly toward μ₀, especially when N is small
Move x̄ away from μ₀	Posterior shifts between them; high-precision source wins
Move μ₀	Prior shifts; posterior tracks it when N is small, ignores it when N is large

      The prior weight % shown below the plot is 1/τ₀² ÷ (1/τ₀² + N/σ²). Watch it approach 0% as N grows — this is the data dominating the prior.
    

All Observations Collapse to x̄ Sufficient Statistic Redux

Notice that N individual observations only appear in the posterior formulas through two quantities:

Sample Mean (location)

x̄ = (1/N) Σ xₙ

Sample Size (precision scaling)

N

The N raw observations are compressed into a single sufficient observation (x̄, N). This confirms x̄ is the sufficient statistic — it captures everything the data say about μ.

Asymptotic Limits

// N → ∞ · τ₀ → ∞ · frequentist link · data dominates · prior irrelevance

What Happens as N → ∞? Limit 1

As the number of observations grows without bound, the data overwhelm the prior:

Posterior Mean → MLE

μ_N → x̄

The prior mean μ₀ contribution vanishes: (x̄ − μ₀) · 0 → 0.

Posterior Std Dev → 0

τ_N → 0

Precision → ∞. Perfect certainty about μ.

📌 In the limit of infinite data, the posterior converges to an infinitely sharp (zero-variance) Gaussian centred on the sample mean! The Bayesian and frequentist answers agree.

What Happens as τ₀ → ∞? Limit 2

A prior standard deviation τ₀ → ∞ represents a completely "diffuse" prior — we have essentially no prior knowledge:

Posterior Mean → MLE

μ_N → x̄

Prior precision → 0, so prior has no weight.

Posterior Std Dev → σ/√N

τ_N → σ / √N

Classical standard error of the mean.

Posterior with diffuse prior

p(μ | x, σ) → normal(x̄, σ/√N)

This is exactly the frequentist sampling distribution of the sample mean! The diffuse prior recovers the classical result.

Summary: When Does Prior Matter? Key Insight

Situation	Prior Influence	Posterior ≈
Small N, informative prior	High	Weighted blend, pulled toward prior
Small N, diffuse prior	Low	Mostly data-driven, high uncertainty
Large N (any prior)	Negligible	≈ normal(x̄, σ/√N)
N → ∞ (any prior)	Zero	Point mass at x̄ (MLE)

How quickly data overwhelms the prior depends on the ratio of data precision (N/σ²) to prior precision (1/τ₀²). In the dumbbell example with σ=1 and τ₀=1, after ~10 observations the likelihood begins to dominate; by ~50 observations the two priors yield nearly identical posteriors. In a noisier setting (larger σ) or with a very strong prior (smaller τ₀), it can take far more data for the prior to become irrelevant — or far fewer if the prior is already very diffuse.

Prior Types

// informative prior · diffuse prior · prior specification · τ₀ effect · dumbbell example

Informative Prior (τ₀ = 1 pound) Example

For the dumbbell experiment, we trust the manufacturer's label. We set a tight prior:

Informative Prior

μ | μ₀, τ₀ ~ normal(μ | 25, 1)

This says: before collecting data, we believe μ is centred at 25 pounds and we'd be surprised if it were outside [22, 28]. The prior is constraining — observations outside 20–30 lbs are considered implausible.

Effect: Even after 10 observations the posterior shifts toward the data, but the strong prior pulls it back. After ~25–50 observations, data dominates. The posterior mean is a precision-weighted compromise.

Diffuse Prior (τ₀ = 25 pounds) Example

What if we have almost no prior knowledge about μ? We use a very wide prior:

Diffuse Prior

μ | μ₀, τ₀ ~ normal(μ | 25, 25)

Caution: A τ₀ = 25 prior places significant probability on negative dumbbell weights — physically impossible! The prior adds essentially no information; the posterior immediately tracks the data from observation 1.

N=1: Posterior Mean ≈

μ_N ≈ x̄ = x₁

Prior contributes nearly nothing

Risk with diffuse prior

Allows μ < 0

Unphysical — weights can't be negative!

Comparing Informative vs. Diffuse Summary

Property	Informative (τ₀=1)	Diffuse (τ₀=25)
Prior precision	High (1/τ₀² = 1.0)	Low (1/τ₀² = 0.0016)
Prior influence at N=1	Strong	Negligible
Posterior mean at N=1	Pulled toward μ₀ = 25	≈ x₁ (first observation)
After 50 observations	Posterior τ_N's become essentially identical!
Allows unphysical values?	No (essentially)	Yes — negative μ possible

With enough data, the choice of prior becomes irrelevant. The informative prior is more appropriate when you genuinely have domain knowledge (here: the manufacturer's label).

Sequential Bayesian Updating Process

A beautiful property of conjugate models: today's posterior becomes tomorrow's prior. You can update sequentially, one observation at a time, and get the same result as batch processing all N observations at once.

1

Start with prior normal(μ | μ₀, τ₀)

Encode your domain knowledge before seeing any data.

2

Observe x₁ → compute posterior normal(μ | μ₁, τ₁)

Apply the posterior formulas with N=1, x̄=x₁.

3

Use normal(μ | μ₁, τ₁) as new prior

Observe x₂, compute next posterior. Repeat for every new observation.

4

After N total observations: normal(μ | μ_N, τ_N)

Same result whether done sequentially or all at once. Sequential updating is especially useful in online/streaming settings.

Unknown σ — Both Parameters Unknown

// joint posterior · independent prior · no closed form · uniform prior on σ

Relaxing the Known-σ Assumption New Setting

Previously we assumed the measurement noise σ was known. Now we drop that assumption — both μ and σ are unknown. This changes everything:

Old: known σ (Lecture 5A)

Learn μ from data
σ fixed (manufacturer spec)
Posterior: p(μ | x, σ) — 1D

New: unknown σ (Lecture 5B)

Learn μ AND σ from data
No σ assumption needed
Posterior: p(μ, σ | x) — 2D

      The running example shifts from a fixed dumbbell to weighing ourselves on a home scale. Two key differences motivate learning σ: (1) a home scale's noise level is not given by any specification, and (2) unlike a rigid object, a person's measured weight can vary with posture, balance, fluid retention, and time of day — so the "true" σ captures a richer combination of sources of variability than pure instrument error.
    

The Joint Posterior Core Formula

We now target the joint distribution of both unknowns conditioned on all data:

Joint Posterior

p(μ, σ | x) ∝ ∏ₙ normal(xₙ | μ, σ) × p(μ, σ)

      The likelihood is unchanged — observations are still conditionally independent given μ and σ. What changes is that we now also update our beliefs about σ.
    

The Joint Prior: Independent Specification Prior Design

One tractable approach: assume μ and σ are a-priori independent:

Factored joint prior

p(μ, σ) = p(μ) × p(σ)

Even if we specify independent priors, the posterior will in general create a relationship between μ and σ — the data couple the two parameters through the likelihood.

Prior on μ — Normal

p(μ) = normal(μ | μ₀, τ₀)
Example: μ₀=250, τ₀=2

Prior on σ — Uniform

p(σ) = uniform(σ | ℓ, u)
Example: ℓ=0.5, u=5.5 (pounds)

Interpreting the σ Prior Bounds Body-Weight Example

The uniform prior on σ reflects our bounded uncertainty about measurement repeatability:

      σ captures the total spread of repeated measurements — instrument noise, posture, balance, and other within-session variation. If you step on the scale 10 times in a row: with σ=5 and μ=250, there is ~95% probability the readings fall in [240, 260]. Setting ℓ=0.5 and u=5.5 spans from a very precise scale to a fairly variable one.
    

Setting	Value	Meaning
μ₀	250	Prior belief: we weigh ~250 lbs
τ₀	2	≈99% probability weight < 256 lbs a-priori
ℓ (lower)	0.5	Scale at least somewhat noisy
u (upper)	5.5	Scale not wildly inaccurate (±16 lbs at 3σ)

No Closed Form — Why? Key Insight

The unnormalised joint posterior is:

Unnormalised log-joint posterior

log p(μ,σ|x) ∝ Σₙ log normal(xₙ|μ,σ) + log normal(μ|μ₀,τ₀) + log uniform(σ|ℓ,u)

      Because the uniform prior on σ is not conjugate to the Gaussian likelihood, the joint posterior has no closed analytic form. We must either evaluate it numerically on a grid, approximate it (Laplace/Normal approximation), or sample from it (MCMC).
    

Visualising the Joint Posterior

// posterior surface · MAP estimates · curvature · contours · sequential updating

Reading the Log-Posterior Surface Visualization

With only two unknowns (μ, σ) we can visualise the posterior as a 2D surface. The colour encodes the un-normalised log-posterior value (viridis scale):

Colour	Log-posterior value	Meaning
Bright yellow	High (near maximum)	Most plausible (μ, σ) combinations — MAP region
Green / teal	Moderate	Somewhat plausible
Blue / dark blue	Low	Less plausible but within prior bounds
Dark purple	Very low	Highly implausible — almost ruled out by data
Grey (masked)	−∞	Outside prior bounds — impossible by assumption

Posterior surface (N=10, μ₀=250, τ₀=2, σ ~ Uniform(0.5, 5.5)):

μ ∈ [245, 265] (horizontal axis) · σ ∈ [0.5, 5.5] (vertical axis, high σ at top) · black lines = contours

      Contour rings are iso-probability levels. The innermost ring contains the most plausible region — the MAP estimate sits at its centre.
    

MAP Estimate — Maximising the Joint Posterior MAP

The Maximum A Posteriori (MAP) estimate is the pair (μ, σ) that maximises the posterior — analogous to MLE but including the prior:

MAP estimate

(μ̂_MAP, σ̂_MAP) = argmax p(μ, σ | x) = argmax [log-likelihood + log-prior]

From 10 observations

μ̂_MAP ≈ 255
σ̂_MAP ≈ 2.5

Uncertainty

Horizontal width → uncertainty in μ
Vertical length → uncertainty in σ

Curvature Insight — μ Curves vs. σ Curves Key Insight

Looking at 1D slices of the log-posterior reveals important structure:

μ

Log-posterior vs. μ (at fixed σ) — familiar parabola

Each curve is a downward parabola in μ, peaked near the sample mean. Higher σ → shallower, more spread-out curve (log-density for μ is less concentrated when noise is higher). The MAP on μ is near 255 for all σ values.

σ

Log-posterior vs. σ (at fixed μ) — new shape!

The curves rise steeply from small σ, peak somewhere in the middle of the prior range, then flatten. Small σ values are ruled out when μ is far from the data average — the data couldn't have come from such a precise scale if they were so spread out. A MAP on σ exists but is less sharply defined than for μ.

↗

μ and σ are coupled in the posterior

Even with an independent prior, the posterior surface shows a slight tilt: as σ increases, slightly lower μ values become plausible. This is because a higher noise level can "explain away" larger deviations from the mean — the data is consistent with a wider range of μ values when the scale is noisy.

Sequential Updating of the Joint Surface N = 0 → 50

How does the joint posterior evolve as we add observations one by one?

N	μ contours	σ contours	Key observation
0 (prior)	Vertical lines	Flat (uniform)	σ is completely unconstrained a-priori
1	No longer vertical	σ MAP pushed to upper bound	First obs > 260 — prior on μ is too narrow; needs high σ to explain it
5–7	Inner ring appears	σ MAP moves inward	More data resolve both μ and σ simultaneously
10	Compact ellipse	σ̂ ≈ 2.5	Posterior well-localised near (μ≈255, σ≈2.5)
30+	Stable, shrinking	Stable, shrinking	Diminishing returns — surface shape barely changes

      The prior acts as "past measurements." Even with just 1 observation, σ is no longer uniform — the data immediately constrain what noise levels are plausible.
    

Figures: Prior Surface vs. Posterior Surface (N=10) Visualisation

Heatmaps of the un-normalised log-posterior. Yellow = high (plausible), dark blue = low, grey = outside prior bounds. Black rings = iso-probability contours.

N = 0 (prior only) · μ ∈ [244, 256] centred on μ₀=250

N = 10 (posterior)

μ ∈ [245, 265] (horizontal) · σ ∈ [0.5, 5.5] (vertical) · Prior: μ₀=250, τ₀=2, σ~Uniform(0.5, 5.5)

Looking Ahead Next Lecture

As N grows, the posterior contours become increasingly elliptical — approaching the shape of a bivariate Gaussian centred on the MAP. This observation motivates the Laplace (Normal) approximation, which we will cover in the next lecture.

Multivariate Normal Distribution

// MVN · covariance matrix · Mahalanobis distance · marginals · conditionals · correlation

The MVN Density Formula

The Multivariate Normal (MVN) generalises the Gaussian to D-dimensional vectors x = (x₁, x₂, …, x_D):

MVN PDF — general form

p(x | μ, Σ) = (1 / √((2π)ᴰ|Σ|)) · exp( −½ (x−μ)ᵀ Σ⁻¹ (x−μ) )

Symbol	Name	Role
μ	Mean vector (D×1)	Location centre of the distribution
Σ	Covariance matrix (D×D)	Shape, scale, and orientation of the "cloud"
\|Σ\|	Determinant of Σ	Normalisation constant
Σ⁻¹	Precision matrix	Inverse covariance — appears in exponent

The Mahalanobis Distance Key Term

The term in the exponent generalises the familiar 1D squared z-score:

1D Gaussian exponent term

(x − μ)² / σ²

Squared distance in units of σ

D-D Mahalanobis distance²

(x−μ)ᵀ Σ⁻¹ (x−μ)

Generalised distance accounting for correlations

      The Mahalanobis distance "stretches" the space by the inverse covariance — points that are close in Euclidean distance but along a high-variance axis are treated as nearby; points along a low-variance axis are treated as far.
    

The Bivariate Gaussian (D=2) 2D Case

For two variables x₁ and x₂ with correlation coefficient ρ:

Mean vector and covariance matrix

μ = [μ₁, μ₂]ᵀ

Σ = [[σ₁², ρσ₁σ₂],
[ρσ₁σ₂, σ₂² ]]

      Off-diagonal elements store the covariance: cov(x₁, x₂) = cov(x₂, x₁) = ρσ₁σ₂. Diagonal elements store individual variances σ₁² and σ₂².
    

Marginal and Conditional Distributions Properties

Two powerful closure properties of the MVN:

1

Marginals are Gaussian

Each component has a marginal Gaussian distribution:
x₁ ~ normal(μ₁, σ₁) x₂ ~ normal(μ₂, σ₂)
This holds in any number of dimensions — every subset of variables in an MVN is also MVN.

2

Conditionals are Gaussian

The conditional distribution of one variable given the other is also Gaussian:
x₂ | x₁ ~ normal(μ₂ + ρ(σ₂/σ₁)(x₁−μ₁), (1−ρ²)σ₂²)
Knowing x₁ shifts the conditional mean toward x₁ (weighted by ρ) and reduces variance.

Effect of Correlation ρ on Contour Shape Visual Intuition

The correlation coefficient ρ ∈ (−1, 1) controls how "tilted" the elliptical contours are:

ρ value	Contour shape	Interpretation
ρ = 0	Circles (if σ₁ = σ₂) or axis-aligned ellipses	Variables independent — knowing x₁ gives no info about x₂
ρ > 0	Ellipses tilted ↗ (positive slope)	High x₁ → expect high x₂; knowing one reduces uncertainty in other
ρ < 0	Ellipses tilted ↘ (negative slope)	High x₁ → expect low x₂
\|ρ\| → 1	Very thin, elongated ellipses	Nearly perfect linear relationship; near-singular Σ

      Connection to our joint posterior: The slight tilt in the (μ, σ) posterior surface corresponds to a small negative correlation — as σ increases, lower μ values become somewhat more plausible. This is induced by the likelihood, not the prior.
    

Conditional mean when |ρ| > 0

E[x₂ | x₁ = v] = μ₂ + ρ · (σ₂/σ₁) · (v − μ₁)

Specifying one variable shifts the conditional mean of the other!

Key Glossary

// all terms defined in lecture 5

Gaussian / Normal Distribution

A continuous symmetric bell-shaped probability distribution parameterised by mean μ and variance σ². Written normal(x | μ, σ) or 𝒩(μ, σ²). Used pervasively in statistics and ML.

Hyperparameters (μ, σ)

The parameters that define the shape of a Gaussian: μ controls location (mean = median = mode); σ² controls spread (variance = width). σ is the standard deviation.

Standard Normal Distribution

The Gaussian with μ=0 and σ=1: normal(x | 0, 1). Any Gaussian can be standardised to a standard normal via the z-score z = (x − μ)/σ.

Z-Score

z = (x − μ) / σ. Measures how many standard deviations x is away from the mean. Allows comparison across different Gaussian scales. Enables change-of-variables in Gaussian models.

Maximum Likelihood Estimate (MLE)

The parameter value that maximises the likelihood p(data | parameter). For the Gaussian mean: μ̂_MLE = x̄ (the sample average). A point estimate with no uncertainty.

Sufficient Statistic

A function of the data that captures all information about the parameter. For the Gaussian mean μ: the sufficient statistic is the sample mean x̄. Once you know x̄ and N, the individual xₙ add no further information about μ.

Conjugate Prior

A prior distribution that, when combined with a given likelihood, yields a posterior from the same family. For a Gaussian likelihood on μ: the conjugate prior is a Gaussian. Conjugacy makes the posterior analytically tractable.

Precision

The inverse of variance: 1/σ². A higher precision means a narrower, more confident distribution. Posterior precision = prior precision + data precision (N/σ²). Precision arithmetic is simpler than variance arithmetic when combining Gaussians.

Posterior Mean (μ_N)

The mean of the posterior distribution. A precision-weighted average of the prior mean μ₀ and the data sample mean x̄. Lies between the two. As N→∞, it converges to x̄.

Posterior Standard Deviation (τ_N)

The uncertainty in μ after seeing data. Decreases as N increases (more data → more precision). Defined by 1/τ_N² = 1/τ₀² + N/σ².

Informative Prior

A prior with small τ₀ — encodes strong belief about μ before data. In the dumbbell example: normal(μ | 25, 1) — confident the weight is near 25 lbs.

Diffuse (Uninformative) Prior

A prior with large τ₀ — encodes very little prior knowledge. The posterior is almost entirely driven by the likelihood. Risk: a Gaussian diffuse prior allows unphysical values (e.g., negative weights).

Credible Interval

A Bayesian interval such that there is X% posterior probability that μ lies within it. Unlike a frequentist confidence interval, it has a direct probabilistic interpretation: given the data and prior, the 90% credible interval contains the true μ with 90% probability.

Sequential Updating

The process of using the current posterior as the next prior when a new observation arrives. In conjugate models, this yields the same result as processing all N observations at once.

Joint Posterior p(μ, σ | x)

The posterior distribution over two or more unknown parameters simultaneously. When both μ (mean) and σ (noise) are unknown, we target their joint distribution. Even if the prior is independent (p(μ,σ) = p(μ)p(σ)), the likelihood couples the parameters — the posterior is generally not independent.

MAP Estimate (Maximum A Posteriori)

The parameter value(s) that maximise the posterior distribution. Equivalent to maximising log-likelihood + log-prior. Generalises MLE by incorporating prior beliefs. In 2D, the MAP is the (μ, σ) pair at the centre of the innermost contour ring of the posterior surface.

Uniform Prior

A prior that assigns equal probability to all values within a specified range [ℓ, u] and zero outside. Used for σ when we know only its plausible range. The uniform prior is NOT conjugate to the Gaussian likelihood, so the joint posterior with a uniform prior on σ has no closed form.

Laplace / Normal Approximation

An approximation that replaces the (possibly non-analytic) posterior with a multivariate Gaussian centred on the MAP estimate. The covariance matrix is the (negative) inverse Hessian of the log-posterior at the MAP. Accurate when the posterior is approximately Gaussian — typically requires moderate-to-large N.

Multivariate Normal (MVN)

The D-dimensional generalisation of the Gaussian. Parameterised by a mean vector μ (D×1) and a symmetric positive-definite covariance matrix Σ (D×D). Every subset of variables is also MVN; all marginal and conditional distributions are Gaussian.

Covariance Matrix Σ

A D×D symmetric positive-definite matrix where diagonal entries are variances σᵢ² and off-diagonal entries are covariances cov(xᵢ, xⱼ) = ρᵢⱼ σᵢ σⱼ. Controls the shape, scale, and orientation of an MVN distribution's elliptical contours.

Mahalanobis Distance

The generalised distance d²(x, μ) = (x−μ)ᵀ Σ⁻¹ (x−μ) that appears in the MVN exponent. Accounts for correlations and different scales across dimensions — it is the "how many standard deviations away" concept extended to D dimensions.

Correlation Coefficient ρ

ρ = cov(x₁,x₂) / (σ₁σ₂) ∈ (−1, 1). Controls the tilt of bivariate Gaussian contours. ρ=0: circles/axis-aligned ellipses; |ρ|→1: very elongated ellipses. When |ρ|>0, knowing one variable reduces uncertainty in the other via the conditional distribution.

Flashcard Quiz

// click card to reveal · mark yourself · track your score

✓ 0 ✗ 0

0 / 0

Press "Shuffle & Start" to begin.