INFSCI 2595 · Machine Learning · Fall 2025

Week 6 — Laplace Approximation

Lecture 6 · Laplace Approximation & Optimization · University of Pittsburgh
The Laplace Approximation
// Why do we need it · Big picture · The approximation result · Limitations
🎯 The Core Problem KEY IDEA

In Bayesian machine learning, we often need to work with the posterior distribution of parameters θ given data 𝒟. For complex models, this posterior cannot be written in closed form.

The Laplace Approximation solves this by fitting a Multivariate Normal (MVN) distribution centered at the posterior mode (MAP estimate) to approximate the true posterior.

In the weight example, θ = (μ, σ) — the mean and noise of a Normal likelihood. We cannot integrate this posterior analytically, so we approximate.

📋 The Final Result FORMULA
Laplace Approximation
p(θ | 𝒟) ≈ MVN( θ̂, −H|θ=θ̂−1 )
SymbolMeaning
θ̂The MAP (posterior mode) — the peak of the log-posterior surface
HThe Hessian matrix of second derivatives of the log-posterior, evaluated at θ̂
−H⁻¹The approximate covariance matrix (negative inverse Hessian = "observed information matrix" inverse)
✅ When It Works Well GOOD
Unimodal posterior Large N (> ~30) Near-Gaussian shape Unbounded parameters

With N=35 observations, the contours around the MAP are nearly elliptical and the approximation closely matches the true posterior near the peak.

⚠️ Limitations CAUTION
MVN is symmetric Poor with small N Ignores parameter bounds Multi-modal = fails

The true posterior for σ is skewed (bounded below by 0). With only N=5, the Gaussian approximation breaks down badly in the tails.

🏔️ Why This Approximation Works: Intuition

Think of the log-posterior as a landscape with a hill. Near the top of the hill, any smooth hill looks roughly like an upside-down bowl — which is exactly the shape of a Gaussian log-density. The Laplace Approximation locally matches that bowl shape at the peak using only first and second derivative information.

"An approximate answer to the right question is better than an exact answer to the wrong one." — Lecture quote
🔑 Two-Step Process
1
Find the MAP (posterior mode) via optimization
Use an iterative algorithm (gradient ascent / Newton-Raphson) to find θ̂ — the parameter values that maximize the log-posterior.
2
Compute the Hessian at θ̂
Evaluate the matrix of second derivatives at θ̂. Its negative inverse is the approximate covariance matrix of the MVN posterior.
Taylor Series & MVN Connection
// Second-order expansion · Gradient · Hessian · Log-MVN link
📐 Second-Order Taylor Series Expansion MATH

We expand the log-posterior around the posterior mode θ̂:

log p(θ|𝒟) ≈ log p(θ̂|𝒟) + (θ − θ̂)ᵀ g|θ̂ + ½(θ − θ̂)ᵀ H|θ̂ (θ − θ̂)

Three terms appear: a constant, a gradient term, and a Hessian term.

At the MAP θ̂, the gradient g = 0 by definition (it's a stationary point / mode). The gradient term vanishes!
log p(θ|𝒟) ≈ log p(θ̂|𝒟) + ½(θ − θ̂)ᵀ H|θ̂ (θ − θ̂)

This simplified form has only two terms — a constant and a quadratic in (θ − θ̂).

🔗 Connection to the Log-MVN Density LINK

Recall the log of a general MVN density (up to constants):

log p(θ | μ, Σ) ∝ −½ (θ − μ)ᵀ Σ⁻¹ (θ − μ)
Taylor Expansion term
½(θ − θ̂)ᵀ H (θ − θ̂)

H is the Hessian (negative = curvature downward at mode)

Log-MVN term
−½ (θ − μ)ᵀ Σ⁻¹ (θ − μ)

Σ⁻¹ is the precision matrix

Matching terms: μ = θ̂ and Σ = −H⁻¹. The negative Hessian is the observed information matrix, and its inverse is the approximate covariance.
📊 Gradient Vector g

For the weight example with parameters (μ, σ), the gradient is a 2-element vector of partial derivatives:

g = [ ∂/∂μ log p(μ,σ|𝒟), ∂/∂σ log p(μ,σ|𝒟) ]ᵀ

The gradient points in the direction of steepest ascent of the log-posterior surface. At the MAP, g = 0.

📊 Hessian Matrix H

For the weight example, the Hessian is a 2×2 matrix of second partial derivatives:

H = ∂²/∂μ²∂²/∂μ∂σ ∂²/∂σ∂μ∂²/∂σ²

The Hessian captures the local curvature of the log-posterior surface. Steep curvature → small variance in that direction. Flat curvature → large variance.

The Hessian is computationally expensive to calculate — requires many function evaluations when using finite differences.
Finding the MAP: Optimization
// Gradient ascent · Newton-Raphson · Step size · Convergence · optim() in R
🏔️ The "Hill Climbing" Intuition

To find the MAP, we must find the peak of the log-posterior surface. We do this iteratively:

1
Start from an initial guess θk=0
Place an arbitrary starting point anywhere in parameter space (e.g., the prior mean).
2
Calculate the gradient
Compute g at the current location — this points uphill toward the peak.
3
Move in the gradient direction (scaled by step size)
Update θ: θk+1 = θk + αk · dk
4
Repeat until convergence (g ≈ 0)
Stop when the gradient is effectively zero — we're at the peak.
⬆️ Gradient Ascent / Descent METHOD 1

The simplest approach: move a fixed fraction of the gradient at each step.

θk+1 = θk − η · g|k

(Minus sign for minimization; plus sign for maximization)

SymbolNameEffect
ηLearning rateSmall η → more steps needed; large η → risk overshooting
g|kGradient at step kDirection of steepest ascent/descent
Problem: Step size is constant, so gradient ascent may zigzag or take too many steps near the optimum. It ignores how the surface curves.
🔄 Newton-Raphson Method METHOD 2

Uses curvature information (the Hessian) to choose a smarter step direction and size:

dk = −H|k−1 · g|k
θk+1 = θk − αk · H|k−1 · g|k
SymbolRole
H⁻¹Inverse Hessian — scales the step by the local curvature of the surface
αkStep size multiplier ∈ (0,1]. α = 1 is the "Full Newton step"
Key insight: Where the surface is steep (large curvature), Newton-Raphson takes a smaller step. Where it's flat, it takes a larger step. The step adapts automatically!
Maximization → Minimization trick: Multiply the log-posterior by −1. Now we minimize the negative log-posterior, which is equivalent to finding the maximum.
🔁 Convergence
Converged

The gradient g ≈ 0. We're at a flat point (the peak). Successive θ values change by less than some tolerance.

Not Converged

The gradient is still non-zero. We're still climbing. Keep iterating until g ≈ 0.

For the weight example, the log-posterior has one unique maximum — so regardless of where we start, the algorithm converges to the same MAP. This is not guaranteed for neural networks!
🛠️ Using optim() in R R CODE

In practice, we let R handle the optimization. optim() uses finite differences to approximate derivatives — no manual differentiation required!

# Define the negative log-posterior (minimize = maximize)
neg_log_post <- function(theta, x) {
  mu <- theta[1]; sigma <- theta[2]
  if (sigma <= 0) return(Inf)
  # Negative log-likelihood + negative log-prior
  -(sum(dnorm(x, mu, sigma, log=TRUE)) +
    dnorm(mu, 250, 2, log=TRUE) +
    dunif(sigma, 0.5, 5.5, log=TRUE))
}

# Run optimization with Hessian computation
result <- optim(
  par = c(250, 3),       # initial guess
  fn  = neg_log_post,
  x   = my_data,
  hessian = TRUE          # get H for Laplace approx
)

theta_hat <- result$par
H         <- result$hessian
Sigma_approx <- solve(H)  # ≈ posterior covariance
Finite differences approximate derivatives by perturbing each parameter by a tiny Δ and computing (f(θ+Δ) − f(θ)) / Δ. No analytic calculus needed!
Parameter Transformations
// Why transforms are needed · Logit transform · Change of variables · Back-transforming via sampling
⚠️ The Problem with Bounded Parameters ISSUE

The MVN distribution places mass on all of ℝ (−∞ to +∞). But many parameters have natural constraints:

σ must be > 0 σ has a uniform prior on (0.5, 5.5) MVN ignores these bounds!

With N=5, the Laplace Approximation without transforms assigns probability mass to impossible values (σ < 0). This is a serious problem!

🔄 The Fix: Change of Variables SOLUTION

Transform the bounded parameter σ to an unbounded parameter φ, apply the Laplace Approximation in the unbounded space, then back-transform samples.

1
Choose a transformation φ = T(σ)
Map the bounded σ → (a, b) to an unbounded φ ∈ (−∞, +∞).
2
Apply Laplace Approximation in (μ, φ) space
Both μ and φ are now unbounded. The MVN is an appropriate approximation.
3
Sample from the approximate MVN
Draw many (μ, φ) pairs from the approximate posterior.
4
Back-transform φ → σ
Apply σ = T⁻¹(φ) to each sample. The bounds on σ are automatically respected!
📜 Change of Variables Theorem for PDFs THEOREM

When we apply a transformation φ = T(σ) to a random variable σ, the probability density of the new variable φ is not simply the old density re-labeled. The PDF must be corrected by the Jacobian to preserve total probability mass.

If σ ~ pσ(σ) and φ = T(σ) is a monotone, differentiable transformation, then the density of φ is:
pφ(φ) = pσ( T⁻¹(φ) ) · |dσ/dφ|
TermRole
T⁻¹(φ)Inverse transformation — plug back into the original density
|dσ/dφ|Absolute value of the Jacobian — accounts for the "stretching" or "squishing" of the variable

Intuition: Probability mass must be conserved. If T squishes an interval of σ into a smaller interval of φ (i.e. |dσ/dφ| < 1), the density of φ must be lower in that region to compensate — and vice versa.

In the original σ-space

pσ(σ) dσ = probability mass in the interval [σ, σ+dσ]

In the transformed φ-space

pφ(φ) dφ = same probability mass, just re-expressed in [φ, φ+dφ]

In log-posterior terms: log p(μ, φ | 𝒟) = log p(μ, T⁻¹(φ) | 𝒟) + log|dσ/dφ|. The log-Jacobian is simply added to the log-posterior — easy to implement in R!
📐 The Logit Transformation FORMULA

For a bounded variable σ ∈ (a, b), the logit transformation maps it to φ ∈ (−∞, +∞):

φ = T(σ) = logit( (σ − a) / (b − a) ) = log( p / (1 − p) )
σ = T⁻¹(φ) = a + (b − a) · logistic(φ) = a + (b − a) · 1/(1 + e−φ)
dσ/dφ = (b − a) · logistic(φ) · (1 − logistic(φ)) ← Jacobian term
The derivative term (Jacobian) must be included in the log-posterior when working in the transformed space. optim() handles numerical derivatives, but you must add this correction to your objective function manually.
📊 Transformed Posterior in (μ, φ) Space

The joint posterior in the transformed space includes a Jacobian correction term:

p(μ, φ | 𝒟) = ∏ Normal(xi | μ, T⁻¹(φ)) · Normal(μ | μ₀, σ₀) · Uniform(T⁻¹(φ) | a, b) · |dσ/dφ|

By working in this transformed space, the Laplace Approximation gives an MVN for (μ, φ). Samples are then back-transformed to get (μ, σ) with bounds fully respected.

✅ Result with N=5 After Transformation VISUAL

Both panels show the marginal approximate posterior for σ. The prior restricts σ ∈ (0.5, 5.5). With only N=5 observations, the MAP sits near 4.5 — close to the upper bound — making the skew problem severe.

⚠ Without Transform

The symmetric MVN bleeds outside the valid bounds, assigning probability mass where the prior is zero.

✓ With Logit Transform

Back-transformed samples respect bounds exactly and reveal the true skewed shape concentrated near the upper limit.

Approx. posterior density Invalid region (outside prior) Bounds σ = 0.5 & σ = 5.5
The logit transformation is also key in classification — it maps probabilities (0,1) to log-odds (−∞, +∞). We'll revisit it later in the semester!
Quick Glossary
// All key terms from Lecture 6 at a glance
Flashcards
// Click a card to reveal the answer · track your score
0   ✗ 0
0 / 0
Press "Shuffle & Start" to begin the flashcard session.