In Bayesian machine learning, we often need to work with the posterior distribution of parameters θ given data 𝒟. For complex models, this posterior cannot be written in closed form.
In the weight example, θ = (μ, σ) — the mean and noise of a Normal likelihood. We cannot integrate this posterior analytically, so we approximate.
| Symbol | Meaning |
|---|---|
| θ̂ | The MAP (posterior mode) — the peak of the log-posterior surface |
| H | The Hessian matrix of second derivatives of the log-posterior, evaluated at θ̂ |
| −H⁻¹ | The approximate covariance matrix (negative inverse Hessian = "observed information matrix" inverse) |
With N=35 observations, the contours around the MAP are nearly elliptical and the approximation closely matches the true posterior near the peak.
The true posterior for σ is skewed (bounded below by 0). With only N=5, the Gaussian approximation breaks down badly in the tails.
Think of the log-posterior as a landscape with a hill. Near the top of the hill, any smooth hill looks roughly like an upside-down bowl — which is exactly the shape of a Gaussian log-density. The Laplace Approximation locally matches that bowl shape at the peak using only first and second derivative information.
We expand the log-posterior around the posterior mode θ̂:
Three terms appear: a constant, a gradient term, and a Hessian term.
This simplified form has only two terms — a constant and a quadratic in (θ − θ̂).
Recall the log of a general MVN density (up to constants):
½(θ − θ̂)ᵀ H (θ − θ̂)
H is the Hessian (negative = curvature downward at mode)
−½ (θ − μ)ᵀ Σ⁻¹ (θ − μ)
Σ⁻¹ is the precision matrix
For the weight example with parameters (μ, σ), the gradient is a 2-element vector of partial derivatives:
The gradient points in the direction of steepest ascent of the log-posterior surface. At the MAP, g = 0.
For the weight example, the Hessian is a 2×2 matrix of second partial derivatives:
The Hessian captures the local curvature of the log-posterior surface. Steep curvature → small variance in that direction. Flat curvature → large variance.
To find the MAP, we must find the peak of the log-posterior surface. We do this iteratively:
The simplest approach: move a fixed fraction of the gradient at each step.
(Minus sign for minimization; plus sign for maximization)
| Symbol | Name | Effect |
|---|---|---|
| η | Learning rate | Small η → more steps needed; large η → risk overshooting |
| g|k | Gradient at step k | Direction of steepest ascent/descent |
Uses curvature information (the Hessian) to choose a smarter step direction and size:
| Symbol | Role |
|---|---|
| H⁻¹ | Inverse Hessian — scales the step by the local curvature of the surface |
| αk | Step size multiplier ∈ (0,1]. α = 1 is the "Full Newton step" |
The gradient g ≈ 0. We're at a flat point (the peak). Successive θ values change by less than some tolerance.
The gradient is still non-zero. We're still climbing. Keep iterating until g ≈ 0.
In practice, we let R handle the optimization. optim() uses finite differences to approximate derivatives — no manual differentiation required!
# Define the negative log-posterior (minimize = maximize) neg_log_post <- function(theta, x) { mu <- theta[1]; sigma <- theta[2] if (sigma <= 0) return(Inf) # Negative log-likelihood + negative log-prior -(sum(dnorm(x, mu, sigma, log=TRUE)) + dnorm(mu, 250, 2, log=TRUE) + dunif(sigma, 0.5, 5.5, log=TRUE)) } # Run optimization with Hessian computation result <- optim( par = c(250, 3), # initial guess fn = neg_log_post, x = my_data, hessian = TRUE # get H for Laplace approx ) theta_hat <- result$par H <- result$hessian Sigma_approx <- solve(H) # ≈ posterior covariance
The MVN distribution places mass on all of ℝ (−∞ to +∞). But many parameters have natural constraints:
With N=5, the Laplace Approximation without transforms assigns probability mass to impossible values (σ < 0). This is a serious problem!
Transform the bounded parameter σ to an unbounded parameter φ, apply the Laplace Approximation in the unbounded space, then back-transform samples.
When we apply a transformation φ = T(σ) to a random variable σ, the probability density of the new variable φ is not simply the old density re-labeled. The PDF must be corrected by the Jacobian to preserve total probability mass.
| Term | Role |
|---|---|
| T⁻¹(φ) | Inverse transformation — plug back into the original density |
| |dσ/dφ| | Absolute value of the Jacobian — accounts for the "stretching" or "squishing" of the variable |
Intuition: Probability mass must be conserved. If T squishes an interval of σ into a smaller interval of φ (i.e. |dσ/dφ| < 1), the density of φ must be lower in that region to compensate — and vice versa.
pσ(σ) dσ = probability mass in the interval [σ, σ+dσ]
pφ(φ) dφ = same probability mass, just re-expressed in [φ, φ+dφ]
For a bounded variable σ ∈ (a, b), the logit transformation maps it to φ ∈ (−∞, +∞):
optim() handles numerical derivatives, but you must add this correction to your objective function manually.
The joint posterior in the transformed space includes a Jacobian correction term:
By working in this transformed space, the Laplace Approximation gives an MVN for (μ, φ). Samples are then back-transformed to get (μ, σ) with bounds fully respected.
Both panels show the marginal approximate posterior for σ. The prior restricts σ ∈ (0.5, 5.5). With only N=5 observations, the MAP sits near 4.5 — close to the upper bound — making the skew problem severe.
The symmetric MVN bleeds outside the valid bounds, assigning probability mass where the prior is zero.
Back-transformed samples respect bounds exactly and reveal the true skewed shape concentrated near the upper limit.