INFSCI 2595 · Fall 2025 · Week 8

Bayesian Linear Models

Informative Priors · Posterior Precision · Design Matrices · Unknown σ · Basis Functions
Week 8 Overview
// What this lecture adds on top of Week 7
🗺️ The Big Picture START HERE

Week 7 derived the MLE for β and introduced the Bayesian posterior assuming σ is known and the prior was infinitely diffuse. Week 8 extends this in three directions:

1
Informative / MVN Priors
Replace the flat prior with a proper Multivariate Normal prior. The posterior is still Normal — but now the posterior mean is a precision-weighted average of the prior and the data, and the posterior precision adds the prior precision to the data precision.
2
Unknown σ
Drop the assumption that σ is known. In the classical (MLE) view: σ̂² = MSE. In the Bayesian view: σ becomes another unknown with its own prior, and we seek the joint posterior p(β, σ | y, X).
3
Linear Basis Models
Show that "linear" refers to linearity in the parameters, not in x. By applying basis functions (polynomials, sine, splines) to x first, the same framework captures highly non-linear input–output relationships.
📐 Model Recap DEFINITION

The full probability model for a single-input linear model with Gaussian noise:

Likelihood
yₙ | μₙ, σ ~ Normal(yₙ | μₙ, σ)
Deterministic Mean Trend
μₙ = β₀ + β₁xₙ = xₙ,: · β
Prior (Informative)
β | μ₀, B₀ ~ MVN(β | μ₀, B₀)
With a Normal prior on β and a Gaussian likelihood, the posterior is also Normal — this is the Normal–Normal conjugate relationship extended to vectors.
🔑 Key Formulas at a Glance MATH
QuantityFormulaWhat it means
Posterior precision V_N⁻¹ = B₀⁻¹ + (1/σ²) XᵀX Prior precision + data precision
Posterior mean m_N = V_N (B₀⁻¹μ₀ + (1/σ²) Xᵀy) Precision-weighted average of prior and data
MLE for σ² σ̂² = (1/N) (y − Xβ̂)ᵀ(y − Xβ̂) Mean Squared Error (MSE)
Diffuse prior recovery B₀⁻¹ → 0 ⟹ m_N → β̂_MLE Diffuse prior collapses to MLE
📚 Topics in This Lecture
MVN Prior Posterior Precision Matrix Precision-Weighted Mean Design Matrix Sum-of-Squares Matrix Informative vs Diffuse Prior Sequential Updating Unknown σ / MLE for σ² Basis Functions Linear in Parameters Polynomial Basis
Posterior with an Informative Prior
// MVN prior · conjugacy · posterior precision · precision-weighted mean
🔄 From Diffuse to Informative KEY IDEA

Week 7 used a nearly flat prior p(β) ∝ 1. Now we replace it with a proper Multivariate Normal (MVN) prior:

β | μ₀, B₀ ~ MVN(β | μ₀, B₀)

Where μ₀ is the prior mean vector and B₀ is the prior covariance matrix. The key insight is:

Normal likelihood × Normal prior = Normal posterior (conjugacy). The posterior is still MVN — just with updated mean and covariance.
⚡ Posterior Precision Matrix FORMULA

The precision matrix is the inverse of the covariance matrix. For the posterior:

Posterior Precision = Prior Precision + Data Precision
V_N⁻¹ = B₀⁻¹ + (1/σ²) XᵀX
B₀⁻¹ — Prior Precision

Encodes how confident we are in the prior. A tight prior (small B₀ variance) → large B₀⁻¹ → prior contributes heavily.

(1/σ²) XᵀX — Data Precision

Grows with every observation added to XᵀX. A small noise σ → large (1/σ²) → data contributes heavily.

Think of it this way: the prior adds "fake" data to the diagonal of XᵀX, regularizing the estimate away from extreme values.
⚖️ Posterior Mean — Precision-Weighted Average FORMULA

The posterior mean m_N is a precision-weighted average of the prior belief and what the data say:

Posterior Mean
m_N = V_N ( B₀⁻¹ μ₀ + (1/σ²) Xᵀy )

Breaking this down:

TermMeaning
B₀⁻¹ μ₀Prior precision × prior mean — "the prior's vote"
(1/σ²) XᵀyData precision × data summary — "the data's vote"
V_N = (V_N⁻¹)⁻¹Normalizing factor (inverse of total precision)
Intuition: Each "voter" (prior, data) contributes proportionally to its precision. More data → data vote dominates. Tight prior → prior vote dominates.
🔍 Recovering the MLE with a Diffuse Prior INSIGHT

As the prior becomes infinitely diffuse (B₀ → ∞I), the prior precision B₀⁻¹ → 0. Plugging this in:

1
Posterior precision simplifies
V_N⁻¹ → (1/σ²) XᵀX
2
Posterior covariance recovers the Week 7 result
V_N → σ²(XᵀX)⁻¹
3
Posterior mean recovers the MLE
m_N → (XᵀX)⁻¹Xᵀy = β̂_MLE
🎲 Simplified Independent Prior SPECIAL CASE

If we assume all regression coefficients are a-priori independent with the same prior standard deviation τ_β and same prior mean μ_β:

p(β) = Normal(β₀ | μ_β, τ_β) × Normal(β₁ | μ_β, τ_β)

The prior covariance matrix becomes a diagonal matrix (off-diagonal = 0):

B₀ = τ_β² · I

Its inverse is trivially:

B₀⁻¹ = (1/τ_β²) · I

So the posterior precision becomes:

V_N⁻¹ = (1/τ_β²) I + (1/σ²) XᵀX
The prior adds 1/τ_β² to every diagonal element of (1/σ²) XᵀX. This is exactly like Ridge Regression regularization!
🛠️ R Code: Posterior with Informative Prior R CODE
# Given: Xmat (design matrix), y (response vector), sigma_true, prior covariance B0

# Define prior: independent standard normals (τ_β = 1)
B0 <- diag(2)   # 2x2 identity = prior covariance with τ_β = 1
mu0 <- c(0, 0) # prior mean vector

# Posterior precision (using first N rows of design matrix)
VN_inv <- solve(B0) + (t(Xmat) %*% Xmat) / sigma_true^2

# Posterior covariance
VN <- solve(VN_inv)

# Posterior mean
mN <- VN %*% (solve(B0) %*% mu0 + (t(Xmat) %*% y) / sigma_true^2)

# Posterior standard deviations (square root of diagonal)
post_sd <- sqrt(diag(VN))

# Posterior correlation matrix
post_cor <- cov2cor(VN)

Note: Use drop=FALSE when subsetting rows from a matrix (e.g., Xmat[1, , drop=FALSE]) to keep the result as a matrix instead of a vector.

Design Matrix & Sum-of-Squares
// Design matrix · sum-of-squares matrix · sequential accumulation · R code
📋 The Design Matrix X DEFINITION

The design matrix X is an N × (D+1) matrix. Each row n is the n-th observation's input vector, with a leading 1 encoding the intercept:

X for D=1 (single input), N=3 observations:
1 1 x₁,₁
2 1 x₂,₁
3 1 x₃,₁
Intercept col Input col
The first column is always all 1s. This is the "fake variable" x_{n,0} = 1 that allows β₀ to be treated like any other coefficient inside the sum: μₙ = Σ_d β_d x_{n,d}.
➕ The Sum-of-Squares Matrix XᵀX FORMULA

The sum-of-squares (or "data precision") matrix is:

XᵀX = Σₙ xₙ,:ᵀ xₙ,:

For the single-input case, each observation n contributes a 2×2 outer product. After 2 observations:

⎡ 1     x₁,₁ ⎤
⎣ x₁,₁ x₁,₁² ⎦
+
⎡ 1     x₂,₁ ⎤
⎣ x₂,₁ x₂,₁² ⎦
=
⎡   2       x₁+x₂     ⎤
⎣ x₁+x₂ x₁²+x₂² ⎦
Key property: XᵀX always has dimensions (D+1) × (D+1) regardless of how many observations N you have. Adding more observations changes the values but not the size.
📈 Sequential Accumulation KEY IDEA

Each new observation adds its outer product to the running sum-of-squares:

XᵀX (after N obs) = XᵀX (after N-1 obs) + xₙ,:ᵀ xₙ,:
NXᵀX contribution from obs NDimensions of XᵀX
1[1, x₁,₁; x₁,₁, x₁,₁²]2 × 2
2+ [1, x₂,₁; x₂,₁, x₂,₁²]Still 2 × 2
N+ [1, xₙ,₁; xₙ,₁, xₙ,₁²]Always 2 × 2
Singularity warning: With fewer observations than parameters (N < D+1), XᵀX is singular (not invertible). Without a prior, you cannot compute the posterior covariance — R will throw an error. An informative prior fixes this by adding B₀⁻¹ to the diagonal.
🛠️ Building the Design Matrix in R R CODE
# model.matrix() creates design matrix FROM THE FORMULA, not from the data columns
# It automatically adds the intercept column of 1s
Xmat <- model.matrix(y ~ x, data = train_df)

# train_df may have extra columns (obs_id, mu, etc.) — model.matrix ignores them!
# It only uses variables mentioned in the formula: y (response) and x (predictor)

# Access specific rows:  use drop=FALSE to stay a matrix (not drop to vector)
Xmat[1, , drop=FALSE]     # first row — still a 1×2 matrix
Xmat[1:2, , drop=FALSE]  # first 2 rows

# Sum of squares matrix for first N observations
N <- 10
XtX <- t(Xmat[1:N, , drop=FALSE]) %*% Xmat[1:N, , drop=FALSE]

# Posterior covariance (diffuse prior, known sigma)
post_cov_diffuse <- sigma_true^2 * solve(XtX)

# Posterior standard deviations
sqrt(diag(post_cov_diffuse))
Informative vs Diffuse Priors
// Sequential updating · posterior contour evolution · what the prior controls
📏 Three Types of Prior COMPARISON
Prior TypePrior SD (τ_β)B₀Effect
Truly Diffuse ∞ · I B₀⁻¹ = 0. Cannot compute posterior with N < D+1. Posterior = likelihood shape.
Vague / Weakly Informative 20 400 · I Almost no regularization. Posterior closely tracks likelihood. High posterior correlation after 1 obs (≈ −0.999).
Informative 1 I Pulls parameters toward 0. Rules out extreme values. Posterior correlation more moderate (≈ −0.73). Works with N = 1 — because the prior precision B₀⁻¹ contributes to the precision matrix even before any data, effectively acting as "virtual" prior observations that make V_N⁻¹ invertible.
🔭 Sequential Updating — Posterior Contours VISUALIZATION

The lecture shows the joint posterior contour over (β₁, β₀) as observations are added one at a time. Use the slider to step through N = 0 to 10 and toggle between prior types.

True values (β₁=1.15, β₀=−0.25)
With large N, both posteriors converge to the same region near the true parameter values. The prior matters most when N is small.
❓ Why Diagonal Lines After 1 Observation? INSIGHT

With a diffuse prior and only 1 observation, the posterior is essentially just the likelihood of that one data point:

μₙ₌₁ = β₀ + β₁ · x₁ = constant

This is the equation of a line in (β₀, β₁) space. Any point on that line gives the same predicted mean. Increasing β₁ while decreasing β₀ keeps μ₁ constant — hence the diagonal contour lines.

A second observation at a different x value gives a second constraint line. The two lines intersect at a unique point — this pins down both β₀ and β₁, producing elliptical contours.
💡 What Informative Priors Do CONCEPTUAL
Informative Prior (sd = 1)
  • ✅ Rules out extreme (β₀, β₁) combinations
  • ✅ Posterior is identifiable from 1 observation
  • ✅ Reduces posterior correlation
  • ✅ Acts as regularization (like Ridge)
  • ⚠️ May bias posterior if prior is wrong
Diffuse Prior (sd = 20)
  • ✅ Lets the data speak
  • ✅ Posterior ≈ likelihood with sufficient data
  • ⚠️ High posterior correlation with small N
  • ⚠️ Cannot invert XᵀX with N < D+1 if truly flat
  • ⚠️ Slow to locate the posterior mode
Unknown σ
// MLE for σ² · joint posterior · Exponential prior · Laplace approximation
❓ What Changes When σ Is Unknown? KEY IDEA

Everything covered so far assumed σ is known. When σ is also unknown, we have two options:

Classical / MLE Approach

Treat σ as a parameter and maximize the log-likelihood jointly over β and σ. This gives a closed-form MLE for σ².

Bayesian Approach

Place a prior on σ. The posterior is now the joint posterior p(β, σ | y, X). In general, this does not have a simple closed form.

📐 MLE for σ² FORMULA

Starting from the log-likelihood (keeping the σ terms this time):

Log-likelihood with σ explicit
log p(y | X, β, σ) ∝ −(N/2) log σ² − (1/2σ²) (y − Xβ)ᵀ(y − Xβ)

Evaluating at the MLE β̂, then taking d/d(σ²) = 0 and solving:

MLE for σ² (= MSE)
σ̂² = (1/N) (y − Xβ̂)ᵀ (y − Xβ̂) = (1/N) · SSE
This is the Mean Squared Error (MSE) — the average squared residual. Statistics texts often use (N − D − 1) instead of N in the denominator (degrees-of-freedom correction), but the concept is the same.
🔗 Bayesian Joint Posterior DEFINITION

In the Bayesian framework, σ is just another unknown parameter. Assuming σ and β are a-priori independent:

p(β, σ | y, X) ∝ p(y | X, β, σ) × p(β) × p(σ)

A common choice of prior for σ (a positive parameter) is the Exponential distribution:

σ | λ = 1 ~ Exp(σ | λ = 1)
The joint posterior over (β₀, β₁, σ) no longer has a simple closed form. We use the Laplace Approximation: find the joint MAP (mode) and approximate the posterior as a multivariate Normal around it using the Hessian.
🗺️ Complete Model Specification

The full probability model with unknown σ and informative priors:

yₙ | μₙ, σ ~ Normal(yₙ | μₙ, σ)
μₙ = β₀ + β₁xₙ
β₀ ~ Normal(β₀ | 0, 1) β₁ ~ Normal(β₁ | 0, 1)
σ | λ=1 ~ Exp(σ | λ=1)

The log-posterior is: log p(β, σ | y, X) ∝ log p(y|X,β,σ) + log p(β₀) + log p(β₁) + log p(σ)

To apply the Laplace Approximation: (1) take partial derivatives of the log-posterior w.r.t. β₀, β₁, and σ; (2) find the joint MAP; (3) evaluate the Hessian; (4) use −H⁻¹ as the approximate posterior covariance.
Linear Basis Functions
// What "linear" really means · basis expansions · polynomial · sine · non-linear transforms
❓ What Makes a Model Linear? KEY IDEA

A model is called linear not because of the relationship between y and x, but because of the relationship between y and the unknown parameters β.

Rule: If the mean response μₙ can be written as a linear combination of the unknown parameters μₙ = Σ_d β_d φ_d(x) where each φ_d is a known function of x, then the model is linear in the parameters — and all the Bayesian/MLE machinery applies directly.
📋 Which Are Linear Models? QUIZ YOURSELF
ModelLinear?Why
μ = β₀ + β₁x✓ YesBasic case
μ = β₀ + β₁x₁ + β₂x₂✓ YesMultiple inputs, still linear in β
μ = β₀ + β₁x₁ + β₂x₁²✓ Yesx₁² is a known function of x — define φ₂(x) = x²
μ = β₀ + β₁x₁x₂✓ YesInteraction term — x₁x₂ is a known function
μ = β₀ + β₁ sin(x)✓ Yessin(x) is a known function of x — define φ₁(x) = sin(x)
μ = β₀ exp(β₁x)✗ Noβ₁ appears inside the exponent — non-linear in parameters
The only non-linear model in the lecture is μ = β₀ exp(β₁x) because β₁ multiplies x inside a non-linear function. However, taking a log transforms it to a linear model: log μ = log β₀ + β₁x.
🔧 Basis Functions — The General Idea DEFINITION

A basis function φ_d(x) transforms the raw input x into a new feature. The mean trend becomes:

μₙ = β₀ + Σ_{d=1}^{D} β_d · φ_d(xₙ)

The design matrix, as we will see the following week, just replaces raw inputs with basis function evaluations:

Xₙ,d = φ_d(xₙ) (for d ≥ 1) Xₙ,₀ = 1 (intercept column, always)
Once you build the design matrix this way, everything else is identical: the gradient, Hessian, MLE, posterior, covariance matrix — all the same formulas apply. Basis functions are a free extension of the linear model framework!
📊 Common Basis Expansions EXAMPLES
NameBasis Functions φ_d(x)Use Case
Polynomial φ_d(x) = x^d for d = 1, 2, …, D Smooth curves, peaks, valleys
Sine/Cosine φ(x) = sin(x) or cos(x) Periodic / oscillatory relationships
Splines Piecewise polynomials at knots Flexible smooth curves with local control
Kernels φ(x) = K(x, cₖ) centred at cₖ Radial basis / Gaussian process connections
Interaction φ(x) = x₁ · x₂ Effect of x₁ depends on x₂
📐 Polynomial Basis Example FORMULA

A 3rd-degree (cubic) polynomial model:

μₙ = β₀ + β₁xₙ + β₂xₙ² + β₃xₙ³

This looks like it uses 4 "inputs" but they are all derived from a single variable x. The design matrix becomes:

1 x₁ x₁² x₁³
1 x₂ x₂² x₂³
All the x², x³ values are calculated from the known data before fitting. The model never has to "learn" what x² means — it just sees them as four independent columns in X and fits linear coefficients to each.
🌊 Sine Basis Example WALKTHROUGH

The lecture demonstrates a sine model: μₙ = β₀ + β₁ sin(xₙ), true values β₀ = 0, β₁ = 1, σ = 0.15.

1
Define the basis: φ(x) = sin(x)
This is a known function of x. Compute sin(xₙ) for every observation.
2
Build the design matrix
Column 0: all 1s. Column 1: sin(xₙ) values.
3
Apply all the usual machinery
MLE: β̂ = (XᵀX)⁻¹Xᵀy. Posterior covariance: σ²(XᵀX)⁻¹. The model learns β₀ and β₁, which scale the sine wave's offset and amplitude.
With only 9 input locations, the Bayesian posterior captures the sine shape well even with very few data points, thanks to the informative prior constraining the amplitude.
🔄 Handling a Non-Linear Model via Transformation INSIGHT

The one non-linear model in the lecture, μ = β₀ exp(β₁x), can be linearized by taking a log:

log μ = log β₀ + β₁ · x

Define: ỹ = log y (transformed response), β̃₀ = log β₀ (new intercept). Then:

ỹ = β̃₀ + β₁ · x
This is now a standard linear model on the log-transformed response! The same design matrix approach and all the Bayesian machinery apply directly.
Important caveat: not every non-linear model can be rescued by a transformation. For example, the logistic growth model μ = L / (1 + exp(−β₁(x − β₀))) has parameters that appear in ways no algebraic transformation can untangle — β₀ shifts the inflection point and β₁ controls steepness, and both are buried inside a nested expression. Models like these must be handled with genuinely non-linear methods (e.g., non-linear least squares, or a Bayesian approach with numerical posterior approximation).
Glossary of Key Terms
// Week 8 vocabulary — search or scroll
Flashcard Quiz
// Click a card to reveal the answer, then mark yourself
0  ✗ 0
0 / 0
Press "Shuffle & Start" to begin.