INFSCI 2595 · Week 8 Study Guide · Bayesian Linear Models

Week 8 Overview

// What this lecture adds on top of Week 7

🗺️ The Big Picture START HERE

Week 7 derived the MLE for β and introduced the Bayesian posterior assuming σ is known and the prior was infinitely diffuse. Week 8 extends this in three directions:

1

Informative / MVN Priors

Replace the flat prior with a proper Multivariate Normal prior. The posterior is still Normal — but now the posterior mean is a precision-weighted average of the prior and the data, and the posterior precision adds the prior precision to the data precision.

2

Unknown σ

Drop the assumption that σ is known. In the classical (MLE) view: σ̂² = MSE. In the Bayesian view: σ becomes another unknown with its own prior, and we seek the joint posterior p(β, σ | y, X).

3

Linear Basis Models

Show that "linear" refers to linearity in the parameters, not in x. By applying basis functions (polynomials, sine, splines) to x first, the same framework captures highly non-linear input–output relationships.

📐 Model Recap DEFINITION

The full probability model for a single-input linear model with Gaussian noise:

Likelihood

yₙ | μₙ, σ ~ Normal(yₙ | μₙ, σ)

Deterministic Mean Trend

μₙ = β₀ + β₁xₙ = xₙ,: · β

Prior (Informative)

β | μ₀, B₀ ~ MVN(β | μ₀, B₀)

      With a Normal prior on β and a Gaussian likelihood, the posterior is also Normal — this is the Normal–Normal conjugate relationship extended to vectors.
    

🔑 Key Formulas at a Glance MATH

Quantity	Formula	What it means
Posterior precision	`V_N⁻¹ = B₀⁻¹ + (1/σ²) XᵀX`	Prior precision + data precision
Posterior mean	`m_N = V_N (B₀⁻¹μ₀ + (1/σ²) Xᵀy)`	Precision-weighted average of prior and data
MLE for σ²	`σ̂² = (1/N) (y − Xβ̂)ᵀ(y − Xβ̂)`	Mean Squared Error (MSE)
Diffuse prior recovery	`B₀⁻¹ → 0 ⟹ m_N → β̂_MLE`	Diffuse prior collapses to MLE

📚 Topics in This Lecture

MVN Prior Posterior Precision Matrix Precision-Weighted Mean Design Matrix Sum-of-Squares Matrix Informative vs Diffuse Prior Sequential Updating Unknown σ / MLE for σ² Basis Functions Linear in Parameters Polynomial Basis

Posterior with an Informative Prior

// MVN prior · conjugacy · posterior precision · precision-weighted mean

🔄 From Diffuse to Informative KEY IDEA

Week 7 used a nearly flat prior p(β) ∝ 1. Now we replace it with a proper Multivariate Normal (MVN) prior:

β | μ₀, B₀ ~ MVN(β | μ₀, B₀)

Where μ₀ is the prior mean vector and B₀ is the prior covariance matrix. The key insight is:

      Normal likelihood × Normal prior = Normal posterior (conjugacy). The posterior is still MVN — just with updated mean and covariance.
    

⚡ Posterior Precision Matrix FORMULA

The precision matrix is the inverse of the covariance matrix. For the posterior:

Posterior Precision = Prior Precision + Data Precision

V_N⁻¹ = B₀⁻¹ + (1/σ²) XᵀX

B₀⁻¹ — Prior Precision

Encodes how confident we are in the prior. A tight prior (small B₀ variance) → large B₀⁻¹ → prior contributes heavily.

(1/σ²) XᵀX — Data Precision

Grows with every observation added to XᵀX. A small noise σ → large (1/σ²) → data contributes heavily.

      Think of it this way: the prior adds "fake" data to the diagonal of XᵀX, regularizing the estimate away from extreme values.
    

⚖️ Posterior Mean — Precision-Weighted Average FORMULA

The posterior mean m_N is a precision-weighted average of the prior belief and what the data say:

Posterior Mean

m_N = V_N ( B₀⁻¹ μ₀ + (1/σ²) Xᵀy )

Breaking this down:

Term	Meaning
`B₀⁻¹ μ₀`	Prior precision × prior mean — "the prior's vote"
`(1/σ²) Xᵀy`	Data precision × data summary — "the data's vote"
`V_N = (V_N⁻¹)⁻¹`	Normalizing factor (inverse of total precision)

      Intuition: Each "voter" (prior, data) contributes proportionally to its precision. More data → data vote dominates. Tight prior → prior vote dominates.
    

🔍 Recovering the MLE with a Diffuse Prior INSIGHT

As the prior becomes infinitely diffuse (B₀ → ∞I), the prior precision B₀⁻¹ → 0. Plugging this in:

1

Posterior precision simplifies

V_N⁻¹ → (1/σ²) XᵀX

2

Posterior covariance recovers the Week 7 result

V_N → σ²(XᵀX)⁻¹

3

Posterior mean recovers the MLE

m_N → (XᵀX)⁻¹Xᵀy = β̂_MLE

🎲 Simplified Independent Prior SPECIAL CASE

If we assume all regression coefficients are a-priori independent with the same prior standard deviation τ_β and same prior mean μ_β:

p(β) = Normal(β₀ | μ_β, τ_β) × Normal(β₁ | μ_β, τ_β)

The prior covariance matrix becomes a diagonal matrix (off-diagonal = 0):

B₀ = τ_β² · I

Its inverse is trivially:

B₀⁻¹ = (1/τ_β²) · I

So the posterior precision becomes:

V_N⁻¹ = (1/τ_β²) I + (1/σ²) XᵀX

      The prior adds 1/τ_β² to every diagonal element of (1/σ²) XᵀX. This is exactly like Ridge Regression regularization!
    

🛠️ R Code: Posterior with Informative Prior R CODE

# Given: Xmat (design matrix), y (response vector), sigma_true, prior covariance B0

# Define prior: independent standard normals (τ_β = 1)
B0 <- diag(2)   # 2x2 identity = prior covariance with τ_β = 1
mu0 <- c(0, 0) # prior mean vector

# Posterior precision (using first N rows of design matrix)
VN_inv <- solve(B0) + (t(Xmat) %*% Xmat) / sigma_true^2

# Posterior covariance
VN <- solve(VN_inv)

# Posterior mean
mN <- VN %*% (solve(B0) %*% mu0 + (t(Xmat) %*% y) / sigma_true^2)

# Posterior standard deviations (square root of diagonal)
post_sd <- sqrt(diag(VN))

# Posterior correlation matrix
post_cor <- cov2cor(VN)

Note: Use drop=FALSE when subsetting rows from a matrix (e.g., Xmat[1, , drop=FALSE]) to keep the result as a matrix instead of a vector.

Design Matrix & Sum-of-Squares

// Design matrix · sum-of-squares matrix · sequential accumulation · R code

📋 The Design Matrix X DEFINITION

The design matrix X is an N × (D+1) matrix. Each row n is the n-th observation's input vector, with a leading 1 encoding the intercept:

X for D=1 (single input), N=3 observations:

1 1 x₁,₁

2 1 x₂,₁

3 1 x₃,₁

Intercept col Input col

      The first column is always all 1s. This is the "fake variable" x_{n,0} = 1 that allows β₀ to be treated like any other coefficient inside the sum: μₙ = Σ_d β_d x_{n,d}.
    

➕ The Sum-of-Squares Matrix XᵀX FORMULA

The sum-of-squares (or "data precision") matrix is:

XᵀX = Σₙ xₙ,:ᵀ xₙ,:

For the single-input case, each observation n contributes a 2×2 outer product. After 2 observations:

⎡ 1     x₁,₁ ⎤
⎣ x₁,₁  x₁,₁² ⎦
+⎡ 1     x₂,₁ ⎤
⎣ x₂,₁  x₂,₁² ⎦
=⎡   2       x₁+x₂     ⎤
⎣ x₁+x₂  x₁²+x₂² ⎦

      Key property: XᵀX always has dimensions (D+1) × (D+1) regardless of how many observations N you have. Adding more observations changes the values but not the size.
    

📈 Sequential Accumulation KEY IDEA

Each new observation adds its outer product to the running sum-of-squares:

XᵀX (after N obs) = XᵀX (after N-1 obs) + xₙ,:ᵀ xₙ,:

N	XᵀX contribution from obs N	Dimensions of XᵀX
1	[1, x₁,₁; x₁,₁, x₁,₁²]	2 × 2
2	+ [1, x₂,₁; x₂,₁, x₂,₁²]	Still 2 × 2
N	+ [1, xₙ,₁; xₙ,₁, xₙ,₁²]	Always 2 × 2

      Singularity warning: With fewer observations than parameters (N < D+1), XᵀX is singular (not invertible). Without a prior, you cannot compute the posterior covariance — R will throw an error. An informative prior fixes this by adding B₀⁻¹ to the diagonal.
    

🛠️ Building the Design Matrix in R R CODE

# model.matrix() creates design matrix FROM THE FORMULA, not from the data columns
# It automatically adds the intercept column of 1s
Xmat <- model.matrix(y ~ x, data = train_df)

# train_df may have extra columns (obs_id, mu, etc.) — model.matrix ignores them!
# It only uses variables mentioned in the formula: y (response) and x (predictor)

# Access specific rows:  use drop=FALSE to stay a matrix (not drop to vector)
Xmat[1, , drop=FALSE]     # first row — still a 1×2 matrix
Xmat[1:2, , drop=FALSE]  # first 2 rows

# Sum of squares matrix for first N observations
N <- 10
XtX <- t(Xmat[1:N, , drop=FALSE]) %*% Xmat[1:N, , drop=FALSE]

# Posterior covariance (diffuse prior, known sigma)
post_cov_diffuse <- sigma_true^2 * solve(XtX)

# Posterior standard deviations
sqrt(diag(post_cov_diffuse))

Informative vs Diffuse Priors

// Sequential updating · posterior contour evolution · what the prior controls

📏 Three Types of Prior COMPARISON

Prior Type	Prior SD (τ_β)	B₀	Effect
Truly Diffuse	∞	∞ · I	B₀⁻¹ = 0. Cannot compute posterior with N < D+1. Posterior = likelihood shape.
Vague / Weakly Informative	20	400 · I	Almost no regularization. Posterior closely tracks likelihood. High posterior correlation after 1 obs (≈ −0.999).
Informative	1	I	Pulls parameters toward 0. Rules out extreme values. Posterior correlation more moderate (≈ −0.73). Works with N = 1 — because the prior precision B₀⁻¹ contributes to the precision matrix even before any data, effectively acting as "virtual" prior observations that make V_N⁻¹ invertible.

🔭 Sequential Updating — Posterior Contours VISUALIZATION

The lecture shows the joint posterior contour over (β₁, β₀) as observations are added one at a time. Use the slider to step through N = 0 to 10 and toggle between prior types.

N observations: 0

        True values (β₁=1.15, β₀=−0.25)
      

      With large N, both posteriors converge to the same region near the true parameter values. The prior matters most when N is small.
    

❓ Why Diagonal Lines After 1 Observation? INSIGHT

With a diffuse prior and only 1 observation, the posterior is essentially just the likelihood of that one data point:

μₙ₌₁ = β₀ + β₁ · x₁ = constant

This is the equation of a line in (β₀, β₁) space. Any point on that line gives the same predicted mean. Increasing β₁ while decreasing β₀ keeps μ₁ constant — hence the diagonal contour lines.

      A second observation at a different x value gives a second constraint line. The two lines intersect at a unique point — this pins down both β₀ and β₁, producing elliptical contours.
    

💡 What Informative Priors Do CONCEPTUAL

Informative Prior (sd = 1)

✅ Rules out extreme (β₀, β₁) combinations
✅ Posterior is identifiable from 1 observation
✅ Reduces posterior correlation
✅ Acts as regularization (like Ridge)
⚠️ May bias posterior if prior is wrong

Diffuse Prior (sd = 20)

✅ Lets the data speak
✅ Posterior ≈ likelihood with sufficient data
⚠️ High posterior correlation with small N
⚠️ Cannot invert XᵀX with N < D+1 if truly flat
⚠️ Slow to locate the posterior mode

Unknown σ

// MLE for σ² · joint posterior · Exponential prior · Laplace approximation

❓ What Changes When σ Is Unknown? KEY IDEA

Everything covered so far assumed σ is known. When σ is also unknown, we have two options:

Classical / MLE Approach

Treat σ as a parameter and maximize the log-likelihood jointly over β and σ. This gives a closed-form MLE for σ².

Bayesian Approach

Place a prior on σ. The posterior is now the joint posterior p(β, σ | y, X). In general, this does not have a simple closed form.

📐 MLE for σ² FORMULA

Starting from the log-likelihood (keeping the σ terms this time):

Log-likelihood with σ explicit

log p(y | X, β, σ) ∝ −(N/2) log σ² − (1/2σ²) (y − Xβ)ᵀ(y − Xβ)

Evaluating at the MLE β̂, then taking d/d(σ²) = 0 and solving:

MLE for σ² (= MSE)

σ̂² = (1/N) (y − Xβ̂)ᵀ (y − Xβ̂) = (1/N) · SSE

      This is the Mean Squared Error (MSE) — the average squared residual. Statistics texts often use (N − D − 1) instead of N in the denominator (degrees-of-freedom correction), but the concept is the same.
    

🔗 Bayesian Joint Posterior DEFINITION

In the Bayesian framework, σ is just another unknown parameter. Assuming σ and β are a-priori independent:

p(β, σ | y, X) ∝ p(y | X, β, σ) × p(β) × p(σ)

A common choice of prior for σ (a positive parameter) is the Exponential distribution:

σ | λ = 1 ~ Exp(σ | λ = 1)

      The joint posterior over (β₀, β₁, σ) no longer has a simple closed form. We use the Laplace Approximation: find the joint MAP (mode) and approximate the posterior as a multivariate Normal around it using the Hessian.
    

🗺️ Complete Model Specification

The full probability model with unknown σ and informative priors:

yₙ | μₙ, σ ~ Normal(yₙ | μₙ, σ)

μₙ = β₀ + β₁xₙ

β₀ ~ Normal(β₀ | 0, 1) β₁ ~ Normal(β₁ | 0, 1)

σ | λ=1 ~ Exp(σ | λ=1)

The log-posterior is: log p(β, σ | y, X) ∝ log p(y|X,β,σ) + log p(β₀) + log p(β₁) + log p(σ)

      To apply the Laplace Approximation: (1) take partial derivatives of the log-posterior w.r.t. β₀, β₁, and σ; (2) find the joint MAP; (3) evaluate the Hessian; (4) use −H⁻¹ as the approximate posterior covariance.
    

Linear Basis Functions

// What "linear" really means · basis expansions · polynomial · sine · non-linear transforms

❓ What Makes a Model Linear? KEY IDEA

A model is called linear not because of the relationship between y and x, but because of the relationship between y and the unknown parameters β.

      Rule: If the mean response μₙ can be written as a linear combination of the unknown parameters μₙ = Σ_d β_d φ_d(x) where each φ_d is a known function of x, then the model is linear in the parameters — and all the Bayesian/MLE machinery applies directly.
    

📋 Which Are Linear Models? QUIZ YOURSELF

Model	Linear?	Why
`μ = β₀ + β₁x`	✓ Yes	Basic case
`μ = β₀ + β₁x₁ + β₂x₂`	✓ Yes	Multiple inputs, still linear in β
`μ = β₀ + β₁x₁ + β₂x₁²`	✓ Yes	x₁² is a known function of x — define φ₂(x) = x²
`μ = β₀ + β₁x₁x₂`	✓ Yes	Interaction term — x₁x₂ is a known function
`μ = β₀ + β₁ sin(x)`	✓ Yes	sin(x) is a known function of x — define φ₁(x) = sin(x)
`μ = β₀ exp(β₁x)`	✗ No	β₁ appears inside the exponent — non-linear in parameters

      The only non-linear model in the lecture is μ = β₀ exp(β₁x) because β₁ multiplies x inside a non-linear function. However, taking a log transforms it to a linear model: log μ = log β₀ + β₁x.
    

🔧 Basis Functions — The General Idea DEFINITION

A basis function φ_d(x) transforms the raw input x into a new feature. The mean trend becomes:

μₙ = β₀ + Σ_{d=1}^{D} β_d · φ_d(xₙ)

The design matrix, as we will see the following week, just replaces raw inputs with basis function evaluations:

Xₙ,d = φ_d(xₙ) (for d ≥ 1) Xₙ,₀ = 1 (intercept column, always)

      Once you build the design matrix this way, everything else is identical: the gradient, Hessian, MLE, posterior, covariance matrix — all the same formulas apply. Basis functions are a free extension of the linear model framework!
    

📊 Common Basis Expansions EXAMPLES

Name	Basis Functions φ_d(x)	Use Case
Polynomial	`φ_d(x) = x^d` for d = 1, 2, …, D	Smooth curves, peaks, valleys
Sine/Cosine	`φ(x) = sin(x)` or `cos(x)`	Periodic / oscillatory relationships
Splines	Piecewise polynomials at knots	Flexible smooth curves with local control
Kernels	`φ(x) = K(x, cₖ)` centred at cₖ	Radial basis / Gaussian process connections
Interaction	`φ(x) = x₁ · x₂`	Effect of x₁ depends on x₂

📐 Polynomial Basis Example FORMULA

A 3rd-degree (cubic) polynomial model:

μₙ = β₀ + β₁xₙ + β₂xₙ² + β₃xₙ³

This looks like it uses 4 "inputs" but they are all derived from a single variable x. The design matrix becomes:

1 x₁ x₁² x₁³

1 x₂ x₂² x₂³

      All the x², x³ values are calculated from the known data before fitting. The model never has to "learn" what x² means — it just sees them as four independent columns in X and fits linear coefficients to each.
    

🌊 Sine Basis Example WALKTHROUGH

The lecture demonstrates a sine model: μₙ = β₀ + β₁ sin(xₙ), true values β₀ = 0, β₁ = 1, σ = 0.15.

1

Define the basis: φ(x) = sin(x)

This is a known function of x. Compute sin(xₙ) for every observation.

2

Build the design matrix

Column 0: all 1s. Column 1: sin(xₙ) values.

3

Apply all the usual machinery

MLE: β̂ = (XᵀX)⁻¹Xᵀy. Posterior covariance: σ²(XᵀX)⁻¹. The model learns β₀ and β₁, which scale the sine wave's offset and amplitude.

      With only 9 input locations, the Bayesian posterior captures the sine shape well even with very few data points, thanks to the informative prior constraining the amplitude.
    

🔄 Handling a Non-Linear Model via Transformation INSIGHT

The one non-linear model in the lecture, μ = β₀ exp(β₁x), can be linearized by taking a log:

log μ = log β₀ + β₁ · x

Define: ỹ = log y (transformed response), β̃₀ = log β₀ (new intercept). Then:

ỹ = β̃₀ + β₁ · x

      This is now a standard linear model on the log-transformed response! The same design matrix approach and all the Bayesian machinery apply directly.
    

      Important caveat: not every non-linear model can be rescued by a transformation. For example, the logistic growth model μ = L / (1 + exp(−β₁(x − β₀))) has parameters that appear in ways no algebraic transformation can untangle — β₀ shifts the inflection point and β₁ controls steepness, and both are buried inside a nested expression. Models like these must be handled with genuinely non-linear methods (e.g., non-linear least squares, or a Bayesian approach with numerical posterior approximation).
    

Glossary of Key Terms

// Week 8 vocabulary — search or scroll

Flashcard Quiz

// Click a card to reveal the answer, then mark yourself

✓ 0 ✗ 0

0 / 0

Press "Shuffle & Start" to begin.