Week 7 derived the MLE for β and introduced the Bayesian posterior assuming σ is known and the prior was infinitely diffuse. Week 8 extends this in three directions:
The full probability model for a single-input linear model with Gaussian noise:
| Quantity | Formula | What it means |
|---|---|---|
| Posterior precision | V_N⁻¹ = B₀⁻¹ + (1/σ²) XᵀX |
Prior precision + data precision |
| Posterior mean | m_N = V_N (B₀⁻¹μ₀ + (1/σ²) Xᵀy) |
Precision-weighted average of prior and data |
| MLE for σ² | σ̂² = (1/N) (y − Xβ̂)ᵀ(y − Xβ̂) |
Mean Squared Error (MSE) |
| Diffuse prior recovery | B₀⁻¹ → 0 ⟹ m_N → β̂_MLE |
Diffuse prior collapses to MLE |
Week 7 used a nearly flat prior p(β) ∝ 1. Now we replace it with a proper Multivariate Normal (MVN) prior:
Where μ₀ is the prior mean vector and B₀ is the prior covariance matrix. The key insight is:
The precision matrix is the inverse of the covariance matrix. For the posterior:
Encodes how confident we are in the prior. A tight prior (small B₀ variance) → large B₀⁻¹ → prior contributes heavily.
Grows with every observation added to XᵀX. A small noise σ → large (1/σ²) → data contributes heavily.
The posterior mean m_N is a precision-weighted average of the prior belief and what the data say:
Breaking this down:
| Term | Meaning |
|---|---|
B₀⁻¹ μ₀ | Prior precision × prior mean — "the prior's vote" |
(1/σ²) Xᵀy | Data precision × data summary — "the data's vote" |
V_N = (V_N⁻¹)⁻¹ | Normalizing factor (inverse of total precision) |
As the prior becomes infinitely diffuse (B₀ → ∞I), the prior precision B₀⁻¹ → 0. Plugging this in:
If we assume all regression coefficients are a-priori independent with the same prior standard deviation τ_β and same prior mean μ_β:
The prior covariance matrix becomes a diagonal matrix (off-diagonal = 0):
Its inverse is trivially:
So the posterior precision becomes:
# Given: Xmat (design matrix), y (response vector), sigma_true, prior covariance B0 # Define prior: independent standard normals (τ_β = 1) B0 <- diag(2) # 2x2 identity = prior covariance with τ_β = 1 mu0 <- c(0, 0) # prior mean vector # Posterior precision (using first N rows of design matrix) VN_inv <- solve(B0) + (t(Xmat) %*% Xmat) / sigma_true^2 # Posterior covariance VN <- solve(VN_inv) # Posterior mean mN <- VN %*% (solve(B0) %*% mu0 + (t(Xmat) %*% y) / sigma_true^2) # Posterior standard deviations (square root of diagonal) post_sd <- sqrt(diag(VN)) # Posterior correlation matrix post_cor <- cov2cor(VN)
Note: Use drop=FALSE when subsetting rows from a matrix (e.g., Xmat[1, , drop=FALSE]) to keep the result as a matrix instead of a vector.
The design matrix X is an N × (D+1) matrix. Each row n is the n-th observation's input vector, with a leading 1 encoding the intercept:
The sum-of-squares (or "data precision") matrix is:
For the single-input case, each observation n contributes a 2×2 outer product. After 2 observations:
Each new observation adds its outer product to the running sum-of-squares:
| N | XᵀX contribution from obs N | Dimensions of XᵀX |
|---|---|---|
| 1 | [1, x₁,₁; x₁,₁, x₁,₁²] | 2 × 2 |
| 2 | + [1, x₂,₁; x₂,₁, x₂,₁²] | Still 2 × 2 |
| N | + [1, xₙ,₁; xₙ,₁, xₙ,₁²] | Always 2 × 2 |
# model.matrix() creates design matrix FROM THE FORMULA, not from the data columns # It automatically adds the intercept column of 1s Xmat <- model.matrix(y ~ x, data = train_df) # train_df may have extra columns (obs_id, mu, etc.) — model.matrix ignores them! # It only uses variables mentioned in the formula: y (response) and x (predictor) # Access specific rows: use drop=FALSE to stay a matrix (not drop to vector) Xmat[1, , drop=FALSE] # first row — still a 1×2 matrix Xmat[1:2, , drop=FALSE] # first 2 rows # Sum of squares matrix for first N observations N <- 10 XtX <- t(Xmat[1:N, , drop=FALSE]) %*% Xmat[1:N, , drop=FALSE] # Posterior covariance (diffuse prior, known sigma) post_cov_diffuse <- sigma_true^2 * solve(XtX) # Posterior standard deviations sqrt(diag(post_cov_diffuse))
| Prior Type | Prior SD (τ_β) | B₀ | Effect |
|---|---|---|---|
| Truly Diffuse | ∞ | ∞ · I | B₀⁻¹ = 0. Cannot compute posterior with N < D+1. Posterior = likelihood shape. |
| Vague / Weakly Informative | 20 | 400 · I | Almost no regularization. Posterior closely tracks likelihood. High posterior correlation after 1 obs (≈ −0.999). |
| Informative | 1 | I | Pulls parameters toward 0. Rules out extreme values. Posterior correlation more moderate (≈ −0.73). Works with N = 1 — because the prior precision B₀⁻¹ contributes to the precision matrix even before any data, effectively acting as "virtual" prior observations that make V_N⁻¹ invertible. |
The lecture shows the joint posterior contour over (β₁, β₀) as observations are added one at a time. Use the slider to step through N = 0 to 10 and toggle between prior types.
With a diffuse prior and only 1 observation, the posterior is essentially just the likelihood of that one data point:
This is the equation of a line in (β₀, β₁) space. Any point on that line gives the same predicted mean. Increasing β₁ while decreasing β₀ keeps μ₁ constant — hence the diagonal contour lines.
- ✅ Rules out extreme (β₀, β₁) combinations
- ✅ Posterior is identifiable from 1 observation
- ✅ Reduces posterior correlation
- ✅ Acts as regularization (like Ridge)
- ⚠️ May bias posterior if prior is wrong
- ✅ Lets the data speak
- ✅ Posterior ≈ likelihood with sufficient data
- ⚠️ High posterior correlation with small N
- ⚠️ Cannot invert XᵀX with N < D+1 if truly flat
- ⚠️ Slow to locate the posterior mode
Everything covered so far assumed σ is known. When σ is also unknown, we have two options:
Treat σ as a parameter and maximize the log-likelihood jointly over β and σ. This gives a closed-form MLE for σ².
Place a prior on σ. The posterior is now the joint posterior p(β, σ | y, X). In general, this does not have a simple closed form.
Starting from the log-likelihood (keeping the σ terms this time):
Evaluating at the MLE β̂, then taking d/d(σ²) = 0 and solving:
In the Bayesian framework, σ is just another unknown parameter. Assuming σ and β are a-priori independent:
A common choice of prior for σ (a positive parameter) is the Exponential distribution:
The full probability model with unknown σ and informative priors:
The log-posterior is: log p(β, σ | y, X) ∝ log p(y|X,β,σ) + log p(β₀) + log p(β₁) + log p(σ)
A model is called linear not because of the relationship between y and x, but because of the relationship between y and the unknown parameters β.
μₙ = Σ_d β_d φ_d(x) where each φ_d is a known function of x, then the model is linear in the parameters — and all the Bayesian/MLE machinery applies directly.
| Model | Linear? | Why |
|---|---|---|
μ = β₀ + β₁x | ✓ Yes | Basic case |
μ = β₀ + β₁x₁ + β₂x₂ | ✓ Yes | Multiple inputs, still linear in β |
μ = β₀ + β₁x₁ + β₂x₁² | ✓ Yes | x₁² is a known function of x — define φ₂(x) = x² |
μ = β₀ + β₁x₁x₂ | ✓ Yes | Interaction term — x₁x₂ is a known function |
μ = β₀ + β₁ sin(x) | ✓ Yes | sin(x) is a known function of x — define φ₁(x) = sin(x) |
μ = β₀ exp(β₁x) | ✗ No | β₁ appears inside the exponent — non-linear in parameters |
μ = β₀ exp(β₁x) because β₁ multiplies x inside a non-linear function. However, taking a log transforms it to a linear model: log μ = log β₀ + β₁x.
A basis function φ_d(x) transforms the raw input x into a new feature. The mean trend becomes:
The design matrix, as we will see the following week, just replaces raw inputs with basis function evaluations:
| Name | Basis Functions φ_d(x) | Use Case |
|---|---|---|
| Polynomial | φ_d(x) = x^d for d = 1, 2, …, D |
Smooth curves, peaks, valleys |
| Sine/Cosine | φ(x) = sin(x) or cos(x) |
Periodic / oscillatory relationships |
| Splines | Piecewise polynomials at knots | Flexible smooth curves with local control |
| Kernels | φ(x) = K(x, cₖ) centred at cₖ |
Radial basis / Gaussian process connections |
| Interaction | φ(x) = x₁ · x₂ |
Effect of x₁ depends on x₂ |
A 3rd-degree (cubic) polynomial model:
This looks like it uses 4 "inputs" but they are all derived from a single variable x. The design matrix becomes:
The lecture demonstrates a sine model: μₙ = β₀ + β₁ sin(xₙ), true values β₀ = 0, β₁ = 1, σ = 0.15.
The one non-linear model in the lecture, μ = β₀ exp(β₁x), can be linearized by taking a log:
Define: ỹ = log y (transformed response), β̃₀ = log β₀ (new intercept). Then:
μ = L / (1 + exp(−β₁(x − β₀))) has parameters that appear in ways no algebraic transformation can untangle — β₀ shifts the inflection point and β₁ controls steepness, and both are buried inside a nested expression. Models like these must be handled with genuinely non-linear methods (e.g., non-linear least squares, or a Bayesian approach with numerical posterior approximation).