INFSCI 2595 · Week 10 Study Guide · Model Complexity & Information Criteria

Week 10 at a Glance

Two focused topics: the geometry of uncertainty, and penalizing model complexity.

Topic 1 The Two Kinds of Uncertainty

A Bayesian model contains two sources of uncertainty: uncertainty about the mean trend (captured by the confidence/credible interval) and uncertainty about individual responses (captured by the prediction interval). Confusing these is one of the most common mistakes in regression.

Topic 2 Penalizing Complexity

Training-set performance always favors more complex models. BIC and AIC approximate how a model will perform on new data by subtracting a penalty proportional to the number of parameters — so complexity must earn its keep.

The Big Picture: Why Both Topics Arise Together

We are fitting natural spline models of varying complexity (1–30 DOF). As complexity increases:

As DOF ↑	Mean Trend	Noise σ	Confidence Interval	Prediction Interval
Low DOF	Smooth, stable	High (under-fits)	Narrow	Wide (noise dominates)
Mid DOF	Balanced	Near true σ	Moderate	Moderate
High DOF	Wiggly, uncertain	Low (over-fits)	Wide (β uncertain)	Wide (CI takes large fraction)

💡

BIC/AIC give us a way to automatically identify the right complexity level — the 8 DOF spline in the lecture example — without needing to look at any test data.

Confidence vs. Prediction Intervals

Both are uncertainty ribbons, but they answer fundamentally different questions.

Confidence / Credible Interval

Uncertainty about where the mean trend μ is.

Comes from uncertainty in the β-parameters
We draw S = 10⁴ β samples — collect them as B [S × J]. The prediction design matrix Φ* [M × J] encodes the M new input points in basis space. Then U* = Φ* B^T [M × S] — column s is one full mean trend curve
Summarise with the 5th–95th percentile of all mean trend curves
Widens when β is uncertain (complex models, collinear inputs)
Even if we knew σ exactly, this interval would still exist

Prediction Interval

Uncertainty about the value of a new observed response y.

Comes from β uncertainty and noise σ in the likelihood
y_n | μ_n, σ ~ Normal(μ_n, σ) — response scatters around the mean
Generated by drawing an [M × S] matrix Z of i.i.d. N(0,1) values and adding σ_s × Z[m,s] to each mean trend entry U*[m,s]
Always wider than the confidence interval
Even if we knew β perfectly, this interval would still exist

Formula How each interval is computed

Step 1 – set up the matrices, then compute mean trend samples (= the confidence interval)

Φ* : [M × J] — prediction design matrix (M test points, J basis columns)
B : [S × J] — posterior β samples (S draws, each a J-dim row vector)

U* = Φ* × B^T : [M × S] — column s is the full mean trend curve for sample s

Step 2 – draw a standard-normal matrix, scale by σ, add to U* (= the prediction interval)

Z : [M × S] — i.i.d. draws, Z[m,s] ~ Normal(0, 1)

Y*[m, s] = U*[m, s] + σ_s × Z[m, s]
σ_s is the s-th posterior noise sample; Z is drawn independently of everything else

      The prediction interval is the CI plus the noise layer. It can never be narrower than the CI.
    

Intuition How the two intervals behave across model complexity

🎯

Simple model (e.g., 4 DOF spline):
The mean trend is constrained and stable — very little variance across β samples. The confidence interval is narrow. However, the model can't fit the true shape, so the residuals are large → σ is over-estimated → prediction interval is very wide. The PI is dominated by noise, not parameter uncertainty.

⚖️

Balanced model (e.g., 8–9 DOF spline):
The mean trend captures the true signal — σ shrinks toward its true value. The confidence interval moderately widens (more β parameters, more uncertainty), but the prediction interval also narrows because σ is small. Both intervals are proportional and interpretable. This is the "sweet spot" the information criteria identify.

🌊

Complex model (e.g., 25 DOF spline):
The model "uses" many β parameters to chase noise. σ shrinks (noise is absorbed into the mean), but the β parameters become very uncertain → confidence interval widens dramatically. The prediction interval is wide again, but this time the CI takes up a large fraction of the PI — parameter uncertainty, not noise, is the main driver.

Key Distinction Frequentist vs. Bayesian Language

Term	Framework	Meaning
Confidence Interval	Frequentist	95% of similarly constructed intervals would contain the true parameter. Does NOT mean "95% probability the parameter is in here."
Credible Interval	Bayesian	The posterior probability that the parameter lies in this range IS 95%. A direct probability statement about the parameter.
Prediction Interval	Both	Interval expected to contain a new, unobserved response value with a specified probability. Always wider than CI/credible interval.

⚠️

In the lecture, the term "confidence interval" is used loosely for the Bayesian uncertainty band on the mean trend. In strict Bayesian language, this is a credible interval — it really does give a probability statement about where μ lies.

BIC & AIC

Information criteria — approximations for how well a model will generalise to new data.

Motivation Why Can't We Just Use Training Set Performance?

Higher complexity → better training fit — always. The 25 DOF spline will win every training metric.
Training R² and RMSE are biased estimators of generalization performance.
We need a metric that penalizes using extra parameters.

      The ideal tool is the Bayesian Evidence (Marginal Likelihood): p(y|Φ) = ∫ p(y|Φ,θ) p(θ) dθ. It automatically balances fit against complexity. But it's intractable for most models — so we approximate it.
    

Derivation From Laplace Approximation → BIC

The Laplace Approximation provides an estimate to the log Evidence:

Laplace log-Evidence estimate

log p(y|Φ) ≈ log p(y|Φ, θ̂) - ½ log|H(θ̂)| + log p(θ̂) + P/2 · log(2π)

Term	What it represents	Effect
log p(y\|Φ, θ̂)	Log-likelihood at posterior mode	↑ as data fit improves
- ½ log\|H(θ̂)\|	Log-determinant of Hessian (curvature)	Penalty — more parameters → larger Hessian → smaller Evidence
log p(θ̂)	Log-prior at posterior mode	↑ when posterior mode is consistent with prior

If we assume a diffuse prior and approximate log|H| ≈ P log(N) + const, the expression simplifies to:

Simplified log-Evidence (basis for BIC)

log p(y|Φ) ≈ log p(y|Φ, θ̂) − ½ P log(N)

Formulas BIC and AIC — Standard Definitions

📌

The standard definitions negate the above (multiply by −2), so that lower BIC/AIC = better model.

Bayesian Information Criterion (BIC)

BIC = P × log(N) − 2 × log p(y|Φ, θ̂)

Akaike Information Criterion (AIC)

AIC = 2P − 2 × log p(y|Φ, θ̂)

Symbol	Meaning
`P`	Number of free parameters in the model (e.g., J+2 for a J-DOF spline: J+1 β's plus σ)
`N`	Number of training observations
`log p(y\|Φ, θ̂)`	Log-likelihood evaluated at the MLEs / posterior mode — measures training fit

Key Differences BIC vs. AIC

BIC

Penalty = P × log(N)
Penalty grows with sample size
For N ≥ 8, BIC penalizes more harshly than AIC (since log(N) > 2)
Interpreted as finding the true model in the candidate set (model selection consistency)
Derived from Bayesian Evidence (Laplace Approx.)

AIC

Penalty = 2P
Penalty fixed — does not depend on N
Less harsh penalty for additional parameters
Interpreted as minimizing prediction error on new data (asymptotic efficiency)
Derived from information theory (Kullback-Leibler divergence)

🧠

The lecture notes: "Even though BIC has 'Bayesian' in the name, it's not actually Bayesian" — because the prior was dropped in the simplification. It is nonetheless a very practical and widely used information criterion.

In Practice Posterior Model Weights

When comparing K candidate models using the Laplace log-Evidence (not just BIC), compute posterior model weights:

w_k = exp(log p(y|Φ, M_k)) / Σ_k' exp(log p(y|Φ, M_k'))

Weights are between 0 and 1 and sum to 1.
In the lecture example (30 candidate splines), the 8 DOF spline received the highest weight — most plausible model.
If the true sine wave basis is included in the candidate set, all splines collapse to near-zero weight.
Critical rule: All models must be compared on the exact same data set.

Quick Comparison

Side-by-side summaries for exam review.

Confidence Interval vs. Prediction Interval — Master Table

Property	Confidence / Credible Interval	Prediction Interval
What does it describe?	Uncertainty about the mean trend μ	Uncertainty about a new response y*
Source of uncertainty	Uncertainty in β-parameters only	Uncertainty in β AND noise σ
Mathematical object	Quantiles of *U = ΦB^T* [M×S], where Φ* is [M×J] and B is [S×J]	Quantiles of *Y = U* + σ·Z** [M×S], where Z[m,s] ~ N(0,1)
Width comparison	Always narrower	Always wider (PI ⊇ CI)
Effect of model complexity	Widens as DOF increases (β uncertain)	Widens at both extremes (noise or β uncertainty)
Would exist if σ = 0?	Yes (still β uncertainty)	No (reduces to CI)
Would exist if β were known exactly?	No (collapses to a line)	Yes (still noise σ > 0)
Bayesian term	Credible interval	Posterior predictive interval

BIC vs. AIC — Master Table

Property	BIC	AIC
Formula	P·log(N) − 2·log L̂	2P − 2·log L̂
Penalty per parameter	log(N)	2
Penalty depends on N?	Yes — grows with sample size	No — fixed at 2
Harsher when	N ≥ 8 (log N > 2)	N ≤ 7 (log N < 2)
Goal / interpretation	Find the true model	Minimize prediction error (KL divergence)
Bayesian origin?	Approximately (Laplace) — but prior dropped	Information theory (Kullback-Leibler)
Best model has	Lowest BIC value	Lowest AIC value
Can values be compared across different datasets?	⚠️ NO — must use the exact same dataset

The Uncertainty Hierarchy

Categorical Features

Dummy variables & one-hot encoding — how string inputs become numeric features.

The Problem You Can't Multiply a Slope by a String

All the linear model machinery (β × feature) requires numeric inputs. A categorical variable like x ∈ {a, b, c, d} can't be directly multiplied by a coefficient.

Solution: represent each category as a binary (0/1) numeric column — a dummy variable. The feature space expands, but the math stays identical.

Dummy Variables The Default R / lm() Approach

For a 4-level categorical input {a, b, c, d}, R automatically creates L − 1 = 3 dummy columns, leaving one level (alphabetically first: a) as the reference:

🔑

Key rule: For L levels, you get L − 1 dummy columns. The missing level is the reference level — it is represented when all dummies are 0. In R, the reference is the first level alphabetically.

Interpretation What Do the Coefficients Mean?

Mean trend model with dummy variables

μ_n = β₀ + β_b·x_b,n + β_c·x_c,n + β_d·x_d,n

Level	Dummies active	Mean trend simplifies to	Interpretation
x = a	all 0	`μ = β₀`	β₀ = average response at the reference level
x = b	x_b = 1	`μ = β₀ + β_b`	β_b = how much higher/lower level b is vs. level a
x = c	x_c = 1	`μ = β₀ + β_c`	β_c = effect of level c relative to reference
x = d	x_d = 1	`μ = β₀ + β_d`	β_d = effect of level d relative to reference

      The intercept = average response at the reference level. Each dummy coefficient = the difference between that level's average and the reference. Not the absolute average — the relative shift.
    

One-Hot Encoding Include All Levels, Drop the Intercept

By specifying y ~ x - 1 in R (the -1 removes the intercept), all L levels get their own column — no reference category needed:

model.matrix(y ~ x - 1, data = sim2) # one-hot: xa xb xc xd

Dummy Variable (default)

L − 1 columns (reference omitted)
Intercept = reference level mean
Each β = relative effect vs. reference
Better for testing: "Is level b significantly different from a?"
R default: lm(y ~ x)

One-Hot Encoding

L columns (one per level, no intercept)
Each β = absolute mean for that level
Coefficient estimates directly equal group averages
Better for reading: "What is the mean for each group?"
R syntax: lm(y ~ x - 1)

⚠️

Critical warning: If you use one-hot encoding AND include an intercept, the design matrix becomes linearly dependent (perfect multicollinearity). Most functions won't error — they'll silently set the intercept to NA or 0. Always remove the intercept when using one-hot encoding.

Mixed Inputs Categorical + Continuous Together

The same logic extends when mixing categorical x₁ (4 levels: A, B, C, D) and continuous x₂:

Additive model mean trend (y ~ x1 + x2)

μ_n = β₀ + β_B·x1_B,n + β_C·x1_C,n + β_D·x1_D,n + β₂·x2,n

When x1 = ?	Mean trend	Effective intercept
A (reference)	β₀ + β₂·x2	β₀ = avg response when x2 = 0, x1 = A
B	(β₀ + β_B) + β₂·x2	β₀ + β_B = avg when x2 = 0, x1 = B
C	(β₀ + β_C) + β₂·x2	β₀ + β_C = avg when x2 = 0, x1 = C
D	(β₀ + β_D) + β₂·x2	β₀ + β_D = avg when x2 = 0, x1 = D

      In an additive model, categorical variables shift the intercept for each group — the slope on x₂ is the same across all groups. To allow slopes to differ, you'd need an interaction term (x1 × x2).
    

Parameter Count Why Categorical Inputs Are "Expensive"

Each categorical variable with L levels adds L − 1 parameters to the model (one per non-reference level). This matters for BIC/AIC — more parameters = bigger complexity penalty.

Input type	Parameters added	Example
Continuous	1	x₂ → adds β₂
Binary categorical	1	{yes, no} → adds β_yes
4-level categorical	3	{a,b,c,d} → adds β_b, β_c, β_d
L-level categorical	L − 1	General rule

📌

In the sim2 example: one categorical input with 4 levels → you must learn 4 β-parameters (intercept + 3 dummy slopes), even though there is only 1 input variable.

Week 10 Glossary

Key terms from the lecture — search to filter.

Confidence Interval (CI) / Credible Interval

An uncertainty band on the mean trend. In Bayesian regression, this is properly called a credible interval — it represents the middle X% of the posterior predictive distribution of μ. It arises solely from uncertainty in the β parameters.

Prediction Interval (PI)

An uncertainty band on a new response observation y*. It accounts for both β uncertainty (like the CI) and noise σ from the likelihood. Always wider than the corresponding CI.

Posterior Predictive Distribution

The distribution over new responses y*, obtained by averaging (marginalising) the likelihood over the posterior: p(y*|y, Φ) = ∫ p(y*|θ) p(θ|y,Φ) dθ. Approximated in practice by sampling.

Evidence / Marginal Likelihood

p(y|Φ) = ∫ p(y|Φ,θ) p(θ) dθ. The integral of the likelihood over the prior. The denominator in Bayes' Theorem. Used to compare models — the model with higher Evidence is more plausible. Intractable in general; approximated by Laplace.

Laplace Approximation (to log Evidence)

An approximation to the log Evidence using the posterior mode θ̂ and the curvature (Hessian) at that mode: log p(y|Φ) ≈ log p(y|Φ,θ̂) + log p(θ̂) − ½ log|H(θ̂)| + P/2 · log(2π).

Bayesian Information Criterion (BIC)

BIC = P·log(N) − 2·log p(y|Φ,θ̂). A simplification of the Laplace log-Evidence approximation obtained by ignoring the prior and approximating the Hessian. Lower is better. Penalizes complexity based on both the number of parameters P and the sample size N.

Akaike Information Criterion (AIC)

AIC = 2P − 2·log p(y|Φ,θ̂). Similar to BIC but with a fixed penalty of 2 per parameter (independent of N). Derived from information theory; targets prediction accuracy (KL divergence minimisation). Lower is better.

Bayes Factor

BF = p(y|Φ, M₀) / p(y|Φ, M₁). The ratio of Evidence between two candidate models. If BF ≫ 1, Model 0 is more plausible than Model 1.

Posterior Model Weights

w_k = exp(log p(y|M_k)) / Σ_k' exp(log p(y|M_k')). Normalised Evidence scores across K candidate models. Values between 0 and 1 summing to 1. Allows direct comparison of many models simultaneously.

Hessian Matrix

The matrix of second partial derivatives of the log-posterior with respect to all parameters, evaluated at the posterior mode. Its determinant captures the "curvature" of the posterior — sharper peaks (better-identified parameters) produce larger determinants and act as a complexity penalty in the Laplace approximation.

Overfitting

A model that performs very well on training data but poorly on new data, because it has learned noise rather than signal. In spline models: very high DOF → very low training RMSE → very low σ estimate → but poor generalisation.

z-score Trick

Generating samples from Normal(μ, σ) as μ + σ×z, where z ~ Normal(0,1). Used in the lecture to efficiently generate prediction interval samples: Y*[m,s] = U*[m,s] + σ_s × z[m,s].

Week 10 Flashcards

Click the card to reveal the answer, then mark yourself.

Card 1 / — | ✓ 0 ✗ 0

0%

Question

Click to reveal answer

Answer