INFSCI 2595 · Fall 2025 · Week 10

Model Complexity & Information Criteria

Confidence vs. Prediction Intervals · BIC · AIC · Bayesian Evidence · Categorical Features
Week 10 at a Glance
Two focused topics: the geometry of uncertainty, and penalizing model complexity.
Topic 1 The Two Kinds of Uncertainty

A Bayesian model contains two sources of uncertainty: uncertainty about the mean trend (captured by the confidence/credible interval) and uncertainty about individual responses (captured by the prediction interval). Confusing these is one of the most common mistakes in regression.

Topic 2 Penalizing Complexity

Training-set performance always favors more complex models. BIC and AIC approximate how a model will perform on new data by subtracting a penalty proportional to the number of parameters — so complexity must earn its keep.

The Big Picture: Why Both Topics Arise Together

We are fitting natural spline models of varying complexity (1–30 DOF). As complexity increases:

As DOF ↑Mean TrendNoise σConfidence IntervalPrediction Interval
Low DOFSmooth, stableHigh (under-fits)NarrowWide (noise dominates)
Mid DOFBalancedNear true σModerateModerate
High DOFWiggly, uncertainLow (over-fits)Wide (β uncertain)Wide (CI takes large fraction)
💡
BIC/AIC give us a way to automatically identify the right complexity level — the 8 DOF spline in the lecture example — without needing to look at any test data.
Confidence vs. Prediction Intervals
Both are uncertainty ribbons, but they answer fundamentally different questions.
Confidence / Credible Interval

Uncertainty about where the mean trend μ is.

  • Comes from uncertainty in the β-parameters
  • We draw S = 104 β samples — collect them as B [S × J]. The prediction design matrix Φ* [M × J] encodes the M new input points in basis space. Then U* = Φ* BT [M × S] — column s is one full mean trend curve
  • Summarise with the 5th–95th percentile of all mean trend curves
  • Widens when β is uncertain (complex models, collinear inputs)
  • Even if we knew σ exactly, this interval would still exist
Prediction Interval

Uncertainty about the value of a new observed response y.

  • Comes from β uncertainty and noise σ in the likelihood
  • yn | μn, σ ~ Normal(μn, σ) — response scatters around the mean
  • Generated by drawing an [M × S] matrix Z of i.i.d. N(0,1) values and adding σs × Z[m,s] to each mean trend entry U*[m,s]
  • Always wider than the confidence interval
  • Even if we knew β perfectly, this interval would still exist
Formula How each interval is computed
Step 1 – set up the matrices, then compute mean trend samples (= the confidence interval)
Φ* : [M × J] — prediction design matrix  (M test points, J basis columns)
B  : [S × J] — posterior β samples     (S draws, each a J-dim row vector)

U* = Φ* × BT  : [M × S] — column s is the full mean trend curve for sample s
Step 2 – draw a standard-normal matrix, scale by σ, add to U* (= the prediction interval)
Z  : [M × S] — i.i.d. draws, Z[m,s] ~ Normal(0, 1)

Y*[m, s] = U*[m, s]  +  σs × Z[m, s]
σs is the s-th posterior noise sample; Z is drawn independently of everything else
The prediction interval is the CI plus the noise layer. It can never be narrower than the CI.
Intuition How the two intervals behave across model complexity
🎯
Simple model (e.g., 4 DOF spline):
The mean trend is constrained and stable — very little variance across β samples. The confidence interval is narrow. However, the model can't fit the true shape, so the residuals are large → σ is over-estimated → prediction interval is very wide. The PI is dominated by noise, not parameter uncertainty.
PI (wide) CI (narrow) Low DOF → σ dominates → PI ≫ CI
⚖️
Balanced model (e.g., 8–9 DOF spline):
The mean trend captures the true signal — σ shrinks toward its true value. The confidence interval moderately widens (more β parameters, more uncertainty), but the prediction interval also narrows because σ is small. Both intervals are proportional and interpretable. This is the "sweet spot" the information criteria identify.
PI CI Balanced DOF → CI and PI proportional
🌊
Complex model (e.g., 25 DOF spline):
The model "uses" many β parameters to chase noise. σ shrinks (noise is absorbed into the mean), but the β parameters become very uncertain → confidence interval widens dramatically. The prediction interval is wide again, but this time the CI takes up a large fraction of the PI — parameter uncertainty, not noise, is the main driver.
High DOF → CI takes up most of PI
Key Distinction Frequentist vs. Bayesian Language
TermFrameworkMeaning
Confidence IntervalFrequentist95% of similarly constructed intervals would contain the true parameter. Does NOT mean "95% probability the parameter is in here."
Credible IntervalBayesianThe posterior probability that the parameter lies in this range IS 95%. A direct probability statement about the parameter.
Prediction IntervalBothInterval expected to contain a new, unobserved response value with a specified probability. Always wider than CI/credible interval.
⚠️
In the lecture, the term "confidence interval" is used loosely for the Bayesian uncertainty band on the mean trend. In strict Bayesian language, this is a credible interval — it really does give a probability statement about where μ lies.
BIC & AIC
Information criteria — approximations for how well a model will generalise to new data.
Motivation Why Can't We Just Use Training Set Performance?
  • Higher complexity → better training fit — always. The 25 DOF spline will win every training metric.
  • Training R² and RMSE are biased estimators of generalization performance.
  • We need a metric that penalizes using extra parameters.
The ideal tool is the Bayesian Evidence (Marginal Likelihood): p(y|Φ) = ∫ p(y|Φ,θ) p(θ) dθ. It automatically balances fit against complexity. But it's intractable for most models — so we approximate it.
Derivation From Laplace Approximation → BIC

The Laplace Approximation provides an estimate to the log Evidence:

Laplace log-Evidence estimate
log p(y|Φ) ≈ log p(y|Φ, θ̂) - ½ log|H(θ̂)| + log p(θ̂) + P/2 · log(2π)
TermWhat it representsEffect
log p(y|Φ, θ̂)Log-likelihood at posterior mode↑ as data fit improves
- ½ log|H(θ̂)|Log-determinant of Hessian (curvature)Penalty — more parameters → larger Hessian → smaller Evidence
log p(θ̂)Log-prior at posterior mode↑ when posterior mode is consistent with prior

If we assume a diffuse prior and approximate log|H| ≈ P log(N) + const, the expression simplifies to:

Simplified log-Evidence (basis for BIC)
log p(y|Φ) ≈ log p(y|Φ, θ̂) − ½ P log(N)
Formulas BIC and AIC — Standard Definitions
📌
The standard definitions negate the above (multiply by −2), so that lower BIC/AIC = better model.
Bayesian Information Criterion (BIC)
BIC = P × log(N) − 2 × log p(y|Φ, θ̂)
Akaike Information Criterion (AIC)
AIC = 2P − 2 × log p(y|Φ, θ̂)
SymbolMeaning
PNumber of free parameters in the model (e.g., J+2 for a J-DOF spline: J+1 β's plus σ)
NNumber of training observations
log p(y|Φ, θ̂)Log-likelihood evaluated at the MLEs / posterior mode — measures training fit
Key Differences BIC vs. AIC
BIC
  • Penalty = P × log(N)
  • Penalty grows with sample size
  • For N ≥ 8, BIC penalizes more harshly than AIC (since log(N) > 2)
  • Interpreted as finding the true model in the candidate set (model selection consistency)
  • Derived from Bayesian Evidence (Laplace Approx.)
AIC
  • Penalty = 2P
  • Penalty fixed — does not depend on N
  • Less harsh penalty for additional parameters
  • Interpreted as minimizing prediction error on new data (asymptotic efficiency)
  • Derived from information theory (Kullback-Leibler divergence)
🧠
The lecture notes: "Even though BIC has 'Bayesian' in the name, it's not actually Bayesian" — because the prior was dropped in the simplification. It is nonetheless a very practical and widely used information criterion.
In Practice Posterior Model Weights

When comparing K candidate models using the Laplace log-Evidence (not just BIC), compute posterior model weights:

w_k = exp(log p(y|Φ, M_k)) / Σ_k' exp(log p(y|Φ, M_k'))
  • Weights are between 0 and 1 and sum to 1.
  • In the lecture example (30 candidate splines), the 8 DOF spline received the highest weight — most plausible model.
  • If the true sine wave basis is included in the candidate set, all splines collapse to near-zero weight.
  • Critical rule: All models must be compared on the exact same data set.
Quick Comparison
Side-by-side summaries for exam review.
Confidence Interval vs. Prediction Interval — Master Table
PropertyConfidence / Credible IntervalPrediction Interval
What does it describe?Uncertainty about the mean trend μUncertainty about a new response y*
Source of uncertaintyUncertainty in β-parameters onlyUncertainty in β AND noise σ
Mathematical objectQuantiles of U* = Φ*BT [M×S], where Φ* is [M×J] and B is [S×J]Quantiles of Y* = U* + σ·Z [M×S], where Z[m,s] ~ N(0,1)
Width comparisonAlways narrowerAlways wider (PI ⊇ CI)
Effect of model complexityWidens as DOF increases (β uncertain)Widens at both extremes (noise or β uncertainty)
Would exist if σ = 0?Yes (still β uncertainty)No (reduces to CI)
Would exist if β were known exactly?No (collapses to a line)Yes (still noise σ > 0)
Bayesian termCredible intervalPosterior predictive interval
BIC vs. AIC — Master Table
PropertyBICAIC
FormulaP·log(N) − 2·log L̂2P − 2·log L̂
Penalty per parameterlog(N)2
Penalty depends on N?Yes — grows with sample sizeNo — fixed at 2
Harsher whenN ≥ 8 (log N > 2)N ≤ 7 (log N < 2)
Goal / interpretationFind the true modelMinimize prediction error (KL divergence)
Bayesian origin?Approximately (Laplace) — but prior droppedInformation theory (Kullback-Leibler)
Best model hasLowest BIC valueLowest AIC value
Can values be compared across different datasets?⚠️ NO — must use the exact same dataset
The Uncertainty Hierarchy
PREDICTION INTERVAL = CI uncertainty + noise σ uncertainty CONFIDENCE / CREDIBLE INTERVAL = β-parameter uncertainty MEAN TREND μ Φ*β (deterministic)
Categorical Features
Dummy variables & one-hot encoding — how string inputs become numeric features.
The Problem You Can't Multiply a Slope by a String

All the linear model machinery (β × feature) requires numeric inputs. A categorical variable like x ∈ {a, b, c, d} can't be directly multiplied by a coefficient.

Solution: represent each category as a binary (0/1) numeric column — a dummy variable. The feature space expands, but the math stays identical.
Dummy Variables The Default R / lm() Approach

For a 4-level categorical input {a, b, c, d}, R automatically creates L − 1 = 3 dummy columns, leaving one level (alphabetically first: a) as the reference:

x (raw) (Intercept) xb xc xd a 1 0 0 0 ← REFERENCE b 1 1 0 0 c 1 0 1 0 d 1 0 0 1 model.matrix(y ~ x, data = sim2)
🔑
Key rule: For L levels, you get L − 1 dummy columns. The missing level is the reference level — it is represented when all dummies are 0. In R, the reference is the first level alphabetically.
Interpretation What Do the Coefficients Mean?
Mean trend model with dummy variables
μ_n = β₀ + β_b·x_b,n + β_c·x_c,n + β_d·x_d,n
LevelDummies activeMean trend simplifies toInterpretation
x = a all 0 μ = β₀ β₀ = average response at the reference level
x = b x_b = 1 μ = β₀ + β_b β_b = how much higher/lower level b is vs. level a
x = c x_c = 1 μ = β₀ + β_c β_c = effect of level c relative to reference
x = d x_d = 1 μ = β₀ + β_d β_d = effect of level d relative to reference
The intercept = average response at the reference level. Each dummy coefficient = the difference between that level's average and the reference. Not the absolute average — the relative shift.
One-Hot Encoding Include All Levels, Drop the Intercept

By specifying y ~ x - 1 in R (the -1 removes the intercept), all L levels get their own column — no reference category needed:

model.matrix(y ~ x - 1, data = sim2) # one-hot: xa xb xc xd
Dummy Variable (default)
  • L − 1 columns (reference omitted)
  • Intercept = reference level mean
  • Each β = relative effect vs. reference
  • Better for testing: "Is level b significantly different from a?"
  • R default: lm(y ~ x)
One-Hot Encoding
  • L columns (one per level, no intercept)
  • Each β = absolute mean for that level
  • Coefficient estimates directly equal group averages
  • Better for reading: "What is the mean for each group?"
  • R syntax: lm(y ~ x - 1)
⚠️
Critical warning: If you use one-hot encoding AND include an intercept, the design matrix becomes linearly dependent (perfect multicollinearity). Most functions won't error — they'll silently set the intercept to NA or 0. Always remove the intercept when using one-hot encoding.
Mixed Inputs Categorical + Continuous Together

The same logic extends when mixing categorical x₁ (4 levels: A, B, C, D) and continuous x₂:

Additive model mean trend (y ~ x1 + x2)
μ_n = β₀ + β_B·x1_B,n + β_C·x1_C,n + β_D·x1_D,n + β₂·x2,n
When x1 = ?Mean trendEffective intercept
A (reference)β₀ + β₂·x2β₀ = avg response when x2 = 0, x1 = A
B(β₀ + β_B) + β₂·x2β₀ + β_B = avg when x2 = 0, x1 = B
C(β₀ + β_C) + β₂·x2β₀ + β_C = avg when x2 = 0, x1 = C
D(β₀ + β_D) + β₂·x2β₀ + β_D = avg when x2 = 0, x1 = D
In an additive model, categorical variables shift the intercept for each group — the slope on x₂ is the same across all groups. To allow slopes to differ, you'd need an interaction term (x1 × x2).
Parameter Count Why Categorical Inputs Are "Expensive"

Each categorical variable with L levels adds L − 1 parameters to the model (one per non-reference level). This matters for BIC/AIC — more parameters = bigger complexity penalty.

Input typeParameters addedExample
Continuous1x₂ → adds β₂
Binary categorical1{yes, no} → adds β_yes
4-level categorical3{a,b,c,d} → adds β_b, β_c, β_d
L-level categoricalL − 1General rule
📌
In the sim2 example: one categorical input with 4 levels → you must learn 4 β-parameters (intercept + 3 dummy slopes), even though there is only 1 input variable.
Week 10 Glossary
Key terms from the lecture — search to filter.
Confidence Interval (CI) / Credible Interval
An uncertainty band on the mean trend. In Bayesian regression, this is properly called a credible interval — it represents the middle X% of the posterior predictive distribution of μ. It arises solely from uncertainty in the β parameters.
Prediction Interval (PI)
An uncertainty band on a new response observation y*. It accounts for both β uncertainty (like the CI) and noise σ from the likelihood. Always wider than the corresponding CI.
Posterior Predictive Distribution
The distribution over new responses y*, obtained by averaging (marginalising) the likelihood over the posterior: p(y*|y, Φ) = ∫ p(y*|θ) p(θ|y,Φ) dθ. Approximated in practice by sampling.
Evidence / Marginal Likelihood
p(y|Φ) = ∫ p(y|Φ,θ) p(θ) dθ. The integral of the likelihood over the prior. The denominator in Bayes' Theorem. Used to compare models — the model with higher Evidence is more plausible. Intractable in general; approximated by Laplace.
Laplace Approximation (to log Evidence)
An approximation to the log Evidence using the posterior mode θ̂ and the curvature (Hessian) at that mode: log p(y|Φ) ≈ log p(y|Φ,θ̂) + log p(θ̂) − ½ log|H(θ̂)| + P/2 · log(2π).
Bayesian Information Criterion (BIC)
BIC = P·log(N) − 2·log p(y|Φ,θ̂). A simplification of the Laplace log-Evidence approximation obtained by ignoring the prior and approximating the Hessian. Lower is better. Penalizes complexity based on both the number of parameters P and the sample size N.
Akaike Information Criterion (AIC)
AIC = 2P − 2·log p(y|Φ,θ̂). Similar to BIC but with a fixed penalty of 2 per parameter (independent of N). Derived from information theory; targets prediction accuracy (KL divergence minimisation). Lower is better.
Bayes Factor
BF = p(y|Φ, M₀) / p(y|Φ, M₁). The ratio of Evidence between two candidate models. If BF ≫ 1, Model 0 is more plausible than Model 1.
Posterior Model Weights
w_k = exp(log p(y|M_k)) / Σ_k' exp(log p(y|M_k')). Normalised Evidence scores across K candidate models. Values between 0 and 1 summing to 1. Allows direct comparison of many models simultaneously.
Hessian Matrix
The matrix of second partial derivatives of the log-posterior with respect to all parameters, evaluated at the posterior mode. Its determinant captures the "curvature" of the posterior — sharper peaks (better-identified parameters) produce larger determinants and act as a complexity penalty in the Laplace approximation.
Overfitting
A model that performs very well on training data but poorly on new data, because it has learned noise rather than signal. In spline models: very high DOF → very low training RMSE → very low σ estimate → but poor generalisation.
z-score Trick
Generating samples from Normal(μ, σ) as μ + σ×z, where z ~ Normal(0,1). Used in the lecture to efficiently generate prediction interval samples: Y*[m,s] = U*[m,s] + σ_s × z[m,s].
Week 10 Flashcards
Click the card to reveal the answer, then mark yourself.
Card 1 /  |  0   0
0%
Question
Click to reveal answer
Answer