A Bayesian model contains two sources of uncertainty: uncertainty about the mean trend (captured by the confidence/credible interval) and uncertainty about individual responses (captured by the prediction interval). Confusing these is one of the most common mistakes in regression.
Training-set performance always favors more complex models. BIC and AIC approximate how a model will perform on new data by subtracting a penalty proportional to the number of parameters — so complexity must earn its keep.
We are fitting natural spline models of varying complexity (1–30 DOF). As complexity increases:
| As DOF ↑ | Mean Trend | Noise σ | Confidence Interval | Prediction Interval |
|---|---|---|---|---|
| Low DOF | Smooth, stable | High (under-fits) | Narrow | Wide (noise dominates) |
| Mid DOF | Balanced | Near true σ | Moderate | Moderate |
| High DOF | Wiggly, uncertain | Low (over-fits) | Wide (β uncertain) | Wide (CI takes large fraction) |
Uncertainty about where the mean trend μ is.
- Comes from uncertainty in the β-parameters
- We draw S = 104 β samples — collect them as B [S × J]. The prediction design matrix Φ* [M × J] encodes the M new input points in basis space. Then U* = Φ* BT [M × S] — column s is one full mean trend curve
- Summarise with the 5th–95th percentile of all mean trend curves
- Widens when β is uncertain (complex models, collinear inputs)
- Even if we knew σ exactly, this interval would still exist
Uncertainty about the value of a new observed response y.
- Comes from β uncertainty and noise σ in the likelihood
- yn | μn, σ ~ Normal(μn, σ) — response scatters around the mean
- Generated by drawing an [M × S] matrix Z of i.i.d. N(0,1) values and adding σs × Z[m,s] to each mean trend entry U*[m,s]
- Always wider than the confidence interval
- Even if we knew β perfectly, this interval would still exist
B : [S × J] — posterior β samples (S draws, each a J-dim row vector)
U* = Φ* × BT : [M × S] — column s is the full mean trend curve for sample s
Y*[m, s] = U*[m, s] + σs × Z[m, s]
σs is the s-th posterior noise sample; Z is drawn independently of everything else
The mean trend is constrained and stable — very little variance across β samples. The confidence interval is narrow. However, the model can't fit the true shape, so the residuals are large → σ is over-estimated → prediction interval is very wide. The PI is dominated by noise, not parameter uncertainty.
The mean trend captures the true signal — σ shrinks toward its true value. The confidence interval moderately widens (more β parameters, more uncertainty), but the prediction interval also narrows because σ is small. Both intervals are proportional and interpretable. This is the "sweet spot" the information criteria identify.
The model "uses" many β parameters to chase noise. σ shrinks (noise is absorbed into the mean), but the β parameters become very uncertain → confidence interval widens dramatically. The prediction interval is wide again, but this time the CI takes up a large fraction of the PI — parameter uncertainty, not noise, is the main driver.
| Term | Framework | Meaning |
|---|---|---|
| Confidence Interval | Frequentist | 95% of similarly constructed intervals would contain the true parameter. Does NOT mean "95% probability the parameter is in here." |
| Credible Interval | Bayesian | The posterior probability that the parameter lies in this range IS 95%. A direct probability statement about the parameter. |
| Prediction Interval | Both | Interval expected to contain a new, unobserved response value with a specified probability. Always wider than CI/credible interval. |
- Higher complexity → better training fit — always. The 25 DOF spline will win every training metric.
- Training R² and RMSE are biased estimators of generalization performance.
- We need a metric that penalizes using extra parameters.
The Laplace Approximation provides an estimate to the log Evidence:
| Term | What it represents | Effect |
|---|---|---|
| log p(y|Φ, θ̂) | Log-likelihood at posterior mode | ↑ as data fit improves |
| - ½ log|H(θ̂)| | Log-determinant of Hessian (curvature) | Penalty — more parameters → larger Hessian → smaller Evidence |
| log p(θ̂) | Log-prior at posterior mode | ↑ when posterior mode is consistent with prior |
If we assume a diffuse prior and approximate log|H| ≈ P log(N) + const, the expression simplifies to:
| Symbol | Meaning |
|---|---|
P | Number of free parameters in the model (e.g., J+2 for a J-DOF spline: J+1 β's plus σ) |
N | Number of training observations |
log p(y|Φ, θ̂) | Log-likelihood evaluated at the MLEs / posterior mode — measures training fit |
- Penalty = P × log(N)
- Penalty grows with sample size
- For N ≥ 8, BIC penalizes more harshly than AIC (since log(N) > 2)
- Interpreted as finding the true model in the candidate set (model selection consistency)
- Derived from Bayesian Evidence (Laplace Approx.)
- Penalty = 2P
- Penalty fixed — does not depend on N
- Less harsh penalty for additional parameters
- Interpreted as minimizing prediction error on new data (asymptotic efficiency)
- Derived from information theory (Kullback-Leibler divergence)
When comparing K candidate models using the Laplace log-Evidence (not just BIC), compute posterior model weights:
- Weights are between 0 and 1 and sum to 1.
- In the lecture example (30 candidate splines), the 8 DOF spline received the highest weight — most plausible model.
- If the true sine wave basis is included in the candidate set, all splines collapse to near-zero weight.
- Critical rule: All models must be compared on the exact same data set.
| Property | Confidence / Credible Interval | Prediction Interval |
|---|---|---|
| What does it describe? | Uncertainty about the mean trend μ | Uncertainty about a new response y* |
| Source of uncertainty | Uncertainty in β-parameters only | Uncertainty in β AND noise σ |
| Mathematical object | Quantiles of U* = Φ*BT [M×S], where Φ* is [M×J] and B is [S×J] | Quantiles of Y* = U* + σ·Z [M×S], where Z[m,s] ~ N(0,1) |
| Width comparison | Always narrower | Always wider (PI ⊇ CI) |
| Effect of model complexity | Widens as DOF increases (β uncertain) | Widens at both extremes (noise or β uncertainty) |
| Would exist if σ = 0? | Yes (still β uncertainty) | No (reduces to CI) |
| Would exist if β were known exactly? | No (collapses to a line) | Yes (still noise σ > 0) |
| Bayesian term | Credible interval | Posterior predictive interval |
| Property | BIC | AIC |
|---|---|---|
| Formula | P·log(N) − 2·log L̂ | 2P − 2·log L̂ |
| Penalty per parameter | log(N) | 2 |
| Penalty depends on N? | Yes — grows with sample size | No — fixed at 2 |
| Harsher when | N ≥ 8 (log N > 2) | N ≤ 7 (log N < 2) |
| Goal / interpretation | Find the true model | Minimize prediction error (KL divergence) |
| Bayesian origin? | Approximately (Laplace) — but prior dropped | Information theory (Kullback-Leibler) |
| Best model has | Lowest BIC value | Lowest AIC value |
| Can values be compared across different datasets? | ⚠️ NO — must use the exact same dataset | |
All the linear model machinery (β × feature) requires numeric inputs. A categorical variable like x ∈ {a, b, c, d} can't be directly multiplied by a coefficient.
For a 4-level categorical input {a, b, c, d}, R automatically creates L − 1 = 3 dummy columns, leaving one level (alphabetically first: a) as the reference:
| Level | Dummies active | Mean trend simplifies to | Interpretation |
|---|---|---|---|
| x = a | all 0 | μ = β₀ |
β₀ = average response at the reference level |
| x = b | x_b = 1 | μ = β₀ + β_b |
β_b = how much higher/lower level b is vs. level a |
| x = c | x_c = 1 | μ = β₀ + β_c |
β_c = effect of level c relative to reference |
| x = d | x_d = 1 | μ = β₀ + β_d |
β_d = effect of level d relative to reference |
By specifying y ~ x - 1 in R (the -1 removes the intercept), all L levels get their own column — no reference category needed:
model.matrix(y ~ x - 1, data = sim2) # one-hot: xa xb xc xd
- L − 1 columns (reference omitted)
- Intercept = reference level mean
- Each β = relative effect vs. reference
- Better for testing: "Is level b significantly different from a?"
- R default:
lm(y ~ x)
- L columns (one per level, no intercept)
- Each β = absolute mean for that level
- Coefficient estimates directly equal group averages
- Better for reading: "What is the mean for each group?"
- R syntax:
lm(y ~ x - 1)
The same logic extends when mixing categorical x₁ (4 levels: A, B, C, D) and continuous x₂:
| When x1 = ? | Mean trend | Effective intercept |
|---|---|---|
| A (reference) | β₀ + β₂·x2 | β₀ = avg response when x2 = 0, x1 = A |
| B | (β₀ + β_B) + β₂·x2 | β₀ + β_B = avg when x2 = 0, x1 = B |
| C | (β₀ + β_C) + β₂·x2 | β₀ + β_C = avg when x2 = 0, x1 = C |
| D | (β₀ + β_D) + β₂·x2 | β₀ + β_D = avg when x2 = 0, x1 = D |
Each categorical variable with L levels adds L − 1 parameters to the model (one per non-reference level). This matters for BIC/AIC — more parameters = bigger complexity penalty.
| Input type | Parameters added | Example |
|---|---|---|
| Continuous | 1 | x₂ → adds β₂ |
| Binary categorical | 1 | {yes, no} → adds β_yes |
| 4-level categorical | 3 | {a,b,c,d} → adds β_b, β_c, β_d |
| L-level categorical | L − 1 | General rule |