- We can engineer many features: polynomials, splines, interactions, dummy variables.
- Complex models fit training data well but overfit — poor out-of-sample performance.
- Cross-validation and log-Evidence help detect overfitting after the fact.
- Regularization is a tool for controlling overfitting directly, even in a complex model.
In the noisy quadratic demo, higher-order polynomial fits had extremely large coefficient values compared to the true quadratic. The 7th-degree model had coefficients ranging into the tens or twenties, while the quadratic's stayed near ±2.
The prior has almost no influence — the likelihood (data) dominates. Coefficients can grow large. The 7th-degree model interpolates the noise, producing wild oscillations. Training error is low; hold-out error is very high.
A good balance. The prior prevents extreme values but still lets the data speak. The 7th-degree model's predictions look nearly quadratic — the prior has effectively smoothed out the noise chasing. Quadratic is still selected as best by log-Evidence.
The prior overwhelms the data. All coefficients are forced near zero. Even the quadratic model behaves like a flat line (constant). The log-Evidence now incorrectly selects the 6th-degree model as best — the prior has broken model selection.
(underfitting) Sweet spot
(generalization) Data dominates
(overfitting)
| Prior Strength | τ_β | Effect on Coefficients | Typical Outcome |
|---|---|---|---|
| Very Strong | < 0.1 | All pushed hard to zero | Underfitting — no trends visible |
| Informative | 1 – 5 | Moderate constraint | Correct model often identified |
| Weak / Diffuse | > 10 | Converge to MLEs | Overfitting risk in complex models |
Start from Bayes' rule. The un-normalized log-posterior on β (assuming σ known) is:
With independent Gaussian priors centered at zero:
The log-likelihood is proportional to the negative SSE:
Combine and factor out 1/σ²:
Even if the model drives SSE toward zero (perfect training fit), the penalty term λ Σ βⱼ² increases. The optimization must balance these two competing goals:
Fit the training data as closely as possible. A very complex model can drive SSE → 0 by interpolating every point, but its coefficients become extreme.
Keep all coefficients small (near the prior mean of zero). This prevents extreme values but may force the model to ignore real trends in the data.
| Bayesian view | Frequentist view |
|---|---|
| Choose τ_β (prior std. dev. on coefficients) | Tune λ via cross-validation |
| τ_β → ∞ ⟹ prior disappears; posterior = MLE | λ → 0 ⟹ no penalty; solution = OLS |
| τ_β → 0 ⟹ prior overwhelms data | λ → ∞ ⟹ all coefficients → 0 |
| Posterior mode = penalized estimate | Argmin of SSE + λ‖β‖² |
- Setting τ_β = 0.04 means the 95% interval on any coefficient is only (−0.08, +0.08). That rules out nearly all meaningful slopes.
- Under such a prior, even the true quadratic model is forced flat — it looks indistinguishable from a constant.
- The log-Evidence then incorrectly favors high-degree polynomials, because only they (with many small coefficients) can still produce some curvature.
- Bayes is not broken — comparing across priors shows the correctly specified prior + model still wins — but within a bad prior the model selection breaks.
- Penalty = sum of squared coefficient values (L2 norm squared).
- Bayesian equivalent: independent Gaussian priors with mean 0 and std. dev. τ_β = σ/√λ.
- Coefficients are shrunk toward zero but never set exactly to zero.
- λ is tuned via cross-validation in the non-Bayesian setting.
- Penalty = sum of absolute coefficient values (L1 norm).
- Bayesian equivalent: independent Double Exponential (Laplace) priors on each βⱼ.
- Key property: LASSO can set coefficients exactly to zero — it performs automatic feature selection.
- λ is tuned via cross-validation.
- Mean = μ | Variance = 2b²
- Has a sharp peak at zero — far more density concentrated at the center than the Gaussian.
- Also has heavier tails than the Gaussian with the same variance.
- α = 0 → pure Ridge (L2 penalty only).
- α = 1 → pure LASSO (L1 penalty only).
- 0 < α < 1 → blend of both; inherits feature selection from LASSO and stability from Ridge.
- Both λ (overall strength) and α (mixing ratio) must be tuned — typically via cross-validation or packages like
caret/tidymodels.
glmnet package implements all three. It provides a path of solutions over a sequence of λ values for a fixed α. Use caret or tidymodels to search over both λ and α simultaneously.| Property | Ridge (L2) | LASSO (L1) |
|---|---|---|
| Penalty term | λ Σ βⱼ² | λ Σ |βⱼ| |
| Bayesian prior | Gaussian (Normal) | Double Exponential (Laplace) |
| Coefficients reach exactly zero? | No — only approach 0 | Yes — sparse solutions |
| Feature selection? | No | Yes (automatic) |
| Best when… | Many small, relevant effects | Few strong effects; many irrelevant features |
| R package | glmnet (α = 0 / α = 1) | |
Consider predicting home-run distance from three inputs with very different scales:
| Input | Typical Value | 1-unit change means… |
|---|---|---|
| Air Pressure | ≈ 100,000 Pa | Negligible physical change |
| Ball Launch Speed | ≈ 100 mph | Moderate change |
| Wind Speed | ≈ 5 mph | Large change in outcome |
Apply to both inputs and the response:
In the standardized scale, if β_j = 1, then a +1 standard deviation change in x_j is associated with a +1 standard deviation change in the mean response. This gives all coefficients the same unit of measurement.
| Prior Std. Dev. (τ_β) | Implied ±2 std dev change in x causes… |
|---|---|
| τ_β = 0.1 | ≈ ±0.4 std dev change in mean y — small effect |
| τ_β = 1 | ≈ ±4 std dev change in mean y — moderate effect |
| τ_β = 10 | ≈ ±40 std dev change in mean y — enormous, rarely plausible |
- Modern Bayesian practice: use weakly informative priors in the standardized scale.
- τ_β between 1 and 10 is typical; values of 3–5 are common defaults.
- Infinitely diffuse ("flat") priors are not recommended — they allow arbitrarily large effects.
- τ_β < 1 only if you are confident the feature should have a negligible effect.
In the lecture, τ_β was swept from 0.02 to 50 for the 7th-degree polynomial and the posterior mean of each coefficient βⱼ was tracked. The resulting chart has log(τ_β) on the x-axis and the coefficient posterior mean on the y-axis, with the MLE shown as an orange dashed reference line. The plot below is a schematic of that pattern:
- τ_β < 0.1: all coefficients pinned near zero regardless of what the data say.
- 0.1 < τ_β < 1: a few coefficients with strong data support start to move away from zero first.
- 1 < τ_β < 10: large coefficient values are still prevented; the posterior balances prior and data.
- τ_β > 10: prior becomes negligible; posterior means converge to the MLEs.
The lecture also plotted training set RMSE (blue) and hold-out set RMSE (red) for the 7th-degree polynomial as a function of the log-prior precision log(τ_β⁻²) — equivalently, increasing regularization as you move right. 100 data points were generated; 30 used for training, 70 held out for testing.
- Far left (weak prior): Training RMSE is very low — the 7th-degree model fits the training points almost perfectly. But hold-out RMSE is enormous, revealing severe overfitting. The gap between training and hold-out RMSE is the visual signature of overfitting.
- Middle (moderate prior): As regularization increases, coefficients are pulled toward zero, the model smooths out, and hold-out RMSE drops sharply. The sweet spot occurs where hold-out RMSE is minimized.
- Far right (strong prior): Both training and hold-out RMSE plateau at a higher value — the model is underfitting because the prior forces all coefficients toward zero.
| Region | Prior | Coefficients | Training RMSE | Hold-out RMSE |
|---|---|---|---|---|
| Very strong reg. | τ_β ≪ 1 | All ≈ 0; prior dominates | High (underfit) | High (underfit) |
| Moderate reg. | τ_β ~ 1–5 | Small but non-zero | Moderate | Low ✓ (sweet spot) |
| Weak reg. | τ_β ~ 10–25 | Moderate values | Low | Moderate |
| No reg. (OLS) | τ_β → ∞ | Equal to MLEs (large) | Very low | Very high (overfit) |