INFSCI 2595 · Fall 2025 · Week 11

Lecture 11 — Regularization

Priors as penalties · Ridge · LASSO · Elastic Net · Standardization
Overview — What is Regularization?
lecture 11 — tying Bayesian priors to classical penalized regression
Context The Problem We're Solving
  • We can engineer many features: polynomials, splines, interactions, dummy variables.
  • Complex models fit training data well but overfit — poor out-of-sample performance.
  • Cross-validation and log-Evidence help detect overfitting after the fact.
  • Regularization is a tool for controlling overfitting directly, even in a complex model.
Key insight: overfitting is associated with large coefficient values. If we constrain the coefficients to be small, the model cannot chase noise — even if it has many parameters.
Signature Observation Large Coefficients → Overfit

In the noisy quadratic demo, higher-order polynomial fits had extremely large coefficient values compared to the true quadratic. The 7th-degree model had coefficients ranging into the tens or twenties, while the quadratic's stayed near ±2.

Quadratic (β₀–β₂) 7th-degree (β₀–β₇) Overfit: coefficients reach extreme values
💡
The 7th-degree polynomial with a diffuse prior has some coefficients near ±25. With an informative prior (τ_β = 1), those same coefficients are constrained near ±1 and the predictions look much more quadratic.
Key Takeaway Effect of Prior Strength on the 7th-Degree Model
🌊
Weak / Diffuse Prior (τ_β = 25):
The prior has almost no influence — the likelihood (data) dominates. Coefficients can grow large. The 7th-degree model interpolates the noise, producing wild oscillations. Training error is low; hold-out error is very high.
⚖️
Informative Prior (τ_β = 1):
A good balance. The prior prevents extreme values but still lets the data speak. The 7th-degree model's predictions look nearly quadratic — the prior has effectively smoothed out the noise chasing. Quadratic is still selected as best by log-Evidence.
🔒
Very Strong Prior (τ_β = 0.04):
The prior overwhelms the data. All coefficients are forced near zero. Even the quadratic model behaves like a flat line (constant). The log-Evidence now incorrectly selects the 6th-degree model as best — the prior has broken model selection.
Big Picture Regularization as a Spectrum
Prior dominates
(underfitting)
Sweet spot
(generalization)
Data dominates
(overfitting)
Prior Strengthτ_βEffect on CoefficientsTypical Outcome
Very Strong< 0.1All pushed hard to zeroUnderfitting — no trends visible
Informative1 – 5Moderate constraintCorrect model often identified
Weak / Diffuse> 10Converge to MLEsOverfitting risk in complex models
Prior as Regularizer
showing that a Gaussian prior on β is equivalent to Ridge penalization
Derivation The Log-Posterior with a Gaussian Prior

Start from Bayes' rule. The un-normalized log-posterior on β (assuming σ known) is:

Log-posterior decomposition
log p(β | y, Φ, σ) ∝ log p(y | Φ, β, σ) + log p(β)

With independent Gaussian priors centered at zero:

Independent Gaussian prior (prior mean = 0)
log p(β) ∝ − (1 / 2τ_β²) Σⱼ βⱼ²

The log-likelihood is proportional to the negative SSE:

Log-likelihood contribution
log p(y | Φ, β, σ) ∝ − (1 / 2σ²) SSE(β)

Combine and factor out 1/σ²:

Full log-posterior (factored)
log p(β | …) ∝ −(1/2σ²) [ SSE(β) + (σ²/τ_β²) Σⱼ βⱼ² ]
Defining λ = σ² / τ_β², the term inside the brackets becomes SSE(β) + λ Σⱼ βⱼ². Maximizing the posterior is equivalent to minimizing this penalized objective.
Insight The Prior Acts as a Floor on the SSE

Even if the model drives SSE toward zero (perfect training fit), the penalty term λ Σ βⱼ² increases. The optimization must balance these two competing goals:

Minimize SSE

Fit the training data as closely as possible. A very complex model can drive SSE → 0 by interpolating every point, but its coefficients become extreme.

Minimize Σ βⱼ²

Keep all coefficients small (near the prior mean of zero). This prevents extreme values but may force the model to ignore real trends in the data.

🎚️
The regularization strength λ (or equivalently τ_β) controls the balance. Larger λ (smaller τ_β) → stronger penalty → more shrinkage toward zero.
Key Relationship λ, σ, and τ_β
Regularization parameter in terms of Bayesian quantities
λ = σ² / τ_β²
Bayesian viewFrequentist view
Choose τ_β (prior std. dev. on coefficients)Tune λ via cross-validation
τ_β → ∞ ⟹ prior disappears; posterior = MLEλ → 0 ⟹ no penalty; solution = OLS
τ_β → 0 ⟹ prior overwhelms dataλ → ∞ ⟹ all coefficients → 0
Posterior mode = penalized estimateArgmin of SSE + λ‖β‖²
Watch Out When the Prior Is Too Strong
  • Setting τ_β = 0.04 means the 95% interval on any coefficient is only (−0.08, +0.08). That rules out nearly all meaningful slopes.
  • Under such a prior, even the true quadratic model is forced flat — it looks indistinguishable from a constant.
  • The log-Evidence then incorrectly favors high-degree polynomials, because only they (with many small coefficients) can still produce some curvature.
  • Bayes is not broken — comparing across priors shows the correctly specified prior + model still wins — but within a bad prior the model selection breaks.
Ridge, LASSO & Elastic Net
non-Bayesian penalized regression — and their Bayesian interpretations
Definition Ridge Regression
Ridge objective (L2 penalty)
argmin_β [ SSE(β) + λ Σⱼ βⱼ² ]
  • Penalty = sum of squared coefficient values (L2 norm squared).
  • Bayesian equivalent: independent Gaussian priors with mean 0 and std. dev. τ_β = σ/√λ.
  • Coefficients are shrunk toward zero but never set exactly to zero.
  • λ is tuned via cross-validation in the non-Bayesian setting.
Definition LASSO Regression
LASSO objective (L1 penalty)
argmin_β [ SSE(β) + λ Σⱼ |βⱼ| ]
  • Penalty = sum of absolute coefficient values (L1 norm).
  • Bayesian equivalent: independent Double Exponential (Laplace) priors on each βⱼ.
  • Key property: LASSO can set coefficients exactly to zero — it performs automatic feature selection.
  • λ is tuned via cross-validation.
LASSO is capable of turning off features entirely. Ridge only shrinks them. This makes LASSO especially useful when you suspect many features are irrelevant.
Distribution The Double Exponential (Laplace) Distribution
Double Exponential pdf
p(x | μ, b) = (1/2b) exp(−|x − μ| / b)
  • Mean = μ  |  Variance = 2b²
  • Has a sharp peak at zero — far more density concentrated at the center than the Gaussian.
  • Also has heavier tails than the Gaussian with the same variance.
DE peak Gaussian peak Gaussian Double Exponential β (centered at 0)
🔑
The Double Exponential's concentrated mass at zero is why LASSO can lock coefficients exactly to zero. Gaussian priors (Ridge) spread probability more smoothly, so coefficients only approach — never hit — zero.
Extension Elastic Net — Blending Ridge and LASSO
Elastic Net objective
argmin_β [ SSE(β) + λ [ (1−α) Σⱼ βⱼ² + α Σⱼ |βⱼ| ] ]
  • α = 0 → pure Ridge (L2 penalty only).
  • α = 1 → pure LASSO (L1 penalty only).
  • 0 < α < 1 → blend of both; inherits feature selection from LASSO and stability from Ridge.
  • Both λ (overall strength) and α (mixing ratio) must be tuned — typically via cross-validation or packages like caret / tidymodels.
📦
In R, the glmnet package implements all three. It provides a path of solutions over a sequence of λ values for a fixed α. Use caret or tidymodels to search over both λ and α simultaneously.
Side-by-Side Ridge vs. LASSO — Quick Reference
PropertyRidge (L2)LASSO (L1)
Penalty termλ Σ βⱼ²λ Σ |βⱼ|
Bayesian priorGaussian (Normal)Double Exponential (Laplace)
Coefficients reach exactly zero?No — only approach 0Yes — sparse solutions
Feature selection?NoYes (automatic)
Best when…Many small, relevant effectsFew strong effects; many irrelevant features
R packageglmnet (α = 0 / α = 1)
Standardizing Variables
why we center and scale inputs — and how it enables a common prior specification
Motivation The Scale Problem

Consider predicting home-run distance from three inputs with very different scales:

InputTypical Value1-unit change means…
Air Pressure≈ 100,000 PaNegligible physical change
Ball Launch Speed≈ 100 mphModerate change
Wind Speed≈ 5 mphLarge change in outcome
A prior of τ_β = 1 might be "strong" for Wind Speed but has no effect on Air Pressure — a 1-unit change in Pa means nothing. Sharing a single prior across raw features is therefore meaningless.
Procedure Center and Scale

Apply to both inputs and the response:

Step 1 — Center (subtract the sample mean)
x_centered = x_raw − mean(x_raw) y_centered = y_raw − mean(y_raw)
Step 2 — Scale (divide by the sample std dev)
x_std = x_centered / sd(x_raw) y_std = y_centered / sd(y_raw)
📐
After standardizing, all inputs and the response have mean ≈ 0 and std ≈ 1. Slopes now represent how many standard deviations of response change per standard deviation of input — a comparable unit for every feature.
Interpretation Slopes in the Standardized Scale

In the standardized scale, if β_j = 1, then a +1 standard deviation change in x_j is associated with a +1 standard deviation change in the mean response. This gives all coefficients the same unit of measurement.

Prior Std. Dev. (τ_β)Implied ±2 std dev change in x causes…
τ_β = 0.1≈ ±0.4 std dev change in mean y — small effect
τ_β = 1≈ ±4 std dev change in mean y — moderate effect
τ_β = 10≈ ±40 std dev change in mean y — enormous, rarely plausible
A prior of τ_β = 10 allows the mean response to swing 40× its own standard deviation from a single 2-SD input shift. That's almost always implausibly large — which is why very diffuse priors invite overfitting.
Practical Guidance Recommended Prior Specifications
  • Modern Bayesian practice: use weakly informative priors in the standardized scale.
  • τ_β between 1 and 10 is typical; values of 3–5 are common defaults.
  • Infinitely diffuse ("flat") priors are not recommended — they allow arbitrarily large effects.
  • τ_β < 1 only if you are confident the feature should have a negligible effect.
📏
Standardizing the response also anchors the prior on σ. An intercept-only model (no trend) would have σ ≈ 1 in the standardized scale. A prior that expects σ ≈ 1 is therefore asking "can the model beat a flat-line prediction?" — a sensible baseline.
Regularization Path
watching what happens to coefficients and errors as prior strength varies
Concept The Regularization Path — Coefficient View

In the lecture, τ_β was swept from 0.02 to 50 for the 7th-degree polynomial and the posterior mean of each coefficient βⱼ was tracked. The resulting chart has log(τ_β) on the x-axis and the coefficient posterior mean on the y-axis, with the MLE shown as an orange dashed reference line. The plot below is a schematic of that pattern:

Prior dominates βⱼ ≈ 0 MLEs (OLS) posterior → MLE coefficients start rising here log(τ_β) → weaker prior (more data influence) βⱼ posterior mean τ = 0.1 τ = 1 τ = 10
📊
How to read this plot: Each colored line is one coefficient βⱼ. On the far left (small τ_β, strong prior), every line hugs zero — the prior dominates and the data cannot move the coefficients. As τ_β grows (moving right), coefficients gradually escape toward their MLE values (shown as dashed orange lines). The lecture showed this for all 8 βⱼ simultaneously — some coefficients had large MLEs and needed stronger regularization to be controlled, while others had near-zero MLEs and barely moved at all.
  • τ_β < 0.1: all coefficients pinned near zero regardless of what the data say.
  • 0.1 < τ_β < 1: a few coefficients with strong data support start to move away from zero first.
  • 1 < τ_β < 10: large coefficient values are still prevented; the posterior balances prior and data.
  • τ_β > 10: prior becomes negligible; posterior means converge to the MLEs.
Bias-Variance Tradeoff RMSE vs. Regularization Strength

The lecture also plotted training set RMSE (blue) and hold-out set RMSE (red) for the 7th-degree polynomial as a function of the log-prior precision log(τ_β⁻²) — equivalently, increasing regularization as you move right. 100 data points were generated; 30 used for training, 70 held out for testing.

30 15 3 RMSE sweet spot large gap = overfit hold-out stays above training Hold-out RMSE Training RMSE ← weak regularization (large τ_β) strong regularization →
📉
How to read this plot: The x-axis runs from weak regularization on the left to strong regularization on the right.
  • Far left (weak prior): Training RMSE is very low — the 7th-degree model fits the training points almost perfectly. But hold-out RMSE is enormous, revealing severe overfitting. The gap between training and hold-out RMSE is the visual signature of overfitting.
  • Middle (moderate prior): As regularization increases, coefficients are pulled toward zero, the model smooths out, and hold-out RMSE drops sharply. The sweet spot occurs where hold-out RMSE is minimized.
  • Far right (strong prior): Both training and hold-out RMSE plateau at a higher value — the model is underfitting because the prior forces all coefficients toward zero.
Summary Four Regions of the Regularization Path
RegionPriorCoefficientsTraining RMSEHold-out RMSE
Very strong reg. τ_β ≪ 1 All ≈ 0; prior dominates High (underfit) High (underfit)
Moderate reg. τ_β ~ 1–5 Small but non-zero Moderate Low ✓ (sweet spot)
Weak reg. τ_β ~ 10–25 Moderate values Low Moderate
No reg. (OLS) τ_β → ∞ Equal to MLEs (large) Very low Very high (overfit)
Glossary
key terms and definitions from lecture 11
Regularization
A technique that penalizes large coefficient values to prevent overfitting. Implemented through a prior in Bayesian models, or through an explicit penalty term added to the objective in non-Bayesian methods.
Ridge Regression
Penalized regression with an L2 (squared) penalty: argmin SSE(β) + λ Σ βⱼ². Coefficients are shrunk toward zero but never exactly zero. Bayesian equivalent uses Gaussian priors with mean 0.
LASSO Regression
Penalized regression with an L1 (absolute value) penalty: argmin SSE(β) + λ Σ |βⱼ|. Can set coefficients exactly to zero, performing automatic feature selection. Bayesian equivalent uses Double Exponential (Laplace) priors.
Elastic Net
A blend of Ridge and LASSO: argmin SSE(β) + λ[(1−α)Σβⱼ² + αΣ|βⱼ|]. The mixing parameter α controls the balance. α=0 is pure Ridge; α=1 is pure LASSO.
Regularization Parameter (λ)
The scalar that controls the overall strength of penalization. Defined as λ = σ²/τ_β² in the Bayesian view. Larger λ → stronger penalty → more shrinkage. Tuned via cross-validation in non-Bayesian settings.
Prior Standard Deviation (τ_β)
The standard deviation of the Gaussian prior placed on regression coefficients. Controls how spread out the prior belief about coefficients is. Small τ_β = strong regularization; large τ_β = weak regularization (approaches MLE).
Standardization (Z-scoring)
Transforming a variable by subtracting its mean and dividing by its standard deviation: z = (x − mean) / sd. After standardization, all variables have mean 0 and std 1, enabling a common prior specification across features with different scales.
Double Exponential Distribution (Laplace)
A symmetric distribution with a sharper peak at zero and heavier tails compared to a Gaussian of the same variance. Its pdf is (1/2b)·exp(−|x−μ|/b). Used as the Bayesian prior corresponding to LASSO regression.
Regularization Path
The sequence of coefficient estimates traced out as the regularization strength λ (or τ_β) is varied from very strong to very weak. Coefficients start near zero and converge to the OLS/MLE estimates as the prior weakens.
Maximum Likelihood Estimate (MLE)
The parameter estimate that maximizes the likelihood (equivalently minimizes SSE for Gaussian noise). Corresponds to the posterior mode when a completely flat/diffuse prior is used. Can produce extreme coefficient values in overparameterized models.
Weakly Informative Prior
A prior that encodes broad knowledge about plausible parameter ranges without being strongly opinionated. In standardized scale, τ_β in the range 1–10 is considered weakly informative and is the recommended default in modern Bayesian practice.
glmnet (R package)
An R package implementing Ridge, LASSO, and Elastic Net regression. Provides a full regularization path over a grid of λ values for a fixed α. Combined with caret or tidymodels for tuning both λ and α via cross-validation.
Flashcards
click a card to reveal the answer — mark yourself honestly
Card 1 / 1  |  0   0
0%
Question
↩ click to reveal answer
Answer