INFSCI 2595 · Week 11 Study Guide

Overview — What is Regularization?

lecture 11 — tying Bayesian priors to classical penalized regression

Context The Problem We're Solving

We can engineer many features: polynomials, splines, interactions, dummy variables.
Complex models fit training data well but overfit — poor out-of-sample performance.
Cross-validation and log-Evidence help detect overfitting after the fact.
Regularization is a tool for controlling overfitting directly, even in a complex model.

      Key insight: overfitting is associated with large coefficient values. If we constrain the coefficients to be small, the model cannot chase noise — even if it has many parameters.
    

Signature Observation Large Coefficients → Overfit

In the noisy quadratic demo, higher-order polynomial fits had extremely large coefficient values compared to the true quadratic. The 7th-degree model had coefficients ranging into the tens or twenties, while the quadratic's stayed near ±2.

💡

The 7th-degree polynomial with a diffuse prior has some coefficients near ±25. With an informative prior (τ_β = 1), those same coefficients are constrained near ±1 and the predictions look much more quadratic.

Key Takeaway Effect of Prior Strength on the 7th-Degree Model

🌊

Weak / Diffuse Prior (τ_β = 25):
The prior has almost no influence — the likelihood (data) dominates. Coefficients can grow large. The 7th-degree model interpolates the noise, producing wild oscillations. Training error is low; hold-out error is very high.

⚖️

Informative Prior (τ_β = 1):
A good balance. The prior prevents extreme values but still lets the data speak. The 7th-degree model's predictions look nearly quadratic — the prior has effectively smoothed out the noise chasing. Quadratic is still selected as best by log-Evidence.

🔒

Very Strong Prior (τ_β = 0.04):
The prior overwhelms the data. All coefficients are forced near zero. Even the quadratic model behaves like a flat line (constant). The log-Evidence now incorrectly selects the 6th-degree model as best — the prior has broken model selection.

Big Picture Regularization as a Spectrum

Prior dominates
(underfitting) Sweet spot
(generalization) Data dominates
(overfitting)

Prior Strength	τ_β	Effect on Coefficients	Typical Outcome
Very Strong	`< 0.1`	All pushed hard to zero	Underfitting — no trends visible
Informative	`1 – 5`	Moderate constraint	Correct model often identified
Weak / Diffuse	`> 10`	Converge to MLEs	Overfitting risk in complex models

Prior as Regularizer

showing that a Gaussian prior on β is equivalent to Ridge penalization

Derivation The Log-Posterior with a Gaussian Prior

Start from Bayes' rule. The un-normalized log-posterior on β (assuming σ known) is:

Log-posterior decomposition

log p(β | y, Φ, σ) ∝ log p(y | Φ, β, σ) + log p(β)

With independent Gaussian priors centered at zero:

Independent Gaussian prior (prior mean = 0)

log p(β) ∝ − (1 / 2τ_β²) Σⱼ βⱼ²

The log-likelihood is proportional to the negative SSE:

Log-likelihood contribution

log p(y | Φ, β, σ) ∝ − (1 / 2σ²) SSE(β)

Combine and factor out 1/σ²:

Full log-posterior (factored)

log p(β | …) ∝ −(1/2σ²) [ SSE(β) + (σ²/τ_β²) Σⱼ βⱼ² ]

      Defining λ = σ² / τ_β², the term inside the brackets becomes SSE(β) + λ Σⱼ βⱼ². Maximizing the posterior is equivalent to minimizing this penalized objective.
    

Insight The Prior Acts as a Floor on the SSE

Even if the model drives SSE toward zero (perfect training fit), the penalty term λ Σ βⱼ² increases. The optimization must balance these two competing goals:

Minimize SSE

Fit the training data as closely as possible. A very complex model can drive SSE → 0 by interpolating every point, but its coefficients become extreme.

Minimize Σ βⱼ²

Keep all coefficients small (near the prior mean of zero). This prevents extreme values but may force the model to ignore real trends in the data.

🎚️

The regularization strength λ (or equivalently τ_β) controls the balance. Larger λ (smaller τ_β) → stronger penalty → more shrinkage toward zero.

Key Relationship λ, σ, and τ_β

Regularization parameter in terms of Bayesian quantities

λ = σ² / τ_β²

Bayesian view	Frequentist view
Choose τ_β (prior std. dev. on coefficients)	Tune λ via cross-validation
τ_β → ∞ ⟹ prior disappears; posterior = MLE	λ → 0 ⟹ no penalty; solution = OLS
τ_β → 0 ⟹ prior overwhelms data	λ → ∞ ⟹ all coefficients → 0
Posterior mode = penalized estimate	Argmin of SSE + λ‖β‖²

Watch Out When the Prior Is Too Strong

Setting τ_β = 0.04 means the 95% interval on any coefficient is only (−0.08, +0.08). That rules out nearly all meaningful slopes.
Under such a prior, even the true quadratic model is forced flat — it looks indistinguishable from a constant.
The log-Evidence then incorrectly favors high-degree polynomials, because only they (with many small coefficients) can still produce some curvature.
Bayes is not broken — comparing across priors shows the correctly specified prior + model still wins — but within a bad prior the model selection breaks.

Ridge, LASSO & Elastic Net

non-Bayesian penalized regression — and their Bayesian interpretations

Definition Ridge Regression

Ridge objective (L2 penalty)

argmin_β [ SSE(β) + λ Σⱼ βⱼ² ]

Penalty = sum of squared coefficient values (L2 norm squared).
Bayesian equivalent: independent Gaussian priors with mean 0 and std. dev. τ_β = σ/√λ.
Coefficients are shrunk toward zero but never set exactly to zero.
λ is tuned via cross-validation in the non-Bayesian setting.

Definition LASSO Regression

LASSO objective (L1 penalty)

argmin_β [ SSE(β) + λ Σⱼ |βⱼ| ]

Penalty = sum of absolute coefficient values (L1 norm).
Bayesian equivalent: independent Double Exponential (Laplace) priors on each βⱼ.
Key property: LASSO can set coefficients exactly to zero — it performs automatic feature selection.
λ is tuned via cross-validation.

      LASSO is capable of turning off features entirely. Ridge only shrinks them. This makes LASSO especially useful when you suspect many features are irrelevant.
    

Distribution The Double Exponential (Laplace) Distribution

Double Exponential pdf

p(x | μ, b) = (1/2b) exp(−|x − μ| / b)

Mean = μ | Variance = 2b²
Has a sharp peak at zero — far more density concentrated at the center than the Gaussian.
Also has heavier tails than the Gaussian with the same variance.

🔑

The Double Exponential's concentrated mass at zero is why LASSO can lock coefficients exactly to zero. Gaussian priors (Ridge) spread probability more smoothly, so coefficients only approach — never hit — zero.

Extension Elastic Net — Blending Ridge and LASSO

Elastic Net objective

argmin_β [ SSE(β) + λ [ (1−α) Σⱼ βⱼ² + α Σⱼ |βⱼ| ] ]

α = 0 → pure Ridge (L2 penalty only).
α = 1 → pure LASSO (L1 penalty only).
0 < α < 1 → blend of both; inherits feature selection from LASSO and stability from Ridge.
Both λ (overall strength) and α (mixing ratio) must be tuned — typically via cross-validation or packages like caret / tidymodels.

📦

In R, the glmnet package implements all three. It provides a path of solutions over a sequence of λ values for a fixed α. Use caret or tidymodels to search over both λ and α simultaneously.

Side-by-Side Ridge vs. LASSO — Quick Reference

Property	Ridge (L2)	LASSO (L1)
Penalty term	λ Σ βⱼ²	λ Σ \|βⱼ\|
Bayesian prior	Gaussian (Normal)	Double Exponential (Laplace)
Coefficients reach exactly zero?	No — only approach 0	Yes — sparse solutions
Feature selection?	No	Yes (automatic)
Best when…	Many small, relevant effects	Few strong effects; many irrelevant features
R package	`glmnet` (α = 0 / α = 1)

Standardizing Variables

why we center and scale inputs — and how it enables a common prior specification

Motivation The Scale Problem

Consider predicting home-run distance from three inputs with very different scales:

Input	Typical Value	1-unit change means…
Air Pressure	≈ 100,000 Pa	Negligible physical change
Ball Launch Speed	≈ 100 mph	Moderate change
Wind Speed	≈ 5 mph	Large change in outcome

      A prior of τ_β = 1 might be "strong" for Wind Speed but has no effect on Air Pressure — a 1-unit change in Pa means nothing. Sharing a single prior across raw features is therefore meaningless.
    

Procedure Center and Scale

Apply to both inputs and the response:

Step 1 — Center (subtract the sample mean)

x_centered = x_raw − mean(x_raw) y_centered = y_raw − mean(y_raw)

Step 2 — Scale (divide by the sample std dev)

x_std = x_centered / sd(x_raw) y_std = y_centered / sd(y_raw)

📐

After standardizing, all inputs and the response have mean ≈ 0 and std ≈ 1. Slopes now represent how many standard deviations of response change per standard deviation of input — a comparable unit for every feature.

Interpretation Slopes in the Standardized Scale

In the standardized scale, if β_j = 1, then a +1 standard deviation change in x_j is associated with a +1 standard deviation change in the mean response. This gives all coefficients the same unit of measurement.

Prior Std. Dev. (τ_β)	Implied ±2 std dev change in x causes…
τ_β = 0.1	≈ ±0.4 std dev change in mean y — small effect
τ_β = 1	≈ ±4 std dev change in mean y — moderate effect
τ_β = 10	≈ ±40 std dev change in mean y — enormous, rarely plausible

      A prior of τ_β = 10 allows the mean response to swing 40× its own standard deviation from a single 2-SD input shift. That's almost always implausibly large — which is why very diffuse priors invite overfitting.
    

Practical Guidance Recommended Prior Specifications

Modern Bayesian practice: use weakly informative priors in the standardized scale.
τ_β between 1 and 10 is typical; values of 3–5 are common defaults.
Infinitely diffuse ("flat") priors are not recommended — they allow arbitrarily large effects.
τ_β < 1 only if you are confident the feature should have a negligible effect.

📏

Standardizing the response also anchors the prior on σ. An intercept-only model (no trend) would have σ ≈ 1 in the standardized scale. A prior that expects σ ≈ 1 is therefore asking "can the model beat a flat-line prediction?" — a sensible baseline.

Regularization Path

watching what happens to coefficients and errors as prior strength varies

Concept The Regularization Path — Coefficient View

In the lecture, τ_β was swept from 0.02 to 50 for the 7th-degree polynomial and the posterior mean of each coefficient βⱼ was tracked. The resulting chart has log(τ_β) on the x-axis and the coefficient posterior mean on the y-axis, with the MLE shown as an orange dashed reference line. The plot below is a schematic of that pattern:

📊

How to read this plot: Each colored line is one coefficient βⱼ. On the far left (small τ_β, strong prior), every line hugs zero — the prior dominates and the data cannot move the coefficients. As τ_β grows (moving right), coefficients gradually escape toward their MLE values (shown as dashed orange lines). The lecture showed this for all 8 βⱼ simultaneously — some coefficients had large MLEs and needed stronger regularization to be controlled, while others had near-zero MLEs and barely moved at all.

τ_β < 0.1: all coefficients pinned near zero regardless of what the data say.
0.1 < τ_β < 1: a few coefficients with strong data support start to move away from zero first.
1 < τ_β < 10: large coefficient values are still prevented; the posterior balances prior and data.
τ_β > 10: prior becomes negligible; posterior means converge to the MLEs.

Bias-Variance Tradeoff RMSE vs. Regularization Strength

The lecture also plotted training set RMSE (blue) and hold-out set RMSE (red) for the 7th-degree polynomial as a function of the log-prior precision log(τ_β⁻²) — equivalently, increasing regularization as you move right. 100 data points were generated; 30 used for training, 70 held out for testing.

📉

How to read this plot: The x-axis runs from weak regularization on the left to strong regularization on the right.

Far left (weak prior): Training RMSE is very low — the 7th-degree model fits the training points almost perfectly. But hold-out RMSE is enormous, revealing severe overfitting. The gap between training and hold-out RMSE is the visual signature of overfitting.
Middle (moderate prior): As regularization increases, coefficients are pulled toward zero, the model smooths out, and hold-out RMSE drops sharply. The sweet spot occurs where hold-out RMSE is minimized.
Far right (strong prior): Both training and hold-out RMSE plateau at a higher value — the model is underfitting because the prior forces all coefficients toward zero.

Summary Four Regions of the Regularization Path

Region	Prior	Coefficients	Training RMSE	Hold-out RMSE
Very strong reg.	τ_β ≪ 1	All ≈ 0; prior dominates	High (underfit)	High (underfit)
Moderate reg.	τ_β ~ 1–5	Small but non-zero	Moderate	Low ✓ (sweet spot)
Weak reg.	τ_β ~ 10–25	Moderate values	Low	Moderate
No reg. (OLS)	τ_β → ∞	Equal to MLEs (large)	Very low	Very high (overfit)

Glossary

key terms and definitions from lecture 11

Regularization

A technique that penalizes large coefficient values to prevent overfitting. Implemented through a prior in Bayesian models, or through an explicit penalty term added to the objective in non-Bayesian methods.

Ridge Regression

Penalized regression with an L2 (squared) penalty: argmin SSE(β) + λ Σ βⱼ². Coefficients are shrunk toward zero but never exactly zero. Bayesian equivalent uses Gaussian priors with mean 0.

LASSO Regression

Penalized regression with an L1 (absolute value) penalty: argmin SSE(β) + λ Σ |βⱼ|. Can set coefficients exactly to zero, performing automatic feature selection. Bayesian equivalent uses Double Exponential (Laplace) priors.

Elastic Net

A blend of Ridge and LASSO: argmin SSE(β) + λ[(1−α)Σβⱼ² + αΣ|βⱼ|]. The mixing parameter α controls the balance. α=0 is pure Ridge; α=1 is pure LASSO.

Regularization Parameter (λ)

The scalar that controls the overall strength of penalization. Defined as λ = σ²/τ_β² in the Bayesian view. Larger λ → stronger penalty → more shrinkage. Tuned via cross-validation in non-Bayesian settings.

Prior Standard Deviation (τ_β)

The standard deviation of the Gaussian prior placed on regression coefficients. Controls how spread out the prior belief about coefficients is. Small τ_β = strong regularization; large τ_β = weak regularization (approaches MLE).

Standardization (Z-scoring)

Transforming a variable by subtracting its mean and dividing by its standard deviation: z = (x − mean) / sd. After standardization, all variables have mean 0 and std 1, enabling a common prior specification across features with different scales.

Double Exponential Distribution (Laplace)

A symmetric distribution with a sharper peak at zero and heavier tails compared to a Gaussian of the same variance. Its pdf is (1/2b)·exp(−|x−μ|/b). Used as the Bayesian prior corresponding to LASSO regression.

Regularization Path

The sequence of coefficient estimates traced out as the regularization strength λ (or τ_β) is varied from very strong to very weak. Coefficients start near zero and converge to the OLS/MLE estimates as the prior weakens.

Maximum Likelihood Estimate (MLE)

The parameter estimate that maximizes the likelihood (equivalently minimizes SSE for Gaussian noise). Corresponds to the posterior mode when a completely flat/diffuse prior is used. Can produce extreme coefficient values in overparameterized models.

Weakly Informative Prior

A prior that encodes broad knowledge about plausible parameter ranges without being strongly opinionated. In standardized scale, τ_β in the range 1–10 is considered weakly informative and is the recommended default in modern Bayesian practice.

glmnet (R package)

An R package implementing Ridge, LASSO, and Elastic Net regression. Provides a full regularization path over a grid of λ values for a fixed α. Combined with caret or tidymodels for tuning both λ and α via cross-validation.

Flashcards

click a card to reveal the answer — mark yourself honestly

Card 1 / 1 | ✓ 0 ✗ 0

0%

Question

↩ click to reveal answer

Answer