INFSCI 2595 · Machine Learning · Fall 2025

Week 7 — Linear Models

Lecture 7 · Introduction to Linear Models · University of Pittsburgh
The Linear Model
// Notation · two-part structure · likelihood · conditional independence
🎯 What is a Linear Model? KEY IDEA

A linear model predicts a continuous response y from an input x using a straight-line mean trend. We have N observed input–output pairs: (xₙ, yₙ) for n = 1, …, N.

This is a regression problem — the response is continuous (a real number), not a discrete class.
📋 Preferred Notation FORMULA

Rather than writing yₙ = β₀ + β₁xₙ + εₙ (which hides the model structure), the lecture uses the likelihood notation:

Complete probabilistic model for the n-th observation
yₙ | xₙ, β₀, β₁, σ ~ Normal( yₙ | μₙ, σ )
μₙ = β₀ + β₁xₙ ← the linear predictor
Why is this preferred? It explicitly shows the likelihood — the probability of observing yₙ given the inputs and parameters — and cleanly separates the two model components.
⚙️ Two Distinct Parts of the Model
Deterministic Part
μₙ = β₀ + β₁xₙ

The linear predictor — the mean trend. Given β₀, β₁, and xₙ, the value of μₙ is always the same. No randomness here.

Stochastic Part
yₙ | μₙ, σ ~ Normal(μₙ, σ)

The noise — how observations scatter around the mean. σ controls the width of this scatter. Higher σ = more noise.

Key insight: the mean μₙ changes with the input xₙ. Each observation has its own Normal distribution centered on a different μₙ. The variability σ around that mean is constant across all xₙ.
🔗 Conditional Independence & the Joint Likelihood

We assume observations are conditionally independent given the input and parameters. This means the joint probability of all N responses factors into a product of N individual likelihoods:

p(y | x, β, σ) = ∏ₙ p(yₙ | xₙ, β) = ∏ₙ Normal(yₙ | μₙ, σ)

Each factor is the n-th observation's likelihood. Multiplying them together gives the full data likelihood over all N observations.

📍 Where Does the Model Sit?

Week 7 begins the supervised learning deep-dive. Linear models are the entry point to a hierarchy:

1
Linear Models (this week)
Straight-line mean trend, Gaussian likelihood, OLS / MLE connection.
2
Generalized Linear Models
Non-Gaussian likelihoods (e.g. Binomial for classification). The mean trend is linked to the linear predictor via a link function.
3
Non-linear Methods
Neural networks, trees, GPs — the mean trend is no longer constrained to be linear in the parameters.
Parameters & Interactive Explorer
// β₀ intercept · β₁ slope · σ noise · ±1σ / ±2σ / ±3σ ribbons
🎛️ Linear Model Explorer INTERACTIVE

Adjust the sliders to see how each parameter shapes the mean trend (white line) and the observation ribbons. The ribbon shows where ~68 / 95 / 99.7% of observations are expected to fall.

Mean trend μ = β₀ + β₁x ±1σ (≈68%) ±2σ (≈95%) ±3σ (≈99.7%) Random observations
📍 Intercept β₀ WHAT IT DOES

The intercept β₀ is the value of the mean μ when x = 0.

μ|ₓ₌₀ = β₀ + β₁·0 = β₀

Changing β₀ shifts the entire line up or down without rotating it. Lines with different β₀ but the same β₁ are perfectly parallel.

📐 Slope β₁ WHAT IT DOES

The slope β₁ determines how much μ changes per unit increase in x.

Δμ = β₁ · Δx
  • β₁ = 0 → flat line, x has no effect on y
  • β₁ > 0 → positive relationship
  • β₁ < 0 → negative (inverse) relationship
🔊 Noise σ STOCHASTIC PART

σ controls the spread of observations around the mean trend. It is the standard deviation of the Normal likelihood. σ does not change the mean trend — it only changes how tightly observations cluster around it.

IntervalCoverage (Normal)What it shows
μ ± 1σ≈ 68.3%Inner ribbon — most observations land here
μ ± 2σ≈ 95.4%Middle ribbon
μ ± 3σ≈ 99.7%Outer ribbon — almost all observations
Important: the ribbon width is constant across all values of x. The variability σ does not grow or shrink with the input. This is the homoscedasticity assumption.
💡 The Observation Distribution Changes with x

At each input value xₙ, the response follows a different Normal distribution:

yₙ | xₙ = c ~ Normal( β₀ + β₁·c, σ )

Example from the lecture: at xₙ = −1.5, β₀=1, β₁=1, σ=0.5:

yₙ | xₙ=−1.5 ~ Normal(1 + 1×(−1.5), 0.5) = Normal(−0.5, 0.5)
The violin plots in the lecture visualize this: at each input location, the density of observations is Gaussian. The violins have the same width everywhere (same σ) but are centered on different means
Model Fitting
// MLE · log-likelihood · SSE · OLS equivalence
🎯 The Learning Objective KEY IDEA

We want to find the values of β = (β₀, β₁) that best explain the data. With σ known, we maximize the likelihood (or equivalently, the log-likelihood) over β:

β̂_MLE = argmax log p(y | x, β, σ)
This is an optimization problem — same idea as the parameter estimation in Week 4, but now we're optimizing over the linear predictor parameters β₀ and β₁.
📊 Deriving the Log-Likelihood DERIVATION
1
Start with the joint log-likelihood
log p(y | x, β, σ) = Σₙ log Normal(yₙ | μₙ, σ)
2
Expand the Gaussian log-density
= Σₙ [ −½ log σ² − ½ log 2π − 1/(2σ²) (yₙ − μₙ)² ]
3
Drop constants (they don't affect argmax)
log p(y | x, β, σ) ∝ − 1/(2σ²) · Σₙ (yₙ − μₙ)²
4
Recognize the Sum of Squared Errors (SSE)
SSE = RSS = Σₙ (yₙ − μₙ)² = Σₙ εₙ²
where εₙ = yₙ − μₙ is the n-th residual.
⚡ The MLE = OLS Result CRITICAL

Since log p ∝ −SSE, maximizing the log-likelihood is equivalent to minimizing the SSE:

Maximum Likelihood (MLE)
argmax log p(y|x,β,σ)

Statistical / probabilistic framing: find β that makes the data most probable.

Ordinary Least Squares (OLS)
argmin Σₙ (yₙ − μₙ)²

Geometric framing: find β that minimizes squared prediction errors.

The MLE estimates for β₀ and β₁ under a Gaussian likelihood are exactly the same as the OLS estimates. These are the familiar ˆβ values you get from lm() in R.
📌 What is a Residual? DEFINITION

The residual εₙ is the difference between the observed response and the model's predicted mean:

εₙ = yₙ − μₙ = yₙ − (β₀ + β₁xₙ)
The SSE sums up the squared residuals. Minimizing SSE means finding the line that makes the squared vertical distances between observations and the line as small as possible.
🛠️ Fitting in R R CODE
# Generate data (true β₀=-0.25, β₁=1.15, σ=0.5)
set.seed(42)
x <- rnorm(n = 100)
mu_true <- -0.25 + 1.15 * x
y <- rnorm(n = length(x), mean = mu_true, sd = 0.5)

# Fit linear model (OLS = MLE under Gaussian likelihood)
my_df <- data.frame(x = x, y = y)
mod <- lm(y ~ x, data = my_df)
summary(mod)   # shows β̂₀, β̂₁ and their uncertainty
coef(mod)      # extract coefficient estimates
Bayesian Formulation
// Prior · posterior · informative vs diffuse · sequential updating · confidence intervals
🔄 Bayesian Setup KEY IDEA

In the Bayesian framework we treat β = (β₀, β₁) as unknown random variables and place priors on them. The posterior combines the prior belief with the data via the likelihood:

p(β | y, x, σ) ∝ p(y | x, β, σ) × p(β)

We assume σ is known for now. The unknowns are only β₀ and β₁.

📏 Informative vs Diffuse Priors COMPARISON
Informative Prior (sd = 1)
p(β) = Normal(β₀|0,1) × Normal(β₁|0,1)

Parameters are pulled toward 0. Rules out extreme values even with little data. With just 1 observation, the prior constrains the trade-off.

Diffuse Prior (sd = 20)
p(β) = Normal(β₀|0,20) × Normal(β₁|0,20)

Nearly flat — the posterior follows the likelihood. With 1 observation all we learn is the intercept–slope trade-off.

As N increases, both posteriors converge to the same result. The prior matters most with small N. With large N, the data overwhelm the prior.
📊 The Intercept–Slope Trade-off INSIGHT

With only 1 observation and a diffuse prior, the posterior contours appear as diagonal parallel lines. Why?

μₙ₌₁ = β₀ + β₁xₙ₌₁ = constant

Many different (β₀, β₁) pairs produce the same value for μₙ₌₁. Increasing the slope while decreasing the intercept by the right amount keeps μ fixed. The likelihood is constant along that diagonal.

Only when a second observation at a different x is added do we get two intersecting constraints that pin down both β₀ and β₁ — the contours tighten into ellipses.
🔄 Sequential Bayesian Updating

The lecture walks through adding the first 10 observations one at a time, watching the joint posterior over (β₀, β₁) evolve:

NPosterior shapeWhat's happening
0Circular (= prior)No data — just the prior
1Diagonal parallel linesOne constraint: μₙ₌₁ is known; β₀↔β₁ trade-off
2Elongated ellipseTwo constraints begin pinning down both parameters
3–5Tighter ellipseEllipse rotates and shrinks; center near true β
10Tight circular blobConverged near true (β₀=−0.25, β₁=1.15)
After just a single observation, we have already updated our beliefs about two unknowns simultaneously. An informative prior acts like a regularizer — it constrains extreme or unlikely parameter combinations from the start. When we use a diffuse prior, the only thing we learn is the constraint between the two parameters that the single data point introduces.
🔑 Posterior → Confidence Interval KEY TERM

The joint posterior over (β₀, β₁) induces a distribution over the mean trend μ(x) at every x. We can summarize it as a ribbon. The visualization below shows the key difference between the confidence interval (uncertainty about the mean line itself) and the observation noise bands from the explorer tab (where σ lives).

Posterior mean trend Confidence interval (uncertainty in β) ±2σ observation noise band Data points
RibbonSource of uncertaintyBehaviour as N→∞
Confidence IntervalUncertainty in β (where is the true mean line?)Shrinks to zero — data pins down β
Observation (±2σ) bandIrreducible noise σ around the meanStays the same — σ is a property of the world
This ribbon is the Confidence Interval — it captures uncertainty about the mean trend, not about individual observations. As N grows, the ribbon shrinks.
📐 Why is the CI narrower near the data? FORMULA

The variance of the predicted mean trend at any input value x has a closed form:

Var( μ̂(x) ) = σ² · [ 1/N + (x − x̄)² / Sₓₓ ]
TermMeaning
Mean of the observed input values
SₓₓSum of squared deviations of inputs: Σ(xₙ − x̄)²
1/NUncertainty from estimating the overall level (intercept)
(x−x̄)²/SₓₓExtra uncertainty that grows as we predict further from the data center
The CI is narrowest at x = x̄ (where the second term vanishes) and fans out symmetrically on both sides. Intuitively: the data pins down the mean trend most precisely right where the observations are concentrated. The further you extrapolate from x̄, the more the slope uncertainty compounds — small errors in β₁ translate into large errors in μ̂(x) when |x − x̄| is large.
Contrast with a Prediction Interval — that includes both the uncertainty in β AND the observation noise σ, so it is always wider than the CI.
🔑 Prior's Role as Regularization INSIGHT

The informative prior (sd = 1) acts to constrain parameters away from extreme values. After just 1 observation:

Very large β₀ AND large β₁ ruled out Negative slopes less plausible Mode near the true values Easier to find the MAP
This mirrors regularization in classical ML: the prior "penalizes" large parameter values. With a diffuse prior and small N, the model is underdetermined. With an informative prior, parameter uncertainty is reduced even before looking at the data.
Matrix Derivations
// Design matrix · μ = Xβ · gradient · normal equations · Hessian · posterior covariance
🎯 Why Generalize? MOTIVATION

With a single input the mean trend is μₙ = β₀ + β₁xₙ. But real problems have dozens or hundreds of inputs. We need a notation that scales cleanly to D inputs without writing out every term.

The trick: organise everything into a design matrix X and a coefficient vector β, then write the entire linear model as a single matrix multiplication: μ = Xβ.
📐 Subscript Notation NOTATION
CaseMean trendIndex meaning
Single inputμₙ = β₀ + β₁xₙn = observation index
Multiple inputs (D=3)μₙ = β₀ + β₁xₙ,₁ + β₂xₙ,₂ + β₃xₙ,₃n = observation, d = input feature
With multiple inputs, xₙ,d means the n-th observation of the d-th input. Think of n as the row index and d as the column index of a matrix.
∑ From Sum to Inner Product STEP 1

For D inputs, the mean trend is a sum of D coefficient–input products:

μₙ = β₀ + Σd=1D βd xn,d

To pull the intercept inside the sum, introduce a fake variable xn,0 ≡ 1:

μₙ = Σd=0D βd xn,d (summation now starts at 0)
β₀ is just βd=0, multiplied by xn,0 = 1. The intercept is now treated symmetrically with all other coefficients.
🗂️ The Design Matrix X KEY OBJECT

Stack all N observations row-by-row. Each row is the n-th observation's input vector (including the fake intercept column):

X =
1x1,1x1,2···x1,D
1x2,1x2,2···x2,D
1xN,1xN,2···xN,D
DimensionMeaning
N × (D+1)N observations, D inputs plus the intercept column of ones
Row xₙ,:The n-th observation's full input row vector (1 × D+1)
Column x:,dAll N values of the d-th input feature
The first column of X is always a column of ones — this is the fake variable xn,0=1 that encodes the intercept term.
⚡ The Key Result: μ = Xβ STEP 2

Organise the D+1 coefficients into a column vector β. The mean trend for the n-th observation is a row–column inner product:

μₙ = xₙ,: · β [ 1×(D+1) ] · [ (D+1)×1 ] = scalar

Stacking all N inner products gives the full N×1 vector of mean trends in one matrix multiplication:

μ = X β [ N×(D+1) ] · [ (D+1)×1 ] = [ N×1 ]
Verify for the single-input case: xₙ,: = [1, xₙ,₁], β = [β₀, β₁]ᵀ, so xₙ,:·β = β₀·1 + β₁·xₙ,₁ = β₀ + β₁xₙ ✓
📊 Log-Likelihood in Matrix Form STEP 3

The scalar SSE sum becomes a compact quadratic form:

Σₙ (yₙ − μₙ)² = Σₙ (yₙ − xₙ,:β)² = (y − Xβ)ᵀ(y − Xβ)

So the full log-posterior (with diffuse prior) in matrix notation is:

log p(β | y, X, σ) ∝ − 1/(2σ²) · (y − Xβ)ᵀ(y − Xβ) = − SSE / (2σ²)
This single compact expression handles any number of inputs D. The scalar summation and the matrix quadratic form are mathematically identical.
∇ The Gradient Vector g STEP 4

Differentiate the log-likelihood with respect to β (using matrix calculus — chain rule analog for vectors):

g = ∂L/∂β = (1/σ²) ( Xᵀy − XᵀXβ )
Xᵀy — "cross term"

Projection of the response onto the input space. Size: (D+1)×1.

XᵀX — "sum of squares"

Captures relationships between inputs. Size: (D+1)×(D+1). If inputs are centered, XᵀX ≈ N × Cov(inputs).

🏁 Normal Equations → Closed-Form MLE STEP 5

Set the gradient to zero (g = 0) to find the mode β̂:

1
Set g = 0
0 = (1/σ²)(Xᵀy − XᵀXβ̂)
2
Rearrange → Normal Equations
XᵀXβ̂ = Xᵀy
3
Invert XᵀX → Closed-form solution
β̂ = (XᵀX)⁻¹ Xᵀy
Two important notes: (1) σ cancels out — the noise level does not affect where the coefficients land. (2) XᵀX must be invertible — requires no perfectly collinear inputs and N ≥ D+1.
📐 The Hessian H STEP 6

The Hessian is the matrix of second derivatives of L with respect to β. Take the derivative of g with respect to β:

H = ∂g/∂β = − (1/σ²) XᵀX
Critical insight: H does not depend on β. It is a constant matrix. This means the log-likelihood is an exactly quadratic function of β — not just approximately quadratic near the mode.

This is why the Newton–Raphson method converges in exactly one step for the linear model:

β_{k=1} = β_{k=0} − H⁻¹ g = β_{k=0} + (XᵀX)⁻¹(Xᵀy − XᵀXβ_{k=0}) = (XᵀX)⁻¹Xᵀy = β̂
The β_{k=0} terms cancel exactly, leaving β̂ regardless of the starting point. One Newton step jumps directly to the global optimum.
📊 Posterior Covariance Matrix STEP 7

Using the Laplace Approximation, the approximate posterior covariance is the negative inverse Hessian evaluated at the mode:

Cov(β,β) = −H⁻¹|β=β̂ = − ( −(1/σ²) XᵀX )⁻¹
Cov(β,β) = σ² (XᵀX)⁻¹
Does NOT depend on y (the response) Scales with σ² (more noise → more uncertainty) Requires XᵀX to be invertible Input covariance drives coefficient covariance
🎯 Full MVN Posterior RESULT

With a diffuse prior, the Laplace Approximation gives the exact posterior (because the log-likelihood is exactly quadratic):

β | X, y, σ ~ MVN( β | (XᵀX)⁻¹Xᵀy, σ²(XᵀX)⁻¹ )

This is an exact result for the linear model — not an approximation — because the log-posterior is a perfect quadratic in β.

✅ Special Case: Intercept-Only Model VERIFICATION

Set D=0. The design matrix is just a column of ones: X = [1, 1, …, 1]ᵀ (N×1). Let's verify the general formula recovers what we already know:

MLE: β̂ = (XᵀX)⁻¹Xᵀy

XᵀX = [1,…,1]·[1,…,1]ᵀ = N

Xᵀy = [1,…,1]·y = Σyₙ = Nȳ

β̂₀ = N⁻¹ · Nȳ = ȳ

The MLE is just the sample mean ✓

Posterior variance: σ²(XᵀX)⁻¹

= σ² · N⁻¹ = σ²/N

Standard deviation = σ/√N

This is the standard error formula ✓

β₀ | X, y, σ ~ Normal( β₀ | ȳ, σ/√N )
This is exactly the Normal–Normal model from earlier in the course! The intercept-only linear model is the simplest possible linear model — it's just the unknown constant mean problem in disguise.
📋 Key Results at a Glance
ObjectFormulaNotes
Mean trendsμ = XβN×1 vector; X is N×(D+1)
Log-likelihoodL ∝ −(y−Xβ)ᵀ(y−Xβ) / 2σ²Quadratic in β
Gradientg = (1/σ²)(Xᵀy − XᵀXβ)(D+1)×1 vector
HessianH = −XᵀX / σ²Constant — does not depend on β
MLE / MAPβ̂ = (XᵀX)⁻¹Xᵀyσ cancels out
Post. covarianceΣ = σ²(XᵀX)⁻¹Does not depend on y
Full posteriorβ | X,y,σ ~ MVN(β̂, σ²(XᵀX)⁻¹)Exact (not approximate)
Quick Glossary
// All key terms from Lecture 7 at a glance
Flashcards
// Click a card to reveal the answer · track your score
0   ✗ 0
0 / 0
Press "Shuffle & Start" to begin.