INFSCI 2595 · Week 7 Study Guide

The Linear Model

// Notation · two-part structure · likelihood · conditional independence

🎯 What is a Linear Model? KEY IDEA

A linear model predicts a continuous response y from an input x using a straight-line mean trend. We have N observed input–output pairs: (xₙ, yₙ) for n = 1, …, N.

      This is a regression problem — the response is continuous (a real number), not a discrete class.
    

📋 Preferred Notation FORMULA

Rather than writing yₙ = β₀ + β₁xₙ + εₙ (which hides the model structure), the lecture uses the likelihood notation:

Complete probabilistic model for the n-th observation

yₙ | xₙ, β₀, β₁, σ ~ Normal( yₙ | μₙ, σ )

μₙ = β₀ + β₁xₙ ← the linear predictor

      Why is this preferred? It explicitly shows the likelihood — the probability of observing yₙ given the inputs and parameters — and cleanly separates the two model components.
    

⚙️ Two Distinct Parts of the Model

Deterministic Part

μₙ = β₀ + β₁xₙ

The linear predictor — the mean trend. Given β₀, β₁, and xₙ, the value of μₙ is always the same. No randomness here.

Stochastic Part

yₙ | μₙ, σ ~ Normal(μₙ, σ)

The noise — how observations scatter around the mean. σ controls the width of this scatter. Higher σ = more noise.

      Key insight: the mean μₙ changes with the input xₙ. Each observation has its own Normal distribution centered on a different μₙ. The variability σ around that mean is constant across all xₙ.
    

🔗 Conditional Independence & the Joint Likelihood

We assume observations are conditionally independent given the input and parameters. This means the joint probability of all N responses factors into a product of N individual likelihoods:

p(y | x, β, σ) = ∏ₙ p(yₙ | xₙ, β) = ∏ₙ Normal(yₙ | μₙ, σ)

Each factor is the n-th observation's likelihood. Multiplying them together gives the full data likelihood over all N observations.

📍 Where Does the Model Sit?

Week 7 begins the supervised learning deep-dive. Linear models are the entry point to a hierarchy:

Linear Models (this week)

Straight-line mean trend, Gaussian likelihood, OLS / MLE connection.

Generalized Linear Models

Non-Gaussian likelihoods (e.g. Binomial for classification). The mean trend is linked to the linear predictor via a link function.

Non-linear Methods

Neural networks, trees, GPs — the mean trend is no longer constrained to be linear in the parameters.

Parameters & Interactive Explorer

// β₀ intercept · β₁ slope · σ noise · ±1σ / ±2σ / ±3σ ribbons

🎛️ Linear Model Explorer INTERACTIVE

Adjust the sliders to see how each parameter shapes the mean trend (white line) and the observation ribbons. The ribbon shows where ~68 / 95 / 99.7% of observations are expected to fall.

β₀ (intercept)0.0

β₁ (slope)1.0

σ (noise)0.5

Mean trend μ = β₀ + β₁x ±1σ (≈68%) ±2σ (≈95%) ±3σ (≈99.7%) Random observations

📍 Intercept β₀ WHAT IT DOES

The intercept β₀ is the value of the mean μ when x = 0.

μ|ₓ₌₀ = β₀ + β₁·0 = β₀

Changing β₀ shifts the entire line up or down without rotating it. Lines with different β₀ but the same β₁ are perfectly parallel.

📐 Slope β₁ WHAT IT DOES

The slope β₁ determines how much μ changes per unit increase in x.

Δμ = β₁ · Δx

β₁ = 0 → flat line, x has no effect on y
β₁ > 0 → positive relationship
β₁ < 0 → negative (inverse) relationship

🔊 Noise σ STOCHASTIC PART

σ controls the spread of observations around the mean trend. It is the standard deviation of the Normal likelihood. σ does not change the mean trend — it only changes how tightly observations cluster around it.

Interval	Coverage (Normal)	What it shows
`μ ± 1σ`	≈ 68.3%	Inner ribbon — most observations land here
`μ ± 2σ`	≈ 95.4%	Middle ribbon
`μ ± 3σ`	≈ 99.7%	Outer ribbon — almost all observations

      Important: the ribbon width is constant across all values of x. The variability σ does not grow or shrink with the input. This is the homoscedasticity assumption.
    

💡 The Observation Distribution Changes with x

At each input value xₙ, the response follows a different Normal distribution:

yₙ | xₙ = c ~ Normal( β₀ + β₁·c, σ )

Example from the lecture: at xₙ = −1.5, β₀=1, β₁=1, σ=0.5:

yₙ | xₙ=−1.5 ~ Normal(1 + 1×(−1.5), 0.5) = Normal(−0.5, 0.5)

      The violin plots in the lecture visualize this: at each input location, the density of observations is Gaussian. The violins have the same width everywhere (same σ) but are centered on different means 
    

Model Fitting

// MLE · log-likelihood · SSE · OLS equivalence

🎯 The Learning Objective KEY IDEA

We want to find the values of β = (β₀, β₁) that best explain the data. With σ known, we maximize the likelihood (or equivalently, the log-likelihood) over β:

β̂_MLE = argmax log p(y | x, β, σ)

      This is an optimization problem — same idea as the parameter estimation in Week 4, but now we're optimizing over the linear predictor parameters β₀ and β₁.
    

📊 Deriving the Log-Likelihood DERIVATION

Start with the joint log-likelihood

log p(y | x, β, σ) = Σₙ log Normal(yₙ | μₙ, σ)

Expand the Gaussian log-density

= Σₙ [ −½ log σ² − ½ log 2π − 1/(2σ²) (yₙ − μₙ)² ]

Drop constants (they don't affect argmax)

log p(y | x, β, σ) ∝ − 1/(2σ²) · Σₙ (yₙ − μₙ)²

Recognize the Sum of Squared Errors (SSE)

SSE = RSS = Σₙ (yₙ − μₙ)² = Σₙ εₙ²

where εₙ = yₙ − μₙ is the n-th residual.

⚡ The MLE = OLS Result CRITICAL

Since log p ∝ −SSE, maximizing the log-likelihood is equivalent to minimizing the SSE:

Maximum Likelihood (MLE)

argmax log p(y|x,β,σ)

Statistical / probabilistic framing: find β that makes the data most probable.

Ordinary Least Squares (OLS)

argmin Σₙ (yₙ − μₙ)²

Geometric framing: find β that minimizes squared prediction errors.

      The MLE estimates for β₀ and β₁ under a Gaussian likelihood are exactly the same as the OLS estimates. These are the familiar ˆβ values you get from lm() in R.
    

📌 What is a Residual? DEFINITION

The residual εₙ is the difference between the observed response and the model's predicted mean:

εₙ = yₙ − μₙ = yₙ − (β₀ + β₁xₙ)

      The SSE sums up the squared residuals. Minimizing SSE means finding the line that makes the squared vertical distances between observations and the line as small as possible.
    

🛠️ Fitting in R R CODE

# Generate data (true β₀=-0.25, β₁=1.15, σ=0.5)
set.seed(42)
x <- rnorm(n = 100)
mu_true <- -0.25 + 1.15 * x
y <- rnorm(n = length(x), mean = mu_true, sd = 0.5)

# Fit linear model (OLS = MLE under Gaussian likelihood)
my_df <- data.frame(x = x, y = y)
mod <- lm(y ~ x, data = my_df)
summary(mod)   # shows β̂₀, β̂₁ and their uncertainty
coef(mod)      # extract coefficient estimates

Bayesian Formulation

// Prior · posterior · informative vs diffuse · sequential updating · confidence intervals

🔄 Bayesian Setup KEY IDEA

In the Bayesian framework we treat β = (β₀, β₁) as unknown random variables and place priors on them. The posterior combines the prior belief with the data via the likelihood:

p(β | y, x, σ) ∝ p(y | x, β, σ) × p(β)

We assume σ is known for now. The unknowns are only β₀ and β₁.

📏 Informative vs Diffuse Priors COMPARISON

Informative Prior (sd = 1)

p(β) = Normal(β₀|0,1) × Normal(β₁|0,1)

Parameters are pulled toward 0. Rules out extreme values even with little data. With just 1 observation, the prior constrains the trade-off.

Diffuse Prior (sd = 20)

p(β) = Normal(β₀|0,20) × Normal(β₁|0,20)

Nearly flat — the posterior follows the likelihood. With 1 observation all we learn is the intercept–slope trade-off.

      As N increases, both posteriors converge to the same result. The prior matters most with small N. With large N, the data overwhelm the prior.
    

📊 The Intercept–Slope Trade-off INSIGHT

With only 1 observation and a diffuse prior, the posterior contours appear as diagonal parallel lines. Why?

μₙ₌₁ = β₀ + β₁xₙ₌₁ = constant

Many different (β₀, β₁) pairs produce the same value for μₙ₌₁. Increasing the slope while decreasing the intercept by the right amount keeps μ fixed. The likelihood is constant along that diagonal.

      Only when a second observation at a different x is added do we get two intersecting constraints that pin down both β₀ and β₁ — the contours tighten into ellipses.
    

🔄 Sequential Bayesian Updating

The lecture walks through adding the first 10 observations one at a time, watching the joint posterior over (β₀, β₁) evolve:

N	Posterior shape	What's happening
0	Circular (= prior)	No data — just the prior
1	Diagonal parallel lines	One constraint: μₙ₌₁ is known; β₀↔β₁ trade-off
2	Elongated ellipse	Two constraints begin pinning down both parameters
3–5	Tighter ellipse	Ellipse rotates and shrinks; center near true β
10	Tight circular blob	Converged near true (β₀=−0.25, β₁=1.15)

      After just a single observation, we have already updated our beliefs about two unknowns simultaneously. An informative prior acts like a regularizer — it constrains extreme or unlikely parameter combinations from the start. When we use a diffuse prior, the only thing we learn is the constraint between the two parameters that the single data point introduces.
    

🔑 Posterior → Confidence Interval KEY TERM

The joint posterior over (β₀, β₁) induces a distribution over the mean trend μ(x) at every x. We can summarize it as a ribbon. The visualization below shows the key difference between the confidence interval (uncertainty about the mean line itself) and the observation noise bands from the explorer tab (where σ lives).

      Posterior mean trend
      Confidence interval (uncertainty in β)
      ±2σ observation noise band
      Data points
    

Ribbon	Source of uncertainty	Behaviour as N→∞
Confidence Interval	Uncertainty in β (where is the true mean line?)	Shrinks to zero — data pins down β
Observation (±2σ) band	Irreducible noise σ around the mean	Stays the same — σ is a property of the world

      This ribbon is the Confidence Interval — it captures uncertainty about the mean trend, not about individual observations. As N grows, the ribbon shrinks.
    

📐 Why is the CI narrower near the data? FORMULA

The variance of the predicted mean trend at any input value x has a closed form:

Var( μ̂(x) ) = σ² · [ 1/N + (x − x̄)² / Sₓₓ ]

Term	Meaning
x̄	Mean of the observed input values
Sₓₓ	Sum of squared deviations of inputs: Σ(xₙ − x̄)²
1/N	Uncertainty from estimating the overall level (intercept)
(x−x̄)²/Sₓₓ	Extra uncertainty that grows as we predict further from the data center

        The CI is narrowest at x = x̄ (where the second term vanishes) and fans out symmetrically on both sides. Intuitively: the data pins down the mean trend most precisely right where the observations are concentrated. The further you extrapolate from x̄, the more the slope uncertainty compounds — small errors in β₁ translate into large errors in μ̂(x) when |x − x̄| is large.
      

      Contrast with a Prediction Interval — that includes both the uncertainty in β AND the observation noise σ, so it is always wider than the CI.
    

🔑 Prior's Role as Regularization INSIGHT

The informative prior (sd = 1) acts to constrain parameters away from extreme values. After just 1 observation:

Very large β₀ AND large β₁ ruled out Negative slopes less plausible Mode near the true values Easier to find the MAP

      This mirrors regularization in classical ML: the prior "penalizes" large parameter values. With a diffuse prior and small N, the model is underdetermined. With an informative prior, parameter uncertainty is reduced even before looking at the data.
    

Matrix Derivations

// Design matrix · μ = Xβ · gradient · normal equations · Hessian · posterior covariance

🎯 Why Generalize? MOTIVATION

With a single input the mean trend is μₙ = β₀ + β₁xₙ. But real problems have dozens or hundreds of inputs. We need a notation that scales cleanly to D inputs without writing out every term.

      The trick: organise everything into a design matrix X and a coefficient vector β, then write the entire linear model as a single matrix multiplication: μ = Xβ.
    

📐 Subscript Notation NOTATION

Case	Mean trend	Index meaning
Single input	`μₙ = β₀ + β₁xₙ`	`n` = observation index
Multiple inputs (D=3)	`μₙ = β₀ + β₁xₙ,₁ + β₂xₙ,₂ + β₃xₙ,₃`	`n` = observation, `d` = input feature

      With multiple inputs, xₙ,d means the n-th observation of the d-th input. Think of n as the row index and d as the column index of a matrix.
    

∑ From Sum to Inner Product STEP 1

For D inputs, the mean trend is a sum of D coefficient–input products:

μₙ = β₀ + Σ_d=1^D β_d x_n,d

To pull the intercept inside the sum, introduce a fake variable x_n,0 ≡ 1:

μₙ = Σ_d=0^D β_d x_n,d (summation now starts at 0)

      β₀ is just βd=0, multiplied by xn,0 = 1. The intercept is now treated symmetrically with all other coefficients.
    

🗂️ The Design Matrix X KEY OBJECT

Stack all N observations row-by-row. Each row is the n-th observation's input vector (including the fake intercept column):

        X =
        
            1x1,1x1,2···x1,D
1x2,1x2,2···x2,D
⋮⋮⋮⋱⋮
1xN,1xN,2···xN,D

Dimension	Meaning
N × (D+1)	N observations, D inputs plus the intercept column of ones
Row xₙ,:	The n-th observation's full input row vector (1 × D+1)
Column x:,d	All N values of the d-th input feature

      The first column of X is always a column of ones — this is the fake variable xn,0=1 that encodes the intercept term.
    

⚡ The Key Result: μ = Xβ STEP 2

Organise the D+1 coefficients into a column vector β. The mean trend for the n-th observation is a row–column inner product:

μₙ = xₙ,: · β [ 1×(D+1) ] · [ (D+1)×1 ] = scalar

Stacking all N inner products gives the full N×1 vector of mean trends in one matrix multiplication:

μ = X β [ N×(D+1) ] · [ (D+1)×1 ] = [ N×1 ]

      Verify for the single-input case: xₙ,: = [1, xₙ,₁], β = [β₀, β₁]ᵀ, so xₙ,:·β = β₀·1 + β₁·xₙ,₁ = β₀ + β₁xₙ ✓
    

📊 Log-Likelihood in Matrix Form STEP 3

The scalar SSE sum becomes a compact quadratic form:

Σₙ (yₙ − μₙ)² = Σₙ (yₙ − xₙ,:β)² = (y − Xβ)ᵀ(y − Xβ)

So the full log-posterior (with diffuse prior) in matrix notation is:

log p(β | y, X, σ) ∝ − 1/(2σ²) · (y − Xβ)ᵀ(y − Xβ) = − SSE / (2σ²)

      This single compact expression handles any number of inputs D. The scalar summation and the matrix quadratic form are mathematically identical.
    

∇ The Gradient Vector g STEP 4

Differentiate the log-likelihood with respect to β (using matrix calculus — chain rule analog for vectors):

g = ∂L/∂β = (1/σ²) ( Xᵀy − XᵀXβ )

Xᵀy — "cross term"

Projection of the response onto the input space. Size: (D+1)×1.

XᵀX — "sum of squares"

Captures relationships between inputs. Size: (D+1)×(D+1). If inputs are centered, XᵀX ≈ N × Cov(inputs).

🏁 Normal Equations → Closed-Form MLE STEP 5

Set the gradient to zero (g = 0) to find the mode β̂:

Set g = 0

0 = (1/σ²)(Xᵀy − XᵀXβ̂)

Rearrange → Normal Equations

XᵀXβ̂ = Xᵀy

Invert XᵀX → Closed-form solution

β̂ = (XᵀX)⁻¹ Xᵀy

      Two important notes: (1) σ cancels out — the noise level does not affect where the coefficients land. (2) XᵀX must be invertible — requires no perfectly collinear inputs and N ≥ D+1.
    

📐 The Hessian H STEP 6

The Hessian is the matrix of second derivatives of L with respect to β. Take the derivative of g with respect to β:

H = ∂g/∂β = − (1/σ²) XᵀX

      Critical insight: H does not depend on β. It is a constant matrix. This means the log-likelihood is an exactly quadratic function of β — not just approximately quadratic near the mode.
    

This is why the Newton–Raphson method converges in exactly one step for the linear model:

β_{k=1} = β_{k=0} − H⁻¹ g = β_{k=0} + (XᵀX)⁻¹(Xᵀy − XᵀXβ_{k=0}) = (XᵀX)⁻¹Xᵀy = β̂

      The β_{k=0} terms cancel exactly, leaving β̂ regardless of the starting point. One Newton step jumps directly to the global optimum.
    

📊 Posterior Covariance Matrix STEP 7

Using the Laplace Approximation, the approximate posterior covariance is the negative inverse Hessian evaluated at the mode:

Cov(β,β) = −H⁻¹|_β=β̂ = − ( −(1/σ²) XᵀX )⁻¹

Cov(β,β) = σ² (XᵀX)⁻¹

Does NOT depend on y (the response) Scales with σ² (more noise → more uncertainty) Requires XᵀX to be invertible Input covariance drives coefficient covariance

🎯 Full MVN Posterior RESULT

With a diffuse prior, the Laplace Approximation gives the exact posterior (because the log-likelihood is exactly quadratic):

β | X, y, σ ~ MVN( β | (XᵀX)⁻¹Xᵀy, σ²(XᵀX)⁻¹ )

This is an exact result for the linear model — not an approximation — because the log-posterior is a perfect quadratic in β.

✅ Special Case: Intercept-Only Model VERIFICATION

Set D=0. The design matrix is just a column of ones: X = [1, 1, …, 1]ᵀ (N×1). Let's verify the general formula recovers what we already know:

MLE: β̂ = (XᵀX)⁻¹Xᵀy

XᵀX = [1,…,1]·[1,…,1]ᵀ = N

Xᵀy = [1,…,1]·y = Σyₙ = Nȳ

β̂₀ = N⁻¹ · Nȳ = ȳ

The MLE is just the sample mean ✓

Posterior variance: σ²(XᵀX)⁻¹

= σ² · N⁻¹ = σ²/N

Standard deviation = σ/√N

This is the standard error formula ✓

β₀ | X, y, σ ~ Normal( β₀ | ȳ, σ/√N )

      This is exactly the Normal–Normal model from earlier in the course! The intercept-only linear model is the simplest possible linear model — it's just the unknown constant mean problem in disguise.
    

📋 Key Results at a Glance

Object	Formula	Notes
Mean trends	`μ = Xβ`	N×1 vector; X is N×(D+1)
Log-likelihood	`L ∝ −(y−Xβ)ᵀ(y−Xβ) / 2σ²`	Quadratic in β
Gradient	`g = (1/σ²)(Xᵀy − XᵀXβ)`	(D+1)×1 vector
Hessian	`H = −XᵀX / σ²`	Constant — does not depend on β
MLE / MAP	`β̂ = (XᵀX)⁻¹Xᵀy`	σ cancels out
Post. covariance	`Σ = σ²(XᵀX)⁻¹`	Does not depend on y
Full posterior	`β \| X,y,σ ~ MVN(β̂, σ²(XᵀX)⁻¹)`	Exact (not approximate)

Quick Glossary

// All key terms from Lecture 7 at a glance

Flashcards

// Click a card to reveal the answer · track your score

✓ 0 ✗ 0

0 / 0

Press "Shuffle & Start" to begin.

1	x_1,1	x_1,2	···	x_1,D
1	x_2,1	x_2,2	···	x_2,D
⋮	⋮	⋮	⋱	⋮
1	x_N,1	x_N,2	···	x_N,D