A linear model predicts a continuous response y from an input x using a straight-line mean trend. We have N observed input–output pairs: (xₙ, yₙ) for n = 1, …, N.
Rather than writing yₙ = β₀ + β₁xₙ + εₙ (which hides the model structure), the lecture uses the likelihood notation:
μₙ = β₀ + β₁xₙ
The linear predictor — the mean trend. Given β₀, β₁, and xₙ, the value of μₙ is always the same. No randomness here.
yₙ | μₙ, σ ~ Normal(μₙ, σ)
The noise — how observations scatter around the mean. σ controls the width of this scatter. Higher σ = more noise.
We assume observations are conditionally independent given the input and parameters. This means the joint probability of all N responses factors into a product of N individual likelihoods:
Each factor is the n-th observation's likelihood. Multiplying them together gives the full data likelihood over all N observations.
Week 7 begins the supervised learning deep-dive. Linear models are the entry point to a hierarchy:
Adjust the sliders to see how each parameter shapes the mean trend (white line) and the observation ribbons. The ribbon shows where ~68 / 95 / 99.7% of observations are expected to fall.
The intercept β₀ is the value of the mean μ when x = 0.
Changing β₀ shifts the entire line up or down without rotating it. Lines with different β₀ but the same β₁ are perfectly parallel.
The slope β₁ determines how much μ changes per unit increase in x.
- β₁ = 0 → flat line, x has no effect on y
- β₁ > 0 → positive relationship
- β₁ < 0 → negative (inverse) relationship
σ controls the spread of observations around the mean trend. It is the standard deviation of the Normal likelihood. σ does not change the mean trend — it only changes how tightly observations cluster around it.
| Interval | Coverage (Normal) | What it shows |
|---|---|---|
μ ± 1σ | ≈ 68.3% | Inner ribbon — most observations land here |
μ ± 2σ | ≈ 95.4% | Middle ribbon |
μ ± 3σ | ≈ 99.7% | Outer ribbon — almost all observations |
At each input value xₙ, the response follows a different Normal distribution:
Example from the lecture: at xₙ = −1.5, β₀=1, β₁=1, σ=0.5:
We want to find the values of β = (β₀, β₁) that best explain the data. With σ known, we maximize the likelihood (or equivalently, the log-likelihood) over β:
Since log p ∝ −SSE, maximizing the log-likelihood is equivalent to minimizing the SSE:
argmax log p(y|x,β,σ)
Statistical / probabilistic framing: find β that makes the data most probable.
argmin Σₙ (yₙ − μₙ)²
Geometric framing: find β that minimizes squared prediction errors.
lm() in R.
The residual εₙ is the difference between the observed response and the model's predicted mean:
# Generate data (true β₀=-0.25, β₁=1.15, σ=0.5) set.seed(42) x <- rnorm(n = 100) mu_true <- -0.25 + 1.15 * x y <- rnorm(n = length(x), mean = mu_true, sd = 0.5) # Fit linear model (OLS = MLE under Gaussian likelihood) my_df <- data.frame(x = x, y = y) mod <- lm(y ~ x, data = my_df) summary(mod) # shows β̂₀, β̂₁ and their uncertainty coef(mod) # extract coefficient estimates
In the Bayesian framework we treat β = (β₀, β₁) as unknown random variables and place priors on them. The posterior combines the prior belief with the data via the likelihood:
We assume σ is known for now. The unknowns are only β₀ and β₁.
p(β) = Normal(β₀|0,1) × Normal(β₁|0,1)
Parameters are pulled toward 0. Rules out extreme values even with little data. With just 1 observation, the prior constrains the trade-off.
p(β) = Normal(β₀|0,20) × Normal(β₁|0,20)
Nearly flat — the posterior follows the likelihood. With 1 observation all we learn is the intercept–slope trade-off.
With only 1 observation and a diffuse prior, the posterior contours appear as diagonal parallel lines. Why?
Many different (β₀, β₁) pairs produce the same value for μₙ₌₁. Increasing the slope while decreasing the intercept by the right amount keeps μ fixed. The likelihood is constant along that diagonal.
The lecture walks through adding the first 10 observations one at a time, watching the joint posterior over (β₀, β₁) evolve:
| N | Posterior shape | What's happening |
|---|---|---|
| 0 | Circular (= prior) | No data — just the prior |
| 1 | Diagonal parallel lines | One constraint: μₙ₌₁ is known; β₀↔β₁ trade-off |
| 2 | Elongated ellipse | Two constraints begin pinning down both parameters |
| 3–5 | Tighter ellipse | Ellipse rotates and shrinks; center near true β |
| 10 | Tight circular blob | Converged near true (β₀=−0.25, β₁=1.15) |
The joint posterior over (β₀, β₁) induces a distribution over the mean trend μ(x) at every x. We can summarize it as a ribbon. The visualization below shows the key difference between the confidence interval (uncertainty about the mean line itself) and the observation noise bands from the explorer tab (where σ lives).
| Ribbon | Source of uncertainty | Behaviour as N→∞ |
|---|---|---|
| Confidence Interval | Uncertainty in β (where is the true mean line?) | Shrinks to zero — data pins down β |
| Observation (±2σ) band | Irreducible noise σ around the mean | Stays the same — σ is a property of the world |
The variance of the predicted mean trend at any input value x has a closed form:
| Term | Meaning |
|---|---|
| x̄ | Mean of the observed input values |
| Sₓₓ | Sum of squared deviations of inputs: Σ(xₙ − x̄)² |
| 1/N | Uncertainty from estimating the overall level (intercept) |
| (x−x̄)²/Sₓₓ | Extra uncertainty that grows as we predict further from the data center |
The informative prior (sd = 1) acts to constrain parameters away from extreme values. After just 1 observation:
With a single input the mean trend is μₙ = β₀ + β₁xₙ. But real problems have dozens or hundreds of inputs. We need a notation that scales cleanly to D inputs without writing out every term.
μ = Xβ.
| Case | Mean trend | Index meaning |
|---|---|---|
| Single input | μₙ = β₀ + β₁xₙ | n = observation index |
| Multiple inputs (D=3) | μₙ = β₀ + β₁xₙ,₁ + β₂xₙ,₂ + β₃xₙ,₃ | n = observation, d = input feature |
xₙ,d means the n-th observation of the d-th input. Think of n as the row index and d as the column index of a matrix.
For D inputs, the mean trend is a sum of D coefficient–input products:
To pull the intercept inside the sum, introduce a fake variable xn,0 ≡ 1:
Stack all N observations row-by-row. Each row is the n-th observation's input vector (including the fake intercept column):
| 1 | x1,1 | x1,2 | ··· | x1,D |
| 1 | x2,1 | x2,2 | ··· | x2,D |
| ⋮ | ⋮ | ⋮ | ⋱ | ⋮ |
| 1 | xN,1 | xN,2 | ··· | xN,D |
| Dimension | Meaning |
|---|---|
| N × (D+1) | N observations, D inputs plus the intercept column of ones |
| Row xₙ,: | The n-th observation's full input row vector (1 × D+1) |
| Column x:,d | All N values of the d-th input feature |
Organise the D+1 coefficients into a column vector β. The mean trend for the n-th observation is a row–column inner product:
Stacking all N inner products gives the full N×1 vector of mean trends in one matrix multiplication:
The scalar SSE sum becomes a compact quadratic form:
So the full log-posterior (with diffuse prior) in matrix notation is:
Differentiate the log-likelihood with respect to β (using matrix calculus — chain rule analog for vectors):
Projection of the response onto the input space. Size: (D+1)×1.
Captures relationships between inputs. Size: (D+1)×(D+1). If inputs are centered, XᵀX ≈ N × Cov(inputs).
Set the gradient to zero (g = 0) to find the mode β̂:
The Hessian is the matrix of second derivatives of L with respect to β. Take the derivative of g with respect to β:
This is why the Newton–Raphson method converges in exactly one step for the linear model:
Using the Laplace Approximation, the approximate posterior covariance is the negative inverse Hessian evaluated at the mode:
With a diffuse prior, the Laplace Approximation gives the exact posterior (because the log-likelihood is exactly quadratic):
This is an exact result for the linear model — not an approximation — because the log-posterior is a perfect quadratic in β.
Set D=0. The design matrix is just a column of ones: X = [1, 1, …, 1]ᵀ (N×1). Let's verify the general formula recovers what we already know:
XᵀX = [1,…,1]·[1,…,1]ᵀ = N
Xᵀy = [1,…,1]·y = Σyₙ = Nȳ
β̂₀ = N⁻¹ · Nȳ = ȳ
The MLE is just the sample mean ✓
= σ² · N⁻¹ = σ²/N
Standard deviation = σ/√N
This is the standard error formula ✓
| Object | Formula | Notes |
|---|---|---|
| Mean trends | μ = Xβ | N×1 vector; X is N×(D+1) |
| Log-likelihood | L ∝ −(y−Xβ)ᵀ(y−Xβ) / 2σ² | Quadratic in β |
| Gradient | g = (1/σ²)(Xᵀy − XᵀXβ) | (D+1)×1 vector |
| Hessian | H = −XᵀX / σ² | Constant — does not depend on β |
| MLE / MAP | β̂ = (XᵀX)⁻¹Xᵀy | σ cancels out |
| Post. covariance | Σ = σ²(XᵀX)⁻¹ | Does not depend on y |
| Full posterior | β | X,y,σ ~ MVN(β̂, σ²(XᵀX)⁻¹) | Exact (not approximate) |