Week 8 established that "linear models" describe the relationship between coefficients and the mean trend — not between the input and the output. Week 9 exploits this distinction to model highly non-linear input-output patterns while keeping all the elegant math of linear regression intact.
The general form of a linear basis model for a single input x:
X in Week 7–8, you can now substitute the basis design matrix Φ. Nothing else changes.
| Model Name | Mean Trend | Linear? | Why |
|---|---|---|---|
| Simple linear | μ = β₀ + β₁x |
✅ Yes | β's multiply features directly |
| Quadratic | μ = β₀ + β₁x + β₂x² |
✅ Yes | x² is a feature; β₂ still multiplies it linearly |
| Sine wave | μ = β₀ + β₁sin(x) |
✅ Yes | sin(x) is the feature; β₁ enters linearly |
| Sigmoid | μ = 1 / (1 + e^(−βx)) |
❌ No | β appears inside the exponential — not linear |
Linearity in the parameters means we can always write the mean trend as matrix multiplication:
This single equation unlocks everything:
The basis design matrix Φ has N rows (one per observation) and J+1 columns (one per basis feature plus intercept). Each entry is the j-th basis function evaluated at observation n:
The j-th polynomial feature is simply the input raised to the j-th power: φⱼ(x) = xʲ.
A degree-1 polynomial (J=1) is just standard linear regression. Degree-2 adds a curvature term. The lecture compared 1st through 9th degree polynomials in an earlier homework.
When data has periodic structure, a sinusoidal basis is highly effective with just two parameters:
The design matrix evaluates sin(xₙ) at each observation:
model.matrix(y ~ sin(x), data = df) automatically builds this matrix with an intercept column and a sin(x) column.
Once Φ is constructed, the mean trend vector is simply:
Expanded for the sine wave example:
[1 sin(x₂)] [β₁] = [β₀ + β₁sin(x₂)]
[1 sin(x₃)] [β₀ + β₁sin(x₃)]
[⋮ ⋮ ] [ ⋮ ]
The polynomial and sine bases we've seen so far work beautifully — when you already have a hypothesis about the shape. A periodic signal? Use sin(x). A smooth global curve? Try a polynomial. But in practice, data rarely comes with a label attached that tells you the underlying functional form.
Consider a response variable that rises, then plateaus, then dips — or a relationship that behaves differently in different regions of x. A single sine wave won't capture it. A high-degree polynomial will oscillate wildly at the edges. What you need is a basis flexible enough to discover the shape from the data itself, without committing to one up front.
A spline is a piecewise polynomial that "stitches together" low-order polynomial segments. The joining points are called knots.
The degrees of freedom of a natural spline control how many distinct polynomial pieces — and therefore how flexible — the model is.
| DOF | Columns in Φ | Coefficients estimated | Behavior |
|---|---|---|---|
| 4 | 5 | 5 | Smooth, possibly underfit |
| 9 | 10 | 10 | Good middle ground |
| 25 | 26 | 26 | Very flexible, risk of overfit |
The spline model does not simply add the features together. Each feature is multiplied by its learned weight β:
μₙ ≠ s₁(xₙ) + s₂(xₙ) + ··· + sⱼ(xₙ)
This just sums all the polynomial basis curves and gives a rigid S-shaped trend.
μₙ = β₀ + β₁s₁(xₙ) + β₂s₂(xₙ) + ··· + βⱼsⱼ(xₙ)
The learned β weights amplify or suppress each polynomial piece, creating flexible non-linear fits.
Increasing DOF adds flexibility but introduces the classic bias-variance tradeoff:
Many spline flavors exist. The course focuses on Natural Splines via splines::ns() in R, but the linear basis framework applies to all of them.
Reference: ISL Section 7.4 for construction details. All are linear basis models under the hood.
When we have two continuous inputs x₁ and x₂, the simplest model is linear additive:
In this model, x₂ has no effect on the slope relating x₁ to μ. It can only shift the trend up or down. No matter what x₂ is, the lines of μ vs. x₁ are perfectly parallel — x₂ just changes their intercept. The same is true in reverse: x₁ simply shifts the trend with respect to x₂.
An interaction is the statistics term for multiplication. We add a product term x₁·x₂ to the model:
The interaction term β₃·x₁·x₂ is still a linear model — β₃ enters the mean trend as a simple multiplicative constant. The product x₁·x₂ is just another feature column in the design matrix, derived from the two inputs.
The interaction term creates a slope on x₁ that depends on x₂. Rearrange the model by grouping the x₁ terms:
Define the effective slope on x₁:
This means: the higher (or lower) the value of x₂, the steeper (or shallower) the slope linking x₁ to μ. The lines of μ vs. x₁ are no longer parallel — they fan out from a pivot point. The position of that pivot is:
One useful way to think about interactions is as a hierarchical model — the slope itself has a model:
β̃₁ = β₁ + β₃ · xᵢ,₂
This is fully equivalent to writing out the product β₃·x₁·x₂ directly. The benefit of the hierarchical view is interpretability:
When you visualize the mean trend μ as a surface over (x₁, x₂):
| Sign of β₃ | Effect on slope of x₁ | Visual behavior |
|---|---|---|
| β₃ > 0 | High x₂ → steeper positive slope; low x₂ → shallower (or negative) slope | Lines fan out rightward — larger x₂ lines are steeper going up to the right |
| β₃ < 0 | High x₂ → shallower (or negative) slope; low x₂ → steeper positive slope | Lines fan out leftward — larger x₂ lines tilt more negatively |
| β₃ = 0 | x₂ has no effect on the slope of x₁ — pure additive model | Perfectly parallel lines regardless of x₂ |
Interactions are not limited to linear features. You can combine any basis features:
An interaction with spline features would multiply a basis feature from x₁ with one from x₂. In all cases, the result is still a linear model in the parameters — just with more feature columns in the design matrix.
When the two inputs x₁ and x₂ are linearly related — say x₂ = a + b·x₁ — we can substitute this relationship into the mean trend and simplify:
= (β₀ + β₂a) + (β₁ + β₂b)·xᵢ,₁
= β̃₀ + β̃₁ · xᵢ,₁
The model collapses to just two unknowns (β̃₀ and β̃₁) even though we are trying to estimate three (β₀, β₁, β₂). We are only ever learning a weighted combination of the original coefficients — not the individual values.
With correlated inputs and ρ ≠ 0 (but not ±1), the MLE still technically exists — but there are infinitely many combinations of β₁ and β₂ that produce the same effective slope β̃₁ = β₁ + β₂·b. For example, if b = 1:
β₁ = 5, β₂ = −2 → 5 − 2 = 3
β₁ = 0, β₂ = 3 → 0 + 3 = 3
The posterior covariance matrix (assuming known noise σ and diffuse priors) is:
The off-diagonal entries of (XᵀX)⁻¹ represent the posterior correlation between parameters. A key empirical result from the lecture:
| Input correlation ρ | Posterior correlation between β₁ and β₂ | Intuition |
|---|---|---|
| ρ ≈ 0 | ≈ 0 (near zero) | Inputs are independent; parameters can be learned separately |
| ρ = +0.9 | ≈ −0.88 (strongly negative) | x₁↑ moves with x₂↑; to keep predictions stable, β₁ and β₂ must move in opposite directions |
| ρ = −0.9 | ≈ +0.88 (strongly positive) | x₁↑ as x₂↓; parameters move together to compensate |
Correlated inputs don't just change the direction of uncertainty — they inflate its magnitude. From the lecture's simulation (100 observations, σ = 1):
| ρ | Posterior SD of β₁ | Posterior SD of β₂ |
|---|---|---|
| 0 | ~0.095 | ~0.102 |
| +0.9 | ~0.222 | ~0.214 |
| −0.9 | ~0.214 | ~0.222 |
Steps to detect and understand multicollinearity in practice:
cor(data[, input_cols])
Large off-diagonal values (|r| > 0.7–0.8) signal potential problems.
X <- model.matrix(~ x1 + x2, data) SSmat <- t(X) %*% X
post_cov <- solve(SSmat) # assuming σ=1 cov2cor(post_cov) # posterior correlation matrixHigh off-diagonal entries in the posterior correlation confirm that individual parameter estimates are unreliable.
sqrt(diag(solve(SSmat)))Much larger SDs than you'd expect from sample size alone indicate multicollinearity is inflating uncertainty.
If you only care about predicting the output, multicollinearity often isn't a crisis. The combined prediction can still be accurate — the uncertainty is in attributing the effect to x₁ vs. x₂.
If you need to understand which input causes the outcome, multicollinearity is a serious problem. You cannot reliably separate β₁ from β₂ — the data simply don't contain that information.
|ρ(x₁, x₂)| → 1 ⟹ Posterior SD of β₁, β₂ → ∞
XᵀX is rank-deficient ⟺ perfect linear dependence among columns of X
The geometry: correlated inputs mean the data only "explore" a thin ridge in the (x₁, x₂) input space, so you can only estimate the effective slope along that ridge — not the individual contributions perpendicular to it.