INFSCI 2595 · Week 9 Study Guide · Additive & Interactive Features

Week 9 Overview

// Extending the linear model beyond straight lines

🗺️ The Big Idea START HERE

Week 8 established that "linear models" describe the relationship between coefficients and the mean trend — not between the input and the output. Week 9 exploits this distinction to model highly non-linear input-output patterns while keeping all the elegant math of linear regression intact.

Re-examine what "linear" means

A model is linear if the unknown coefficients β multiply the features additively. The features themselves can be wildly non-linear transformations of x (sine, polynomial, spline…).

Replace x with a basis feature φ(x)

Instead of plugging x directly into the design matrix, we plug in a transformed version: φ(x) = sin(x), or x², or a spline basis. The design matrix Φ replaces X — and all the formulas stay identical.

Choose the right basis for the problem

Polynomial bases are great for smooth global trends. Sine/cosine bases suit periodic signals. Splines are the workhorse when you don't know the shape in advance — they stitch low-order polynomials together to approximate any smooth curve.

Degrees of freedom control complexity

More spline features = more flexibility = risk of overfitting. This is the same bias-variance tradeoff you've seen before, now expressed in terms of how many basis functions we include.

📐 General Linear Basis Model DEFINITION

The general form of a linear basis model for a single input x:

Mean Trend (Basis Model)

μₙ = β₀ + Σⱼ βⱼ φⱼ(xₙ) = Φ β

MLE (identical to standard linear regression)

β̂_MLE = (ΦᵀΦ)⁻¹ Φᵀ y

Posterior (known σ, diffuse prior)

β | y, Φ, σ ~ MVN( β̂_MLE, σ² (ΦᵀΦ)⁻¹ )

      Key insight: Everywhere you saw X in Week 7–8, you can now substitute the basis design matrix Φ. Nothing else changes.
    

📚 Topics in This Lecture

Linear in Parameters Basis Design Matrix Φ Polynomial Basis Sine Wave Basis Natural Splines Degrees of Freedom Overfitting / Underfitting model.matrix() in R splines::ns() in R

⚠️ Professor's Warning IMPORTANT

      Just because a linear relationship does NOT explain the data does NOT mean you should immediately jump to neural networks or gradient boosted trees. Spend time trying different basis functions first — they are interpretable, efficient, and often sufficient.
    

What is Linearity?

// Linear in the parameters — not in the input

🔑 The Critical Distinction CORE CONCEPT

❌ Common Misconception

"A linear model only describes straight-line relationships between x and y."

✅ Correct Definition

A model is linear when the unknown coefficients β are linearly related to the mean trend μ — regardless of how complex the relationship is between x and μ.

      The mean trend is a weighted sum of features. As long as the weights (β's) enter linearly, the model is "linear" — even if those features are sin(x), x⁸, or a spline polynomial.
    

📊 Concrete Examples EXAMPLES

Model Name	Mean Trend	Linear?	Why
Simple linear	`μ = β₀ + β₁x`	✅ Yes	β's multiply features directly
Quadratic	`μ = β₀ + β₁x + β₂x²`	✅ Yes	x² is a feature; β₂ still multiplies it linearly
Sine wave	`μ = β₀ + β₁sin(x)`	✅ Yes	sin(x) is the feature; β₁ enters linearly
Sigmoid	`μ = 1 / (1 + e^(−βx))`	❌ No	β appears inside the exponential — not linear

      A sine wave model looks non-linear, but from the model's perspective it is completely linear — because sin(x) acts as a pre-computed feature that β₁ simply scales.
    

🧮 Why Does This Matter? MATH

Linearity in the parameters means we can always write the mean trend as matrix multiplication:

Vector form (any linear basis model)

μ = Φ β

This single equation unlocks everything:

→

Closed-form OLS / MLE solution still exists

β̂ = (ΦᵀΦ)⁻¹ Φᵀ y

→

Bayesian posterior is still MVN

All the conjugate formulas from Week 7–8 apply verbatim.

→

Standard lm() in R works unchanged

R's formula interface transparently builds Φ for you.

Basis Functions

// Transforming inputs into features

📐 The Basis Design Matrix Φ DEFINITION

The basis design matrix Φ has N rows (one per observation) and J+1 columns (one per basis feature plus intercept). Each entry is the j-th basis function evaluated at observation n:

Φ — basis design matrix structure

φ₀(x₁)=1 φ₁(x₁) φ₂(x₁) ··· φⱼ(x₁)

1 φ₁(x₂) φ₂(x₂) ··· φⱼ(x₂)

⋮ ⋮ ⋮ ⋮

1 φ₁(xₙ) φ₂(xₙ) ··· φⱼ(xₙ)

Gold column = intercept (always 1) · Blue columns = basis features φⱼ(xₙ)

The zeroth feature φ₀(x) = 1 always. It captures the intercept β₀.

📈 Polynomial Basis TYPE 1

The j-th polynomial feature is simply the input raised to the j-th power: φⱼ(x) = xʲ.

J-th degree polynomial mean trend

μₙ = β₀ + β₁xₙ + β₂xₙ² + ··· + βⱼxₙᴶ = Σⱼ βⱼ xₙʲ

Polynomial Φ (degree 2 example, 4 observations)

1 x₁¹ x₁²

1 x₂¹ x₂²

1 x₃¹ x₃²

1 x₄¹ x₄²

A degree-1 polynomial (J=1) is just standard linear regression. Degree-2 adds a curvature term. The lecture compared 1st through 9th degree polynomials in an earlier homework.

〰️ Sine Wave Basis TYPE 2

When data has periodic structure, a sinusoidal basis is highly effective with just two parameters:

Parametric sine wave model

μₙ = β₀ + β₁ sin(xₙ)

The design matrix evaluates sin(xₙ) at each observation:

Sine basis Φ

1 sin(x₁)

1 sin(x₂)

1 sin(x₃)

⋮ ⋮

1 sin(xₙ)

      In R: model.matrix(y ~ sin(x), data = df) automatically builds this matrix with an intercept column and a sin(x) column.
    

🧮 Mean Trend = Matrix Multiply MATH

Once Φ is constructed, the mean trend vector is simply:

Vectorized mean trend

μ = Φ β (N×1 = N×(J+1) × (J+1)×1)

Expanded for the sine wave example:

Sine basis expansion

[1 sin(x₁)] [β₀] [β₀ + β₁sin(x₁)]
[1 sin(x₂)] [β₁] = [β₀ + β₁sin(x₂)]
[1 sin(x₃)] [β₀ + β₁sin(x₃)]
[⋮ ⋮ ] [ ⋮ ]

The β vector has J+1 entries. It must match the number of columns in Φ. Always check dimensions before fitting!

Splines — The Flexible Basis

// When you don't know the shape in advance

❓ Motivation WHY SPLINES?

The polynomial and sine bases we've seen so far work beautifully — when you already have a hypothesis about the shape. A periodic signal? Use sin(x). A smooth global curve? Try a polynomial. But in practice, data rarely comes with a label attached that tells you the underlying functional form.

Consider a response variable that rises, then plateaus, then dips — or a relationship that behaves differently in different regions of x. A single sine wave won't capture it. A high-degree polynomial will oscillate wildly at the edges. What you need is a basis flexible enough to discover the shape from the data itself, without committing to one up front.

      This is exactly what splines do: they build up complex non-linear shapes from simple, local polynomial pieces — adapting to the local structure of the data without requiring any prior knowledge of the functional form.
    

🧵 What is a Spline? DEFINITION

A spline is a piecewise polynomial that "stitches together" low-order polynomial segments. The joining points are called knots.

Piecewise

Different polynomial equations apply in different regions of x. Each region is called a "piece."

Smooth Joins

Constraints ensure the pieces connect smoothly at the knots — typically matching function value, slope, and curvature.

      Splines are linear basis models. The basis functions (spline features) are polynomials evaluated over the input range. The β's are the weights we learn. The output is a weighted sum.
    

🔢 Degrees of Freedom (DOF) KEY CONCEPT

The degrees of freedom of a natural spline control how many distinct polynomial pieces — and therefore how flexible — the model is.

4 DOF → 5 coefficients (4 spline features + intercept). Fairly smooth, constrained fit. Good starting point for moderate non-linearity. May underfit a complex curve like a full sine wave.

DOF	Columns in Φ	Coefficients estimated	Behavior
4	5	5	Smooth, possibly underfit
9	10	10	Good middle ground
25	26	26	Very flexible, risk of overfit

🎛️ Weighted Sum of Features MATH

The spline model does not simply add the features together. Each feature is multiplied by its learned weight β:

❌ Simple Sum (WRONG)

μₙ ≠ s₁(xₙ) + s₂(xₙ) + ··· + sⱼ(xₙ)

This just sums all the polynomial basis curves and gives a rigid S-shaped trend.

✅ Weighted Sum (CORRECT)

μₙ = β₀ + β₁s₁(xₙ) + β₂s₂(xₙ) + ··· + βⱼsⱼ(xₙ)

The learned β weights amplify or suppress each polynomial piece, creating flexible non-linear fits.

      The greater the absolute value of βⱼ, the greater the influence of that polynomial piece on the overall trend. This is exactly how the model "learns" the shape of the data.
    

⚖️ Bias-Variance in Splines WATCH OUT

Increasing DOF adds flexibility but introduces the classic bias-variance tradeoff:

↓

DOF = 4 (Underfitting)

Smooth but inflexible. Misses important non-linear patterns. High bias, low variance.

✓

DOF = 9 (Good fit — for this data)

Captures the main structure without following noise. The "sweet spot" in the lecture example.

↑

DOF = 25 (Overfitting)

Very wiggly. Follows individual noise points. Low bias, high variance. Will generalize poorly.

How do we choose DOF? Cross-validation or information criteria (AIC/BIC) — a topic for later in the course.

📦 Types of Splines REFERENCE

Many spline flavors exist. The course focuses on Natural Splines via splines::ns() in R, but the linear basis framework applies to all of them.

Natural Splines (ns) Cubic Splines B-Splines Bezier Splines Thin Plate Splines

Reference: ISL Section 7.4 for construction details. All are linear basis models under the hood.

Interaction Terms

// When the effect of one input depends on another

🔲 Starting Point: The Additive Model FOUNDATION

When we have two continuous inputs x₁ and x₂, the simplest model is linear additive:

μᵢ = β₀ + β₁xᵢ,₁ + β₂xᵢ,₂

In this model, x₂ has no effect on the slope relating x₁ to μ. It can only shift the trend up or down. No matter what x₂ is, the lines of μ vs. x₁ are perfectly parallel — x₂ just changes their intercept. The same is true in reverse: x₁ simply shifts the trend with respect to x₂.

The key limitation: an additive model assumes the two inputs never interact — their effects are completely independent of each other.

✖️ What is an Interaction? KEY CONCEPT

An interaction is the statistics term for multiplication. We add a product term x₁·x₂ to the model:

μᵢ = β₀ + β₁xᵢ,₁ + β₂xᵢ,₂ + β₃ · xᵢ,₁ · xᵢ,₂

The interaction term β₃·x₁·x₂ is still a linear model — β₃ enters the mean trend as a simple multiplicative constant. The product x₁·x₂ is just another feature column in the design matrix, derived from the two inputs.

Is this a linear model? YES. β₃ is still linearly related to μ. The model is non-linear in the inputs, but linear in the parameters.

📐 The Effective Slope Interpretation MATH

The interaction term creates a slope on x₁ that depends on x₂. Rearrange the model by grouping the x₁ terms:

μᵢ = β₀ + (β₁ + β₃xᵢ,₂) · xᵢ,₁ + β₂xᵢ,₂

Define the effective slope on x₁:

β̃₁ = β₁ + β₃ · x₂ ← this is now a function of x₂!

This means: the higher (or lower) the value of x₂, the steeper (or shallower) the slope linking x₁ to μ. The lines of μ vs. x₁ are no longer parallel — they fan out from a pivot point. The position of that pivot is:

Pivot location (x₁ axis)

x₁ = −β₂ / β₃

μ value at pivot

μ = β₀ − (β₁β₂ / β₃)

The interpretation is symmetric: you can equally factor out x₂ terms and define β̃₂ = β₂ + β₃·x₁. Both are valid — be careful about which direction you interpret the interaction!

🔄 Model Within a Model INTUITION

One useful way to think about interactions is as a hierarchical model — the slope itself has a model:

μᵢ = β₀ + β̃₁ · xᵢ,₁ + β₂ · xᵢ,₂
β̃₁ = β₁ + β₃ · xᵢ,₂

This is fully equivalent to writing out the product β₃·x₁·x₂ directly. The benefit of the hierarchical view is interpretability:

β₁ — Main effect of x₁

The slope on x₁ when x₂ = 0. This is the "baseline" relationship.

β₂ — Main effect of x₂

The slope on x₂ (contributes both the shift term and affects the effective slope via the interaction).

β₃ — Interaction coefficient

Controls how much x₂ changes the slope on x₁ (or equivalently, how much x₁ changes the slope on x₂). If β₃ = 0, you're back to the additive model.

🗺️ Surface Behavior VISUAL

When you visualize the mean trend μ as a surface over (x₁, x₂):

Additive model (β₃ = 0)

The contour lines are straight and parallel diagonals. The surface is a flat tilted plane.

Interactive model (β₃ ≠ 0)

The contour lines curve. The surface is a saddle or hyperbolic paraboloid shape — high-high and low-low corners are elevated or depressed depending on the sign of β₃.

Setting β₃ = 0 in an interaction model exactly recovers the additive model. The interaction is simply a generalization.

⚡ Sign of β₃ and What It Means INTERPRETATION

Sign of β₃	Effect on slope of x₁	Visual behavior
β₃ > 0	High x₂ → steeper positive slope; low x₂ → shallower (or negative) slope	Lines fan out rightward — larger x₂ lines are steeper going up to the right
β₃ < 0	High x₂ → shallower (or negative) slope; low x₂ → steeper positive slope	Lines fan out leftward — larger x₂ lines tilt more negatively
β₃ = 0	x₂ has no effect on the slope of x₁ — pure additive model	Perfectly parallel lines regardless of x₂

📝 Extending to Non-Linear Features BEYOND BASICS

Interactions are not limited to linear features. You can combine any basis features:

μᵢ = β₀ + β₁xᵢ,₁ + β₂xᵢ,₂ + β₃xᵢ,₁² + β₄xᵢ,₂² ← polynomial additive

μᵢ = β₀ + Σ β_j1 φ_j1(xᵢ,₁) + Σ β_j2 φ_j2(xᵢ,₂) ← spline additive

An interaction with spline features would multiply a basis feature from x₁ with one from x₂. In all cases, the result is still a linear model in the parameters — just with more feature columns in the design matrix.

Multicollinearity

// What happens when your inputs are correlated with each other

⚠️ The Core Problem CRITICAL

When the two inputs x₁ and x₂ are linearly related — say x₂ = a + b·x₁ — we can substitute this relationship into the mean trend and simplify:

μᵢ = β₀ + β₁xᵢ,₁ + β₂(a + b·xᵢ,₁)
= (β₀ + β₂a) + (β₁ + β₂b)·xᵢ,₁
= β̃₀ + β̃₁ · xᵢ,₁

The model collapses to just two unknowns (β̃₀ and β̃₁) even though we are trying to estimate three (β₀, β₁, β₂). We are only ever learning a weighted combination of the original coefficients — not the individual values.

When inputs are perfectly linearly related, the design matrix X is rank deficient — the sum of squares matrix XᵀX is not invertible and the MLE does not exist.

🔗 The Effective Sum Problem KEY INSIGHT

With correlated inputs and ρ ≠ 0 (but not ±1), the MLE still technically exists — but there are infinitely many combinations of β₁ and β₂ that produce the same effective slope β̃₁ = β₁ + β₂·b. For example, if b = 1:

Both give the same β̃₁ = 3

β₁ = 1, β₂ = 2 → 1 + 2 = 3
β₁ = 5, β₂ = −2 → 5 − 2 = 3
β₁ = 0, β₂ = 3 → 0 + 3 = 3

Consequence

The data can tell us β̃₁ precisely, but cannot distinguish between the infinitely many (β₁, β₂) pairs that produce that sum. Posterior uncertainty on individual coefficients is high.

📊 Posterior Correlation Between Parameters MATH

The posterior covariance matrix (assuming known noise σ and diffuse priors) is:

Posterior covariance = σ²(XᵀX)⁻¹

The off-diagonal entries of (XᵀX)⁻¹ represent the posterior correlation between parameters. A key empirical result from the lecture:

Input correlation ρ	Posterior correlation between β₁ and β₂	Intuition
ρ ≈ 0	≈ 0 (near zero)	Inputs are independent; parameters can be learned separately
ρ = +0.9	≈ −0.88 (strongly negative)	x₁↑ moves with x₂↑; to keep predictions stable, β₁ and β₂ must move in opposite directions
ρ = −0.9	≈ +0.88 (strongly positive)	x₁↑ as x₂↓; parameters move together to compensate

The posterior correlation between β₁ and β₂ is roughly the negative of the input correlation ρ. As |ρ| → 1, the posterior parameter correlation approaches ±1.

📈 Effect on Posterior Uncertainty WATCH OUT

Correlated inputs don't just change the direction of uncertainty — they inflate its magnitude. From the lecture's simulation (100 observations, σ = 1):

ρ	Posterior SD of β₁	Posterior SD of β₂
0	~0.095	~0.102
+0.9	~0.222	~0.214
−0.9	~0.214	~0.222

Highly correlated inputs inflate posterior standard deviations by more than 2×. Predictions about the outcome may still be reliable, but individual coefficient estimates are highly uncertain and difficult to interpret.

🔬 Diagnosing Multicollinearity R WORKFLOW

Steps to detect and understand multicollinearity in practice:

Check the input correlation matrix

cor(data[, input_cols])

Large off-diagonal values (|r| > 0.7–0.8) signal potential problems.

Build the design matrix and compute XᵀX

X <- model.matrix(~ x1 + x2, data)
SSmat <- t(X) %*% X

Compute posterior covariance and extract correlations

post_cov <- solve(SSmat)   # assuming σ=1
cov2cor(post_cov)            # posterior correlation matrix

High off-diagonal entries in the posterior correlation confirm that individual parameter estimates are unreliable.

Compare posterior standard deviations

sqrt(diag(solve(SSmat)))

Much larger SDs than you'd expect from sample size alone indicate multicollinearity is inflating uncertainty.

🧭 What To Do About Multicollinearity PRACTICAL

Prediction goals

If you only care about predicting the output, multicollinearity often isn't a crisis. The combined prediction can still be accurate — the uncertainty is in attributing the effect to x₁ vs. x₂.

Interpretation goals

If you need to understand which input causes the outcome, multicollinearity is a serious problem. You cannot reliably separate β₁ from β₂ — the data simply don't contain that information.

Possible remedies: collect more data (doesn't always help), apply regularization (Ridge/Lasso), use dimensionality reduction (PCA), or redesign the experiment to decorrelate the inputs.

🔑 Summary: Correlated Inputs → Correlated Parameters TAKEAWAY

Input correlation ρ(x₁, x₂) → Posterior parameter correlation ≈ −ρ(x₁, x₂)
|ρ(x₁, x₂)| → 1 ⟹ Posterior SD of β₁, β₂ → ∞
XᵀX is rank-deficient ⟺ perfect linear dependence among columns of X

The geometry: correlated inputs mean the data only "explore" a thin ridge in the (x₁, x₂) input space, so you can only estimate the effective slope along that ridge — not the individual contributions perpendicular to it.

Term Glossary

// Key vocabulary for Week 9

Basis Function φⱼ(x)

A transformation applied to the raw input x to produce a feature column. Examples: x² (polynomial), sin(x) (trigonometric), or a spline polynomial. The j-th basis function generates the j-th column of the design matrix Φ.

Basis Design Matrix Φ (Phi)

The N × (J+1) matrix where each row is one observation and each column is one basis feature evaluated at all observations. Replaces X from standard linear regression. All formulas for MLE, posterior mean, and posterior covariance use Φ in place of X.

Linear Basis Model

A model whose mean trend is a weighted sum of basis features: μₙ = Σⱼ βⱼ φⱼ(xₙ). "Linear" refers to linearity in the parameters β, not linearity in x. The features φ can be arbitrary non-linear transformations of x.

Linear in Parameters

The defining property of a linear model: every unknown coefficient β enters the mean trend as a simple multiplicative factor. If β appears inside an exponent, log, or any non-linear function, the model is NOT linear in parameters.

Polynomial Basis

A basis where the j-th feature is φⱼ(x) = xʲ. A degree-J polynomial basis includes features 1, x, x², ..., xᴶ. Higher degrees capture more complex curvature but increase overfitting risk.

Spline

A piecewise polynomial function that smoothly stitches together low-order polynomial segments. Joining points are called knots. Splines provide flexible non-linear fitting without requiring prior knowledge of the functional form. They are implemented as linear basis models.

Natural Spline

A type of cubic spline with additional linearity constraints at the boundaries (beyond the outermost knots). This prevents wild extrapolation outside the data range. Implemented in R via splines::ns().

Degrees of Freedom (DOF)

The number of spline basis features (not counting the intercept). A J-DOF spline produces J features, so the full design matrix has J+1 columns and J+1 coefficients are estimated. Higher DOF = more flexibility = higher overfitting risk.

Knot

A point in the input space where two polynomial segments of a spline join. The smoothness constraints at knots ensure the spline is continuous and has continuous derivatives. More knots (higher DOF) allows more local flexibility.

Weighted Sum of Features

The correct interpretation of a spline model: μₙ = β₀ + β₁s₁(xₙ) + ··· + βⱼsⱼ(xₙ). Each spline feature is MULTIPLIED by its β weight before summing. This is different from (and more flexible than) simply summing the raw spline features.

model.matrix()

An R function that constructs the design matrix Φ from a formula object. Handles intercept addition automatically. Works with arbitrary basis functions in the formula: model.matrix(y ~ sin(x)), model.matrix(y ~ splines::ns(x, 4)), etc.

splines::ns()

R function to create a natural spline basis. Syntax: splines::ns(x, df) where df specifies the degrees of freedom. Returns a matrix with df columns, one per spline feature. Part of base R — no package installation needed.

%*% (Matrix Multiply)

The R operator for matrix multiplication. Used to compute the mean trend as μ = Φ %*% β. Requires the number of columns in Φ to match the number of rows in β. Returns a matrix (column vector) — use as.vector() to convert back to numeric.

Interaction Term

The product of two input variables (or basis features) included as a feature in the model. Written as x₁·x₂ or β₃·x₁·x₂. It is still a linear model because β₃ enters linearly. An interaction means the slope on x₁ depends on the value of x₂, and vice versa. If β₃ = 0, the model reduces to the additive case.

Main Effect

The direct slope coefficient for a single input: β₁ for x₁ or β₂ for x₂. In a model with an interaction, the main effect β₁ represents the slope on x₁ when x₂ = 0. The full slope when x₂ ≠ 0 is the effective slope β̃₁ = β₁ + β₃·x₂.

Effective Slope

In an interaction model, the total slope on one input accounting for the interaction with another. For μ = β₀ + β₁x₁ + β₂x₂ + β₃x₁x₂, the effective slope on x₁ is β̃₁ = β₁ + β₃x₂ — this varies with x₂ and is what determines how steeply μ changes as x₁ changes.

Additive Model

A model in which the effects of each input on the mean trend are independent and simply add together: μ = β₀ + β₁x₁ + β₂x₂. The slope on x₁ does not depend on x₂ and vice versa. Visually, the trends for different x₂ values are perfectly parallel when plotted against x₁.

Multicollinearity

The condition where two or more input variables are highly correlated with each other. When inputs are correlated, the individual parameter estimates β₁ and β₂ become highly uncertain and their posterior distributions become strongly correlated, even though predictions for the outcome may still be accurate. As |ρ| → 1 the posterior standard deviations for individual slopes approach infinity.

Posterior Parameter Correlation

The correlation between two coefficient estimates in the posterior distribution, captured by the off-diagonal entries of σ²(XᵀX)⁻¹. When the two inputs are positively correlated (ρ > 0), β₁ and β₂ are negatively correlated in the posterior, and vice versa. The magnitude of the posterior correlation grows with |ρ|.

XᵀX (Sum of Squares Matrix)

The matrix product XᵀX from the design matrix X. Its inverse σ²(XᵀX)⁻¹ is the posterior covariance matrix for β (assuming diffuse priors and known σ). If columns of X are linearly dependent (perfectly collinear inputs), XᵀX is singular and cannot be inverted — MLE does not exist.

Study Flashcards

// Click a card to reveal the answer

Card 1 / 16 · 0 correct · 0 incorrect

Question — click to reveal

↩ click to flip

Answer