INFSCI 2595 · Week 2 Study Guide

What is Regression?

// Core concepts · continuous outputs · synthetic data · noise vs truth

Regression — The Big Idea Definition

Regression deals with predicting continuous responses (outputs). Think of a continuous variable as a floating-point number — the response can take any real value within a range.

We want to learn an approximate function y ≈ f(x) from data. Because we learn from noisy data, the relationship is always approximate — expect error!

The Lecture Used Synthetic ("Toy") Data Important Setup

Rather than using a real-world dataset, the lecture generated all data artificially using random number generators in R. This is a deliberate teaching choice — it means we get to know the ground truth, which is impossible in any real application.

Step 1 — Define the TRUE function

y* = β₀* + β₁*x + β₂*x²
β₀*=0.33, β₁*=1.15, β₂*=−2.25

This is a parabola. We set the exact coefficient values ourselves — something never possible in a real problem.

Step 2 — Simulate noisy observations

yₙ = y*(xₙ) + εₙ
ε ~ random noise, N = 30 points

Random noise ε is added to the true signal to simulate what real measurements look like. The result is N=30 synthetic input-output pairs {(xₙ, yₙ)}.

      Why does this matter? In a real problem, the true function is permanently hidden — we can never check our model against it. By generating data synthetically, the lecture lets us verify exactly how well our methods work, diagnose failures, and build intuition before applying these tools to real data.
    

TRUTH vs. NOISY DATA Key Distinction

Even though we generated the data, we then deliberately pretend we don't know the truth — just as we would in a real ML project.

True signal y* — a clean parabola (red curve)
Observed data — 30 points scattered around it
Question: can our methods recover the parabola from noise alone?

Why a Toy Problem? Insight

Knowing the ground truth lets us:

Compare estimated β's to their TRUE values
See exactly when and why models fail
Diagnose overfitting vs underfitting visually
Validate evaluation methods like cross-validation

In real problems, you never know the truth — you only have data.

The Complete Workflow Big Picture

1

Obtain Data (real or synthetic)

In this lecture: generate N=30 noisy input-output pairs using R's random number generators. In real life: collect measurements from experiments or sensors.

2

Split Data

Reserve ~20% as a hold-out test set before touching anything. Use the remaining 80% for training & cross-validation.

3

Fit Multiple Models

Train candidate models of varying complexity (polynomial degrees 0–8) on the training data using lm() in R.

4

Evaluate via Cross-Validation

Use k-fold CV to estimate how each model generalizes to new data. Average RMSE across folds gives a stable performance estimate.

5

Select Best Model (1-SE Rule)

Pick the simplest model whose CV performance is within 1 standard error of the overall best. Avoids accidental selection of a needlessly complex model.

6

Final Evaluation on Hold-out Set

Report the selected model's performance on the untouched test set — an unbiased estimate of real-world generalisation error.

Models & R Code

// lm() · formula interface · polynomial models · predictions

Linear Model R Code

y = β₀ + β₁x + error

# Fit a simple linear (degree-1) model
mod1 <- lm(y ~ x, data = my_train)

# Formula reads: "y is a function of x"
# β₀ = Intercept,  β₁ = slope on x
summary(mod1)    # text summary
coefplot(mod1)   # visualize coefficients + CIs

The coefplot() function from the coefplot package plots coefficient estimates with confidence intervals. If zero is inside the CI for β₁, the slope is not statistically significant.

Quadratic (Degree-2) Model R Code

y = β₀ + β₁x + β₂x² + error

# x² must be wrapped in I() so ^ is not misinterpreted
mod2 <- lm(y ~ x + I(x^2), data = my_train)

# Or using poly() for orthogonal polynomials
# Formula: y ~ x + <predictor2> = additive model

In this toy demo, the quadratic model RECOVERS the TRUE coefficients — the estimates are close and the true values fall inside the confidence intervals.

Higher-Degree Polynomials (0th – 8th) Model Family

The lecture fits 9 models: degree 0 (intercept only) through degree 8.

Degree	Formula	Coefficients	Notes
0	`y ~ 1`	1	Constant / mean only
1	`y ~ x`	2	Linear line
2	`y ~ x + I(x^2)`	3	← TRUE degree
3	`y ~ x + I(x^2) + I(x^3)`	4	Cubic
…	…	…	…
8	8 predictors	9	Severely overfit

Making Predictions in R R Code

# Create a "fine" test grid of x values for visualization
test_viz <- tibble(x = seq(-1.5, 1.5, length.out = 51))

# Predict with a single model
preds <- predict(mod2, test_viz)

# Predict WITH uncertainty intervals
conf_int <- predict(mod2, test_viz, interval = "confidence")
pred_int <- predict(mod2, test_viz, interval = "prediction")

Confidence Interval — uncertainty in the mean trend

Prediction Interval — uncertainty for a single new observation (always wider)

⚠️ High-degree polynomials (7th, 8th) have enormous prediction intervals — the model is highly uncertain about its own mean trend.

Performance Metrics

// MSE · RMSE · MAE · R² · predicted vs observed

Why Do We Need Metrics? Motivation

In a real problem, we cannot compare coefficient estimates to "true" values because we don't know the truth. We need an objective measure of how well the model explains the data.

Metric	Formula	Units	Best Value
MSE Mean Squared Error	`Σ(yᵢ − ŷᵢ)² / N`	Units²	0 (lower = better)
RMSE Root MSE	`√MSE`	Same as y	0 (lower = better)
MAE Mean Absolute Error	`Σ\|yᵢ − ŷᵢ\| / N`	Same as y	0 (lower = better)
R² R-squared	`corr(y, ŷ)²`	Unitless [0, 1]	1 (higher = better)

R-Squared (R²) Explained Definition

R² is the squared correlation between the model predictions (ŷ) and the observed responses (y).

On the predicted vs. observed plot, the closer all points lie along the 45° diagonal line, the higher the R². A perfect model would have all points on the diagonal → R² = 1.0.

# R code — using modelr package
rsquare(mod2, my_train)   # training R²
rsquare(mod8, my_train)   # 8th degree R² on training

⚠️ On the training set, R² keeps increasing as degree increases — but this is misleading! The 8th degree model is overfit.

Calculating RMSE in R R Code

# modelr::rmse(model_object, data_frame)
rmse(mod1, my_train)   # linear model training RMSE
rmse(mod3, my_train)   # cubic model training RMSE

# Evaluate on hold-out test set
rmse(mod3, test_split)  # generalisation error

Key observation: On the training set, the 8th-degree polynomial has the lowest RMSE. But it generalizes poorly — it has memorized noise.

Model Complexity

// Bias-variance trade-off · overfitting · underfitting · Goldilocks

The Bias-Variance Trade-off Core Concept

📉

Underfit

Too simple. Ignores the training data. High bias, low variance.

✅

Just Right

Captures the true pattern. Generalizes well.

🌊

Overfit

Too complex. Memorizes noise. Low bias, high variance.

The 0th-degree (constant) model is the most underfit. The 7th-8th degree polynomials are severely overfit — their predicted trends swing wildly depending on which training points were used in each CV fold.

Signs of Overfitting Warning Signs

Training RMSE keeps decreasing, but test/CV RMSE increases
Coefficient estimates are enormous (e.g. ±20 for degree-8)
Confidence intervals on coefficients are extremely wide
Prediction intervals span huge y-axis ranges
Model predictions change drastically when the training set changes slightly (high variance across CV folds)

Complexity = Number of Coefficients Definition

For polynomial regression, complexity is measured by the polynomial degree. Each additional degree adds one more coefficient (parameter) to estimate from the data.

# coefficients = polynomial degree + 1

More coefficients → more flexibility → potentially captures more signal but also more noise.

The Goldilocks Principle Model Selection

We want to strike a balance between underfit and overfit — a model that is "just right." This is the essence of the bias-variance trade-off and model selection.

Underfit (bias)

Too simple — misses the real pattern

Just Right

Captures signal, ignores noise

Overfit (variance)

Too complex — memorizes noise

Resampling & Cross-Validation

// Train-test split · k-fold CV · 1-SE rule · LOO · time series

Why Not Just Evaluate on Training Data? Problem

Training set performance is always optimistic — the model has already "seen" that data. The 8th-degree polynomial gets the best training RMSE yet is clearly wrong. We need a way to estimate performance on new, unseen data.

Train / Test Split First Step

Split the dataset once before doing anything else. Common rule of thumb: 80% training, 20% hold-out test.

# Simple random split in R
idx <- sample(1:nrow(my_data), size = 0.8 * nrow(my_data))
train_split <- my_data[idx, ]
test_split  <- my_data[-idx, ]

⚠️ A single random split can be unlucky. The lecture showed the test set selected the 7th degree model — not the true quadratic! This motivates resampling.

K-Fold Cross-Validation Core Technique

Partition the training data into k folds. Each observation appears in a test set exactly once.

5-FOLD CV ASSIGNMENT DIAGRAM  (blue = train, white = test)

FOLD 1

FOLD 2

FOLD 3

FOLD 4

FOLD 5

Training

Test (held-out)

1

Partition into k folds

Randomly assign each observation to one of k groups.

2

For each fold: Train → Test

Train the model on the other k-1 folds. Evaluate on the held-out fold. Do NOT cross folds!

3

Average performance across folds

Each model gets k RMSE values. Average them → the "expected performance on new data."

4

Compare models

Plot fold-averaged RMSE ± standard error for all candidate models.

CV Variants Comparison

Method	k	Test size per fold	Trade-off
5-fold CV	5	20% of data	Fewer folds, more points per test set, less variance in estimates
10-fold CV	10	10% of data	More folds, smaller test sets, more computation
LOO-CV	N	1 point	Exact but very slow for large datasets
Repeated k-fold	k × r	Varies	Runs k-fold r times with different splits; more stable

Practical note: Choice of k is mainly driven by compute time. 5-fold and 10-fold are most popular because they are fast and give stable estimates.

The One-Standard-Error (1-SE) Rule Model Selection

Even after CV, two models may appear statistically equivalent within the margin of error. Use the 1-SE Rule:

Select the simplest model whose fold-averaged RMSE is within 1 standard error of the overall best-performing model.

In the lecture, the quadratic and cubic models both perform within 1-SE of each other. By the 1-SE rule we prefer the quadratic — which happens to be the true model!

Train → Validate → Test Pipeline Best Practice

A

Hold out 20% as a final test set

Never touch this during model building.

B

Run k-fold CV on the remaining 80%

Compare candidate models; select the best using 1-SE rule.

C

Report performance on the hold-out test set

This is your unbiased estimate of generalisation error.

Time series caveat: Random partitioning breaks temporal structure. Use time-series cross-validation (rolling window) when forecasting future events.

Quick Glossary

// All key terms from the lecture at a glance

Flashcards

// Click a card to reveal the answer · track your score

✓ 0 ✗ 0

0 / 0

Press "Shuffle & Start" to begin the flashcard session.