INFSCI 2595 · Week 2

Applied ML · Regression

Interactive Study Guide — Lecture Notes Companion
What is Regression?
// Core concepts · continuous outputs · synthetic data · noise vs truth
Regression — The Big Idea Definition

Regression deals with predicting continuous responses (outputs). Think of a continuous variable as a floating-point number — the response can take any real value within a range.

We want to learn an approximate function y ≈ f(x) from data. Because we learn from noisy data, the relationship is always approximate — expect error!
The Lecture Used Synthetic ("Toy") Data Important Setup

Rather than using a real-world dataset, the lecture generated all data artificially using random number generators in R. This is a deliberate teaching choice — it means we get to know the ground truth, which is impossible in any real application.

Step 1 — Define the TRUE function
y* = β₀* + β₁*x + β₂*x²
β₀*=0.33, β₁*=1.15, β₂*=−2.25

This is a parabola. We set the exact coefficient values ourselves — something never possible in a real problem.

Step 2 — Simulate noisy observations
yₙ = y*(xₙ) + εₙ
ε ~ random noise, N = 30 points

Random noise ε is added to the true signal to simulate what real measurements look like. The result is N=30 synthetic input-output pairs {(xₙ, yₙ)}.

Why does this matter? In a real problem, the true function is permanently hidden — we can never check our model against it. By generating data synthetically, the lecture lets us verify exactly how well our methods work, diagnose failures, and build intuition before applying these tools to real data.
TRUTH vs. NOISY DATA Key Distinction

Even though we generated the data, we then deliberately pretend we don't know the truth — just as we would in a real ML project.

  • True signal y* — a clean parabola (red curve)
  • Observed data — 30 points scattered around it
  • Question: can our methods recover the parabola from noise alone?
Why a Toy Problem? Insight

Knowing the ground truth lets us:

  • Compare estimated β's to their TRUE values
  • See exactly when and why models fail
  • Diagnose overfitting vs underfitting visually
  • Validate evaluation methods like cross-validation
In real problems, you never know the truth — you only have data.
The Complete Workflow Big Picture
1
Obtain Data (real or synthetic)
In this lecture: generate N=30 noisy input-output pairs using R's random number generators. In real life: collect measurements from experiments or sensors.
2
Split Data
Reserve ~20% as a hold-out test set before touching anything. Use the remaining 80% for training & cross-validation.
3
Fit Multiple Models
Train candidate models of varying complexity (polynomial degrees 0–8) on the training data using lm() in R.
4
Evaluate via Cross-Validation
Use k-fold CV to estimate how each model generalizes to new data. Average RMSE across folds gives a stable performance estimate.
5
Select Best Model (1-SE Rule)
Pick the simplest model whose CV performance is within 1 standard error of the overall best. Avoids accidental selection of a needlessly complex model.
6
Final Evaluation on Hold-out Set
Report the selected model's performance on the untouched test set — an unbiased estimate of real-world generalisation error.
Models & R Code
// lm() · formula interface · polynomial models · predictions
Linear Model R Code
y = β₀ + β₁x + error
# Fit a simple linear (degree-1) model
mod1 <- lm(y ~ x, data = my_train)

# Formula reads: "y is a function of x"
# β₀ = Intercept,  β₁ = slope on x
summary(mod1)    # text summary
coefplot(mod1)   # visualize coefficients + CIs

The coefplot() function from the coefplot package plots coefficient estimates with confidence intervals. If zero is inside the CI for β₁, the slope is not statistically significant.

Quadratic (Degree-2) Model R Code
y = β₀ + β₁x + β₂x² + error
# x² must be wrapped in I() so ^ is not misinterpreted
mod2 <- lm(y ~ x + I(x^2), data = my_train)

# Or using poly() for orthogonal polynomials
# Formula: y ~ x + <predictor2> = additive model
In this toy demo, the quadratic model RECOVERS the TRUE coefficients — the estimates are close and the true values fall inside the confidence intervals.
Higher-Degree Polynomials (0th – 8th) Model Family

The lecture fits 9 models: degree 0 (intercept only) through degree 8.

DegreeFormulaCoefficientsNotes
0y ~ 11Constant / mean only
1y ~ x2Linear line
2y ~ x + I(x^2)3← TRUE degree
3y ~ x + I(x^2) + I(x^3)4Cubic
88 predictors9Severely overfit
Making Predictions in R R Code
# Create a "fine" test grid of x values for visualization
test_viz <- tibble(x = seq(-1.5, 1.5, length.out = 51))

# Predict with a single model
preds <- predict(mod2, test_viz)

# Predict WITH uncertainty intervals
conf_int <- predict(mod2, test_viz, interval = "confidence")
pred_int <- predict(mod2, test_viz, interval = "prediction")
Confidence Interval — uncertainty in the mean trend
Prediction Interval — uncertainty for a single new observation (always wider)

⚠️ High-degree polynomials (7th, 8th) have enormous prediction intervals — the model is highly uncertain about its own mean trend.

Performance Metrics
// MSE · RMSE · MAE · R² · predicted vs observed
Why Do We Need Metrics? Motivation

In a real problem, we cannot compare coefficient estimates to "true" values because we don't know the truth. We need an objective measure of how well the model explains the data.

MetricFormulaUnitsBest Value
MSE
Mean Squared Error
Σ(yᵢ − ŷᵢ)² / N Units² 0 (lower = better)
RMSE
Root MSE
√MSE Same as y 0 (lower = better)
MAE
Mean Absolute Error
Σ|yᵢ − ŷᵢ| / N Same as y 0 (lower = better)

R-squared
corr(y, ŷ)² Unitless [0, 1] 1 (higher = better)
R-Squared (R²) Explained Definition

R² is the squared correlation between the model predictions (ŷ) and the observed responses (y).

On the predicted vs. observed plot, the closer all points lie along the 45° diagonal line, the higher the R². A perfect model would have all points on the diagonal → R² = 1.0.
# R code — using modelr package
rsquare(mod2, my_train)   # training R²
rsquare(mod8, my_train)   # 8th degree R² on training

⚠️ On the training set, R² keeps increasing as degree increases — but this is misleading! The 8th degree model is overfit.

Calculating RMSE in R R Code
# modelr::rmse(model_object, data_frame)
rmse(mod1, my_train)   # linear model training RMSE
rmse(mod3, my_train)   # cubic model training RMSE

# Evaluate on hold-out test set
rmse(mod3, test_split)  # generalisation error
Key observation: On the training set, the 8th-degree polynomial has the lowest RMSE. But it generalizes poorly — it has memorized noise.
Model Complexity
// Bias-variance trade-off · overfitting · underfitting · Goldilocks
The Bias-Variance Trade-off Core Concept
📉
Underfit
Too simple. Ignores the training data. High bias, low variance.
Just Right
Captures the true pattern. Generalizes well.
🌊
Overfit
Too complex. Memorizes noise. Low bias, high variance.
The 0th-degree (constant) model is the most underfit. The 7th-8th degree polynomials are severely overfit — their predicted trends swing wildly depending on which training points were used in each CV fold.
Signs of Overfitting Warning Signs
  • Training RMSE keeps decreasing, but test/CV RMSE increases
  • Coefficient estimates are enormous (e.g. ±20 for degree-8)
  • Confidence intervals on coefficients are extremely wide
  • Prediction intervals span huge y-axis ranges
  • Model predictions change drastically when the training set changes slightly (high variance across CV folds)
Complexity = Number of Coefficients Definition

For polynomial regression, complexity is measured by the polynomial degree. Each additional degree adds one more coefficient (parameter) to estimate from the data.

# coefficients = polynomial degree + 1

More coefficients → more flexibility → potentially captures more signal but also more noise.

The Goldilocks Principle Model Selection

We want to strike a balance between underfit and overfit — a model that is "just right." This is the essence of the bias-variance trade-off and model selection.

Underfit (bias)
Too simple — misses the real pattern
Just Right
Captures signal, ignores noise
Overfit (variance)
Too complex — memorizes noise
Resampling & Cross-Validation
// Train-test split · k-fold CV · 1-SE rule · LOO · time series
Why Not Just Evaluate on Training Data? Problem

Training set performance is always optimistic — the model has already "seen" that data. The 8th-degree polynomial gets the best training RMSE yet is clearly wrong. We need a way to estimate performance on new, unseen data.

Train / Test Split First Step

Split the dataset once before doing anything else. Common rule of thumb: 80% training, 20% hold-out test.

# Simple random split in R
idx <- sample(1:nrow(my_data), size = 0.8 * nrow(my_data))
train_split <- my_data[idx, ]
test_split  <- my_data[-idx, ]
⚠️ A single random split can be unlucky. The lecture showed the test set selected the 7th degree model — not the true quadratic! This motivates resampling.
K-Fold Cross-Validation Core Technique

Partition the training data into k folds. Each observation appears in a test set exactly once.

5-FOLD CV ASSIGNMENT DIAGRAM (blue = train, white = test)
FOLD 1
FOLD 2
FOLD 3
FOLD 4
FOLD 5
Training
Test (held-out)
1
Partition into k folds
Randomly assign each observation to one of k groups.
2
For each fold: Train → Test
Train the model on the other k-1 folds. Evaluate on the held-out fold. Do NOT cross folds!
3
Average performance across folds
Each model gets k RMSE values. Average them → the "expected performance on new data."
4
Compare models
Plot fold-averaged RMSE ± standard error for all candidate models.
CV Variants Comparison
MethodkTest size per foldTrade-off
5-fold CV520% of dataFewer folds, more points per test set, less variance in estimates
10-fold CV1010% of dataMore folds, smaller test sets, more computation
LOO-CVN1 pointExact but very slow for large datasets
Repeated k-foldk × rVariesRuns k-fold r times with different splits; more stable
Practical note: Choice of k is mainly driven by compute time. 5-fold and 10-fold are most popular because they are fast and give stable estimates.
The One-Standard-Error (1-SE) Rule Model Selection

Even after CV, two models may appear statistically equivalent within the margin of error. Use the 1-SE Rule:

Select the simplest model whose fold-averaged RMSE is within 1 standard error of the overall best-performing model.

In the lecture, the quadratic and cubic models both perform within 1-SE of each other. By the 1-SE rule we prefer the quadratic — which happens to be the true model!

Train → Validate → Test Pipeline Best Practice
A
Hold out 20% as a final test set
Never touch this during model building.
B
Run k-fold CV on the remaining 80%
Compare candidate models; select the best using 1-SE rule.
C
Report performance on the hold-out test set
This is your unbiased estimate of generalisation error.
Time series caveat: Random partitioning breaks temporal structure. Use time-series cross-validation (rolling window) when forecasting future events.
Quick Glossary
// All key terms from the lecture at a glance
Flashcards
// Click a card to reveal the answer · track your score
0   ✗ 0
0 / 0
Press "Shuffle & Start" to begin the flashcard session.