INFSCI 2595 · Week 3

Applied ML · Classification

Interactive Study Guide — Lecture Notes Companion
From Regression to Classification
// binary outcomes · encoding · why regression fails · synthetic toy data
Everything from Last Week Still Applies Recap

Overfitting, resampling, k-fold cross-validation, and the 1-SE rule all carry over directly to classification. This lecture introduces the new problem type and its specific performance metrics. The generalization concerns are identical.

This lecture focuses on measuring classification performance on the training set — resampling for classification follows the same logic as for regression.
What Makes Classification Different? Definition

In regression, the response y is a continuous decimal number. In classification, the response is categorical — it belongs to one of a discrete set of classes.

Regression output
ŷ ∈ ℝ (any real number)
e.g. 3.14, −0.07, 22.5
Classification output
ŷ ∈ {EVENT, NON-EVENT}
e.g. Hit/Out, Yes/No, 1/0

This lecture focuses on the binary case — exactly two possible outcomes. Real-world examples include: hit/out in baseball, default/no-default on a loan, cancer/no cancer, stay/leave on a website.

Can We Use Linear Regression for Classification? The Argument

The lecture walks through this question step by step — and the answer is more nuanced than a flat "no":

1
First objection: we can't compute the error
If y is a text label like "HIT", we cannot compute "HIT − f(x)". That's a valid concern — but it's immediately solved in the next step.
2
Solution: encode the response as 0 and 1
Set EVENT=1 and NON-EVENT=0. Now we can compute numeric errors like (1 − f(x)) and (0 − f(x)). So regression is technically possible with this encoding.
3
Deeper problem: only 2 unique response values
Even with encoding, y can only ever be 0 or 1 — there are no intermediate values. The squared-error loss used in regression is designed for continuous responses, not binary ones. The loss function should be fundamentally different.
4
Consequence: we end up predicting event probability
The appropriate loss function for binary data leads naturally to a model that predicts the probability of the event — a number in [0,1]. This is what binary classifiers like logistic regression do. Linear regression, applied naively, does not constrain its output to [0,1].
The Toy Problem: Synthetic Data (Again) Setup

Just like in Week 2, the lecture uses artificially generated data so we can know the ground truth and verify our methods. Here is exactly how the data was created:

1
Define the TRUE event probability function
A set of TRUE parameters β* was chosen to define the true probability curve μ = f(x, β*). This produces an S-shaped curve (more on why in a later lecture).
2
Generate random input values x
N=115 input values were randomly generated across the input range.
3
Compute the true probability at each x
For each generated xₙ, the TRUE event probability μₙ = f(xₙ, β*) is calculated from the known function.
4
Randomly generate binary outcomes
Each binary label yₙ ∈ {0, 1} is drawn randomly based on the true probability μₙ. When μₙ is high, we usually get yₙ=1 — but NOT always. Randomness means exceptions occur!
The key insight: even when the true probability is 0.9, we will occasionally observe yₙ=0. And when it's 0.1, we may still observe yₙ=1. This irreducible randomness is why binary classification is fundamentally probabilistic.
Modeling Probability
// event probability · loss function · glm() · S-curve · encoding
What Does a Classification Model Actually Predict? Core Idea
Most binary classification models do NOT directly predict the class label. They predict the EVENT PROBABILITY — a number between 0 and 1.
Binary classification model output
EVENT PROBABILITY = f(x, β)
where 0 ≤ f(x, β) ≤ 1 always

The model is fit by minimizing a loss function based on the likelihood of the observed response (y=1 or y=0) given the modeled probability. This is fundamentally different from the squared error used in regression.

Encoding the Binary Response Definition

To make the response numeric, we apply a simple encoding before fitting:

OutcomeEncodingError formulation
EVENT occursy = 11 − f(x) = error
NON-EVENT occursy = 00 − f(x) = −f(x) = error
Even though y is now numeric (0 or 1), there are still only 2 unique values. The binary loss function is therefore different from the continuous regression loss — we use likelihood / cross-entropy (details in a later lecture).
The S-Shaped Probability Curve Logistic Shape

The true event probability curve in the toy problem (and in logistic regression generally) has an S-shape (sigmoid). You will learn the mathematical reason later in the semester. For now, understand its behavior:

Low x values
Probability approaches 0. Events are rare.
Middle x values
Probability near 0.5. Outcomes are uncertain.
High x values
Probability approaches 1. Events are common.
In the toy problem: x values giving probability < 0.25 → event rarely observed. x values giving probability > 0.75 → event usually observed. But exceptions always occur due to randomness!
Fitting the Model in R with glm() R Code
# Classification uses glm() instead of lm()
# Must specify the family (sets the loss function)
mod_logistic <- glm(y ~ x,
                    data   = my_data,
                    family = binomial())

# Same formula interface as lm():
# response ~ predictor(s)
summary(mod_logistic)
glm() = Generalised Linear Model. A broader family that includes logistic regression via family = binomial().
family= sets the loss function. binomial() triggers the likelihood-based loss appropriate for binary 0/1 responses.
The Decision Threshold
// converting probability to class · threshold choice · trade-offs
From Probability to Class Label Core Concept

The model outputs a probability. To get a predicted class, we compare it to a threshold:

Classification rule
if μ ≥ threshold → predict EVENT (ŷ = 1)
if μ < threshold → predict NON-EVENT (ŷ = 0)
The default threshold is 0.5 — intuitive because it predicts whichever class is more probable. But this is a decision that can and should be changed depending on the costs of each error type.
Interactive Threshold Explorer Interactive

Move the slider to see how changing the threshold shifts the trade-off between sensitivity and specificity.

Sensitivity (TPR)
Specificity (TNR)
False Pos. Rate
Accuracy
The Two Types of Error Error Types
False Positive (Type I Error)

A NON-EVENT is incorrectly predicted as an EVENT. Like a false accusation.

Costly when: spam filters block important emails, plagiarism detectors flag clean work, criminal risk tools falsely flag innocent people.

False Negative (Type II Error)

An EVENT is incorrectly predicted as a NON-EVENT. A missed detection.

Costly when: disease screening misses cancer, fault detection misses failures, fraud detection misses real fraud.

Lowering the threshold makes it easier to predict EVENT → more True Positives and more False Positives. Raising it has the opposite effect. There is no free lunch.
Classification Metrics
// confusion matrix · accuracy · sensitivity · specificity · FPR
The Confusion Matrix Core Tool

With binary classification there are 4 possible combinations of predicted and observed class. The confusion matrix counts how many observations fall into each cell.

← Predicted Class →
Predicted EVENT (1) Predicted NON-EVENT (0)
TP
True Positive
Predicted EVENT, observed EVENT ✓
FN
False Negative
Predicted NON-EVENT, but IS an event ✗
FP
False Positive
Predicted EVENT, but NOT an event ✗
TN
True Negative
Predicted NON-EVENT, observed NON-EVENT ✓

Main diagonal (TP + TN) = correct predictions. Off-diagonal (FP + FN) = errors.

All Metrics Derived from the Confusion Matrix Formulas
MetricFormulaFocuses on
Accuracy (TP + TN) / (TP + TN + FP + FN) Overall correct rate
Sensitivity (TPR) TP / (TP + FN) Observed EVENTs only
Specificity (TNR) TN / (TN + FP) Observed NON-EVENTs only
False Positive Rate FP / (FP + TN) = 1 − Specificity NON-EVENTs called events
Why Accuracy Alone Is Insufficient Limitation

Accuracy collapses all four cells into a single number. It does NOT tell you:

  • Whether failures are False Positives or False Negatives
  • How the model performs within each class separately
  • How performance changes as the threshold changes
Example: A model that always predicts NON-EVENT achieves 95% accuracy on a dataset where events are only 5% of observations — yet it catches zero actual events. Accuracy here is completely misleading.
Sensitivity vs. Specificity — The Trade-off Key Insight
Sensitivity (TPR)
TP / (TP + FN)

Out of all actual events, how many did we catch? Sensitive to events.

Specificity (TNR)
TN / (TN + FP)

Out of all actual non-events, how many did we correctly call non-event?

The trade-off: Lowering the threshold → more events predicted → Sensitivity ↑ but Specificity ↓ (more false alarms). Raising it does the reverse. You cannot improve both simultaneously without a better model.
ROC Curve & AUC
// receiver operating characteristic · area under curve · threshold sweep
The ROC Curve — What and Why Core Concept

Instead of evaluating the model at a single threshold, we sweep through all possible thresholds and plot the resulting (FPR, TPR) pairs. This is the Receiver Operating Characteristic (ROC) curve.

The ROC curve plots Sensitivity (TPR) on the y-axis vs. 1 − Specificity (FPR) on the x-axis. The threshold is the hidden third dimension driving both values.
Low threshold
Nearly everything classified as EVENT → TPR high, FPR high. Upper-right corner.
High threshold
Nearly everything classified as NON-EVENT → TPR low, FPR low. Lower-left corner.
The model cannot deviate from its ROC trajectory — choosing a threshold locks you to one point on the curve. The shape of the curve is fixed by the model itself.
Ideal vs. Useless Models Benchmarks
Perfect Model — AUC = 1.0

ROC curve goes straight up the y-axis to (0, 1) then right along the top. At some threshold, TPR = 1 and FPR = 0 simultaneously. Real models never achieve this.

Random Model — AUC = 0.5

ROC curve follows the diagonal from (0,0) to (1,1). Lowering the threshold increases FPR and TPR equally — no discrimination ability whatsoever.

AUC has a floor of 0.5 (random guessing). A good model should have AUC well above 0.5. Another interpretation: AUC = the probability that the model ranks a random EVENT higher than a random NON-EVENT.
AUC — The Area Under the Curve Key Metric

The ROC curve can be summarised by integrating the area under it — the AUC. This gives a single threshold-independent measure of model discrimination ability.

AUC RangeInterpretation
0.5No better than random guessing
0.5 – 0.7Poor discrimination
0.7 – 0.8Acceptable
0.8 – 0.9Excellent
0.9 – 1.0Outstanding
Probabilistic interpretation: AUC = 0.82 means the model has an 82% chance of correctly ranking a randomly drawn EVENT above a randomly drawn NON-EVENT. AUC = 0.5 means it does no better than flipping a coin.
Reading the ROC Curve — What Good Looks Like Insight

An ideal ROC curve should:

  • Rise steeply toward TPR=1 while FPR remains near 0 (upper-left bulge)
  • Remain close to 1 on the TPR axis as FPR grows
  • Show a large area between the curve and the diagonal
In the toy demo, the fitted model ROC curve closely matches the TRUE ROC curve (possible only because we know the ground truth). The slight deviation from a perfect step function is caused by the inherent randomness in the observed binary outcomes — even a perfect probability model cannot achieve AUC = 1 when outcomes are stochastic.
Model Calibration
// calibration curves · predicted probability vs empirical frequency · bin width
Beyond Point-Wise Accuracy Motivation

Accuracy and ROC/AUC compare predicted classes to observed classes — they are point-wise metrics. But the model actually predicts probabilities. A different question emerges:

When the model predicts a 0.33 event probability, do events actually occur about 1/3 of the time in the real data? If yes, the model is well-calibrated.
Calibration is about long-run behavior. A weather forecaster is well-calibrated if it rains on ~70% of the days they forecast "70% chance of rain."
How Calibration Curves Are Built Method
1
Bin the predicted probabilities
Divide [0,1] into bins (e.g. 10 bins of width 0.1: [0,0.1), [0.1,0.2), …). Each bin is represented by its midpoint (0.05, 0.15, 0.25, …).
2
Count observations per bin
For each observation, place it in the bin matching its predicted probability.
3
Calculate empirical event frequency per bin
Within each bin, count how many observations actually had y=1 (events). Divide by total observations in that bin → empirical frequency.
4
Plot: predicted probability (x) vs empirical frequency (y)
A perfectly calibrated model lies on the 45° diagonal. Points above the line = model under-predicts probability. Points below = model over-predicts.
Calibration curve — predicted probability (x) vs empirical event frequency (y)
Model (empirical freq per bin)
Perfect calibration
The Bin Width Challenge Trade-off
Narrow bins (e.g. 0.05)
More resolution, but fewer observations per bin. Empirical frequencies are noisy estimates — unreliable with small samples.
Wide bins (e.g. 0.2)
More observations per bin → more stable estimates. But fewer bins means less detail about model behaviour across the probability range.
The lecture shows both 10-bin and 5-bin calibration curves for the same model. There is no universally correct bin width — it depends on dataset size and how much detail you need.
Calibration vs. Accuracy — Key Differences Comparison
DimensionAccuracy / ROC / AUCCalibration
What it measuresPoint-wise class discriminationAgreement between predicted probs & observed frequencies
Requires threshold?Yes (Accuracy); No (AUC)No
Can be computed in real problems?YesYes
View of model behaviourSingle-prediction accuracyLong-run aggregate behaviour
A model can have high AUC but poor calibration (predictions rank well but the probabilities are systematically biased). Both perspectives matter, especially when the predicted probability is used directly (e.g. risk scoring).
The TOY Demo Advantage (Again) Toy Problem Insight

In a real problem, you can never directly compare the model's predicted probability to the "true" probability — because the true probability is unknown. The only comparison available is to the observed 0/1 outcomes.

In the toy demo, we can overlay the model's predicted probability curve against the TRUE probability curve. This is an exclusive luxury of synthetic data — and it shows the model is doing a good job of recovering the true S-curve.

An important insight from this comparison: when the model predicts ~0.5, we should expect a near 50/50 mix of 0s and 1s — and we should not be surprised by misclassifications in that region. They are statistically appropriate, not model failures.

Quick Glossary
// All key terms from the lecture at a glance
Flashcards
// Click a card to reveal the answer · track your score
0   ✗ 0
0 / 0
Press "Shuffle & Start" to begin.