INFSCI 2595 · Week 1 Study Guide

What is Machine Learning?

// pattern recognition · applied math · course structure · what we'll learn

The One-Line Definition Core

Machine Learning = tools for recognizing patterns in data
…essentially, applied linear algebra & probability

Every ML algorithm — from a simple linear regression to a neural network — is built on the same mathematical bedrock. The models, languages, and tools change constantly; the foundations do not.

Even the most advanced "AI" is based on basic principles of mathematics and statistics. Understanding those principles is what separates practitioners who just run code from those who truly understand what they are doing.

What the Course Covers Big Picture

Algorithms & Theory

Linear models & regularization
Generalized linear models (logistic regression)
Bayesian linear models
Neural networks
Decision trees, random forests, boosting
Unsupervised: PCA & clustering

Concepts & Practice

Bias-variance trade-off
Cross-validation & resampling
Maximum likelihood (MLE)
Bayesian inference (MAP)

Course Structure — 4 Primary Modules Roadmap

Applied ML
(now)

Distribution
Fitting

Supervised
Learning Deep Dive

Unsupervised
Learning

The course deliberately starts with applied ML (regression & classification hands-on) before building up the statistical theory. This lets you see the big picture first, then understand why everything works.

Why not jump straight to neural networks? The course builds understanding from fundamental linear models upward. Neural networks then become natural extensions of basic principles, not mysterious black boxes.

Why R? Tools

The course uses R because the focus is on statistics, not software engineering. R makes the statistical fundamentals front-and-center.

Powerful packages for visualization (ggplot2), data manipulation (dplyr/tidyverse), modeling (modelr, caret, tidymodels), and more.

Homework uses R Markdown — code + written explanation in one document, submitted as rendered HTML.

⚡ If you understand the statistics and linear algebra, you have the foundation to work in any programming language once you know the syntax. The language is a tool; the math is the skill.

The Core ML Framework

// y ≈ f(x) · loss functions · parameters · training vs testing · supervised learning

The Fundamental Approximation Foundation

The supervised learning goal

y ≈ f(x) + error

We want to learn a function f(x) that maps inputs x to outputs y. Because we learn from noisy, finite data, the relationship is always approximate — there will always be error.

Inputs (features)

Denoted x (scalar), x (vector), or X (matrix). What we observe and feed into the model.

Output (response / target)

Denoted y or t. What we want to predict. Can be continuous, categorical, count, etc.

The Three Pillars of Every ML Algorithm Structure

1

Loss / Objective Function

Every ML algorithm requires specifying what is to be learned — the goal. All ML seeks to minimize this loss by adjusting the model's parameters. Example: YouTube minimizes watch time lost; a regression model minimizes prediction error.

2

Tunable Parameters (β / coefficients)

Every model has a set of parameters that control what patterns it can identify. Learning = finding the parameter values that minimize the loss function. More parameters → more complex model.

3

Training Data + Testing Data

Training data is used to learn the optimal parameters. Testing data evaluates whether the model generalizes to new samples. A model that only performs well on training data may have memorized noise — it hasn't truly learned.

Learning is supervised because the known output labels guide (supervise) the learning of the unknown model parameters. The parameters are learned by minimizing the model error.

The Role of Probability Distributions Why It Matters

Model assumptions are related to probability distributions. Understanding the probability distributions behind popular model formulations allows us to:

Understand more about the data generating process

Understand what the model thinks can happen

Quantify and communicate uncertainty

Two frameworks for finding optimal parameters: Maximum Likelihood Estimation (MLE) — frequentist statistics, and Maximum a Posteriori (MAP) — Bayesian inference. Both will be covered in depth.

Understanding Assumptions = Understanding the Model Insight

When we understand a model's assumptions, we can:

Understand what controls the model behavior
Identify the strengths and weaknesses of different methods

Recognize when a method is NOT appropriate
Interpret the model's learned parameters

Reality check: You can apply methods without understanding them — and sometimes it will obviously fail and you'll know how to fix it. The danger is when the model is silently wrong and you don't notice.

ML Problem Types

// supervised vs unsupervised · regression · classification · other response types

Supervised vs. Unsupervised Learning Top-Level Split

Supervised Learning

We have labelled input-output pairs. The output supervises learning of the model parameters. Goal: learn a mapping from input to output.

e.g. Predicting house price from features, classifying emails as spam/not-spam.

Unsupervised Learning

No output labels. Goal: discover interesting patterns and structure in the input data without being told what to look for.

e.g. Clustering customers, PCA for dimensionality reduction.

This course focuses primarily on supervised learning, with coverage of unsupervised methods (PCA and clustering) in the final module. The key question in supervised learning: does a new observation correctly trigger the learned pattern?

Supervised Learning: Response Types Taxonomy

Regression

Continuous response — any real number.

House price · stock price · temperature next week · expected video views

Binary Classification

Exactly 2 possible classes.

Loan default / no default · pass / fail · disease present / absent

Multi-class Classification

3 or more possible classes.

Image classification · which team wins the Super Bowl · next song to recommend

Other Response Types

Also exist — covered later.

Count data (calls/hour, goals/game) · Survival/reliability (time to failure)

This course focuses primarily on binary classification, but will cover the multiclass generalization — the fundamental ideas transfer directly.

Unsupervised Learning ("Data Discovery") Definition

Observe variables without a distinction between inputs and responses. Explore the inputs without regard for any label (or when no label exists).

Find relationships between variables — which features co-vary?

Find relationships between observations — which samples cluster together?

Especially useful in high-dimensional settings where manual exploration is impossible

In this course: K-means and hierarchical clustering for grouping observations; Principal Component Analysis (PCA) for dimensionality reduction and discovering latent structure.

Other Types of Learning Broader Landscape

Supervised and unsupervised are the two main paradigms, but others exist:

Paradigm	Description
Semi-supervised	Uses a mix of labelled and unlabelled data
Self-supervised	Generates its own supervision signal from the data structure (e.g. predicting masked words)
Active learning	Model queries for labels on the most informative samples
Online learning	Model updates continuously as new data arrives
Reinforcement learning	Agent learns via trial and error, receiving reward/penalty signals

The ML Workflow

// data access · EDA · cleaning · preprocessing · training · resampling · model selection

The Complete Supervised Learning Pipeline Big Picture

Real ML projects follow this pipeline. Note that EDA appears at multiple points — understanding the data is an ongoing activity, not a one-time step.

Data
Access

EDA

Contextualize
& Clean

EDA

Identify Models
& Preprocess

Fit on
Training Data

Resampling

Identify
Best Model

⟲ This process is iterative — model results often send you back to re-examine or re-clean the data

Each Stage Explained Details

1

Data Access

Acquiring suitable data can be complicated — especially ethically when human subjects are involved. Raw data is typically spread across multiple sources and not organized conveniently. Sources must be merged using common "keys."

2

EDA — Exploratory Data Analysis

Create figures and describe the data to motivate further analysis. Identify relationships between inputs and outputs, spot outliers, understand distributions, and decide on preprocessing steps. EDA happens both before and after cleaning.

3

Contextualize & Clean

Remove duplicate rows, correct erroneous values, handle missing data. This is the most time-consuming part of real ML projects — data access, contextualization, and cleaning can take 60–80% of total project time. The most sophisticated model is worthless on invalid data.

4

Identify Candidate Models & Preprocess

Choose models appropriate for the data structure, response type, and domain. Apply preprocessing (standardization, normalization, feature selection, transformations) based on the assumptions of the selected models. Preprocessing choices depend on model choice.

5

Fit Models to Training Data

Train each candidate model on the training set. This is where the "cool stuff" lives — but it's only a small fraction of the total effort. Analogy: if you got 100% on a practice exam, you wouldn't say you aced the real exam.

6

Resampling (Cross-Validation)

Evaluate models on data they haven't seen to estimate true generalization performance. Prevents overfitting to the training set. ML models must be tested on unseen data to determine whether they've truly learned a general pattern.

7

Compare Models & Identify Best

Compare candidates using held-out performance metrics. Select the best model considering both performance and complexity. The 1-SE rule (from Week 2) helps avoid selecting unnecessarily complex models.

Data Format: The Rectangular Table Structure

The ideal data format is a flat rectangular table (a data frame or tibble in R):

Observation	Input 1	Input 2	Response 1 (continuous)	Response 2 (binary)
1	5.2	green	43.1	TRUE
2	6.1	green	57.4	FALSE
3	2.0	yellow	18.9	FALSE

Each row = one observation/sample. Each column = one input feature or response variable. In real projects, data usually starts scattered across multiple sources and must be merged before reaching this tidy format.

Why Most ML Projects Fail Reality Check

Despite enormous hype, a very large fraction of "big data" projects fail (estimates around 80%), and only a small fraction of proof-of-concept projects ever make it into production.

Root cause: Data that is not suitable for analysis. Shortcuts taken during data preparation. Problems that simply aren't amenable to ML.

The fix: Understand and prepare the data before throwing complex models at the problem. Data quality determines the ceiling on model performance.

Why Math Matters

// the danger zone · understanding assumptions · "WHY" questions · what math enables

Avoiding the "Danger Zone" Motivation

Data science sits at the intersection of three skill areas. The "Danger Zone" occurs when you have hacking skills and domain knowledge but no math & statistics understanding — you can produce confident-looking but completely wrong results without noticing.

The Danger Zone is the overlap of Hacking Skills + Substantive Expertise without Math & Statistics. You can get code to run, results to look plausible, and confidently report conclusions that are entirely wrong.

What Mathematical Understanding Enables The Goal

After building solid foundations, you will be able to confidently answer questions like:

🔍

WHY did it work?

Trace the result back to the model assumptions and data properties

❌

WHY didn't it work?

Identify violated assumptions, data quality issues, or mismatched loss functions

⚙️

WHY that setup?

Justify modeling choices — not just copy a tutorial blindly

🔄

WHY different results?

Diagnose differences between your implementation and a reference

📊

WHY does it matter?

Communicate findings clearly to stakeholders in research, business, or policy

🚀

Adapt easily

Pick up new languages and frameworks quickly — the foundations transfer everywhere

The Practical Philosophy Mindset

Tools, languages, and specific techniques are constantly changing. With a solid mathematical foundation, you can adapt easily to anything new.

The most important part of machine learning is understanding the statistics behind the model. Math may seem intimidating at first, but visual and intuitive explanations — like those found in good online resources — make it far more approachable than it might seem.

As a data scientist, your value comes from applying models to real use cases, finding insights about key factors driving behavior, and communicating those findings effectively. The math is what makes that possible reliably — not just occasionally by luck.

ML in the Real World

// applications across fields · what ML can do · generative AI · hype vs reality

Fields Transformed by ML Impact

🏥

Healthcare

Disease detection, drug discovery, patient risk scoring, public health surveillance

💰

Finance

Credit risk, fraud detection, algorithmic trading, loan default prediction

🏭

Engineering

Predictive maintenance, quality control, fault detection, reliability analysis

📱

Tech & Products

Recommendation systems, search ranking, content moderation, ad targeting

🔬

Science & Research

Seismic signal classification, genomics, climate modeling, particle physics

📚

Humanities & Social Science

Text corpus analysis for literary & legal studies, social media analysis

ML is being used by companies across every industry: Microsoft Azure, GE, Amazon Web Services (AWS), Google (TensorFlow), Airbnb, Coca-Cola, Netflix, the NFL, and countless others.

What ML Actually Does Capabilities

Capability	Example
Find patterns automatically	Discover structure in data without being told what to look for
Model relationships	Link observed measurements/traits to outcomes (purchases, health, capacity)
Predict outcomes	How likely is a new customer to buy a cordless drill given similar customers' behavior?
Adapt to feedback	Improve performance relative to environment and user feedback (reinforcement learning)
Generate outputs	Create new content based on some class of input — chatbots, generative art, code

Generative AI & LLMs in Context Perspective

Recent advances in large language models and generative AI have created enormous hype. Two important things to keep in mind:

        LLMs hallucinate. There is a good chance an LLM will give you the wrong answer to technical ML questions — especially on the specific mathematical details you will be working with in this course.
      

        All of it is math. The most impressive generative models are fundamentally extensions of the same principles — linear algebra, probability, optimization — that this course is built around.
      

The tension illustrated in the lecture: a model prompted to paint "Harvey Milk" produces a stylistically compelling but factually incorrect portrait. High capability, genuine risk. Understanding the foundations helps you use these tools critically.

Why ML Became Popular Now History

The mathematical ideas behind ML are decades old. What changed recently:

Ubiquitous digital data collection (surveillance, sensors, social media)
Data centers and supercomputers capable of processing it all

Proliferation of internet-connected devices (IoT)
Growth of ML/data science university programs
Open-source algorithms and frameworks (TensorFlow, scikit-learn, R, etc.)

Quick Glossary

// All key terms from the Week 1 lectures