10-315 - Fall 2023

Introduction to
Machine Learning




Key Information

Sundays, Mondays, Tuesdays, Wednesdays, 4:00pm - 5:15pm, Room 2049

12.0

25% Final, 15% Midterm, 38% Homework, 18% Multiple-choice Quizzes, 4% Participation

(15122) and (21127 or 21128 or 15151) and (21325 or 36217 or 36218 or 36225 or 15359). In general, a solid background in CS, calculus, and probability theory is needed to deal successfully with course challenges

Zhijie Xu


Overview

Machine learning is a subfield of computer science with the goal of exploring, studying, and developing learning systems, methods, and algorithms that can improve their performance with learning from data. The course is designed to give undergraduate students a one-semester-long introduction to the main principles, algorithms, and applications of machine learning.


After completing the course, students will be able to:

  • select and apply an appropriated supervised learning algorithm for classification problems (e.g., naive Bayes, perceptron, support vector machine, logistic regression);

  • select and apply an appropriate supervised learning algorithm for regression problems (e.g., linear regression, ridge regression);
  • recognize different types of unsupervised learning problems, and select and apply appropriate algorithms (e.g., density estimation, clustering, linear and nonlinear dimensionality reduction);

  • work with probabilities (Bayes rule, conditioning, expectations, independence), linear algebra (vector and matrix operations, eigenvectors, SVD), and calculus (gradients, Jacobians) to derive machine learning methods such as linear regression, naive Bayes, and principal components analysis;

  • understand machine learning principles such as model selection, overfitting, and underfitting, and techniques such as cross-validation and regularization;

  • implement machine learning algorithms such as logistic regression via stochastic gradient descent, linear regression, perceptron, SVMs, boosting, k-means clustering;

  • run appropriate supervised and unsupervised learning algorithms on real and synthetic data sets and interpret the results.


The course is organized as follows:

  • The course will be based on lectures. As a general rule, lectures will be given on Sundays and Mondays, while Tuesdays and Wednesdays will be either used as lecture or recitation classes. Recitation classes will be aimed to revise and/or expand concepts introduced in the lectures, work out example cases, and cover aspects that may be not in the students' background. In any case, based on the week/topic, class days will be dynamically allocated for their best use :)
    Students are expected to attend all classes and to actively participate with questions and by answering to in-class quizzes. Class attendance and participation will weight 4% of the final grade.

  • For each one of the different topics, the course will present relevant techniques, discuss formal results, and show the application to problems of practical and theoretical interest.

  • Homework will include both questions to be answered and programming assignments. Written questions will involve working through different algorithms, deriving and proving mathematical results, and critically analyzing the material presented in class. Programming assignments will mainly involve writing code to implement and test algorithms in relevant scenarios.

  • Quizzes, in the form of multiple-choice questions hereafter referred to as QNAs, will be used to check student progress and to promote revising lecture material with continuity.

Prerequisites

Having successfully passed the following courses is necessary: (15122) and (21127 or 21128 or 15151) and (21325 or 36217 or 36218 or 36225 or 15359).

In general, familiarity with python programming, and a solid background in general CS, calculus, and probability theory are needed to deal successfully with course challenges. Some basic concepts in CS, calculus, and probability will be briefly revised (but not re-explained from scratch).

Talk to the teacher if you are unsure whether your background is suitable or not for the course.

Grading

Course grading will assigned based on the following weighting:

  • 38% Homework
  • 25% Final exam
  • 15% Midterm exam
  • 18% QNAs
  • 4% Participation

There will be about five homework assignments and six QNAs. The final exam will include questions about all the topics considered in the course, with an emphasis on the topics introduced after the midterm exam. QNA will consist in multiple-choice questions aimed to keep up with the topics of the course in-between homeworks.

Note that grades will be NOT curved. The mapping between scores and letter grades will roughly follow the scheme below. However, final course scores will be converted to letter grades based on grade boundaries that will be precisely determined at the end of the semester accounting for a number of aspects, such as participation in lecture and recitation, exam performance, and overall grade trends. Note that precise grade cutoffs will not be discussed at any point during or after the semester.

  • A: score ≥ 90%
  • B: 80% ≤ score < 90%
  • C: 70% ≤ score < 80%
  • D: 60% ≤ score < 70%


Textbooks

In addition to the lecture handouts (that will be made available after each lecture), during the course additional material will be provided by the instructor to cover specific parts of the course.

A number of (optional) textbooks can be consulted to ease the understanding of the different topics (the relevant chapters will be pointed out by the teacher):

  • Machine Learning, Tom Mitchell (in the library)
  • Pattern Recognition and Machine Learning, Christopher Bishop, available online
  • Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, available online
  • A Course in Machine Learning, Hal Daumé, available online
  • Mathematics for Machine Learning, Marc O. Deisenroth, A. Aldo Faisal, Cheng Soon Ong, CUP, online
  • Pattern Classification, Richard Duda, Peter Hart, David Stork, 2nd ed., partially online (in the library)
  • Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, available online
  • Kernel Methods for Pattern Analysis, John Shaw-Taylor, Nello Cristianini, available online

Schedule

(Subject to changes)

Dates Topics Slides References
Assignments
8/20 Introduction, General overview of ML - Course aims and logistics; introduction to ML; core concepts; taxonomy of main ML tasks and paradigms; introduction to supervised, unsupervised, and reinforcement learning. pdf
8/21 SL workflow, Empirical risk, Generalization, Canonical SL problem - SL workflow and design choices; classification example; hypothesis class; loss function; generalization; empirical risk minimization; canonical SL problem. pdf
  • Murphy Chapter 1.4
  • Bishop Chapter 1
8/22 Loss functions, Overfitting, Model selection and Generalization - Examples of loss functions for classification and regression; regression example, overfitting and model selection; generalization concepts. pdf
8/23 A first, model: K-Nearest Neighbors (k-NN) - Instance-based and non-parametric methods; basic concepts of k-NN; k-NN classifier vs. Optimal classifier; decision regions and decision boudaries; 1-NN decision boundary and Voronoi tassellation of feature space; small vs. large K; k-NN regression; non-parametric vs. parametric methods.

  • Self-Review of basic concepts: calculus, linear algebra, probability theory; Tutorial on Numpy
pdf 8/24: QNA1 out

8/27 Decision Trees 1: SL based on the Divide-and-Conquer Model - Learning by asking questions; structure of decision trees; expressivness of DTs; hypothesis space and NP-hardness for finding simplest consistent hypothesis; recursive dataset decomposition, divide-and-conquer; axes-parallel decision boundaries; greedy top-down heuristics; Greedy top-down heuristics for decision trees: ID3, C4.5. pdf
  • Mitchell Chapters 1, 2, 6.1 - 6.3
  • Bishop Chapters 1, 2
8/28 Decision Trees 2 - Entropy and information gain; purity of a labeled set; maximal gain for choosing the next attribute; overfitting issues and countermeasures; discrete vs. continous features; continous features, ranges, binary splits; decision tree regression, measures of purity. pdf
8/29 Model selection and Cross-Validation - Overfitting, estimation of generalization error, and model selection; hold-out method; cross-validation methods: k-fold CV, leave-one-out CV, random subsampling; design issues in CV; general scheme for model optimization and selection using validation sets. pdf
8/30 Recitation - Decision trees, k-NN, cross-validation. 8/31: QNA1 due; HW1 out

9/3 Linear models, Linear regression 1, Feature transformations - General form and properties of linear models for classification and for regression; geometry of decision boundaried and regression functions; formulation of a linear model, bias, scalar product; basic geometrical properties; Regression problems, examples and taxonomy; empirical predictor; hypothesis class and loss functions; linear regression with squared losses: Ordinary Least Squares (OLS); problem formulation and solution approaches; predictions as a weighted linear combination of labels; linear regression for non-linear data; feature spaces; basis functions as features; solution using feature functions; examples with polynomial regression. pdf
9/4 Linear regression 2, Regularization - Issues related to solving OLS: matrix inversion, computations, numerical instabilities; normal equations vs. SGD vs. algebraic methods; controlling model complexity (and avoiding singularities) using regularization; effects of different regularization approaches; Ridge regression as constrained optimization, shrinking of weights; Lasso regression as constrained optimization, shrinking and zeroing of weights; comparison among Lp-norm regularizations; comparative analysis over a number of test scenarios with linear and polynomial features; Elastic Net regression; a real-world regression scenario from data wrangling to model selection. pdf
  • Bishop 3.1.4
  • Daumé Chapters 7.2, 7.3
9/5 Gradient methods for optimization - Concave and convex functions, local and global optima; partial derivatives and their calculation; gradient vectors; geometric properties; general framework for iterative optimization; gradient descent / ascent; design choices: step size, convergence check, starting point; zig-zag behavior of gradients and function conditioning properties; sum functions and stochastic approximations of gradients; batch, incremental, and mini-batch GD; properties of stochastic GD. pdf
9/6 Recitation - Linear regression, Regularization, Gradient methods. Notebook for gradient descent Sep 7: HW1 due, QNA2 out

9/10 Break, no classes
9/11 Break, no classes
9/12 Non-parametric / Kernel Regression - Closed-form solutions for prediction from linear models, weighted linear combination of label; smoothing kernels; examples with Gaussian and other basis functions; localization and weighted averages of observations; bias-variance tradeoff; kernel regression as non-parametric regression; Nadaraya-Watson estimator; examples of kernels; role and impact of bandwidth; k-NN as another non-parametric estimator; kernel regression and least squares formulations. pdf
  • Bishop Chapter 3.3, 3.2, 2.5, 6.2, 6.3.1
9/13 Recitation - Non parametric methods, more on gradient methods QNA2 due, HW2 out

9/17 Bayes Optimal Classifier, Decision Boundaries - Probabilistic view, elements of decision theory, first review of probability concepts, Bayes rule, decision boundaries and classification errors, generative and discriminative modeling, introduction to probability distribution estimation pdf
9/18 Estimating Probabilities 1, MLE/MAP - Probability estimation and Bayes classifiers; importance and challenges estimating probabilities from data; frequentist vs. Bayesian approach to modeling probabilities; definition and properties of MLE, MAP, and Full Bayes approaches for parameter estimation; priors and conjugate probability distributions; examples with Bernoulli data and Beta priors. pdf
9/19 Estimating Probabilities 2, MLE/MAP - Review of concepts, overview of conjugate priors for continous and discrete distributions; practical examples; Bernoulli-Beta, Binomial-Beta, Multinomial-Dirichlet, Gaussian-Gaussian; MLE vs. MAP vs. Full Bayes, pros and cons. pdf
  • Bishop Chapters 2, 4.2, 4.3
  • Murphy Chapters 3, 5
  • Duda, Hart, Stork, Chapter 3.3, 3.4, 3.5
9/20 Recitation - Review of linear algebra; matrix-vector notation; quadratic forms; multivariate Gaussians; isocontours; notions related to covariance matrices; making predictions using estimated probabilities and MLE/MAP/Bayes. Sep 21: HW2 due, QNA3 out

9/24 Prediction and Classification using Estimated Probabilities - Classification using estimated probabilities and MLE/MAP/Bayes; quadratic and linear decision boundaries using Gaussians; complexity challenges and feature dependencies; introduction to Naive Bayes models. pdf
9/25 Classification with Naive Bayes models - Naive bayes models; discrete and continous features, MLE and MAP approaches; simplification rules for MAP estimation using smoothing parameters; case study for discrete features: text classification, bags of words; case study for continous features: image classification. pdf
9/26 From Generative to Discriminative Classifiers, Logistic Regression (LR) - From Generative to Discriminative Classifiers; logistic regression as linear probabilistic classifier; decision boundaries; M(C)LE and M(C)AP models for probabilistic parameter estimation for LR; concave optimization problem; gradient ascent for LR-MLE; gradient ascent for LR-MAP; MCAP case for gradient ascent with Gaussian priors; logistic regression with more than two classes, softmax function; decision boundaries for different classifiers; linear vs. non-linear boundaries; number of parameters in LR vs. Naive Bayes; asymptotic results for LR vs. NB; overall comparison between LR and NB. pdf
9/27 Recitation - Logistic regression, Naive Bayes QNA3 due, HW3 out

10/1 Ensemble methods, Bagging, Random Forests, Boosting - Ensemble models, Bagging, Boosting, Random forests: General ideas behind combining models; voting/averaging vs. stacking models; bagging and boosting as forms of combining different experts; bagging: construction of the datasets by bootstrapping, properties of the base model, variance reduction goals, aggregation by averaging; random forests as bagging with randomization of the features of each model; boosting: sequential generation of the weighted datasets, base model as a weak learner, goals of combining multiple weak learners, how to compute voting weights in AdaBoost; decision stumps as weak classifiers; analysis and properties of AdaBoost; robustness to overfitting. pdf
10/2 Neural networks 1: Perceptrons, MLPs - Linear units and perceptron; perceptron algorithm and properties; from perceptrons to artificial neural networks, biological analogy; structure of a unit; multi-layerd feed-forward architectures (MLP); recurrent network models; sigmoid units; other activation functions; hidden layers and hierarchical feature learning and propagation; matrices and network parameters; basic overview of properties, design choices. pdf
10/3 Recitation - Ensemble methods
10/4 Recitation - Review for midterm QNA4 out

10/8 Fall break, no classes
10/9 Fall break, no classes
10/10 Fall break, no classes
10/11 Fall break, no classes

10/15 Neural networks 2: MLPs - NN as composite functions; functional form of a NN; concepts about overfitting and complexity; visualization of the output surface; loss minimization problem in the weight space; non-convex optimization landscape for the error surface; stochastic and batch gradient descent; backpropagation and chain rule; backpropagation for a logistic unit; backpropagation for a network of logistic units; forward and backward passes in the general case; properties and issues of backpropagation; design choices (momentum, learning rate, epochs); weight inizialization. pdf
10/16 Midterm Exam
10/17 Neural networks 3: CNNs - Overfitting and generalization issues; approaches for to regularization; overview of model selection and validation approaches using NN; design choices; cross-entropy loss; softmax activation layer; number of trainable parameters; sgd and epochs; issues with the choice of the activation function, sigmoid/tanh units and vanishing gradients; limitations of fully connected MLPs; general ideas about exploiting structure and locality in input data in convolutional neural networks; receptive fields; convolutional filters and feature maps; weight sharing; invariant properties of features; pooling layers for subsampling; incremental and hierachical feature extraction by convolutional and pooling layers; output layer and softmax function; examples of CNN architectures. pdf
10/18 Recitation - Neural networks HW3 due, Oct 20: QNA4 due, HW4 out

10/22 Neural networks 4: CNNs, Applications - Convolutional filters and feature maps; weight sharing; invariant properties of features; pooling layers for subsampling; incremental and hierachical feature extraction by convolutional and pooling layers; output layer and softmax function; examples of CNN architectures; application to image and text data; recommender systems. pdf
10/23 Neural networks 5: Autoencoders, Transformers, GANs - concepts and implementation of autoencoders for dimensionality reduction, compression, denoising; general ideas about transfer learning and generative networks; examples of transformer, chatGPT; Generative Adversarial Networs.
10/24 Recitation - Neural networks
10/25 Recitation - Neural networks

10/29 Unsupervised learning 1: Dimensionality Reduction (Principal Component Analysis, Autoencoders) - Overview of unsupervised learning tasks; large feature spaces and curse of dimensionality; dimensionality reduction by feature selection and and latent features; simple general model for dimensionality reduction / compression; subspace spanned by a vector basis; key ideas behind principal component analysis (PCA); representation of data; variance captured by projections; definition of mininum variance directions; PCA algorithm; examples; limitations of PCA, Kernel PCA; dimensionality reduction using Autoencoders (recap from previous lecture). pdf
10/30 Unsupervised learning 2: Data Clustering - Characterization of clustering tasks; types of clustering; (flat) K-means clustering problem; role of centroids, cluster assignments, Voronoi diagrams; (naive) K-means algorithm, examples; computational complexity; convergence and local minima; K-means loss function; alternating optimization (expectation-maximization); assumptions and limitations of K-means; illustration of failing cases; kernel K-means; soft clustering; relatinship to vector quantization and use of clustering for lossy compression; hierachical clustering, linakge methods, assumptions, computational complexity. pdf
10/31 Unsupervised learning 3: Mixture models / GMMS, Latent data models - Probabilistic clustering and limitations of hard partitioning methods; mixture models and density estimation; modeling with latent variables; Gaussian Mixture Models (GMMs); MLE for parameter estimation in GMMs; GMMs solutions with complete data, form of decision boundaries; relationships between K-Means solutions and GMMs solutions; from complete data to latent data; MLE for latent data and parameter estimation in GMMs, problem formulation. pdf
  • Bishop Chapters 9.2-9.4
  • Deisenroth Chapters 11.1, 11.2
11/1 Unsupervised learning 4: Expectation-Maximization (EM), EM for GMMs - MLE for latent data and parameter estimation in GMMs; concepts and properties of Expectation-Maximization (EM) as iterative alternating optimization; EM for GMMs and probabilistic clustering; general form of EM for likelihood function optimization in latent variable models; Q function as lower bound of likelihood; formalism and concepts behind the EM approach; properties and limitations. pdf
  • Bishop Chapters 9.2-9.4
  • Deisenroth Chapters 11.3, 11.4

11/5 Recitation - PCA, K-means, GMMs HW4 due, HW5 out
11/6 Unsupervised learning 5: Nonparametric Density Estimation - Density estimation problem; parametric vs. non-parametric approaches; histogram density estimation; role of bin width; bias-variance tradeoff; general form of the local approximator; fixing the width: kernel methods; Parzen windows; smooth kernels; finite vs. inifinite support; fixing the number of points: k-NN methods; comparison between the approaches; role of the bandwidth and bias-variance tradeoff. pdf
  • Bishop Chapter 2.5
11/7 Recitation - EM, Non-parametric density estimation
11/8 Break, no classes

11/12 No class
11/13 Support Vector Machines (SVM) 1 - Linear classifiers (deterministic); review of score and functional margin, geometric margin; max-margin classifiers; linearly separable case and hard-margin SVM optimization problem; general concepts about constrained optimization. pdf
11/14 Support Vector Machines (SVM) 2 - Support vectors and relationship with the weight vector; non-linearly separable case and use of slack variables for elastic problem formulation; margin and non margin support vectors; penalty / tradeoff parameter; solution of the SVM optimization problem; relaxations; Lagrangian function and dual problem; Lagrange multipliers and their interpretation; weak and strong duality concepts. pdf
11/15 Support Vector Machines (SVM) 3 - SVM optimization problem, primal and dual; solution of the dual for the hard-magin case; functional relations between multipliers and SVM parameters; solving the non-linearly separable case (soft-margin). Hinge loss and soft-margin SVM; regularized hinge loss; properties of linear classifiers. pdf

11/19 Feature transformations, Kernel Methods, Kernelization pdf HW5 due, QNA5 out
11/20 Recitation - SVMs, Kernels and kernelization of linear algorithms
11/21 Recitation - SVMs, Kernels and kernelization of linear algorithms
11/22 Learning Theory 1 - Needs for bounds on generalization errors; PAC model bounds; sample complexity; consistent but bad hypotheses; derivation of PAC Haussler bound; use of a PAC bound; limitation of Haussler's bound; Hoeffding's bound for a hypothesis which is not consistent; PAC bound and Bias-Variance tradeoff; computing the sample complexity; sample complexity for the case of decision trees; DT of fixed width vs. number of leaves; sample complexity and number of points that allow consistent classification. pdf

11/26 Learning Theory 2 - PAC bounds on continuous hypothesis spaces; set shattering; VC dimension; VC dimension for linear models, decision stumps, axis-aligned rectangles, circles, ellipsis; generalization error bound and VC dimension; tightness of the bound; bias-variance and VC-dimension; limitations of the VC dimensions. pdf Nov 23: QNA6 out
11/27 Recitation - Learning theory, SVMs, Kernelization
11/28 Recitation - Course review
11/29 Future and Ethics of AI, Q&A

Assignments

Topic Files Dates
QNA 1: General concepts, ML models and workflow, k-NN - Out: Aug 24 - Due: Aug 31
Homework 1: Decision trees, k-NN, Cross-validation - Out: Aug 31 - Due: Sep 7
QNA 2: Linear regression, Regularization, Gradient methods - Out: Sep 7 - Due: Sep 13
Homework 2: Linear and non-linear regression models, Non-parametric regression, Gradient methods for optimization - Out: Sep 13 - Due: Sep 21
QNA 3: Probabilistic models, Decision boundaries, MLE and MAP approaches - Out: Sep 21 - Due: Sep 27
Homework 3: Probabilistic models, MLE/MAP/Bayes, Naive Bayes, Logistic regression - Out: Sep 27 - Due: Oct 7
QNA 4: Ensemble methods (Bagging, Boosting, Random Forest) - Out: 3 - Due: Oct 18
Homework 4: Neural networks models and applications, Deep learning - Out: Oct 18 - Due: Nov 1
Homework 5: Upervised Learning (Dimensionality reduction, Clustering, Mixture models, Non-parametric density estimation) - Out: Nov 1 - Due: Nov 15
QNA 5: SVMs, Kernels and kernelization - Out: Nov 15 - Due: Nov 23
QNA 6: Learning theory - Out: Nov 23 - Due: Nov 30


Assignments Policies

  • Each assignment, either a homework or a QNA, is due on Gradescope by the posted deadline. Assignments submitted past the deadline will incur the use of late days.

  • You have 4 late days in total, but cannot use more than 1 late day per homework or QNA. No credit will be given for an assignment submitted more than 1 day after the due date. After your 4 late days have been used you will receive 20% off for each additional day late.

  • You can discuss the exercises with your classmates, but you should write up your own solutions, both for the theory and programming questions.

  • Using any external sources of code or algorithms or complete solutions in any way must have approval from the instructor before submitting the work. For example, you must get instructor approval before using an algorithm you found online for implementing a function in a programming assignment.

  • Violations of the above policies will be reported as an academic integrity violation. In general, for both assignments and exams, CMU's directives for academic integrity apply and must be duly followed. Information about academic integrity at CMU may be found at https://www.cmu.edu/academic-integrity. Please contact the instructor if you ever have any questions regarding academic integrity or these collaboration policies.

Exam dates and policies

The class includes both a midterm and a final exam. Both the exams will include theory and, possibly, pseudo-programming questions. During exams students are only allowed to consult a 2-page cheatsheet (written in any desired format) as well as the lecture slides (previously downloaded offline). No other material is allowed, including textbooks, computers/smartphones, or general consultation of Internet repositories. Any violation of these policies will determine a null grade at the exam.

The midterm exam is set for October 4.

The final exam is set for TBD.

Office Hours

Name Email Hours Location
Gianni Di Caro gdicaro@cmu.edu TBD + pass by my office at any time ... M 1007
Zhijie Xu zhijie@andrew.cmu.edu TBD ARC