Supervised Learning Models and Algorithms
39 tarjetasAn overview of supervised learning models, algorithms, and key concepts such as regression, classification, loss functions, and optimization techniques.
39 tarjetas
Supervised Learning Fundamentals
Supervised learning is a machine learning paradigm where models learn from labeled data to make predictions. The goal is to build a classifier that can learn to predict an outcome variable y from input features x.
Types of Predictive Models
| Regression | Classification | |
| Outcome | Continuous | Categorical (Class) |
| Examples | Linear regression | Logistic regression, SVM, Naive Bayes |
Types of Machine Learning Models
| Discriminative Model | Generative Model | |
| Goal | Directly estimate | Estimate to deduce |
| What's Learned | Decision boundary | Probability distributions of the data |
| Examples | Regressions, SVMs | GDA, Naive Bayes |
Generative models first learn how the data is generated by estimating , then use Bayes' rule to estimate .
Core Concepts in Supervised Learning
- Hypothesis (): This is the chosen model. For a given input , the model's prediction is .
-
Loss Function (): A function that quantifies the difference between a predicted value and the real data value .
Squared Loss Logistic Loss Hinge Loss Cross-entropy Loss Used in Linear Regression Used in Logistic Regression Used in SVM Used in Neural Networks - Cost Function (): Commonly used to assess model performance, defined as the sum of loss over all training examples:
-
Gradient Descent: An optimization algorithm to minimize the cost function. With learning rate , the update rule is:
Stochastic Gradient Descent (SGD) updates parameters based on each training example. Batch Gradient Descent updates based on a batch of training examples.
- Likelihood (): Used to find optimal parameters by maximizing the likelihood of the model given the parameters. The optimal is found by: In practice, the log-likelihood is often optimized for simplicity.
- Newton's Algorithm: A numerical method to find where . The update rule is: The multidimensional generalization (Newton-Raphson method) uses:
Regression Models
Regression models predict continuous outcome variables.
Linear Regression
Assumes the outcome given and follows a Normal distribution: .
- Normal Equations: For a design matrix , the that minimizes the cost function has a closed-form solution:
- LMS Algorithm (Widrow-Hoff): An iterative update rule for a training set of data points with learning rate : This is a special case of gradient ascent.
- Locally Weighted Regression (LWR): A variant where each training example is weighted in the cost function by , defined as: where is a parameter.
Classification Models
Classification models predict categorical outcome variables.
Logistic Regression
Used for binary classification; assumes .
- Sigmoid Function (): Also known as the logistic function, it maps any real-valued number to a value between 0 and 1:
-
Hypothesis: The probability of given and is:
There is no closed-form solution for logistic regression, requiring iterative optimization methods like gradient descent.
Softmax Regression
A generalization of logistic regression for more than two outcome classes, also called multiclass logistic regression. By convention, . The Bernoulli parameter for each class is:
Generalized Linear Models (GLMs)
GLMs provide a flexible way to model response variables that follow various distributions, not just the normal distribution, by relating the linear predictor to the response variable via a link function.
Exponential Family Distributions
A class of distributions belonging to the exponential family can be written as: Where:
- : Natural parameter (canonical parameter or link function).
- : Sufficient statistic (often ).
- : Log-partition function.
- : Base measure.
The term acts as a normalization constant ensuring probabilities sum to one.
Common Exponential Distributions
| Distribution | ||||
| Bernoulli | ||||
| Gaussian | ||||
| Poisson | ||||
| Geometric |
Assumptions of GLMs
GLMs aim to predict a random variable as a function of based on three key assumptions:
- (The conditional distribution of given belongs to the Exponential Family).
- (The model predicts the expected value of ).
- (The natural parameter is a linear function of the input features).
Ordinary Least Squares (Linear Regression) and Logistic Regression are special cases of Generalized Linear Models.
Support Vector Machines (SVMs)
The goal of SVMs is to find an optimal hyperplane that maximizes the minimum distance to the closest training examples from any class (the margin).
Optimal Margin Classifier
The optimal margin classifier is given by:
Where are the parameters obtained by solving the following optimization problem:
The decision boundary is defined by .
Hinge Loss
The hinge loss function, specific to SVMs, is defined as:
Kernel Function
Given a feature mapping , the kernel is defined as the inner product of the mapped features:
A commonly used kernel is the Gaussian kernel: .
The "kernel trick" allows us to compute the cost function using the kernel without explicitly knowing the often complex feature mapping . This is crucial for handling non-linear separability by mapping data into higher-dimensional spaces.
Lagrangian
The Lagrangian is used in optimization with constraints:
The coefficients are known as Lagrange multipliers.
Generative Learning Algorithms
Generative models learn the joint probability distribution or and , which then allows them to infer using Bayes' rule.
Gaussian Discriminant Analysis (GDA)
GDA assumes specific distributions for and .
-
Setting:
-
Estimation: Parameters are found by maximizing the likelihood.
Naive Bayes
A generative model that simplifies the estimation by assuming feature independence.
- Assumption: Features are conditionally independent given :
-
Solutions: Maximizing the log-likelihood yields the following estimations for and feature value :
Naive Bayes is widely used in text classification and spam detection due to its simplicity and effectiveness.
Tree-Based Methods
These methods can be used for both regression and classification problems.
- CART (Classification and Regression Trees): Commonly known as decision trees, these are binary trees. They are highly interpretable.
- Random Forest: A tree-based ensemble method that builds a large number of decision trees from randomly selected subsets of features and data. It typically offers good performance but is less interpretable than single decision trees.
Boosting Methods
Boosting combines several "weak learners" to form a stronger learner.
| Adaptive Boosting (Adaboost) | Gradient Boosting |
| Assigns higher weights to misclassified examples in subsequent boosting steps. | Weak learners are trained on the residuals (remaining errors) of previous learners. |
K-Nearest Neighbors (K-NN)
The -NN algorithm is a non-parametric approach where the response of a new data point is determined by the majority class (for classification) or the average (for regression) of its closest neighbors in the training set.
The choice of influences the model's bias and variance:
- A higher leads to higher bias (smoother decision boundaries, less sensitive to noise).
- A lower leads to higher variance (more complex decision boundaries, more sensitive to noise).
Learning Theory
Learning theory provides theoretical foundations for understanding how and why algorithms learn effectively.
- Union Bound: For events , the probability of at least one event occurring is bounded:
- Hoeffding Inequality (Chernoff Bound): For i.i.d. Bernoulli variables with mean , and sample mean , for any :
- Training Error (): Also known as empirical risk or empirical error, it measures the proportion of mistakes a classifier makes on the training set:
-
Probably Approximately Correct (PAC): A framework for analyzing the theoretical learnability of a concept. Its assumptions are:
- Training and test sets follow the same distribution.
- Training examples are drawn independently.
- Shattering: A set of classifiers shatters a set of data points if it can perfectly classify any possible labeling of these points:
- Upper Bound Theorem: For a finite hypothesis class of size , with probability at least , the true error of the learned hypothesis is bounded by:
-
VC Dimension (Vapnik-Chervonenkis Dimension): For an infinite hypothesis class , is the size of the largest set that can be shattered by .
For example, the VC dimension of linear classifiers in 2 dimensions is 3.
- Vapnik's Theorem: For a hypothesis class with VC dimension and training examples, with probability at least : This theorem provides a generalization bound for infinite hypothesis classes.
Key Takeaways
- Supervised learning involves building models to predict outcomes from labeled data.
- Models are categorized as regression (continuous outcomes) or classification (categorical outcomes), and as discriminative or generative.
- Fundamental concepts include hypothesis, loss functions, cost functions, and optimization algorithms like gradient descent and Newton's method.
- Popular algorithms include Linear Regression, Logistic Regression, SVMs, GDA, Naive Bayes, Decision Trees, Random Forests, and K-NN.
- Generalized Linear Models offer a unifying framework for various exponential family distributions.
- Learning theory provides bounds and insights into model generalization through concepts like PAC learning, shattering, and VC dimension.
Empezar cuestionario
Prueba tus conocimientos con preguntas interactivas