Supervised Learning Models and Algorithms

39 carte

An overview of supervised learning models, algorithms, and key concepts such as regression, classification, loss functions, and optimization techniques.

39 carte

Ripassa

La ripetizione spaziata ti mostra ogni carta al momento ottimale per memorizzare a lungo termine, con revisioni sempre più distanziate.

Domanda

What is a 'hypothesis' in machine learning?

Risposta

In machine learning, a hypothesis, denoted as h_θ, is the chosen model that predicts an output h_θ(x⁽ⁱ⁾) for a given input data x⁽ⁱ⁾. It represents the learned function.

Domanda

How is a 'loss function' defined?

Risposta

A loss function, L:(z,y)∈ R×Y⟼L(z,y)∈ R, quantifies the difference between a predicted value z and the true data value y. It serves as a measure of error.

Domanda

What is the purpose of a 'cost function'?

Risposta

The cost function, J, evaluates a model's overall performance by summing the losses over all training examples. It is defined as J (θ) = ∑^m_i=1L (h_θ(x⁽ⁱ⁾),y⁽ⁱ⁾).

Domanda

Explain 'gradient descent' in machine learning.

Risposta

Gradient descent is an optimization algorithm used to update model parameters θ by iteratively moving in the direction opposite to the gradient of the cost function J. The update rule is θ←θ-α∇J (θ), where α is the learning rate.

Domanda

What is the 'likelihood' of a model?

Risposta

The likelihood L(θ) of a model indicates how probable the observed data is given the model's parameters θ. Maximizing the likelihood helps find the optimal parameters θ^opt.

Domanda

What is the log-likelihood?

Risposta

The log-likelihood, ℓ (θ) = log (L(θ)), is the natural logarithm of the likelihood function. It is often optimized instead of the likelihood directly because it simplifies calculations by converting products into sums.

Domanda

What is 'Newton's algorithm' used for?

Risposta

Newton's algorithm is a numerical method to find the optimal parameters θ by finding where the derivative of the log-likelihood, ℓ'(θ), is zero. Its update rule is θ←θ- ℓ'(θ)/ℓ''(θ).

Domanda

Describe the 'Newton-Raphson method'.

Risposta

The Newton-Raphson method is the multidimensional generalization of Newton's algorithm. Its update rule for parameters θ is θ←θ-(∇²_θℓ (θ))^-1∇_θℓ (θ), involving the inverse of the Hessian matrix.

Domanda

What are 'Normal equations' in linear regression?

Risposta

Normal equations provide a closed-form solution for the parameter θ that minimizes the cost function in linear regression. Given the design matrix X, the solution is θ=(X^T X)^-1 X^T y.

Domanda

Explain the 'LMS algorithm'.

Risposta

The Least Mean Squares (LMS) algorithm, or Widrow-Hoff learning rule, is an iterative update rule for the parameters θ_j in linear regression. It adjusts θ_j based on the sum of errors across all training examples: θ_j ←θ_j+α∑^m_i=1[y⁽ⁱ⁾-h_θ(x⁽ⁱ⁾)]x⁽ⁱ⁾_j.

Domanda

What is 'Locally Weighted Regression' (LWR)?

Risposta

LWR is a variant of linear regression where each training example's contribution to the cost function is weighted based on its proximity to the query point x. The weight w⁽ⁱ⁾(x) is an exponential function of the squared distance between x⁽ⁱ⁾ and x, controlled by parameter τ.

Domanda

What is the 'sigmoid function'?

Risposta

The sigmoid function, also known as the logistic function, is defined as g(z) = 1/(1+e^-z). It maps any real-valued number z to a value between 0 and 1, making it suitable for probabilities.

Domanda

How is 'logistic regression' defined?

Risposta

Logistic regression models the probability of a binary outcome y=1 given x and parameters θ. It assumes y|x;θ follows a Bernoulli distribution, with p(y=1|x;θ) = 1/(1+ exp (-θ^T x)) = g(θ^T x).

Domanda

What is 'Softmax regression'?

Risposta

Softmax regression generalizes logistic regression to handle more than two outcome classes. For each class i, the Bernoulli parameter φ_i is given by φ_i = exp (θ^T_i x)/∑^K_j=1exp (θ^T_j x), with θ_K =0 by convention.

Domanda

What defines an 'exponential family' distribution?

Risposta

A distribution belongs to the exponential family if its probability density function can be expressed as p(y;η) = b(y) exp (ηT(y)-a(η)), where η is the natural parameter, T(y) is the sufficient statistic, and a(η) is the log-partition function.

Domanda

What are the assumptions of 'Generalized Linear Models' (GLMs)?

Risposta

GLMs rely on three assumptions: (1) y|x;θ follows an exponential family distribution, (2) the hypothesis h_θ(x) is the expected value of y given x and θ, and (3) the natural parameter η is a linear combination of features, η=θ^T x.

Domanda

Can ordinary least squares and logistic regression be GLMs?

Risposta

Yes, ordinary least squares (linear regression) and logistic regression are both considered special cases of Generalized Linear Models (GLMs). This highlights the broad applicability and unifying framework of GLMs.

Domanda

What is the goal of 'Support Vector Machines' (SVMs)?

Risposta

The goal of Support Vector Machines (SVMs) is to find an optimal separating hyperplane that maximizes the minimum distance (the margin) between data points of different classes. This typically leads to better generalization.

Domanda

What is an 'optimal margin classifier'?

Risposta

An optimal margin classifier h(x) = sign(w^T x- b) in SVMs is determined by solving an optimization problem: minimizing 1/2||w||² subject to the constraint y⁽ⁱ⁾(w^T x⁽ⁱ⁾- b) ≥ 1 for all data points.

Domanda

What is 'hinge loss'?

Risposta

Hinge loss is a loss function specifically used in SVMs. It is defined as L (z,y) = [1- yz]₊ = max (0,1- yz), penalizing predictions that are not confidently classified correctly.

Domanda

How is a 'kernel' defined in SVMs?

Risposta

Given a feature mapping φ, the kernel K is defined as the inner product of the mapped features: K (x,z) = φ(x)^T φ(z). It allows SVMs to operate in a high-dimensional feature space without explicitly computing the mapping.

Domanda

What is the 'Gaussian kernel'?

Risposta

The Gaussian kernel, also known as the Radial Basis Function (RBF) kernel, is a commonly used kernel in SVMs. It is defined as K(x,z) = exp (-||x- z||²/(2 σ²)), where σ is a parameter controlling the width.

Domanda

What is the 'kernel trick'?

Risposta

The kernel trick allows calculating the cost function in SVMs using the kernel K(x,z) directly, without needing to explicitly know the often complex feature mapping φ. This makes computations efficient in high-dimensional spaces.

Domanda

What is a 'Lagrangian' in optimization?

Risposta

In optimization, the Lagrangian L(w,b) is a function formed to incorporate constraints into the objective function. It is defined as L (w,b) = f (w) + ∑^l_i=1β_ih_i(w), where β_i are Lagrange multipliers.

Domanda

What is a 'generative model'?

Risposta

A generative model aims to learn the underlying distribution of the data, P(x|y), which describes how the data is generated. This can then be used to estimate the conditional probability P(y|x) using Bayes' rule.

Domanda

What are the assumptions of 'Gaussian Discriminant Analysis' (GDA)?

Risposta

GDA assumes that the labels y follow a Bernoulli distribution, and that the features x conditioned on each class y=0 or y=1 follow a multivariate normal distribution with means μ₀, μ₁ and a shared covariance matrix Σ.

Domanda

How are parameters estimated in GDA?

Risposta

In GDA, parameters like φ (for y's Bernoulli distribution), μ_j (mean for x|y=j), and Σ (covariance matrix for x|y) are estimated by maximizing the likelihood of the observed data.

Domanda

What is the core assumption of the 'Naive Bayes' model?

Risposta

The Naive Bayes model assumes that all features x_i are conditionally independent given the class label y. This simplifies the joint probability P(x|y) to the product of individual conditional probabilities: P (x|y) = ∏ⁿ_i=1P (x_i|y).

Domanda

How are parameters P(y=k) and P(xi=l|y=k) estimated in Naive Bayes?

Risposta

In Naive Bayes, is estimated as the frequency of class k in the training set, and i=l|y=k)> is estimated as the frequency of feature x_i having value l within class k.

Domanda

For what applications is Naive Bayes widely used?

Risposta

Naive Bayes is widely used for text classification tasks, such as spam detection, sentiment analysis, and categorizing documents. Its simplicity and effectiveness make it a popular choice in these areas.

Domanda

What are 'CART' trees?

Risposta

CART, or Classification and Regression Trees, are commonly known as decision trees. They are represented as binary trees and are highly interpretable models used for both classification and regression problems.

Domanda

What is a 'Random forest'?

Risposta

A Random Forest is an ensemble learning method that builds a large number of decision trees using randomly selected subsets of features. While less interpretable than single decision trees, they generally offer high performance and robustness.

Domanda

What is 'Boosting' in machine learning?

Risposta

Boosting is an ensemble technique that combines several 'weak learners' into a 'stronger learner'. It iteratively trains models to correct the errors of previous models, gradually improving overall prediction accuracy.

Domanda

Explain 'Adaptive boosting' (AdaBoost).

Risposta

Adaptive boosting, or AdaBoost, focuses on improving performance by assigning higher weights to misclassified examples in each boosting step. Subsequent weak learners concentrate more on these difficult-to-classify data points.

Domanda

What is 'Gradient boosting'?

Risposta

Gradient boosting is a boosting method where weak learners, typically decision trees, are trained on the residual errors (gradients) of the previous models. This iterative process allows the ensemble to progressively reduce the overall error.

Domanda

What is the 'k-nearest neighbors' (k-NN) algorithm?

Risposta

The k-nearest neighbors (k-NN) algorithm is a non-parametric method used for classification and regression. It predicts the response of a data point based on the majority class (for classification) or average (for regression) of its k closest neighbors in the training set.

Domanda

How does the parameter k affect k-NN?

Risposta

In k-NN, a higher value of k leads to higher bias and lower variance, making the model smoother and less sensitive to noise. Conversely, a lower k results in lower bias and higher variance, making it more susceptible to noise.

Domanda

What is the 'Union bound' inequality?

Risposta

The Union bound states that for any k events A₁,...,A_k, the probability of at least one of these events occurring is less than or equal to the sum of their individual probabilities: P(⋃^k_i=1A_i) ≤ ∑^k_i=1P(A_i).

Domanda

What is 'Hoeffding's inequality'?

Risposta

Hoeffding's inequality provides a bound on the probability that the sample mean

Supervised Learning Fundamentals

Supervised learning is a machine learning paradigm where models learn from labeled data to make predictions. The goal is to build a classifier that can learn to predict an outcome variable y from input features x.

Types of Predictive Models

Regression Classification

Outcome Continuous Categorical (Class)

Examples Linear regression Logistic regression, SVM, Naive Bayes

Types of Machine Learning Models

Discriminative Model Generative Model

Goal Directly estimate $P(y|x)$ Estimate $P(x|y)$ to deduce $P(y|x)$

What's Learned Decision boundary Probability distributions of the data

Examples Regressions, SVMs GDA, Naive Bayes

Generative models first learn how the data is generated by estimating $P(x|y)$ , then use Bayes' rule to estimate $P(y|x)$ .

Core Concepts in Supervised Learning

Hypothesis ( $h_\theta$ ): This is the chosen model. For a given input $x^{(i)}$ , the model's prediction is $h_\theta(x^{(i)})$ .

Loss Function ( $L$ ): A function that quantifies the difference between a predicted value $z$ and the real data value $y$ .

Squared Loss Logistic Loss Hinge Loss Cross-entropy Loss

$\frac{1}{2}(y-z)^2$ $\log(1+\exp(-yz))$ $\max(0,1-yz)$ $-[y \log(z)+(1-y) \log(1-z)]$

Used in Linear Regression Used in Logistic Regression Used in SVM Used in Neural Networks

Cost Function ( $J(\theta)$ ): Commonly used to assess model performance, defined as the sum of loss over all training examples: $J (\theta) = \sum_ {i = 1} ^ {m} L \left(h _ {\theta} \left(x ^ {(i)}\right), y ^ {(i)}\right)$

Gradient Descent: An optimization algorithm to minimize the cost function. With learning rate $\alpha$ , the update rule is: $\theta \longleftarrow \theta - \alpha \nabla J (\theta)$
Stochastic Gradient Descent (SGD) updates parameters based on each training example. Batch Gradient Descent updates based on a batch of training examples.

Likelihood ( $L(\theta)$ ): Used to find optimal parameters $\theta$ by maximizing the likelihood of the model given the parameters. The optimal $\theta$ is found by: $\theta^ {\mathrm {o p t}} = \underset {\theta} {\arg \max } L (\theta)$ In practice, the log-likelihood $\ell (\theta) = \log (L(\theta))$ is often optimized for simplicity.

Newton's Algorithm: A numerical method to find $\theta$ where $\ell'(\theta) = 0$ . The update rule is: $\theta \leftarrow \theta - \frac {\ell^ {\prime} (\theta)}{\ell^ {\prime \prime} (\theta)}$ The multidimensional generalization (Newton-Raphson method) uses: $\theta \leftarrow \theta - \left(\nabla_ {\theta} ^ {2} \ell (\theta)\right) ^ {- 1} \nabla_ {\theta} \ell (\theta)$

Regression Models

Regression models predict continuous outcome variables.

Linear Regression

Assumes the outcome $y$ given $x$ and $\theta$ follows a Normal distribution: $y|x;\theta \sim \mathcal{N}(\mu ,\sigma^2)$ .

Normal Equations: For a design matrix $X$ , the $\theta$ that minimizes the cost function has a closed-form solution: $\theta = (X^T X)^{-1} X^T y$

LMS Algorithm (Widrow-Hoff): An iterative update rule for a training set of $m$ data points with learning rate $\alpha$ : $\forall j, \quad \theta_j \leftarrow \theta_j + \alpha \sum_{i=1}^{m} \left[ y^{(i)} - h_\theta(x^{(i)}) \right] x_j^{(i)}$ This is a special case of gradient ascent.

Locally Weighted Regression (LWR): A variant where each training example is weighted in the cost function by $w^{(i)}(x)$ , defined as: $w^{(i)}(x) = \exp \left(- \frac{(x^{(i)} - x)^2}{2 \tau^2}\right)$ where $\tau$ is a parameter.

Classification Models

Classification models predict categorical outcome variables.

Logistic Regression

Used for binary classification; assumes $y|x;\theta \sim \mathrm{Bernoulli}(\phi)$ .

Sigmoid Function ( $g$ ): Also known as the logistic function, it maps any real-valued number to a value between 0 and 1: $\forall z \in \mathbb{R}, \quad g(z) = \frac{1}{1 + e^{-z}} \in ]0,1[$

Hypothesis: The probability of $y=1$ given $x$ and $\theta$ is: $\phi = p(y = 1 | x; \theta) = \frac{1}{1 + \exp(-\theta^T x)} = g(\theta^T x)$
There is no closed-form solution for logistic regression, requiring iterative optimization methods like gradient descent.

Softmax Regression

A generalization of logistic regression for more than two outcome classes, also called multiclass logistic regression. By convention, $\theta_K = 0$ . The Bernoulli parameter $\phi_i$ for each class $i$ is: $\phi_i = \frac{\exp(\theta_i^T x)}{\sum_{j=1}^{K} \exp(\theta_j^T x)}$

Generalized Linear Models (GLMs)

GLMs provide a flexible way to model response variables that follow various distributions, not just the normal distribution, by relating the linear predictor to the response variable via a link function.

Exponential Family Distributions

A class of distributions belonging to the exponential family can be written as: $p(y; \eta) = b(y) \exp(\eta T(y) - a(\eta))$ Where:

$\eta$ : Natural parameter (canonical parameter or link function).

$T(y)$ : Sufficient statistic (often $T(y) = y$ ).

$a(\eta)$ : Log-partition function.

$b(y)$ : Base measure.

The term $\exp(-a(\eta))$ acts as a normalization constant ensuring probabilities sum to one.

Common Exponential Distributions

Distribution $\eta$ $T(y)$ $a(\eta)$ $b(y)$

Bernoulli $\log(\phi/1 - \phi)$ $y$ $\log(1 + \exp(\eta))$ $1$

Gaussian $\mu$ $y$ $\frac{\eta^2}{2}$ $\frac{1}{\sqrt{2\pi}} \exp\left(-\frac{y^2}{2}\right)$

Poisson $\log(\lambda)$ $y$ $e^{\eta}$ $\frac{1}{y!}$

Geometric $\log(1 - \phi)$ $y$ $\log(\frac{e^{\eta}}{1 - e^{\eta}})$ $1$

Assumptions of GLMs

GLMs aim to predict a random variable $y$ as a function of $x \in \mathbb{R}^{n+1}$ based on three key assumptions:

$y | x; \theta \sim \operatorname {ExpFamily}(\eta)$ (The conditional distribution of $y$ given $x$ belongs to the Exponential Family).

$h_\theta(x) = E[y | x; \theta]$ (The model predicts the expected value of $y$ ).

$\eta = \theta^T x$ (The natural parameter is a linear function of the input features).

Ordinary Least Squares (Linear Regression) and Logistic Regression are special cases of Generalized Linear Models.

Support Vector Machines (SVMs)

The goal of SVMs is to find an optimal hyperplane that maximizes the minimum distance to the closest training examples from any class (the margin).

Optimal Margin Classifier

The optimal margin classifier $h(x)$ is given by:
$h(x) = \operatorname{sign}(w^T x - b)$
Where $(w,b) \in \mathbb{R}^n \times \mathbb{R}$ are the parameters obtained by solving the following optimization problem:
$\min \frac{1}{2} ||w||^2 \quad \text{such that} \quad y^{(i)}(w^T x^{(i)} - b) \geqslant 1$
The decision boundary is defined by $w^T x - b = 0$ .

Hinge Loss

The hinge loss function, specific to SVMs, is defined as:
$L (z, y) = [ 1 - y z ] _ {+} = \max (0, 1 - y z)$
Kernel Function

Given a feature mapping $\phi$ , the kernel $K$ is defined as the inner product of the mapped features:
$K (x, z) = \phi (x) ^ {T} \phi (z)$
A commonly used kernel is the Gaussian kernel: $K(x,z) = \exp \left(-\frac{||x - z||^2}{2\sigma^2}\right)$ .

The "kernel trick" allows us to compute the cost function using the kernel $K(x,z)$ without explicitly knowing the often complex feature mapping $\phi$ . This is crucial for handling non-linear separability by mapping data into higher-dimensional spaces.

Lagrangian

The Lagrangian $\mathcal{L}(w,b)$ is used in optimization with constraints:
$\mathcal {L} (w, b) = f (w) + \sum_ {i = 1} ^ {l} \beta_ {i} h _ {i} (w)$
The coefficients $\beta_i$ are known as Lagrange multipliers.

Generative Learning Algorithms

Generative models learn the joint probability distribution $P(x,y)$ or $P(x|y)$ and $P(y)$ , which then allows them to infer $P(y|x)$ using Bayes' rule.

Gaussian Discriminant Analysis (GDA)

GDA assumes specific distributions for $y$ and $x|y$ .

Setting:

$y \sim \operatorname {Bernoulli} (\phi)$

$x \mid y = 0 \sim \mathcal {N} \left(\mu_ {0}, \Sigma\right)$

$x \mid y = 1 \sim \mathcal {N} \left(\mu_ {1}, \Sigma\right)$

Estimation: Parameters are found by maximizing the likelihood.

$\phi$ $\mu_j (j=0,1)$ $\Sigma$

$\frac{1}{m}\sum_{i=1}^{m}1\{y^{(i)}=1\}$ $\frac{\sum_{i=1}^{m}1\{y^{(i)}=j\}x^{(i)}}{\sum_{i=1}^{m}1\{y^{(i)}=j\}}$ $\frac{1}{m}\sum_{i=1}^{m}(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T$

Naive Bayes

A generative model that simplifies the estimation by assuming feature independence.

Assumption: Features $x_1, x_2, \dots$ are conditionally independent given $y$ : $P (x | y) = P \left(x _ {1}, x _ {2}, \dots | y\right) = \prod_ {i = 1} ^ {n} P \left(x _ {i} | y\right)$

Solutions: Maximizing the log-likelihood yields the following estimations for $k \in \{0,1\}$ and feature value $l \in [1,L]$ : $P (y = k) = \frac {1}{m} \times \# \{j | y ^ {(j)} = k \}$ $P (x _ {i} = l | y = k) = \frac {\# \{j | y ^ {(j)} = k \text { and } x _ {i} ^ {(j)} = l \}}{\# \{j | y ^ {(j)} = k \}}$
Naive Bayes is widely used in text classification and spam detection due to its simplicity and effectiveness.

Tree-Based Methods

These methods can be used for both regression and classification problems.

CART (Classification and Regression Trees): Commonly known as decision trees, these are binary trees. They are highly interpretable.

Random Forest: A tree-based ensemble method that builds a large number of decision trees from randomly selected subsets of features and data. It typically offers good performance but is less interpretable than single decision trees.

Boosting Methods

Boosting combines several "weak learners" to form a stronger learner.

Adaptive Boosting (Adaboost) Gradient Boosting

Assigns higher weights to misclassified examples in subsequent boosting steps. Weak learners are trained on the residuals (remaining errors) of previous learners.

K-Nearest Neighbors (K-NN)

The $k$ -NN algorithm is a non-parametric approach where the response of a new data point is determined by the majority class (for classification) or the average (for regression) of its $k$ closest neighbors in the training set.

The choice of $k$ influences the model's bias and variance:

A higher $k$ leads to higher bias (smoother decision boundaries, less sensitive to noise).

A lower $k$ leads to higher variance (more complex decision boundaries, more sensitive to noise).

Learning Theory

Learning theory provides theoretical foundations for understanding how and why algorithms learn effectively.

Union Bound: For $k$ events $A_1, \dots, A_k$ , the probability of at least one event occurring is bounded: $P \left(\bigcup_{i=1}^k A_i\right) \le \sum_{i=1}^k P(A_i)$

Hoeffding Inequality (Chernoff Bound): For $m$ i.i.d. Bernoulli variables $Z_1, \dots, Z_m$ with mean $\phi$ , and sample mean $\widehat{\phi}$ , for any $\gamma > 0$ : $P \left(\left| \phi - \widehat {\phi} \right| > \gamma\right) \leqslant 2 \exp \left(- 2 \gamma^ {2} m\right)$

Training Error ( $\widehat{\epsilon}(h)$ ): Also known as empirical risk or empirical error, it measures the proportion of mistakes a classifier $h$ makes on the training set: $\widehat {\epsilon} (h) = \frac {1}{m} \sum_ {i = 1} ^ {m} 1 _ {\{h (x ^ {(i)}) \neq y ^ {(i)} \}}$

Probably Approximately Correct (PAC): A framework for analyzing the theoretical learnability of a concept. Its assumptions are:

Training and test sets follow the same distribution.

Training examples are drawn independently.

Shattering: A set of classifiers $\mathcal{H}$ shatters a set of data points $S = \{x^{(1)},\dots,x^{(d)}\}$ if it can perfectly classify any possible labeling of these points: $\exists h \in \mathcal {H}, \quad \forall i \in [ 1, d ], \quad h (x ^ {(i)}) = y ^ {(i)}$

Upper Bound Theorem: For a finite hypothesis class $\mathcal{H}$ of size $k$ , with probability at least $1-\delta$ , the true error $\epsilon(\widehat{h})$ of the learned hypothesis $\widehat{h}$ is bounded by: $\epsilon (\widehat {h}) \leqslant \left(\min _{h \in \mathcal {H}} \epsilon (h)\right) + 2 \sqrt {\frac {1}{2 m} \log \left(\frac {2 k}{\delta}\right)}$

VC Dimension (Vapnik-Chervonenkis Dimension): For an infinite hypothesis class $\mathcal{H}$ , $VC(\mathcal{H})$ is the size of the largest set that can be shattered by $\mathcal{H}$ .
For example, the VC dimension of linear classifiers in 2 dimensions is 3.

Vapnik's Theorem: For a hypothesis class $\mathcal{H}$ with VC dimension $d$ and $m$ training examples, with probability at least $1-\delta$ : $\epsilon (\widehat {h}) \leqslant \left(\min _ {h \in \mathcal {H}} \epsilon (h)\right) + O \left(\sqrt {\frac {d}{m} \log \left(\frac {m}{d}\right) + \frac {1}{m} \log \left(\frac {1}{\delta}\right)}\right)$ This theorem provides a generalization bound for infinite hypothesis classes.

Key Takeaways

Supervised learning involves building models to predict outcomes from labeled data.

Models are categorized as regression (continuous outcomes) or classification (categorical outcomes), and as discriminative or generative.

Fundamental concepts include hypothesis, loss functions, cost functions, and optimization algorithms like gradient descent and Newton's method.

Popular algorithms include Linear Regression, Logistic Regression, SVMs, GDA, Naive Bayes, Decision Trees, Random Forests, and K-NN.

Generalized Linear Models offer a unifying framework for various exponential family distributions.

Learning theory provides bounds and insights into model generalization through concepts like PAC learning, shattering, and VC dimension.

Inizia un quiz
Testa le tue conoscenze con domande interattive

	Regression	Classification
Outcome	Continuous	Categorical (Class)
Examples	Linear regression	Logistic regression, SVM, Naive Bayes

	Discriminative Model	Generative Model
Goal	Directly estimate $P(y\|x)$	Estimate $P(x\|y)$ to deduce $P(y\|x)$
What's Learned	Decision boundary	Probability distributions of the data
Examples	Regressions, SVMs	GDA, Naive Bayes

Squared Loss	Logistic Loss	Hinge Loss	Cross-entropy Loss
$\frac{1}{2}(y-z)^2$	$\log(1+\exp(-yz))$	$\max(0,1-yz)$	$-[y \log(z)+(1-y) \log(1-z)]$
Used in Linear Regression	Used in Logistic Regression	Used in SVM	Used in Neural Networks

Distribution	$\eta$	$T(y)$	$a(\eta)$	$b(y)$
Bernoulli	$\log(\phi/1 - \phi)$	$y$	$\log(1 + \exp(\eta))$	$1$
Gaussian	$\mu$	$y$	$\frac{\eta^2}{2}$	$\frac{1}{\sqrt{2\pi}} \exp\left(-\frac{y^2}{2}\right)$
Poisson	$\log(\lambda)$	$y$	$e^{\eta}$	$\frac{1}{y!}$
Geometric	$\log(1 - \phi)$	$y$	$\log(\frac{e^{\eta}}{1 - e^{\eta}})$	$1$

$\phi$	$\mu_j (j=0,1)$	$\Sigma$
$\frac{1}{m}\sum_{i=1}^{m}1\{y^{(i)}=1\}$	$\frac{\sum_{i=1}^{m}1\{y^{(i)}=j\}x^{(i)}}{\sum_{i=1}^{m}1\{y^{(i)}=j\}}$	$\frac{1}{m}\sum_{i=1}^{m}(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T$

Adaptive Boosting (Adaboost)	Gradient Boosting
Assigns higher weights to misclassified examples in subsequent boosting steps.	Weak learners are trained on the residuals (remaining errors) of previous learners.