Supervised Learning Models and Algorithms

39 carte

An overview of supervised learning models, algorithms, and key concepts such as regression, classification, loss functions, and optimization techniques.

39 carte

Ripassa
La ripetizione spaziata ti mostra ogni carta al momento ottimale per memorizzare a lungo termine, con revisioni sempre più distanziate.
Domanda
What is a 'hypothesis' in machine learning?
Risposta
In machine learning, a hypothesis, denoted as h θ, is the chosen model that predicts an output h θ(x(i)) for a given input data x(i). It represents the learned function.
Domanda
How is a 'loss function' defined?
Risposta
A loss function, L:(z,y)∈ R×Y⟼L(z,y)∈ R, quantifies the difference between a predicted value z and the true data value y. It serves as a measure of error.
Domanda
What is the purpose of a 'cost function'?
Risposta
The cost function, J, evaluates a model's overall performance by summing the losses over all training examples. It is defined as J (θ) = ∑mi=1L (h θ(x(i)),y(i)).
Domanda
Explain 'gradient descent' in machine learning.
Risposta
Gradient descent is an optimization algorithm used to update model parameters θ by iteratively moving in the direction opposite to the gradient of the cost function J. The update rule is θ←θ-α∇J (θ), where α is the learning rate.
Domanda
What is the 'likelihood' of a model?
Risposta
The likelihood L(θ) of a model indicates how probable the observed data is given the model's parameters θ. Maximizing the likelihood helps find the optimal parameters θopt.
Domanda
What is the log-likelihood?
Risposta
The log-likelihood, ℓ (θ) = log (L(θ)), is the natural logarithm of the likelihood function. It is often optimized instead of the likelihood directly because it simplifies calculations by converting products into sums.
Domanda
What is 'Newton's algorithm' used for?
Risposta
Newton's algorithm is a numerical method to find the optimal parameters θ by finding where the derivative of the log-likelihood, ℓ'(θ), is zero. Its update rule is θ←θ- ℓ'(θ)/ℓ''(θ).
Domanda
Describe the 'Newton-Raphson method'.
Risposta
The Newton-Raphson method is the multidimensional generalization of Newton's algorithm. Its update rule for parameters θ is θ←θ-(∇2 θℓ (θ))-1 θℓ (θ), involving the inverse of the Hessian matrix.
Domanda
What are 'Normal equations' in linear regression?
Risposta
Normal equations provide a closed-form solution for the parameter θ that minimizes the cost function in linear regression. Given the design matrix X, the solution is θ=(XT X)-1 XT y.
Domanda
Explain the 'LMS algorithm'.
Risposta
The Least Mean Squares (LMS) algorithm, or Widrow-Hoff learning rule, is an iterative update rule for the parameters θj in linear regression. It adjusts θj based on the sum of errors across all training examples: θj ←θj+α∑mi=1[y(i)-h θ(x(i))]x(i)j.
Domanda
What is 'Locally Weighted Regression' (LWR)?
Risposta
LWR is a variant of linear regression where each training example's contribution to the cost function is weighted based on its proximity to the query point x. The weight w(i)(x) is an exponential function of the squared distance between x(i) and x, controlled by parameter τ.
Domanda
What is the 'sigmoid function'?
Risposta
The sigmoid function, also known as the logistic function, is defined as g(z) = 1/(1+e-z). It maps any real-valued number z to a value between 0 and 1, making it suitable for probabilities.
Domanda
How is 'logistic regression' defined?
Risposta
Logistic regression models the probability of a binary outcome y=1 given x and parameters θ. It assumes y|x;θ follows a Bernoulli distribution, with p(y=1|x;θ) = 1/(1+ exp (-θT x)) = g(θT x).
Domanda
What is 'Softmax regression'?
Risposta
Softmax regression generalizes logistic regression to handle more than two outcome classes. For each class i, the Bernoulli parameter φi is given by φi = exp (θTi x)/∑Kj=1exp (θTj x), with θK =0 by convention.
Domanda
What defines an 'exponential family' distribution?
Risposta
A distribution belongs to the exponential family if its probability density function can be expressed as p(y;η) = b(y) exp (ηT(y)-a(η)), where η is the natural parameter, T(y) is the sufficient statistic, and a(η) is the log-partition function.
Domanda
What are the assumptions of 'Generalized Linear Models' (GLMs)?
Risposta
GLMs rely on three assumptions: (1) y|x;θ follows an exponential family distribution, (2) the hypothesis h θ(x) is the expected value of y given x and θ, and (3) the natural parameter η is a linear combination of features, η=θT x.
Domanda
Can ordinary least squares and logistic regression be GLMs?
Risposta
Yes, ordinary least squares (linear regression) and logistic regression are both considered special cases of Generalized Linear Models (GLMs). This highlights the broad applicability and unifying framework of GLMs.
Domanda
What is the goal of 'Support Vector Machines' (SVMs)?
Risposta
The goal of Support Vector Machines (SVMs) is to find an optimal separating hyperplane that maximizes the minimum distance (the margin) between data points of different classes. This typically leads to better generalization.
Domanda
What is an 'optimal margin classifier'?
Risposta
An optimal margin classifier h(x) = sign(wT x- b) in SVMs is determined by solving an optimization problem: minimizing 1/2||w||2 subject to the constraint y(i)(wT x(i)- b) ≥ 1 for all data points.
Domanda
What is 'hinge loss'?
Risposta
Hinge loss is a loss function specifically used in SVMs. It is defined as L (z,y) = [1- yz]+ = max (0,1- yz), penalizing predictions that are not confidently classified correctly.
Domanda
How is a 'kernel' defined in SVMs?
Risposta
Given a feature mapping φ, the kernel K is defined as the inner product of the mapped features: K (x,z) = φ(x)T φ(z). It allows SVMs to operate in a high-dimensional feature space without explicitly computing the mapping.
Domanda
What is the 'Gaussian kernel'?
Risposta
The Gaussian kernel, also known as the Radial Basis Function (RBF) kernel, is a commonly used kernel in SVMs. It is defined as K(x,z) = exp (-||x- z||2/(2 σ2)), where σ is a parameter controlling the width.
Domanda
What is the 'kernel trick'?
Risposta
The kernel trick allows calculating the cost function in SVMs using the kernel K(x,z) directly, without needing to explicitly know the often complex feature mapping φ. This makes computations efficient in high-dimensional spaces.
Domanda
What is a 'Lagrangian' in optimization?
Risposta
In optimization, the Lagrangian L(w,b) is a function formed to incorporate constraints into the objective function. It is defined as L (w,b) = f (w) + ∑li=1βihi(w), where βi are Lagrange multipliers.
Domanda
What is a 'generative model'?
Risposta
A generative model aims to learn the underlying distribution of the data, P(x|y), which describes how the data is generated. This can then be used to estimate the conditional probability P(y|x) using Bayes' rule.
Domanda
What are the assumptions of 'Gaussian Discriminant Analysis' (GDA)?
Risposta
GDA assumes that the labels y follow a Bernoulli distribution, and that the features x conditioned on each class y=0 or y=1 follow a multivariate normal distribution with means μ0, μ1 and a shared covariance matrix Σ.
Domanda
How are parameters estimated in GDA?
Risposta
In GDA, parameters like φ (for y's Bernoulli distribution), μj (mean for x|y=j), and Σ (covariance matrix for x|y) are estimated by maximizing the likelihood of the observed data.
Domanda
What is the core assumption of the 'Naive Bayes' model?
Risposta
The Naive Bayes model assumes that all features xi are conditionally independent given the class label y. This simplifies the joint probability P(x|y) to the product of individual conditional probabilities: P (x|y) = ∏ni=1P (xi|y).
Domanda
How are parameters P(y=k) and P(xi=l|y=k) estimated in Naive Bayes?
Risposta
In Naive Bayes, is estimated as the frequency of class k in the training set, and i=l|y=k)> is estimated as the frequency of feature xi having value l within class k.
Domanda
For what applications is Naive Bayes widely used?
Risposta
Naive Bayes is widely used for text classification tasks, such as spam detection, sentiment analysis, and categorizing documents. Its simplicity and effectiveness make it a popular choice in these areas.
Domanda
What are 'CART' trees?
Risposta
CART, or Classification and Regression Trees, are commonly known as decision trees. They are represented as binary trees and are highly interpretable models used for both classification and regression problems.
Domanda
What is a 'Random forest'?
Risposta
A Random Forest is an ensemble learning method that builds a large number of decision trees using randomly selected subsets of features. While less interpretable than single decision trees, they generally offer high performance and robustness.
Domanda
What is 'Boosting' in machine learning?
Risposta
Boosting is an ensemble technique that combines several 'weak learners' into a 'stronger learner'. It iteratively trains models to correct the errors of previous models, gradually improving overall prediction accuracy.
Domanda
Explain 'Adaptive boosting' (AdaBoost).
Risposta
Adaptive boosting, or AdaBoost, focuses on improving performance by assigning higher weights to misclassified examples in each boosting step. Subsequent weak learners concentrate more on these difficult-to-classify data points.
Domanda
What is 'Gradient boosting'?
Risposta
Gradient boosting is a boosting method where weak learners, typically decision trees, are trained on the residual errors (gradients) of the previous models. This iterative process allows the ensemble to progressively reduce the overall error.
Domanda
What is the 'k-nearest neighbors' (k-NN) algorithm?
Risposta
The k-nearest neighbors (k-NN) algorithm is a non-parametric method used for classification and regression. It predicts the response of a data point based on the majority class (for classification) or average (for regression) of its k closest neighbors in the training set.
Domanda
How does the parameter k affect k-NN?
Risposta
In k-NN, a higher value of k leads to higher bias and lower variance, making the model smoother and less sensitive to noise. Conversely, a lower k results in lower bias and higher variance, making it more susceptible to noise.
Domanda
What is the 'Union bound' inequality?
Risposta
The Union bound states that for any k events A1,...,Ak, the probability of at least one of these events occurring is less than or equal to the sum of their individual probabilities: P(⋃ki=1Ai) ≤ ∑ki=1P(Ai).
Domanda
What is 'Hoeffding's inequality'?
Risposta
Hoeffding's inequality provides a bound on the probability that the sample mean

Supervised Learning Fundamentals

Supervised learning is a machine learning paradigm where models learn from labeled data to make predictions. The goal is to build a classifier that can learn to predict an outcome variable y from input features x.

Types of Predictive Models

Regression Classification
Outcome Continuous Categorical (Class)
Examples Linear regression Logistic regression, SVM, Naive Bayes

Types of Machine Learning Models

Discriminative Model Generative Model
Goal Directly estimate P(yx)P(y|x) Estimate P(xy)P(x|y) to deduce P(yx)P(y|x)
What's Learned Decision boundary Probability distributions of the data
Examples Regressions, SVMs GDA, Naive Bayes

Generative models first learn how the data is generated by estimating P(xy)P(x|y), then use Bayes' rule to estimate P(yx)P(y|x).

Core Concepts in Supervised Learning

  • Hypothesis (hθh_\theta): This is the chosen model. For a given input x(i)x^{(i)}, the model's prediction is hθ(x(i))h_\theta(x^{(i)}).
  • Loss Function (LL): A function that quantifies the difference between a predicted value zz and the real data value yy.
    Squared Loss Logistic Loss Hinge Loss Cross-entropy Loss
    12(yz)2\frac{1}{2}(y-z)^2 log(1+exp(yz))\log(1+\exp(-yz)) max(0,1yz)\max(0,1-yz) [ylog(z)+(1y)log(1z)]-[y \log(z)+(1-y) \log(1-z)]
    Used in Linear Regression Used in Logistic Regression Used in SVM Used in Neural Networks
  • Cost Function (J(θ)J(\theta)): Commonly used to assess model performance, defined as the sum of loss over all training examples: J(θ)=i=1mL(hθ(x(i)),y(i))J (\theta) = \sum_ {i = 1} ^ {m} L \left(h _ {\theta} \left(x ^ {(i)}\right), y ^ {(i)}\right)
  • Gradient Descent: An optimization algorithm to minimize the cost function. With learning rate α\alpha, the update rule is: θθαJ(θ)\theta \longleftarrow \theta - \alpha \nabla J (\theta)

    Stochastic Gradient Descent (SGD) updates parameters based on each training example. Batch Gradient Descent updates based on a batch of training examples.

  • Likelihood (L(θ)L(\theta)): Used to find optimal parameters θ\theta by maximizing the likelihood of the model given the parameters. The optimal θ\theta is found by: θopt=argmaxθL(θ)\theta^ {\mathrm {o p t}} = \underset {\theta} {\arg \max } L (\theta) In practice, the log-likelihood (θ)=log(L(θ))\ell (\theta) = \log (L(\theta)) is often optimized for simplicity.
  • Newton's Algorithm: A numerical method to find θ\theta where (θ)=0\ell'(\theta) = 0. The update rule is: θθ(θ)(θ)\theta \leftarrow \theta - \frac {\ell^ {\prime} (\theta)}{\ell^ {\prime \prime} (\theta)} The multidimensional generalization (Newton-Raphson method) uses: θθ(θ2(θ))1θ(θ)\theta \leftarrow \theta - \left(\nabla_ {\theta} ^ {2} \ell (\theta)\right) ^ {- 1} \nabla_ {\theta} \ell (\theta)

Regression Models

Regression models predict continuous outcome variables.

Linear Regression

Assumes the outcome yy given xx and θ\theta follows a Normal distribution: yx;θN(μ,σ2)y|x;\theta \sim \mathcal{N}(\mu ,\sigma^2).

  • Normal Equations: For a design matrix XX, the θ\theta that minimizes the cost function has a closed-form solution: θ=(XTX)1XTy\theta = (X^T X)^{-1} X^T y
  • LMS Algorithm (Widrow-Hoff): An iterative update rule for a training set of mm data points with learning rate α\alpha: j,θjθj+αi=1m[y(i)hθ(x(i))]xj(i)\forall j, \quad \theta_j \leftarrow \theta_j + \alpha \sum_{i=1}^{m} \left[ y^{(i)} - h_\theta(x^{(i)}) \right] x_j^{(i)} This is a special case of gradient ascent.
  • Locally Weighted Regression (LWR): A variant where each training example is weighted in the cost function by w(i)(x)w^{(i)}(x), defined as: w(i)(x)=exp((x(i)x)22τ2)w^{(i)}(x) = \exp \left(- \frac{(x^{(i)} - x)^2}{2 \tau^2}\right) where τ\tau is a parameter.

Classification Models

Classification models predict categorical outcome variables.

Logistic Regression

Used for binary classification; assumes yx;θBernoulli(ϕ)y|x;\theta \sim \mathrm{Bernoulli}(\phi).

  • Sigmoid Function (gg): Also known as the logistic function, it maps any real-valued number to a value between 0 and 1: zR,g(z)=11+ez]0,1[\forall z \in \mathbb{R}, \quad g(z) = \frac{1}{1 + e^{-z}} \in ]0,1[
  • Hypothesis: The probability of y=1y=1 given xx and θ\theta is: ϕ=p(y=1x;θ)=11+exp(θTx)=g(θTx)\phi = p(y = 1 | x; \theta) = \frac{1}{1 + \exp(-\theta^T x)} = g(\theta^T x)

    There is no closed-form solution for logistic regression, requiring iterative optimization methods like gradient descent.

Softmax Regression

A generalization of logistic regression for more than two outcome classes, also called multiclass logistic regression. By convention, θK=0\theta_K = 0. The Bernoulli parameter ϕi\phi_i for each class ii is: ϕi=exp(θiTx)j=1Kexp(θjTx)\phi_i = \frac{\exp(\theta_i^T x)}{\sum_{j=1}^{K} \exp(\theta_j^T x)}

Generalized Linear Models (GLMs)

GLMs provide a flexible way to model response variables that follow various distributions, not just the normal distribution, by relating the linear predictor to the response variable via a link function.

Exponential Family Distributions

A class of distributions belonging to the exponential family can be written as: p(y;η)=b(y)exp(ηT(y)a(η))p(y; \eta) = b(y) \exp(\eta T(y) - a(\eta)) Where:

  • η\eta: Natural parameter (canonical parameter or link function).
  • T(y)T(y): Sufficient statistic (often T(y)=yT(y) = y).
  • a(η)a(\eta): Log-partition function.
  • b(y)b(y): Base measure.

The term exp(a(η))\exp(-a(\eta)) acts as a normalization constant ensuring probabilities sum to one.

Common Exponential Distributions

Distribution η\eta T(y)T(y) a(η)a(\eta) b(y)b(y)
Bernoulli log(ϕ/1ϕ)\log(\phi/1 - \phi) yy log(1+exp(η))\log(1 + \exp(\eta)) 11
Gaussian μ\mu yy η22\frac{\eta^2}{2} 12πexp(y22)\frac{1}{\sqrt{2\pi}} \exp\left(-\frac{y^2}{2}\right)
Poisson log(λ)\log(\lambda) yy eηe^{\eta} 1y!\frac{1}{y!}
Geometric log(1ϕ)\log(1 - \phi) yy log(eη1eη)\log(\frac{e^{\eta}}{1 - e^{\eta}}) 11

Assumptions of GLMs

GLMs aim to predict a random variable yy as a function of xRn+1x \in \mathbb{R}^{n+1} based on three key assumptions:

  1. yx;θExpFamily(η)y | x; \theta \sim \operatorname {ExpFamily}(\eta) (The conditional distribution of yy given xx belongs to the Exponential Family).
  2. hθ(x)=E[yx;θ]h_\theta(x) = E[y | x; \theta] (The model predicts the expected value of yy).
  3. η=θTx\eta = \theta^T x (The natural parameter is a linear function of the input features).
Ordinary Least Squares (Linear Regression) and Logistic Regression are special cases of Generalized Linear Models.

Support Vector Machines (SVMs)

The goal of SVMs is to find an optimal hyperplane that maximizes the minimum distance to the closest training examples from any class (the margin).

Optimal Margin Classifier

The optimal margin classifier h(x)h(x) is given by:

h(x)=sign(wTxb)h(x) = \operatorname{sign}(w^T x - b)

Where (w,b)Rn×R(w,b) \in \mathbb{R}^n \times \mathbb{R} are the parameters obtained by solving the following optimization problem:

min12w2such thaty(i)(wTx(i)b)1\min \frac{1}{2} ||w||^2 \quad \text{such that} \quad y^{(i)}(w^T x^{(i)} - b) \geqslant 1

The decision boundary is defined by wTxb=0w^T x - b = 0.

Hinge Loss

The hinge loss function, specific to SVMs, is defined as:

L(z,y)=[1yz]+=max(0,1yz)L (z, y) = [ 1 - y z ] _ {+} = \max (0, 1 - y z)

Kernel Function

Given a feature mapping ϕ\phi, the kernel KK is defined as the inner product of the mapped features:

K(x,z)=ϕ(x)Tϕ(z)K (x, z) = \phi (x) ^ {T} \phi (z)

A commonly used kernel is the Gaussian kernel: K(x,z)=exp(xz22σ2)K(x,z) = \exp \left(-\frac{||x - z||^2}{2\sigma^2}\right).

The "kernel trick" allows us to compute the cost function using the kernel K(x,z)K(x,z) without explicitly knowing the often complex feature mapping ϕ\phi. This is crucial for handling non-linear separability by mapping data into higher-dimensional spaces.

Lagrangian

The Lagrangian L(w,b)\mathcal{L}(w,b) is used in optimization with constraints:

L(w,b)=f(w)+i=1lβihi(w)\mathcal {L} (w, b) = f (w) + \sum_ {i = 1} ^ {l} \beta_ {i} h _ {i} (w)

The coefficients βi\beta_i are known as Lagrange multipliers.

Generative Learning Algorithms

Generative models learn the joint probability distribution P(x,y)P(x,y) or P(xy)P(x|y) and P(y)P(y), which then allows them to infer P(yx)P(y|x) using Bayes' rule.

Gaussian Discriminant Analysis (GDA)

GDA assumes specific distributions for yy and xyx|y.

  • Setting:
    • yBernoulli(ϕ)y \sim \operatorname {Bernoulli} (\phi)
    • xy=0N(μ0,Σ)x \mid y = 0 \sim \mathcal {N} \left(\mu_ {0}, \Sigma\right)
    • xy=1N(μ1,Σ)x \mid y = 1 \sim \mathcal {N} \left(\mu_ {1}, \Sigma\right)
  • Estimation: Parameters are found by maximizing the likelihood.
    ϕ\phi μj(j=0,1)\mu_j (j=0,1) Σ\Sigma
    1mi=1m1{y(i)=1}\frac{1}{m}\sum_{i=1}^{m}1\{y^{(i)}=1\} i=1m1{y(i)=j}x(i)i=1m1{y(i)=j}\frac{\sum_{i=1}^{m}1\{y^{(i)}=j\}x^{(i)}}{\sum_{i=1}^{m}1\{y^{(i)}=j\}} 1mi=1m(x(i)μy(i))(x(i)μy(i))T\frac{1}{m}\sum_{i=1}^{m}(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T

Naive Bayes

A generative model that simplifies the estimation by assuming feature independence.

  • Assumption: Features x1,x2,x_1, x_2, \dots are conditionally independent given yy: P(xy)=P(x1,x2,y)=i=1nP(xiy)P (x | y) = P \left(x _ {1}, x _ {2}, \dots | y\right) = \prod_ {i = 1} ^ {n} P \left(x _ {i} | y\right)
  • Solutions: Maximizing the log-likelihood yields the following estimations for k{0,1}k \in \{0,1\} and feature value l[1,L]l \in [1,L]: P(y=k)=1m×#{jy(j)=k}P (y = k) = \frac {1}{m} \times \# \{j | y ^ {(j)} = k \} P(xi=ly=k)=#{jy(j)=k and xi(j)=l}#{jy(j)=k}P (x _ {i} = l | y = k) = \frac {\# \{j | y ^ {(j)} = k \text { and } x _ {i} ^ {(j)} = l \}}{\# \{j | y ^ {(j)} = k \}}

    Naive Bayes is widely used in text classification and spam detection due to its simplicity and effectiveness.

Tree-Based Methods

These methods can be used for both regression and classification problems.

  • CART (Classification and Regression Trees): Commonly known as decision trees, these are binary trees. They are highly interpretable.
  • Random Forest: A tree-based ensemble method that builds a large number of decision trees from randomly selected subsets of features and data. It typically offers good performance but is less interpretable than single decision trees.

Boosting Methods

Boosting combines several "weak learners" to form a stronger learner.

Adaptive Boosting (Adaboost) Gradient Boosting
Assigns higher weights to misclassified examples in subsequent boosting steps. Weak learners are trained on the residuals (remaining errors) of previous learners.

K-Nearest Neighbors (K-NN)

The kk-NN algorithm is a non-parametric approach where the response of a new data point is determined by the majority class (for classification) or the average (for regression) of its kk closest neighbors in the training set.

The choice of kk influences the model's bias and variance:
  • A higher kk leads to higher bias (smoother decision boundaries, less sensitive to noise).
  • A lower kk leads to higher variance (more complex decision boundaries, more sensitive to noise).

Learning Theory

Learning theory provides theoretical foundations for understanding how and why algorithms learn effectively.

  • Union Bound: For kk events A1,,AkA_1, \dots, A_k, the probability of at least one event occurring is bounded: P(i=1kAi)i=1kP(Ai)P \left(\bigcup_{i=1}^k A_i\right) \le \sum_{i=1}^k P(A_i)
  • Hoeffding Inequality (Chernoff Bound): For mm i.i.d. Bernoulli variables Z1,,ZmZ_1, \dots, Z_m with mean ϕ\phi, and sample mean ϕ^\widehat{\phi}, for any γ>0\gamma > 0: P(ϕϕ^>γ)2exp(2γ2m)P \left(\left| \phi - \widehat {\phi} \right| > \gamma\right) \leqslant 2 \exp \left(- 2 \gamma^ {2} m\right)
  • Training Error (ϵ^(h)\widehat{\epsilon}(h)): Also known as empirical risk or empirical error, it measures the proportion of mistakes a classifier hh makes on the training set: ϵ^(h)=1mi=1m1{h(x(i))y(i)}\widehat {\epsilon} (h) = \frac {1}{m} \sum_ {i = 1} ^ {m} 1 _ {\{h (x ^ {(i)}) \neq y ^ {(i)} \}}
  • Probably Approximately Correct (PAC): A framework for analyzing the theoretical learnability of a concept. Its assumptions are:
    • Training and test sets follow the same distribution.
    • Training examples are drawn independently.
  • Shattering: A set of classifiers H\mathcal{H} shatters a set of data points S={x(1),,x(d)}S = \{x^{(1)},\dots,x^{(d)}\} if it can perfectly classify any possible labeling of these points: hH,i[1,d],h(x(i))=y(i)\exists h \in \mathcal {H}, \quad \forall i \in [ 1, d ], \quad h (x ^ {(i)}) = y ^ {(i)}
  • Upper Bound Theorem: For a finite hypothesis class H\mathcal{H} of size kk, with probability at least 1δ1-\delta, the true error ϵ(h^)\epsilon(\widehat{h}) of the learned hypothesis h^\widehat{h} is bounded by: ϵ(h^)(minhHϵ(h))+212mlog(2kδ)\epsilon (\widehat {h}) \leqslant \left(\min _{h \in \mathcal {H}} \epsilon (h)\right) + 2 \sqrt {\frac {1}{2 m} \log \left(\frac {2 k}{\delta}\right)}
  • VC Dimension (Vapnik-Chervonenkis Dimension): For an infinite hypothesis class H\mathcal{H}, VC(H)VC(\mathcal{H}) is the size of the largest set that can be shattered by H\mathcal{H}.

    For example, the VC dimension of linear classifiers in 2 dimensions is 3.

  • Vapnik's Theorem: For a hypothesis class H\mathcal{H} with VC dimension dd and mm training examples, with probability at least 1δ1-\delta: ϵ(h^)(minhHϵ(h))+O(dmlog(md)+1mlog(1δ))\epsilon (\widehat {h}) \leqslant \left(\min _ {h \in \mathcal {H}} \epsilon (h)\right) + O \left(\sqrt {\frac {d}{m} \log \left(\frac {m}{d}\right) + \frac {1}{m} \log \left(\frac {1}{\delta}\right)}\right) This theorem provides a generalization bound for infinite hypothesis classes.

Key Takeaways

  • Supervised learning involves building models to predict outcomes from labeled data.
  • Models are categorized as regression (continuous outcomes) or classification (categorical outcomes), and as discriminative or generative.
  • Fundamental concepts include hypothesis, loss functions, cost functions, and optimization algorithms like gradient descent and Newton's method.
  • Popular algorithms include Linear Regression, Logistic Regression, SVMs, GDA, Naive Bayes, Decision Trees, Random Forests, and K-NN.
  • Generalized Linear Models offer a unifying framework for various exponential family distributions.
  • Learning theory provides bounds and insights into model generalization through concepts like PAC learning, shattering, and VC dimension.

Inizia un quiz

Testa le tue conoscenze con domande interattive