Supervised Learning Models and Algorithms

39 cartes

An overview of supervised learning models, algorithms, and key concepts such as regression, classification, loss functions, and optimization techniques.

39 cartes

Réviser
La répétition espacée te présente chaque carte au moment optimal pour la mémoriser durablement, en espaçant les révisions de façon croissante.
Question
What is a 'hypothesis' in machine learning?
Réponse
In machine learning, a hypothesis, denoted as h θ, is the chosen model that predicts an output h θ(x(i)) for a given input data x(i). It represents the learned function.
Question
How is a 'loss function' defined?
Réponse
A loss function, L:(z,y)∈ R×Y⟼L(z,y)∈ R, quantifies the difference between a predicted value z and the true data value y. It serves as a measure of error.
Question
What is the purpose of a 'cost function'?
Réponse
The cost function, J, evaluates a model's overall performance by summing the losses over all training examples. It is defined as J (θ) = ∑mi=1L (h θ(x(i)),y(i)).
Question
Explain 'gradient descent' in machine learning.
Réponse
Gradient descent is an optimization algorithm used to update model parameters θ by iteratively moving in the direction opposite to the gradient of the cost function J. The update rule is θ←θ-α∇J (θ), where α is the learning rate.
Question
What is the 'likelihood' of a model?
Réponse
The likelihood L(θ) of a model indicates how probable the observed data is given the model's parameters θ. Maximizing the likelihood helps find the optimal parameters θopt.
Question
What is the log-likelihood?
Réponse
The log-likelihood, ℓ (θ) = log (L(θ)), is the natural logarithm of the likelihood function. It is often optimized instead of the likelihood directly because it simplifies calculations by converting products into sums.
Question
What is 'Newton's algorithm' used for?
Réponse
Newton's algorithm is a numerical method to find the optimal parameters θ by finding where the derivative of the log-likelihood, ℓ'(θ), is zero. Its update rule is θ←θ- ℓ'(θ)/ℓ''(θ).
Question
Describe the 'Newton-Raphson method'.
Réponse
The Newton-Raphson method is the multidimensional generalization of Newton's algorithm. Its update rule for parameters θ is θ←θ-(∇2 θℓ (θ))-1 θℓ (θ), involving the inverse of the Hessian matrix.
Question
What are 'Normal equations' in linear regression?
Réponse
Normal equations provide a closed-form solution for the parameter θ that minimizes the cost function in linear regression. Given the design matrix X, the solution is θ=(XT X)-1 XT y.
Question
Explain the 'LMS algorithm'.
Réponse
The Least Mean Squares (LMS) algorithm, or Widrow-Hoff learning rule, is an iterative update rule for the parameters θj in linear regression. It adjusts θj based on the sum of errors across all training examples: θj ←θj+α∑mi=1[y(i)-h θ(x(i))]x(i)j.
Question
What is 'Locally Weighted Regression' (LWR)?
Réponse
LWR is a variant of linear regression where each training example's contribution to the cost function is weighted based on its proximity to the query point x. The weight w(i)(x) is an exponential function of the squared distance between x(i) and x, controlled by parameter τ.
Question
What is the 'sigmoid function'?
Réponse
The sigmoid function, also known as the logistic function, is defined as g(z) = 1/(1+e-z). It maps any real-valued number z to a value between 0 and 1, making it suitable for probabilities.
Question
How is 'logistic regression' defined?
Réponse
Logistic regression models the probability of a binary outcome y=1 given x and parameters θ. It assumes y|x;θ follows a Bernoulli distribution, with p(y=1|x;θ) = 1/(1+ exp (-θT x)) = g(θT x).
Question
What is 'Softmax regression'?
Réponse
Softmax regression generalizes logistic regression to handle more than two outcome classes. For each class i, the Bernoulli parameter φi is given by φi = exp (θTi x)/∑Kj=1exp (θTj x), with θK =0 by convention.
Question
What defines an 'exponential family' distribution?
Réponse
A distribution belongs to the exponential family if its probability density function can be expressed as p(y;η) = b(y) exp (ηT(y)-a(η)), where η is the natural parameter, T(y) is the sufficient statistic, and a(η) is the log-partition function.
Question
What are the assumptions of 'Generalized Linear Models' (GLMs)?
Réponse
GLMs rely on three assumptions: (1) y|x;θ follows an exponential family distribution, (2) the hypothesis h θ(x) is the expected value of y given x and θ, and (3) the natural parameter η is a linear combination of features, η=θT x.
Question
Can ordinary least squares and logistic regression be GLMs?
Réponse
Yes, ordinary least squares (linear regression) and logistic regression are both considered special cases of Generalized Linear Models (GLMs). This highlights the broad applicability and unifying framework of GLMs.
Question
What is the goal of 'Support Vector Machines' (SVMs)?
Réponse
The goal of Support Vector Machines (SVMs) is to find an optimal separating hyperplane that maximizes the minimum distance (the margin) between data points of different classes. This typically leads to better generalization.
Question
What is an 'optimal margin classifier'?
Réponse
An optimal margin classifier h(x) = sign(wT x- b) in SVMs is determined by solving an optimization problem: minimizing 1/2||w||2 subject to the constraint y(i)(wT x(i)- b) ≥ 1 for all data points.
Question
What is 'hinge loss'?
Réponse
Hinge loss is a loss function specifically used in SVMs. It is defined as L (z,y) = [1- yz]+ = max (0,1- yz), penalizing predictions that are not confidently classified correctly.
Question
How is a 'kernel' defined in SVMs?
Réponse
Given a feature mapping φ, the kernel K is defined as the inner product of the mapped features: K (x,z) = φ(x)T φ(z). It allows SVMs to operate in a high-dimensional feature space without explicitly computing the mapping.
Question
What is the 'Gaussian kernel'?
Réponse
The Gaussian kernel, also known as the Radial Basis Function (RBF) kernel, is a commonly used kernel in SVMs. It is defined as K(x,z) = exp (-||x- z||2/(2 σ2)), where σ is a parameter controlling the width.
Question
What is the 'kernel trick'?
Réponse
The kernel trick allows calculating the cost function in SVMs using the kernel K(x,z) directly, without needing to explicitly know the often complex feature mapping φ. This makes computations efficient in high-dimensional spaces.
Question
What is a 'Lagrangian' in optimization?
Réponse
In optimization, the Lagrangian L(w,b) is a function formed to incorporate constraints into the objective function. It is defined as L (w,b) = f (w) + ∑li=1βihi(w), where βi are Lagrange multipliers.
Question
What is a 'generative model'?
Réponse
A generative model aims to learn the underlying distribution of the data, P(x|y), which describes how the data is generated. This can then be used to estimate the conditional probability P(y|x) using Bayes' rule.
Question
What are the assumptions of 'Gaussian Discriminant Analysis' (GDA)?
Réponse
GDA assumes that the labels y follow a Bernoulli distribution, and that the features x conditioned on each class y=0 or y=1 follow a multivariate normal distribution with means μ0, μ1 and a shared covariance matrix Σ.
Question
How are parameters estimated in GDA?
Réponse
In GDA, parameters like φ (for y's Bernoulli distribution), μj (mean for x|y=j), and Σ (covariance matrix for x|y) are estimated by maximizing the likelihood of the observed data.
Question
What is the core assumption of the 'Naive Bayes' model?
Réponse
The Naive Bayes model assumes that all features xi are conditionally independent given the class label y. This simplifies the joint probability P(x|y) to the product of individual conditional probabilities: P (x|y) = ∏ni=1P (xi|y).
Question
How are parameters P(y=k) and P(xi=l|y=k) estimated in Naive Bayes?
Réponse
In Naive Bayes, is estimated as the frequency of class k in the training set, and i=l|y=k)> is estimated as the frequency of feature xi having value l within class k.
Question
For what applications is Naive Bayes widely used?
Réponse
Naive Bayes is widely used for text classification tasks, such as spam detection, sentiment analysis, and categorizing documents. Its simplicity and effectiveness make it a popular choice in these areas.
Question
What are 'CART' trees?
Réponse
CART, or Classification and Regression Trees, are commonly known as decision trees. They are represented as binary trees and are highly interpretable models used for both classification and regression problems.
Question
What is a 'Random forest'?
Réponse
A Random Forest is an ensemble learning method that builds a large number of decision trees using randomly selected subsets of features. While less interpretable than single decision trees, they generally offer high performance and robustness.
Question
What is 'Boosting' in machine learning?
Réponse
Boosting is an ensemble technique that combines several 'weak learners' into a 'stronger learner'. It iteratively trains models to correct the errors of previous models, gradually improving overall prediction accuracy.
Question
Explain 'Adaptive boosting' (AdaBoost).
Réponse
Adaptive boosting, or AdaBoost, focuses on improving performance by assigning higher weights to misclassified examples in each boosting step. Subsequent weak learners concentrate more on these difficult-to-classify data points.
Question
What is 'Gradient boosting'?
Réponse
Gradient boosting is a boosting method where weak learners, typically decision trees, are trained on the residual errors (gradients) of the previous models. This iterative process allows the ensemble to progressively reduce the overall error.
Question
What is the 'k-nearest neighbors' (k-NN) algorithm?
Réponse
The k-nearest neighbors (k-NN) algorithm is a non-parametric method used for classification and regression. It predicts the response of a data point based on the majority class (for classification) or average (for regression) of its k closest neighbors in the training set.
Question
How does the parameter k affect k-NN?
Réponse
In k-NN, a higher value of k leads to higher bias and lower variance, making the model smoother and less sensitive to noise. Conversely, a lower k results in lower bias and higher variance, making it more susceptible to noise.
Question
What is the 'Union bound' inequality?
Réponse
The Union bound states that for any k events A1,...,Ak, the probability of at least one of these events occurring is less than or equal to the sum of their individual probabilities: P(⋃ki=1Ai) ≤ ∑ki=1P(Ai).
Question
What is 'Hoeffding's inequality'?
Réponse
Hoeffding's inequality provides a bound on the probability that the sample mean

Supervised Learning Fundamentals

Supervised learning is a machine learning paradigm where models learn from labeled data to make predictions. The goal is to build a classifier that can learn to predict an outcome variable y from input features x.

Types of Predictive Models

Regression Classification
Outcome Continuous Categorical (Class)
Examples Linear regression Logistic regression, SVM, Naive Bayes

Types of Machine Learning Models

Discriminative Model Generative Model
Goal Directly estimate P(yx)P(y|x) Estimate P(xy)P(x|y) to deduce P(yx)P(y|x)
What's Learned Decision boundary Probability distributions of the data
Examples Regressions, SVMs GDA, Naive Bayes

Generative models first learn how the data is generated by estimating P(xy)P(x|y), then use Bayes' rule to estimate P(yx)P(y|x).

Core Concepts in Supervised Learning

  • Hypothesis (hθh_\theta): This is the chosen model. For a given input x(i)x^{(i)}, the model's prediction is hθ(x(i))h_\theta(x^{(i)}).
  • Loss Function (LL): A function that quantifies the difference between a predicted value zz and the real data value yy.
    Squared Loss Logistic Loss Hinge Loss Cross-entropy Loss
    12(yz)2\frac{1}{2}(y-z)^2 log(1+exp(yz))\log(1+\exp(-yz)) max(0,1yz)\max(0,1-yz) [ylog(z)+(1y)log(1z)]-[y \log(z)+(1-y) \log(1-z)]
    Used in Linear Regression Used in Logistic Regression Used in SVM Used in Neural Networks
  • Cost Function (J(θ)J(\theta)): Commonly used to assess model performance, defined as the sum of loss over all training examples: J(θ)=i=1mL(hθ(x(i)),y(i))J (\theta) = \sum_ {i = 1} ^ {m} L \left(h _ {\theta} \left(x ^ {(i)}\right), y ^ {(i)}\right)
  • Gradient Descent: An optimization algorithm to minimize the cost function. With learning rate α\alpha, the update rule is: θθαJ(θ)\theta \longleftarrow \theta - \alpha \nabla J (\theta)

    Stochastic Gradient Descent (SGD) updates parameters based on each training example. Batch Gradient Descent updates based on a batch of training examples.

  • Likelihood (L(θ)L(\theta)): Used to find optimal parameters θ\theta by maximizing the likelihood of the model given the parameters. The optimal θ\theta is found by: θopt=argmaxθL(θ)\theta^ {\mathrm {o p t}} = \underset {\theta} {\arg \max } L (\theta) In practice, the log-likelihood (θ)=log(L(θ))\ell (\theta) = \log (L(\theta)) is often optimized for simplicity.
  • Newton's Algorithm: A numerical method to find θ\theta where (θ)=0\ell'(\theta) = 0. The update rule is: θθ(θ)(θ)\theta \leftarrow \theta - \frac {\ell^ {\prime} (\theta)}{\ell^ {\prime \prime} (\theta)} The multidimensional generalization (Newton-Raphson method) uses: θθ(θ2(θ))1θ(θ)\theta \leftarrow \theta - \left(\nabla_ {\theta} ^ {2} \ell (\theta)\right) ^ {- 1} \nabla_ {\theta} \ell (\theta)

Regression Models

Regression models predict continuous outcome variables.

Linear Regression

Assumes the outcome yy given xx and θ\theta follows a Normal distribution: yx;θN(μ,σ2)y|x;\theta \sim \mathcal{N}(\mu ,\sigma^2).

  • Normal Equations: For a design matrix XX, the θ\theta that minimizes the cost function has a closed-form solution: θ=(XTX)1XTy\theta = (X^T X)^{-1} X^T y
  • LMS Algorithm (Widrow-Hoff): An iterative update rule for a training set of mm data points with learning rate α\alpha: j,θjθj+αi=1m[y(i)hθ(x(i))]xj(i)\forall j, \quad \theta_j \leftarrow \theta_j + \alpha \sum_{i=1}^{m} \left[ y^{(i)} - h_\theta(x^{(i)}) \right] x_j^{(i)} This is a special case of gradient ascent.
  • Locally Weighted Regression (LWR): A variant where each training example is weighted in the cost function by w(i)(x)w^{(i)}(x), defined as: w(i)(x)=exp((x(i)x)22τ2)w^{(i)}(x) = \exp \left(- \frac{(x^{(i)} - x)^2}{2 \tau^2}\right) where τ\tau is a parameter.

Classification Models

Classification models predict categorical outcome variables.

Logistic Regression

Used for binary classification; assumes yx;θBernoulli(ϕ)y|x;\theta \sim \mathrm{Bernoulli}(\phi).

  • Sigmoid Function (gg): Also known as the logistic function, it maps any real-valued number to a value between 0 and 1: zR,g(z)=11+ez]0,1[\forall z \in \mathbb{R}, \quad g(z) = \frac{1}{1 + e^{-z}} \in ]0,1[
  • Hypothesis: The probability of y=1y=1 given xx and θ\theta is: ϕ=p(y=1x;θ)=11+exp(θTx)=g(θTx)\phi = p(y = 1 | x; \theta) = \frac{1}{1 + \exp(-\theta^T x)} = g(\theta^T x)

    There is no closed-form solution for logistic regression, requiring iterative optimization methods like gradient descent.

Softmax Regression

A generalization of logistic regression for more than two outcome classes, also called multiclass logistic regression. By convention, θK=0\theta_K = 0. The Bernoulli parameter ϕi\phi_i for each class ii is: ϕi=exp(θiTx)j=1Kexp(θjTx)\phi_i = \frac{\exp(\theta_i^T x)}{\sum_{j=1}^{K} \exp(\theta_j^T x)}

Generalized Linear Models (GLMs)

GLMs provide a flexible way to model response variables that follow various distributions, not just the normal distribution, by relating the linear predictor to the response variable via a link function.

Exponential Family Distributions

A class of distributions belonging to the exponential family can be written as: p(y;η)=b(y)exp(ηT(y)a(η))p(y; \eta) = b(y) \exp(\eta T(y) - a(\eta)) Where:

  • η\eta: Natural parameter (canonical parameter or link function).
  • T(y)T(y): Sufficient statistic (often T(y)=yT(y) = y).
  • a(η)a(\eta): Log-partition function.
  • b(y)b(y): Base measure.

The term exp(a(η))\exp(-a(\eta)) acts as a normalization constant ensuring probabilities sum to one.

Common Exponential Distributions

Distribution η\eta T(y)T(y) a(η)a(\eta) b(y)b(y)
Bernoulli log(ϕ/1ϕ)\log(\phi/1 - \phi) yy log(1+exp(η))\log(1 + \exp(\eta)) 11
Gaussian μ\mu yy η22\frac{\eta^2}{2} 12πexp(y22)\frac{1}{\sqrt{2\pi}} \exp\left(-\frac{y^2}{2}\right)
Poisson log(λ)\log(\lambda) yy eηe^{\eta} 1y!\frac{1}{y!}
Geometric log(1ϕ)\log(1 - \phi) yy log(eη1eη)\log(\frac{e^{\eta}}{1 - e^{\eta}}) 11

Assumptions of GLMs

GLMs aim to predict a random variable yy as a function of xRn+1x \in \mathbb{R}^{n+1} based on three key assumptions:

  1. yx;θExpFamily(η)y | x; \theta \sim \operatorname {ExpFamily}(\eta) (The conditional distribution of yy given xx belongs to the Exponential Family).
  2. hθ(x)=E[yx;θ]h_\theta(x) = E[y | x; \theta] (The model predicts the expected value of yy).
  3. η=θTx\eta = \theta^T x (The natural parameter is a linear function of the input features).
Ordinary Least Squares (Linear Regression) and Logistic Regression are special cases of Generalized Linear Models.

Support Vector Machines (SVMs)

The goal of SVMs is to find an optimal hyperplane that maximizes the minimum distance to the closest training examples from any class (the margin).

Optimal Margin Classifier

The optimal margin classifier h(x)h(x) is given by:

h(x)=sign(wTxb)h(x) = \operatorname{sign}(w^T x - b)

Where (w,b)Rn×R(w,b) \in \mathbb{R}^n \times \mathbb{R} are the parameters obtained by solving the following optimization problem:

min12w2such thaty(i)(wTx(i)b)1\min \frac{1}{2} ||w||^2 \quad \text{such that} \quad y^{(i)}(w^T x^{(i)} - b) \geqslant 1

The decision boundary is defined by wTxb=0w^T x - b = 0.

Hinge Loss

The hinge loss function, specific to SVMs, is defined as:

L(z,y)=[1yz]+=max(0,1yz)L (z, y) = [ 1 - y z ] _ {+} = \max (0, 1 - y z)

Kernel Function

Given a feature mapping ϕ\phi, the kernel KK is defined as the inner product of the mapped features:

K(x,z)=ϕ(x)Tϕ(z)K (x, z) = \phi (x) ^ {T} \phi (z)

A commonly used kernel is the Gaussian kernel: K(x,z)=exp(xz22σ2)K(x,z) = \exp \left(-\frac{||x - z||^2}{2\sigma^2}\right).

The "kernel trick" allows us to compute the cost function using the kernel K(x,z)K(x,z) without explicitly knowing the often complex feature mapping ϕ\phi. This is crucial for handling non-linear separability by mapping data into higher-dimensional spaces.

Lagrangian

The Lagrangian L(w,b)\mathcal{L}(w,b) is used in optimization with constraints:

L(w,b)=f(w)+i=1lβihi(w)\mathcal {L} (w, b) = f (w) + \sum_ {i = 1} ^ {l} \beta_ {i} h _ {i} (w)

The coefficients βi\beta_i are known as Lagrange multipliers.

Generative Learning Algorithms

Generative models learn the joint probability distribution P(x,y)P(x,y) or P(xy)P(x|y) and P(y)P(y), which then allows them to infer P(yx)P(y|x) using Bayes' rule.

Gaussian Discriminant Analysis (GDA)

GDA assumes specific distributions for yy and xyx|y.

  • Setting:
    • yBernoulli(ϕ)y \sim \operatorname {Bernoulli} (\phi)
    • xy=0N(μ0,Σ)x \mid y = 0 \sim \mathcal {N} \left(\mu_ {0}, \Sigma\right)
    • xy=1N(μ1,Σ)x \mid y = 1 \sim \mathcal {N} \left(\mu_ {1}, \Sigma\right)
  • Estimation: Parameters are found by maximizing the likelihood.
    ϕ\phi μj(j=0,1)\mu_j (j=0,1) Σ\Sigma
    1mi=1m1{y(i)=1}\frac{1}{m}\sum_{i=1}^{m}1\{y^{(i)}=1\} i=1m1{y(i)=j}x(i)i=1m1{y(i)=j}\frac{\sum_{i=1}^{m}1\{y^{(i)}=j\}x^{(i)}}{\sum_{i=1}^{m}1\{y^{(i)}=j\}} 1mi=1m(x(i)μy(i))(x(i)μy(i))T\frac{1}{m}\sum_{i=1}^{m}(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T

Naive Bayes

A generative model that simplifies the estimation by assuming feature independence.

  • Assumption: Features x1,x2,x_1, x_2, \dots are conditionally independent given yy: P(xy)=P(x1,x2,y)=i=1nP(xiy)P (x | y) = P \left(x _ {1}, x _ {2}, \dots | y\right) = \prod_ {i = 1} ^ {n} P \left(x _ {i} | y\right)
  • Solutions: Maximizing the log-likelihood yields the following estimations for k{0,1}k \in \{0,1\} and feature value l[1,L]l \in [1,L]: P(y=k)=1m×#{jy(j)=k}P (y = k) = \frac {1}{m} \times \# \{j | y ^ {(j)} = k \} P(xi=ly=k)=#{jy(j)=k and xi(j)=l}#{jy(j)=k}P (x _ {i} = l | y = k) = \frac {\# \{j | y ^ {(j)} = k \text { and } x _ {i} ^ {(j)} = l \}}{\# \{j | y ^ {(j)} = k \}}

    Naive Bayes is widely used in text classification and spam detection due to its simplicity and effectiveness.

Tree-Based Methods

These methods can be used for both regression and classification problems.

  • CART (Classification and Regression Trees): Commonly known as decision trees, these are binary trees. They are highly interpretable.
  • Random Forest: A tree-based ensemble method that builds a large number of decision trees from randomly selected subsets of features and data. It typically offers good performance but is less interpretable than single decision trees.

Boosting Methods

Boosting combines several "weak learners" to form a stronger learner.

Adaptive Boosting (Adaboost) Gradient Boosting
Assigns higher weights to misclassified examples in subsequent boosting steps. Weak learners are trained on the residuals (remaining errors) of previous learners.

K-Nearest Neighbors (K-NN)

The kk-NN algorithm is a non-parametric approach where the response of a new data point is determined by the majority class (for classification) or the average (for regression) of its kk closest neighbors in the training set.

The choice of kk influences the model's bias and variance:
  • A higher kk leads to higher bias (smoother decision boundaries, less sensitive to noise).
  • A lower kk leads to higher variance (more complex decision boundaries, more sensitive to noise).

Learning Theory

Learning theory provides theoretical foundations for understanding how and why algorithms learn effectively.

  • Union Bound: For kk events A1,,AkA_1, \dots, A_k, the probability of at least one event occurring is bounded: P(i=1kAi)i=1kP(Ai)P \left(\bigcup_{i=1}^k A_i\right) \le \sum_{i=1}^k P(A_i)
  • Hoeffding Inequality (Chernoff Bound): For mm i.i.d. Bernoulli variables Z1,,ZmZ_1, \dots, Z_m with mean ϕ\phi, and sample mean ϕ^\widehat{\phi}, for any γ>0\gamma > 0: P(ϕϕ^>γ)2exp(2γ2m)P \left(\left| \phi - \widehat {\phi} \right| > \gamma\right) \leqslant 2 \exp \left(- 2 \gamma^ {2} m\right)
  • Training Error (ϵ^(h)\widehat{\epsilon}(h)): Also known as empirical risk or empirical error, it measures the proportion of mistakes a classifier hh makes on the training set: ϵ^(h)=1mi=1m1{h(x(i))y(i)}\widehat {\epsilon} (h) = \frac {1}{m} \sum_ {i = 1} ^ {m} 1 _ {\{h (x ^ {(i)}) \neq y ^ {(i)} \}}
  • Probably Approximately Correct (PAC): A framework for analyzing the theoretical learnability of a concept. Its assumptions are:
    • Training and test sets follow the same distribution.
    • Training examples are drawn independently.
  • Shattering: A set of classifiers H\mathcal{H} shatters a set of data points S={x(1),,x(d)}S = \{x^{(1)},\dots,x^{(d)}\} if it can perfectly classify any possible labeling of these points: hH,i[1,d],h(x(i))=y(i)\exists h \in \mathcal {H}, \quad \forall i \in [ 1, d ], \quad h (x ^ {(i)}) = y ^ {(i)}
  • Upper Bound Theorem: For a finite hypothesis class H\mathcal{H} of size kk, with probability at least 1δ1-\delta, the true error ϵ(h^)\epsilon(\widehat{h}) of the learned hypothesis h^\widehat{h} is bounded by: ϵ(h^)(minhHϵ(h))+212mlog(2kδ)\epsilon (\widehat {h}) \leqslant \left(\min _{h \in \mathcal {H}} \epsilon (h)\right) + 2 \sqrt {\frac {1}{2 m} \log \left(\frac {2 k}{\delta}\right)}
  • VC Dimension (Vapnik-Chervonenkis Dimension): For an infinite hypothesis class H\mathcal{H}, VC(H)VC(\mathcal{H}) is the size of the largest set that can be shattered by H\mathcal{H}.

    For example, the VC dimension of linear classifiers in 2 dimensions is 3.

  • Vapnik's Theorem: For a hypothesis class H\mathcal{H} with VC dimension dd and mm training examples, with probability at least 1δ1-\delta: ϵ(h^)(minhHϵ(h))+O(dmlog(md)+1mlog(1δ))\epsilon (\widehat {h}) \leqslant \left(\min _ {h \in \mathcal {H}} \epsilon (h)\right) + O \left(\sqrt {\frac {d}{m} \log \left(\frac {m}{d}\right) + \frac {1}{m} \log \left(\frac {1}{\delta}\right)}\right) This theorem provides a generalization bound for infinite hypothesis classes.

Key Takeaways

  • Supervised learning involves building models to predict outcomes from labeled data.
  • Models are categorized as regression (continuous outcomes) or classification (categorical outcomes), and as discriminative or generative.
  • Fundamental concepts include hypothesis, loss functions, cost functions, and optimization algorithms like gradient descent and Newton's method.
  • Popular algorithms include Linear Regression, Logistic Regression, SVMs, GDA, Naive Bayes, Decision Trees, Random Forests, and K-NN.
  • Generalized Linear Models offer a unifying framework for various exponential family distributions.
  • Learning theory provides bounds and insights into model generalization through concepts like PAC learning, shattering, and VC dimension.

Lancer un quiz

Teste tes connaissances avec des questions interactives