ML | 12344321

Machine learning is broadly divided into three distinct classes of techniques: supervised learning, unsupervised learning, and deep learni...

Machine learning is broadly divided into three distinct classes of techniques: supervised learning, unsupervised learning, and deep learning/reinforcement learning.

Supervised Learning

Supervised learning involves ML algorithms that infer patterns between a set of inputs (the X’s) and the desired output (Y). The inferred pattern is then used to map a given input set into a predicted output. Supervised learning requires a labeled dataset, one that contains matched sets of observed inputs and the associated output. Applying the ML algorithm to this dataset to infer the pattern between the inputs and output is called “training” the algorithm. Once the algorithm has been trained, the inferred pattern can be used to predict output values based on new inputs (i.e., ones not in the training dataset).

Multiple regression is an example of supervised learning. A regression model takes matched data (X’s, Y) and uses it to estimate parameters that characterize the relationship between Y and the X’s. The estimated parameters can then be used to predict Y on a new, different set of X’s. The difference between the predicted and actual Y is used to evaluate how well the regression model predicts out-of-sample (i.e., using new data).

The terminology used with ML algorithms differs from that used in regression. The graph below provides a visual of the supervised learning model training process and a translation between regression and ML terminologies.

Exhibit 1

Supervised Learning — Continued

In supervised machine learning, the dependent variable (Y) is the target and the independent variables (X’s) are known as features. The labeled data (training dataset) is used to train the supervised ML algorithm to infer a pattern-based prediction rule. The fit of the ML model is evaluated using labeled test data in which the predicted targets (Y^Predict) are compared to the actual targets (Y^Actual).

An example of supervised learning is the case in which ML algorithms are used to predict whether credit card transactions are fraudulent or legitimate. In the credit card example, the target is a binary variable with a value of 1 for “fraudulent” or 0 for “non-fraudulent.” The features are the transaction characteristics. The chosen ML algorithm uses these data elements to train a model to predict the likelihood of fraud more accurately in new transactions. The ML program “learns from experience” if the percentage of correctly predicted credit card transactions increases as the amount of input from a growing credit card database increases. One possible ML algorithm to use would be to fit a logistic regression model to the data to provide an estimate of the probability a transaction is fraudulent.

Supervised learning can be divided into two categories of problems—regression and classification—with the distinction between them being determined by the nature of the target (Y) variable. If the target variable is continuous, then the task is one of regression (even if the ML technique used is not “regression”; note this nuance of ML terminology). If the target variable is categorical or ordinal (i.e., a ranked category), then it is a classification problem. Regression and classification use different ML techniques.

Regression focuses on making predictions of continuous target variables. Most learners are already familiar with multiple linear regression (e.g., ordinary least squares) models, but other supervised learning techniques exist, including non-linear models. These non-linear models are useful for problems involving large datasets with large numbers of features, many of which may be correlated. Some examples of problems belonging to the regression category are using historical stock market returns to forecast stock price performance or using historical corporate financial ratios to forecast the probability of bond default.

Classification focuses on sorting observations into distinct categories. In a regression problem, when the dependent variable (target) is categorical, the model relating the outcome to the independent variables (features) is called a “classifier.” You may already be familiar with logistic regression as a type of classifier. Many classification models are binary classifiers, as in the case of fraud detection for credit card transactions. Multi-category classification is not uncommon, as in the case of classifying firms into multiple credit rating categories. In assigning ratings, the outcome variable is ordinal, meaning the categories have a distinct order or ranking (e.g., from low to high creditworthiness). Ordinal variables are intermediate between categorical variables and continuous variables on a scale of measurement.

Unsupervised Learning

Unsupervised learning is machine learning that does not make use of labeled data. More formally, in unsupervised learning, we have inputs (X’s) that are used for analysis without any target (Y) being supplied. In unsupervised learning, because the ML algorithm is not given labeled training data, the algorithm seeks to discover structure within the data themselves. As such, unsupervised learning is useful for exploring new datasets because it can provide human experts with insights into a dataset too big or too complex to visualize.

Two important types of problems that are well suited to unsupervised machine learning are reducing the dimension of data and sorting data into clusters, known as dimension reduction and clustering, respectively.

Dimension reduction focuses on reducing the number of features while retaining variation across observations to preserve the information contained in that variation. Dimension reduction may have several purposes. It may be applied to data with a large number of features to produce a lower dimensional representation (i.e., with fewer features) that can fit, for example, on a computer screen. Dimension reduction is also used in many quantitative investment and risk management applications where it is critical to identify the most predictive factors underlying asset price movements.

Clustering focuses on sorting observations into groups (clusters) such that observations in the same cluster are more similar to each other than they are to observations in other clusters. Groups are formed based on a set of criteria that may or may not be prespecified (such as the number of groups). Clustering has been used by asset managers to sort companies into groupings driven by data (e.g., based on their financial statement data or corporate characteristics) rather than conventional groupings (e.g., based on sectors or countries).

Deep Learning and Reinforcement Learning

More broadly in the field of artificial intelligence, additional categories of machine learning algorithms are distinguished. In deep learning, sophisticated algorithms address complex tasks, such as image classification, face recognition, speech recognition, and natural language processing. Deep learning is based on neural networks (NNs), also called artificial neural networks (ANNs)—highly flexible ML algorithms that have been successfully applied to a variety of supervised and unsupervised tasks characterized by large datasets, non-linearities, and interactions among features. In reinforcement learning, a computer learns from interacting with itself or data generated by the same algorithm. Deep learning and reinforcement learning principles have been combined to create efficient algorithms for solving a range of highly complex problems in robotics, health care, and finance.

Summary of ML Algorithms and How to Choose among Them

Below, you will find a guide to the various machine learning algorithms organized by algorithm type (supervised or unsupervised) and by type of variables (continuous, categorical, or both). We will not cover linear or logistic regression. The extensions of linear regression, such as penalized regression and least absolute shrinkage and selection operator (LASSO), as well as the other ML algorithms shown below, will be covered in this course.

Guide to ML Algorithms

ML Algorithm Type
Variables	Supervised (Target Variable)	Unsupervised (No Target Variable)
Continuous	Regression	Dimension Reduction
	• Linear; Penalized Regression/LASSO	• Principal Components Analysis (PCA)
	• Logistic	Clustering
	• Classification and Regression Tree (CART)	• K-Means
	• Random Forest	• Hierarchical
Categorical	Classification	Dimension Reduction
	• Logistic	• Principal Components Analysis (PCA)
	• Support Vector Machine (SVM)	Clustering
	• K-Nearest Neighbor (KNN)	• K-Means
	• Classification and Regression Tree (CART)	• Hierarchical
Continuous or Categorical	Neural Networks	Neural Networks
	Deep Learning	Deep Learning
	Reinforcement Learning	Reinforcement Learning

Overview of Evaluating ML Algorithm Performance

In this lesson, you will learn how to describe overfitting, and to identify methods of addressing it.

Machine learning algorithms promise several advantages relative to a structured statistical approach in exploring and analyzing the structure of very large datasets. ML algorithms have the ability to uncover complex interactions between feature variables and the target variable, and they can process massive amounts of data quickly. Moreover, many ML algorithms can easily capture non-linear relationships and may be able to recognize and predict structural changes between features and the target. These advantages mainly derive from the non-parametric and non-linear models that allow more flexibility when inferring relationships.

The flexibility of ML algorithms comes with a price, however. ML algorithms can produce overly complex models with results that are difficult to interpret, may be sensitive to noise or particulars of the data, and may fit the training data too well. An ML algorithm that fits the training data too well will typically not predict well using new data. This problem is known as overfitting, and it means that the fitted algorithm does not generalize well to new data. A model that generalizes well is a model that retains its explanatory power when predicting using out-of-sample (i.e., new) data. An overfit model has incorporated the noise or random fluctuations in the training data into its learned relationship. The problem is that these aspects often do not apply to new data the algorithm receives and so will negatively impact the model’s ability to generalize, therefore reducing its overall predictive value. The evaluation of any ML algorithm thus focuses on its prediction error on new data rather than on its goodness of fit on the data with which the algorithm was fitted (i.e., trained).

Generalization is an objective in model building, so the problem of overfitting is a challenge to attaining that objective. These two concepts are the focus of the following section.

Generalization and Overfitting

To properly describe generalization and overfitting of an ML model, it is important to note the partitioning of the dataset to which the model will be applied. The dataset is typically divided into three non-overlapping samples: (1) training sample used to train the model, (2) validation sample for validating and tuning the model, and (3) test sample for testing the model’s ability to predict well on new data. The training and validation samples are often referred to as being “in-sample,” and the test sample is commonly referred to as being “out-of-sample.” We will return shortly to the topic of partitioning the dataset.

To be valid and useful, any supervised machine learning model must generalize well beyond the training data. The model should retain its explanatory power when tested out-of-sample. As mentioned, one common reason for failure to generalize is overfitting. Think of overfitting as tailoring a custom suit that fits only one person. Continuing the analogy, underfitting is similar to making a baggy suit that fits no one, whereas robust fitting, the desired result, is similar to fashioning a universal suit that fits all people of similar dimensions.

Underfitting, Overfitting, and Good Fitting

Exhibit 3

Errors and Overfitting

To capture these effects and calibrate degree of fit, data scientists compare error rates in- and out-of-sample as a function of both the data and the algorithm. Total in-sample errors (E_in) are generated by the predictions of the fitted relationship relative to actual target outcomes on the training sample. Total out-of-sample errors (E_out) are from either the validation or test samples. Low or no in-sample error but large out-of-sample error are indicative of poor generalization. Data scientists decompose the total out-of-sample error into three sources:

Bias error, or the degree to which a model fits the training data. Algorithms with erroneous assumptions produce high bias with poor approximation, causing underfitting and high in-sample error.

Variance error, or how much the model’s results change in response to new data from validation and test samples. Unstable models pick up noise and produce high variance, causing overfitting and high out-of-sample error.

Base error due to randomness in the data.

A learning curve plots the accuracy rate (= 1 – error rate) in the validation or test samples (i.e., out-of-sample) against the amount of data in the training sample, so it is useful for describing under- and overfitting as a function of bias and variance errors. If the model is robust, out-of-sample accuracy increases as the training sample size increases. This implies that error rates experienced in the validation or test samples (E_out) and in the training sample (E_in) converge toward each other and toward a desired error rate (or, alternatively, the base error).

In an underfitted model with high bias error, shown in the left panel, high error rates cause convergence below the desired accuracy rate. Adding more training samples will not improve the model to the desired performance level. In an overfitted model with high variance error, shown in the middle panel, the validation sample and training sample error rates fail to converge. In building models, data scientists try to simultaneously minimize both bias and variance errors while selecting an algorithm with good predictive or classifying power, as seen in the right panel.

Learning Curves: Accuracy in Validation and Training Samples

Exhibit 4

Out-of-sample error rates are also a function of model complexity. As complexity increases in the training set, error rates (E_in) fall and bias error shrinks. As complexity increases in the test set, however, error rates (E_out) rise and variance error rises. Typically, linear functions are more susceptible to bias error and underfitting, while non-linear functions are more prone to variance error and overfitting. Therefore, an optimal point of model complexity exists where the bias and variance error curves intersect and in- and out-of-sample error rates are minimized. A fitting curve, which shows in- and out-of-sample error rates (E_in and E_out) on the y-axis plotted against model complexity on the x-axis, is presented below and illustrates this trade-off.

Fitting Curve Shows Trade-Off between Bias and Variance Errors and Model Complexity

Exhibit 5

Finding the optimal point (managing overfitting risk)—the point just before the total error rate starts to rise (due to increasing variance error)—is a core part of the machine learning process and the key to successful generalization. Data scientists express the trade-off between overfitting and generalization as a trade-off between cost (the difference between in- and out-of-sample error rates) and complexity. They use the trade-off between cost and complexity to calibrate and visualize under- and overfitting and to optimize their models.

Preventing Overfitting in Supervised Machine Learning

We have seen that overfitting impairs generalization, but overfitting potential is endemic to the supervised machine learning process due to the presence of noise. So, how do data scientists combat this risk? Two common methods are used to reduce overfitting:

(1) preventing the algorithm from getting too complex during selection and training, which requires estimating an overfitting penalty

This strategy comes from Occam’s razor, the problem-solving principle that the simplest solution tends to be the correct one. In supervised machine learning, it means limiting the number of features and penalizing algorithms that are too complex or too flexible by constraining them to include only parameters that reduce out-of-sample error.

(2) proper data sampling achieved by using cross-validation, a technique for estimating out-of-sample error directly by determining the error in validation samples.

To mitigate the problem of such holdout samples (i.e., data samples not used to train the model) reducing the training set size too much, modelers use special cross-validation techniques. One such technique is k-fold cross-validation, in which the data (excluding test sample and fresh data) are shuffled randomly and then are divided into k equal sub-samples, with k – 1 samples used as training samples and one sample, the kth, used as a validation sample. Note that k is typically set at 5 or 10. This process is then repeated k times, which helps minimize both bias and variance by insuring that each data point is used in the training set k – 1 times and in the validation set once. The average of the k validation errors (mean E_val) is then taken as a reasonable estimate of the model’s out-of-sample error (E_out). A limitation of k-fold cross-validation is that it cannot be used with time-series data, where only the most recent data can reasonably be used for model validation.

In sum, mitigating overfitting risk by avoiding excessive out-of-sample error is critical to creating a supervised machine learning model that generalizes well to fresh datasets drawn from the same distribution. The main techniques used to mitigate overfitting risk in model construction are complexity reduction (or regularization) and cross-validation.

12344321

Header$type=social_icons

$type=ticker$count=12$cols=4$cate=0

ML

Supervised Learning

Supervised Learning — Continued

Unsupervised Learning

Deep Learning and Reinforcement Learning

Summary of ML Algorithms and How to Choose among Them

Overview of Evaluating ML Algorithm Performance

Generalization and Overfitting

Errors and Overfitting

Preventing Overfitting in Supervised Machine Learning

(1) preventing the algorithm from getting too complex during selection and training, which requires estimating an overfitting penalty

(2) proper data sampling achieved by using cross-validation, a technique for estimating out-of-sample error directly by determining the error in validation samples.

Labels:

BÌNH LUẬN

Fundamental Law of Active Management

__MMO.VK$type=social_counter

QUOTE$quote=Be wisdom

/fa-clock-o/ NỔI BẬT$type=list

Fb page

Footer Social$type=social_icons

$type=ticker$count=12$cols=4$cate=0

ML

Supervised Learning

Supervised Learning — Continued

Unsupervised Learning

Deep Learning and Reinforcement Learning

Summary of ML Algorithms and How to Choose among Them

Overview of Evaluating ML Algorithm Performance

Generalization and Overfitting

Errors and Overfitting

Preventing Overfitting in Supervised Machine Learning

(1) preventing the algorithm from getting too complex during selection and training, which requires estimating an overfitting penalty

(2) proper data sampling achieved by using cross-validation, a technique for estimating out-of-sample error directly by determining the error in validation samples.

Labels:

CHIA SẺ

BÌNH LUẬN

Fundamental Law of Active Management

__MMO.VK$type=social_counter

QUOTE$quote=Be wisdom

/fa-clock-o/ NỔI BẬT$type=list

Fb page

Footer Social$type=social_icons