Machine Learning (General)

Machine Learning (General)#

Supervised/unsupervised/reinforcement problem?
Regression/classification/clustering etc.?
Select a performance measure
- Regression: RMSE/MAE
- Classification: Precision/recall; sensitivity/specificity; log-loss; Gini; entropy

Visualizing the data
- distribution
- correlation
- feature engineering (can also be a step in next process)
Missing data

Cleaning: missing data (imputation?), label encoding
Handling categorical data: e.g. one-hot encoding
Feature scaling
Feature engineering: create new features
Train/test data splitting
- e.g. 80% vs 20% or 75% vs 25%
- Make sure that the two parts are comparable in terms of the outcome variable (e.g. similar % of success if it’s binary)

Use cross-validation as the means to evaluate the overall performance of a model
```
from sklearn.model-selection import cross-val-score
```
Grid search: GridSearchCV in model-selection module
Random search: RandomizedSearchCV
Ensemble methods: combine the models that perform best
Analyze the best model and their error
Recommend to save both the hyperparameters & the trained parameters as well as the CV scores, and the predictions using joblib from external module
Usually there will be multiple candidate models to be considered; use the same criteria to compare different models for their performance

Cost function: used in estimating the model parameters, the process of minimizing the cost/loss function leads to the optimized model specification given the data; e.g. MSE, gini impurity measure
Performance measures: used in evaluating the predictions performance from a model identified from the data; e.g. AUC, log-loss, F1 score.
Different ML models use different cost/loss function for parameter estimation, it’s the “built-in” feature of a specific ML algorithm. When models are trained, their predictive capabilities are compared under the same performance measure/metric.

AUC computed w.r.t. binary classification with a varying decision threshold
Log-loss takes “certainty” of classification into account’ it conceptually goes beyond AUC and is especially relevant in imbalanced data or unequally distributed error cost

\[-(y\log(p) + (1-y)\log(1-p)) \leftarrow \text{ well calibrated}\]
- a measure of accuracy that incorporates the idea of probabilistic confidence
\[F_1 = \left.( \frac{2}{\text{recall}^{-1} + \text{precision}^{-1}}\right.),\]

where recall = sensitivity (true positive rate, TPR) and precision = positive predicted value(PPV, i.e. true positive/predicted positive)

PRECISION & RECALL

Intro article about Precision and Recall
If care about absolute probabilistic difference, go with log-loss
If only care about final class prediction, and don’t want to tune threshold, go with AUC
F1 is sensitive to threshold
If data is imbalanced
- Go with F1 if have less positive cases, e.g. fraud detection
- Go with AUC if don’t care about the actual class

Multiclass
- OVA (one vs all): \(K\) model if there are \(K\) classes; preferred for most binary classifier
- OVO (one vs one): \(\frac{K(K-1)}{2}\) models if there are total \(K\) classes; faster to train, leads to many small sets
Random forest can directly classify instances into multiple classes
Multilabel: each instance have more than one outcome to predict, e.g. a photo with multiple people in it
Multioutcome: multiclass + multilabel, i.e. each outcome, among multiple outcomes, is a multiclass perdition task

Gini impurity for node \(i\):

\[G_i = 1 - \sum_{k=1}^K p_{i, k}^2,\]

\(k\) is the class index
Entropy for node \(i\):

\[H_i = -\sum_{k=1, p_{i,k}\ne0}^K p_{i, k}\log(p_{i, k})\]
Most of the times, \(G_i\) or \(H_i\) don’t make big difference, while \(G_i\) is slightly faster to compute
Gini trends to isolate the most frequent class in its own branch, while entropy trends to be more balanced

Data leakage: training data contains info about the target, but similar data won’t be available when the model is used for prediction.
- Target leakage: timing or chronological order of x and y
- Train-test leakage
Target leakage: some “predictor” or x may actually be collected/updated after y(outcome) is observed in the training data, such variables should be used in model training
Train-test contamination: when validation data affects the preprocessing behavior
- e.g. run a preprocessing (like fitting an imputer for missing values) before calling train_test_split()
- If the validation is based on a simple train-test split, exclude the validation data from any type of fitting, including the fitting of preprocessing steps. When using CV, it’s even more critical that you do your preprocessing inside then pipeline!