Overview
Motivation
- Estimate predictions for small data
- Evaluate variable or feature importance on all subpopulations of the data
- Generate prediction uncertainty and variance
- Develop classifiers based on unseen data
Main advantages
- Predictions are generated on many different training/validation data splits
- Predictor or feature importances to the dependent variable are generalized over many subpopulations of the data
- No data leakage - Predictions are on observations not included during training
Procedural overview
Monte Carlo simulation splits the data into training and validation sets
K-fold cross validation (10 by default) on the training set is used to estimate “good” model parameters
The model with the “good” parameters is fit on the entire training set
The refitted model predicts the yet-to-be-seen validation set
Performance metrics are generated using resamples (bootstrap with replacement) of the observation probabilities