Specify Fixed ML Pipelines
This notebook is part of the CaTabRa GitHub repository.
This short example illustrates how a fixed ML pipeline can be specified in CaTabRa, i.e.,
Fixed pipelines (without hyperparameter optimization) can be useful for quickly training and evaluating baseline models, like simple logistic regression.
For the related question of how to add a new full-fledged AutoML backend (with hyperparameter optimization), or extend the default auto-sklearn backend, refer to this example.
Compose Pipeline
We compose a simple pipeline, consisting of elementary preprocessing steps (scaling, imputation) followed by a logistic regression.
[2]:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
[3]:
# preprocessing pipeline
preprocessing = make_pipeline(
MinMaxScaler(), # min-max scale all features to [0, 1] interval
SimpleImputer(strategy='constant', fill_value=-1), # impute missing values with -1
'passthrough' # no estimator in preprocessing pipeline
)
[4]:
# final estimator
estimator = LogisticRegression()
NOTE: catabra.automl.fixed_pipeline.standard_preprocessing() is a convenient built-in implementation of the above preprocessing pipeline. In addition, it also one-hot encodes categorical features.
We can now register the fixed pipeline as a new AutoML backend (strictly speaking, the term “AutoML” is not appropriate in this case, but never mind):
[5]:
from catabra.automl import fixed_pipeline
fixed_pipeline.register_backend(
'logreg',
preprocessing=preprocessing,
estimator=estimator
)
NOTE: The preprocessing object must implement fit_transform() and transform(), and the estimator object must implement fit(), predict() and, if used for classification, predict_proba(). Both should subclass `sklearn.base.BaseEstimator <https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html>`__ to be able to get/set hyperparameters with get_params() and set_params(), respectively. preprocessing is optional and can be set
to None.
Utilize Pipeline
"logreg" can be used in CaTabRa’s data analysis workflow just as any other AutoML backend.
[6]:
# load dataset
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(as_frame=True, return_X_y=True)
[7]:
# add target labels to DataFrame
X['diagnosis'] = y
[8]:
# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X.index <= 0.8 * len(X)
When analyzing the data, we inform CaTabRa that we want to use the "logreg" backend by adjusting the config dict:
[9]:
from catabra.analysis import analyze
analyze(
X,
classify='diagnosis', # name of column containing classification target
split='train', # name of column containing information about the train-test split (optional)
time=None, # specifying a time budget has no effect on fixed pipelines
out='logreg_example',
config={
'automl': 'logreg', # name of the "AutoML" backend (in this case it's a fixed pipeline)
'binary_classification_metrics': ['accuracy', 'roc_auc'],
}
)
[CaTabRa] ### Analysis started at 2023-03-08 14:34:44.562167
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Using AutoML-backend logreg for binary_classification
[CaTabRa warning] Could not set number of jobs of Pipeline preprocessing to 1.
[CaTabRa] Final training statistics:
n_models_trained: 1
[CaTabRa] Creating shap explainer
[CaTabRa] Initialized out-of-distribution detector of type BinsDetector
[CaTabRa] Fitting out-of-distribution detector...
[CaTabRa] Out-of-distribution detector fitted.
[CaTabRa] ### Analysis finished at 2023-03-08 14:34:45.339168
[CaTabRa] ### Elapsed time: 0 days 00:00:00.777001
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/logreg_example
[CaTabRa] ### Evaluation started at 2023-03-08 14:34:45.385179
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Evaluation results for train:
accuracy @ 0.5: 0.9758771929824561
roc_auc: 0.9944444444444445
[CaTabRa] Evaluation results for not_train:
accuracy @ 0.5: 0.9734513274336283
roc_auc: 0.9991158267020337
[CaTabRa] ### Evaluation finished at 2023-03-08 14:34:50.289666
[CaTabRa] ### Elapsed time: 0 days 00:00:04.904487
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/logreg_example/eval
After implementing the fixed pipeline in a few lines of code, CaTabRa takes care of everything else: calculating descriptive statistics, splitting the data into training- and a test sets, training a classifier and an OOD detector, and evaluating the classifier on both training- and test set (including visualizations).
The classifier can furthermore be explained without ado:
[10]:
from catabra.explanation import explain
explain(
X,
folder='logreg_example',
from_invocation='logreg_example/invocation.json',
out='logreg_example/explain'
)
[CaTabRa] ### Explanation started at 2023-03-08 14:39:12.430560
[CaTabRa] *** Split train
Sample batches: 100%|########################################| 15/15 [00:00<00:00, 570.05it/s]
[CaTabRa] *** Split not_train
Sample batches: 100%|########################################| 4/4 [00:00<00:00, 455.20it/s]
[CaTabRa] ### Explanation finished at 2023-03-08 14:39:14.921295
[CaTabRa] ### Elapsed time: 0 days 00:00:02.490735
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/logreg_example/explain
Configure Pipeline
Although fixed pipelines are, well, fixed in the sense that hyperparameters are not automatically optimized, it is still possible to configure hyperparameters through the config dict.
Find out which hyperparameters there are:
[12]:
preprocessing.get_params()
[12]:
{'memory': None,
'steps': [('minmaxscaler', MinMaxScaler()),
('simpleimputer', SimpleImputer(fill_value=-1, strategy='constant')),
('passthrough', 'passthrough')],
'verbose': False,
'minmaxscaler': MinMaxScaler(),
'simpleimputer': SimpleImputer(fill_value=-1, strategy='constant'),
'passthrough': 'passthrough',
'minmaxscaler__clip': False,
'minmaxscaler__copy': True,
'minmaxscaler__feature_range': (0, 1),
'simpleimputer__add_indicator': False,
'simpleimputer__copy': True,
'simpleimputer__fill_value': -1,
'simpleimputer__missing_values': nan,
'simpleimputer__strategy': 'constant',
'simpleimputer__verbose': 0}
[13]:
estimator.get_params()
[13]:
{'C': 1.0,
'class_weight': None,
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'l1_ratio': None,
'max_iter': 100,
'multi_class': 'auto',
'n_jobs': None,
'penalty': 'l2',
'random_state': None,
'solver': 'lbfgs',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
Hyperparameters can be configured by adding corresponding entries to the config dict. Keys must be prefixed by "logreg_preprocessing__" and "logreg_estimator__", respectively:
[16]:
analyze(
X,
classify='diagnosis', # name of column containing classification target
split='train', # name of column containing information about the train-test split (optional)
time=None, # specifying a time budget has no effect on fixed pipelines
out='logreg_example_configured',
config={
'automl': 'logreg', # name of the "AutoML" backend (in this case it's a fixed pipeline)
'binary_classification_metrics': ['accuracy', 'roc_auc'],
'logreg_preprocessing__simpleimputer__strategy': 'mean', # impute missing values with feature-wise mean
'logreg_estimator__penalty': 'none', # don't regularize
'logreg_estimator__max_iter': 500 # increase number of iterations
}
)
Output folder "/mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/logreg_example_configured" already exists. Delete? [y/n] y
[CaTabRa] ### Analysis started at 2023-03-08 15:00:22.444121
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Using AutoML-backend logreg for binary_classification
[CaTabRa warning] Could not set number of jobs of Pipeline preprocessing to 1.
[CaTabRa] Final training statistics:
n_models_trained: 1
[CaTabRa] Creating shap explainer
[CaTabRa] Initialized out-of-distribution detector of type BinsDetector
[CaTabRa] Fitting out-of-distribution detector...
[CaTabRa] Out-of-distribution detector fitted.
[CaTabRa] ### Analysis finished at 2023-03-08 15:00:24.329333
[CaTabRa] ### Elapsed time: 0 days 00:00:01.885212
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/logreg_example_configured
[CaTabRa] ### Evaluation started at 2023-03-08 15:00:24.334280
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Evaluation results for train:
accuracy @ 0.5: 1.0
roc_auc: 1.0
[CaTabRa] Evaluation results for not_train:
accuracy @ 0.5: 0.9557522123893806
roc_auc: 0.9712643678160919
[CaTabRa] ### Evaluation finished at 2023-03-08 15:00:28.631267
[CaTabRa] ### Elapsed time: 0 days 00:00:04.296987
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/logreg_example_configured/eval
Bottom Line
Although it would be technically possible to incorporate hyperparameter optimization into fixed pipelines by utilizing sklearn.model_selection.GridSearchCV and related concepts, we strongly recommend to implement a proper AutoML backend instead. Refer to Add New AutoML Backend for information how this works.