Add New AutoML Backend

This notebook is part of the CaTabRa GitHub repository.

This short example demonstrates how a new AutoML backend can be added to CaTabRa, i.e.,

how it can be implemented, and
how it can be utilized in CaTabRa’s data analysis workflow.

It also briefly explains how the existing auto-sklearn backend can be extended without having to add new backend from scratch.

For the related question of how to conveniently utilize a fixed ML pipeline (without hyperparameter optimization) refer to this example.

Implement Random Search

We implement a simple random search over a fixed, non-configurable parameter grid.

ATTENTION! This is an extremely reduced example that only serves demonstration purposes. It lacks many capabilities normally expected from CaTabRa AutoML backends, like

supporting different prediction tasks (not just binary- and multiclass classification),
handling numerical and categorical features,
handling unlabeled samples,
supporting grouped splitting for internal validation,
taking time- and memory constraints into accoount,
taking different optimization objectives into account,
logging the training process,
building ensembles,
etc.

If you intend to actually add a new AutoML backend, have a look at the implementation of the default auto-sklearn backend in `catabra.automl.askl.backend <https://github.com/risc-mi/catabra/tree/main/catabra/automl/askl/backend.py>`__.

[1]:

from typing import Optional
import numpy as np
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

from catabra.automl.base import FittedEnsemble, AutoMLBackend

AutoML backends need to implement the abstract base class `catabra.automl.base.AutoMLBackend <https://github.com/risc-mi/catabra/tree/main/catabra/automl/base.py>`__. The main methods of interest are fit(), predict() and predict_proba().

[2]:

class RandomSearchBackend(AutoMLBackend):

    @property
    def name(self) -> str:
        return 'random_search'

    @property
    def model_ids_(self) -> list:
        [0]

    def summary(self) -> dict:
        return {0: [' '.join(repr(s[1]).replace('\n', ' ').split()) for s in self.random_search_.best_estimator_.steps]}

    def training_history(self) -> pd.DataFrame:
        hist = pd.DataFrame(self.random_search_.cv_results_)
        hist.rename({'mean_test_score': 'val_score'}, axis=1, inplace=True)   # for plotting
        return hist

    def fitted_ensemble(self, ensemble_only: bool = True) -> FittedEnsemble:
        pip = self.random_search_.best_estimator_
        return FittedEnsemble(
            task=self.task,
            models={
                0: dict(preprocessing=pip.steps[0][1], estimator=pip.steps[1][1])
            }
        )

    def fit(self, x_train: pd.DataFrame, y_train: pd.DataFrame, groups: Optional[np.ndarray] = None,
            sample_weights: Optional[np.ndarray] = None, time: Optional[int] = None, jobs: Optional[int] = None,
            dataset_name: Optional[str] = None, monitor=None) -> 'RandomSearchBackend':

        assert self.task in ('binary_classification', 'multiclass_classification')
        assert y_train.notna().all().all()
        assert groups is None
        assert sample_weights is None

        metrics = self.config.get(self.task + '_metrics', [])
        assert len(metrics) == 0 or metrics[0] == 'accuracy'

        pip = Pipeline(
            [
                ('imputer', SimpleImputer()),
                ('classifier', RandomForestClassifier())
            ]
        )

        param_dist = {
            'imputer__strategy': ['mean', 'median', 'most_frequent', 'constant'],
            'imputer__add_indicator': [True, False],
            'classifier__n_estimators': [10, 20, 50, 80, 100, 150, 200],
            'classifier__criterion': ['gini', 'entropy'],
            'classifier__max_depth': [None, 4, 10],
            'classifier__class_weight': [None, 'balanced', 'balanced_subsample'],
        }

        if time is None:
            time = 1

        # abuse `time` as number of iterations
        n_iter = time

        self.random_search_ = RandomizedSearchCV(pip, param_distributions=param_dist, n_iter=n_iter, refit=True)
        self.random_search_.fit(x_train.values, y_train.values[:, 0])

        return self

    def predict(self, x: pd.DataFrame, jobs: Optional[int] = None, batch_size: Optional[int] = None,
                model_id=None, calibrated: bool = 'auto') -> np.ndarray:
        return self.random_search_.predict(x)

    def predict_proba(self, x: pd.DataFrame, jobs: Optional[int] = None, batch_size: Optional[int] = None,
                      model_id=None, calibrated: bool = 'auto') -> np.ndarray:
        return self.random_search_.predict_proba(x)

    def predict_all(self, x: pd.DataFrame, jobs: Optional[int] = None, batch_size: Optional[int] = None) -> dict:
        return {0: self.predict(x, jobs=jobs, batch_size=batch_size)}

    def predict_proba_all(self, x: pd.DataFrame, jobs: Optional[int] = None, batch_size: Optional[int] = None) -> dict:
        return {0: self.predict_proba(x, jobs=jobs, batch_size=batch_size)}

    def get_versions(self) -> dict:
        return {}

[3]:

AutoMLBackend.register('random_search', RandomSearchBackend)

Utilize Random Search

[4]:

# load dataset
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(as_frame=True, return_X_y=True)

[5]:

# add target labels to DataFrame
X['diagnosis'] = y

[6]:

# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X.index <= 0.8 * len(X)

When analyzing the data, we inform CaTabRa that we want to use the "random_search" backend by adjusting the config dict:

[7]:

from catabra.analysis import analyze

analyze(
    X,
    classify='diagnosis',     # name of column containing classification target
    split='train',            # name of column containing information about the train-test split (optional)
    time=20,                  # ONLY IN THIS CASE: number of random search iterations
    out='random_search_example',
    config={
        'automl': 'random_search',     # name of the AutoML backend
        'binary_classification_metrics': ['accuracy', 'roc_auc'],
    }
)

[CaTabRa] ### Analysis started at 2023-02-09 09:30:27.137817
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Using AutoML-backend random_search for binary_classification
[CaTabRa] Final training statistics:
    n_models_trained: 20
[CaTabRa] Creating shap explainer
[CaTabRa] Initialized out-of-distribution detector of type Autoencoder
[CaTabRa] Fitting out-of-distribution detector...
Iteration 1, loss = 0.06783438
Iteration 2, loss = 0.03997528
Iteration 3, loss = 0.02633058
Iteration 4, loss = 0.01948884
Iteration 5, loss = 0.01487442
Iteration 6, loss = 0.01228704
Iteration 7, loss = 0.01144362
Iteration 8, loss = 0.01063012
Iteration 9, loss = 0.00981005
Iteration 10, loss = 0.00913160
Iteration 11, loss = 0.00833614
Iteration 12, loss = 0.00764720
Iteration 13, loss = 0.00714880
Iteration 14, loss = 0.00660951
Iteration 15, loss = 0.00632128
Iteration 16, loss = 0.00613749
Iteration 17, loss = 0.00583286
Iteration 18, loss = 0.00577213
Iteration 19, loss = 0.00582528
Iteration 20, loss = 0.00698503
Iteration 21, loss = 0.00653891
Iteration 22, loss = 0.00593587
Iteration 23, loss = 0.00592284
Iteration 24, loss = 0.00581431
Iteration 25, loss = 0.00573134
Iteration 26, loss = 0.00559525
Iteration 27, loss = 0.00543705
Iteration 28, loss = 0.00539677
Iteration 29, loss = 0.00539616
Iteration 30, loss = 0.00539520
Iteration 31, loss = 0.00533531
Iteration 32, loss = 0.00531575
Iteration 33, loss = 0.00529221
Iteration 34, loss = 0.00524817
Iteration 35, loss = 0.00523697
Iteration 36, loss = 0.00521718
Iteration 37, loss = 0.00521684
Iteration 38, loss = 0.00520618
Iteration 39, loss = 0.00520097
Iteration 40, loss = 0.00520765
Iteration 41, loss = 0.00520148
Iteration 42, loss = 0.00519837
Iteration 43, loss = 0.00518533
Iteration 44, loss = 0.00518255
Iteration 45, loss = 0.00518003
Iteration 46, loss = 0.00517438
Iteration 47, loss = 0.00517886
Iteration 48, loss = 0.00518837
Iteration 49, loss = 0.00516961
Iteration 50, loss = 0.00519560
Iteration 51, loss = 0.00516057
Iteration 52, loss = 0.00517097
Iteration 53, loss = 0.00515444
Iteration 54, loss = 0.00515273
Iteration 55, loss = 0.00514750
Iteration 56, loss = 0.00514033
Iteration 57, loss = 0.00514505
Iteration 58, loss = 0.00514759
Iteration 59, loss = 0.00515513
Iteration 60, loss = 0.00514548
Iteration 61, loss = 0.00513684
Iteration 62, loss = 0.00512496
Iteration 63, loss = 0.00513075
Iteration 64, loss = 0.00512894
Iteration 65, loss = 0.00511904
Iteration 66, loss = 0.00512427
Iteration 67, loss = 0.00512083
Iteration 68, loss = 0.00512339
Iteration 69, loss = 0.00511534
Iteration 70, loss = 0.00510811
Iteration 71, loss = 0.00514462
Iteration 72, loss = 0.00512639
Iteration 73, loss = 0.00512267
Iteration 74, loss = 0.00514646
Iteration 75, loss = 0.00511554
Iteration 76, loss = 0.00511430
Iteration 77, loss = 0.00511921
Iteration 78, loss = 0.00511720
Training loss did not improve more than tol=0.000100 for 50 consecutive epochs. Stopping.
[CaTabRa] Out-of-distribution detector fitted.
[CaTabRa] ### Analysis finished at 2023-02-09 09:30:41.778586
[CaTabRa] ### Elapsed time: 0 days 00:00:14.640769
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_search_example
[CaTabRa] ### Evaluation started at 2023-02-09 09:30:41.827327
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Evaluation results for train:
    accuracy @ 0.5: 0.9978070175438597
    roc_auc: 0.9999999999999999
[CaTabRa] Evaluation results for not_train:
    accuracy @ 0.5: 0.9734513274336283
    roc_auc: 0.9982316534040672
[CaTabRa] ### Evaluation finished at 2023-02-09 09:30:46.508033
[CaTabRa] ### Elapsed time: 0 days 00:00:04.680706
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_search_example/eval

After implementing the (simplistic) new AutoML backend in a few lines of code, CaTabRa takes care of everything else: calculating descriptive statistics, splitting the data into training- and a test sets, training a classifier and an OOD detector, and evaluating the classifier on both training- and test set (including visualizations).

We can inspect the training history and the model summary:

[8]:

from catabra.util import io
training_history = io.read_df('random_search_example/training_history.xlsx')
model_summary = io.load('random_search_example/model_summary.json')

[15]:

training_history.drop('Unnamed: 0', axis=1).sort_values('rank_test_score').head()

[15]:

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_imputer__strategy	param_imputer__add_indicator	param_classifier__n_estimators	param_classifier__max_depth	param_classifier__criterion	param_classifier__class_weight	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	val_score	std_test_score	rank_test_score
10	0.102886	0.002096	0.007032	0.000338	median	False	100	NaN	entropy	NaN	{'imputer__strategy': 'median', 'imputer__add_...	0.945652	0.956044	0.967033	0.956044	1.000000	0.964955	0.018782	1
18	0.124079	0.002483	0.006806	0.000152	median	False	100	NaN	entropy	balanced_subsample	{'imputer__strategy': 'median', 'imputer__add_...	0.945652	0.956044	0.978022	0.967033	0.967033	0.962757	0.011020	2
5	0.082439	0.000397	0.005633	0.000063	most_frequent	False	80	NaN	gini	balanced	{'imputer__strategy': 'most_frequent', 'impute...	0.945652	0.956044	0.967033	0.967033	0.967033	0.960559	0.008583	3
14	0.025500	0.000564	0.001812	0.000055	median	True	20	10.0	gini	balanced_subsample	{'imputer__strategy': 'median', 'imputer__add_...	0.956522	0.945055	0.967033	0.956044	0.978022	0.960535	0.011171	4
9	0.082382	0.001896	0.005584	0.000116	median	True	80	NaN	entropy	balanced	{'imputer__strategy': 'median', 'imputer__add_...	0.923913	0.956044	0.967033	0.978022	0.967033	0.958409	0.018596	5

[10]:

model_summary

[10]:

{'0': ["SimpleImputer(strategy='median')",
  "RandomForestClassifier(criterion='entropy')"]}

The classifier can be explained without further ado:

[14]:

from catabra.explanation import explain

explain(
    X,
    folder='random_search_example',
    from_invocation='random_search_example/invocation.json',
    out='random_search_example/explain'
)

[CaTabRa] ### Explanation started at 2023-02-09 09:40:19.095028
[CaTabRa] *** Split train
Sample batches: 100%|########################################| 15/15 [00:00<00:00, 275.13it/s]
[CaTabRa] *** Split not_train
Sample batches: 100%|########################################| 4/4 [00:00<00:00, 152.09it/s]
[CaTabRa] ### Explanation finished at 2023-02-09 09:40:21.726130
[CaTabRa] ### Elapsed time: 0 days 00:00:02.631102
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_search_example/explain

Extend Existing Auto-Sklearn Backend

The existing auto-sklearn backend can be easily extended with new components, for instance, for data preprocessing, feature engineering, and predictive modeling. This is independent of CaTabRa and documented on the official auto-sklearn website, with examples. Additionally, you can check out `catabra.automl.askl.addons.xgb <https://github.com/risc-mi/catabra/tree/main/catabra/automl/askl/addons/xgb.py>`__ for details about how CaTabRa adds XGBoost classifiers and regressors to auto-sklearn.