CaTabRa Workflow


This notebook is part of the CaTabRa GitHub repository.

This tutorial demonstrates CaTabRa’s main workflow, in particular how it can be used to

Prerequisites

[15]:
# generic package imports
from catabra.util import io
[16]:
# output directory (where all generated artifacts, like statistics, models, etc. are saved)
output_dir = 'workflow'

Step 0: Prepare Data

We are going to work with the breast cancer dataset, a well-known binary classification dataset.

CaTabRa assumes a table in the usual \(samples \times attributes\) format as input, where the attributes encompass features, target labels, and possibly additional information like a predefined train-test split.

[17]:
# load dataset
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(as_frame=True, return_X_y=True)
[18]:
# add target labels to DataFrame
X['diagnosis'] = y
[19]:
# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X.index <= 0.8 * len(X)
[20]:
X.head()
[20]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension diagnosis train
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0 True
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0 True
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0 True
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0 True
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0 True

5 rows × 32 columns

NOTE The column specifying the train-test split may contain more than two values. For instance, values "train", "val" and "test" would yield a three-way split with one training set and two test sets. Only make sure that the column name and -values clearly indicate what the training set is meant to be; the names of the remaining sets are arbitrary. Prediction models are evaluated on each set (including the training set) separately.

Step 1: Analyze Data and Train Classifier

Analyze the prepared data X. Only one simple function call is required to produce descriptive statistics, a high-quality classifier with automatically tuned hyperparameters, and an Out-of-Distribution detector.

The corresponding command in CaTabRa’s command-line interface is called catabra analyze ....

[21]:
from catabra.analysis import analyze

analyze(
    X,                        # table to analyze; can also be the path to a CSV/Excel/HDF5 file
    classify='diagnosis',     # name of column containing classification target
    split='train',            # name of column containing information about the train-test split (optional)
    time=3,                   # time budget for hyperparameter tuning, in minutes (optional)
    out=output_dir
)
Output folder "/mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow" already exists. Delete?
[CaTabRa] ### Analysis started at 2023-04-13 15:00:41.878108
[CaTabRa] Saving descriptive statistics completed
/mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/catabra/util/statistics.py:213: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  return dict_stat, dict_non_num_stat, (df.corr() if df.shape[1] <= corr_threshold else None)
[CaTabRa] Using AutoML-backend auto-sklearn for binary_classification
[CaTabRa] Successfully loaded the following auto-sklearn add-on module(s): xgb
[CaTabRa] Using auto-sklearn 2.0.
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/autosklearn/experimental/selector.py:24: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col, series in prediction.iteritems():
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/smac/intensification/parallel_scheduling.py:153: UserWarning: SuccessiveHalving is executed with 1 workers only. Consider to use pynisher to use all available workers.
  warnings.warn(
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.986260
    n_constituent_models: 1
    total_elapsed_time: 00:03
[CaTabRa] New model #1 trained:
    val_roc_auc: 0.989845
    val_accuracy: 0.947368
    val_balanced_accuracy: 0.946356
    train_roc_auc: 1.000000
    type: gradient_boosting
    total_elapsed_time: 00:03
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.986260
    n_constituent_models: 1
    total_elapsed_time: 00:06
[CaTabRa] New model #2 trained:
    val_roc_auc: 0.945430
    val_accuracy: 0.921053
    val_balanced_accuracy: 0.924134
    train_roc_auc: 1.000000
    type: gradient_boosting
    total_elapsed_time: 00:05
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.986260
    n_constituent_models: 1
    total_elapsed_time: 00:08
[CaTabRa] New model #3 trained:
    val_roc_auc: 0.971416
    val_accuracy: 0.921053
    val_balanced_accuracy: 0.919952
    train_roc_auc: 0.993877
    type: gradient_boosting
    total_elapsed_time: 00:08
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.987834
    n_constituent_models: 3
    total_elapsed_time: 00:11
[CaTabRa] New model #4 trained:
    val_roc_auc: 0.968250
    val_accuracy: 0.929825
    val_balanced_accuracy: 0.926523
    train_roc_auc: 0.995034
    type: gradient_boosting
    total_elapsed_time: 00:10
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.996953
    n_constituent_models: 1
    total_elapsed_time: 00:14
[CaTabRa] New model #5 trained:
    val_roc_auc: 0.997073
    val_accuracy: 0.971491
    val_balanced_accuracy: 0.970072
    train_roc_auc: 0.999985
    type: mlp
    total_elapsed_time: 00:14
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.996953
    n_constituent_models: 1
    total_elapsed_time: 00:16
[CaTabRa] New model #6 trained:
    val_roc_auc: 0.955048
    val_accuracy: 0.914474
    val_balanced_accuracy: 0.915233
    train_roc_auc: 0.986036
    type: gradient_boosting
    total_elapsed_time: 00:16
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.996953
    n_constituent_models: 1
    total_elapsed_time: 00:19
[CaTabRa] New model #7 trained:
    val_roc_auc: 0.990054
    val_accuracy: 0.949561
    val_balanced_accuracy: 0.946535
    train_roc_auc: 1.000000
    type: gradient_boosting
    total_elapsed_time: 00:18
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:21
[CaTabRa] New model #8 trained:
    val_roc_auc: 0.995579
    val_accuracy: 0.949561
    val_balanced_accuracy: 0.954898
    train_roc_auc: 0.996864
    type: mlp
    total_elapsed_time: 00:21
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:24
[CaTabRa] New model #9 trained:
    val_roc_auc: 0.990352
    val_accuracy: 0.969298
    val_balanced_accuracy: 0.967384
    train_roc_auc: 0.999701
    type: mlp
    total_elapsed_time: 00:23
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:26
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:28
[CaTabRa] New model #10 trained:
    val_roc_auc: 0.988949
    val_accuracy: 0.936404
    val_balanced_accuracy: 0.937933
    train_roc_auc: 0.999955
    type: mlp
    total_elapsed_time: 00:28
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:31
[CaTabRa] New model #11 trained:
    val_roc_auc: 0.992861
    val_accuracy: 0.964912
    val_balanced_accuracy: 0.962007
    train_roc_auc: 1.000000
    type: extra_trees
    total_elapsed_time: 00:30
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:33
[CaTabRa] New model #12 trained:
    val_roc_auc: 0.991756
    val_accuracy: 0.953947
    val_balanced_accuracy: 0.951912
    train_roc_auc: 1.000000
    type: gradient_boosting
    total_elapsed_time: 00:33
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:35
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:39
[CaTabRa] New model #13 trained:
    val_roc_auc: 0.995639
    val_accuracy: 0.964912
    val_balanced_accuracy: 0.961171
    train_roc_auc: 0.999044
    type: mlp
    total_elapsed_time: 00:38
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:41
[CaTabRa] New model #14 trained:
    val_roc_auc: 0.993429
    val_accuracy: 0.967105
    val_balanced_accuracy: 0.964695
    train_roc_auc: 0.999836
    type: extra_trees
    total_elapsed_time: 00:40
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:45
[CaTabRa] New model #15 trained:
    val_roc_auc: 0.958393
    val_accuracy: 0.888158
    val_balanced_accuracy: 0.889665
    train_roc_auc: 0.973305
    type: gradient_boosting
    total_elapsed_time: 00:47
[CaTabRa] New model #16 trained:
    val_roc_auc: 0.989934
    val_accuracy: 0.947368
    val_balanced_accuracy: 0.944683
    train_roc_auc: 0.997961
    type: gradient_boosting
    total_elapsed_time: 00:49
[CaTabRa] New model #17 trained:
    val_roc_auc: 0.969355
    val_accuracy: 0.866228
    val_balanced_accuracy: 0.863620
    train_roc_auc: 0.969444
    type: mlp
    total_elapsed_time: 00:52
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:54
[CaTabRa] New model #18 trained:
    val_roc_auc: 0.990382
    val_accuracy: 0.942982
    val_balanced_accuracy: 0.940143
    train_roc_auc: 0.999806
    type: gradient_boosting
    total_elapsed_time: 00:54
[CaTabRa] New model #19 trained:
    val_roc_auc: 0.920311
    val_accuracy: 0.875000
    val_balanced_accuracy: 0.850956
    train_roc_auc: 0.924418
    type: mlp
    total_elapsed_time: 00:56
[CaTabRa] New model #20 trained:
    val_roc_auc: 0.962873
    val_accuracy: 0.903509
    val_balanced_accuracy: 0.902628
    train_roc_auc: 0.974208
    type: gradient_boosting
    total_elapsed_time: 00:58
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 01:00
[CaTabRa] New model #21 trained:
    val_roc_auc: 0.992294
    val_accuracy: 0.949561
    val_balanced_accuracy: 0.946535
    train_roc_auc: 1.000000
    type: gradient_boosting
    total_elapsed_time: 01:00
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 01:03
[CaTabRa] New model #22 trained:
    val_roc_auc: 0.992234
    val_accuracy: 0.945175
    val_balanced_accuracy: 0.941995
    train_roc_auc: 1.000000
    type: gradient_boosting
    total_elapsed_time: 01:03
[CaTabRa] New model #23 trained:
    val_roc_auc: 0.990024
    val_accuracy: 0.945175
    val_balanced_accuracy: 0.947013
    train_roc_auc: 0.999895
    type: mlp
    total_elapsed_time: 01:05
[CaTabRa] New model #24 trained:
    val_roc_auc: 0.979271
    val_accuracy: 0.763158
    val_balanced_accuracy: 0.799164
    train_roc_auc: 0.980526
    type: sgd
    total_elapsed_time: 01:07
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 01:10
[CaTabRa] New model #25 trained:
    val_roc_auc: 0.994982
    val_accuracy: 0.951754
    val_balanced_accuracy: 0.941697
    train_roc_auc: 0.996789
    type: mlp
    total_elapsed_time: 01:10
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 01:13
[CaTabRa] New model #26 trained:
    val_roc_auc: 0.994385
    val_accuracy: 0.967105
    val_balanced_accuracy: 0.966368
    train_roc_auc: 0.999940
    type: mlp
    total_elapsed_time: 01:13
[CaTabRa] New model #27 trained:
    val_roc_auc: 0.991099
    val_accuracy: 0.962719
    val_balanced_accuracy: 0.960155
    train_roc_auc: 1.000000
    type: extra_trees
    total_elapsed_time: 01:16
[CaTabRa] New model #28 trained:
    val_roc_auc: 0.928614
    val_accuracy: 0.620614
    val_balanced_accuracy: 0.673775
    train_roc_auc: 0.927225
    type: mlp
    total_elapsed_time: 01:18
[CaTabRa] New model #29 trained:
    val_roc_auc: 0.981123
    val_accuracy: 0.912281
    val_balanced_accuracy: 0.909200
    train_roc_auc: 0.992891
    type: random_forest
    total_elapsed_time: 01:20
[CaTabRa] New model #30 trained:
    val_roc_auc: 0.935364
    val_accuracy: 0.894737
    val_balanced_accuracy: 0.890203
    train_roc_auc: 0.950508
    type: mlp
    total_elapsed_time: 01:22
[CaTabRa] New model #31 trained:
    val_roc_auc: 0.991517
    val_accuracy: 0.951754
    val_balanced_accuracy: 0.950060
    train_roc_auc: 1.000000
    type: gradient_boosting
    total_elapsed_time: 01:28
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 01:32
[CaTabRa] New model #32 trained:
    val_roc_auc: 0.984648
    val_accuracy: 0.921053
    val_balanced_accuracy: 0.917443
    train_roc_auc: 0.996364
    type: random_forest
    total_elapsed_time: 01:36
[CaTabRa] New model #33 trained:
    val_roc_auc: 0.991428
    val_accuracy: 0.953947
    val_balanced_accuracy: 0.953584
    train_roc_auc: 1.000000
    type: mlp
    total_elapsed_time: 01:42
[CaTabRa] New model #34 trained:
    val_roc_auc: 0.963501
    val_accuracy: 0.962719
    val_balanced_accuracy: 0.963501
    train_roc_auc: 0.977121
    type: mlp
    total_elapsed_time: 01:46
[CaTabRa] New model #35 trained:
    val_roc_auc: 0.948566
    val_accuracy: 0.848684
    val_balanced_accuracy: 0.860514
    train_roc_auc: 0.950060
    type: mlp
    total_elapsed_time: 01:50
[CaTabRa] New model #36 trained:
    val_roc_auc: 0.984020
    val_accuracy: 0.914474
    val_balanced_accuracy: 0.911888
    train_roc_auc: 0.995550
    type: random_forest
    total_elapsed_time: 01:52
[CaTabRa] New model #37 trained:
    val_roc_auc: 0.500000
    val_accuracy: 0.469298
    val_balanced_accuracy: 0.500000
    train_roc_auc: 0.500000
    type: mlp
    total_elapsed_time: 01:55
[CaTabRa] New model #38 trained:
    val_roc_auc: 0.994295
    val_accuracy: 0.967105
    val_balanced_accuracy: 0.964695
    train_roc_auc: 0.999910
    type: mlp
    total_elapsed_time: 01:59
[CaTabRa] New model #39 trained:
    val_roc_auc: 0.986141
    val_accuracy: 0.938596
    val_balanced_accuracy: 0.933931
    train_roc_auc: 0.987963
    type: extra_trees
    total_elapsed_time: 02:02
[CaTabRa] New model #40 trained:
    val_roc_auc: 0.996894
    val_accuracy: 0.967105
    val_balanced_accuracy: 0.964695
    train_roc_auc: 0.999955
    type: mlp
    total_elapsed_time: 02:05
[CaTabRa] New model #41 trained:
    val_roc_auc: 0.988411
    val_accuracy: 0.969298
    val_balanced_accuracy: 0.966547
    train_roc_auc: 0.999925
    type: mlp
    total_elapsed_time: 02:09
[CaTabRa] New model #42 trained:
    val_roc_auc: 0.987276
    val_accuracy: 0.929825
    val_balanced_accuracy: 0.931541
    train_roc_auc: 1.000000
    type: gradient_boosting
    total_elapsed_time: 02:11
[CaTabRa] New model #43 trained:
    val_roc_auc: 0.981481
    val_accuracy: 0.947368
    val_balanced_accuracy: 0.948029
    train_roc_auc: 0.999627
    type: mlp
    total_elapsed_time: 02:18
[CaTabRa] New model #44 trained:
    val_roc_auc: 0.982796
    val_accuracy: 0.936404
    val_balanced_accuracy: 0.932915
    train_roc_auc: 0.983244
    type: mlp
    total_elapsed_time: 02:22
[CaTabRa] New model #45 trained:
    val_roc_auc: 0.985603
    val_accuracy: 0.912281
    val_balanced_accuracy: 0.898327
    train_roc_auc: 0.981825
    type: extra_trees
    total_elapsed_time: 02:25
[CaTabRa] New model #46 trained:
    val_roc_auc: 0.984767
    val_accuracy: 0.929825
    val_balanced_accuracy: 0.932378
    train_roc_auc: 0.986499
    type: passive_aggressive
    total_elapsed_time: 02:27
[CaTabRa] New model #47 trained:
    val_roc_auc: 0.989247
    val_accuracy: 0.942982
    val_balanced_accuracy: 0.943489
    train_roc_auc: 0.997805
    type: gradient_boosting
    total_elapsed_time: 02:29
[CaTabRa] New model #48 trained:
    val_roc_auc: 0.989904
    val_accuracy: 0.947368
    val_balanced_accuracy: 0.948029
    train_roc_auc: 0.994430
    type: extra_trees
    total_elapsed_time: 02:34
[CaTabRa] New model #49 trained:
    val_roc_auc: 0.986918
    val_accuracy: 0.927632
    val_balanced_accuracy: 0.920490
    train_roc_auc: 0.988023
    type: mlp
    total_elapsed_time: 02:38
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 02:42
[CaTabRa] New model #50 trained:
    val_roc_auc: 0.996535
    val_accuracy: 0.964912
    val_balanced_accuracy: 0.962007
    train_roc_auc: 0.999164
    type: mlp
    total_elapsed_time: 02:42
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997212
    n_constituent_models: 2
    total_elapsed_time: 02:46
[CaTabRa] New model #51 trained:
    val_roc_auc: 0.996595
    val_accuracy: 0.971491
    val_balanced_accuracy: 0.969235
    train_roc_auc: 0.999970
    type: mlp
    total_elapsed_time: 02:46
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/sklearn/preprocessing/_data.py:3237: RuntimeWarning: divide by zero encountered in log
  loglike = -n_samples / 2 * np.log(x_trans.var())
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (32) reached and the optimization hasn't converged yet.
  warnings.warn(
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/sklearn/impute/_base.py:49: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode = stats.mode(array)
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/sklearn/preprocessing/_data.py:3237: RuntimeWarning: divide by zero encountered in log
  loglike = -n_samples / 2 * np.log(x_trans.var())
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (32) reached and the optimization hasn't converged yet.
  warnings.warn(
[CaTabRa] Final training statistics:
    n_models_trained: 51
    ensemble_val_roc_auc: 0.9972122660294704
[CaTabRa] Creating shap explainer
[CaTabRa] Initialized out-of-distribution detector of type BinsDetector
[CaTabRa] Fitting out-of-distribution detector...
[CaTabRa] Out-of-distribution detector fitted.
[CaTabRa] ### Analysis finished at 2023-04-13 15:03:41.403394
[CaTabRa] ### Elapsed time: 0 days 00:02:59.525286
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow
[CaTabRa] ### Evaluation started at 2023-04-13 15:03:41.421730
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Evaluation results for train:
    roc_auc: 0.9994623655913979
    accuracy @ 0.5: 0.9868421052631579
    balanced_accuracy @ 0.5: 0.9863799283154122
The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
[CaTabRa] Evaluation results for not_train:
    roc_auc: 0.9991158267020337
    accuracy @ 0.5: 0.9469026548672567
    balanced_accuracy @ 0.5: 0.9655172413793103
[CaTabRa] ### Evaluation finished at 2023-04-13 15:03:44.241915
[CaTabRa] ### Elapsed time: 0 days 00:00:02.820185
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/eval

By specifying a train-test split, CaTabRa not only trains a classifier (on the training set) but also evaluates it (on both sets). The last few lines of the above logging output inform about the performance of the classifier on “train” and “not_train”. More detailed results are available as well, as we will see in Step 3.

The newly created directory specified by output_dir contains all results generated during data analysis, including

Descriptive Statistics

Descriptive statistics are calculated for numeric and non-numeric (categorical) features separately and saved in statistics/statistics_numeric.xlsx and statistics/statistics/non_numeric.xlsx. It is easiest to simply view these files in Excel, but they can of course be loaded as pandas DataFrames, too.

CaTabRa provides a convenience function for loading tables in arbitrary format, implemented in module `catabra.util.io <https://github.com/risc-mi/catabra/tree/main/catabra/util/io.py>`__ read_df() for loading a single table and read_dfs() for loading all tables stored in a file. In classification tasks, descriptive statistics are computed both for the entire dataset and for each class individually and written to two different tables, so we use read_dfs() to load both of them:

[22]:
stats = io.read_dfs(output_dir + '/statistics/statistics_numeric.xlsx')
[23]:
# overall statistics
stats['overall'].head()
[23]:
Unnamed: 0 count mean std min 25% 50% 75% max
0 mean radius 569 14.127292 3.524049 6.98100 11.70000 13.37000 15.7800 28.1100
1 mean texture 569 19.289649 4.301036 9.71000 16.17000 18.84000 21.8000 39.2800
2 mean perimeter 569 91.969033 24.298981 43.79000 75.17000 86.24000 104.1000 188.5000
3 mean area 569 654.889104 351.914129 143.50000 420.30000 551.10000 782.7000 2501.0000
4 mean smoothness 569 0.096360 0.014064 0.05263 0.08637 0.09587 0.1053 0.1634
[24]:
# statistics per class
stats['diagnosis']['Feature'].fillna(method='ffill', inplace=True)
stats['diagnosis'].head()
[24]:
Feature diagnosis count mean std min 25% 50% 75% max mann_whitney_u
0 mean radius 0 212 17.462830 3.203971 10.950 15.0750 17.325 19.590 28.11 2.692943e-68
1 mean radius 1 357 12.146524 1.780512 6.981 11.0800 12.200 13.370 17.85 2.692943e-68
2 mean texture 0 212 21.604906 3.779470 10.380 19.3275 21.460 23.765 39.28 3.428627e-28
3 mean texture 1 357 17.914762 3.995125 9.710 15.1500 17.390 19.760 33.81 3.428627e-28
4 mean perimeter 0 212 115.365377 21.854653 71.900 98.7450 114.200 129.925 188.50 3.553870e-71

In the above per-class statistics, a Mann-Whitney U test is performed to detect statistically significant differences in the distribution of a feature between the different classes, and the resulting p-values are reported in column mann_whitney_u.

For more information about the descriptive statistics computed by CaTabRa by default, refer to Statistics.

Descriptive statistics can be computed manually as well, see module `catabra.util.statistics <https://github.com/risc-mi/catabra/tree/main/catabra/util/statistics>`__ for details.

Model Summary

The final prediction model is summarized in model_summary.json. This file contains a dict with information about the individual constituent models (if the model is an ensemble), the used preprocessing steps, and the selected hyperparameter values. The exact format depends on the used AutoML backend, but for the default auto-sklearn backend the main information is contained in the list under the "models" key, as can be seen below:

[25]:
io.load(output_dir + '/model_summary.json')
[25]:
{'automl': 'auto-sklearn',
 'task': 'binary_classification',
 'models': [{'model_id': 6,
   'rank': 1,
   'cost': 0.00561529271206688,
   'ensemble_weight': 0.0,
   'data_preprocessor': "FeatTypeSplit(column_transformer=ColumnTransformer(sparse_threshold=0.0, transformers=[('numerical_transformer', NumericalPreprocessingPipeline(config=Configuration(values={ 'imputation:strategy': 'median', 'rescaling:__choice__': 'power_transformer', }) , dataset_properties={'signed': False, 'sparse': False}, exclude={}, include={}, init_params={}, steps=[('imput... 'symmetry error': 'numerical', 'texture error': 'numerical', 'worst area': 'numerical', 'worst compactness': 'numerical', 'worst concave points': 'numerical', 'worst concavity': 'numerical', 'worst fractal dimension': 'numerical', 'worst perimeter': 'numerical', 'worst radius': 'numerical', 'worst smoothness': 'numerical', 'worst symmetry': 'numerical', 'worst texture': 'numerical'}, init_params={})",
   'balancing': 'Balancing(random_state=42)',
   'feature_preprocessor': 'NoPreprocessing(<unknown params>)',
   'classifier': 'MLPClassifier(alpha=0.07979356062608887, beta_1=0.999, beta_2=0.9, early_stopping=True, hidden_layer_sizes=(257, 257, 257), learning_rate_init=0.001829312822950054, max_iter=32, n_iter_no_change=32, random_state=42, verbose=0, warm_start=True)'},
  {'model_id': 60,
   'rank': 2,
   'cost': 0.0034050179211469276,
   'ensemble_weight': 0.7,
   'data_preprocessor': "FeatTypeSplit(column_transformer=ColumnTransformer(sparse_threshold=0.0, transformers=[('numerical_transformer', NumericalPreprocessingPipeline(config=Configuration(values={ 'imputation:strategy': 'most_frequent', 'rescaling:__choice__': 'power_transformer', }) , dataset_properties={'signed': False, 'sparse': False}, exclude={}, include={}, init_params={}, steps=[... 'symmetry error': 'numerical', 'texture error': 'numerical', 'worst area': 'numerical', 'worst compactness': 'numerical', 'worst concave points': 'numerical', 'worst concavity': 'numerical', 'worst fractal dimension': 'numerical', 'worst perimeter': 'numerical', 'worst radius': 'numerical', 'worst smoothness': 'numerical', 'worst symmetry': 'numerical', 'worst texture': 'numerical'}, init_params={})",
   'balancing': 'Balancing(random_state=42)',
   'feature_preprocessor': 'NoPreprocessing(<unknown params>)',
   'classifier': 'MLPClassifier(alpha=2.9638327738166795e-05, beta_1=0.999, beta_2=0.9, early_stopping=True, hidden_layer_sizes=(241,), learning_rate_init=0.008555948122763763, max_iter=32, n_iter_no_change=32, random_state=42, verbose=0, warm_start=True)'}]}

Training History

Information about each model trained during hyperparameter optimization is contained in training_history.xlsx and visualized in training_history.pdf:

[26]:
io.read_df(output_dir + '/training_history.xlsx').drop('Unnamed: 0', axis=1, errors='ignore').head()
[26]:
model_id timestamp total_elapsed_time type val_roc_auc val_accuracy val_balanced_accuracy train_roc_auc duration ensemble_weight ensemble_val_roc_auc
0 2 2023-04-13 15:00:46.003 0 days 00:00:03.044101953 gradient_boosting 0.989845 0.947368 0.946356 1.000000 2.391810 0.0 0.986260
1 3 2023-04-13 15:00:48.393 0 days 00:00:05.434068441 gradient_boosting 0.945430 0.921053 0.924134 1.000000 2.265227 0.0 0.986260
2 4 2023-04-13 15:00:51.003 0 days 00:00:08.043761730 gradient_boosting 0.971416 0.921053 0.919952 0.993877 2.473614 0.0 0.986260
3 5 2023-04-13 15:00:53.396 0 days 00:00:10.437424898 gradient_boosting 0.968250 0.929825 0.926523 0.995034 2.244282 0.0 0.987834
4 6 2023-04-13 15:00:56.929 0 days 00:00:13.970171690 mlp 0.997073 0.971491 0.970072 0.999985 3.378461 0.0 0.996953

Step 2: Calibrate Classifier

Classifiers can be calibrated to ensure that the probability estimates they return correspond to the “true” confidence of the model. As in the initial data analysis and model construction, one simple function call suffices to calibrate a classifier in CaTabRa.

Worth noting are the use of the from_invocation keyword argument, which automatically sets all unspecified arguments to the values stored in the given JSON file; this, for example, applies to split. The effect of setting subset to True is that the classifier is only calibrated on those samples whose value in the train-test-split column "train" is True (i.e., the training set). Normally, classifiers should not be calibrated on the training set, though. After calibration, model.joblib is replaced by the new, calibrated model.

The corresponding command in CaTabRa’s command-line interface is catabra calibrate ....

[27]:
from catabra.calibration import calibrate

calibrate(
    X,
    folder=output_dir,    # directory containing trained classifier (= output directory of previous call to `analyze()`)
    from_invocation=output_dir + '/invocation.json',
    subset=True,
    out=output_dir + '/calib'
)
[CaTabRa] ### Calibration started at 2023-04-13 15:03:44.475144
[CaTabRa] Restricting table to calibration subset train = True (456 entries)
[CaTabRa] ### Calibration finished at 2023-04-13 15:03:45.766419
[CaTabRa] ### Elapsed time: 0 days 00:00:01.291275
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/calib

Step 3: Evaluate Classifier

Prediction models can be evaluated on (labeled) data that have the same format as the data they were initially trained on, as passed to function `catabra.analysis.analyze() <https://github.com/risc-mi/catabra/tree/main/catabra/analysis/main.py>`__. Again, one simple function call is sufficient. If the data is split into two or more disjoint subsets via argument split (implicit in from_invocation below), the model is evaluated on each of these subsets separately.

Bootstrapping can be used to obtain estimates on the variance, confidence interval, etc. of the performance of our classifier. We activate it by simply setting bootstrapping_repetitions to the desired number of repetitions.

Since the desired output directory has been created by function analyze() already, we are asked whether it should be replaced.

The corresponding command in CaTabRa’s command-line interface is catabra evaluate ....

[28]:
from catabra.evaluation import evaluate

evaluate(
    X,
    folder=output_dir,    # directory containing trained classifier (= output directory of previous call to `analyze()`)
    from_invocation=output_dir + '/invocation.json',
    bootstrapping_repetitions=1000,   # number of bootstrapping repetitions to perform; set to 0 to disable bootstrapping
    out=output_dir + '/eval'
)
Evaluation folder "/mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/eval" already exists. Delete?
[CaTabRa] ### Evaluation started at 2023-04-13 15:03:45.781536
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
[CaTabRa] Evaluation results for train:
    roc_auc: 0.9994623655913979
    accuracy @ 0.5: 0.9868421052631579
    balanced_accuracy @ 0.5: 0.9863799283154122
[CaTabRa] Evaluation results for not_train:
    roc_auc: 0.9991158267020337
    accuracy @ 0.5: 0.9469026548672567
    balanced_accuracy @ 0.5: 0.9655172413793103
[CaTabRa] ### Evaluation finished at 2023-04-13 15:04:01.642427
[CaTabRa] ### Elapsed time: 0 days 00:00:15.860891
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/eval

Note how accuracy and balanced accuracy changed compared to the initial data analysis. This is because of model calibration, which potentially affects thresholded metrics (like accuracy and balanced accuracy) but leaves threshold-independent metrics, like ROC-AUC, unchanged.

Performance Metrics (Non-Bootstrapped)

One of the main evaluation results produced by CaTabRa are tables with detailed information on model performance, and corresponding visualizations. In our case, they are contained in subdirectories eval/train/ and eval/not_train/.

Non-bootstrapped performance metrics are saved in metrics.xlsx. In binary classification, this file consists of the three tables "overall", "thresholded" and "calibration".

[29]:
metrics = io.read_dfs(output_dir + '/eval/not_train/metrics.xlsx')

Table "overall" contains non-thresholded performance metrics, like ROC-AUC, average precision, etc.:

[30]:
metrics['overall']
[30]:
Unnamed: 0 pos_label n n_pos roc_auc average_precision pr_auc brier_loss hinge_loss log_loss
0 diagnosis 1 113 87 0.999116 0.999742 0.99974 0.03369 0.283123 0.125328

Table "thresholded" contains all performance metrics that depend on a specific decision threshold (a.k.a. cut-off point), like accuracy, balanced accuracy, F1-score, etc. These metrics are evaluated at different decision thresholds.

[31]:
metrics['thresholded'].drop('Unnamed: 0', axis=1).head()
[31]:
threshold accuracy balanced_accuracy f1 sensitivity specificity positive_predictive_value negative_predictive_value cohen_kappa hamming_loss jaccard true_positive true_negative false_positive false_negative
0 0.012333 0.769912 0.500000 0.870000 1.0 0.000000 0.769912 1.0 0.000000 0.230088 0.769912 87 0 26 0
1 0.012333 0.796460 0.557692 0.883249 1.0 0.115385 0.790909 1.0 0.167254 0.203540 0.790909 87 3 23 0
2 0.012333 0.814159 0.596154 0.892308 1.0 0.192308 0.805556 1.0 0.268270 0.185841 0.805556 87 5 21 0
3 0.012333 0.823009 0.615385 0.896907 1.0 0.230769 0.813084 1.0 0.315981 0.176991 0.813084 87 6 20 0
4 0.012333 0.831858 0.634615 0.901554 1.0 0.269231 0.820755 1.0 0.361961 0.168142 0.820755 87 7 19 0

Table "calibration" contains the fraction of positive samples for different threshold intervals. The intervals are constructed such that each of them contains roughly the same number of samples.

[32]:
metrics['calibration'].drop('Unnamed: 0', axis=1).head()
[32]:
threshold_lower threshold_upper pos_fraction
0 0.012333 0.012333 0.0
1 0.012333 0.012333 0.0
2 0.012333 0.012333 0.0
3 0.012333 0.012333 0.0
4 0.012333 0.012333 0.0

Bootstrapped Performance

Since we activated bootstrapping by setting bootstrapping_repetitions to a positive number, file bootstrapping.xlsx was generated. It contains two tables "summary" and "details" with summary statistics over all bootstrapping runs and the runs themselves, respectively.

[33]:
bootstrapping = io.read_dfs(output_dir + '/eval/not_train/bootstrapping.xlsx')
[34]:
bootstrapping['summary']
[34]:
Unnamed: 0 roc_auc accuracy balanced_accuracy __threshold
0 count 1000.000000 1000.000000 1000.000000 1000.0
1 mean 0.999107 0.946177 0.965097 0.5
2 std 0.001177 0.021432 0.013757 0.0
3 min 0.990909 0.876106 0.922222 0.5
4 25% 0.998557 0.929204 0.956044 0.5
5 50% 0.999532 0.946903 0.966292 0.5
6 75% 1.000000 0.964602 0.975904 0.5
7 max 1.000000 1.000000 1.000000 0.5

Table "details" reports the performance metrics for each single run, together with the random seed used for resampling the data.

[35]:
bootstrapping['details'].drop('Unnamed: 0', axis=1, errors='ignore').head()
[35]:
roc_auc accuracy balanced_accuracy __seed
0 1.000000 0.955752 0.970238 2854880344
1 1.000000 0.938053 0.963158 1506600952
2 1.000000 0.893805 0.931034 3277809138
3 0.997895 0.946903 0.960000 3141104837
4 1.000000 0.964602 0.977011 2847344748

Sample-Wise Predictions

Finally, the model output for each individual sample is saved in predictions.xlsx.

[36]:
predictions = io.read_df(output_dir + '/eval/not_train/predictions.xlsx')

The table contains the true label (column "diagnosis") and the predicted probabilities of the negative and positive class, respectively. Note that in our cases the two classes are simply called 0 and 1, which is why the corresponding columns are called "0_proba" and "1_proba".

[37]:
predictions.head()
[37]:
Unnamed: 0 diagnosis 0_proba 1_proba
0 456 1 0.007257 0.992743
1 457 1 0.224164 0.775836
2 458 1 0.727156 0.272844
3 459 1 0.005909 0.994091
4 460 0 0.987667 0.012333

Out-of-Distribution Detection

In addition to the output of the prediction model we can also inspect the likelihood of samples (or the whole training- or test-set) being out-of-distribution (OOD). Predictions for samples with high OOD likelihood should be treated with care, as they might differ significantly from all samples the model has seen during training.

[38]:
ood = io.read_df(output_dir + '/eval/not_train/ood.xlsx')
[39]:
ood.head()
[39]:
Unnamed: 0 proba decision
0 0 0 False
1 1 0 False
2 2 0 False
3 3 0 False
4 4 0 False

Step 4: Explain Classifier

Prediction models can be explained on data that have the same format as the data they were initially trained on, as passed to function analyze(). As before, one simple function call is sufficient. If the data is split into two or more disjoint subsets via argument split (implicit in from_invocation below), the model is explained on each of these subsets separately.

If the final model is an ensemble of several base models, each of them is expained separately.

By default, SHAP is used for generating local (i.e., sample-wise) explanations in terms of feature importance scores. These scores are saved as HDF5 tables and visualized in so-called beeswarm plots, and can be found in the specified output directory.

In addition to SHAP, CaTabRa also provides a ready-to-use implementation of permutation importance. The advantage of permutation importance over SHAP is that it can be generally computed much faster. We use it here by setting explainer="permutation" in the command below. You can try SHAP by setting explainer="shap" or simply omitting the keyword argument.

The corresponding command in CaTabRa’s command-line interface is catabra explain ....

[41]:
from catabra.explanation import explain

explain(
    X,
    folder=output_dir,    # directory containing trained classifier (= output directory of previous call to `analyze()`)
    from_invocation=output_dir + '/invocation.json',
    out=output_dir + '/explain_permutation',
    explainer='permutation'
)
[CaTabRa] ### Explanation started at 2023-04-13 15:20:32.120711
[CaTabRa] *** Split train
Features: 100%|########################################| 30/30 [00:06<00:00, 4.93it/s]
[CaTabRa] *** Split not_train
Features: 100%|########################################| 30/30 [00:03<00:00, 11.82it/s]
[CaTabRa] ### Explanation finished at 2023-04-13 15:20:43.127964
[CaTabRa] ### Elapsed time: 0 days 00:00:11.007253
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/explain_permutation

Permutation importance generates global (i.e., feature-wise) explanations. The corresponding importance scores are saved as HDF5 tables and visualized in bar plots.

Refer to Explanations for more information about model explanations.

Step 5: Apply Classifier to New Data

Finally, the trained classifier can be applied to new data of the same format as the data it was initially trained on, possibly without the label column. For demonstration purposes we apply the classifier to the same data X we are using throughout, although in a real-world use-case this would not make sense.

The corresponding command in CaTabRa’s command-line interface is catabra apply ....

[42]:
from catabra.application import apply

apply(
    X.drop('diagnosis', axis=1),   # data to apply the model to; column containing ground-truth labels is not needed (but would not harm either)
    folder=output_dir,    # directory containing trained classifier (= output directory of previous call to `analyze()`)
    from_invocation=output_dir + '/invocation.json',
    out=output_dir + '/apply'
)
[CaTabRa] ### Application started at 2023-04-13 15:20:43.169793
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] ### Application finished at 2023-04-13 15:20:43.760727
[CaTabRa] ### Elapsed time: 0 days 00:00:00.590934
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/apply

The results are saved in predictions.xlsx and contain the predicted probabilities of the two classes, for every sample. OOD scores are saved again in ood.xlsx.

[43]:
predictions = io.read_df(output_dir + '/apply/predictions.xlsx')
[44]:
predictions.head()
[44]:
Unnamed: 0 0_proba 1_proba
0 0 0.987667 0.012333
1 1 0.987667 0.012333
2 2 0.987667 0.012333
3 3 0.987667 0.012333
4 4 0.987667 0.012333

Load Classifier into Python

Prediction models generated with CaTabRa can be easily loaded into a Python session. The easiest and most straight-forward way to do this is through the `catabra.util.io.CaTabRaLoader <https://github.com/risc-mi/catabra/tree/main/catabra/util/io.py>`__ class, which only needs to be instantiated with the directory containing model:

[45]:
loader = io.CaTabRaLoader(output_dir)

The resulting class instance provides easy access to all sorts of artifacts generated by the functions above, in particular the trained classifier:

[46]:
model = loader.get_model()

Investigating the Model

The type of the loaded model object depends on the AutoML backend used for training it, in this case auto-sklearn:

[47]:
type(model)
[47]:
catabra.automl.askl.backend.AutoSklearnBackend

If we want a uniform representation of the model independent of the AutoML backend, we can convert it into a `catabra.automl.fitted_ensemble.FittedEnsemble <https://github.com/risc-mi/catabra/tree/main/catabra/automl/fitted_ensemble.py>`__:

[48]:
fe = model.fitted_ensemble()

A FittedEnsemble is, as its name suggests, an ensemble consisting of individual base models and a meta-estimator combining the predictions of the base models to a single output. These base models can be accessed via the models_ attribute, which is a dict mapping model-IDs to instances of class FittedModel:

[49]:
fe.models_
[49]:
{6: FittedModel(
     preprocessing=[ColumnTransformer(sparse_threshold=0.0,
                   transformers=[('numerical_transformer',
                                  Pipeline(steps=[('imputation',
                                                   SimpleImputer(copy=False,
                                                                 strategy='median')),
                                                  ('variance_threshold',
                                                   VarianceThreshold()),
                                                  ('rescaling',
                                                   PowerTransformer(copy=False)),
                                                  ('dummy', 'passthrough')]),
                                  [True, True, True, True, True, True, True,
                                   True, True, True, True, True, True, True,
                                   True, True, True, True, True, True, True,
                                   True, True, True, True, True, True, True,
                                   True, True])])],
     estimator=MLPClassifier(alpha=0.07979356062608887, beta_1=0.999, beta_2=0.9,
               early_stopping=True, hidden_layer_sizes=(257, 257, 257),
               learning_rate_init=0.001829312822950054, max_iter=32,
               n_iter_no_change=32, random_state=42, verbose=0, warm_start=True)),
 60: FittedModel(
     preprocessing=[ColumnTransformer(sparse_threshold=0.0,
                   transformers=[('numerical_transformer',
                                  Pipeline(steps=[('imputation',
                                                   SimpleImputer(copy=False,
                                                                 strategy='most_frequent')),
                                                  ('variance_threshold',
                                                   VarianceThreshold()),
                                                  ('rescaling',
                                                   PowerTransformer(copy=False)),
                                                  ('dummy', 'passthrough')]),
                                  [True, True, True, True, True, True, True,
                                   True, True, True, True, True, True, True,
                                   True, True, True, True, True, True, True,
                                   True, True, True, True, True, True, True,
                                   True, True])])],
     estimator=MLPClassifier(alpha=2.9638327738166795e-05, beta_1=0.999, beta_2=0.9,
               early_stopping=True, hidden_layer_sizes=(241,),
               learning_rate_init=0.008555948122763763, max_iter=32,
               n_iter_no_change=32, random_state=42, verbose=0, warm_start=True))}
[50]:
list(fe.models_.values())[0]
[50]:
FittedModel(
    preprocessing=[ColumnTransformer(sparse_threshold=0.0,
                  transformers=[('numerical_transformer',
                                 Pipeline(steps=[('imputation',
                                                  SimpleImputer(copy=False,
                                                                strategy='median')),
                                                 ('variance_threshold',
                                                  VarianceThreshold()),
                                                 ('rescaling',
                                                  PowerTransformer(copy=False)),
                                                 ('dummy', 'passthrough')]),
                                 [True, True, True, True, True, True, True,
                                  True, True, True, True, True, True, True,
                                  True, True, True, True, True, True, True,
                                  True, True, True, True, True, True, True,
                                  True, True])])],
    estimator=MLPClassifier(alpha=0.07979356062608887, beta_1=0.999, beta_2=0.9,
              early_stopping=True, hidden_layer_sizes=(257, 257, 257),
              learning_rate_init=0.001829312822950054, max_iter=32,
              n_iter_no_change=32, random_state=42, verbose=0, warm_start=True))

NOTE Predictions returned by fe may deviate slightly from those of model due to a known bug in auto-sklearn.

Applying the Model

If we want to apply the model to new data, we first need to load the encoder that was constructed jointly with the model. Again, the loader object comes in handy:

[51]:
encoder = loader.get_encoder()
[52]:
model.predict_proba(encoder.transform(x=X))
[52]:
array([[0.98766681, 0.01233319],
       [0.98766681, 0.01233319],
       [0.98766681, 0.01233319],
       ...,
       [0.98766655, 0.01233345],
       [0.98766681, 0.01233319],
       [0.00594657, 0.99405343]])