CaTabRa Workflow
This notebook is part of the CaTabRa GitHub repository.
This tutorial demonstrates CaTabRa’s main workflow, in particular how it can be used to
Prerequisites
[15]:
# generic package imports
from catabra.util import io
[16]:
# output directory (where all generated artifacts, like statistics, models, etc. are saved)
output_dir = 'workflow'
Step 0: Prepare Data
We are going to work with the breast cancer dataset, a well-known binary classification dataset.
CaTabRa assumes a table in the usual \(samples \times attributes\) format as input, where the attributes encompass features, target labels, and possibly additional information like a predefined train-test split.
[17]:
# load dataset
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(as_frame=True, return_X_y=True)
[18]:
# add target labels to DataFrame
X['diagnosis'] = y
[19]:
# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X.index <= 0.8 * len(X)
[20]:
X.head()
[20]:
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | diagnosis | train | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | 0 | True |
| 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | 0 | True |
| 2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | 0 | True |
| 3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | 0 | True |
| 4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | 0 | True |
5 rows × 32 columns
NOTE The column specifying the train-test split may contain more than two values. For instance, values "train", "val" and "test" would yield a three-way split with one training set and two test sets. Only make sure that the column name and -values clearly indicate what the training set is meant to be; the names of the remaining sets are arbitrary. Prediction models are evaluated on each set (including the training set) separately.
Step 1: Analyze Data and Train Classifier
Analyze the prepared data X. Only one simple function call is required to produce descriptive statistics, a high-quality classifier with automatically tuned hyperparameters, and an Out-of-Distribution detector.
The corresponding command in CaTabRa’s command-line interface is called catabra analyze ....
[21]:
from catabra.analysis import analyze
analyze(
X, # table to analyze; can also be the path to a CSV/Excel/HDF5 file
classify='diagnosis', # name of column containing classification target
split='train', # name of column containing information about the train-test split (optional)
time=3, # time budget for hyperparameter tuning, in minutes (optional)
out=output_dir
)
Output folder "/mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow" already exists. Delete?
[CaTabRa] ### Analysis started at 2023-04-13 15:00:41.878108
[CaTabRa] Saving descriptive statistics completed
/mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/catabra/util/statistics.py:213: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
return dict_stat, dict_non_num_stat, (df.corr() if df.shape[1] <= corr_threshold else None)
[CaTabRa] Using AutoML-backend auto-sklearn for binary_classification
[CaTabRa] Successfully loaded the following auto-sklearn add-on module(s): xgb
[CaTabRa] Using auto-sklearn 2.0.
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/autosklearn/experimental/selector.py:24: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for col, series in prediction.iteritems():
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/smac/intensification/parallel_scheduling.py:153: UserWarning: SuccessiveHalving is executed with 1 workers only. Consider to use pynisher to use all available workers.
warnings.warn(
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.986260
n_constituent_models: 1
total_elapsed_time: 00:03
[CaTabRa] New model #1 trained:
val_roc_auc: 0.989845
val_accuracy: 0.947368
val_balanced_accuracy: 0.946356
train_roc_auc: 1.000000
type: gradient_boosting
total_elapsed_time: 00:03
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.986260
n_constituent_models: 1
total_elapsed_time: 00:06
[CaTabRa] New model #2 trained:
val_roc_auc: 0.945430
val_accuracy: 0.921053
val_balanced_accuracy: 0.924134
train_roc_auc: 1.000000
type: gradient_boosting
total_elapsed_time: 00:05
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.986260
n_constituent_models: 1
total_elapsed_time: 00:08
[CaTabRa] New model #3 trained:
val_roc_auc: 0.971416
val_accuracy: 0.921053
val_balanced_accuracy: 0.919952
train_roc_auc: 0.993877
type: gradient_boosting
total_elapsed_time: 00:08
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.987834
n_constituent_models: 3
total_elapsed_time: 00:11
[CaTabRa] New model #4 trained:
val_roc_auc: 0.968250
val_accuracy: 0.929825
val_balanced_accuracy: 0.926523
train_roc_auc: 0.995034
type: gradient_boosting
total_elapsed_time: 00:10
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.996953
n_constituent_models: 1
total_elapsed_time: 00:14
[CaTabRa] New model #5 trained:
val_roc_auc: 0.997073
val_accuracy: 0.971491
val_balanced_accuracy: 0.970072
train_roc_auc: 0.999985
type: mlp
total_elapsed_time: 00:14
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.996953
n_constituent_models: 1
total_elapsed_time: 00:16
[CaTabRa] New model #6 trained:
val_roc_auc: 0.955048
val_accuracy: 0.914474
val_balanced_accuracy: 0.915233
train_roc_auc: 0.986036
type: gradient_boosting
total_elapsed_time: 00:16
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.996953
n_constituent_models: 1
total_elapsed_time: 00:19
[CaTabRa] New model #7 trained:
val_roc_auc: 0.990054
val_accuracy: 0.949561
val_balanced_accuracy: 0.946535
train_roc_auc: 1.000000
type: gradient_boosting
total_elapsed_time: 00:18
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 00:21
[CaTabRa] New model #8 trained:
val_roc_auc: 0.995579
val_accuracy: 0.949561
val_balanced_accuracy: 0.954898
train_roc_auc: 0.996864
type: mlp
total_elapsed_time: 00:21
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 00:24
[CaTabRa] New model #9 trained:
val_roc_auc: 0.990352
val_accuracy: 0.969298
val_balanced_accuracy: 0.967384
train_roc_auc: 0.999701
type: mlp
total_elapsed_time: 00:23
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 00:26
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 00:28
[CaTabRa] New model #10 trained:
val_roc_auc: 0.988949
val_accuracy: 0.936404
val_balanced_accuracy: 0.937933
train_roc_auc: 0.999955
type: mlp
total_elapsed_time: 00:28
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 00:31
[CaTabRa] New model #11 trained:
val_roc_auc: 0.992861
val_accuracy: 0.964912
val_balanced_accuracy: 0.962007
train_roc_auc: 1.000000
type: extra_trees
total_elapsed_time: 00:30
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 00:33
[CaTabRa] New model #12 trained:
val_roc_auc: 0.991756
val_accuracy: 0.953947
val_balanced_accuracy: 0.951912
train_roc_auc: 1.000000
type: gradient_boosting
total_elapsed_time: 00:33
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 00:35
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 00:39
[CaTabRa] New model #13 trained:
val_roc_auc: 0.995639
val_accuracy: 0.964912
val_balanced_accuracy: 0.961171
train_roc_auc: 0.999044
type: mlp
total_elapsed_time: 00:38
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 00:41
[CaTabRa] New model #14 trained:
val_roc_auc: 0.993429
val_accuracy: 0.967105
val_balanced_accuracy: 0.964695
train_roc_auc: 0.999836
type: extra_trees
total_elapsed_time: 00:40
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 00:45
[CaTabRa] New model #15 trained:
val_roc_auc: 0.958393
val_accuracy: 0.888158
val_balanced_accuracy: 0.889665
train_roc_auc: 0.973305
type: gradient_boosting
total_elapsed_time: 00:47
[CaTabRa] New model #16 trained:
val_roc_auc: 0.989934
val_accuracy: 0.947368
val_balanced_accuracy: 0.944683
train_roc_auc: 0.997961
type: gradient_boosting
total_elapsed_time: 00:49
[CaTabRa] New model #17 trained:
val_roc_auc: 0.969355
val_accuracy: 0.866228
val_balanced_accuracy: 0.863620
train_roc_auc: 0.969444
type: mlp
total_elapsed_time: 00:52
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 00:54
[CaTabRa] New model #18 trained:
val_roc_auc: 0.990382
val_accuracy: 0.942982
val_balanced_accuracy: 0.940143
train_roc_auc: 0.999806
type: gradient_boosting
total_elapsed_time: 00:54
[CaTabRa] New model #19 trained:
val_roc_auc: 0.920311
val_accuracy: 0.875000
val_balanced_accuracy: 0.850956
train_roc_auc: 0.924418
type: mlp
total_elapsed_time: 00:56
[CaTabRa] New model #20 trained:
val_roc_auc: 0.962873
val_accuracy: 0.903509
val_balanced_accuracy: 0.902628
train_roc_auc: 0.974208
type: gradient_boosting
total_elapsed_time: 00:58
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 01:00
[CaTabRa] New model #21 trained:
val_roc_auc: 0.992294
val_accuracy: 0.949561
val_balanced_accuracy: 0.946535
train_roc_auc: 1.000000
type: gradient_boosting
total_elapsed_time: 01:00
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 01:03
[CaTabRa] New model #22 trained:
val_roc_auc: 0.992234
val_accuracy: 0.945175
val_balanced_accuracy: 0.941995
train_roc_auc: 1.000000
type: gradient_boosting
total_elapsed_time: 01:03
[CaTabRa] New model #23 trained:
val_roc_auc: 0.990024
val_accuracy: 0.945175
val_balanced_accuracy: 0.947013
train_roc_auc: 0.999895
type: mlp
total_elapsed_time: 01:05
[CaTabRa] New model #24 trained:
val_roc_auc: 0.979271
val_accuracy: 0.763158
val_balanced_accuracy: 0.799164
train_roc_auc: 0.980526
type: sgd
total_elapsed_time: 01:07
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 01:10
[CaTabRa] New model #25 trained:
val_roc_auc: 0.994982
val_accuracy: 0.951754
val_balanced_accuracy: 0.941697
train_roc_auc: 0.996789
type: mlp
total_elapsed_time: 01:10
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 01:13
[CaTabRa] New model #26 trained:
val_roc_auc: 0.994385
val_accuracy: 0.967105
val_balanced_accuracy: 0.966368
train_roc_auc: 0.999940
type: mlp
total_elapsed_time: 01:13
[CaTabRa] New model #27 trained:
val_roc_auc: 0.991099
val_accuracy: 0.962719
val_balanced_accuracy: 0.960155
train_roc_auc: 1.000000
type: extra_trees
total_elapsed_time: 01:16
[CaTabRa] New model #28 trained:
val_roc_auc: 0.928614
val_accuracy: 0.620614
val_balanced_accuracy: 0.673775
train_roc_auc: 0.927225
type: mlp
total_elapsed_time: 01:18
[CaTabRa] New model #29 trained:
val_roc_auc: 0.981123
val_accuracy: 0.912281
val_balanced_accuracy: 0.909200
train_roc_auc: 0.992891
type: random_forest
total_elapsed_time: 01:20
[CaTabRa] New model #30 trained:
val_roc_auc: 0.935364
val_accuracy: 0.894737
val_balanced_accuracy: 0.890203
train_roc_auc: 0.950508
type: mlp
total_elapsed_time: 01:22
[CaTabRa] New model #31 trained:
val_roc_auc: 0.991517
val_accuracy: 0.951754
val_balanced_accuracy: 0.950060
train_roc_auc: 1.000000
type: gradient_boosting
total_elapsed_time: 01:28
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 01:32
[CaTabRa] New model #32 trained:
val_roc_auc: 0.984648
val_accuracy: 0.921053
val_balanced_accuracy: 0.917443
train_roc_auc: 0.996364
type: random_forest
total_elapsed_time: 01:36
[CaTabRa] New model #33 trained:
val_roc_auc: 0.991428
val_accuracy: 0.953947
val_balanced_accuracy: 0.953584
train_roc_auc: 1.000000
type: mlp
total_elapsed_time: 01:42
[CaTabRa] New model #34 trained:
val_roc_auc: 0.963501
val_accuracy: 0.962719
val_balanced_accuracy: 0.963501
train_roc_auc: 0.977121
type: mlp
total_elapsed_time: 01:46
[CaTabRa] New model #35 trained:
val_roc_auc: 0.948566
val_accuracy: 0.848684
val_balanced_accuracy: 0.860514
train_roc_auc: 0.950060
type: mlp
total_elapsed_time: 01:50
[CaTabRa] New model #36 trained:
val_roc_auc: 0.984020
val_accuracy: 0.914474
val_balanced_accuracy: 0.911888
train_roc_auc: 0.995550
type: random_forest
total_elapsed_time: 01:52
[CaTabRa] New model #37 trained:
val_roc_auc: 0.500000
val_accuracy: 0.469298
val_balanced_accuracy: 0.500000
train_roc_auc: 0.500000
type: mlp
total_elapsed_time: 01:55
[CaTabRa] New model #38 trained:
val_roc_auc: 0.994295
val_accuracy: 0.967105
val_balanced_accuracy: 0.964695
train_roc_auc: 0.999910
type: mlp
total_elapsed_time: 01:59
[CaTabRa] New model #39 trained:
val_roc_auc: 0.986141
val_accuracy: 0.938596
val_balanced_accuracy: 0.933931
train_roc_auc: 0.987963
type: extra_trees
total_elapsed_time: 02:02
[CaTabRa] New model #40 trained:
val_roc_auc: 0.996894
val_accuracy: 0.967105
val_balanced_accuracy: 0.964695
train_roc_auc: 0.999955
type: mlp
total_elapsed_time: 02:05
[CaTabRa] New model #41 trained:
val_roc_auc: 0.988411
val_accuracy: 0.969298
val_balanced_accuracy: 0.966547
train_roc_auc: 0.999925
type: mlp
total_elapsed_time: 02:09
[CaTabRa] New model #42 trained:
val_roc_auc: 0.987276
val_accuracy: 0.929825
val_balanced_accuracy: 0.931541
train_roc_auc: 1.000000
type: gradient_boosting
total_elapsed_time: 02:11
[CaTabRa] New model #43 trained:
val_roc_auc: 0.981481
val_accuracy: 0.947368
val_balanced_accuracy: 0.948029
train_roc_auc: 0.999627
type: mlp
total_elapsed_time: 02:18
[CaTabRa] New model #44 trained:
val_roc_auc: 0.982796
val_accuracy: 0.936404
val_balanced_accuracy: 0.932915
train_roc_auc: 0.983244
type: mlp
total_elapsed_time: 02:22
[CaTabRa] New model #45 trained:
val_roc_auc: 0.985603
val_accuracy: 0.912281
val_balanced_accuracy: 0.898327
train_roc_auc: 0.981825
type: extra_trees
total_elapsed_time: 02:25
[CaTabRa] New model #46 trained:
val_roc_auc: 0.984767
val_accuracy: 0.929825
val_balanced_accuracy: 0.932378
train_roc_auc: 0.986499
type: passive_aggressive
total_elapsed_time: 02:27
[CaTabRa] New model #47 trained:
val_roc_auc: 0.989247
val_accuracy: 0.942982
val_balanced_accuracy: 0.943489
train_roc_auc: 0.997805
type: gradient_boosting
total_elapsed_time: 02:29
[CaTabRa] New model #48 trained:
val_roc_auc: 0.989904
val_accuracy: 0.947368
val_balanced_accuracy: 0.948029
train_roc_auc: 0.994430
type: extra_trees
total_elapsed_time: 02:34
[CaTabRa] New model #49 trained:
val_roc_auc: 0.986918
val_accuracy: 0.927632
val_balanced_accuracy: 0.920490
train_roc_auc: 0.988023
type: mlp
total_elapsed_time: 02:38
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997172
n_constituent_models: 2
total_elapsed_time: 02:42
[CaTabRa] New model #50 trained:
val_roc_auc: 0.996535
val_accuracy: 0.964912
val_balanced_accuracy: 0.962007
train_roc_auc: 0.999164
type: mlp
total_elapsed_time: 02:42
[CaTabRa] New ensemble fitted:
ensemble_val_roc_auc: 0.997212
n_constituent_models: 2
total_elapsed_time: 02:46
[CaTabRa] New model #51 trained:
val_roc_auc: 0.996595
val_accuracy: 0.971491
val_balanced_accuracy: 0.969235
train_roc_auc: 0.999970
type: mlp
total_elapsed_time: 02:46
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/sklearn/preprocessing/_data.py:3237: RuntimeWarning: divide by zero encountered in log
loglike = -n_samples / 2 * np.log(x_trans.var())
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (32) reached and the optimization hasn't converged yet.
warnings.warn(
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/sklearn/impute/_base.py:49: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode = stats.mode(array)
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/sklearn/preprocessing/_data.py:3237: RuntimeWarning: divide by zero encountered in log
loglike = -n_samples / 2 * np.log(x_trans.var())
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (32) reached and the optimization hasn't converged yet.
warnings.warn(
[CaTabRa] Final training statistics:
n_models_trained: 51
ensemble_val_roc_auc: 0.9972122660294704
[CaTabRa] Creating shap explainer
[CaTabRa] Initialized out-of-distribution detector of type BinsDetector
[CaTabRa] Fitting out-of-distribution detector...
[CaTabRa] Out-of-distribution detector fitted.
[CaTabRa] ### Analysis finished at 2023-04-13 15:03:41.403394
[CaTabRa] ### Elapsed time: 0 days 00:02:59.525286
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow
[CaTabRa] ### Evaluation started at 2023-04-13 15:03:41.421730
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Evaluation results for train:
roc_auc: 0.9994623655913979
accuracy @ 0.5: 0.9868421052631579
balanced_accuracy @ 0.5: 0.9863799283154122
The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
[CaTabRa] Evaluation results for not_train:
roc_auc: 0.9991158267020337
accuracy @ 0.5: 0.9469026548672567
balanced_accuracy @ 0.5: 0.9655172413793103
[CaTabRa] ### Evaluation finished at 2023-04-13 15:03:44.241915
[CaTabRa] ### Elapsed time: 0 days 00:00:02.820185
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/eval
By specifying a train-test split, CaTabRa not only trains a classifier (on the training set) but also evaluates it (on both sets). The last few lines of the above logging output inform about the performance of the classifier on “train” and “not_train”. More detailed results are available as well, as we will see in Step 3.
The newly created directory specified by output_dir contains all results generated during data analysis, including
a copy of the used configuration:
config.json,the arguments passed to function
analyze():invocation.json,descriptive statistics of the analyzed data:
statistics/,the trained prediction model:
model.joblib,information about the constituents of the prediction model and their hyperparameters:
model_summary.json,the training history:
training_history.xlsxandtraining_history.pdf,the OOD-detector:
ood.joblib, andevaluation results (because we specified a train-test split):
eval/.
Descriptive Statistics
Descriptive statistics are calculated for numeric and non-numeric (categorical) features separately and saved in statistics/statistics_numeric.xlsx and statistics/statistics/non_numeric.xlsx. It is easiest to simply view these files in Excel, but they can of course be loaded as pandas DataFrames, too.
CaTabRa provides a convenience function for loading tables in arbitrary format, implemented in module `catabra.util.io <https://github.com/risc-mi/catabra/tree/main/catabra/util/io.py>`__ read_df() for loading a single table and read_dfs() for loading all tables stored in a file. In classification tasks, descriptive statistics are computed both for the entire dataset and for each class individually and written to two different tables, so we use read_dfs() to load both of them:
[22]:
stats = io.read_dfs(output_dir + '/statistics/statistics_numeric.xlsx')
[23]:
# overall statistics
stats['overall'].head()
[23]:
| Unnamed: 0 | count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | mean radius | 569 | 14.127292 | 3.524049 | 6.98100 | 11.70000 | 13.37000 | 15.7800 | 28.1100 |
| 1 | mean texture | 569 | 19.289649 | 4.301036 | 9.71000 | 16.17000 | 18.84000 | 21.8000 | 39.2800 |
| 2 | mean perimeter | 569 | 91.969033 | 24.298981 | 43.79000 | 75.17000 | 86.24000 | 104.1000 | 188.5000 |
| 3 | mean area | 569 | 654.889104 | 351.914129 | 143.50000 | 420.30000 | 551.10000 | 782.7000 | 2501.0000 |
| 4 | mean smoothness | 569 | 0.096360 | 0.014064 | 0.05263 | 0.08637 | 0.09587 | 0.1053 | 0.1634 |
[24]:
# statistics per class
stats['diagnosis']['Feature'].fillna(method='ffill', inplace=True)
stats['diagnosis'].head()
[24]:
| Feature | diagnosis | count | mean | std | min | 25% | 50% | 75% | max | mann_whitney_u | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | mean radius | 0 | 212 | 17.462830 | 3.203971 | 10.950 | 15.0750 | 17.325 | 19.590 | 28.11 | 2.692943e-68 |
| 1 | mean radius | 1 | 357 | 12.146524 | 1.780512 | 6.981 | 11.0800 | 12.200 | 13.370 | 17.85 | 2.692943e-68 |
| 2 | mean texture | 0 | 212 | 21.604906 | 3.779470 | 10.380 | 19.3275 | 21.460 | 23.765 | 39.28 | 3.428627e-28 |
| 3 | mean texture | 1 | 357 | 17.914762 | 3.995125 | 9.710 | 15.1500 | 17.390 | 19.760 | 33.81 | 3.428627e-28 |
| 4 | mean perimeter | 0 | 212 | 115.365377 | 21.854653 | 71.900 | 98.7450 | 114.200 | 129.925 | 188.50 | 3.553870e-71 |
In the above per-class statistics, a Mann-Whitney U test is performed to detect statistically significant differences in the distribution of a feature between the different classes, and the resulting p-values are reported in column mann_whitney_u.
For more information about the descriptive statistics computed by CaTabRa by default, refer to Statistics.
Descriptive statistics can be computed manually as well, see module `catabra.util.statistics <https://github.com/risc-mi/catabra/tree/main/catabra/util/statistics>`__ for details.
Model Summary
The final prediction model is summarized in model_summary.json. This file contains a dict with information about the individual constituent models (if the model is an ensemble), the used preprocessing steps, and the selected hyperparameter values. The exact format depends on the used AutoML backend, but for the default auto-sklearn backend the main information is contained in the list under the "models" key, as can be seen below:
[25]:
io.load(output_dir + '/model_summary.json')
[25]:
{'automl': 'auto-sklearn',
'task': 'binary_classification',
'models': [{'model_id': 6,
'rank': 1,
'cost': 0.00561529271206688,
'ensemble_weight': 0.0,
'data_preprocessor': "FeatTypeSplit(column_transformer=ColumnTransformer(sparse_threshold=0.0, transformers=[('numerical_transformer', NumericalPreprocessingPipeline(config=Configuration(values={ 'imputation:strategy': 'median', 'rescaling:__choice__': 'power_transformer', }) , dataset_properties={'signed': False, 'sparse': False}, exclude={}, include={}, init_params={}, steps=[('imput... 'symmetry error': 'numerical', 'texture error': 'numerical', 'worst area': 'numerical', 'worst compactness': 'numerical', 'worst concave points': 'numerical', 'worst concavity': 'numerical', 'worst fractal dimension': 'numerical', 'worst perimeter': 'numerical', 'worst radius': 'numerical', 'worst smoothness': 'numerical', 'worst symmetry': 'numerical', 'worst texture': 'numerical'}, init_params={})",
'balancing': 'Balancing(random_state=42)',
'feature_preprocessor': 'NoPreprocessing(<unknown params>)',
'classifier': 'MLPClassifier(alpha=0.07979356062608887, beta_1=0.999, beta_2=0.9, early_stopping=True, hidden_layer_sizes=(257, 257, 257), learning_rate_init=0.001829312822950054, max_iter=32, n_iter_no_change=32, random_state=42, verbose=0, warm_start=True)'},
{'model_id': 60,
'rank': 2,
'cost': 0.0034050179211469276,
'ensemble_weight': 0.7,
'data_preprocessor': "FeatTypeSplit(column_transformer=ColumnTransformer(sparse_threshold=0.0, transformers=[('numerical_transformer', NumericalPreprocessingPipeline(config=Configuration(values={ 'imputation:strategy': 'most_frequent', 'rescaling:__choice__': 'power_transformer', }) , dataset_properties={'signed': False, 'sparse': False}, exclude={}, include={}, init_params={}, steps=[... 'symmetry error': 'numerical', 'texture error': 'numerical', 'worst area': 'numerical', 'worst compactness': 'numerical', 'worst concave points': 'numerical', 'worst concavity': 'numerical', 'worst fractal dimension': 'numerical', 'worst perimeter': 'numerical', 'worst radius': 'numerical', 'worst smoothness': 'numerical', 'worst symmetry': 'numerical', 'worst texture': 'numerical'}, init_params={})",
'balancing': 'Balancing(random_state=42)',
'feature_preprocessor': 'NoPreprocessing(<unknown params>)',
'classifier': 'MLPClassifier(alpha=2.9638327738166795e-05, beta_1=0.999, beta_2=0.9, early_stopping=True, hidden_layer_sizes=(241,), learning_rate_init=0.008555948122763763, max_iter=32, n_iter_no_change=32, random_state=42, verbose=0, warm_start=True)'}]}
Training History
Information about each model trained during hyperparameter optimization is contained in training_history.xlsx and visualized in training_history.pdf:
[26]:
io.read_df(output_dir + '/training_history.xlsx').drop('Unnamed: 0', axis=1, errors='ignore').head()
[26]:
| model_id | timestamp | total_elapsed_time | type | val_roc_auc | val_accuracy | val_balanced_accuracy | train_roc_auc | duration | ensemble_weight | ensemble_val_roc_auc | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2023-04-13 15:00:46.003 | 0 days 00:00:03.044101953 | gradient_boosting | 0.989845 | 0.947368 | 0.946356 | 1.000000 | 2.391810 | 0.0 | 0.986260 |
| 1 | 3 | 2023-04-13 15:00:48.393 | 0 days 00:00:05.434068441 | gradient_boosting | 0.945430 | 0.921053 | 0.924134 | 1.000000 | 2.265227 | 0.0 | 0.986260 |
| 2 | 4 | 2023-04-13 15:00:51.003 | 0 days 00:00:08.043761730 | gradient_boosting | 0.971416 | 0.921053 | 0.919952 | 0.993877 | 2.473614 | 0.0 | 0.986260 |
| 3 | 5 | 2023-04-13 15:00:53.396 | 0 days 00:00:10.437424898 | gradient_boosting | 0.968250 | 0.929825 | 0.926523 | 0.995034 | 2.244282 | 0.0 | 0.987834 |
| 4 | 6 | 2023-04-13 15:00:56.929 | 0 days 00:00:13.970171690 | mlp | 0.997073 | 0.971491 | 0.970072 | 0.999985 | 3.378461 | 0.0 | 0.996953 |
Step 2: Calibrate Classifier
Classifiers can be calibrated to ensure that the probability estimates they return correspond to the “true” confidence of the model. As in the initial data analysis and model construction, one simple function call suffices to calibrate a classifier in CaTabRa.
Worth noting are the use of the from_invocation keyword argument, which automatically sets all unspecified arguments to the values stored in the given JSON file; this, for example, applies to split. The effect of setting subset to True is that the classifier is only calibrated on those samples whose value in the train-test-split column "train" is True (i.e., the training set). Normally, classifiers should not be calibrated on the training set, though. After calibration,
model.joblib is replaced by the new, calibrated model.
The corresponding command in CaTabRa’s command-line interface is catabra calibrate ....
[27]:
from catabra.calibration import calibrate
calibrate(
X,
folder=output_dir, # directory containing trained classifier (= output directory of previous call to `analyze()`)
from_invocation=output_dir + '/invocation.json',
subset=True,
out=output_dir + '/calib'
)
[CaTabRa] ### Calibration started at 2023-04-13 15:03:44.475144
[CaTabRa] Restricting table to calibration subset train = True (456 entries)
[CaTabRa] ### Calibration finished at 2023-04-13 15:03:45.766419
[CaTabRa] ### Elapsed time: 0 days 00:00:01.291275
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/calib
Step 3: Evaluate Classifier
Prediction models can be evaluated on (labeled) data that have the same format as the data they were initially trained on, as passed to function `catabra.analysis.analyze() <https://github.com/risc-mi/catabra/tree/main/catabra/analysis/main.py>`__. Again, one simple function call is sufficient. If the data is split into two or more disjoint subsets via argument split (implicit in from_invocation below), the model is evaluated on each of these subsets separately.
Bootstrapping can be used to obtain estimates on the variance, confidence interval, etc. of the performance of our classifier. We activate it by simply setting bootstrapping_repetitions to the desired number of repetitions.
Since the desired output directory has been created by function analyze() already, we are asked whether it should be replaced.
The corresponding command in CaTabRa’s command-line interface is catabra evaluate ....
[28]:
from catabra.evaluation import evaluate
evaluate(
X,
folder=output_dir, # directory containing trained classifier (= output directory of previous call to `analyze()`)
from_invocation=output_dir + '/invocation.json',
bootstrapping_repetitions=1000, # number of bootstrapping repetitions to perform; set to 0 to disable bootstrapping
out=output_dir + '/eval'
)
Evaluation folder "/mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/eval" already exists. Delete?
[CaTabRa] ### Evaluation started at 2023-04-13 15:03:45.781536
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
[CaTabRa] Evaluation results for train:
roc_auc: 0.9994623655913979
accuracy @ 0.5: 0.9868421052631579
balanced_accuracy @ 0.5: 0.9863799283154122
[CaTabRa] Evaluation results for not_train:
roc_auc: 0.9991158267020337
accuracy @ 0.5: 0.9469026548672567
balanced_accuracy @ 0.5: 0.9655172413793103
[CaTabRa] ### Evaluation finished at 2023-04-13 15:04:01.642427
[CaTabRa] ### Elapsed time: 0 days 00:00:15.860891
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/eval
Note how accuracy and balanced accuracy changed compared to the initial data analysis. This is because of model calibration, which potentially affects thresholded metrics (like accuracy and balanced accuracy) but leaves threshold-independent metrics, like ROC-AUC, unchanged.
Performance Metrics (Non-Bootstrapped)
One of the main evaluation results produced by CaTabRa are tables with detailed information on model performance, and corresponding visualizations. In our case, they are contained in subdirectories eval/train/ and eval/not_train/.
Non-bootstrapped performance metrics are saved in metrics.xlsx. In binary classification, this file consists of the three tables "overall", "thresholded" and "calibration".
[29]:
metrics = io.read_dfs(output_dir + '/eval/not_train/metrics.xlsx')
Table "overall" contains non-thresholded performance metrics, like ROC-AUC, average precision, etc.:
[30]:
metrics['overall']
[30]:
| Unnamed: 0 | pos_label | n | n_pos | roc_auc | average_precision | pr_auc | brier_loss | hinge_loss | log_loss | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | diagnosis | 1 | 113 | 87 | 0.999116 | 0.999742 | 0.99974 | 0.03369 | 0.283123 | 0.125328 |
Table "thresholded" contains all performance metrics that depend on a specific decision threshold (a.k.a. cut-off point), like accuracy, balanced accuracy, F1-score, etc. These metrics are evaluated at different decision thresholds.
[31]:
metrics['thresholded'].drop('Unnamed: 0', axis=1).head()
[31]:
| threshold | accuracy | balanced_accuracy | f1 | sensitivity | specificity | positive_predictive_value | negative_predictive_value | cohen_kappa | hamming_loss | jaccard | true_positive | true_negative | false_positive | false_negative | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.012333 | 0.769912 | 0.500000 | 0.870000 | 1.0 | 0.000000 | 0.769912 | 1.0 | 0.000000 | 0.230088 | 0.769912 | 87 | 0 | 26 | 0 |
| 1 | 0.012333 | 0.796460 | 0.557692 | 0.883249 | 1.0 | 0.115385 | 0.790909 | 1.0 | 0.167254 | 0.203540 | 0.790909 | 87 | 3 | 23 | 0 |
| 2 | 0.012333 | 0.814159 | 0.596154 | 0.892308 | 1.0 | 0.192308 | 0.805556 | 1.0 | 0.268270 | 0.185841 | 0.805556 | 87 | 5 | 21 | 0 |
| 3 | 0.012333 | 0.823009 | 0.615385 | 0.896907 | 1.0 | 0.230769 | 0.813084 | 1.0 | 0.315981 | 0.176991 | 0.813084 | 87 | 6 | 20 | 0 |
| 4 | 0.012333 | 0.831858 | 0.634615 | 0.901554 | 1.0 | 0.269231 | 0.820755 | 1.0 | 0.361961 | 0.168142 | 0.820755 | 87 | 7 | 19 | 0 |
Table "calibration" contains the fraction of positive samples for different threshold intervals. The intervals are constructed such that each of them contains roughly the same number of samples.
[32]:
metrics['calibration'].drop('Unnamed: 0', axis=1).head()
[32]:
| threshold_lower | threshold_upper | pos_fraction | |
|---|---|---|---|
| 0 | 0.012333 | 0.012333 | 0.0 |
| 1 | 0.012333 | 0.012333 | 0.0 |
| 2 | 0.012333 | 0.012333 | 0.0 |
| 3 | 0.012333 | 0.012333 | 0.0 |
| 4 | 0.012333 | 0.012333 | 0.0 |
Bootstrapped Performance
Since we activated bootstrapping by setting bootstrapping_repetitions to a positive number, file bootstrapping.xlsx was generated. It contains two tables "summary" and "details" with summary statistics over all bootstrapping runs and the runs themselves, respectively.
[33]:
bootstrapping = io.read_dfs(output_dir + '/eval/not_train/bootstrapping.xlsx')
[34]:
bootstrapping['summary']
[34]:
| Unnamed: 0 | roc_auc | accuracy | balanced_accuracy | __threshold | |
|---|---|---|---|---|---|
| 0 | count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.0 |
| 1 | mean | 0.999107 | 0.946177 | 0.965097 | 0.5 |
| 2 | std | 0.001177 | 0.021432 | 0.013757 | 0.0 |
| 3 | min | 0.990909 | 0.876106 | 0.922222 | 0.5 |
| 4 | 25% | 0.998557 | 0.929204 | 0.956044 | 0.5 |
| 5 | 50% | 0.999532 | 0.946903 | 0.966292 | 0.5 |
| 6 | 75% | 1.000000 | 0.964602 | 0.975904 | 0.5 |
| 7 | max | 1.000000 | 1.000000 | 1.000000 | 0.5 |
Table "details" reports the performance metrics for each single run, together with the random seed used for resampling the data.
[35]:
bootstrapping['details'].drop('Unnamed: 0', axis=1, errors='ignore').head()
[35]:
| roc_auc | accuracy | balanced_accuracy | __seed | |
|---|---|---|---|---|
| 0 | 1.000000 | 0.955752 | 0.970238 | 2854880344 |
| 1 | 1.000000 | 0.938053 | 0.963158 | 1506600952 |
| 2 | 1.000000 | 0.893805 | 0.931034 | 3277809138 |
| 3 | 0.997895 | 0.946903 | 0.960000 | 3141104837 |
| 4 | 1.000000 | 0.964602 | 0.977011 | 2847344748 |
Sample-Wise Predictions
Finally, the model output for each individual sample is saved in predictions.xlsx.
[36]:
predictions = io.read_df(output_dir + '/eval/not_train/predictions.xlsx')
The table contains the true label (column "diagnosis") and the predicted probabilities of the negative and positive class, respectively. Note that in our cases the two classes are simply called 0 and 1, which is why the corresponding columns are called "0_proba" and "1_proba".
[37]:
predictions.head()
[37]:
| Unnamed: 0 | diagnosis | 0_proba | 1_proba | |
|---|---|---|---|---|
| 0 | 456 | 1 | 0.007257 | 0.992743 |
| 1 | 457 | 1 | 0.224164 | 0.775836 |
| 2 | 458 | 1 | 0.727156 | 0.272844 |
| 3 | 459 | 1 | 0.005909 | 0.994091 |
| 4 | 460 | 0 | 0.987667 | 0.012333 |
Out-of-Distribution Detection
In addition to the output of the prediction model we can also inspect the likelihood of samples (or the whole training- or test-set) being out-of-distribution (OOD). Predictions for samples with high OOD likelihood should be treated with care, as they might differ significantly from all samples the model has seen during training.
[38]:
ood = io.read_df(output_dir + '/eval/not_train/ood.xlsx')
[39]:
ood.head()
[39]:
| Unnamed: 0 | proba | decision | |
|---|---|---|---|
| 0 | 0 | 0 | False |
| 1 | 1 | 0 | False |
| 2 | 2 | 0 | False |
| 3 | 3 | 0 | False |
| 4 | 4 | 0 | False |
Step 4: Explain Classifier
Prediction models can be explained on data that have the same format as the data they were initially trained on, as passed to function analyze(). As before, one simple function call is sufficient. If the data is split into two or more disjoint subsets via argument split (implicit in from_invocation below), the model is explained on each of these subsets separately.
If the final model is an ensemble of several base models, each of them is expained separately.
By default, SHAP is used for generating local (i.e., sample-wise) explanations in terms of feature importance scores. These scores are saved as HDF5 tables and visualized in so-called beeswarm plots, and can be found in the specified output directory.
In addition to SHAP, CaTabRa also provides a ready-to-use implementation of permutation importance. The advantage of permutation importance over SHAP is that it can be generally computed much faster. We use it here by setting explainer="permutation" in the command below. You can try SHAP by setting explainer="shap" or simply omitting the keyword argument.
The corresponding command in CaTabRa’s command-line interface is catabra explain ....
[41]:
from catabra.explanation import explain
explain(
X,
folder=output_dir, # directory containing trained classifier (= output directory of previous call to `analyze()`)
from_invocation=output_dir + '/invocation.json',
out=output_dir + '/explain_permutation',
explainer='permutation'
)
[CaTabRa] ### Explanation started at 2023-04-13 15:20:32.120711
[CaTabRa] *** Split train
Features: 100%|########################################| 30/30 [00:06<00:00, 4.93it/s]
[CaTabRa] *** Split not_train
Features: 100%|########################################| 30/30 [00:03<00:00, 11.82it/s]
[CaTabRa] ### Explanation finished at 2023-04-13 15:20:43.127964
[CaTabRa] ### Elapsed time: 0 days 00:00:11.007253
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/explain_permutation
Permutation importance generates global (i.e., feature-wise) explanations. The corresponding importance scores are saved as HDF5 tables and visualized in bar plots.
Refer to Explanations for more information about model explanations.
Step 5: Apply Classifier to New Data
Finally, the trained classifier can be applied to new data of the same format as the data it was initially trained on, possibly without the label column. For demonstration purposes we apply the classifier to the same data X we are using throughout, although in a real-world use-case this would not make sense.
The corresponding command in CaTabRa’s command-line interface is catabra apply ....
[42]:
from catabra.application import apply
apply(
X.drop('diagnosis', axis=1), # data to apply the model to; column containing ground-truth labels is not needed (but would not harm either)
folder=output_dir, # directory containing trained classifier (= output directory of previous call to `analyze()`)
from_invocation=output_dir + '/invocation.json',
out=output_dir + '/apply'
)
[CaTabRa] ### Application started at 2023-04-13 15:20:43.169793
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] ### Application finished at 2023-04-13 15:20:43.760727
[CaTabRa] ### Elapsed time: 0 days 00:00:00.590934
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/workflow/apply
The results are saved in predictions.xlsx and contain the predicted probabilities of the two classes, for every sample. OOD scores are saved again in ood.xlsx.
[43]:
predictions = io.read_df(output_dir + '/apply/predictions.xlsx')
[44]:
predictions.head()
[44]:
| Unnamed: 0 | 0_proba | 1_proba | |
|---|---|---|---|
| 0 | 0 | 0.987667 | 0.012333 |
| 1 | 1 | 0.987667 | 0.012333 |
| 2 | 2 | 0.987667 | 0.012333 |
| 3 | 3 | 0.987667 | 0.012333 |
| 4 | 4 | 0.987667 | 0.012333 |
Load Classifier into Python
Prediction models generated with CaTabRa can be easily loaded into a Python session. The easiest and most straight-forward way to do this is through the `catabra.util.io.CaTabRaLoader <https://github.com/risc-mi/catabra/tree/main/catabra/util/io.py>`__ class, which only needs to be instantiated with the directory containing model:
[45]:
loader = io.CaTabRaLoader(output_dir)
The resulting class instance provides easy access to all sorts of artifacts generated by the functions above, in particular the trained classifier:
[46]:
model = loader.get_model()
Investigating the Model
The type of the loaded model object depends on the AutoML backend used for training it, in this case auto-sklearn:
[47]:
type(model)
[47]:
catabra.automl.askl.backend.AutoSklearnBackend
If we want a uniform representation of the model independent of the AutoML backend, we can convert it into a `catabra.automl.fitted_ensemble.FittedEnsemble <https://github.com/risc-mi/catabra/tree/main/catabra/automl/fitted_ensemble.py>`__:
[48]:
fe = model.fitted_ensemble()
A FittedEnsemble is, as its name suggests, an ensemble consisting of individual base models and a meta-estimator combining the predictions of the base models to a single output. These base models can be accessed via the models_ attribute, which is a dict mapping model-IDs to instances of class FittedModel:
[49]:
fe.models_
[49]:
{6: FittedModel(
preprocessing=[ColumnTransformer(sparse_threshold=0.0,
transformers=[('numerical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='median')),
('variance_threshold',
VarianceThreshold()),
('rescaling',
PowerTransformer(copy=False)),
('dummy', 'passthrough')]),
[True, True, True, True, True, True, True,
True, True, True, True, True, True, True,
True, True, True, True, True, True, True,
True, True, True, True, True, True, True,
True, True])])],
estimator=MLPClassifier(alpha=0.07979356062608887, beta_1=0.999, beta_2=0.9,
early_stopping=True, hidden_layer_sizes=(257, 257, 257),
learning_rate_init=0.001829312822950054, max_iter=32,
n_iter_no_change=32, random_state=42, verbose=0, warm_start=True)),
60: FittedModel(
preprocessing=[ColumnTransformer(sparse_threshold=0.0,
transformers=[('numerical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='most_frequent')),
('variance_threshold',
VarianceThreshold()),
('rescaling',
PowerTransformer(copy=False)),
('dummy', 'passthrough')]),
[True, True, True, True, True, True, True,
True, True, True, True, True, True, True,
True, True, True, True, True, True, True,
True, True, True, True, True, True, True,
True, True])])],
estimator=MLPClassifier(alpha=2.9638327738166795e-05, beta_1=0.999, beta_2=0.9,
early_stopping=True, hidden_layer_sizes=(241,),
learning_rate_init=0.008555948122763763, max_iter=32,
n_iter_no_change=32, random_state=42, verbose=0, warm_start=True))}
[50]:
list(fe.models_.values())[0]
[50]:
FittedModel(
preprocessing=[ColumnTransformer(sparse_threshold=0.0,
transformers=[('numerical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='median')),
('variance_threshold',
VarianceThreshold()),
('rescaling',
PowerTransformer(copy=False)),
('dummy', 'passthrough')]),
[True, True, True, True, True, True, True,
True, True, True, True, True, True, True,
True, True, True, True, True, True, True,
True, True, True, True, True, True, True,
True, True])])],
estimator=MLPClassifier(alpha=0.07979356062608887, beta_1=0.999, beta_2=0.9,
early_stopping=True, hidden_layer_sizes=(257, 257, 257),
learning_rate_init=0.001829312822950054, max_iter=32,
n_iter_no_change=32, random_state=42, verbose=0, warm_start=True))
NOTE Predictions returned by fe may deviate slightly from those of model due to a known bug in auto-sklearn.
Applying the Model
If we want to apply the model to new data, we first need to load the encoder that was constructed jointly with the model. Again, the loader object comes in handy:
[51]:
encoder = loader.get_encoder()
[52]:
model.predict_proba(encoder.transform(x=X))
[52]:
array([[0.98766681, 0.01233319],
[0.98766681, 0.01233319],
[0.98766681, 0.01233319],
...,
[0.98766655, 0.01233345],
[0.98766681, 0.01233319],
[0.00594657, 0.99405343]])