Add New Explanation Backend

This notebook is part of the CaTabRa GitHub repository.

This notebook demonstrates how a new explanation backend can be added to CaTabRa, i.e.,

how it can be implemented, and
how it can be utilized in CaTabRa’s data analysis workflow.

Implement Random Explainer

We implement a new dummy explanation backend that returns random feature importance scores. If you intend to actually add a new explanation backend, don’t forget to have a look at the implementation of the existing SHAP and permutation importance backends in `catabra.explanation._shap.backend <https://github.com/risc-mi/catabra/tree/main/catabra/explanation/_shap/backend.py>`__ and `catabra.explanation._permutation.backend <https://github.com/risc-mi/catabra/tree/main/catabra/explanation/_permutation/backend.py>`__, respectively.

[2]:

from typing import Optional
import numpy as np
import pandas as pd

from catabra.explanation.base import TransformationExplainer, IdentityTransformationExplainer, EnsembleExplainer
from catabra.automl.base import FittedEnsemble

Explanation backends need to implement the abstract base class `catabra.explanation.base.EnsembleExplainer <https://github.com/risc-mi/catabra/tree/main/catabra/explanation/base.py>`__. The main methods of interest are explain() and explain_global() for producing local (sample-wise) and global explanations, respectively.

[3]:

class RandomEnsembleExplainer(EnsembleExplainer):

    @property
    def name(self) -> str:
        return 'random_explainer'

    @property
    def behavior(self) -> dict:
        return dict(
            supports_local=True,
            requires_y=False,
            global_accepts_x=True,
            global_requires_x=True,
            global_is_mean_of_local=True
        )

    def __init__(self, ensemble: FittedEnsemble = None, config: Optional[dict] = None,
             feature_names: Optional[list] = None, target_names: Optional[list] = None,
             x: Optional[pd.DataFrame] = None, y: Optional[pd.DataFrame] = None, params=None):
        super(RandomEnsembleExplainer, self).__init__(ensemble=ensemble, config=config, feature_names=feature_names,
                                                      target_names=target_names, x=x, y=y, params=params)

        # `ensemble` is the ensemble of one or more models we will explain,
        #    in the uniform `FittedEnsemble` representation.
        # Note: If the explainer needs to know the prediction task `ensemble` was trained for
        #    (binary/multiclass/multilabel classification, regression),
        #    it can be accessed via `ensemble.task`.
        self._ensemble = ensemble

        if params is None:
            assert x is not None
            if feature_names is None:
                self._feature_names = list(range(x.shape[1]))
            else:
                assert len(feature_names) == x.shape[1]
                self._feature_names = feature_names
            self._params = dict(feature_names=self._feature_names)
        else:
            self._params = params
            self._feature_names = self._params['feature_names']

    @property
    def params_(self) -> dict:
        return self._params

    def explain(self, x: pd.DataFrame, y: Optional[pd.DataFrame]=None, jobs: int = 1, batch_size: Optional[int] = None,
                model_id=None, mapping: Optional[dict] = None, show_progress: bool = False) -> dict:
        if model_id is None:
            keys = ['__ensemble__']
        elif not isinstance(model_id, (list, set)):
            keys = [model_id]
        else:
            keys = model_id

        # explain models with IDs in `keys` on data `x`,
        # and return DataFrames with one row per sample and one column per feature

        return {
            k: pd.DataFrame(
                index=x.index,
                columns=self._feature_names,
                data=np.random.uniform(-1, 1, size=(len(x), len(self._feature_names)))
            )
            for k in keys
        }

    def explain_global(self, x: Optional[pd.DataFrame] = None, sample_weight: Optional[np.ndarray] = None,
                       jobs: int = 1, batch_size: Optional[int] = None, model_id=None,
                       mapping: Optional[dict] = None, show_progress: bool = False) -> dict:
        local_explanation = self.explain(x, jobs=jobs, batch_size=batch_size, model_id=model_id,
                                         mapping=mapping, show_progress=show_progress)
        return {k: e.mean(axis=0) for k, e in local_explanation.items()}

    def get_versions(self) -> dict:
        return {}

Finally, the new backend needs to be registered:

[4]:

EnsembleExplainer.register('random_explainer', RandomEnsembleExplainer)

Utilize Random Explainer

We now utilize the new "random_explainer" in a simple classification problem. Everthing shown works analogously for all other prediction tasks.

[5]:

# load dataset
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(as_frame=True, return_X_y=True)

[6]:

# add target labels to DataFrame
X['diagnosis'] = y

[7]:

# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X.index <= 0.8 * len(X)

When analyzing the data, we inform CaTabRa that we want to use the "random_explainer" backend by adjusting the config dict (strictly speaking, this is only necessary if the explainer needs to be fit to training data):

[8]:

from catabra.analysis import analyze

analyze(
    X,
    classify='diagnosis',     # name of column containing classification target
    split='train',            # name of column containing information about the train-test split (optional)
    time=1,                   # time budget for hyperparameter tuning, in minutes (optional)
    out='random_explainer_example',
    config={
        'explainer': 'random_explainer'     # name of the explanation backend
    }
)

[CaTabRa] ### Analysis started at 2023-04-13 12:17:10.550203
[CaTabRa] Saving descriptive statistics completed

/mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/catabra/util/statistics.py:213: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  return dict_stat, dict_non_num_stat, (df.corr() if df.shape[1] <= corr_threshold else None)

[CaTabRa] Using AutoML-backend auto-sklearn for binary_classification
[CaTabRa] Successfully loaded the following auto-sklearn add-on module(s): xgb
[CaTabRa] Using auto-sklearn 2.0.

/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/autosklearn/experimental/selector.py:24: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col, series in prediction.iteritems():
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/smac/intensification/parallel_scheduling.py:153: UserWarning: SuccessiveHalving is executed with 1 workers only. Consider to use pynisher to use all available workers.
  warnings.warn(

[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.986260
    n_constituent_models: 1
    total_elapsed_time: 00:07
[CaTabRa] New model #1 trained:
    val_roc_auc: 0.989845
    val_accuracy: 0.947368
    val_balanced_accuracy: 0.946356
    train_roc_auc: 1.000000
    type: gradient_boosting
    total_elapsed_time: 00:07
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.986260
    n_constituent_models: 1
    total_elapsed_time: 00:09
[CaTabRa] New model #2 trained:
    val_roc_auc: 0.945430
    val_accuracy: 0.921053
    val_balanced_accuracy: 0.924134
    train_roc_auc: 1.000000
    type: gradient_boosting
    total_elapsed_time: 00:09
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.986260
    n_constituent_models: 1
    total_elapsed_time: 00:12
[CaTabRa] New model #3 trained:
    val_roc_auc: 0.971416
    val_accuracy: 0.921053
    val_balanced_accuracy: 0.919952
    train_roc_auc: 0.993877
    type: gradient_boosting
    total_elapsed_time: 00:11
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.987834
    n_constituent_models: 3
    total_elapsed_time: 00:14
[CaTabRa] New model #4 trained:
    val_roc_auc: 0.968250
    val_accuracy: 0.929825
    val_balanced_accuracy: 0.926523
    train_roc_auc: 0.995034
    type: gradient_boosting
    total_elapsed_time: 00:14
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.996953
    n_constituent_models: 1
    total_elapsed_time: 00:17
[CaTabRa] New model #5 trained:
    val_roc_auc: 0.997073
    val_accuracy: 0.971491
    val_balanced_accuracy: 0.970072
    train_roc_auc: 0.999985
    type: mlp
    total_elapsed_time: 00:17
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.996953
    n_constituent_models: 1
    total_elapsed_time: 00:19
[CaTabRa] New model #6 trained:
    val_roc_auc: 0.955048
    val_accuracy: 0.914474
    val_balanced_accuracy: 0.915233
    train_roc_auc: 0.986036
    type: gradient_boosting
    total_elapsed_time: 00:19
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.996953
    n_constituent_models: 1
    total_elapsed_time: 00:22
[CaTabRa] New model #7 trained:
    val_roc_auc: 0.990054
    val_accuracy: 0.949561
    val_balanced_accuracy: 0.946535
    train_roc_auc: 1.000000
    type: gradient_boosting
    total_elapsed_time: 00:22
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:24
[CaTabRa] New model #8 trained:
    val_roc_auc: 0.995579
    val_accuracy: 0.949561
    val_balanced_accuracy: 0.954898
    train_roc_auc: 0.996864
    type: mlp
    total_elapsed_time: 00:24
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:27
[CaTabRa] New model #9 trained:
    val_roc_auc: 0.990352
    val_accuracy: 0.969298
    val_balanced_accuracy: 0.967384
    train_roc_auc: 0.999701
    type: mlp
    total_elapsed_time: 00:27
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:29
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:31
[CaTabRa] New model #10 trained:
    val_roc_auc: 0.988949
    val_accuracy: 0.936404
    val_balanced_accuracy: 0.937933
    train_roc_auc: 0.999955
    type: mlp
    total_elapsed_time: 00:31
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:34
[CaTabRa] New model #11 trained:
    val_roc_auc: 0.992861
    val_accuracy: 0.964912
    val_balanced_accuracy: 0.962007
    train_roc_auc: 1.000000
    type: extra_trees
    total_elapsed_time: 00:34
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:36
[CaTabRa] New model #12 trained:
    val_roc_auc: 0.991756
    val_accuracy: 0.953947
    val_balanced_accuracy: 0.951912
    train_roc_auc: 1.000000
    type: gradient_boosting
    total_elapsed_time: 00:36
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:38
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:42
[CaTabRa] New model #13 trained:
    val_roc_auc: 0.995639
    val_accuracy: 0.964912
    val_balanced_accuracy: 0.961171
    train_roc_auc: 0.999044
    type: mlp
    total_elapsed_time: 00:41
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:44
[CaTabRa] New model #14 trained:
    val_roc_auc: 0.993429
    val_accuracy: 0.967105
    val_balanced_accuracy: 0.964695
    train_roc_auc: 0.999836
    type: extra_trees
    total_elapsed_time: 00:44
[CaTabRa] New ensemble fitted:
    ensemble_val_roc_auc: 0.997172
    n_constituent_models: 2
    total_elapsed_time: 00:48
[CaTabRa] New model #15 trained:
    val_roc_auc: 0.958393
    val_accuracy: 0.888158
    val_balanced_accuracy: 0.889665
    train_roc_auc: 0.973305
    type: gradient_boosting
    total_elapsed_time: 00:50
[CaTabRa] New model #16 trained:
    val_roc_auc: 0.989934
    val_accuracy: 0.947368
    val_balanced_accuracy: 0.944683
    train_roc_auc: 0.997961
    type: gradient_boosting
    total_elapsed_time: 00:52

/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/sklearn/preprocessing/_data.py:3237: RuntimeWarning: divide by zero encountered in log
  loglike = -n_samples / 2 * np.log(x_trans.var())
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (32) reached and the optimization hasn't converged yet.
  warnings.warn(
/home/skaltenl/anaconda3/envs/test2/lib/python3.9/site-packages/sklearn/neural_network/_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (32) reached and the optimization hasn't converged yet.
  warnings.warn(

[CaTabRa] Final training statistics:
    n_models_trained: 16
    ensemble_val_roc_auc: 0.9971724412584628
[CaTabRa] Creating random_explainer explainer
[CaTabRa] Initialized out-of-distribution detector of type BinsDetector
[CaTabRa] Fitting out-of-distribution detector...
[CaTabRa] Out-of-distribution detector fitted.
[CaTabRa] ### Analysis finished at 2023-04-13 12:18:10.490961
[CaTabRa] ### Elapsed time: 0 days 00:00:59.940758
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/random_explainer_example
[CaTabRa] ### Evaluation started at 2023-04-13 12:18:10.512848
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.

/mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/catabra/util/statistics.py:213: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  return dict_stat, dict_non_num_stat, (df.corr() if df.shape[1] <= corr_threshold else None)

[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Evaluation results for train:
    roc_auc: 0.9990442054958184
    accuracy @ 0.5: 0.9824561403508771
    balanced_accuracy @ 0.5: 0.9818399044205496

/mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/catabra/util/statistics.py:213: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  return dict_stat, dict_non_num_stat, (df.corr() if df.shape[1] <= corr_threshold else None)

[CaTabRa] Evaluation results for not_train:
    roc_auc: 0.9995579133510168
    accuracy @ 0.5: 0.9734513274336283
    balanced_accuracy @ 0.5: 0.9827586206896552
[CaTabRa] ### Evaluation finished at 2023-04-13 12:18:13.383042
[CaTabRa] ### Elapsed time: 0 days 00:00:02.870194
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/random_explainer_example/eval

The rest proceeds as usual:

[9]:

from catabra.explanation import explain

explain(
    X,
    folder='random_explainer_example',
    from_invocation='random_explainer_example/invocation.json',
    out='random_explainer_example/explain'
)

[CaTabRa] ### Explanation started at 2023-04-13 12:18:13.398007
[CaTabRa] *** Split train
[CaTabRa] *** Split not_train
[CaTabRa] ### Explanation finished at 2023-04-13 12:18:16.667998
[CaTabRa] ### Elapsed time: 0 days 00:00:03.269991
[CaTabRa] ### Output saved in /mnt/c/Users/skaltenl/Documents/catabra_2023/develop/catabra/examples/random_explainer_example/explain

[10]:

from catabra.util.io import read_df
ex = read_df('random_explainer_example/explain/not_train/__ensemble__.h5')

As can be seen, the feature importance scores are completely random:

[11]:

from catabra.explanation import plot_beeswarms
plot_beeswarms(ex, features=X[~X['train']])

[11]:

../_images/jupyter_explanation_extension_22_0.png