Add New Out-of-Distribution Detector


This notebook is part of the CaTabRa GitHub repository.

This notebook demonstrates how a new Out-of-Distribution (OOD) detector can be added to CaTabRa, i.e.,

Implement Random OOD-Detector

We implement a new dummy OOD-detector that assigns random OOD probabilities to each sample.

If you intend to actually add a proper new OOD detector, have a look at the implementation of one of the default detectors, like `catabra.ood.pyod <https://github.com/risc-mi/catabra/tree/main/catabra/ood/pyod.py>`__.

[1]:
import numpy as np
import pandas as pd

from catabra.ood.base import OODDetector

OOD-detectors need to implement the abstract base class `catabra.ood.base.OODDetector <https://github.com/risc-mi/catabra/tree/main/catabra/ood/base.py>`__. The main methods of interest are _fit_transformed() and _predict_proba_transformed() for fitting the detector on training data and applying it to unseen samples, respectively.

[2]:
class RandomOODDetector(OODDetector):

    def _fit_transformer(self, X: pd.DataFrame):
        pass

    def _transform(self, X: pd.DataFrame):
        return X

    def _fit_transformed(self, X: pd.DataFrame, y: pd.Series):
        pass

    def _predict_transformed(self, X):
        return self._predict_proba_transformed(X) >= 0.5

    def _predict_proba_transformed(self, X):
        return np.random.uniform(0, 1, size=len(X))

Utilize Random OOD-Detector

[3]:
# load dataset
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(as_frame=True, return_X_y=True)
[4]:
# add target labels to DataFrame
X['diagnosis'] = y
[5]:
# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X.index <= 0.8 * len(X)

When analyzing the data, we inform CaTabRa that we want to use the new dummy OOD-detector by adjusting the config dict:

[6]:
from catabra.analysis import analyze

analyze(
    X,
    classify='diagnosis',     # name of column containing classification target
    split='train',            # name of column containing information about the train-test split (optional)
    out='random_ood_example',
    config={
        'automl': None,                     # deactivate model building
        'ood_source': 'external',           # set to "external" for custom detectors
        'ood_class': '__main__.RandomOODDetector'    # name (and module) of the OODDetector subclass
    }
)
Output folder "/mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_ood_example" already exists. Delete? [y/n] y
[CaTabRa] ### Analysis started at 2023-02-13 08:52:47.970930
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] ### Analysis finished at 2023-02-13 08:52:50.348169
[CaTabRa] ### Elapsed time: 0 days 00:00:02.377239
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_ood_example
[CaTabRa] ### Evaluation started at 2023-02-13 08:52:50.400599
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] ### Evaluation finished at 2023-02-13 08:52:51.023008
[CaTabRa] ### Elapsed time: 0 days 00:00:00.622409
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_ood_example/eval

Although we deactivated model building by setting "automl" to None, there is still an eval/ directory with descriptive statistics of training- and test set, and OOD probabilities:

[6]:
from catabra.util import io
[7]:
io.read_df('random_ood_example/eval/not_train/ood.xlsx').set_index('Unnamed: 0').head()
[7]:
proba decision
Unnamed: 0
456 0.149550 True
457 0.300978 True
458 0.676700 True
459 0.118917 False
460 0.641352 True

The OOD detector can be applied to unseen samples using the apply() function, as usual:

[8]:
from catabra.application import apply

apply(
    X,
    folder='random_ood_example',
    from_invocation='random_ood_example/invocation.json',
    out='random_ood_example/apply'
)
Application folder "/mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_ood_example/apply" already exists. Delete? [y/n] y
[CaTabRa] ### Application started at 2023-02-13 09:07:09.017188
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] ### Application finished at 2023-02-13 09:07:10.864471
[CaTabRa] ### Elapsed time: 0 days 00:00:01.847283
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_ood_example/apply