Add New Out-of-Distribution Detector
This notebook is part of the CaTabRa GitHub repository.
This notebook demonstrates how a new Out-of-Distribution (OOD) detector can be added to CaTabRa, i.e.,
Implement Random OOD-Detector
We implement a new dummy OOD-detector that assigns random OOD probabilities to each sample.
If you intend to actually add a proper new OOD detector, have a look at the implementation of one of the default detectors, like `catabra.ood.pyod <https://github.com/risc-mi/catabra/tree/main/catabra/ood/pyod.py>`__.
[1]:
import numpy as np
import pandas as pd
from catabra.ood.base import OODDetector
OOD-detectors need to implement the abstract base class `catabra.ood.base.OODDetector <https://github.com/risc-mi/catabra/tree/main/catabra/ood/base.py>`__. The main methods of interest are _fit_transformed() and _predict_proba_transformed() for fitting the detector on training data and applying it to unseen samples, respectively.
[2]:
class RandomOODDetector(OODDetector):
def _fit_transformer(self, X: pd.DataFrame):
pass
def _transform(self, X: pd.DataFrame):
return X
def _fit_transformed(self, X: pd.DataFrame, y: pd.Series):
pass
def _predict_transformed(self, X):
return self._predict_proba_transformed(X) >= 0.5
def _predict_proba_transformed(self, X):
return np.random.uniform(0, 1, size=len(X))
Utilize Random OOD-Detector
[3]:
# load dataset
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(as_frame=True, return_X_y=True)
[4]:
# add target labels to DataFrame
X['diagnosis'] = y
[5]:
# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X.index <= 0.8 * len(X)
When analyzing the data, we inform CaTabRa that we want to use the new dummy OOD-detector by adjusting the config dict:
[6]:
from catabra.analysis import analyze
analyze(
X,
classify='diagnosis', # name of column containing classification target
split='train', # name of column containing information about the train-test split (optional)
out='random_ood_example',
config={
'automl': None, # deactivate model building
'ood_source': 'external', # set to "external" for custom detectors
'ood_class': '__main__.RandomOODDetector' # name (and module) of the OODDetector subclass
}
)
Output folder "/mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_ood_example" already exists. Delete? [y/n] y
[CaTabRa] ### Analysis started at 2023-02-13 08:52:47.970930
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] ### Analysis finished at 2023-02-13 08:52:50.348169
[CaTabRa] ### Elapsed time: 0 days 00:00:02.377239
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_ood_example
[CaTabRa] ### Evaluation started at 2023-02-13 08:52:50.400599
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] ### Evaluation finished at 2023-02-13 08:52:51.023008
[CaTabRa] ### Elapsed time: 0 days 00:00:00.622409
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_ood_example/eval
Although we deactivated model building by setting "automl" to None, there is still an eval/ directory with descriptive statistics of training- and test set, and OOD probabilities:
[6]:
from catabra.util import io
[7]:
io.read_df('random_ood_example/eval/not_train/ood.xlsx').set_index('Unnamed: 0').head()
[7]:
| proba | decision | |
|---|---|---|
| Unnamed: 0 | ||
| 456 | 0.149550 | True |
| 457 | 0.300978 | True |
| 458 | 0.676700 | True |
| 459 | 0.118917 | False |
| 460 | 0.641352 | True |
The OOD detector can be applied to unseen samples using the apply() function, as usual:
[8]:
from catabra.application import apply
apply(
X,
folder='random_ood_example',
from_invocation='random_ood_example/invocation.json',
out='random_ood_example/apply'
)
Application folder "/mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_ood_example/apply" already exists. Delete? [y/n] y
[CaTabRa] ### Application started at 2023-02-13 09:07:09.017188
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] ### Application finished at 2023-02-13 09:07:10.864471
[CaTabRa] ### Elapsed time: 0 days 00:00:01.847283
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/random_ood_example/apply