Out-of-Distribution

Configuration

Configuration for the command line tools should be done in the following format:

    "ood_class": "autoencoder",
    "ood_source": "internal",
    "ood_kwargs": {}

"ood_source" defines the origin of the detector. The following values are accepted:
- "internal": detector implemented directly in CaTabRa.
- "pyod": detector from PyOD library
- "external": detector implemented by an outside module
"ood_class" is the name/path of an OOD detector:
- if "ood_source" is "internal": name of one of the modules in catabra.ood.internal (e.g. "soft_brownian_offset")
- if "ood_source" is "pyod": name of one of the modules in pyod.models (e.g. "kde")
- if "ood_source" is "external": full import path consisting of modules and class (e.g. custom.module.CustomOOD)
- if value is None no OOD detection is performed
"ood_kwargs" is a dictionary of optional parameters for specific OOD detectors in the form {"parameter-name": value, ...}, e.g. for the autoencoder {"target_dim_factor": 0.25, "reduction_factor": 0.9}.

Classes

Internal

BinsDetector

A simple detector that divides each feature into equally sized bins and checks whether any bins do not have any corresponding instances in the training set. A new value that falls within such an empty is considered OOD.

The test returns for each sample whether it is out-of-distribution. Values returned by predict_proba are the distances to the bin edges normalized by the bin width if the value falls within and empty bin, otherwise 0. This is calculated for each column and only the maximum is selected.

SoftBrownianOffset

Based on: F. Möller et. al: Out-of-distribution Detection and Generation using Soft Brownian Offset Sampling and Autoencoders. arXiv:2105.02965, May 2021, Accessed: Apr. 06, 2022.

Generates synthetic out-of-distribution samples by selecting a sample from the original dataset and transforming it iteratively until the point’s minimum distance transgresses a set boundary. These samples are combined with the original+ data set in oder to train a classifier to differentiate between in-distribution and out-of-distribution samples.

The test returns for each sample whether it is out-of-distribution. Values returned by predict_proba are the probabilities generated by the trained classifier

Autoencoder

An autoencoder is a neural network that consists of an encoder a decoder part. The encoder learns to reduce the input data to a lower-dimensional space. The decoder learns to reconstruct the original points from the compressed data. In the context of out-of-distribution detection it is assumed that in-distribution data results in a lower reconstruction error then out-of-distribution data. A sample can be defined as OOD if the reconstruction error is above a certain threshold. Refer to this example application.

The detector returns for each sample whether it is out-of-distribution. Values returned by predict_proba are the difference between the maximum reconstruction error of the validation set and the reconstruction error of the sample. This is calculated for each column and only the maximum is selected. Values are cut to the range [0,1]

KSTest

Refer to: Comparing sample distributions with the Kolmogorov-Smirnov (KS) test The Kolmogorov-Smirnov-Test is a statistical test to answer the question “How likely is it that we would see two sets of samples like this if they were drawn from the same (but unknown) probability distribution?” It is applied separately for each column.

The detector returns for each column whether it is out-of-distribution. Values returned by predict_proba are 1 - the p-value returned by the test. Generally 0.05 is the threshold p-value for rejecting the null-hypothesis of stemming from equal distributions is rejected. To confirm with the other detectors where higher values suggest higher likelihood of being OOD - the reverse 0.95 (1 - 0.05) is chosen as standard threshold. This means only very high “proba” values indicate a column being OOD.

PyOD

PyOD is a Python library for anomaly detection which includes well-established algorithms. They are generally considered if there are at least two years since publication, 50+ citations, and usefulness is probable.

Many anomaly detection algorithms can double as OOD detectors if the percentage of outliers is set to be very low. Examples of such detectors are:

Isolation Forests (iforest)
Kernel Density Estimation (kde)
k-Nearest-Neighbours (knn)