Utilities

IO

make_path(p: str | Path, absolute: bool = False) Path[source]

Convert a path-like object into a proper path object, i.e., an instance of class Path.

Parameters:
  • p (str | Path) – Path-like object. If an instance of Path and absolute is False, p is returned unchanged.

  • absolute (bool, default=False) – Whether to make sure that the output is an absolute path. If False, the path may be relative.

Returns:

Path object.

Return type:

Path

read_df(fn: str | Path, key: str | Iterable[str] = 'table') DataFrame[source]

Read a DataFrame from a CSV, Excel, HDF5, Pickle or Parquet file. The file type is determined from the file extension of the given file.

Parameters:
  • fn (str | Path) – The file to read.

  • key (str | Iterable[str], default='table') – The key(s) in the HDF5 file, if fn is an HDF5 file. Defaults to “table”. If an iterable, all keys are read and concatenated along the row axis.

Returns:

A DataFrame.

Return type:

DataFrame

read_dfs(fn: str | Path) Dict[str, DataFrame][source]

Read multiple DataFrames from a single file.

  • If an Excel file, all sheets are read and returned.

  • If an H5 file, all top-level keys are read and returned.

  • If any other file, the singleton dict {“table”: df} is returned, where df is the single DataFrame contained in the file.

Parameters:

fn (str, Path) – The file to read.

Returns:

A dict mapping keys to DataFrames, possibly empty.

Return type:

str | DataFrame

write_df(df: DataFrame, fn: str | Path, key: str = 'table', mode: str = 'w')[source]

Write a DataFrame to file. The file type is determined from the file extension of the given file.

Parameters:
  • df (DataFrame) – The DataFrame to write.

  • fn (str | Path) – The target file name.

  • key (str, default='table') – The key in the HDF5 file, if fn is an HDF5 file. If None, fn may contain only one table.

  • mode (str, default='w') – The mode in which the HDF5 file shall be opened, if fn is an HDF5 file. Ignored otherwise.

write_dfs(dfs: Dict[str, DataFrame], fn: str | Path, mode: str = 'w')[source]

Write a dict of DataFrames to file. The file type is determined from the file extension of the given file. Unless an Excel- or HDF5 file, dfs must be empty or a singleton.

Parameters:
  • dfs (dict) – The DataFrames to write. If empty and mode differs from “a”, the file is deleted.

  • fn (str | Path) – The target file name.

  • mode (str, default='w') – The mode in which the file shall be opened, if fn is an Excel- or HDF5 file. Ignored otherwise.

load(fn: str | Path)[source]

Load a Python object from disk. The object can be stored in JSON, Pickle or joblib format. The format is automatically determined based on the given file extension:

  • “.json” => JSON

  • “.pkl”, “.pickle” => Pickle

  • “.joblib” => joblib

Parameters:

fn (str | Path) – The file to load.

Returns:

The loaded object.

Return type:

Any

dump(obj, fn: str | Path)[source]

Dump a Python object to disk, either as a JSON, Pickle or joblib file. The format is determined automatically based on the given file extension:

  • “.json” => JSON

  • “.pkl”, “.pickle” => Pickle

  • “.joblib” => joblib

Parameters:
  • obj – The object to dump.

  • fn (str | Path) – The file.

Notes

When dumping objects as JSON, calling to_json() beforehand might be necessary to ensure compliance with the JSON standard. joblib is preferred over Pickle, as it is more efficient if the object contains large Numpy arrays.

to_json(x)[source]

Returns a JSON-compliant representation of the given object.

Parameters:

x – Arbitrary object.

Returns:

Representation of x that can be serialized as JSON.

Return type:

Any

convert_rows_to_str(d: [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>], rowindex_to_convert: list, inplace: bool = True, skip: list = []) dict | DataFrame[source]

Converts rows (indexed via rowindex_to_convert) to str, mainly used for saving dataframes (to avoid missing values in .xlsx-files in case of e.g. timedelta datatype)

Parameters:
  • d (dict | DataFrame) – Single DataFrame or dictionary of dataframes

  • rowindex_to_convert (list) – List of row indices (e.g., features), that should be converted to str

  • inplace (bool, default=True) – Determines if changes will be made to input data or a deep-copy of it

  • skip (list, default=[]) – List of column(s) that should not be converted to string

Returns:

Modified (str-converted rows) single DataFrame or dictionary of DataFrames.

Return type:

DataFrame | dict

class CaTabRaLoader(path: str | Path, check_exists: bool = True)[source]

Bases: object

CaTabRaLoader for conveniently accessing artifacts generated by analyzing tables, like trained models, configs, encoders, etc.

Parameters:
  • path (str | Path) – Path to the CaTabRa directory.

  • check_exists (bool, default=True) – Check whether the directory pointed to by path exists.

get_fitted_ensemble(from_model: bool = False) FittedEnsemble | None[source]

Get the trained prediction model as a FittedEnsemble object.

Parameters:

from_model (bool, default=False) – Whether to convert a plain model of type AutoMLBackend into a FittedEnsemble object, if such an object does not exist in the directory.

get_explainer(explainer: str | None = None, fitted_ensemble: FittedEnsemble | None = None) EnsembleExplainer | None[source]

Get the explainer object.

Parameters:
  • explainer (str, optional) – Name of the explainer to load. If None, the first explainer specified in config param “explainer” is loaded.

  • fitted_ensemble (FittedEnsemble) – Pre-loaded FittedEnsemble object. If None, method get_fitted_ensemble() is used for loading it.

get_train_data() DataFrame | None[source]

Get the training data copied into the directory, “train_data.h5”. In contrast to get_table(), this is only the data actually used for training.

get_table(keep_singleton: bool = False) DataFrame | List[DataFrame] | None[source]

Get the table(s) originally passed to analyze(), if they still reside in their original location.

Parameters:

keep_singleton (bool, default=False) – Whether to keep singleton lists. If False, a single DataFrame is returned in that case.

Logging

prompt(msg: str, accepted: List[str] | None = None, allow_headless: bool = True) str[source]

Prompt the user for input.

Parameters:
  • msg (str) – The message to be printed.

  • accepted (list, optional) – List of accepted inputs. Must be lower-case. If None, all inputs are accepted.

  • allow_headless (bool, default=True) – What to do in headless mode. If True, the first element in accepted is returned if accepted is a list and “” is returned if accepted is None. If False, a RunTimeError is raised.

Returns:

The input of the user, an element of accepted if accepted is a list, or arbitrary if accepted is None.

Return type:

str

progress_bar(iterable, desc: str | None = None, total: int | None = None, disable: bool = False, meter_width: int = 40)[source]

Show a simple progress bar when iterating over a given iterable. This works similar to package tqdm, but in contrast to tqdm also works when mirroring messages to a file.

Parameters:
  • iterable – The iterable.

  • desc (str, optional) – Description to add to the beginning of the progress bar, optional.

  • total (int, optional) – Total number of elements in iterable if iterable does not implement the __len__() method.

  • disable (bool, default=False) – Whether to disable the progress bar. If True, the behavior is equivalent to not calling this function at all.

  • meter_width (int, default=40) – The width of the meter, in characters. Should not be too long to make the whole progress bar fit into a single line. Might have to be decreased if desc is a long text.

class LogMirror(log_path: str, mode: str = 'w')[source]

Bases: object

Used to temporary mirror both stderr and stdout to a log file. Based on [1] and [2].

Examples

>>> with LogMirror("log.txt"):
>>>     log("writing to log.txt and the console")
>>>     err("works with errors as well")
>>>     warn("and in case you need warnings")
>>>     print("no need to use the custom log functions")

References

Common

fresh_name(name, lst: Iterable)[source]

Create a fresh name based on name, i.e., a name that does not appear in lst.

Parameters:
  • name – An arbitrary object. If a list, tuple or set, all elements of name are processed individually, an they are ensured to be distinct from each other.

  • lst (Iterable) – A list-like structure.

Returns:

If name does not appear in lst, name is returned as-is. Otherwise, a numeric suffix is added to the string representation of name.

Return type:

Any

repr_list(lst: list | tuple, limit: int | None = 50, delim: str = ', ', brackets: bool = True) str[source]

Return a string representation of some list, limiting the displayed items to a certain number.

Parameters:
  • lst (list | tuple) – The list.

  • limit (int, default=50) – The maximum number of displayed items.

  • delim (str, default=', ') – The item delimiter.

  • brackets (bool, default=True) – Whether to add brackets.

Returns:

String representation of lst.

Return type:

str

repr_timedelta(delta, subsecond_resolution: int = 0) str[source]

Return a string representation of some time delta. Minutes and seconds are always displayed, hours and days only if needed. Format is “d days hh:mm:ss”.

Parameters:
  • delta – Time delta to represent, either a float or an object with a total_seconds() method (e.g., a pandas Timedelta instance). Floats are assumed to be given in seconds.

  • subsecond_resolution (int, default=0) – The subsecond resolution to display, i.e., number of decimal places.

Returns:

String representation of delta.

Return type:

str

Plotting

save(fig, fn: str | Path, png: bool = False)[source]

Save a figure or a list of figures to disk.

Parameters:
  • fig – The figure(s) to save. May be a Matplotlib figure object, a plotly figure object, or a dict whose values are such figure objects.

  • fn (str | Path) – The file or directory. It is recommended to leave the file extension unspecified and simply pass “/path/to/figure” instead of “/path/to/figure.png”. The file extension is then determined automatically depending on the type of fig and on the value of png. If fig is a dict, fn refers to the parent directory.

  • png (bool, default=False) – Whether to save Matplotlib figures as PNG or as PDF. Ignored if a file extension is specified in fn or if fig is a plotly figure, which are always saved as HTML.

Metrics

class averageable(func, accepts_global: bool = False)

Bases: _OperatorBase

Return an averageable variant of a given metric, i.e., a function that accepts parameter average with possible values None, “binary”, “micro”, “macro”, “weighted”, “samples” and , optionally, “global”. Apart from “binary” and “global”, averaging is taken care of by the new metric; the original metric only needs to handle binary classification tasks.

Parameters:
  • func (callable) – Metric to make averageable, callable that accepts y_true and y_pred and returns a scalar value.

  • accepts_global (bool, optional) – What to do if average is set to “global”: if true, func is simply called on the provided arguments; otherwise, a ValueError is raised.

class no_average(func)

Bases: _OperatorBase

Return the “no-average” variant of a given classification metric, i.e., a new metric that returns class- or label-wise results, without any averaging.

Parameters:

func (callable) – Base metric, callable that accepts y_true and y_pred and returns a scalar value.

Notes

This corresponds to metric(…, average=None).

class micro_average(func)

Bases: _OperatorBase

Return the micro-averaged variant of a given classification metric, i.e., a new metric that returns micro-averaged results.

Micro-averaging amounts to counting the total number of true and false positives and negatives across all classes, and computing the metric value wrt. these numbers.

Parameters:

func (callable) – Base metric, callable that accepts y_true and y_pred and returns a scalar value.

Notes

This corresponds to metric(…, average=”micro”).

class macro_average(func)

Bases: _OperatorBase

Return the macro-averaged variant of a given classification metric, i.e., a new metric that returns macro-averaged results.

Macro-averaging amounts to computing the metric value for each class/label individually, and then returning the unweighted mean of these values.

Parameters:

func (callable) – Base metric, callable that accepts y_true and y_pred and returns a scalar value.

Notes

This corresponds to metric(…, average=”macro”).

class weighted_average(func)

Bases: _OperatorBase

Return the weighted-averaged variant of a given classification metric, i.e., a new metric that returns weighted-averaged results.

Weighted-averaging amounts to computing the metric value for each class/label individually, and then returning the weighted mean of these values. Weights correspond to class/label support.

Parameters:

func (callable) – Base metric, callable that accepts y_true and y_pred and returns a scalar value.

Notes

This corresponds to metric(…, average=”weighted”).

class samples_average(func)

Bases: _OperatorBase

Return the samples-averaged variant of a given classification metric, i.e., a new metric that returns samples-averaged results.

Samples-averaging amounts to computing the metric value for each sample individually, and then returning the (weighted) mean of these values.

Parameters:

func (callable) – Base metric, callable that accepts y_true and y_pred and returns a scalar value.

Notes

This corresponds to metric(…, average=”samples”), and is only defined for multilabel tasks.

to_score(func, errors: str = 'ignore')

Convenience function for converting a metric into a (possibly different) metric that returns scores (i.e., higher values correspond to better results). That means, if the given metric returns scores already, it is returned unchanged. Otherwise, it is negated.

Parameters:
  • func (callable) – The metric to convert, e.g., accuracy, balanced_accuracy, etc. Note that in case of classification metrics, both thresholded and non-thresholded metrics are accepted.

  • errors (str, default="ignore") –

    What to do if the polarity of func cannot be determined:

    • ”ignore”: return func.

    • ”negate”: return -func.

    • ”raise”: raise a ValueError.

Return type:

Either func itself or -func.

get(name)

Retrieve a metric function given by its name.

Parameters:

name (str | callable) –

The name of the requested metric function. It must be of the form

name [@ threshold] [(agg : n_reps)]”

where name is the name of a recognized metric and the threshold and agg/n_reps parts are optional.

If threshold is specified, name must be the name of a thresholded classification metric (e.g., “accuracy”) and threshold must be either a specific numerical threshold or the name of a thresholding strategy; see function thresholded() for details.

If agg and n_reps are specified, the bootstrapped metric with n_reps repetitions and aggregation agg is returned.

If both a threshold and bootstrapping are specified, the threshold must be specified first.

Note that some synonyms are recognized as well, most notably “precision” for “positive_predictive_value” and “recall” for “sensitivity”.

Return type:

Metric function (callable).

bootstrapped(func, n_repetitions: int = 100, agg='mean', seed=None, replace: bool = True, size: int | float = 1.0, **kwargs)

Convenience function for converting a metric into its bootstrapped version.

Parameters:
  • func (callable) – The metric to convert, e.g., roc_auc, accuracy, mean_squared_error, etc.

  • n_repetitions (int, default=100) – Number of bootstrapping repetitions to perform. If 0, func is returned unchanged.

  • agg (str | callable, default='mean') – Aggregation to compute of bootstrapping results.

  • seed (int, optional) – Random seed.

  • replace (bool, default=True) – Whether to resample with replacement. If False, this does not actually correspond to bootstrapping.

  • size (int | float, default=1.) – The size of the resampled data. If <= 1, it is multiplied with the number of samples in the given data. Bootstrapping normally assumes that resampled data have the same number of samples as the original data, so this parameter should be set to 1.

  • **kwargs – Additional keyword arguments that are passed to func upon application. Note that only arguments that do not need to be resampled can be passed here; in particular, this excludes sample_weight.

Returns:

  • New metric that, when applied to y_true and y_hat, resamples the data, evaluates the metric on each

  • resample, and returns some aggregation (typically average) of the results thus obtained.

balance_score_threshold(y_true, y_score, sample_weight: ndarray | None = None) Tuple[float, float]

Compute the balance score and -threshold of a binary classification problem.

Parameters:
  • y_true (array-like) – Ground truth, with 0 representing the negative class and 1 representing the positive class. Must not contain NaN.

  • y_score (array-like) – Predicted scores, i.e., the higher a score the more confident the model is that the sample belongs to the positive class. Range is arbitrary.

  • sample_weight (array-like, optional) – Sample weights.

Returns:

  • balance_score (float) – Sensitivity at balance_threshold, which by definition is approximately equal to specificity and can furthermore be shown to be approximately equal to accuracy and balanced accuracy, too.

  • balance_threshold (float) – Decision threshold that minimizes the difference between sensitivity and specificity, i.e., it is defined as

    \[min_t |sensitivity(y_true, y_score >= t) - specificity(y_true, y_score >= t)|\]

prevalence_score_threshold(y_true, y_score, sample_weight: ndarray | None = None) Tuple[float, float]

Compute the prevalence score and -threshold of a binary classification problem.

Parameters:
  • y_true (array-like) – Ground truth, with 0 representing the negative class and 1 representing the positive class. Must not contain NaN.

  • y_score (array-like) – Predicted scores, i.e., the higher a score the more confident the model is that the sample belongs to the positive class. Range is arbitrary.

  • sample_weight (array-like, optional) – Sample weights.

Returns:

  • prevalence_score (float) – Sensitivity at prevalence_threshold, which can be shown to be approximately equal to positive predictive value and F1-score.

  • prevalence_threshold (float) – Decision threshold that minimizes the difference between the number of positive samples in y_true (m) and the number of predicted positives. In other words, the threshold is set to the m-th largest value in y_score. If sample_weight is given, the threshold minimizes the difference between the total weight of all positive samples and the total weight of all samples predicted positive.

zero_one_threshold(y_true, y_score, sample_weight: ndarray | None = None, specificity_weight: float = 1.0) float

Compute the threshold corresponding to the (0,1)-criterion [1] of a binary classification problem. Although a popular strategy for selecting decision thresholds, [1] advocates maximizing informedness (aka Youden index) instead, which is equivalent to maximizing balanced accuracy.

Parameters:
  • y_true (array-like) – Ground truth, with 0 representing the negative class and 1 representing the positive class. Must not contain NaN.

  • y_score (array-like) – Predicted scores, i.e., the higher a score the more confident the model is that the sample belongs to the positive class. Range is arbitrary.

  • sample_weight (array-like, optional) – Sample weights.

  • specificity_weight (float, default=1.) – The relative weight of specificity wrt. sensitivity. 1 means that sensitivity and specificity are weighted equally, a value < 1 means that sensitivity is weighted stronger than specificity, and a value > 1 means that specificity is weighted stronger than sensitivity. See the formula below for details.

Returns:

threshold – Decision threshold that minimizes the Euclidean distance between the point (0, 1) and the point (1 - specificity, sensitivity), i.e., arg min_t (1 - sensitivity(y_true, y_score >= t)) ** 2 + specificity_weight * (1 - specificity(y_true, y_score >= t)) ** 2

Return type:

float

References

argmax_score_threshold(func, y_true, y_score, sample_weight: ndarray | None = None, discretize=100, **kwargs) Tuple[float, float]

Compute the decision threshold that maximizes a given binary classification metric or callable. Since in most built-in classification metrics larger values indicate better results, there is no analogous argmin_score_threshold().

Parameters:
  • func (callable) – The metric or function ot maximize. If a string, function get() is called on it.

  • y_true (array-like) – Ground truth, with 0 representing the negative class and 1 representing the positive class. Must not contain NaN.

  • y_score (array-like) – Predicted scores, i.e., the higher a score the more confident the model is that the sample belongs to the positive class. Range is arbitrary.

  • sample_weight (array-like, optional) – Sample weights.

  • discretize (int, default=100) – Discretization steps for limiting the number of calls to func. If None, no discretization happens, i.e., all unique values in y_score are tried.

  • **kwargs – Additional keyword arguments passed to func.

Returns:

  • score (float) – Value of func at threshold.

  • threshold (float) – Decision threshold that maximizes func, i.e.,

    \[arg max_t func(y_true, y_score >= t).\]

get_thresholding_strategy(name: str)

Retrieve a thresholding strategy for binary classification, given by its name.

Parameters:

name (str) – The name of the thresholding strategy, like “balance”, “prevalence” or “zero_one”.

Returns:

  • Thresholding strategy (callable) that can be applied to y_true, y_score and sample_weight, and that returns

  • a single scalar threshold.

calibration_curve(y_true, y_score, sample_weight: ndarray | None = None, thresholds: ndarray | None = None) Tuple[ndarray, ndarray]

Compute the calibration curve of a binary classification problem. The predicated class probabilities are binned and, for each bin, the fraction of positive samples is determined. These fractions can then be plotted against the midpoints of the respective bins. Ideally, the resulting curve will be monotonic increasing.

Parameters:
  • y_true (array-like) – Ground truth, array of shape (n,) with values among 0 and 1. Values must not be NaN.

  • y_score (array-like) – Predicated probabilities of the positive class, array of shape (n,) with arbitrary non-NaN values; in particular, the values do not necessarily need to correspond to probabilities or confidences.

  • sample_weight (array-like, optional) – Sample weight.

  • thresholds (array-like, optional) – The thresholds used for binning y_score. If None, suitable thresholds are determined automatically.

Returns:

  • fractions (ndarray) – Fractions of positive samples in each bin defined by thresholds, array of shape (m - 1,). Note that the i-th bin corresponds to the half-open interval [thresholds[i], thresholds[i + 1]) if i < m - 2, and to the closed interval [thresholds[i], thresholds[i + 1]] otherwise (in other words: the last bin is closed).

  • thresholds (ndarray) – Thresholds, array of shape (m,).

roc_pr_curve(y_true, y_score, *, pos_label: int | str | None = None, sample_weight: ndarray | None = None, drop_intermediate: bool = True) Tuple[ndarray, ndarray, ndarray, ndarray, ndarray, ndarray]

Convenience function for computing ROC- and precision-recall curves simultaneously, with only one call to function _binary_clf_curve().

Parameters:
  • y_true (array-like) – Same as in sklearn.metrics.roc_curve() and sklearn.metrics.precision_recall_curve().

  • y_score (array-like) – Same as in sklearn.metrics.roc_curve() and sklearn.metrics.precision_recall_curve().

  • pos_label (int | str, optional) – Same as in sklearn.metrics.roc_curve() and sklearn.metrics.precision_recall_curve().

  • sample_weight (array-like, optional) – Same as in sklearn.metrics.roc_curve() and sklearn.metrics.precision_recall_curve().

  • drop_intermediate (bool, default=True) – Same as in sklearn.metrics.roc_curve().

Returns:

  • 6-tuple (fpr, tpr, thresholds_roc, precision, recall, thresholds_pr), i.e., the concatenation of the return

  • values of functions sklearn.metrics.roc_curve() and sklearn.metrics.precision_recall_curve().

See also

sklearn.metrics.roc_curve, sklearn.metrics.precision_recall_curve

Notes

The output of this function may differ from the output of sklearn.metrics.roc_curve and sklearn.metrics.precision_recall_curve, because the implementation of the latter changed over time. For instance, early versions of scikit-learn set the first threshold in the output of roc_curve to 1 + the second threshold, whereas later this was changed to +inf. Similarly, early versions of precision_recall_curve only returned precision and recall until full recall was attained, whereas more recent versions return precision and recall for all thresholds.

get_thresholds(y: ndarray, n_max: int = 100, add_half_one: bool | None = None, ensure: list | None = None, sample_weight: ndarray | None = None) list

Return equally-spaced thresholds for a given array of classification scores or class probabilities.

Parameters:
  • y (array-like) – Values used for determining the thresholds, typically (but not necessarily) the scores or class probabilities returned by a binary classification model. Must be a 1D-array of floats; may contain NaN and infinite values, which are tacitly ignored.

  • n_max (int, default=100) – Maximum number of thresholds to return. Note that ensure takes precedence over this parameter, i.e., if ensure is given, the output may contain more than n_max elements.

  • add_half_one (bool, optional) – Ensure 0.5 and 1.0 in the resulting list of thresholds. If None, 0.5 and 1.0 are added iff all elements of y are in the [0, 1] interval, i.e., correspond to class probabilities.

  • ensure (list, optional) – Thresholds to ensure. If given, all of its elements appear in the final list.

  • sample_weight (array-like, optional) – Sample weights. Thresholds are chosen such that the total sample weights in each bin are roughly equal.

Returns:

Thresholds, ascending list of floats with length >= 2.

Return type:

list of float

multiclass_proba_to_pred(y) ndarray

Translate multiclass class probabilities into actual predictions, by returning the class with the highest probability. If two or more classes have the same highest probabilities, the last one is returned. This behavior is consistent with binary classification problems, where the positive class is returned if both classes have equal probabilities and the default threshold of 0.5 is used.

Parameters:

y (array-like) – Class probabilities, of shape (n_classes,) or (n_samples, n_classes). The values of y can be arbitrary, they don’t need to be between 0 and 1. n_classes must be >= 1.

Return type:

Predicted class indices, either single integer or array of shape (n_samples,).

class thresholded(func, threshold: float | str = 0.5, **kwargs)

Bases: _OperatorBase

Convenience class for converting a classification metric that can only be applied to class predictions into a metric that can be applied to probabilities. This proceeds by specifying a fixed decision threshold.

Parameters:
  • func (callable) – The metric to convert, e.g., accuracy, balanced_accuracy, etc.

  • threshold (float | str, default=0.5) – The decision threshold. In binary classification this can also be the name of a thresholding strategy that is accepted by function get_thresholding_strategy().

  • **kwargs – Additional keyword arguments that are passed to func upon application.

Returns:

  • New metric that, when applied to y_true and y_score, returns func(y_true, y_score >= threshold) in case of

  • binary- or multilabel classification, and func(y_true, multiclass_proba_to_pred(y_score)) in case of multiclass

  • classification.

classmethod make(func, threshold: float | str = 0.5, **kwargs)

Convenience function for converting a classification metric into its “thresholded” version IF NECESSARY. That means, if the given metric can be applied to class probabilities, it is returned unchanged. Otherwise, thresholded(func, threshold) is returned.

Parameters:
  • func (callable) – The metric to convert, e.g., accuracy, balanced_accuracy, etc.

  • threshold (float | str, default=0.5) – The decision threshold.

  • **kwargs – Additional keyword arguments that shall be passed to func upon application.

Return type:

Either func itself or thresholded(func, threshold).

maybe_thresholded(func, threshold: float | str = 0.5, **kwargs)

Convenience function for converting a classification metric into its “thresholded” version IF NECESSARY. That means, if the given metric can be applied to class probabilities, it is returned unchanged. Otherwise, thresholded(func, threshold) is returned.

Parameters:
  • func (callable) – The metric to convert, e.g., accuracy, balanced_accuracy, etc.

  • threshold (float | str, default=0.5) – The decision threshold.

  • **kwargs – Additional keyword arguments that shall be passed to func upon application.

Return type:

Either func itself or thresholded(func, threshold).

confusion_matrix(y_true, y_pred, multilabel='auto', normalize: str | None = None, samplewise: bool = False, **kwargs) ndarray

Compute confusion matrix to evaluate the accuracy of a classification.

In the binary and multiclass case, the result is an array of shape (n_classes, n_classes) whose ij-th entry is the number of samples belonging to class i and classified as class j. In short: rows = ground truth, columns = predictions.

In the multilabel case, the result is an array of shape (n_labels, 2, 2), with a binary confusion matrix for each label.

Parameters:
  • y_true (array-like) – Ground-truth (correct) target values, array-like of shape (samples,) or (n_samples, n_labels).

  • y_pred (array-like) – Predictions, array-like with the same shape as y_true.

  • multilabel (str | bool, default="auto") –

    Whether to return a binary/multiclass confusion matrix, or a multiplabel confusion matrix:

    • True: Return a multilabel confusion matrix, even if the input is binary/multiclass. Multiclass data will be

      treated as if binarized under a one-vs-rest transformation.

    • False: Return a binary/multiclass confusion matrix. Raises a ValueError if the input is multilabel.

    • ”auto” (default): Automatically detect the confusion matrix type to return: multilabel if the input is

      multilabel, binary/multiclass otherwise.

  • normalize (str, optional) – Normalize the confusion matrix over the rows (“true”), columns (“pred”) conditions or the whole population (“all”). If None, the confusion matrix will not be normalized. For multilabel input, each of the 2x2 confusion matrices is normalized separately.

  • samplewise (bool, default=False) – In the multilabel case, this calculates a confusion matrix per sample.

Returns:

  • Confusion matrix, array of shape (n_classes, n_classes) if multilabel is False, or (n_classe, 2, 2) if

  • multilabel is True.

See also

sklearn.metrics.confusion_matrix, sklearn.metrics.multilabel_confusion_matrix

Notes

This implementation combines both sklearn.metrics.confusion_matrix and sklearn.metrics.multilabel_confusion_matrix. Setting multilabel to False is equivalent to the former, setting it to True is equivalent to the latter.

hamming_loss(y_true, y_pred, *, sample_weight=None, **kwargs)

Hamming loss is 1 - accuracy, but their multilabel default averaging policy differs: accuracy returns subset accuracy by default (i.e., all labels must match), whereas hamming loss returns label-wise macro average by default.

informedness(y_true, y_pred, *, average='binary', sample_weight=None)

Informedness (aka Youden index or Youden’s J statistic) is the sum of sensitivity and specificity, minus 1.

markedness(y_true, y_pred, *, average='binary', sample_weight=None)

Markedness is the sum of positive- and negative predictive value, minus 1.

hamming_loss_micro(y_true, y_pred, *, sample_weight=None, **kwargs)

Hamming loss is 1 - accuracy, but their multilabel default averaging policy differs: accuracy returns subset accuracy by default (i.e., all labels must match), whereas hamming loss returns label-wise macro average by default.

hamming_loss_macro(y_true, y_pred, *, sample_weight=None, **kwargs)

Hamming loss is 1 - accuracy, but their multilabel default averaging policy differs: accuracy returns subset accuracy by default (i.e., all labels must match), whereas hamming loss returns label-wise macro average by default.

hamming_loss_samples(y_true, y_pred, *, sample_weight=None, **kwargs)

Hamming loss is 1 - accuracy, but their multilabel default averaging policy differs: accuracy returns subset accuracy by default (i.e., all labels must match), whereas hamming loss returns label-wise macro average by default.

hamming_loss_weighted(y_true, y_pred, *, sample_weight=None, **kwargs)

Hamming loss is 1 - accuracy, but their multilabel default averaging policy differs: accuracy returns subset accuracy by default (i.e., all labels must match), whereas hamming loss returns label-wise macro average by default.

informedness_micro(y_true, y_pred, *, average='micro', sample_weight=None)

Informedness (aka Youden index or Youden’s J statistic) is the sum of sensitivity and specificity, minus 1.

informedness_macro(y_true, y_pred, *, average='macro', sample_weight=None)

Informedness (aka Youden index or Youden’s J statistic) is the sum of sensitivity and specificity, minus 1.

informedness_samples(y_true, y_pred, *, average='samples', sample_weight=None)

Informedness (aka Youden index or Youden’s J statistic) is the sum of sensitivity and specificity, minus 1.

informedness_weighted(y_true, y_pred, *, average='weighted', sample_weight=None)

Informedness (aka Youden index or Youden’s J statistic) is the sum of sensitivity and specificity, minus 1.

markedness_micro(y_true, y_pred, *, average='micro', sample_weight=None)

Markedness is the sum of positive- and negative predictive value, minus 1.

markedness_macro(y_true, y_pred, *, average='macro', sample_weight=None)

Markedness is the sum of positive- and negative predictive value, minus 1.

markedness_samples(y_true, y_pred, *, average='samples', sample_weight=None)

Markedness is the sum of positive- and negative predictive value, minus 1.

markedness_weighted(y_true, y_pred, *, average='weighted', sample_weight=None)

Markedness is the sum of positive- and negative predictive value, minus 1.

precision_recall_fscore_support_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, beta: float = 1.0, average: str | None = None, zero_division: float | str = 'warn')

Compute precision, recall, F-measure and support for each class. This is the confusion-matrix based variant of precision_recall_fscore_support.

Parameters:
  • cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2). Mutually exclusive with tp, fp, tn and fn.

  • tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn must be given, too. Mutually exclusive with cm.

  • fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn must be given, too. Mutually exclusive with cm.

  • tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn must be given, too. Mutually exclusive with cm.

  • fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn must be given, too. Mutually exclusive with cm.

  • beta (float, default=1) – The strength of recall versus precision in the F-score.

  • average (str, optional) – Averaging to perform for multiclass and multilabel input.

  • zero_division (float | str, default="warn") – The value to return if there is a division by zero.

Returns:

  • precision (float | ndarray) – Precision score (positive predictive value).

  • recall (float | ndarray) – Recall score (sensitivity).

  • f_score (float | ndarray) – F-beta score.

  • support (float | ndarray) – The support of each class/label. None unless average is None.

See also

precision_recall_fscore_support

precision_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, average: str | None = 'binary', swap_pos_neg: bool = False, zero_division: float | str = 'warn') float | int | ndarray

Compute the precision (positive predictive value) from a given confusion matrix. This is the confusion-matrix based variant of positive_predictive_value.

Parameters:
  • cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2). Mutually exclusive with tp, fp, tn and fn.

  • tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn must be given, too. Mutually exclusive with cm.

  • fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn must be given, too. Mutually exclusive with cm.

  • tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn must be given, too. Mutually exclusive with cm.

  • fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn must be given, too. Mutually exclusive with cm.

  • swap_pos_neg (bool, default=False) – Swap positive and negative class. If True, negative predictive value is computed instead.

  • average (str, optional) – Averaging to perform for multiclass and multilabel input.

  • zero_division (float | str, default="warn") – The value to return if there is a division by zero.

Return type:

Precision score (positive predictive value).

See also

positive_predictive_value, negative_predictive_value, positive_predictive_value_cm, negative_predictive_value_cm

recall_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, average: str | None = 'binary', zero_division: float | str = 'warn', swap_pos_neg: bool = False) float | int | ndarray

Compute the recall (sensitivity) from a given confusion matrix. This is the confusion-matrix based variant of sensitivity.

Parameters:
  • cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2). Mutually exclusive with tp, fp, tn and fn.

  • tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn must be given, too. Mutually exclusive with cm.

  • fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn must be given, too. Mutually exclusive with cm.

  • tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn must be given, too. Mutually exclusive with cm.

  • fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn must be given, too. Mutually exclusive with cm.

  • swap_pos_neg (bool, default=False) – Swap positive and negative class. If True, specificity is computed instead.

  • average (str, optional) – Averaging to perform for multiclass and multilabel input.

  • zero_division (float | str, default="warn") – The value to return if there is a division by zero.

Return type:

Recall score (sensitivity).

See also

sensitivity, specificity, sensitivity_cm, specificity_cm

accuracy_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, average: str | None = 'global', normalize: bool = True) float | int | ndarray

Compute the accuracy from a given confusion matrix. This is the confusion-matrix based variant of accuracy.

Parameters:
  • cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2). Mutually exclusive with tp, fp, tn and fn.

  • tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn must be given, too. Mutually exclusive with cm.

  • fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn must be given, too. Mutually exclusive with cm.

  • tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn must be given, too. Mutually exclusive with cm.

  • fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn must be given, too. Mutually exclusive with cm.

  • normalize (bool, default=True) – Return the fraction of correctly classified samples. Otherwise, return the number of correctly classified samples.

  • average (str, optional) – Averaging to perform for multiclass and multilabel input.

Return type:

Accuracy score.

See also

accuracy

balanced_accuracy_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, average: str | None = 'global', adjusted: bool = False) float | int | ndarray

Compute the balanced accuracy from a given confusion matrix. This is the confusion-matrix based variant of balanced_accuracy.

Parameters:
  • cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2). Mutually exclusive with tp, fp, tn and fn.

  • tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn must be given, too. Mutually exclusive with cm.

  • fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn must be given, too. Mutually exclusive with cm.

  • tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn must be given, too. Mutually exclusive with cm.

  • fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn must be given, too. Mutually exclusive with cm.

  • adjusted (bool, default=False) – Adjust the result for chance, so that random performance would score 0, while keeping perfect performance at a score of 1.

  • average (str, optional) – Averaging to perform for multiclass and multilabel input.

Return type:

Balanced accuracy score.

See also

balanced_accuracy

fbeta_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, average: str | None = 'binary', zero_division: str | float = 'warn', beta: float = 1.0) float | int | ndarray

Compute the F-beta score from a given confusion matrix.

Parameters:
  • cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2). Mutually exclusive with tp, fp, tn and fn.

  • tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn must be given, too. Mutually exclusive with cm.

  • fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn must be given, too. Mutually exclusive with cm.

  • tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn must be given, too. Mutually exclusive with cm.

  • fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn must be given, too. Mutually exclusive with cm.

  • beta (float, default=1) – The strength of recall versus precision.

  • average (str, optional) – Averaging to perform for multiclass and multilabel input.

  • zero_division (float | str, default="warn") – The value to return if there is a division by zero.

Return type:

F-beta score.

cohen_kappa_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, average: str | None = 'global', weights: str | None = None) float | int | ndarray

Compute Cohen’s kappa from a given confusion matrix. This is the confusion-matrix based variant of cohen_kappa.

Parameters:
  • cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2). Mutually exclusive with tp, fp, tn and fn.

  • tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn must be given, too. Mutually exclusive with cm.

  • fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn must be given, too. Mutually exclusive with cm.

  • tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn must be given, too. Mutually exclusive with cm.

  • fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn must be given, too. Mutually exclusive with cm.

  • weights (str, optional) – Weighting type to calculate the score. None means not weighted; “linear” means linear weighting; “quadratic” means quadratic weighting.

  • average (str, optional) – Averaging to perform for multiclass and multilabel input.

Returns:

  • Kappa statistic, float or array of floats between -1 and 1. The maximum value means complete agreement; zero or

  • lower means chance agreement.

See also

cohen_kappa

matthews_correlation_coefficient_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, average: str | None = 'global') float | int | ndarray

Compute the Matthews correlation coefficient (MCC) from a given confusion matrix. This is the confusion-matrix based variant of matthews_correlation_coefficient.

Parameters:
  • cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2). Mutually exclusive with tp, fp, tn and fn.

  • tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn must be given, too. Mutually exclusive with cm.

  • fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn must be given, too. Mutually exclusive with cm.

  • tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn must be given, too. Mutually exclusive with cm.

  • fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn must be given, too. Mutually exclusive with cm.

  • average (str, optional) – Averaging to perform for multiclass and multilabel input.

Returns:

  • Matthews correlation coefficient (+1 represents a perfect prediction, 0 an average random prediction and -1 and

  • inverse prediction).

See also

matthews_correlation_coefficient

jaccard_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, average: str | None = 'binary', zero_division: str | float = 'warn') float | int | ndarray

Compute the Jaccard score (intersection over union, IoU) from a given confusion matrix. This is the confusion-matrix based variant of jaccard.

Parameters:
  • cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2). Mutually exclusive with tp, fp, tn and fn.

  • tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn must be given, too. Mutually exclusive with cm.

  • fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn must be given, too. Mutually exclusive with cm.

  • tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn must be given, too. Mutually exclusive with cm.

  • fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn must be given, too. Mutually exclusive with cm.

  • average (str, optional) – Averaging to perform for multiclass and multilabel input.

  • zero_division (float | str, default="warn") – The value to return if there is a division by zero.

Return type:

Jaccard score.

See also

jaccard

hamming_loss_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, **kwargs) float | int | ndarray

Hamming loss is 1 - accuracy, but their multilabel default averaging policy differs: accuracy returns subset accuracy by default (i.e., all labels must match), whereas hamming loss returns label-wise macro average by default.

f1_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, average: str | None = 'binary', zero_division: str | float = 'warn', beta: float = 1) float | int | ndarray

Compute the F-beta score from a given confusion matrix.

Parameters:
  • cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2). Mutually exclusive with tp, fp, tn and fn.

  • tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn must be given, too. Mutually exclusive with cm.

  • fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn must be given, too. Mutually exclusive with cm.

  • tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn must be given, too. Mutually exclusive with cm.

  • fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn must be given, too. Mutually exclusive with cm.

  • beta (float, default=1) – The strength of recall versus precision.

  • average (str, optional) – Averaging to perform for multiclass and multilabel input.

  • zero_division (float | str, default="warn") – The value to return if there is a division by zero.

Return type:

F-beta score.

sensitivity_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, average: str | None = 'binary', zero_division: float | str = 0, swap_pos_neg: bool = False) float | int | ndarray

Compute the recall (sensitivity) from a given confusion matrix. This is the confusion-matrix based variant of sensitivity.

Parameters:
  • cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2). Mutually exclusive with tp, fp, tn and fn.

  • tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn must be given, too. Mutually exclusive with cm.

  • fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn must be given, too. Mutually exclusive with cm.

  • tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn must be given, too. Mutually exclusive with cm.

  • fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn must be given, too. Mutually exclusive with cm.

  • swap_pos_neg (bool, default=False) – Swap positive and negative class. If True, specificity is computed instead.

  • average (str, optional) – Averaging to perform for multiclass and multilabel input.

  • zero_division (float | str, default="warn") – The value to return if there is a division by zero.

Return type:

Recall score (sensitivity).

See also

sensitivity, specificity, sensitivity_cm, specificity_cm

specificity_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, average: str | None = 'binary', zero_division: float | str = 0, swap_pos_neg: bool = True) float | int | ndarray

Compute the recall (sensitivity) from a given confusion matrix. This is the confusion-matrix based variant of sensitivity.

Parameters:
  • cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2). Mutually exclusive with tp, fp, tn and fn.

  • tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn must be given, too. Mutually exclusive with cm.

  • fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn must be given, too. Mutually exclusive with cm.

  • tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn must be given, too. Mutually exclusive with cm.

  • fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn must be given, too. Mutually exclusive with cm.

  • swap_pos_neg (bool, default=False) – Swap positive and negative class. If True, specificity is computed instead.

  • average (str, optional) – Averaging to perform for multiclass and multilabel input.

  • zero_division (float | str, default="warn") – The value to return if there is a division by zero.

Return type:

Recall score (sensitivity).

See also

sensitivity, specificity, sensitivity_cm, specificity_cm

positive_predictive_value_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, average: str | None = 'binary', swap_pos_neg: bool = False, zero_division: float | str = 1) float | int | ndarray

Compute the precision (positive predictive value) from a given confusion matrix. This is the confusion-matrix based variant of positive_predictive_value.

Parameters:
  • cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2). Mutually exclusive with tp, fp, tn and fn.

  • tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn must be given, too. Mutually exclusive with cm.

  • fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn must be given, too. Mutually exclusive with cm.

  • tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn must be given, too. Mutually exclusive with cm.

  • fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn must be given, too. Mutually exclusive with cm.

  • swap_pos_neg (bool, default=False) – Swap positive and negative class. If True, negative predictive value is computed instead.

  • average (str, optional) – Averaging to perform for multiclass and multilabel input.

  • zero_division (float | str, default="warn") – The value to return if there is a division by zero.

Return type:

Precision score (positive predictive value).

See also

positive_predictive_value, negative_predictive_value, positive_predictive_value_cm, negative_predictive_value_cm

negative_predictive_value_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, average: str | None = 'binary', swap_pos_neg: bool = True, zero_division: float | str = 1) float | int | ndarray

Compute the precision (positive predictive value) from a given confusion matrix. This is the confusion-matrix based variant of positive_predictive_value.

Parameters:
  • cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2). Mutually exclusive with tp, fp, tn and fn.

  • tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn must be given, too. Mutually exclusive with cm.

  • fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn must be given, too. Mutually exclusive with cm.

  • tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn must be given, too. Mutually exclusive with cm.

  • fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn must be given, too. Mutually exclusive with cm.

  • swap_pos_neg (bool, default=False) – Swap positive and negative class. If True, negative predictive value is computed instead.

  • average (str, optional) – Averaging to perform for multiclass and multilabel input.

  • zero_division (float | str, default="warn") – The value to return if there is a division by zero.

Return type:

Precision score (positive predictive value).

See also

positive_predictive_value, negative_predictive_value, positive_predictive_value_cm, negative_predictive_value_cm

informedness_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, average: str | None = 'binary')

Informedness (aka Youden index or Youden’s J statistic) is the sum of sensitivity and specificity, minus 1.

markedness_cm(*, cm=None, tp=None, fp=None, tn=None, fn=None, average: str | None = 'binary')

Markedness is the sum of positive- and negative predictive value, minus 1.

Statistics

create_non_numeric_statistics(df: DataFrame, target: list, name_: str = '') DataFrame[source]

Calculate descriptive statistics for non-numeric features for a specific dataframe

Parameters:
  • df (DataFrame) – The main dataframe.

  • target (list) – The target labels; stored in list

  • name (str, default='') – Name of the label. Used for naming the columns

Returns:

Returns a dataframe with statistics (for non-numeric features)

Return type:

DataFrame

calc_non_numeric_statistics(df: DataFrame, target: list, classify: bool) dict[source]

Calculate non-numeric descriptive statistics

Parameters:
  • df (DataFrame) – The main dataframe.

  • target (list) – The target labels; stored in list

  • classify (bool) – Is true, if classification task. False for regression task

Returns:

Dictionary with descriptive statistics (for non-numeric features) for overall dataset, each target and (in case of classification) each label.

Return type:

dict

calc_numeric_statistics(df: DataFrame, target: list, classify: bool) dict[source]

Calculate descriptive statistics for numeric features for a specific dataframe

Parameters:
  • df (DataFrame) – The main dataframe.

  • target (list) – The target labels; stored in list

  • classify (bool) – Is true, if classification task. False for regression task

Returns:

Dictionary with statistics (for numeric features) for entire dataset, each target and (in case of classification) each label.

Return type:

dict

calc_descriptive_statistics(df: DataFrame, target: list, classify: bool, corr_threshold: int = 200) Tuple[dict, dict, DataFrame | None][source]

Calculate and return descriptive statistics including correlation information

Parameters:
  • df (DataFrame) – The main dataframe.

  • target (list) – The target labels; stored in list

  • classify (bool) – Is true, if classification task. False for regression task

  • corr_threshold (int, default=200) – Maximum number of columns for which a correlation-DataFrame is computed.

Returns:

Tuple of numeric and non-numeric statistics (separate dictionaries) and correlation-DataFrame

Return type:

tuple

save_descriptive_statistics(df: DataFrame, target: list, classify: bool, fn: str | Path, corr_threshold: int = 200)[source]

Calculate and save descriptive statistics including correlation information to disk.

Parameters:
  • df (DataFrame) – The main dataframe.

  • target (list) – The target labels; stored in list.

  • classify (bool) – Is true, if classification task. False for regression task

  • fn (str | Path) – The directory where to save the statistics files

  • corr_threshold (int, default=200) – Maximum number of columns for which a correlation-DataFrame is computed.

mann_whitney_u(x, y, nan_policy: str = 'omit', **kwargs) float[source]

Mann-Whitney U test for testing whether two independent samples are equal (more precisely: have equal median). Only applicable to numerical observations; categorical observations should be treated with the chi square test.

Parameters:
  • x (array-like) – First sample, array-like with numerical values.

  • y (array-like) – Second sample, array-like with numerical values.

  • nan_policy (str, default="omit") –

    Specifies how to handle NaN values:

    • ”omit”: Perform the test on all non-NaN values.

    • ”propagate”: Return NaN if at least one input value is NaN.

    • ”raise”: Raise a ValueError if at least one input value is NaN.

  • **kwargs – Keyword arguments passed to scipy.stats.mannwhitneyu().

Returns:

p_value – P-value. Smaller values mean that x and y are distributed differently.

Return type:

float

See also

scipy.stats.mannwhitneyu

Notes

This test is symmetric between x and y if alternative is set to “two-sided” (default), i.e., mann_whitney_u(x, y) equals mann_whitney_u(y, x).

The Mann-Whitney U test is a special case of the Kruskal-Wallis H test, which works for more than two samples.

chi_square(x, y, nan_policy: str = 'omit', **kwargs) float[source]

Chi square test for testing whether a sample of categorical observations is distributed according to another sample of categorical observations.

Parameters:
  • x (array-like) – First sample, array-like with categorical values.

  • y (array-like) – Second sample, array-like with categorical values.

  • nan_policy (str, default="omit") –

    Specifies how to handle NaN values:

    • ”omit”: Perform the test on all non-NaN values.

    • ”propagate”: Return NaN if at least one input value is NaN.

    • ”raise”: Raise a ValueError if at least one input value is NaN.

  • **kwargs – Keyword arguments passed to scipy.stats.chisquare().

Returns:

p_value – p-value. Smaller values mean that x is distributed differently from y.

Return type:

float

See also

scipy.stats.chisquare

Notes

This test is not symmetric between x and y, i.e., chi_square(x, y) differs from chi_square(y, x) in general.

delong_test(y_true, y_hat_1, y_hat_2, sample_weight=None, nan_policy: str = 'omit') float[source]

Compute the p-value of the DeLong test for the null hypothesis that two ROC-AUCs are equal.

Parameters:
  • y_true (array-like) – Ground truth, 1D array-like of shape (n_samples,) with values in {0, 1}.

  • y_hat_1 (array-like) – Predictions of the first classifier, 1D array-like of shape (n_samples,) with arbitrary values. Larger values correspond to a higher predicted probability that a sample belongs to the positive class.

  • y_hat_2 (array-like) – Predictions of the second classifier, 1D array-like of shape (n_samples,) with arbitrary values. Larger values correspond to a higher predicted probability that a sample belongs to the positive class.

  • sample_weight (array-like, optional) – Sample weights. None defaults to uniform weights.

  • nan_policy (str, default="omit") –

    Specifies how to handle NaN values:

    • ”omit”: Perform the test on all non-NaN values. Since this is a paired test, all observations that are NaN in

      any of the three arrays are dropped.

    • ”propagate”: Return NaN if at least one input value is NaN.

    • ”raise”: Raise a ValueError if at least one input value is NaN.

Returns:

p_value – p-value for the null hypothesis that the ROC-AUCs of the two classifiers are equal. If this value is smaller than a certain pre-defined threshold (e.g., 0.05) the null hypothesis can be rejected, meaning that there is a statistically significant difference between the two ROC-AUCs.

Return type:

float

See also

roc_auc_confidence_interval

Confidence interval for the ROC-AUC of a given classifier.

roc_auc_confidence_interval(y_true, y_hat, alpha: float = 0.95, sample_weight=None, nan_policy: str = 'omit') Tuple[float, float, float][source]

Return the confidence interval and ROC-AUC of given ground-truth and model predictions.

Parameters:
  • y_true (array-like) – Ground truth, 1D array-like of shape (n_samples,) with values in {0, 1}.

  • y_hat (array-like) – Predictions of the classifier, 1D array-like of shape (n_samples,) with arbitrary values. Larger values correspond to a higher predicted probability that a sample belongs to the positive class.

  • alpha (float, default=0.95) – Confidence level, between 0 and 1.

  • sample_weight (array-like, optional) – Sample weights. None defaults to uniform weights.

  • nan_policy (str, default="omit") –

    Specifies how to handle NaN values:

    • ”omit”: Perform the test on all non-NaN values.

    • ”propagate”: Return NaN if at least one input value is NaN.

    • ”raise”: Raise a ValueError if at least one input value is NaN.

Returns:

  • auc (float) – ROC-AUC of the given ground-truth and predictions.

  • ci_left (float) – Left endpoint of the confidence interval.

  • ci_right (float) – Right endpoint of the confidence interval.

Notes

The output always satisfies 0 <= ci_left <= auc <= ci_right <= 1.

See also

delong_test

Statistical test for the null hypothesis that the ROC-AUCs of two classifiers are equal.

suggest_test(task: str = 'comparison', quantitative: bool = True, paired: bool = False, n_groups: int = 2, normal: bool = False, equal_variance: bool = False) dict[source]

Suggest statistical hypothesis tests for comparing two or more groups (samples). The list of suggested tests is by no means exhaustive, but includes some of the most frequently used tests in practice.

See Notes for some general comments on statistical testing.

Parameters:
  • task (str, default="comparison") –

    The objective of the test, can be either “comparison” or “correlation”:

    • ”comparison”: The objective of the test is to determine whether the given groups were drawn from the same

      distribution. This usually, but not necessarily, happens by comparing group statistics, like mean, median or variance.

    • ”correlation”: The objective of the test is to determine whether the given (paired) groups are correlated.

      Groups can be correlated even when drawn from distinct distributions.

  • quantitative (bool, default=True) – The observations in the given groups are quantitative, i.e., drawn from continuous or discrete distributions, such that each observation has a numerical value. The alternative are categorical observations.

  • paired (bool, default=False) – The observations in the given groups are paired, i.e., the i-th observation in the first group corresponds with the i-observation in the second group. Correspondence can mean, for instance, that observations originate from the same subject, measurement device, etc. Note that this implies that all groups must have the same size. The alternative are independent groups.

  • n_groups (int, default=2) – The number of groups the test should handle. Some tests are restricted to two groups, others can handle arbitrarily many groups.

  • normal (bool, default=True) – The observations are known to be drawn from a normal distribution. Some tests need this assumption to work properly, others (called “non-parametric tests”) can deal with arbitrary underlying distributions.

  • equal_variance (bool, default=False) – The observations are known to be drawn from distributions with equal variance (usually normal distributions). Some tests need this assumption to work properly, others can deal with arbitrary variances. This property is also known as homoscedasticity.

Returns:

Dict of suggested tests (possibly empty), keys are names and values are dicts with main properties. Carefully read the documentation of each test to select the one appropriate for your data.

Return type:

dict

Notes

A statistical test is usually performed by finding evidence _against_ the null hypothesis of the test, e.g., using the t-test to show that two groups have _different_ mean values. The converse is not true, though: if a test does not produce evidence against the null hypothesis, we cannot conclude that the null-hypothesis must be true – only that we have not found any evidence against it. This holds true even if the p-values are close to 1. More concisely: null hypothesis true ==> (relatively) large p-value. Note the implication, not equivalence!

One common assumption of most statistical tests is that all observations in a group are independent, i.e., all are drawn independently from the same underlying distribution (i.i.d. assumption). Whether this property holds true also _between_ groups can be controlled with parameter paired.

There are many resources for finding the right statistical test on the internet, e.g., _[1].

References

Preprocessing

class MinMaxScaler(fit_bool: bool = True, **kwargs)

Bases: MinMaxScaler

Transform data by scaling each feature to a given range. The only difference to sklearn.preprocessing.MinMaxScaler is parameter fit_bool that, when set to False, does not fit this scaler on boolean features but rather uses 0 and 1 as fixed minimum and maximum values. This ensures that False is always mapped to feature_range[0] and True is always mapped to feature_range[1]. Otherwise, if the training data only contains True values, True would be mapped to feature_range[0] and False to feature_range[0] - feature_range[1]. The behavior on other numerical data types is not affected by this.

Parameters:
  • fit_bool (bool, default=True) – Whether to fit this scaler on boolean features. If True, the behavior is identical to sklearn.preprocessing.MinMaxScaler.

  • **kwargs – Additional keyword arguments, passed to sklearn.preprocessing.MinMaxScaler.

See also

sklearn.preprocessing.MinMaxScaler

Notes

Note that sklearn.preprocessing.MaxAbsScaler always maps False to 0 and True to 1, so there is no need for an analogous subclass.

partial_fit(X, y=None) MinMaxScaler

Online computation of min and max on X for later scaling.

All of X is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

Returns:

self – Fitted scaler.

Return type:

object

class OneHotEncoder(drop_na: bool = False, drop=None, handle_unknown: str | None = None, min_frequency: int | float | None = None, max_categories: int | None = None, **kwargs)

Bases: OneHotEncoder

Encode categorical features as a one-hot numeric array. The only difference to sklearn.preprocessing.OneHotEncoder is parameter drop_na that, when set to True, allows to drop NaN categories. More precisely, no separate columns representing NaN categories are added upon transformation, resembling the behavior of pandas.get_dummies().

Parameters:
  • drop_na (bool, default=False) – Drop NaN categories. If False, the behavior is identical to sklearn.preprocessing.OneHotEncoder.

  • drop (iterable, optional) – Categories to drop. If drop_na is True, this parameter must be None.

  • handle_unknown (str, optional) – How to handle unknown categories. If drop_na is True, this parameter must be “ignore”. None defaults to “ignore” if drop_na is True and to “error” otherwise.

  • min_frequency (int | float, optional) – Specifies the minimum frequency below which a category will be considered infrequent. If drop_na is True, this parameter must be None.

  • max_categories (int, optional) – Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If drop_na is True, this parameter must be None.

See also

sklearn.preprocessing.OneHotEncoder

Notes

If drop_na is True, all features containing only NaN values during fit() are removed entirely.

fit(X, y=None) OneHotEncoder

Fit OneHotEncoder to X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The data to determine the categories of each feature.

  • y (None) – Ignored. This parameter exists only for compatibility with Pipeline.

Return type:

self

class DTypeTransformer(num=None, cat=None, bool=None, timedelta=None, datetime=None, obj=None, default='passthrough', timedelta_resolution: str | pandas.Timedelta | None = None, datetime_resolution: str | pandas.Timedelta | None = None)

Bases: BaseEstimator, TransformerMixin

Transform columns of a pandas DataFrame depending on their data types.

The order of columns may change compared to the input.

Parameters:
  • num (str | BaseEstimator, optional) –

    The transformation to apply to numerical columns, or “passthrough”, “drop”, “num”, “cat”, “bool”, “timedelta”, “datetime”, “obj” or None/”default”:

    • BaseEstimator: Apply the BaseEstimator to all columns with numerical data type. The BaseEstimator must

      implement fit() and transform(). Class instances are cloned before being fit to data, to ensure that the given instances are left unchanged.

    • ”passthrough”: Pass numerical columns through unchanged.

    • ”drop”: Drop numerical columns.

    • ”num”: Prohibited here, but allowed with cat, bool, timedelta, datetime, obj and default: Treat

      columns of the respective data type as numerical and apply the transformation specified by num.

    • ”cat”: Treat numerical columns like categorical columns, and apply the transformation specified by cat.

    • ”bool”: Treat numerical columns like boolean columns, and apply the transformation specified by bool.

    • ”timedelta”: Treat numerical columns like timedelta columns, and apply the transformation specified by

      timedelta.

    • ”datetime”: Treat numerical columns like datetime columns, and apply the transformation specified by

      datetime.

    • ”obj”: Treat numerical columns like columns with object data type, and apply the transformation specified by

      obj.

    • None or “default”: Apply the default transformation, specified by default.

  • cat (str | BaseEstimator, optional) – The transformation to apply to categorical columns. Same options as for num.

  • bool (str | BaseEstimator, optional) – The transformation to apply to boolean columns. Same options as for num.

  • timedelta (str | BaseEstimator, optional) – The transformation to apply to timedelta columns. Same options as for num.

  • datetime (str | BaseEstimator, optional) – The transformation to apply to datetime columns. Same options as for num.

  • obj (str | BaseEstimator, optional) – The transformation to apply to columns with object data type. Same options as for num.

  • default (str | BaseEstimator, default="passthrough") – Default behavior for columns with unspecified transformation. Same options as for num, but cannot be None.

  • timedelta_resolution (str | pandas.Timedelta, optional) – Convert timedelta columns to float by diving through the given temporal resolution. This transformation is applied before any other transformation, and regardless of the value of timedelta. None keeps the data type of timedelta columns.

  • datetime_resolution (str | pandas.Timedelta, optional) – Convert datetime columns to float by diving through the given temporal resolution. This transformation is applied before any other transformation, and regardless of the value of datetime. None keeps the data type of timedelta columns.

See also

sklearn.compose.ColumnTransformer

Notes

This preprocessing transformation is only applicable to pandas DataFrames.

If the transformation specification is recursive, fit() raises a ValueError. Recursive specifications arise when some data type A shall be treated like B, B shall be treated like C, C shall be treated like … like A.

class FeatureFilter(add_missing: bool = True, remove_unknown: bool = True)

Bases: BaseEstimator, TransformerMixin

Simple transformer that ensures that list of features is identical to features seen during fit.

Parameters:
  • add_missing (bool, default=True) – Add missing columns in transform(), filling them with NaN. If False, an error is raised instead.

  • remove_unknown (bool, default=True) – Remove unknown columns in transform(). If False, an error is raised instead.

Notes

Data types are ignored, i.e., the output of transform() has the same data types as the input, which may differ from the data types seen during fit.

This preprocessing transformation is only applicable to pandas DataFrames.

ordinal_encoder(dtype=<class 'numpy.float64'>, **kwargs) DTypeTransformer

Create a transformation for ordinal-encoding categorical features, while keeping other features unchanged.

Parameters:
  • dtype – Data type of ordinal encoding. Passed to sklearn.preprocessing.OrdinalEncoder.

  • **kwargs – Keyword arguments passed to DTypeTransformer, most notably num, bool etc. for specifying how to treat non-categorical columns. cat cannot be specified.

Return type:

DTypeTransformer instance that can be used for ordinal-encoding categorical columns in pandas DataFrames.

See also

ordinal_encode, sklearn.preprocessing.OrdinalEncoder

ordinal_encode(X: pandas.DataFrame, dtype=<class 'numpy.float64'>, output: str = 'default', **kwargs) pandas.DataFrame | ndarray

Ordinal-encode categorical features in a pandas DataFrame, while keeping other features unchanged.

Internally, this function creates a suitable transformation using ordinal_encoder() and applies its fit_transform() method to the given DataFrame.

Parameters:
  • X (pandas.DataFrame) – DataFrame to process.

  • dtype – Data type of ordinal encoding. Passed to sklearn.preprocessing.OrdinalEncoder.

  • output (str, default="default") – Desired output type, either “default” (Numpy array) or “pandas” (pandas DataFrame).

  • **kwargs – Keyword arguments passed to DTypeTransformer, most notably num, bool etc. for specifying how to treat non-categorical columns. cat cannot be specified.

Return type:

Transformed input, either a DataFrame or an array, depending on output.

See also

ordinal_encoder, sklearn.preprocessing.OrdinalEncoder

one_hot_encoder(drop_na: bool = False, drop=None, dtype=<class 'numpy.float64'>, handle_unknown: str | None = None, **kwargs) DTypeTransformer

Create a transformation for one-hot-encoding categorical features, while keeping other features unchanged.

Parameters:
  • drop_na (bool, default=False) – Drop NaN categories. Passed to OneHotEncoder.

  • drop (iterable, optional) – Categories to drop. If drop_na is True, this parameter must be None. Passed to OneHotEncoder.

  • dtype – Data type of one-hot encoding. Passed to OneHotEncoder.

  • handle_unknown (str, optional) – How to handle unknown categories. Passed to OneHotEncoder.

  • **kwargs – Keyword arguments passed to DTypeTransformer, most notably num, bool etc. for specifying how to treat non-categorical columns. cat cannot be specified.

Return type:

DTypeTransformer instance that can be used for one-hot-encoding categorical columns in pandas DataFrames.

Notes

The resulting transformation always returns dense output in a pandas DataFrame.

one_hot_encode(X: pandas.DataFrame, drop_na: bool = False, drop=None, dtype=<class 'numpy.float64'>, handle_unknown: str | None = None, output: str = 'default', **kwargs) pandas.DataFrame | ndarray

One-hot encode categorical features in a pandas DataFrame, while keeping other features unchanged.

Internally, this function creates a suitable transformation using one_hot_encoder() and applies its fit_transform() method to the given DataFrame.

Parameters:
  • X (pandas.DataFrame) – DataFrame to process.

  • drop_na (bool, default=False) – Drop NaN categories. Passed to OneHotEncoder.

  • drop (iterable, optional) – Categories to drop. If drop_na is True, this parameter must be None. Passed to OneHotEncoder.

  • dtype – Data type of one-hot encoding. Passed to OneHotEncoder.

  • output (str, default="default") – Desired output type, either “default” (Numpy array) or “pandas” (pandas DataFrame).

  • **kwargs – Keyword arguments passed to DTypeTransformer, most notably num, bool etc. for specifying how to treat non-categorical columns. cat cannot be specified.

Return type:

Transformed input, either a DataFrame or an array, depending on output.

k_bins_discretizer(n_bins: int = 5, encode: str = 'onehot', strategy: str = 'quantile', timedelta: str = 'num', **kwargs) DTypeTransformer

Create a transformation for k-bins-discretizing numerical features, while keeping other features unchanged.

Parameters:
  • n_bins (int, default=5) – Number of bins to produce. Passed to sklearn.preprocessing.KBinsDiscretizer.

  • encode (str, default="onehot") – Method used to encode the transformed result. Passed to sklearn.preprocessing.KBinsDiscretizer.

  • strategy (str, default="quantile") – Strategy used to define the widths of the bins. Passed to sklearn.preprocessing.KBinsDiscretizer.

  • timedelta (str, default="num") – How to treat timedelta features.

  • **kwargs – Keyword arguments passed to DTypeTransformer, most notably cat, bool etc. for specifying how to treat non-numerical columns. num cannot be specified.

Return type:

DTypeTransformer instance that can be used for k-bins-discretizing numerical columns in pandas DataFrames.

See also

k_bins_discretize, sklearn.preprocessing.KBinsDiscretizer

k_bins_discretize(X: pandas.DataFrame, n_bins: int = 5, encode: str = 'onehot', strategy: str = 'quantile', output: str = 'default', timedelta: str = 'num', **kwargs) pandas.DataFrame | ndarray

K-bins discretize numerical features in a pandas DataFrame, while keeping other features unchanged.

Internally, this function creates a suitable transformation using k_bins_discretizer() and applies its fit_transform() method to the given DataFrame.

Parameters:
  • X (pandas.DataFrame) – DataFrame to process.

  • n_bins (int, default=5) – Number of bins to produce. Passed to sklearn.preprocessing.KBinsDiscretizer.

  • encode (str, default="onehot") – Method used to encode the transformed result. Passed to sklearn.preprocessing.KBinsDiscretizer.

  • strategy (str, default="quantile") – Strategy used to define the widths of the bins. Passed to sklearn.preprocessing.KBinsDiscretizer.

  • output (str, default="default") – Desired output type, either “default” (Numpy array) or “pandas” (pandas DataFrame).

  • timedelta (str, default="num") – How to handle timedelta features.

  • **kwargs – Keyword arguments passed to DTypeTransformer, most notably cat, bool etc. for specifying how to treat non-numerical columns. num cannot be specified.

Return type:

Transformed input, either a DataFrame or an array, depending on output.

See also

k_bins_discretizer, sklearn.preprocessing.KBinsDiscretizer

binarizer(threshold: float = 0, **kwargs) DTypeTransformer

Create a transformation for binarizing numerical features, while keeping other features unchanged.

Parameters:
  • threshold (float, default=0) – Feature values below or equal to this are replaced by 0, above it by 1. Passed to sklearn.preprocessing.Binarizer.

  • **kwargs – Keyword arguments passed to DTypeTransformer, most notably cat, bool etc. for specifying how to treat non-numerical columns. num cannot be specified.

Return type:

DTypeTransformer instance that can be used for binarizing numerical columns in pandas DataFrames.

See also

binarize, sklearn.preprocessing.Binarizer

binarize(X: pandas.DataFrame, threshold: float = 0, output: str = 'default', **kwargs) pandas.DataFrame | ndarray

Binarize numerical features in a pandas DataFrame, while keeping other features unchanged.

Internally, this function creates a suitable transformation using binarizer() and applies its fit_transform() method to the given DataFrame.

Parameters:
  • X (pandas.DataFrame) – DataFrame to process.

  • threshold (float, default=0) – Feature values below or equal to this are replaced by 0, above it by 1. Passed to sklearn.preprocessing.Binarizer.

  • output (str, default="default") – Desired output type, either “default” (Numpy array) or “pandas” (pandas DataFrame).

  • **kwargs – Keyword arguments passed to DTypeTransformer, most notably cat, bool etc. for specifying how to treat non-numerical columns. num cannot be specified.

Return type:

Transformed input, either a DataFrame or an array, depending on output.

See also

binarizer, sklearn.preprocessing.Binarizer

scaler(strategy: str = 'standard', cat=None, bool=None, timedelta='num', datetime=None, obj=None, default='passthrough', timedelta_resolution=None, datetime_resolution=None, fit_bool=None, **kwargs) DTypeTransformer

Create a transformation for scaling numerical features, while keeping other features unchanged.

Parameters:
  • strategy (str, default="standard") – Strategy used to scale numerical data: * “standard”: Scale data to have zero mean and unit variance, using sklearn.preprocessing.StandardScaler. * “robust”: Scale data using statistics that are robust to outliers, using sklearn.preprocessing.RobustScaler. * “minmax”: Scale data to have zero minimum and unit maximum, using sklearn.preprocessing.MinMaxScaler. * “maxabs”: Scale data to have a maximum absolute value of 1, using sklearn.preprocessing.MaxAbsScaler.

  • cat (optional) – How to handle categorical features.

  • bool (optional) – How to handle boolean features.

  • timedelta (default="num") – How to handle timedelta features.

  • datetime (optional) – How to handle datetime features.

  • obj (optional) – How to handle object features.

  • default (default="passthrough") – How to handle features for which no transformation is specified elsewhere.

  • timedelta_resolution (str | pandas.Timedelta, optional) – Timedelta resolution. If None and timedelta is set to “num” (either explicitly or implicitly), the resolution is automatically set to “s”.

  • datetime_resolution (str | pandas.Timedelta, optional) – Datetime resolution. If None and datetime is set to “num” (either explicitly or implicitly), the resolution is automatically set to “s”.

  • **kwargs – Additional keyword arguments passed to the underlying scikit-learn scaler.

Return type:

DTypeTransformer instance that can be used for scaling numerical columns in pandas DataFrames.

See also

scale, sklearn.preprocessing.StandardScaler, sklearn.preprocessing.RobustScaler, sklearn.preprocessing.MinMaxScaler, sklearn.preprocessing.MaxAbsScaler

scale(X: pandas.DataFrame, strategy: str = 'standard', output: str = 'default', **kwargs) pandas.DataFrame | ndarray

Scale numerical features in a pandas DataFrame, while keeping other features unchanged.

Internally, this function creates a suitable transformation using scaler() and applies its fit_transform() method to the given DataFrame.

Parameters:
  • X (pandas.DataFrame) – DataFrame to process.

  • strategy (str, default="standard") – Strategy used to scale numerical data: * “standard”: Scale data to have zero mean and unit variance, using sklearn.preprocessing.StandardScaler. * “robust”: Scale data using statistics that are robust to outliers, using sklearn.preprocessing.RobustScaler. * “minmax”: Scale data to have zero minimum and unit maximum, using sklearn.preprocessing.MinMaxScaler. * “maxabs”: Scale data to have a maximum absolute value of 1, using sklearn.preprocessing.MaxAbsScaler.

  • output (str, default="default") – Desired output type, either “default” (Numpy array) or “pandas” (pandas DataFrame).

  • **kwargs – Additional keyword arguments passed to scaler().

Return type:

Transformed input, either a DataFrame or an array, depending on output.

See also

scaler, sklearn.preprocessing.StandardScaler, sklearn.preprocessing.RobustScaler, sklearn.preprocessing.MinMaxScaler, sklearn.preprocessing.MaxAbsScaler

Encoding

class Encoder(classify: bool = True)[source]

Bases: BaseEstimator

Encoder for features- and labels DataFrames. Implements the BaseEstimator class of sklearn, with methods fit(), transform() and inverse_transform(), and can easily be dumped to and loaded from disk.

Notes

Encoding ensures that:

  • The data type of every feature column is either float, int, bool, categorical or string (if the installed Pandas version supports it). Time-like columns are converted into float, and object data types raise an exception.

  • The data type of every target column is float.

    • In regression tasks, this is achieved by converting numerical data types (float, int, bool, time-like) into float, and raising exceptions if other data types are found.

    • In binary classification, this is achieved by representing the negative class by 0.0 and the positive class by 1.0. If the original data type is categorical, the negative class corresponds to the first category, whereas the positive class corresponds to the second category. If the original data type is not categorical the positive and negative classes are determined through sklearn’s LabelEncoder.

    • In multiclass classification, this is achieved by representing the i-th class by i.

    • In multilabel classification, this is achieved by representing the presence of a class by 1.0 and its absence by 0.0.

  • Both features and labels may contain NaN values before encoding. These are simply propagated, meaning that encoded data may contain NaN values as well!

get_target_or_class_names() list | None[source]

Convenience method for getting the names of the targets or, in case of multiclass classification, the names of the individual classes.

Returns:

List of target- or class names.

Return type:

list

transform(*, inplace: bool = True, **kwargs: DataFrame | None)[source]

Transform features- and/or labels DataFrames.

Parameters:
  • inplace (bool,default=True) – Whether to modify the given data in place.

  • **kwargs (DataFrame, optional) – The data to transform, with keys “x” (features), “y” (labels) or “data” (features+labels).

Returns:

The transformed DataFrame(s), either a single DataFrame if only one of “x” or “y” is passed, or a pair of

DataFrames in the same order as in the argument dict. If “data” is passed, returns the pair of encoded features and labels.

Return type:

Any

inverse_transform(*, inplace: bool = True, **kwargs: DataFrame | ndarray | None)[source]

Back-transform features- and/or labels DataFrames i.e. Decodes encoded data. In the case of classification, it is also able to handle Numpy arrays containing class (indices), as returned by predict(), as well as class probabilities, as returned by predict_proba().

Parameters:
  • inplace (bool, default=True) – Whether to modify the given data in place.

  • **kwargs (DataFrame, ndarray, optional) – The data to transform back, with keys “x” (features) or “y” (labels).

Returns:

The back-transformed DataFrame(s), either a single DataFrame if only one of “x” or “y” is passed, or apair of DataFrames in the same order as in the argument dict.

Return type:

Any

Table

convert_object_dtypes(df: DataFrame, inplace: bool = True, max_categories: int = 100) DataFrame[source]

Convert “object” data types in df into other data types, if possible. In particular, this includes timedelta, datetime, categorical and string types, in that order. String types are not supported in all Pandas versions.

Parameters:
  • df (DataFrame) – The DataFrame.

  • inplace (bool, default=True) – Whether to modify df in place. Note that if no column in df can be converted, it is returned as-is even if inplace is False.

  • max_categories (int, default=100) – The maximum number of allowed categories when converting on object column into a categorical column.

Returns:

DataFrame with converted data types.

Return type:

DataFrame

set_index(df: DataFrame, inplace: bool = True) Tuple[DataFrame, List[str]][source]

Set the row index of the given DataFrame to an ID column, unless it contains IDs already, and return a list of other potential ID columns.

Parameters:
  • df – The DataFrame.

  • inplace (bool, default=True) – Whether to modify df in place.

Returns:

Pair (df, id_cols), where df is the new DataFrame and id_cols is a list of potential ID columns.

Return type:

tuple

merge_tables(tables: Iterable[DataFrame | str | Path]) Tuple[DataFrame, List[str]][source]

Merge the given tables by left-joining them on ID columns.

Parameters:

tables (Iterable) – The tables to merge, an iterable of DataFrames or paths to tables. Function convert_object_dtypes() is automatically applied to tables read from files.

Returns:

The pair (df, id_cols), where df is the merged DataFrame and id_cols is the list of potential ID columns.

Return type:

tuple

train_test_split(df: DataFrame, by: str) Tuple[Dict[str, ndarray], str | None][source]

Split the given DataFrame into train- and test set(s), by a given column.

Parameters:
  • df (DataFrame) – The DataFrame.

  • by (str) – The name of the column to split by. Must have bool or categorical data type.

Returns:

Pair (split_masks, train_key), where

  • split_masks is a dict mapping string-keys to masks corresponding to non-overlapping portions of df.

  • train_key is the key (in split_masks) containing the training set, or None if the training set could not

    be determined.

Return type:

tuple

Split

class StratifiedGroupShuffleSplit(n_splits=10, *, test_size=None, train_size=None, random_state=None, method='automatic', n_iter=None)

Bases: GroupShuffleSplit

Stratified grouped split into train- and test set. Ensures that groups in the two sets do not overlap, and tries to distribute samples in such a way that class percentages are roughly maintained in each split.

Parameters:
  • n_splits (int, default=10) – Number of re-shuffling & splitting iterations.

  • test_size (float | int, optional) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.1.

  • train_size (float | int, optional) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

  • random_state (int | RandomState, optional) – Controls the randomness of the training and testing indices produced. Pass an int for reproducible output across multiple function calls.

  • method (str, default="automatic") – Resampling method to use. Can be “automatic”, “exact” and “brute_force”. If there are many small groups, “brute_force” tends to give reasonable results and is significantly faster than “exact”. Otherwise, if there are only few large groups, method “exact” might be preferable. “automatic” tries to infer the optimal method based on the number of groups.

  • n_iter (int, optional) – Number of brute-force iterations. The larger the number, the more splits are tried, and hence the better the results get. If None, the number of iterations is determined automatically.

class StratifiedGroupKFold(n_splits=5, shuffle=False, random_state=None, method: str = 'automatic', n_iter: int | None = None)

Bases: _BaseKFold

Copied and adapted from sklearn version 1.0.2 [1], because older versions do not provide this very useful class.

Changelist:

  • Remove warning if some class has fewer than n_splits instances.

  • Do not throw error if all classes have fewer than n_splits instances.

  • Add method “brute_force”.

  • Fix bug in scikit-learn, leading to suboptimal stratifications:

    https://github.com/scikit-learn/scikit-learn/issues/24656.

Parameters:
  • n_splits (int, default=10) – Number of re-shuffling & splitting iterations.

  • shuffle (bool, default=False) – Whether to shuffle samples before splitting.

  • random_state (int or RandomState, optional) – Controls the randomness of the training and testing indices produced. Pass an int for reproducible output across multiple function calls.

  • method (str, default="automatic") – Resampling method to use. Can be “automatic”, “exact” and “brute_force”. If there are many small groups, “brute_force” tends to give reasonable results and is significantly faster than “exact”. Otherwise, if there are only few large groups, method “exact” might be preferable. “automatic” tries to infer the optimal method based on the number of groups. Note that “brute_force” is only possible if shuffle is set to True.

  • n_iter (int, optional) – Number of brute-force iterations. The larger the number, the more splits are tried, and hence the better the results get. If None, the number of iterations is determined automatically.

References

class CustomPredefinedSplit(test_folds=None)

Bases: BaseCrossValidator

Predefined split cross-validator. Provides train/test indices to split data into train/test sets using a predefined scheme specified by explicit test indices.

In contrast to sklearn.model_selection.PredefinedSplit, samples can be in the test set of more than one split.

Parameters:

test_folds (list of array-like) – Indices of test samples for each split. The number of splits equals the length of the list. Note that the test sets do not have to be pairwise disjoint.

See also

sklearn.model_selection.PredefinedSplit

Notes

In methods split() etc., parameters y and groups only exist for compatibility, but are always ignored. X is only needed for obtaining the total number of samples.

get_n_splits(X=None, y=None, groups=None)

Returns the number of splitting iterations in the cross-validator

Longitudinal

resample_eav(df: DataFrame | dask.dataframe.DataFrame, windows: DataFrame | dask.dataframe.DataFrame, agg: dict = None, entity_col=None, time_col=None, attribute_col=None, value_col=None, include_start: bool = True, include_stop: bool = False, optimize: str = 'time') DataFrame | dask.dataframe.DataFrame[source]

Resample data in EAV (entity-attribute-value) format wrt. explicitly passed windows of arbitrary (possibly infinite) length.

Parameters:
  • df (pd.DataFrame | dask.dataframe.DataFrame) – The DataFrame to resample, in EAV format. That means, must have columns value_col (contains observed values), time_col (contains observation times), attribute_col (optional; contains attribute identifiers) and entity_col (optional; contains entity identifiers). Must have one column index level. Data types are arbitrary, as long as observation times and entity identifiers can be compared wrt. < and <= (e.g., float, int, time delta, date time). Entity identifiers must not be NA. Observation times may be NA, but such entries are ignored entirely. df can be a Dask DataFrame as well. In that case, however, entity_col must not be None and entities should already be on the row index, with known divisions. Otherwise, the row index is set to entity_col, which can be very costly both in terms of time and memory. Especially if df is known to be sorted wrt. entities already, the calling function should better take care of this; see https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html.

  • windows (pd.DataFrame | dask.dataframe.DataFrame | callable) – The target windows into which df is resampled. Must have two column index levels and columns (time_col, “start”) (optional; contains start times of each window), (time_col, “stop”) (optional; contains end times of each window), (entity_col, “”) (optional; contains entity identifiers) and (window_group_col, “”) (optional; contains information for creating groups of mutually disjoint windows). Start- and end times may be NA, but such windows are deemed invalid and by definition do not contain any observations. At least one of the two endpoint-columns must be given; if one is missing it is assumed to represent +/- inf. windows can be a Dask DataFrame as well. In that case, however, entity_col must not be None and entities should already be on the row index, with known divisions. Otherwise, the row index is set to entity_col, which can be very costly both in terms of time and memory. Especially if windows is known to be sorted wrt. entities already, the calling function should better take care of this; see https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html. Alternatively, windows can be a callable that, when applied to a DataFrame and keyword arguments entity_col, time_col, attribute_col and value_col, returns a DataFrame of the form described above. The canonical example of such a callable is the result returned by make_windows(); see the documentation of make_windows() for details.

  • agg (dict) –

    The aggregations to apply. Must be a dict mapping attribute identifiers to lists of aggregation functions, which are applied to all observed values of the respective attribute in each specified window. Supported aggregation functions are:

    • "mean": Empirical mean of observed non-NA values

    • "min": Minimum of observed non-NA values; equivalent to “p0”

    • "max": Maximum of observed non-NA values; equivalent to “p100”

    • "median": Median of observed non-NA values; equivalent to “p50”

    • "std": Empirical standard deviation of observed non-NA values

    • "var": Empirical variance of observed non-NA values

    • "sum": Sum of observed non-NA values

    • "prod": Product of observed non-NA values

    • "skew": Skewness of observed non-NA values

    • "mad": Mean absolute deviation of observed non-NA values

    • "sem": Standard error of the mean of observed non-NA values

    • "size": Number of observations, including NA values

    • "count": Number of non-NA observations

    • "nunique": Number of unique observed non-NA values

    • "mode": Mode of observed non-NA values, i.e., most frequent value; ties are broken randomly but reproducibly

    • "mode_count": Number of occurrences of mode

    • "pxx": Percentile of observed non-NA values; xx is an arbitrary float in the interval [0, 100]

    • "rxx": xx-th observed value (possibly NA), starting from 0; negative indices count from the end

    • "txx": Time of xx-th observed value; negative indices count from the end

    • "callable": Function that takes as input a DataFrame in and returns a new DataFrame out. See Notes for details.

  • entity_col (str, optional) – Name of the column in df and windows containing entity identifiers. If None, all entries are assumed to belong to the same entity. Note that entity identifiers may also be on the row index.

  • time_col (str, optional) – Name of the column in df containing observation times, and also name of column(s) in windows containing start- and end times of the windows. Note that despite its name the data type of the column is arbitrary, as long as it supports the following arithmetic- and order operations: -, /, <, <=.

  • attribute_col (str, optional) – Name of the column in df containing attribute identifiers. If None, all entries are assumed to belong to the same attribute; in that case agg may only contain one single item.

  • value_col (str, optional) – Name of the column in df containing the observed values.

  • include_start (bool, default=True) – Start times of observation windows are part of the windows.

  • include_stop (bool, default=False) – End times of observation windows are part of the windows.

  • optimize (str, default='time') – Optimize runtime or memory requirements. If set to “time”, the function returns faster but requires more memory; if set to “memory”, the runtime is longer but memory consumption is reduced to a minimum. If “time”, global variable MAX_ROWS can be used to adjust the time-memory tradeoff: increasing it increases memory consumption while reducing runtime. Note that this parameter is only relevant for computing non-rank-like aggregations, since rank-like aggregations (“rxx”, “txx”) can be efficiently computed anyway.

Returns:

Resampled data. Like windows, but with one additional column for each requested aggregation. Order of columns is arbitrary, order of rows is exactly as in windows – unless windows is a Dask DataFrame, in which case the order of rows may differ. The output is a (lazy) Dask DataFrame if windows is a Dask DataFrame, and a Pandas DataFrame otherwise, regardless of what df is.

Return type:

pd.DataFrame | dask.dataframe.DataFrame

Notes

When passing a callable to agg, it is expected to take as input a DataFrame in and return a new DataFrame out. in has two columns time_col and value_col (in that order). Its row index specifies which entries belong to the same observation window: entries with the same row index value belong to the same window, entries with different row index values belong to distinct windows. Observation times are guaranteed to be non-N/A, values may be N/A. Note, however, that in is not necessarily sorted wrt. its row index and/or observation times! Also note that the entities the observations in in stem from (if entity_col is specified) are not known to the function. out should have one row per row index value of in (with the same row index value), and an arbitrary number of columns with arbitrary names and dtypes. Columns should be consistent in every invocation of the function. The reason why the function is not applied to each row-index-value group individually is that some aggregations can be implemented efficiently using sorting rather than grouping. The function should be stateless and must not modify in in place.

  • Example 1: A simple aggregation which calculates the fraction of values between 0 and 1 in every window could be passed as

    lambda x: x[value_col].between(0, 1).groupby(level=0).mean().to_frame('frac_between_0_1')
    
  • Example 2: A more sophisticated aggregation which fits a linear regression to the observations in every window and returns the slope of the resulting regression line could be defined as

    def slope(x):
      tmp = pd.DataFrame(
          index=x.index,
          data={time_col: x[time_col].dt.total_seconds(), value_col: x[value_col]}
      )
      return tmp[tmp[value_col].notna()].groupby(level=0).apply(
          lambda g: scipy.stats.linregress(g[time_col], y=g[value_col]).slope
      ).to_frame('slope')
    
resample_interval(df: DataFrame | dask.dataframe.DataFrame, windows: DataFrame | dask.dataframe.DataFrame, attributes: list = None, entity_col=None, start_col=None, stop_col=None, attribute_col=None, value_col=None, time_col=None, epsilon=1e-07) DataFrame | dask.dataframe.DataFrame[source]

Resample interval-like data wrt. explicitly passed windows of arbitrary (possibly infinite) length. “Interval-like” means that each observation is characterized by a start- and stop time rather than a singular timestamp (as in EAV data).

Parameters:
  • df (pd.DataFrame | dask.dataframe.DataFrame) – The DataFrame to resample. Must have columns value_col (contains observed values), start_col (optional; contains start times), stop_time (optional; contains end times), attribute_col (optional; contains attribute identifiers) and entity_col (optional; contains entity identifiers). Must have one column index level. Data types are arbitrary, as long as times and entity identifiers can be compared wrt. < and <= (e.g., float, int, time delta, date time). Entity identifiers must not be NA. Values must be numeric (float, int, bool). Observation times and observed values may be NA, but such entries are ignored entirely. Although both start_col and stop_col are optional, at least one must be present. Missing start- and end columns are interpreted as -/+ inf. All intervals are closed, i.e., start- and end times are included. This is especially relevant for entries whose start time equals their end time. df can be a Dask DataFrame as well. In that case, however, entity_col must not be None and entities should already be on the row index, with known divisions. Otherwise, the row index is set to entity_col, which can be very costly both in terms of time and memory. Especially if df is known to be sorted wrt. entities already, the calling function should better take care of this; see https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html.

  • windows (pd.DataFrame | dask.dataframe.DataFrame | callable) – The target windows into which df is resampled. Must have either one or two columns index level(s). If it has one column index level, must have columns start_col (optional; contains start times of each window), stop_col (optional; contains end times of each window) and entity_col (optional; contains entity identifiers). If it has two column index levels, the columns must be (time_col, “start”), (time_col, “stop”) and (entity_col, “”). Start- and end times may be NA, but such windows are deemed invalid and by definition do not overlap with any observation intervals. At least one of the two endpoint-columns must be present; if one is missing it is assumed to represent -/+ inf. All time windows are closed, i.e., start- and end times are included. windows can be a Dask DataFrame as well. In that case, however, entity_col must not be None and entities should already be on the row index, with known divisions. Otherwise, the row index is set to entity_col, which can be very costly both in terms of time and memory. Especially if windows is known to be sorted wrt. entities already, the calling function should better take care of this; see https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html. Alternatively, windows can be a callable that, when applied to a DataFrame and keyword arguments entity_col, start_col, stop_col, time_col, attribute_col and value_col, returns a DataFrame of the form described above. The canonical example of such a callable is the result returned by make_windows(); see the documentation of make_windows() for details.

  • attributes (list, optional) – The attributes to consider. Must be a list-like of attribute identifiers. None defaults to the list of all such identifiers present in column attribute_col. If attribute_col is None but attributes is not, it must be a singleton list.

  • entity_col (str, optional) – Name of the column in df and windows containing entity identifiers. If None, all entries are assumed to belong to the same entity. Note that entity identifiers may also be on the row index.

  • start_col (str, optional) – Name of the column in df (and windows if it has only one column index level) containing start times. If None, all start times are assumed to be -inf. Note that despite its name the data type of the column is arbitrary, as long as it supports the following arithmetic- and order operations: -, /, <, <=.

  • stop_col (str, optional) – Name of the column in df (and windows if it has only one column index level) containing end times. If None, all end times are assumed to be +inf. Note that despite its name the data type of the column is arbitrary, as long as it supports the following arithmetic- and order operations: -, /, <, <=.

  • attribute_col (str, optional) – Name of the column in df containing attribute identifiers. If None, all entries are assumed to belong to the same attribute.

  • value_col (str, optional) – Name of the column in df containing the observed values.

  • time_col (list | str, optional) – Name of the column(s) in windows containing start- and end times of the windows. Only needed if windows has two column index levels, because otherwise these two columns must be called start_col and stop_col, respectively.

  • epsilon – The value to set \(W_I\) to if \(I\) is infinite and \(W \cap I\) is non-empty and finite; see Notes for details.

Returns:

  • Resampled data. Like windows, but with one additional column for each attribute, and same number of

  • column index levels.

  • Order of columns is arbitrary, order of rows is exactly as in windows – unless windows is a Dask DataFrame, in

  • which case the order of rows may differ.

  • The output is a (lazy) Dask DataFrame if windows is a Dask DataFrame, and a Pandas DataFrame otherwise,

  • regardless of what df is.

Notes

A typical example of interval-like data are medication records, since medications can be administered over longer time periods.

The only supported resampling aggregation is summing the observed values per time window, scaled by the fraction of the length of the intersection of observation interval and time window divided by the total length of the observation interval: Let \(W = [s, t]\) be a time window and let \(I = [a, b]\) be an observation interval with observed value \(v\). Then \(I\) contributes to \(W\) the value

\(W_I = v * \frac{|W \cap I|}{|I|}\)

The overall value of \(W\) is the sum of \(W_I\) over all intervals. Of course, all this is computed separately for each entity-attribute combination. Some remarks on the above equation are in place:

  • If \(v\) is N/A, \(W_I\) is set to 0.

  • If \(a = b\) both numerator and denominator are 0. In this case the fraction is defined as 1 if \(a \in W\) (i.e., \(s \leq a \leq t\)) and 0 otherwise.

  • If \(I\) is infinite and \(W \cap I\) is non-empty but finite, \(W_I\) is set to \(epsilon * sign(v)\). Note that \(W \cap I\) is non-empty even if it is of the form \([x, x]\). This leads to the slightly counter-intuitive situation that \(W_I = epsilon\) if \(I\) is infinite, and \(W_I = 0\) if \(I\) is finite.

  • If \(I\) and \(W \cap I\) are both infinite, the fraction is defined as 1. This is regardless of whether \(W \cap I\) equals \(I\) or whether it is a proper subset of it.

class make_windows(df: DataFrame | str | None = None, entity=None, start=None, stop=None, start_rel=None, stop_rel=None, duration=None, anchor=None)[source]

Bases: object

Convenience function for easily creating windows that can be passed to functions resample_eav() and resample_interval(). Note that internally, invoking this function does not create the actual windows-DataFrame yet. Instead, when passing the resulting callable to resample_eav() or resample_interval(), it is applied to the DataFrame to be resampled. This allows to implicitly refer to it here; see the examples below for specific use-cases.

Parameters:
  • df (pd.DataFrame | str, optional) – Source DataFrame. If None, defaults to the DataFrame to be resampled in resample_eav() or resample_interval(). Can also be a string, which will be evaluated using Python’s eval() function. The string can contain references to the DataFrame to be resampled via variable df, and to column-names entity_col, time_col, start_col and stop_col passed to resample_eav() and resample_interval(). Example: “df.groupby(entity_col)[time_col].max().to_frame()”

  • entity (pd.Series | pd.Index | str | scalar, optional) – Entity of each window. Series are used as-is (possibly after re-ordering rows to match other row indices), strings refer to columns in df, and scalars are replicated to populate every window with the same value. If None, defaults to df[entity_col] if df contains that column.

  • start (pd.Series | pd.Index | str | scalar, optional) – Start time of each window. Series are used as-is (possibly after re-ordering rows to match other row indices), strings refer to columns in df, and scalars are replicated to populate every window with the same value. Note that despite its name the data type of the start times is arbitrary, as long as it supports the following arithmetic- and order operations: -, /, <, <=. start and start_rel are mutually exclusive.

  • stop (pd.Series | pd.Index | str | scalar, optional) – Stop time of each window. Series are used as-is (possibly after re-ordering rows to match other row indices), strings refer to columns in df, and scalars are replicated to populate every window with the same value. Note that despite its name the data type of the stop times is arbitrary, as long as it supports the following arithmetic- and order operations: -, /, <, <=. stop and stop_rel are mutually exclusive.

  • start_rel (pd.Series | pd.Index | str | scalar, optional) – Start time of each window, relative to anchor. Series are used as-is (possibly after re-ordering rows to match other row indices), strings refer to columns in df, and scalars are replicated to populate every window with the same value. If given, anchor must be given, too. start and start_rel are mutually exclusive.

  • stop_rel (pd.Series | pd.Index | str | scalar, optional) – Stop time of each window, relative to anchor. Series are used as-is (possibly after re-ordering rows to match other row indices), strings refer to columns in df, and scalars are replicated to populate every window with the same value. If given, anchor must be given, too. stop and stop_rel are mutually exclusive.

  • duration (pd.Series | pd.Index | str | scalar, optional) – Duration of each window. Series are used as-is (possibly after re-ordering rows to match other row indices), strings refer to columns in df, and scalars are replicated to populate every window with the same value. Durations can only be specified if exactly one endpoint (either start or stop) is specified; the other endpoint is then computed from duration.

  • anchor (pd.Series | pd.Index | str | scalar, optional) – Anchor time start_rel and stop_rel refer to. Series are used as-is (possibly after re-ordering rows to match other row indices), strings refer to columns in df, and scalars are replicated to populate every window with the same value. Ignored unless start_rel or stop_rel is given. If start_rel or stop_rel is given but anchor is None, it defaults to time_col, but a warning message is printed.

Notes

  • The current implementation does not support Dask DataFrames.

  • This function does not check whether windows are non-empty, i.e., whether start times come before end times.

Examples

  • Use-case 1: Create fixed-length windows relative to the time column in the DataFrame to be resampled. Since anchor is required by start_rel but not set explicitly, it defaults to time_col, but a warning message is printed.

    resample_eav(
        df_to_be_resampled,
        make_windows(
            start_rel=pd.Timedelta("-1 day"),
            stop_rel=pd.Timedelta("-1 hour")
        ),
        ...
    )
    
  • Use-case 2: Similar to use-case 1, but only create one window per entity, for the temporally last entry. Note how the DataFrame to be resampled is only passed once directly to function resample_eav(); make_windows() refers to it implicitly via variable name “df” in the string of keyword argument df. Note also that the resulting DataFrame may have entities on its row index.

    resample_eav(
        df_to_be_resampled,
        make_windows(
            df="df.groupby(entity_col)[time_col].max().to_frame()",
            start_rel=pd.Timedelta("-7 days"),
            duration=pd.Timedelta("5 days"),
            anchor="timestamp"
        ),
        time_col="timestamp",
        entity_col=...,
        ...
    )
    
  • Use-case 3: make_windows() can be used with function resample_interval(), too – regardless of whether time_col is passed to resample_interval() or not.

    resample_interval(
        df_to_be_resampled,
        make_windows(
            stop=pd.Series(...),
            duration=pd.Series(...),    # must have the same row index as the Series passed to `start`
        ),
        start_col=...,
        stop_col=...,
        time_col=...,                   # optional
        ...
    )
    
prev_next_values(df: DataFrame, sort_by=None, group_by=None, columns=None, first_indicator_name=None, last_indicator_name=None, keep_sorted: bool = False, inplace: bool = False) DataFrame[source]

Find the previous/next values of some columns in DataFrame df, for every entry. Additionally, entries can be grouped and previous/next values only searched within each group.

Parameters:
  • df (pd.DataFrame) – The DataFrame.

  • sort_by (list | str, optional) – The column(s) to sort by. Can be the name of a single column or a list of column names and/or row index levels. Strings are interpreted as column names or row index names, integers are interpreted as row index levels. ATTENTION! N/A values in columns to sort by are not ignored; rather, they are treated in the same way as Pandas treats such values in DataFrame.sort_values(), i.e., they are put at the end.

  • group_by (list | str, optional) – Column(s) to group df by, optional. Same values as sort_by.

  • columns (dict) –

    A dict mapping column names to dicts of the form

    {
        "prev_name": <prev_name>,
        "prev_fill": <prev_fill>,
        "next_name": <next_name>,
        "next_fill": <next_fill>
    }
    

    prev_name and next_name are the names of the columns in the result, containing the previous/next values. If any of them is None, the corresponding previous/next values are not computed for that column. prev_fill and next_fill specify which values to assign to the first/last entry in every group, which does not have any previous/next values. Note that column names not present in df are tacitly skipped.

  • first_indicator_name (str, optional) – Name of the column in the result containing boolean indicators whether the corresponding entries come first in their respective groups. If None, no such column is added.

  • last_indicator_name (str, optional) – Name of the column in the result containing boolean indicators whether the corresponding entries come last in their respective groups. If None, no such column is added.

  • keep_sorted (bool, default=False) – Keep the result sorted wrt. group_by and sort_by. If False, the order of rows of the result is identical to that of df.

  • inplace (bool, default=False) – If True, the new columns are added to df.

Returns:

The modified DataFrame if inplace is True, a DataFrame with the requested previous/next values otherwise.

Return type:

pd.DataFrame

Bootstrapping

class Bootstrapping(*args, kwargs: dict | None = None, fn=None, seed=None, replace: bool = True, size: int | float = 1.0)

Bases: object

Class for performing bootstrapping [1], i.e., repeatedly sample with replacement from given data and evaluate statistics on each resample to obtain mean, standard deviation, etc. for more robust estimates.

Parameters:
  • *args (array-like) – Data, non-empty sequence of array-likes with the same length.

  • kwargs (dict, optional) – Additional keyword arguments passed to the function fn computing the statistics. Like args, the values of the dict must be array-likes with the same length as the elements of args.

  • fn (callable | dict | tuple, optional) – The statistics to compute. Must be None, a function that takes the given args as input and returns a scalar/array-like or a (nested) dict/tuple thereof, or a (nested) dict/tuple of such functions.

  • seed (int, optional) – Random seed.

  • replace (bool, default=True) – Whether to resample with replacement. If False, this does not actually correspond to bootstrapping.

  • size (int | float, default=1.) – The size of the resampled data. If <= 1, it is multiplied with the number of samples in the given data. Bootstrapping normally assumes that resampled data have the same number of samples as the original data, so this parameter should be set to 1.

References

run(n_repetitions: int = 100, sample_indices: ndarray | None = None) Bootstrapping

Run bootstrapping for a given number of repetitions, and store the results in a list. Results are appended to results from previous runs!

Parameters:
  • n_repetitions (int, default=100) – Number of repetitions.

  • sample_indices (ndarray, optional) – Pre-computed sample indices to use in each repetition. If not None, n_repetitions is ignored and sample_indices must have shape (n, size).

subsample(seed: int | None = None) ndarray

Construct a subsample.

Parameters:

seed (int, optional) – Random seed to use.

Return type:

Array with subsample indices.

get_sample_indices() ndarray

Get sample indices used for resampling the data.

Return type:

Array of shape (n_runs, size).

agg(func)

Compute aggregate statistics of the results of the individual runs, like mean, standard deviation, etc.

Parameters:

func (str | callable) – The aggregation function to apply. If a string, can be the name of a Numpy function (“mean”, “std”, etc.), or “iqr” (interquartile range) or “ci<alpha>” (confidence interval wrt. alpha).

Return type:

Aggregated results.

dataframe(keys=None) pandas.DataFrame | None

Construct a pandas DataFrame with all results, if possible. Only works for (dicts/tuples of) scalar values.

Returns:

DataFrame whose columns correspond to individual metrics and whose rows correspond to runs, or None.

Return type:

pandas.DataFrame

describe(keys=None) pandas.Series | pandas.DataFrame

Describe the results of the individual runs by computing a predefined set of statistics, similar to pandas’ describe() method. Only works for (dicts/tuples of) scalar values.

Returns:

DataFrame or Series with descriptive statistics.

Return type:

pandas.Series | pandas.DataFrame

Summary

summarize_performance(directories: Iterable[str | Path], metrics: Iterable[str | tuple], split: str | Iterable[str] | None = None, path_callback=None) DataFrame[source]

Summarize the performance of multiple prediction models trained and evaluated with CaTabRa. This is a convenient way for quickly comparing them and selecting the best model(s) for a certain task. An implicit assumption of this function is that all models were trained on the same prediction task.

IMPORTANT: Only pre-evaluated metrics in “metrics.xlsx” and “bootstrapping.xlsx” are considered!

Parameters:
  • directories (Iterable[str | Path]) – The directories to consider, an iterable of path-like objects. Each directory must be the output directory of an invocation of catabra.evaluate, or a subdirectory corresponding to a specific split (containing “metrics.xlsx” and maybe also “bootstrapping.xlsx”). A convenient way to specify a couple of directories matching a certain pattern is by using Path(root_path).rglob(pattern).

  • metrics (Iterable[str]) –

    List of metrics to include in the summary, an iterable of strings. Values must match the following pattern:

    "[target:]metric_name[@threshold][(bootstrapping_aggregation)]"
    
    • target is optional and specifies the target (or class in case of multiclass classification); can be “*” to include all available targets, and can be a sequence separated by “,”. Ignored if bootstrapping_aggregation is specified.

    • metric_name is the name of the actual metric, exactly as written in “metrics.xlsx” or “bootstrapping.xlsx”; can be “*” to include all available pre-evaluated metrics, and can be a sequence separated by “,”.

    • threshold is optional and must be a numeral between 0 and 1 (cannot be a string like “balance”), and cannot be “*”. Only relevant for threshold-dependent classification metrics, and mutually exclusive with bootstrapping_aggregation. Note that the given threshold must exactly match one of the thresholds evaluated in “metrics.xlsx”.

    • bootstrapping_aggregation is optional and specifies the bootstrapping aggregation to include, like “mean”, “std”, etc.; can be “*” to include all available pre-evaluated aggregations in “bootstrapping.xlsx”, and can be a sequence separated by “,”.

  • split (Iterable[str], default=None) – If a directory in directories has subdirectories corresponding to data splits that were evaluated separately, only include the splits in split. If None, all splits are included.

  • path_callback (Callable, default=None) – Callback function applied to every path visited. Must return None, True, False or a dict; False indicates that the current path should be dropped from the output, True and None are aliases for {}, and a dict adds a column for every key to the output DataFrame, with the corresponding values in them.

Returns:

DataFrame with one row per evaluation and one column per performance metric. If multiple splits are included in the performance summary, each is put into a separate row.

Return type:

DataFrame

Examples

Example metric specifications:

  • “roc_auc”

  • “roc_auc(mean,std)”

  • “accuracy,sensitivity@0.5”

  • “*@0.5”

  • “r2(*)”

  • “*(*)”

  • “target_1:mean_squared_error”

  • “*:mean_squared_error”

  • “*:*(*)”

  • “__threshold(mean,std)”

See also

summarize_importance

Summarize feature importance scores.

summarize_importance(directories: Iterable[str | Path], columns: str | Iterable[str] | None = None, new_column_name: str = '{feature} {column}', glob: bool = False, split: str | Iterable[str] | None = None, model_id: str | Iterable[str] | None = None, path_callback=None) DataFrame[source]

Summarize the feature importance of multiple prediction models trained and explained with CaTabRa. This is a convenient way for quickly comparing them. An implicit assumption of this function is that all models were trained on the same prediction task, and that the same feature importance calculation method was applied to generate the importance scores.

IMPORTANT: Only pre-evaluated feature importance scores are considered!

Parameters:
  • directories (Iterable[str | Path]) – The directories to consider, an iterable of path-like objects. Each directory must be the output directory of an invocation of catabra.explain, or a subdirectory corresponding to a specific split (containing HDF5 files with feature importance scores). A convenient way to specify a couple of directories matching a certain pattern is by using Path(root_path).rglob(pattern).

  • columns (Iterable[str], default=None) – The columns in global feature importance scores to consider. For instance, if catabra.explanation.average_local_explanations() is used to produce global scores, 4 columns “>0”, “<0”, “>0 std” and “<0 std” are normally generated. This parameter allows to include only a subset in the summary. None defaults to all columns.

  • new_column_name (str) – String pattern specifying the names of the columns in the output DataFrame. May have two named fields feature and column, which are filled with original feature- and column names, respectively.

  • glob (bool) – Whether feature importance scores in directories are global. If not, catabra.explanation.average_local_explanations() is applied.

  • split (Iterable[str], default=None) – If a directory in directories has subdirectories corresponding to data splits that were explained separately, only include the splits in split. If None, all splits are included.

  • model_id (Iterable[str], default=None) – Model-IDs to consider, optional. Determines the names of the HDF5 files to be included. None defaults to all found model-IDs.

  • path_callback (Callable, default=None) – Callback function applied to every path visited. Must return None, True, False or a dict; False indicates that the current path should be dropped from the output, True and None are aliases for {}, and a dict adds a column for every key to the output DataFrame, with the corresponding values in them.

Returns:

DataFrame with one row per explanation and one column per feature-column pair. If multiple splits are included in the importance summary, each is put into a separate row. If there are multiple targets (multiclass/multilabel classification, multioutput regression) and the feature importance scores for each target are stored in a separate table, each is put into a separate row and an additional column “__target__” is added.

Return type:

DataFrame

See also

summarize_performance

Summarize model performance.