Utilities

IO

make_path(p: str | Path, absolute: bool = False) → Path[source]

Convert a path-like object into a proper path object, i.e., an instance of class Path.

Parameters:

p (str | Path) – Path-like object. If an instance of Path and absolute is False, p is returned unchanged.
absolute (bool, default=False) – Whether to make sure that the output is an absolute path. If False, the path may be relative.

Returns:

Path object.

Return type:

Path

read_df(fn: str | Path, key: str | Iterable[str] = 'table') → DataFrame[source]

Read a DataFrame from a CSV, Excel, HDF5, Pickle or Parquet file. The file type is determined from the file extension of the given file.

Parameters:

fn (str | Path) – The file to read.
key (str | Iterable[str], default='table') – The key(s) in the HDF5 file, if fn is an HDF5 file. Defaults to “table”. If an iterable, all keys are read and concatenated along the row axis.

Returns:

A DataFrame.

Return type:

DataFrame

read_dfs(fn: str | Path) → Dict[str, DataFrame][source]

Read multiple DataFrames from a single file.

If an Excel file, all sheets are read and returned.
If an H5 file, all top-level keys are read and returned.
If any other file, the singleton dict {“table”: df} is returned, where df is the single DataFrame contained in the file.

Parameters:: fn (str, Path) – The file to read.
Returns:: A dict mapping keys to DataFrames, possibly empty.
Return type:: str | DataFrame

write_df(df: DataFrame, fn: str | Path, key: str = 'table', mode: str = 'w')[source]

Write a DataFrame to file. The file type is determined from the file extension of the given file.

Parameters:

df (DataFrame) – The DataFrame to write.
fn (str | Path) – The target file name.
key (str, default='table') – The key in the HDF5 file, if fn is an HDF5 file. If None, fn may contain only one table.
mode (str, default='w') – The mode in which the HDF5 file shall be opened, if fn is an HDF5 file. Ignored otherwise.

write_dfs(dfs: Dict[str, DataFrame], fn: str | Path, mode: str = 'w')[source]

Write a dict of DataFrames to file. The file type is determined from the file extension of the given file. Unless an Excel- or HDF5 file, dfs must be empty or a singleton.

Parameters:

dfs (dict) – The DataFrames to write. If empty and mode differs from “a”, the file is deleted.
fn (str | Path) – The target file name.
mode (str, default='w') – The mode in which the file shall be opened, if fn is an Excel- or HDF5 file. Ignored otherwise.

load(fn: str | Path)[source]

Load a Python object from disk. The object can be stored in JSON, Pickle or joblib format. The format is automatically determined based on the given file extension:

“.json” => JSON
“.pkl”, “.pickle” => Pickle
“.joblib” => joblib

Parameters:: fn (str | Path) – The file to load.
Returns:: The loaded object.
Return type:: Any

dump(obj, fn: str | Path)[source]

Dump a Python object to disk, either as a JSON, Pickle or joblib file. The format is determined automatically based on the given file extension:

“.json” => JSON
“.pkl”, “.pickle” => Pickle
“.joblib” => joblib

Parameters:

obj – The object to dump.
fn (str | Path) – The file.

Notes

When dumping objects as JSON, calling to_json() beforehand might be necessary to ensure compliance with the JSON standard. joblib is preferred over Pickle, as it is more efficient if the object contains large Numpy arrays.

to_json(x)[source]

Returns a JSON-compliant representation of the given object.

Parameters:: x – Arbitrary object.
Returns:: Representation of x that can be serialized as JSON.
Return type:: Any

convert_rows_to_str(d: [<class 'dict'>, <class 'pandas.core.frame.DataFrame'>], rowindex_to_convert: list, inplace: bool = True, skip: list = []) → dict | DataFrame[source]

Converts rows (indexed via rowindex_to_convert) to str, mainly used for saving dataframes (to avoid missing values in .xlsx-files in case of e.g. timedelta datatype)

Parameters:

d (dict | DataFrame) – Single DataFrame or dictionary of dataframes
rowindex_to_convert (list) – List of row indices (e.g., features), that should be converted to str
inplace (bool, default=True) – Determines if changes will be made to input data or a deep-copy of it
skip (list, default=[]) – List of column(s) that should not be converted to string

Returns:

Modified (str-converted rows) single DataFrame or dictionary of DataFrames.

Return type:

DataFrame | dict

class CaTabRaLoader(path: str | Path, check_exists: bool = True)[source]

Bases: object

CaTabRaLoader for conveniently accessing artifacts generated by analyzing tables, like trained models, configs, encoders, etc.

Parameters:

path (str | Path) – Path to the CaTabRa directory.
check_exists (bool, default=True) – Check whether the directory pointed to by path exists.

get_fitted_ensemble(from_model: bool = False) → FittedEnsemble | None[source]

Get the trained prediction model as a FittedEnsemble object.

Parameters:: from_model (bool, default=False) – Whether to convert a plain model of type AutoMLBackend into a FittedEnsemble object, if such an object does not exist in the directory.

get_explainer(explainer: str | None = None, fitted_ensemble: FittedEnsemble | None = None) → EnsembleExplainer | None[source]

Get the explainer object.

Parameters:

explainer (str, optional) – Name of the explainer to load. If None, the first explainer specified in config param “explainer” is loaded.
fitted_ensemble (FittedEnsemble) – Pre-loaded FittedEnsemble object. If None, method get_fitted_ensemble() is used for loading it.

get_train_data() → DataFrame | None[source]: Get the training data copied into the directory, “train_data.h5”. In contrast to get_table(), this is only the data actually used for training.

get_table(keep_singleton: bool = False) → DataFrame | List[DataFrame] | None[source]

Get the table(s) originally passed to analyze(), if they still reside in their original location.

Parameters:: keep_singleton (bool, default=False) – Whether to keep singleton lists. If False, a single DataFrame is returned in that case.

Logging

prompt(msg: str, accepted: List[str] | None = None, allow_headless: bool = True) → str[source]

Prompt the user for input.

Parameters:

msg (str) – The message to be printed.
accepted (list, optional) – List of accepted inputs. Must be lower-case. If None, all inputs are accepted.
allow_headless (bool, default=True) – What to do in headless mode. If True, the first element in accepted is returned if accepted is a list and “” is returned if accepted is None. If False, a RunTimeError is raised.

Returns:

The input of the user, an element of accepted if accepted is a list, or arbitrary if accepted is None.

Return type:

str

progress_bar(iterable, desc: str | None = None, total: int | None = None, disable: bool = False, meter_width: int = 40)[source]

Show a simple progress bar when iterating over a given iterable. This works similar to package tqdm, but in contrast to tqdm also works when mirroring messages to a file.

Parameters:

iterable – The iterable.
desc (str, optional) – Description to add to the beginning of the progress bar, optional.
total (int, optional) – Total number of elements in iterable if iterable does not implement the __len__() method.
disable (bool, default=False) – Whether to disable the progress bar. If True, the behavior is equivalent to not calling this function at all.
meter_width (int, default=40) – The width of the meter, in characters. Should not be too long to make the whole progress bar fit into a single line. Might have to be decreased if desc is a long text.

class LogMirror(log_path: str, mode: str = 'w')[source]

Bases: object

Used to temporary mirror both stderr and stdout to a log file. Based on [1] and [2].

Examples

>>> with LogMirror("log.txt"):
>>>     log("writing to log.txt and the console")
>>>     err("works with errors as well")
>>>     warn("and in case you need warnings")
>>>     print("no need to use the custom log functions")

References

Common

fresh_name(name, lst: Iterable)[source]

Create a fresh name based on name, i.e., a name that does not appear in lst.

Parameters:

name – An arbitrary object. If a list, tuple or set, all elements of name are processed individually, an they are ensured to be distinct from each other.
lst (Iterable) – A list-like structure.

Returns:

If name does not appear in lst, name is returned as-is. Otherwise, a numeric suffix is added to the string representation of name.

Return type:

Any

repr_list(lst: list | tuple, limit: int | None = 50, delim: str = ', ', brackets: bool = True) → str[source]

Return a string representation of some list, limiting the displayed items to a certain number.

Parameters:

lst (list | tuple) – The list.
limit (int, default=50) – The maximum number of displayed items.
delim (str, default=', ') – The item delimiter.
brackets (bool, default=True) – Whether to add brackets.

Returns:

String representation of lst.

Return type:

str

repr_timedelta(delta, subsecond_resolution: int = 0) → str[source]

Return a string representation of some time delta. Minutes and seconds are always displayed, hours and days only if needed. Format is “d days hh:mm:ss”.

Parameters:

delta – Time delta to represent, either a float or an object with a total_seconds() method (e.g., a pandas Timedelta instance). Floats are assumed to be given in seconds.
subsecond_resolution (int, default=0) – The subsecond resolution to display, i.e., number of decimal places.

Returns:

String representation of delta.

Return type:

str

Plotting

save(fig, fn: str | Path, png: bool = False)[source]

Save a figure or a list of figures to disk.

Parameters:

fig – The figure(s) to save. May be a Matplotlib figure object, a plotly figure object, or a dict whose values are such figure objects.
fn (str | Path) – The file or directory. It is recommended to leave the file extension unspecified and simply pass “/path/to/figure” instead of “/path/to/figure.png”. The file extension is then determined automatically depending on the type of fig and on the value of png. If fig is a dict, fn refers to the parent directory.
png (bool, default=False) – Whether to save Matplotlib figures as PNG or as PDF. Ignored if a file extension is specified in fn or if fig is a plotly figure, which are always saved as HTML.

Metrics

to_score(func)[source]

Convenience function for converting a metric into a (possibly different) metric that returns scores (i.e., higher values correspond to better results). That means, if the given metric returns scores already, it is returned unchanged. Otherwise, it is negated.

Parameters:: func – The metric to convert, e.g., accuracy, balanced_accuracy, etc. Note that in case of classification metrics, both thresholded and non-thresholded metrics are accepted.
Returns:: Either func itself or -func.
Return type:: Any

get(name)[source]

Retrieve a metric function given by its name.

Parameters:: name – The name of the requested metric function. It may be of the form “name @ threshold”, where name is the name of a thresholded classification metric (e.g., “accuracy”) and threshold is the desired threshold. Furthermore, some synonyms are recognized as well, most notably “precision” for “positive_predictive_value” and “recall” for “sensitivity”. threshold can also be the name of a thresholding strategy; see function thresholded() for details.
Returns:: Metric function (callable).
Return type:: Any

bootstrapped(func, n_repetitions: int = 100, agg='mean', seed=None, replace: bool = True, size: int | float = 1.0, **kwargs)[source]

Convenience function for converting a metric into its bootstrapped version.

Parameters:

func – The metric to convert, e.g., roc_auc, accuracy, mean_squared_error, etc.
n_repetitions (int, default=100) – Number of bootstrapping repetitions to perform. If 0, func is returned unchanged.
agg (default='mean') – Aggregation to compute of bootstrapping results.
seed (int, optional) – Random seed.
replace (bool, default=True) – Whether to resample with replacement. If False, this does not actually correspond to bootstrapping.
size (int | float, default=1.) – The size of the resampled data. If <= 1, it is multiplied with the number of samples in the given data. Bootstrapping normally assumes that resampled data have the same number of samples as the original data, so this parameter should be set to 1.
**kwargs – Additional keyword arguments that are passed to func upon application. Note that only arguments that do not need to be resampled can be passed here; in particular, this excludes sample_weight.

Returns:

New metric that, when applied to y_true and y_hat, resamples the data, evaluates the metric on each resample, and returns som aggregation (typically average) of the results thus obtained.

Return type:

Any

balance_score_threshold(y_true, y_score, sample_weight: ndarray | None = None) → Tuple[float, float][source]

Compute the balance score and -threshold of a binary classification problem.

Parameters:

y_true – Ground truth, with 0 representing the negative class and 1 representing the positive class. Must not contain NaN.
y_score – Predicted scores, i.e., the higher a score the more confident the model is that the sample belongs to the positive class. Range is arbitrary.
sample_weight (ndarray, optional) – Sample weights.

Returns:

Pair (balance_score, balance_threshold), where balance_threshold is the decision threshold that minimizes the difference between sensitivity and specificity, i.e., it is defined as

\[min_t |sensitivity(y_true, y_score >= t) - specificity(y_true, y_score >= t)|\]

balance_score is the corresponding sensitivity value, which by definition is approximately equal to specificity and can furthermore be shown to be approximately equal to accuracy and balanced accuracy, too.

Return type:

tuple

prevalence_score_threshold(y_true, y_score, sample_weight: ndarray | None = None) → Tuple[float, float][source]

Compute the prevalence score and -threshold of a binary classification problem.

Parameters:

y_true – Ground truth, with 0 representing the negative class and 1 representing the positive class. Must not contain NaN.
y_score – Predicted scores, i.e., the higher a score the more confident the model is that the sample belongs to the positive class. Range is arbitrary.
sample_weight (ndarray) – Sample weights, optional.

Returns:

Pair (prevalence_score, prevalence_threshold), where prevalence_threshold is the decision threshold that minimizes the difference between the number of positive samples in y_true (m) and the number of predicted positives. In other words, the threshold is set to the m-th largest value in y_score. If sample_weight is given, the threshold minimizes the difference between the total weight of all positive samples and the total weight of all samples predicted positive. prevalence_score is the corresponding sensitivity value, which can be shown to be approximately equal to positive predictive value and F1.

Return type:

tuple

zero_one_threshold(y_true, y_score, sample_weight: ndarray | None = None, specificity_weight: float = 1.0) → float[source]

Compute the threshold corresponding to the (0,1)-criterion [1] of a binary classification problem. Although a popular strategy for selecting decision thresholds, [1] advocates maximizing informedness (aka Youden index) instead, which is equivalent to maximizing balanced accuracy.

Parameters:

y_true – Ground truth, with 0 representing the negative class and 1 representing the positive class. Must not contain NaN.
y_score – Predicted scores, i.e., the higher a score the more confident the model is that the sample belongs to the positive class. Range is arbitrary.
sample_weight (ndarray, optional) – Sample weights.
specificity_weight (float, default=1.) – The relative weight of specificity wrt. sensitivity. 1 means that sensitivity and specificity are weighted equally, a value < 1 means that sensitivity is weighted stronger than specificity, and a value > 1 means that specificity is weighted stronger than sensitivity. See the formula below for details.

Returns:

Decision threshold that minimizes the Euclidean distance between the point (0, 1) and the point (1 - specificity, sensitivity), i.e., arg min_t (1 - sensitivity(y_true, y_score >= t)) ** 2 + specificity_weight * (1 - specificity(y_true, y_score >= t)) ** 2

Return type:

float

References

argmax_score_threshold(func, y_true, y_score, sample_weight: ndarray | None = None, discretize=100, **kwargs) → Tuple[float, float][source]

Compute the decision threshold that maximizes a given binary classification metric or callable. Since in most built-in classification metrics larger values indicate better results, there is no analogous argmin_score_threshold().

Parameters:

func – The metric or function ot maximize. If a string, function get() is called on it.
y_true – Ground truth, with 0 representing the negative class and 1 representing the positive class. Must not contain NaN.
y_score – Predicted scores, i.e., the higher a score the more confident the model is that the sample belongs to the positive class. Range is arbitrary.
sample_weight (ndarray, optional) – Sample weights.
discretize (default=100) – Discretization steps for limiting the number of calls to func. If None, no discretization happens, i.e., all unique values in y_score are tried.
**kwargs – Additional keyword arguments passed to func.

Returns:

Pair (score, threshold), where threshold is the decision threshold that maximizes func, i.e., arg max_t func(y_true, y_score >= t) score is the corresponding value of func.

Return type:

tuple

get_thresholding_strategy(name: str)[source]

Retrieve a thresholding strategy for binary classification, given by its name.

Parameters:: name (str) – The name of the thresholding strategy, like “balance”, “prevalence” or “zero_one”.
Returns:: Thresholding strategy (callable) that can be applied to y_true, y_score and sample_weight, and that returns a single scalar threshold.
Return type:: Any

calibration_curve(y_true: ndarray, y_score: ndarray, sample_weight: ndarray | None = None, thresholds: ndarray | None = None) → Tuple[ndarray, ndarray][source]

Compute the calibration curve of a binary classification problem. The predicated class probabilities are binned and, for each bin, the fraction of positive samples is determined. These fractions can then be plotted against the midpoints of the respective bins. Ideally, the resulting curve will be monotonic increasing.

Parameters:

y_true (ndarray) – Ground truth, array of shape (n,) with values among 0 and 1. Values must not be NaN.
y_score (ndarray) – Predicated probabilities of the positive class, array of shape (n,) with arbitrary non-NaN values; in particular, the values do not necessarily need to correspond to probabilities or confidences.
sample_weight (ndarray, optional) – Sample weight.
thresholds (ndarray, optional) – The thresholds used for binning y_score. If None, suitable thresholds are determined automatically.

Returns:

Pair (fractions, thresholds), where thresholds is the array of thresholds of shape (m,), and fractions is the corresponding array of fractions of positive samples in each bin, of shape (m - 1,). Note that the i-th bin corresponds to the half-open interval [thresholds[i], thresholds[i + 1]) if i < m - 2, and to the closed interval [thresholds[i], thresholds[i + 1]] otherwise (in other words: the last bin is closed).

Return type:

tuple

roc_pr_curve(y_true: ndarray, y_score: ndarray, *, pos_label: int | str | None = None, sample_weight: ndarray | None = None, drop_intermediate: bool = True) → Tuple[ndarray, ndarray, ndarray, ndarray, ndarray, ndarray][source]

Convenience function for computing ROC- and precision-recall curves simultaneously, with only one call to function _binary_clf_curve().

Parameters:

y_true (ndarray) – Same as in sklearn.metrics.roc_curve() and sklearn.metrics.precision_recall_curve().
y_score (ndarray) – Same as in sklearn.metrics.roc_curve() and sklearn.metrics.precision_recall_curve().
pos_label (int | str, optional) – Same as in sklearn.metrics.roc_curve() and sklearn.metrics.precision_recall_curve().
sample_weight (ndarray, optional) – Same as in sklearn.metrics.roc_curve() and sklearn.metrics.precision_recall_curve().
drop_intermediate (bool, default=True) – Same as in sklearn.metrics.roc_curve().

Returns:

6-tuple (fpr, tpr, thresholds_roc, precision, recall, thresholds_pr), i.e., the concatenation of the return values of functions sklearn.metrics.roc_curve() and sklearn.metrics.precision_recall_curve().

Return type:

tuple

multiclass_proba_to_pred(y: ndarray) → ndarray[source]

Translate multiclass class probabilities into actual predictions, by returning the class with the highest probability. If two or more classes have the same highest probabilities, the last one is returned. This behavior is consistent with binary classification problems, where the positive class is returned if both classes have equal probabilities and the default threshold of 0.5 is used.

Parameters:: y (ndarray) – Class probabilities, of shape (n_classes,) or (n_samples, n_classes). The values of y can be arbitrary, they don’t need to be between 0 and 1. n_classes must be >= 1.
Returns:: Predicted class indices, either single integer or array of shape (n_samples,).
Return type:: ndarray

class thresholded(func, threshold: float | str = 0.5, **kwargs)[source]

Bases: _OperatorBase

Convenience class for converting a classification metric that can only be applied to class predictions into a metric that can be applied to probabilities. This proceeds by specifying a fixed decision threshold.

Parameters:

func – The metric to convert, e.g., accuracy, balanced_accuracy, etc.
threshold (float | str, default=0.5) – The decision threshold. In binary classification this can also be the name of a thresholding strategy that is accepted by function get_thresholding_strategy().
**kwargs – Additional keyword arguments that are passed to func upon application.

Returns:

New metric that, when applied to y_true and y_score, returns func(y_true, y_score >= threshold)
in case of binary- or multilabel classification, and func(y_true, multiclass_proba_to_pred(y_score)) in case
of multiclass classification.

classmethod make(func, threshold: float | str = 0.5, **kwargs)[source]

Convenience function for converting a classification metric into its “thresholded” version IF NECESSARY. That means, if the given metric can be applied to class probabilities, it is returned unchanged. Otherwise, thresholded(func, threshold) is returned.

Parameters:

func – The metric to convert, e.g., accuracy, balanced_accuracy, etc.
threshold (float | str, default=0.5) – The decision threshold.
**kwargs – Additional keyword arguments that shall be passed to func upon application.

Returns:

Either func itself or thresholded(func, threshold).

Return type:

Any

maybe_thresholded(func, threshold: float | str = 0.5, **kwargs)[source]

Convenience function for converting a classification metric into its “thresholded” version IF NECESSARY. That means, if the given metric can be applied to class probabilities, it is returned unchanged. Otherwise, thresholded(func, threshold) is returned.

Parameters:

func – The metric to convert, e.g., accuracy, balanced_accuracy, etc.
threshold (float | str, default=0.5) – The decision threshold.
**kwargs – Additional keyword arguments that shall be passed to func upon application.

Returns:

Either func itself or thresholded(func, threshold).

Return type:

Any

informedness(y_true, y_pred, *, average='binary', sample_weight=None)[source]: Informedness (aka Youden index or Youden’s J statistic) is the sum of sensitivity and specificity, minus 1.

markedness(y_true, y_pred, *, average='binary', sample_weight=None)[source]: Markedness is the sum of positive- and negative predictive value, minus 1.

accuracy_cm(*, tp=None, fp=None, tn=None, fn=None, average: str | None = 'binary', normalize: bool = True) → float | int | ndarray[source]: Calculate accuracy from a confusion matrix. ATTENTION! In the multilabel case, this implementation actually corresponds to accuracy_micro etc.

balanced_accuracy_cm(*, tp=None, fp=None, tn=None, fn=None, average: str | None = 'binary', adjusted: bool = False) → float | int | ndarray[source]: Calculate accuracy from a confusion matrix. ATTENTION! In the multilabel case, this implementation actually corresponds to balanced_accuracy_micro etc.

Statistics

create_non_numeric_statistics(df: DataFrame, target: list, name_: str = '') → DataFrame[source]

Calculate descriptive statistics for non-numeric features for a specific dataframe

Parameters:

df (DataFrame) – The main dataframe.
target (list) – The target labels; stored in list
name (str, default='') – Name of the label. Used for naming the columns

Returns:

Returns a dataframe with statistics (for non-numeric features)

Return type:

DataFrame

calc_non_numeric_statistics(df: DataFrame, target: list, classify: bool) → dict[source]

Calculate non-numeric descriptive statistics

Parameters:

df (DataFrame) – The main dataframe.
target (list) – The target labels; stored in list
classify (bool) – Is true, if classification task. False for regression task

Returns:

Dictionary with descriptive statistics (for non-numeric features) for overall dataset, each target and (in case of classification) each label.

Return type:

dict

calc_numeric_statistics(df: DataFrame, target: list, classify: bool) → dict[source]

Calculate descriptive statistics for numeric features for a specific dataframe

Parameters:

df (DataFrame) – The main dataframe.
target (list) – The target labels; stored in list
classify (bool) – Is true, if classification task. False for regression task

Returns:

Dictionary with statistics (for numeric features) for entire dataset, each target and (in case of classification) each label.

Return type:

dict

calc_descriptive_statistics(df: DataFrame, target: list, classify: bool, corr_threshold: int = 200) → Tuple[dict, dict, DataFrame | None][source]

Calculate and return descriptive statistics including correlation information

Parameters:

df (DataFrame) – The main dataframe.
target (list) – The target labels; stored in list
classify (bool) – Is true, if classification task. False for regression task
corr_threshold (int, default=200) – Maximum number of columns for which a correlation-DataFrame is computed.

Returns:

Tuple of numeric and non-numeric statistics (separate dictionaries) and correlation-DataFrame

Return type:

tuple

save_descriptive_statistics(df: DataFrame, target: list, classify: bool, fn: str | Path, corr_threshold: int = 200)[source]

Calculate and save descriptive statistics including correlation information to disk.

Parameters:

df (DataFrame) – The main dataframe.
target (list) – The target labels; stored in list.
classify (bool) – Is true, if classification task. False for regression task
fn (str | Path) – The directory where to save the statistics files
corr_threshold (int, default=200) – Maximum number of columns for which a correlation-DataFrame is computed.

mann_whitney_u(x: ndarray | Series, y: ndarray | Series, **kwargs) → float[source]

Mann-Whitney U test for testing whether two samples are equal (more precisely: have equal median). Only applicable to numerical observations; categorical observations should be treated with the chi square test. The Mann-Whitney U test is a special case of the Kruskal-Wallis H test, which works for more than two samples.

Parameters:

x (ndarray | Series) – First sample, array-like with numerical values.
y (ndarray | Series) – Second sample, array-like with numerical values.
**kwargs – Keyword arguments passed to scipy.stats.mannwhitneyu().

Returns:

P-value. Smaller values mean that x and y are distributed differently. Note that this test is symmetric between x and y.

Return type:

float

chi_square(x: ndarray | Series, y: ndarray | Series, **kwargs) → float[source]

Chi square test for testing whether a sample of categorical observations is distributed according to another sample of categorical observations.

Parameters:

x (ndarray | Series) – First sample, array-like with categorical values.
y (ndarray, Series) – Second sample, array-like with categorical values.
**kwargs – Keyword arguments passed to scipy.stats.chisquare().

Returns:

p-value. Smaller values mean that x is distributed differently from y. Note that this test is not symmetric between x and y!

Return type:

float

delong_test(y_true: ndarray, y_hat_1: ndarray, y_hat_2: ndarray, sample_weight=None) → float[source]

Compute the p-value of the DeLong test for the null hypothesis that two ROC-AUCs are equal.

Parameters:

y_true (np.ndarray) – Ground truth, 1D array of shape (n_samples,) with values in {0, 1}.
y_hat_1 (np.ndarray) – Predictions of the first classifier, 1D array of shape (n_samples,) with arbitrary values. Larger values correspond to a higher predicted probability that a sample belongs to the positive class.
y_hat_2 (np.ndarray) – Predictions of the second classifier, 1D array of shape (n_samples,) with arbitrary values. Larger values correspond to a higher predicted probability that a sample belongs to the positive class.
sample_weight (np.ndarray, optional) – Sample weights. None defaults to uniform weights.

Returns:

p_value – p-value for the null hypothesis that the ROC-AUCs of the two classifiers are equal. If this value is smaller than a certain pre-defined threshold (e.g., 0.05) the null hypothesis can be rejected, meaning that there is a statistically significant difference between the two ROC-AUCs.

Return type:

float

See also

roc_auc_confidence_interval: Confidence interval for the ROC-AUC of a given classifier.

roc_auc_confidence_interval(y_true: ndarray, y_hat: ndarray, alpha: float = 0.95, sample_weight: ndarray | None = None) → Tuple[float, float, float][source]

Return the confidence interval and ROC-AUC of given ground-truth and model predictions.

Parameters:

y_true (np.ndarray) – Ground truth, 1D array of shape (n_samples,) with values in {0, 1}.
y_hat (np.ndarray) – Predictions of the classifier, 1D array of shape (n_samples,) with arbitrary values. Larger values correspond to a higher predicted probability that a sample belongs to the positive class.
alpha (float, default=0.95) – Confidence level, between 0 and 1.
sample_weight (np.ndarray, optional) – Sample weights. None defaults to uniform weights.

Returns:

auc (float) – ROC-AUC of the given ground-truth and predictions.
ci_left (float) – Left endpoint of the confidence interval.
ci_right (float) – Right endpoint of the confidence interval.

Notes

The output always satisfies 0 <= ci_left <= auc <= ci_right <= 1.

See also

delong_test: Statistical test for the null hypothesis that the ROC-AUCs of two classifiers are equal.

Preprocessing

class MinMaxScaler(fit_bool: bool = True, **kwargs)[source]

Bases: MinMaxScaler

partial_fit(X: ndarray | DataFrame, y=None) → MinMaxScaler[source]

Online computation of min and max on X for later scaling.

All of X is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

Parameters:

X (array-like of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.

Returns:

self – Fitted scaler.

Return type:

object

class OneHotEncoder(drop_na: bool = False, drop=None, handle_unknown=None, **kwargs)[source]

Bases: OneHotEncoder

fit(X, y=None) → OneHotEncoder[source]

Fit OneHotEncoder to X.

Parameters:

X (array-like of shape (n_samples, n_features)) – The data to determine the categories of each feature.
y (None) – Ignored. This parameter exists only for compatibility with Pipeline.

Return type:

self

class NumCatTransformer(num_transformer=None, cat_transformer=None, bool: str = 'passthrough', obj: str = 'drop', timedelta: str = 'num', timestamp: str = 'num')[source]: Bases: BaseEstimator, TransformerMixin

class FeatureFilter[source]

Bases: BaseEstimator, TransformerMixin

Simple transformer that ensures that list of features is identical to features seen during fit. Only applicable to DataFrames.

Encoding

class Encoder(classify: bool = True)[source]

Bases: BaseEstimator

Encoder for features- and labels DataFrames. Implements the BaseEstimator class of sklearn, with methods fit(), transform() and inverse_transform(), and can easily be dumped to and loaded from disk.

Notes

Encoding ensures that:

The data type of every feature column is either float, int, bool, categorical or string (if the installed Pandas version supports it). Time-like columns are converted into float, and object data types raise an exception.
The data type of every target column is float.
- In regression tasks, this is achieved by converting numerical data types (float, int, bool, time-like) into float, and raising exceptions if other data types are found.
- In binary classification, this is achieved by representing the negative class by 0.0 and the positive class by 1.0. If the original data type is categorical, the negative class corresponds to the first category, whereas the positive class corresponds to the second category. If the original data type is not categorical the positive and negative classes are determined through sklearn’s LabelEncoder.
- In multiclass classification, this is achieved by representing the i-th class by i.
- In multilabel classification, this is achieved by representing the presence of a class by 1.0 and its absence by 0.0.
Both features and labels may contain NaN values before encoding. These are simply propagated, meaning that encoded data may contain NaN values as well!

get_target_or_class_names() → list | None[source]

Convenience method for getting the names of the targets or, in case of multiclass classification, the names of the individual classes.

Returns:: List of target- or class names.
Return type:: list

transform(*, inplace: bool = True, **kwargs: DataFrame | None)[source]

Transform features- and/or labels DataFrames.

Parameters:

inplace (bool,default=True) – Whether to modify the given data in place.
**kwargs (DataFrame, optional) – The data to transform, with keys “x” (features), “y” (labels) or “data” (features+labels).

Returns:

The transformed DataFrame(s), either a single DataFrame if only one of “x” or “y” is passed, or a pair of: DataFrames in the same order as in the argument dict. If “data” is passed, returns the pair of encoded features and labels.

Return type:

Any

inverse_transform(*, inplace: bool = True, **kwargs: DataFrame | ndarray | None)[source]

Back-transform features- and/or labels DataFrames i.e. Decodes encoded data. In the case of classification, it is also able to handle Numpy arrays containing class (indices), as returned by predict(), as well as class probabilities, as returned by predict_proba().

Parameters:

inplace (bool, default=True) – Whether to modify the given data in place.
**kwargs (DataFrame, ndarray, optional) – The data to transform back, with keys “x” (features) or “y” (labels).

Returns:

The back-transformed DataFrame(s), either a single DataFrame if only one of “x” or “y” is passed, or apair of DataFrames in the same order as in the argument dict.

Return type:

Any

Table

convert_object_dtypes(df: DataFrame, inplace: bool = True, max_categories: int = 100) → DataFrame[source]

Convert “object” data types in df into other data types, if possible. In particular, this includes timedelta, datetime, categorical and string types, in that order. String types are not supported in all Pandas versions.

Parameters:

df (DataFrame) – The DataFrame.
inplace (bool, default=True) – Whether to modify df in place. Note that if no column in df can be converted, it is returned as-is even if inplace is False.
max_categories (int, default=100) – The maximum number of allowed categories when converting on object column into a categorical column.

Returns:

DataFrame with converted data types.

Return type:

DataFrame

set_index(df: DataFrame, inplace: bool = True) → Tuple[DataFrame, List[str]][source]

Set the row index of the given DataFrame to an ID column, unless it contains IDs already, and return a list of other potential ID columns.

Parameters:

df – The DataFrame.
inplace (bool, default=True) – Whether to modify df in place.

Returns:

Pair (df, id_cols), where df is the new DataFrame and id_cols is a list of potential ID columns.

Return type:

tuple

merge_tables(tables: Iterable[DataFrame | str | Path]) → Tuple[DataFrame, List[str]][source]

Merge the given tables by left-joining them on ID columns.

Parameters:: tables (Iterable) – The tables to merge, an iterable of DataFrames or paths to tables. Function convert_object_dtypes() is automatically applied to tables read from files.
Returns:: The pair (df, id_cols), where df is the merged DataFrame and id_cols is the list of potential ID columns.
Return type:: tuple

train_test_split(df: DataFrame, by: str) → Tuple[Dict[str, ndarray], str | None][source]

Split the given DataFrame into train- and test set(s), by a given column.

Parameters:

df (DataFrame) – The DataFrame.
by (str) – The name of the column to split by. Must have bool or categorical data type.

Returns:

Pair (split_masks, train_key), where

split_masks is a dict mapping string-keys to masks corresponding to non-overlapping portions of df.
train_key is the key (in split_masks) containing the training set, or None if the training set could not
be determined.

Return type:

tuple

Split

class StratifiedGroupShuffleSplit(n_splits=10, *, test_size=None, train_size=None, random_state=None, method='automatic', n_iter=None)[source]

Bases: StratifiedShuffleSplit

Stratified grouped split into train- and test set. Ensures that groups in the two sets do not overlap, and tries to distribute samples in such a way that class percentages are roughly maintained in each split.

Parameters:

n_splits (int, default=10) – Number of re-shuffling & splitting iterations.
test_size (float | int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.1.
train_size (float | int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
random_state (int or RandomState instance, default=None) – Controls the randomness of the training and testing indices produced. Pass an int for reproducible output across multiple function calls.
method (str, default="automatic") – Resampling method to use. Can be “automatic”, “exact” and “brute_force”. If there are many small groups, “brute_force” tends to give reasonable results and is significantly faster than “exact”. Otherwise, if there are only few large groups, method “exact” might be preferable. “automatic” tries to infer the optimal method based on the number of groups.
n_iter (int, default=None) – Number of brute-force iterations. The larger the number, the more splits are tried, and hence the better the results get. If None, the number of iterations is determined automatically.

class StratifiedGroupKFold(n_splits=5, shuffle=False, random_state=None, method: str = 'automatic', n_iter: int | None = None)[source]

Bases: _BaseKFold

Copied and adapted from sklearn version 1.0.2 [1], because auto-sklearn requires an older version without this class.

Changelist: - Removed warning if some class has fewer than n_splits instances. - Do not throw error if all classes have fewer than n_splits instances. - Added method “brute_force”.

Parameters:

n_splits (int, default=10) – Number of re-shuffling & splitting iterations.
shuffle (bool, default=False) – Whether to shuffle samples before splitting.
random_state (int or RandomState instance, default=None) – Controls the randomness of the training and testing indices produced. Pass an int for reproducible output across multiple function calls.
method (str, default="automatic") – Resampling method to use. Can be “automatic”, “exact” and “brute_force”. If there are many small groups, “brute_force” tends to give reasonable results and is significantly faster than “exact”. Otherwise, if there are only few large groups, method “exact” might be preferable. “automatic” tries to infer the optimal method based on the number of groups. Note that “brute_force” is only possible if shuffle is set to True.
n_iter (int, default=None) – Number of brute-force iterations. The larger the number, the more splits are tried, and hence the better the results get. If None, the number of iterations is determined automatically.

References

class CustomPredefinedSplit(test_folds=None)[source]

Bases: BaseCrossValidator

Predefined split cross-validator. Provides train/test indices to split data into train/test sets using a predefined scheme specified by explicit test indices.

In contrast to sklearn.model_selection.PredefinedSplit, samples can be in the test set of more than one split.

In methods split() etc., parameters X, y and groups only exist for compatibility, but are always ignored.

Parameters:: test_folds (list of array-like) – Indices of test samples for each split. The number of splits equals the length of the list. Note that the test sets do not have to be mutually disjoint.

get_n_splits(X=None, y=None, groups=None)[source]: Returns the number of splitting iterations in the cross-validator

Longitudinal

resample_eav(df: DataFrame | dask.dataframe.DataFrame, windows: DataFrame | dask.dataframe.DataFrame, agg: dict = None, entity_col=None, time_col=None, attribute_col=None, value_col=None, include_start: bool = True, include_stop: bool = False, optimize: str = 'time') → DataFrame | dask.dataframe.DataFrame[source]

Resample data in EAV (entity-attribute-value) format wrt. explicitly passed windows of arbitrary (possibly infinite) length.

Parameters:

df (pd.DataFrame | dask.dataframe.DataFrame) – The DataFrame to resample, in EAV format. That means, must have columns value_col (contains observed values), time_col (contains observation times), attribute_col (optional; contains attribute identifiers) and entity_col (optional; contains entity identifiers). Must have one column index level. Data types are arbitrary, as long as observation times and entity identifiers can be compared wrt. < and <= (e.g., float, int, time delta, date time). Entity identifiers must not be NA. Observation times may be NA, but such entries are ignored entirely. df can be a Dask DataFrame as well. In that case, however, entity_col must not be None and entities should already be on the row index, with known divisions. Otherwise, the row index is set to entity_col, which can be very costly both in terms of time and memory. Especially if df is known to be sorted wrt. entities already, the calling function should better take care of this; see https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html.
windows (pd.DataFrame | dask.dataframe.DataFrame | callable) – The target windows into which df is resampled. Must have two column index levels and columns (time_col, “start”) (optional; contains start times of each window), (time_col, “stop”) (optional; contains end times of each window), (entity_col, “”) (optional; contains entity identifiers) and (window_group_col, “”) (optional; contains information for creating groups of mutually disjoint windows). Start- and end times may be NA, but such windows are deemed invalid and by definition do not contain any observations. At least one of the two endpoint-columns must be given; if one is missing it is assumed to represent +/- inf. windows can be a Dask DataFrame as well. In that case, however, entity_col must not be None and entities should already be on the row index, with known divisions. Otherwise, the row index is set to entity_col, which can be very costly both in terms of time and memory. Especially if windows is known to be sorted wrt. entities already, the calling function should better take care of this; see https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html. Alternatively, windows can be a callable that, when applied to a DataFrame and keyword arguments entity_col, time_col, attribute_col and value_col, returns a DataFrame of the form described above. The canonical example of such a callable is the result returned by make_windows(); see the documentation of make_windows() for details.
agg (dict) –
The aggregations to apply. Must be a dict mapping attribute identifiers to lists of aggregation functions, which are applied to all observed values of the respective attribute in each specified window. Supported aggregation functions are:
- "mean": Empirical mean of observed non-NA values
- "min": Minimum of observed non-NA values; equivalent to “p0”
- "max": Maximum of observed non-NA values; equivalent to “p100”
- "median": Median of observed non-NA values; equivalent to “p50”
- "std": Empirical standard deviation of observed non-NA values
- "var": Empirical variance of observed non-NA values
- "sum": Sum of observed non-NA values
- "prod": Product of observed non-NA values
- "skew": Skewness of observed non-NA values
- "mad": Mean absolute deviation of observed non-NA values
- "sem": Standard error of the mean of observed non-NA values
- "size": Number of observations, including NA values
- "count": Number of non-NA observations
- "nunique": Number of unique observed non-NA values
- "mode": Mode of observed non-NA values, i.e., most frequent value; ties are broken randomly but reproducibly
- "mode_count": Number of occurrences of mode
- "pxx": Percentile of observed non-NA values; xx is an arbitrary float in the interval [0, 100]
- "rxx": xx-th observed value (possibly NA), starting from 0; negative indices count from the end
- "txx": Time of xx-th observed value; negative indices count from the end
- "callable": Function that takes as input a DataFrame in and returns a new DataFrame out. See Notes for details.
entity_col (str, optional) – Name of the column in df and windows containing entity identifiers. If None, all entries are assumed to belong to the same entity. Note that entity identifiers may also be on the row index.
time_col (str, optional) – Name of the column in df containing observation times, and also name of column(s) in windows containing start- and end times of the windows. Note that despite its name the data type of the column is arbitrary, as long as it supports the following arithmetic- and order operations: -, /, <, <=.
attribute_col (str, optional) – Name of the column in df containing attribute identifiers. If None, all entries are assumed to belong to the same attribute; in that case agg may only contain one single item.
value_col (str, optional) – Name of the column in df containing the observed values.
include_start (bool, default=True) – Whether start times of observation windows are part of the windows.
include_stop (bool, default=False) – Whether end times of observation windows are part of the windows.
optimize (str, default='time') – Whether to optimize runtime or memory requirements. If set to “time”, the function returns faster but requires more memory; if set to “memory”, the runtime is longer but memory consumption is reduced to a minimum. If “time”, global variable MAX_ROWS can be used to adjust the time-memory tradeoff: increasing it increases memory consumption while reducing runtime. Note that this parameter is only relevant for computing non-rank-like aggregations, since rank-like aggregations (“rxx”, “txx”) can be efficiently computed anyway.

Returns:

Resampled data. Like windows, but with one additional column for each requested aggregation. Order of columns is arbitrary, order of rows is exactly as in windows – unless windows is a Dask DataFrame, in which case the order of rows may differ. The output is a (lazy) Dask DataFrame if windows is a Dask DataFrame, and a Pandas DataFrame otherwise, regardless of what df is.

Return type:

pd.DataFrame | dask.dataframe.DataFrame

Notes

When passing a callable to agg, it is expected to take as input a DataFrame in and return a new DataFrame out. in has two columns time_col and value_col (in that order). Its row index specifies which entries belong to the same observation window: entries with the same row index value belong to the same window, entries with different row index values belong to distinct windows. Observation times are guaranteed to be non-N/A, values may be N/A. Note, however, that in is not necessarily sorted wrt. its row index and/or observation times! Also note that the entities the observations in in stem from (if entity_col is specified) are not known to the function. out should have one row per row index value of in (with the same row index value), and an arbitrary number of columns with arbitrary names and dtypes. Columns should be consistent in every invocation of the function. The reason why the function is not applied to each row-index-value group individually is that some aggregations can be implemented efficiently using sorting rather than grouping. The function should be stateless and must not modify in in place.

Example 1: A simple aggregation which calculates the fraction of values between 0 and 1 in every window could be passed as
```
lambda x: x[value_col].between(0, 1).groupby(level=0).mean().to_frame('frac_between_0_1')
```

Example 2: A more sophisticated aggregation which fits a linear regression to the observations in every window and returns the slope of the resulting regression line could be defined as

def slope(x):
  tmp = pd.DataFrame(
      index=x.index,
      data={time_col: x[time_col].dt.total_seconds(), value_col: x[value_col]}
  )
  return tmp[tmp[value_col].notna()].groupby(level=0).apply(
      lambda g: scipy.stats.linregress(g[time_col], y=g[value_col]).slope
  ).to_frame('slope')

resample_interval(df: DataFrame | dask.dataframe.DataFrame, windows: DataFrame | dask.dataframe.DataFrame, attributes: list = None, entity_col=None, start_col=None, stop_col=None, attribute_col=None, value_col=None, time_col=None, epsilon=1e-07) → DataFrame | dask.dataframe.DataFrame[source]

Resample interval-like data wrt. explicitly passed windows of arbitrary (possibly infinite) length. “Interval-like” means that each observation is characterized by a start- and stop time rather than a singular timestamp (as in EAV data).

Parameters:

df (pd.DataFrame | dask.dataframe.DataFrame) – The DataFrame to resample. Must have columns value_col (contains observed values), start_col (optional; contains start times), stop_time (optional; contains end times), attribute_col (optional; contains attribute identifiers) and entity_col (optional; contains entity identifiers). Must have one column index level. Data types are arbitrary, as long as times and entity identifiers can be compared wrt. < and <= (e.g., float, int, time delta, date time). Entity identifiers must not be NA. Values must be numeric (float, int, bool). Observation times and observed values may be NA, but such entries are ignored entirely. Although both start_col and stop_col are optional, at least one must be present. Missing start- and end columns are interpreted as -/+ inf. All intervals are closed, i.e., start- and end times are included. This is especially relevant for entries whose start time equals their end time. df can be a Dask DataFrame as well. In that case, however, entity_col must not be None and entities should already be on the row index, with known divisions. Otherwise, the row index is set to entity_col, which can be very costly both in terms of time and memory. Especially if df is known to be sorted wrt. entities already, the calling function should better take care of this; see https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html.
windows (pd.DataFrame | dask.dataframe.DataFrame | callable) – The target windows into which df is resampled. Must have either one or two columns index level(s). If it has one column index level, must have columns start_col (optional; contains start times of each window), stop_col (optional; contains end times of each window) and entity_col (optional; contains entity identifiers). If it has two column index levels, the columns must be (time_col, “start”), (time_col, “stop”) and (entity_col, “”). Start- and end times may be NA, but such windows are deemed invalid and by definition do not overlap with any observation intervals. At least one of the two endpoint-columns must be present; if one is missing it is assumed to represent -/+ inf. All time windows are closed, i.e., start- and end times are included. windows can be a Dask DataFrame as well. In that case, however, entity_col must not be None and entities should already be on the row index, with known divisions. Otherwise, the row index is set to entity_col, which can be very costly both in terms of time and memory. Especially if windows is known to be sorted wrt. entities already, the calling function should better take care of this; see https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html. Alternatively, windows can be a callable that, when applied to a DataFrame and keyword arguments entity_col, start_col, stop_col, time_col, attribute_col and value_col, returns a DataFrame of the form described above. The canonical example of such a callable is the result returned by make_windows(); see the documentation of make_windows() for details.
attributes (list, optional) – The attributes to consider. Must be a list-like of attribute identifiers. None defaults to the list of all such identifiers present in column attribute_col. If attribute_col is None but attributes is not, it must be a singleton list.
entity_col (str, optional) – Name of the column in df and windows containing entity identifiers. If None, all entries are assumed to belong to the same entity. Note that entity identifiers may also be on the row index.
start_col (str, optional) – Name of the column in df (and windows if it has only one column index level) containing start times. If None, all start times are assumed to be -inf. Note that despite its name the data type of the column is arbitrary, as long as it supports the following arithmetic- and order operations: -, /, <, <=.
stop_col (str, optional) – Name of the column in df (and windows if it has only one column index level) containing end times. If None, all end times are assumed to be +inf. Note that despite its name the data type of the column is arbitrary, as long as it supports the following arithmetic- and order operations: -, /, <, <=.
attribute_col (str, optional) – Name of the column in df containing attribute identifiers. If None, all entries are assumed to belong to the same attribute.
value_col (str, optional) – Name of the column in df containing the observed values.
time_col (list | str, optional) – Name of the column(s) in windows containing start- and end times of the windows. Only needed if windows has two column index levels, because otherwise these two columns must be called start_col and stop_col, respectively.
epsilon – The value to set \(W_I\) to if \(I\) is infinite and \(W \cap I\) is non-empty and finite; see Notes for details.

Returns:

Resampled data. Like windows, but with one additional column for each attribute, and same number of
column index levels.
Order of columns is arbitrary, order of rows is exactly as in windows – unless windows is a Dask DataFrame, in
which case the order of rows may differ.
The output is a (lazy) Dask DataFrame if windows is a Dask DataFrame, and a Pandas DataFrame otherwise,
regardless of what df is.

Notes

A typical example of interval-like data are medication records, since medications can be administered over longer time periods.

The only supported resampling aggregation is summing the observed values per time window, scaled by the fraction of the length of the intersection of observation interval and time window divided by the total length of the observation interval: Let \(W = [s, t]\) be a time window and let \(I = [a, b]\) be an observation interval with observed value \(v\). Then \(I\) contributes to \(W\) the value

\(W_I = v * \frac{|W \cap I|}{|I|}\)

The overall value of \(W\) is the sum of \(W_I\) over all intervals. Of course, all this is computed separately for each entity-attribute combination. Some remarks on the above equation are in place:

If \(v\) is N/A, \(W_I\) is set to 0.
If \(a = b\) both numerator and denominator are 0. In this case the fraction is defined as 1 if \(a \in W\) (i.e., \(s \leq a \leq t\)) and 0 otherwise.
If \(I\) is infinite and \(W \cap I\) is non-empty but finite, \(W_I\) is set to \(epsilon * sign(v)\). Note that \(W \cap I\) is non-empty even if it is of the form \([x, x]\). This leads to the slightly counter-intuitive situation that \(W_I = epsilon\) if \(I\) is infinite, and \(W_I = 0\) if \(I\) is finite.
If \(I\) and \(W \cap I\) are both infinite, the fraction is defined as 1. This is regardless of whether \(W \cap I\) equals \(I\) or whether it is a proper subset of it.

class make_windows(df: DataFrame | str | None = None, entity=None, start=None, stop=None, start_rel=None, stop_rel=None, duration=None, anchor=None)[source]

Bases: object

Convenience function for easily creating windows that can be passed to functions resample_eav() and resample_interval(). Note that internally, invoking this function does not create the actual windows-DataFrame yet. Instead, when passing the resulting callable to resample_eav() or resample_interval(), it is applied to the DataFrame to be resampled. This allows to implicitly refer to it here; see the examples below for specific use-cases.

Parameters:

df (pd.DataFrame | str, optional) – Source DataFrame. If None, defaults to the DataFrame to be resampled in resample_eav() or resample_interval(). Can also be a string, which will be evaluated using Python’s eval() function. The string can contain references to the DataFrame to be resampled via variable df, and to column-names entity_col, time_col, start_col and stop_col passed to resample_eav() and resample_interval(). Example: “df.groupby(entity_col)[time_col].max().to_frame()”
entity (pd.Series | pd.Index | str | scalar, optional) – Entity of each window. Series are used as-is (possibly after re-ordering rows to match other row indices), strings refer to columns in df, and scalars are replicated to populate every window with the same value. If None, defaults to df[entity_col] if df contains that column.
start (pd.Series | pd.Index | str | scalar, optional) – Start time of each window. Series are used as-is (possibly after re-ordering rows to match other row indices), strings refer to columns in df, and scalars are replicated to populate every window with the same value. Note that despite its name the data type of the start times is arbitrary, as long as it supports the following arithmetic- and order operations: -, /, <, <=. start and start_rel are mutually exclusive.
stop (pd.Series | pd.Index | str | scalar, optional) – Stop time of each window. Series are used as-is (possibly after re-ordering rows to match other row indices), strings refer to columns in df, and scalars are replicated to populate every window with the same value. Note that despite its name the data type of the stop times is arbitrary, as long as it supports the following arithmetic- and order operations: -, /, <, <=. stop and stop_rel are mutually exclusive.
start_rel (pd.Series | pd.Index | str | scalar, optional) – Start time of each window, relative to anchor. Series are used as-is (possibly after re-ordering rows to match other row indices), strings refer to columns in df, and scalars are replicated to populate every window with the same value. If given, anchor must be given, too. start and start_rel are mutually exclusive.
stop_rel (pd.Series | pd.Index | str | scalar, optional) – Stop time of each window, relative to anchor. Series are used as-is (possibly after re-ordering rows to match other row indices), strings refer to columns in df, and scalars are replicated to populate every window with the same value. If given, anchor must be given, too. stop and stop_rel are mutually exclusive.
duration (pd.Series | pd.Index | str | scalar, optional) – Duration of each window. Series are used as-is (possibly after re-ordering rows to match other row indices), strings refer to columns in df, and scalars are replicated to populate every window with the same value. Durations can only be specified if exactly one endpoint (either start or stop) is specified; the other endpoint is then computed from duration.
anchor (pd.Series | pd.Index | str | scalar, optional) – Anchor time start_rel and stop_rel refer to. Series are used as-is (possibly after re-ordering rows to match other row indices), strings refer to columns in df, and scalars are replicated to populate every window with the same value. Ignored unless start_rel or stop_rel is given. If start_rel or stop_rel is given but anchor is None, it defaults to time_col, but a warning message is printed.

Notes

The current implementation does not support Dask DataFrames.
This function does not check whether windows are non-empty, i.e., whether start times come before end times.

Examples

Use-case 1: Create fixed-length windows relative to the time column in the DataFrame to be resampled. Since anchor is required by start_rel but not set explicitly, it defaults to time_col, but a warning message is printed.
```
resample_eav(
    df_to_be_resampled,
    make_windows(
        start_rel=pd.Timedelta("-1 day"),
        stop_rel=pd.Timedelta("-1 hour")
    ),
    ...
)
```
Use-case 2: Similar to use-case 1, but only create one window per entity, for the temporally last entry. Note how the DataFrame to be resampled is only passed once directly to function resample_eav(); make_windows() refers to it implicitly via variable name “df” in the string of keyword argument df. Note also that the resulting DataFrame may have entities on its row index.
```
resample_eav(
    df_to_be_resampled,
    make_windows(
        df="df.groupby(entity_col)[time_col].max().to_frame()",
        start_rel=pd.Timedelta("-7 days"),
        duration=pd.Timedelta("5 days"),
        anchor="timestamp"
    ),
    time_col="timestamp",
    entity_col=...,
    ...
)
```

Use-case 3: make_windows() can be used with function resample_interval(), too – regardless of whether time_col is passed to resample_interval() or not.

resample_interval(
    df_to_be_resampled,
    make_windows(
        stop=pd.Series(...),
        duration=pd.Series(...),    # must have the same row index as the Series passed to `start`
    ),
    start_col=...,
    stop_col=...,
    time_col=...,                   # optional
    ...
)

group_temporal(df: DataFrame, group_by=None, time_col=None, start_col=None, stop_col=None, distance=None, inclusive: bool = True) → Series[source]

Group intervals wrt. their temporal distance to each other. Intervals can also be isolated points, i.e., single-point intervals of the form [x, x].

Parameters:

df (DataFrame) – DataFrame with intervals.
group_by (optional) – Additional column(s) to group df by, optional. If given, the computed grouping refines the given one, in the sense that any two intervals belonging to the same computed group are guaranteed to belong to the same given group, too. Can be the name of a single column or a list of column names and/or row index levels. Strings are interpreted as column names or row index names, integers are interpreted as row index levels.
time_col (str, optional) – Name of the column in df containing both start- and end times of single-point intervals. If given, both start_col and stop_col must be None.
start_col (str, optional) – Name of the column in df containing start times of intervals. If given, time_col must be None.
stop_col (str, optional) – Name of the column in df containing end times of intervals. If given, time_col must be None. Note that the function tacitly assumes that no interval ends before it starts, although this is not checked. If this assumption is violated, the returned results may not be correct.
distance (optional) – Maximum allowed distance between two intervals for being put into the same group. Should be non-negative. The distance between two intervals is the single-linkage distance, i.e., the minimum distance between any two points in the respective intervals. This means, for example, that the distance between overlapping intervals is always 0.
inclusive (bool, default=False) – Whether distance is inclusive.

Notes

The returned grouping is the reflexive-transitive closure of the proximity relation induced by distance. Formally: Let \(R\) be the binary relation on the set of intervals in df such that \(R(I_1, I_2)\) holds iff the distance between \(I_1\) and \(I_2\) is less than (or equal to) distance (and additionally \(I_1\) and \(I_2\) belong to the same groups specified by group_by). \(R\) is obviously symmetric, so its reflexive-transitive closure \(R^*\) is an equivalence relation on the set of intervals in df. The returned grouping corresponds precisely to this equivalence relation, in the sense that there is one group per equivalence class and vice versa. Note that if two intervals belong to the same group, their distance may still be larger than distance.

Returns:: Series with the same row index as df, in the same order, whose values are group indices.
Return type:: Series

prev_next_values(df: DataFrame, sort_by=None, group_by=None, columns=None, first_indicator_name=None, last_indicator_name=None, keep_sorted: bool = False, inplace: bool = False) → DataFrame[source]

Find the previous/next values of some columns in DataFrame df, for every entry. Additionally, entries can be grouped and previous/next values only searched within each group.

Parameters:

df (DataFrame) – The DataFrame.
sort_by (list | str, optional) – The column(s) to sort by. Can be the name of a single column or a list of column names and/or row index levels. Strings are interpreted as column names or row index names, integers are interpreted as row index levels. ATTENTION! N/A values in columns to sort by are not ignored; rather, they are treated in the same way as Pandas treats such values in DataFrame.sort_values(), i.e., they are put at the end.
group_by (list | str, optional) – Column(s) to group df by, optional. Same values as sort_by.
columns (dict) –
A dict mapping column names to dicts of the form
```
{
    "prev_name": <prev_name>,
    "prev_fill": <prev_fill>,
    "next_name": <next_name>,
    "next_fill": <next_fill>
}
```
prev_name and next_name are the names of the columns in the result, containing the previous/next values. If any of them is None, the corresponding previous/next values are not computed for that column. prev_fill and next_fill specify which values to assign to the first/last entry in every group, which does not have any previous/next values. Note that column names not present in df are tacitly skipped.
first_indicator_name (str, optional) – Name of the column in the result containing boolean indicators whether the corresponding entries come first in their respective groups. If None, no such column is added.
last_indicator_name (str, optional) – Name of the column in the result containing boolean indicators whether the corresponding entries come last in their respective groups. If None, no such column is added.
keep_sorted (bool, default=False) – Whether to keep the result sorted wrt. group_by and sort_by. If False, the order of rows of the result is identical to that of df.
inplace (bool, default=False) – If True, the new columns are added to df.

Returns:

The modified DataFrame if inplace is True, a DataFrame with the requested previous/next values otherwise.

Return type:

DataFrame

Bootstrapping

class Bootstrapping(*args: DataFrame | Series | ndarray, kwargs: dict | None = None, fn=None, seed=None, replace: bool = True, size: int | float = 1.0)[source]

Bases: object

Class for performing bootstrapping [1], i.e., repeatedly sample with replacement from given data and evaluate statistics on each resample to obtain mean, standard deviation, etc. for more robust estimates.

Parameters:

*args (DataFrame | Series | ndarray) – Data, non-empty sequence of DataFrames, Series or arrays of the same length.
kwargs (dict, optional) – Additional keyword arguments passed to the function fn computing the statistics. Like args, the values of the dict must be DataFrames, Series or arrays of the same length as the elements of args.
fn (optional) – The statistics to compute. Must be None, a function that takes the given args as input and returns a scalar/array/DataFrame/Series or a (nest) dict/tuple thereof, or a (nested) dict/tuple of such functions.
seed (int, optional) – Random seed.
replace (bool, default=True) – Whether to resample with replacement. If False, this does not actually correspond to bootstrapping.
size (int | float, default=1.) – The size of the resampled data. If <= 1, it is multiplied with the number of samples in the given data. Bootstrapping normally assumes that resampled data have the same number of samples as the original data, so this parameter should be set to 1.

References

run(n_repetitions: int = 100, sample_indices: ndarray | None = None) → Bootstrapping[source]

Run bootstrapping for a given number of repetitions, and store the results in a list. Results are appended to results from previous runs!

Parameters:

n_repetitions (int, default=100) – Number of repetitions.
sample_indices (ndarray, optional) – Pre-computed sample indices to use in each repetition. If not None, n_repetitions` is ignored and sample_indices must have shape (n, size).

subsample(seed: int | None = None) → ndarray[source]

Construct a subsample.

Parameters:: seed (int, optional) – Random seed to use.
Returns:: Array with subsample indices.
Return type:: ndarray

get_sample_indices() → ndarray[source]

Get sample indices used for resampling the data.

Returns:: Array of shape (n_runs, size).
Return type:: ndarray

agg(func)[source]

Compute aggregate statistics of the results of the individual runs, like mean, standard deviation, etc.

Parameters:: func – The aggregation function to apply.
Returns:: Aggregated results.
Return type:: Any

dataframe(keys=None) → DataFrame | None[source]

Construct a DataFrame with all results, if possible. Only works for (dicts/tuples of) scalar values.

Returns:: DataFrame whose columns correspond to individual metrics and whose rows correspond to runs, or None.
Return type:: DataFrame

describe(keys=None) → Series | DataFrame[source]

Describe the results of the individual runs by computing a predefined set of statistics, similar to pandas’ describe() method. Only works for (dicts/tuples of) scalar values.

Returns:: DataFrame or Series with descriptive statistics.
Return type:: Series | DataFrame

Summary

summarize_performance(directories: Iterable[str | Path], metrics: Iterable[str | tuple], split: str | Iterable[str] | None = None, path_callback=None) → DataFrame[source]

Summarize the performance of multiple prediction models trained and evaluated with CaTabRa. This is a convenient way for quickly comparing them and selecting the best model(s) for a certain task. An implicit assumption of this function is that all models were trained on the same prediction task.

IMPORTANT: Only pre-evaluated metrics in “metrics.xlsx” and “bootstrapping.xlsx” are considered!

Parameters:

directories (Iterable[str | Path]) – The directories to consider, an iterable of path-like objects. Each directory must be the output directory of an invocation of catabra.evaluate, or a subdirectory corresponding to a specific split (containing “metrics.xlsx” and maybe also “bootstrapping.xlsx”). A convenient way to specify a couple of directories matching a certain pattern is by using Path(root_path).rglob(pattern).
metrics (Iterable[str]) –
List of metrics to include in the summary, an iterable of strings. Values must match the following pattern:
```
"[target:]metric_name[@threshold][(bootstrapping_aggregation)]"
```
- target is optional and specifies the target (or class in case of multiclass classification); can be “*” to include all available targets, and can be a sequence separated by “,”. Ignored if bootstrapping_aggregation is specified.
- metric_name is the name of the actual metric, exactly as written in “metrics.xlsx” or “bootstrapping.xlsx”; can be “*” to include all available pre-evaluated metrics, and can be a sequence separated by “,”.
- threshold is optional and must be a numeral between 0 and 1 (cannot be a string like “balance”), and cannot be “*”. Only relevant for threshold-dependent classification metrics, and mutually exclusive with bootstrapping_aggregation. Note that the given threshold must exactly match one of the thresholds evaluated in “metrics.xlsx”.
- bootstrapping_aggregation is optional and specifies the bootstrapping aggregation to include, like “mean”, “std”, etc.; can be “*” to include all available pre-evaluated aggregations in “bootstrapping.xlsx”, and can be a sequence separated by “,”.
split (Iterable[str], default=None) – If a directory in directories has subdirectories corresponding to data splits that were evaluated separately, only include the splits in split. If None, all splits are included.
path_callback (Callable, default=None) – Callback function applied to every path visited. Must return None, True, False or a dict; False indicates that the current path should be dropped from the output, True and None are aliases for {}, and a dict adds a column for every key to the output DataFrame, with the corresponding values in them.

Returns:

DataFrame with one row per evaluation and one column per performance metric. If multiple splits are included in the performance summary, each is put into a separate row.

Return type:

DataFrame

Examples

Example metric specifications:

“roc_auc”

“roc_auc(mean,std)”

“accuracy,sensitivity@0.5”

“*@0.5”

“r2(*)”

“*(*)”

“target_1:mean_squared_error”

“*:mean_squared_error”

“*:*(*)”

“__threshold(mean,std)”

See also

summarize_importance: Summarize feature importance scores.

summarize_importance(directories: Iterable[str | Path], columns: str | Iterable[str] | None = None, new_column_name: str = '{feature} {column}', glob: bool = False, split: str | Iterable[str] | None = None, model_id: str | Iterable[str] | None = None, path_callback=None) → DataFrame[source]

Summarize the feature importance of multiple prediction models trained and explained with CaTabRa. This is a convenient way for quickly comparing them. An implicit assumption of this function is that all models were trained on the same prediction task, and that the same feature importance calculation method was applied to generate the importance scores.

IMPORTANT: Only pre-evaluated feature importance scores are considered!

Parameters:

directories (Iterable[str | Path]) – The directories to consider, an iterable of path-like objects. Each directory must be the output directory of an invocation of catabra.explain, or a subdirectory corresponding to a specific split (containing HDF5 files with feature importance scores). A convenient way to specify a couple of directories matching a certain pattern is by using Path(root_path).rglob(pattern).
columns (Iterable[str], default=None) – The columns in global feature importance scores to consider. For instance, if catabra.explanation.average_local_explanations() is used to produce global scores, 4 columns “>0”, “<0”, “>0 std” and “<0 std” are normally generated. This parameter allows to include only a subset in the summary. None defaults to all columns.
new_column_name (str) – String pattern specifying the names of the columns in the output DataFrame. May have two named fields feature and column, which are filled with original feature- and column names, respectively.
glob (bool) – Whether feature importance scores in directories are global. If not, catabra.explanation.average_local_explanations() is applied.
split (Iterable[str], default=None) – If a directory in directories has subdirectories corresponding to data splits that were explained separately, only include the splits in split. If None, all splits are included.
model_id (Iterable[str], default=None) – Model-IDs to consider, optional. Determines the names of the HDF5 files to be included. None defaults to all found model-IDs.
path_callback (Callable, default=None) – Callback function applied to every path visited. Must return None, True, False or a dict; False indicates that the current path should be dropped from the output, True and None are aliases for {}, and a dict adds a column for every key to the output DataFrame, with the corresponding values in them.

Returns:

DataFrame with one row per explanation and one column per feature-column pair. If multiple splits are included in the importance summary, each is put into a separate row. If there are multiple targets (multiclass/multilabel classification, multioutput regression) and the feature importance scores for each target are stored in a separate table, each is put into a separate row and an additional column “__target__” is added.

Return type:

DataFrame

See also

summarize_performance: Summarize model performance.