Read a DataFrame from a CSV, Excel, HDF5, Pickle or Parquet file. The file type is determined from the file
extension of the given file.
Parameters:
fn (str | Path) – The file to read.
key (str | Iterable[str], default='table') – The key(s) in the HDF5 file, if fn is an HDF5 file. Defaults to “table”. If an iterable, all keys are read and
concatenated along the row axis.
Write a dict of DataFrames to file. The file type is determined from the file extension of the given file.
Unless an Excel- or HDF5 file, dfs must be empty or a singleton.
Parameters:
dfs (dict) – The DataFrames to write. If empty and mode differs from “a”, the file is deleted.
fn (str | Path) – The target file name.
mode (str, default='w') – The mode in which the file shall be opened, if fn is an Excel- or HDF5 file. Ignored otherwise.
Load a Python object from disk. The object can be stored in JSON, Pickle or joblib format. The format is
automatically determined based on the given file extension:
Dump a Python object to disk, either as a JSON, Pickle or joblib file. The format is determined automatically based
on the given file extension:
“.json” => JSON
“.pkl”, “.pickle” => Pickle
“.joblib” => joblib
Parameters:
obj – The object to dump.
fn (str | Path) – The file.
Notes
When dumping objects as JSON, calling to_json() beforehand might be necessary to ensure compliance with the JSON
standard. joblib is preferred over Pickle, as it is more efficient if the object contains large Numpy arrays.
Converts rows (indexed via rowindex_to_convert) to str, mainly used for saving dataframes (to avoid missing values
in .xlsx-files in case of e.g. timedelta datatype)
Parameters:
d (dict | DataFrame) – Single DataFrame or dictionary of dataframes
rowindex_to_convert (list) – List of row indices (e.g., features), that should be converted to str
inplace (bool, default=True) – Determines if changes will be made to input data or a deep-copy of it
skip (list, default=[]) – List of column(s) that should not be converted to string
Returns:
Modified (str-converted rows) single DataFrame or dictionary of DataFrames.
Get the trained prediction model as a FittedEnsemble object.
Parameters:
from_model (bool, default=False) – Whether to convert a plain model of type AutoMLBackend into a FittedEnsemble object, if such an object does
not exist in the directory.
accepted (list, optional) – List of accepted inputs. Must be lower-case. If None, all inputs are accepted.
allow_headless (bool, default=True) – What to do in headless mode. If True, the first element in accepted is returned if accepted is a list and
“” is returned if accepted is None. If False, a RunTimeError is raised.
Returns:
The input of the user, an element of accepted if accepted is a list, or arbitrary if accepted is None.
Show a simple progress bar when iterating over a given iterable. This works similar to package tqdm, but in
contrast to tqdm also works when mirroring messages to a file.
Parameters:
iterable – The iterable.
desc (str, optional) – Description to add to the beginning of the progress bar, optional.
total (int, optional) – Total number of elements in iterable if iterable does not implement the __len__() method.
disable (bool, default=False) – Whether to disable the progress bar. If True, the behavior is equivalent to not calling this function at all.
meter_width (int, default=40) – The width of the meter, in characters. Should not be too long to make the whole progress bar fit into a single
line. Might have to be decreased if desc is a long text.
Used to temporary mirror both stderr and stdout to a log file. Based on [1] and [2].
Examples
>>> withLogMirror("log.txt"):>>> log("writing to log.txt and the console")>>> err("works with errors as well")>>> warn("and in case you need warnings")>>> print("no need to use the custom log functions")
Create a fresh name based on name, i.e., a name that does not appear in lst.
Parameters:
name – An arbitrary object. If a list, tuple or set, all elements of name are processed individually, an they are
ensured to be distinct from each other.
lst (Iterable) – A list-like structure.
Returns:
If name does not appear in lst, name is returned as-is. Otherwise, a numeric suffix is added to the string
representation of name.
Return a string representation of some time delta.
Minutes and seconds are always displayed, hours and days only if needed. Format is “d days hh:mm:ss”.
Parameters:
delta – Time delta to represent, either a float or an object with a total_seconds() method (e.g., a pandas Timedelta
instance). Floats are assumed to be given in seconds.
subsecond_resolution (int, default=0) – The subsecond resolution to display, i.e., number of decimal places.
fig – The figure(s) to save. May be a Matplotlib figure object, a plotly figure object, or a dict whose values are
such figure objects.
fn (str | Path) – The file or directory. It is recommended to leave the file extension unspecified and simply pass
“/path/to/figure” instead of “/path/to/figure.png”. The file extension is then determined automatically
depending on the type of fig and on the value of png. If fig is a dict, fn refers to the parent
directory.
png (bool, default=False) – Whether to save Matplotlib figures as PNG or as PDF. Ignored if a file extension is specified in fn or if
fig is a plotly figure, which are always saved as HTML.
Convenience function for converting a metric into a (possibly different) metric that returns scores (i.e., higher
values correspond to better results). That means, if the given metric returns scores already, it is returned
unchanged. Otherwise, it is negated.
Parameters:
func – The metric to convert, e.g., accuracy, balanced_accuracy, etc. Note that in case of classification metrics,
both thresholded and non-thresholded metrics are accepted.
name – The name of the requested metric function. It may be of the form “name @ threshold”, where name is the
name of a thresholded classification metric (e.g., “accuracy”) and threshold is the desired threshold.
Furthermore, some synonyms are recognized as well, most notably “precision” for “positive_predictive_value” and
“recall” for “sensitivity”. threshold can also be the name of a thresholding strategy; see function
thresholded() for details.
Convenience function for converting a metric into its bootstrapped version.
Parameters:
func – The metric to convert, e.g., roc_auc, accuracy, mean_squared_error, etc.
n_repetitions (int, default=100) – Number of bootstrapping repetitions to perform. If 0, func is returned unchanged.
agg (default='mean') – Aggregation to compute of bootstrapping results.
seed (int, optional) – Random seed.
replace (bool, default=True) – Whether to resample with replacement. If False, this does not actually correspond to bootstrapping.
size (int | float, default=1.) – The size of the resampled data. If <= 1, it is multiplied with the number of samples in the given data.
Bootstrapping normally assumes that resampled data have the same number of samples as the original data, so
this parameter should be set to 1.
**kwargs – Additional keyword arguments that are passed to func upon application. Note that only arguments that do
not need to be resampled can be passed here; in particular, this excludes sample_weight.
Returns:
New metric that, when applied to y_true and y_hat, resamples the data, evaluates the metric on each
resample, and returns som aggregation (typically average) of the results thus obtained.
Compute the balance score and -threshold of a binary classification problem.
Parameters:
y_true – Ground truth, with 0 representing the negative class and 1 representing the positive class. Must not contain
NaN.
y_score – Predicted scores, i.e., the higher a score the more confident the model is that the sample belongs to the
positive class. Range is arbitrary.
Pair (balance_score, balance_threshold), where balance_threshold is the decision threshold that minimizes
the difference between sensitivity and specificity, i.e., it is defined as
balance_score is the corresponding sensitivity value, which by definition is approximately equal to
specificity and can furthermore be shown to be approximately equal to accuracy and balanced accuracy, too.
Compute the prevalence score and -threshold of a binary classification problem.
Parameters:
y_true – Ground truth, with 0 representing the negative class and 1 representing the positive class. Must not contain
NaN.
y_score – Predicted scores, i.e., the higher a score the more confident the model is that the sample belongs to the
positive class. Range is arbitrary.
Pair (prevalence_score, prevalence_threshold), where prevalence_threshold is the decision threshold that
minimizes the difference between the number of positive samples in y_true (m) and the number of predicted
positives. In other words, the threshold is set to the m-th largest value in y_score. If sample_weight
is given, the threshold minimizes the difference between the total weight of all positive samples and the total
weight of all samples predicted positive. prevalence_score is the corresponding sensitivity value, which can
be shown to be approximately equal to positive predictive value and F1.
Compute the threshold corresponding to the (0,1)-criterion [1] of a binary classification problem.
Although a popular strategy for selecting decision thresholds, [1] advocates maximizing informedness (aka Youden
index) instead, which is equivalent to maximizing balanced accuracy.
Parameters:
y_true – Ground truth, with 0 representing the negative class and 1 representing the positive class. Must not contain
NaN.
y_score – Predicted scores, i.e., the higher a score the more confident the model is that the sample belongs to the
positive class. Range is arbitrary.
specificity_weight (float, default=1.) – The relative weight of specificity wrt. sensitivity. 1 means that sensitivity and specificity are weighted
equally, a value < 1 means that sensitivity is weighted stronger than specificity, and a value > 1 means that
specificity is weighted stronger than sensitivity. See the formula below for details.
Returns:
Decision threshold that minimizes the Euclidean distance between the point (0, 1) and the point
(1 - specificity, sensitivity), i.e., arg min_t (1 - sensitivity(y_true, y_score >= t)) ** 2 +
specificity_weight * (1 - specificity(y_true, y_score >= t)) ** 2
Compute the decision threshold that maximizes a given binary classification metric or callable.
Since in most built-in classification metrics larger values indicate better results, there is no analogous
argmin_score_threshold().
Parameters:
func – The metric or function ot maximize. If a string, function get() is called on it.
y_true – Ground truth, with 0 representing the negative class and 1 representing the positive class. Must not contain
NaN.
y_score – Predicted scores, i.e., the higher a score the more confident the model is that the sample belongs to the
positive class. Range is arbitrary.
discretize (default=100) – Discretization steps for limiting the number of calls to func. If None, no discretization happens, i.e., all
unique values in y_score are tried.
**kwargs – Additional keyword arguments passed to func.
Returns:
Pair (score, threshold), where threshold is the decision threshold that maximizes func, i.e., arg max_t
func(y_true, y_score >= t) score is the corresponding value of func.
Compute the calibration curve of a binary classification problem. The predicated class probabilities are binned and,
for each bin, the fraction of positive samples is determined. These fractions can then be plotted against the
midpoints of the respective bins. Ideally, the resulting curve will be monotonic increasing.
Parameters:
y_true (ndarray) – Ground truth, array of shape (n,) with values among 0 and 1. Values must not be NaN.
y_score (ndarray) – Predicated probabilities of the positive class, array of shape (n,) with arbitrary non-NaN values; in
particular, the values do not necessarily need to correspond to probabilities or confidences.
thresholds (ndarray, optional) – The thresholds used for binning y_score. If None, suitable thresholds are determined automatically.
Returns:
Pair (fractions, thresholds), where thresholds is the array of thresholds of shape (m,), and fractions
is the corresponding array of fractions of positive samples in each bin, of shape (m - 1,). Note that the
i-th bin corresponds to the half-open interval [thresholds[i], thresholds[i + 1]) if i < m - 2, and to the
closed interval [thresholds[i], thresholds[i + 1]] otherwise (in other words: the last bin is closed).
Convenience function for computing ROC- and precision-recall curves simultaneously, with only one call to
function _binary_clf_curve().
Parameters:
y_true (ndarray) – Same as in sklearn.metrics.roc_curve() and sklearn.metrics.precision_recall_curve().
y_score (ndarray) – Same as in sklearn.metrics.roc_curve() and sklearn.metrics.precision_recall_curve().
pos_label (int | str, optional) – Same as in sklearn.metrics.roc_curve() and sklearn.metrics.precision_recall_curve().
sample_weight (ndarray, optional) – Same as in sklearn.metrics.roc_curve() and sklearn.metrics.precision_recall_curve().
drop_intermediate (bool, default=True) – Same as in sklearn.metrics.roc_curve().
Returns:
6-tuple (fpr, tpr, thresholds_roc, precision, recall, thresholds_pr), i.e., the concatenation of the return
values of functions sklearn.metrics.roc_curve() and sklearn.metrics.precision_recall_curve().
Translate multiclass class probabilities into actual predictions, by returning the class with the highest
probability. If two or more classes have the same highest probabilities, the last one is returned. This behavior is
consistent with binary classification problems, where the positive class is returned if both classes have equal
probabilities and the default threshold of 0.5 is used.
Parameters:
y (ndarray) – Class probabilities, of shape (n_classes,) or (n_samples, n_classes). The values of y can be arbitrary,
they don’t need to be between 0 and 1. n_classes must be >= 1.
Returns:
Predicted class indices, either single integer or array of shape (n_samples,).
Convenience class for converting a classification metric that can only be applied to class predictions into a
metric that can be applied to probabilities. This proceeds by specifying a fixed decision threshold.
Parameters:
func – The metric to convert, e.g., accuracy, balanced_accuracy, etc.
threshold (float | str, default=0.5) – The decision threshold. In binary classification this can also be the name of a thresholding strategy that is
accepted by function get_thresholding_strategy().
**kwargs – Additional keyword arguments that are passed to func upon application.
Returns:
New metric that, when applied to y_true and y_score, returns func(y_true, y_score >= threshold)
in case of binary- or multilabel classification, and func(y_true, multiclass_proba_to_pred(y_score)) in case
Convenience function for converting a classification metric into its “thresholded” version IF NECESSARY.
That means, if the given metric can be applied to class probabilities, it is returned unchanged. Otherwise,
thresholded(func, threshold) is returned.
Parameters:
func – The metric to convert, e.g., accuracy, balanced_accuracy, etc.
threshold (float | str, default=0.5) – The decision threshold.
**kwargs – Additional keyword arguments that shall be passed to func upon application.
Returns:
Either func itself or thresholded(func, threshold).
Convenience function for converting a classification metric into its “thresholded” version IF NECESSARY.
That means, if the given metric can be applied to class probabilities, it is returned unchanged. Otherwise,
thresholded(func, threshold) is returned.
Parameters:
func – The metric to convert, e.g., accuracy, balanced_accuracy, etc.
threshold (float | str, default=0.5) – The decision threshold.
**kwargs – Additional keyword arguments that shall be passed to func upon application.
Returns:
Either func itself or thresholded(func, threshold).
Calculate accuracy from a confusion matrix.
ATTENTION! In the multilabel case, this implementation actually corresponds to balanced_accuracy_micro etc.
Mann-Whitney U test for testing whether two samples are equal (more precisely: have equal median). Only applicable
to numerical observations; categorical observations should be treated with the chi square test. The Mann-Whitney U
test is a special case of the Kruskal-Wallis H test, which works for more than two samples.
Parameters:
x (ndarray | Series) – First sample, array-like with numerical values.
y (ndarray | Series) – Second sample, array-like with numerical values.
**kwargs – Keyword arguments passed to scipy.stats.mannwhitneyu().
Returns:
P-value. Smaller values mean that x and y are distributed differently. Note that this test is symmetric
between x and y.
Compute the p-value of the DeLong test for the null hypothesis that two ROC-AUCs are equal.
Parameters:
y_true (np.ndarray) – Ground truth, 1D array of shape (n_samples,) with values in {0, 1}.
y_hat_1 (np.ndarray) – Predictions of the first classifier, 1D array of shape (n_samples,) with arbitrary values. Larger values
correspond to a higher predicted probability that a sample belongs to the positive class.
y_hat_2 (np.ndarray) – Predictions of the second classifier, 1D array of shape (n_samples,) with arbitrary values. Larger values
correspond to a higher predicted probability that a sample belongs to the positive class.
p_value – p-value for the null hypothesis that the ROC-AUCs of the two classifiers are equal. If this value is smaller
than a certain pre-defined threshold (e.g., 0.05) the null hypothesis can be rejected, meaning that there is a
statistically significant difference between the two ROC-AUCs.
Return the confidence interval and ROC-AUC of given ground-truth and model predictions.
Parameters:
y_true (np.ndarray) – Ground truth, 1D array of shape (n_samples,) with values in {0, 1}.
y_hat (np.ndarray) – Predictions of the classifier, 1D array of shape (n_samples,) with arbitrary values. Larger values
correspond to a higher predicted probability that a sample belongs to the positive class.
alpha (float, default=0.95) – Confidence level, between 0 and 1.
Online computation of min and max on X for later scaling.
All of X is processed as a single batch. This is intended for cases
when fit() is not feasible due to very large number of
n_samples or because X is read from a continuous stream.
Parameters:
X (array-like of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation
used for later scaling along the features axis.
Encoder for features- and labels DataFrames. Implements the BaseEstimator class of sklearn, with methods fit(),
transform() and inverse_transform(), and can easily be dumped to and loaded from disk.
Notes
Encoding ensures that:
The data type of every feature column is either float, int, bool, categorical or string (if the installed Pandas
version supports it). Time-like columns are converted into float, and object data types raise an exception.
The data type of every target column is float.
In regression tasks, this is achieved by converting numerical data types (float, int, bool, time-like) into
float, and raising exceptions if other data types are found.
In binary classification, this is achieved by representing the negative class by 0.0 and the positive class by
1.0. If the original data type is categorical, the negative class corresponds to the first category, whereas
the positive class corresponds to the second category. If the original data type is not categorical the
positive and negative classes are determined through sklearn’s LabelEncoder.
In multiclass classification, this is achieved by representing the i-th class by i.
In multilabel classification, this is achieved by representing the presence of a class by 1.0 and its absence
by 0.0.
Both features and labels may contain NaN values before encoding. These are simply propagated, meaning that encoded
data may contain NaN values as well!
Back-transform features- and/or labels DataFrames i.e. Decodes encoded data. In the case of classification,
it is also able to handle Numpy arrays containing class (indices), as returned by predict(), as well as class
probabilities, as returned by predict_proba().
Parameters:
inplace (bool, default=True) – Whether to modify the given data in place.
**kwargs (DataFrame, ndarray, optional) – The data to transform back, with keys “x” (features) or “y” (labels).
Returns:
The back-transformed DataFrame(s), either a single DataFrame if only one of “x” or “y” is passed, or apair
of DataFrames in the same order as in the argument dict.
Convert “object” data types in df into other data types, if possible. In particular, this includes timedelta,
datetime, categorical and string types, in that order. String types are not supported in all Pandas versions.
Parameters:
df (DataFrame) – The DataFrame.
inplace (bool, default=True) – Whether to modify df in place. Note that if no column in df can be converted, it is returned as-is even if
inplace is False.
max_categories (int, default=100) – The maximum number of allowed categories when converting on object column into a categorical column.
Merge the given tables by left-joining them on ID columns.
Parameters:
tables (Iterable) – The tables to merge, an iterable of DataFrames or paths to tables. Function convert_object_dtypes() is
automatically applied to tables read from files.
Returns:
The pair (df, id_cols), where df is the merged DataFrame and id_cols is the list of potential ID columns.
Stratified grouped split into train- and test set. Ensures that groups in the two sets do not overlap, and tries
to distribute samples in such a way that class percentages are roughly maintained in each split.
Parameters:
n_splits (int, default=10) – Number of re-shuffling & splitting iterations.
test_size (float | int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion
of the dataset to include in the test split. If int, represents the
absolute number of test samples. If None, the value is set to the
complement of the train size. If train_size is also None, it will
be set to 0.1.
train_size (float | int, default=None) – If float, should be between 0.0 and 1.0 and represent the
proportion of the dataset to include in the train split. If
int, represents the absolute number of train samples. If None,
the value is automatically set to the complement of the test size.
random_state (int or RandomState instance, default=None) – Controls the randomness of the training and testing indices produced.
Pass an int for reproducible output across multiple function calls.
method (str, default="automatic") – Resampling method to use. Can be “automatic”, “exact” and “brute_force”.
If there are many small groups, “brute_force” tends to give reasonable
results and is significantly faster than “exact”. Otherwise, if there
are only few large groups, method “exact” might be preferable.
“automatic” tries to infer the optimal method based on the number of
groups.
n_iter (int, default=None) – Number of brute-force iterations. The larger the number, the more
splits are tried, and hence the better the results get. If None, the
number of iterations is determined automatically.
Copied and adapted from sklearn version 1.0.2 [1], because auto-sklearn
requires an older version without this class.
Changelist:
- Removed warning if some class has fewer than n_splits instances.
- Do not throw error if all classes have fewer than n_splits instances.
- Added method “brute_force”.
Parameters:
n_splits (int, default=10) – Number of re-shuffling & splitting iterations.
shuffle (bool, default=False) – Whether to shuffle samples before splitting.
random_state (int or RandomState instance, default=None) – Controls the randomness of the training and testing indices produced.
Pass an int for reproducible output across multiple function calls.
method (str, default="automatic") – Resampling method to use. Can be “automatic”, “exact” and “brute_force”.
If there are many small groups, “brute_force” tends to give reasonable
results and is significantly faster than “exact”. Otherwise, if there
are only few large groups, method “exact” might be preferable.
“automatic” tries to infer the optimal method based on the number of
groups.
Note that “brute_force” is only possible if shuffle is set to True.
n_iter (int, default=None) – Number of brute-force iterations. The larger the number, the more
splits are tried, and hence the better the results get. If None, the
number of iterations is determined automatically.
Predefined split cross-validator. Provides train/test indices to split data into train/test sets using a predefined
scheme specified by explicit test indices.
In contrast to sklearn.model_selection.PredefinedSplit, samples can be in the test set of more than one split.
In methods split() etc., parameters X, y and groups only exist for compatibility, but are always ignored.
Parameters:
test_folds (list of array-like) – Indices of test samples for each split. The number of splits equals the length of the list.
Note that the test sets do not have to be mutually disjoint.
Resample data in EAV (entity-attribute-value) format wrt. explicitly passed windows of arbitrary (possibly
infinite) length.
Parameters:
df (pd.DataFrame | dask.dataframe.DataFrame) – The DataFrame to resample, in EAV format. That means, must have columns value_col (contains observed values),
time_col (contains observation times), attribute_col (optional; contains attribute identifiers) and
entity_col (optional; contains entity identifiers). Must have one column index level. Data types are
arbitrary, as long as observation times and entity identifiers can be compared wrt. < and <= (e.g., float,
int, time delta, date time). Entity identifiers must not be NA. Observation times may be NA, but such entries
are ignored entirely. df can be a Dask DataFrame as well. In that case, however, entity_col must not be
None and entities should already be on the row index, with known divisions. Otherwise, the row index is set to
entity_col, which can be very costly both in terms of time and memory. Especially if df is known to be
sorted wrt. entities already, the calling function should better take care of this; see
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html.
windows (pd.DataFrame | dask.dataframe.DataFrame | callable) – The target windows into which df is resampled. Must have two column index levels and columns (time_col,
“start”) (optional; contains start times of each window), (time_col, “stop”) (optional; contains end times of
each window), (entity_col, “”) (optional; contains entity identifiers) and (window_group_col, “”) (optional;
contains information for creating groups of mutually disjoint windows). Start- and end times may be NA, but such
windows are deemed invalid and by definition do not contain any observations. At least one of the two
endpoint-columns must be given; if one is missing it is assumed to represent +/- inf. windows can be a Dask
DataFrame as well. In that case, however, entity_col must not be None and entities should already be on the
row index, with known divisions. Otherwise, the row index is set to entity_col, which can be very costly both
in terms of time and memory. Especially if windows is known to be sorted wrt. entities already, the calling
function should better take care of this; see
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html.
Alternatively, windows can be a callable that, when applied to a DataFrame and keyword arguments
entity_col, time_col, attribute_col and value_col, returns a DataFrame of the form described above.
The canonical example of such a callable is the result returned by make_windows(); see the documentation of
make_windows() for details.
agg (dict) –
The aggregations to apply. Must be a dict mapping attribute identifiers to lists of aggregation functions,
which are applied to all observed values of the respective attribute in each specified window. Supported
aggregation functions are:
"mean": Empirical mean of observed non-NA values
"min": Minimum of observed non-NA values; equivalent to “p0”
"max": Maximum of observed non-NA values; equivalent to “p100”
"median": Median of observed non-NA values; equivalent to “p50”
"std": Empirical standard deviation of observed non-NA values
"var": Empirical variance of observed non-NA values
"sum": Sum of observed non-NA values
"prod": Product of observed non-NA values
"skew": Skewness of observed non-NA values
"mad": Mean absolute deviation of observed non-NA values
"sem": Standard error of the mean of observed non-NA values
"size": Number of observations, including NA values
"count": Number of non-NA observations
"nunique": Number of unique observed non-NA values
"mode": Mode of observed non-NA values, i.e., most frequent value; ties are broken randomly but
reproducibly
"mode_count": Number of occurrences of mode
"pxx": Percentile of observed non-NA values; xx is an arbitrary float in the interval [0, 100]
"rxx": xx-th observed value (possibly NA), starting from 0; negative indices count from the end
"txx": Time of xx-th observed value; negative indices count from the end
"callable": Function that takes as input a DataFrame in and returns a new DataFrame out.
See Notes for details.
entity_col (str, optional) – Name of the column in df and windows containing entity identifiers. If None, all entries are assumed to
belong to the same entity. Note that entity identifiers may also be on the row index.
time_col (str, optional) – Name of the column in df containing observation times, and also name of column(s) in windows containing
start- and end times of the windows. Note that despite its name the data type of the column is arbitrary, as
long as it supports the following arithmetic- and order operations: -, /, <, <=.
attribute_col (str, optional) – Name of the column in df containing attribute identifiers. If None, all entries are assumed to belong to the
same attribute; in that case agg may only contain one single item.
value_col (str, optional) – Name of the column in df containing the observed values.
include_start (bool, default=True) – Whether start times of observation windows are part of the windows.
include_stop (bool, default=False) – Whether end times of observation windows are part of the windows.
optimize (str, default='time') – Whether to optimize runtime or memory requirements. If set to “time”, the function returns faster but requires
more memory; if set to “memory”, the runtime is longer but memory consumption is reduced to a minimum. If
“time”, global variable MAX_ROWS can be used to adjust the time-memory tradeoff: increasing it increases
memory consumption while reducing runtime. Note that this parameter is only relevant for computing non-rank-like
aggregations, since rank-like aggregations (“rxx”, “txx”) can be efficiently computed anyway.
Returns:
Resampled data. Like windows, but with one additional column for each requested aggregation.
Order of columns is arbitrary, order of rows is exactly as in windows – unless windows is a Dask DataFrame,
in which case the order of rows may differ. The output is a (lazy) Dask DataFrame if windows is a Dask
DataFrame, and a Pandas DataFrame otherwise, regardless of what df is.
Return type:
pd.DataFrame | dask.dataframe.DataFrame
Notes
When passing a callable to agg, it is expected to take as input a DataFrame in and return a new DataFrame out.
in has two columns time_col and value_col (in that order). Its row index specifies which entries belong to
the same observation window: entries with the same row index value belong to the same window, entries with
different row index values belong to distinct windows. Observation times are guaranteed to be non-N/A, values may
be N/A. Note, however, that in is not necessarily sorted wrt. its row index and/or observation times! Also note
that the entities the observations in in stem from (if entity_col is specified) are not known to the function.
out should have one row per row index value of in (with the same row index value), and an arbitrary number of
columns with arbitrary names and dtypes. Columns should be consistent in every invocation of the function.
The reason why the function is not applied to each row-index-value group individually is that some aggregations can
be implemented efficiently using sorting rather than grouping. The function should be stateless and must not modify
in in place.
Example 1: A simple aggregation which calculates the fraction of values between 0 and 1 in every window could be
passed as
Example 2: A more sophisticated aggregation which fits a linear regression to the observations in every window
and returns the slope of the resulting regression line could be defined as
Resample interval-like data wrt. explicitly passed windows of arbitrary (possibly infinite) length. “Interval-like”
means that each observation is characterized by a start- and stop time rather than a singular timestamp (as in EAV
data).
Parameters:
df (pd.DataFrame | dask.dataframe.DataFrame) – The DataFrame to resample. Must have columns value_col (contains observed values), start_col (optional;
contains start times), stop_time (optional; contains end times), attribute_col (optional; contains attribute
identifiers) and entity_col (optional; contains entity identifiers). Must have one column index level. Data
types are arbitrary, as long as times and entity identifiers can be compared wrt. < and <= (e.g., float,
int, time delta, date time). Entity identifiers must not be NA. Values must be numeric (float, int, bool).
Observation times and observed values may be NA, but such entries are ignored entirely. Although both
start_col and stop_col are optional, at least one must be present. Missing start- and end columns are
interpreted as -/+ inf. All intervals are closed, i.e., start- and end times are included. This is especially
relevant for entries whose start time equals their end time.
df can be a Dask DataFrame as well. In that case, however, entity_col must not be None and entities should
already be on the row index, with known divisions. Otherwise, the row index is set to entity_col, which can be
very costly both in terms of time and memory. Especially if df is known to be sorted wrt. entities already,
the calling function should better take care of this; see
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html.
windows (pd.DataFrame | dask.dataframe.DataFrame | callable) – The target windows into which df is resampled. Must have either one or two columns index level(s). If it has
one column index level, must have columns start_col (optional; contains start times of each window),
stop_col (optional; contains end times of each window) and entity_col (optional; contains entity
identifiers). If it has two column index levels, the columns must be (time_col, “start”),
(time_col, “stop”) and (entity_col, “”). Start- and end times may be NA, but such windows are deemed
invalid and by definition do not overlap with any observation intervals. At least one of the two
endpoint-columns must be present; if one is missing it is assumed to represent -/+ inf. All time windows are
closed, i.e., start- and end times are included. windows can be a Dask DataFrame as well. In that case,
however, entity_col must not be None and entities should already be on the row index, with known divisions.
Otherwise, the row index is set to entity_col, which can be very costly both in terms of time and memory.
Especially if windows is known to be sorted wrt. entities already, the calling function should better take
care of this; see
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html.
Alternatively, windows can be a callable that, when applied to a DataFrame and keyword arguments
entity_col, start_col, stop_col, time_col, attribute_col and value_col, returns a DataFrame of the
form described above. The canonical example of such a callable is the result returned by make_windows(); see
the documentation of make_windows() for details.
attributes (list, optional) – The attributes to consider. Must be a list-like of attribute identifiers. None defaults to the list of all such
identifiers present in column attribute_col. If attribute_col is None but attributes is not, it must be a
singleton list.
entity_col (str, optional) – Name of the column in df and windows containing entity identifiers. If None, all entries are assumed to
belong to the same entity. Note that entity identifiers may also be on the row index.
start_col (str, optional) – Name of the column in df (and windows if it has only one column index level) containing start times. If
None, all start times are assumed to be -inf. Note that despite its name the data type of the column is
arbitrary, as long as it supports the following arithmetic- and order operations: -, /, <, <=.
stop_col (str, optional) – Name of the column in df (and windows if it has only one column index level) containing end times. If None,
all end times are assumed to be +inf. Note that despite its name the data type of the column is arbitrary, as
long as it supports the following arithmetic- and order operations: -, /, <, <=.
attribute_col (str, optional) – Name of the column in df containing attribute identifiers. If None, all entries are assumed to belong to the
same attribute.
value_col (str, optional) – Name of the column in df containing the observed values.
time_col (list | str, optional) – Name of the column(s) in windows containing start- and end times of the windows. Only needed if windows
has two column index levels, because otherwise these two columns must be called start_col and stop_col,
respectively.
epsilon – The value to set \(W_I\) to if \(I\) is infinite and \(W \cap I\) is non-empty and finite;
see Notes for details.
Returns:
Resampled data. Like windows, but with one additional column for each attribute, and same number of
column index levels.
Order of columns is arbitrary, order of rows is exactly as in windows – unless windows is a Dask DataFrame, in
which case the order of rows may differ.
The output is a (lazy) Dask DataFrame if windows is a Dask DataFrame, and a Pandas DataFrame otherwise,
regardless of what df is.
Notes
A typical example of interval-like data are medication records, since medications can be administered over
longer time periods.
The only supported resampling aggregation is summing the observed values per time window, scaled by the fraction
of the length of the intersection of observation interval and time window divided by the total length of the
observation interval: Let \(W = [s, t]\) be a time window and let \(I = [a, b]\) be an observation interval
with observed value \(v\). Then \(I\) contributes to \(W\) the value
\(W_I = v * \frac{|W \cap I|}{|I|}\)
The overall value of \(W\) is the sum of \(W_I\) over all intervals. Of course, all this is computed
separately for each entity-attribute combination.
Some remarks on the above equation are in place:
If \(v\) is N/A, \(W_I\) is set to 0.
If \(a = b\) both numerator and denominator are 0. In this case the fraction is defined as 1 if
\(a \in W\) (i.e., \(s \leq a \leq t\)) and 0 otherwise.
If \(I\) is infinite and \(W \cap I\) is non-empty but finite, \(W_I\) is set to
\(epsilon * sign(v)\).
Note that \(W \cap I\) is non-empty even if it is of the form \([x, x]\). This leads to the slightly
counter-intuitive situation that \(W_I = epsilon\) if \(I\) is infinite, and \(W_I = 0\) if \(I\)
is finite.
If \(I\) and \(W \cap I\) are both infinite, the fraction is defined as 1. This is regardless of whether
\(W \cap I\) equals \(I\) or whether it is a proper subset of it.
Convenience function for easily creating windows that can be passed to functions resample_eav() and
resample_interval().
Note that internally, invoking this function does not create the actual windows-DataFrame yet. Instead, when
passing the resulting callable to resample_eav() or resample_interval(), it is applied to the DataFrame to be
resampled. This allows to implicitly refer to it here; see the examples below for specific use-cases.
Parameters:
df (pd.DataFrame | str, optional) – Source DataFrame. If None, defaults to the DataFrame to be resampled in resample_eav() or
resample_interval().
Can also be a string, which will be evaluated using Python’s eval() function. The string can contain
references to the DataFrame to be resampled via variable df, and to column-names entity_col, time_col,
start_col and stop_col passed to resample_eav() and resample_interval().
Example: “df.groupby(entity_col)[time_col].max().to_frame()”
entity (pd.Series | pd.Index | str | scalar, optional) – Entity of each window. Series are used as-is (possibly after re-ordering rows to match other row indices),
strings refer to columns in df, and scalars are replicated to populate every window with the same value.
If None, defaults to df[entity_col] if df contains that column.
start (pd.Series | pd.Index | str | scalar, optional) – Start time of each window. Series are used as-is (possibly after re-ordering rows to match other row indices),
strings refer to columns in df, and scalars are replicated to populate every window with the same value.
Note that despite its name the data type of the start times is arbitrary, as long as it supports the following
arithmetic- and order operations: -, /, <, <=.
start and start_rel are mutually exclusive.
stop (pd.Series | pd.Index | str | scalar, optional) – Stop time of each window. Series are used as-is (possibly after re-ordering rows to match other row indices),
strings refer to columns in df, and scalars are replicated to populate every window with the same value.
Note that despite its name the data type of the stop times is arbitrary, as long as it supports the following
arithmetic- and order operations: -, /, <, <=.
stop and stop_rel are mutually exclusive.
start_rel (pd.Series | pd.Index | str | scalar, optional) – Start time of each window, relative to anchor. Series are used as-is (possibly after re-ordering rows to
match other row indices), strings refer to columns in df, and scalars are replicated to populate every window
with the same value. If given, anchor must be given, too.
start and start_rel are mutually exclusive.
stop_rel (pd.Series | pd.Index | str | scalar, optional) – Stop time of each window, relative to anchor. Series are used as-is (possibly after re-ordering rows to
match other row indices), strings refer to columns in df, and scalars are replicated to populate every window
with the same value. If given, anchor must be given, too.
stop and stop_rel are mutually exclusive.
duration (pd.Series | pd.Index | str | scalar, optional) – Duration of each window. Series are used as-is (possibly after re-ordering rows to match other row indices),
strings refer to columns in df, and scalars are replicated to populate every window with the same value.
Durations can only be specified if exactly one endpoint (either start or stop) is specified; the other endpoint
is then computed from duration.
anchor (pd.Series | pd.Index | str | scalar, optional) – Anchor time start_rel and stop_rel refer to. Series are used as-is (possibly after re-ordering rows to
match other row indices), strings refer to columns in df, and scalars are replicated to populate every window
with the same value. Ignored unless start_rel or stop_rel is given.
If start_rel or stop_rel is given but anchor is None, it defaults to time_col, but a warning message is
printed.
Notes
The current implementation does not support Dask DataFrames.
This function does not check whether windows are non-empty, i.e., whether start times come before end times.
Examples
Use-case 1: Create fixed-length windows relative to the time column in the DataFrame to be resampled. Since
anchor is required by start_rel but not set explicitly, it defaults to time_col, but a warning message is
printed.
Use-case 2: Similar to use-case 1, but only create one window per entity, for the temporally last entry. Note
how the DataFrame to be resampled is only passed once directly to function resample_eav(); make_windows()
refers to it implicitly via variable name “df” in the string of keyword argument df. Note also that the
resulting DataFrame may have entities on its row index.
Use-case 3: make_windows() can be used with function resample_interval(), too – regardless of whether
time_col is passed to resample_interval() or not.
resample_interval(df_to_be_resampled,make_windows(stop=pd.Series(...),duration=pd.Series(...),# must have the same row index as the Series passed to `start`),start_col=...,stop_col=...,time_col=...,# optional...)
Group intervals wrt. their temporal distance to each other. Intervals can also be isolated points, i.e.,
single-point intervals of the form [x, x].
Parameters:
df (DataFrame) – DataFrame with intervals.
group_by (optional) – Additional column(s) to group df by, optional. If given, the computed grouping refines the given one, in the
sense that any two intervals belonging to the same computed group are guaranteed to belong to the same given
group, too. Can be the name of a single column or a list of column names and/or row index levels. Strings are
interpreted as column names or row index names, integers are interpreted as row index levels.
time_col (str, optional) – Name of the column in df containing both start- and end times of single-point intervals. If given, both
start_col and stop_col must be None.
start_col (str, optional) – Name of the column in df containing start times of intervals. If given, time_col must be None.
stop_col (str, optional) – Name of the column in df containing end times of intervals. If given, time_col must be None. Note that the
function tacitly assumes that no interval ends before it starts, although this is not checked. If this
assumption is violated, the returned results may not be correct.
distance (optional) – Maximum allowed distance between two intervals for being put into the same group. Should be non-negative.
The distance between two intervals is the single-linkage distance, i.e., the minimum distance between any two
points in the respective intervals. This means, for example, that the distance between overlapping intervals is
always 0.
inclusive (bool, default=False) – Whether distance is inclusive.
Notes
The returned grouping is the reflexive-transitive closure of the proximity relation induced by distance.
Formally: Let \(R\) be the binary relation on the set of intervals in df such that \(R(I_1, I_2)\) holds
iff the distance between \(I_1\) and \(I_2\) is less than (or equal to) distance (and additionally
\(I_1\) and \(I_2\) belong to the same groups specified by group_by). \(R\) is obviously symmetric,
so its reflexive-transitive closure \(R^*\) is an equivalence relation on the set of intervals in df. The
returned grouping corresponds precisely to this equivalence relation, in the sense that there is one group per
equivalence class and vice versa.
Note that if two intervals belong to the same group, their distance may still be larger than distance.
Returns:
Series with the same row index as df, in the same order, whose values are group indices.
Find the previous/next values of some columns in DataFrame df, for every entry. Additionally, entries can be
grouped and previous/next values only searched within each group.
Parameters:
df (DataFrame) – The DataFrame.
sort_by (list | str, optional) – The column(s) to sort by. Can be the name of a single column or a list of column names and/or row index levels.
Strings are interpreted as column names or row index names, integers are interpreted as row index levels.
ATTENTION! N/A values in columns to sort by are not ignored; rather, they are treated in the same way as Pandas
treats such values in DataFrame.sort_values(), i.e., they are put at the end.
group_by (list | str, optional) – Column(s) to group df by, optional. Same values as sort_by.
prev_name and next_name are the names of the columns in the result, containing the previous/next values.
If any of them is None, the corresponding previous/next values are not computed for that column.
prev_fill and next_fill specify which values to assign to the first/last entry in every group, which does
not have any previous/next values.
Note that column names not present in df are tacitly skipped.
first_indicator_name (str, optional) – Name of the column in the result containing boolean indicators whether the corresponding entries come first in
their respective groups. If None, no such column is added.
last_indicator_name (str, optional) – Name of the column in the result containing boolean indicators whether the corresponding entries come last in
their respective groups. If None, no such column is added.
keep_sorted (bool, default=False) – Whether to keep the result sorted wrt. group_by and sort_by. If False, the order of rows of the result is
identical to that of df.
inplace (bool, default=False) – If True, the new columns are added to df.
Returns:
The modified DataFrame if inplace is True, a DataFrame with the requested previous/next values otherwise.
Class for performing bootstrapping [1], i.e., repeatedly sample with replacement from given data and evaluate
statistics on each resample to obtain mean, standard deviation, etc. for more robust estimates.
Parameters:
*args (DataFrame | Series | ndarray) – Data, non-empty sequence of DataFrames, Series or arrays of the same length.
kwargs (dict, optional) – Additional keyword arguments passed to the function fn computing the statistics. Like args, the values
of the dict must be DataFrames, Series or arrays of the same length as the elements of args.
fn (optional) – The statistics to compute. Must be None, a function that takes the given args as input and returns a
scalar/array/DataFrame/Series or a (nest) dict/tuple thereof, or a (nested) dict/tuple of such functions.
seed (int, optional) – Random seed.
replace (bool, default=True) – Whether to resample with replacement. If False, this does not actually correspond to bootstrapping.
size (int | float, default=1.) – The size of the resampled data. If <= 1, it is multiplied with the number of samples in the given data.
Bootstrapping normally assumes that resampled data have the same number of samples as the original data, so
this parameter should be set to 1.
Run bootstrapping for a given number of repetitions, and store the results in a list. Results are appended to
results from previous runs!
Parameters:
n_repetitions (int, default=100) – Number of repetitions.
sample_indices (ndarray, optional) – Pre-computed sample indices to use in each repetition. If not None, n_repetitions` is ignored and
sample_indices must have shape (n, size).
Describe the results of the individual runs by computing a predefined set of statistics, similar to pandas’
describe() method. Only works for (dicts/tuples of) scalar values.
Summarize the performance of multiple prediction models trained and evaluated with CaTabRa. This is a convenient
way for quickly comparing them and selecting the best model(s) for a certain task. An implicit assumption of this
function is that all models were trained on the same prediction task.
IMPORTANT: Only pre-evaluated metrics in “metrics.xlsx” and “bootstrapping.xlsx” are considered!
Parameters:
directories (Iterable[str | Path]) – The directories to consider, an iterable of path-like objects. Each directory must be the output directory of
an invocation of catabra.evaluate, or a subdirectory corresponding to a specific split (containing
“metrics.xlsx” and maybe also “bootstrapping.xlsx”). A convenient way to specify a couple of directories
matching a certain pattern is by using Path(root_path).rglob(pattern).
metrics (Iterable[str]) –
List of metrics to include in the summary, an iterable of strings. Values must match the following pattern:
target is optional and specifies the target (or class in case of multiclass classification); can be “*” to
include all available targets, and can be a sequence separated by “,”. Ignored if
bootstrapping_aggregation is specified.
metric_name is the name of the actual metric, exactly as written in “metrics.xlsx” or “bootstrapping.xlsx”;
can be “*” to include all available pre-evaluated metrics, and can be a sequence separated by “,”.
threshold is optional and must be a numeral between 0 and 1 (cannot be a string like “balance”), and cannot
be “*”. Only relevant for threshold-dependent classification metrics, and mutually exclusive with
bootstrapping_aggregation. Note that the given threshold must exactly match one of the thresholds
evaluated in “metrics.xlsx”.
bootstrapping_aggregation is optional and specifies the bootstrapping aggregation to include, like “mean”,
“std”, etc.; can be “*” to include all available pre-evaluated aggregations in “bootstrapping.xlsx”, and
can be a sequence separated by “,”.
split (Iterable[str], default=None) – If a directory in directories has subdirectories corresponding to data splits that were evaluated separately,
only include the splits in split. If None, all splits are included.
path_callback (Callable, default=None) – Callback function applied to every path visited. Must return None, True, False or a dict; False indicates that
the current path should be dropped from the output, True and None are aliases for {}, and a dict adds a
column for every key to the output DataFrame, with the corresponding values in them.
Returns:
DataFrame with one row per evaluation and one column per performance metric. If multiple splits are included in
the performance summary, each is put into a separate row.
Summarize the feature importance of multiple prediction models trained and explained with CaTabRa. This is a
convenient way for quickly comparing them. An implicit assumption of this function is that all models were trained
on the same prediction task, and that the same feature importance calculation method was applied to generate the
importance scores.
IMPORTANT: Only pre-evaluated feature importance scores are considered!
Parameters:
directories (Iterable[str | Path]) – The directories to consider, an iterable of path-like objects. Each directory must be the output directory of
an invocation of catabra.explain, or a subdirectory corresponding to a specific split (containing HDF5 files
with feature importance scores). A convenient way to specify a couple of directories matching a certain pattern
is by using Path(root_path).rglob(pattern).
columns (Iterable[str], default=None) – The columns in global feature importance scores to consider. For instance, if
catabra.explanation.average_local_explanations() is used to produce global scores, 4 columns “>0”, “<0”,
“>0 std” and “<0 std” are normally generated. This parameter allows to include only a subset in the summary.
None defaults to all columns.
new_column_name (str) – String pattern specifying the names of the columns in the output DataFrame. May have two named fields feature
and column, which are filled with original feature- and column names, respectively.
glob (bool) – Whether feature importance scores in directories are global. If not,
catabra.explanation.average_local_explanations() is applied.
split (Iterable[str], default=None) – If a directory in directories has subdirectories corresponding to data splits that were explained separately,
only include the splits in split. If None, all splits are included.
model_id (Iterable[str], default=None) – Model-IDs to consider, optional. Determines the names of the HDF5 files to be included. None defaults to all
found model-IDs.
path_callback (Callable, default=None) – Callback function applied to every path visited. Must return None, True, False or a dict; False indicates that
the current path should be dropped from the output, True and None are aliases for {}, and a dict adds a
column for every key to the output DataFrame, with the corresponding values in them.
Returns:
DataFrame with one row per explanation and one column per feature-column pair. If multiple splits are included
in the importance summary, each is put into a separate row. If there are multiple targets (multiclass/multilabel
classification, multioutput regression) and the feature importance scores for each target are stored in a
separate table, each is put into a separate row and an additional column “__target__” is added.