Read a DataFrame from a CSV, Excel, HDF5, Pickle or Parquet file. The file type is determined from the file
extension of the given file.
Parameters:
fn (str | Path) – The file to read.
key (str | Iterable[str], default='table') – The key(s) in the HDF5 file, if fn is an HDF5 file. Defaults to “table”. If an iterable, all keys are read and
concatenated along the row axis.
Write a dict of DataFrames to file. The file type is determined from the file extension of the given file.
Unless an Excel- or HDF5 file, dfs must be empty or a singleton.
Parameters:
dfs (dict) – The DataFrames to write. If empty and mode differs from “a”, the file is deleted.
fn (str | Path) – The target file name.
mode (str, default='w') – The mode in which the file shall be opened, if fn is an Excel- or HDF5 file. Ignored otherwise.
Load a Python object from disk. The object can be stored in JSON, Pickle or joblib format. The format is
automatically determined based on the given file extension:
Dump a Python object to disk, either as a JSON, Pickle or joblib file. The format is determined automatically based
on the given file extension:
“.json” => JSON
“.pkl”, “.pickle” => Pickle
“.joblib” => joblib
Parameters:
obj – The object to dump.
fn (str | Path) – The file.
Notes
When dumping objects as JSON, calling to_json() beforehand might be necessary to ensure compliance with the JSON
standard. joblib is preferred over Pickle, as it is more efficient if the object contains large Numpy arrays.
Converts rows (indexed via rowindex_to_convert) to str, mainly used for saving dataframes (to avoid missing values
in .xlsx-files in case of e.g. timedelta datatype)
Parameters:
d (dict | DataFrame) – Single DataFrame or dictionary of dataframes
rowindex_to_convert (list) – List of row indices (e.g., features), that should be converted to str
inplace (bool, default=True) – Determines if changes will be made to input data or a deep-copy of it
skip (list, default=[]) – List of column(s) that should not be converted to string
Returns:
Modified (str-converted rows) single DataFrame or dictionary of DataFrames.
Get the trained prediction model as a FittedEnsemble object.
Parameters:
from_model (bool, default=False) – Whether to convert a plain model of type AutoMLBackend into a FittedEnsemble object, if such an object does
not exist in the directory.
accepted (list, optional) – List of accepted inputs. Must be lower-case. If None, all inputs are accepted.
allow_headless (bool, default=True) – What to do in headless mode. If True, the first element in accepted is returned if accepted is a list and
“” is returned if accepted is None. If False, a RunTimeError is raised.
Returns:
The input of the user, an element of accepted if accepted is a list, or arbitrary if accepted is None.
Show a simple progress bar when iterating over a given iterable. This works similar to package tqdm, but in
contrast to tqdm also works when mirroring messages to a file.
Parameters:
iterable – The iterable.
desc (str, optional) – Description to add to the beginning of the progress bar, optional.
total (int, optional) – Total number of elements in iterable if iterable does not implement the __len__() method.
disable (bool, default=False) – Whether to disable the progress bar. If True, the behavior is equivalent to not calling this function at all.
meter_width (int, default=40) – The width of the meter, in characters. Should not be too long to make the whole progress bar fit into a single
line. Might have to be decreased if desc is a long text.
Used to temporary mirror both stderr and stdout to a log file. Based on [1] and [2].
Examples
>>> withLogMirror("log.txt"):>>> log("writing to log.txt and the console")>>> err("works with errors as well")>>> warn("and in case you need warnings")>>> print("no need to use the custom log functions")
Create a fresh name based on name, i.e., a name that does not appear in lst.
Parameters:
name – An arbitrary object. If a list, tuple or set, all elements of name are processed individually, an they are
ensured to be distinct from each other.
lst (Iterable) – A list-like structure.
Returns:
If name does not appear in lst, name is returned as-is. Otherwise, a numeric suffix is added to the string
representation of name.
Return a string representation of some time delta.
Minutes and seconds are always displayed, hours and days only if needed. Format is “d days hh:mm:ss”.
Parameters:
delta – Time delta to represent, either a float or an object with a total_seconds() method (e.g., a pandas Timedelta
instance). Floats are assumed to be given in seconds.
subsecond_resolution (int, default=0) – The subsecond resolution to display, i.e., number of decimal places.
fig – The figure(s) to save. May be a Matplotlib figure object, a plotly figure object, or a dict whose values are
such figure objects.
fn (str | Path) – The file or directory. It is recommended to leave the file extension unspecified and simply pass
“/path/to/figure” instead of “/path/to/figure.png”. The file extension is then determined automatically
depending on the type of fig and on the value of png. If fig is a dict, fn refers to the parent
directory.
png (bool, default=False) – Whether to save Matplotlib figures as PNG or as PDF. Ignored if a file extension is specified in fn or if
fig is a plotly figure, which are always saved as HTML.
Return an averageable variant of a given metric, i.e., a function that accepts parameter average with possible
values None, “binary”, “micro”, “macro”, “weighted”, “samples” and , optionally, “global”. Apart from “binary” and
“global”, averaging is taken care of by the new metric; the original metric only needs to handle binary
classification tasks.
Parameters:
func (callable) – Metric to make averageable, callable that accepts y_true and y_pred and returns a scalar value.
accepts_global (bool, optional) – What to do if average is set to “global”: if true, func is simply called on the provided arguments;
otherwise, a ValueError is raised.
Return the micro-averaged variant of a given classification metric, i.e., a new metric that returns
micro-averaged results.
Micro-averaging amounts to counting the total number of true and false positives and negatives across all classes,
and computing the metric value wrt. these numbers.
Parameters:
func (callable) – Base metric, callable that accepts y_true and y_pred and returns a scalar value.
Return the weighted-averaged variant of a given classification metric, i.e., a new metric that returns
weighted-averaged results.
Weighted-averaging amounts to computing the metric value for each class/label individually, and then returning the
weighted mean of these values. Weights correspond to class/label support.
Parameters:
func (callable) – Base metric, callable that accepts y_true and y_pred and returns a scalar value.
Notes
This corresponds to metric(…, average=”weighted”).
Convenience function for converting a metric into a (possibly different) metric that returns scores (i.e.,
higher values correspond to better results). That means, if the given metric returns scores already, it is returned
unchanged. Otherwise, it is negated.
Parameters:
func (callable) – The metric to convert, e.g., accuracy, balanced_accuracy, etc. Note that in case of classification metrics,
both thresholded and non-thresholded metrics are accepted.
errors (str, default="ignore") –
What to do if the polarity of func cannot be determined:
The name of the requested metric function. It must be of the form
”name [@ threshold] [(agg : n_reps)]”
where name is the name of a recognized metric and the threshold and agg/n_reps parts are optional.
If threshold is specified, name must be the name of a thresholded classification metric (e.g., “accuracy”)
and threshold must be either a specific numerical threshold or the name of a thresholding strategy; see
function thresholded() for details.
If agg and n_reps are specified, the bootstrapped metric with n_reps repetitions and aggregation agg
is returned.
If both a threshold and bootstrapping are specified, the threshold must be specified first.
Note that some synonyms are recognized as well, most notably “precision” for “positive_predictive_value” and
“recall” for “sensitivity”.
Convenience function for converting a metric into its bootstrapped version.
Parameters:
func (callable) – The metric to convert, e.g., roc_auc, accuracy, mean_squared_error, etc.
n_repetitions (int, default=100) – Number of bootstrapping repetitions to perform. If 0, func is returned unchanged.
agg (str | callable, default='mean') – Aggregation to compute of bootstrapping results.
seed (int, optional) – Random seed.
replace (bool, default=True) – Whether to resample with replacement. If False, this does not actually correspond to bootstrapping.
size (int | float, default=1.) – The size of the resampled data. If <= 1, it is multiplied with the number of samples in the given data.
Bootstrapping normally assumes that resampled data have the same number of samples as the original data, so
this parameter should be set to 1.
**kwargs – Additional keyword arguments that are passed to func upon application. Note that only arguments that do
not need to be resampled can be passed here; in particular, this excludes sample_weight.
Returns:
New metric that, when applied to y_true and y_hat, resamples the data, evaluates the metric on each
resample, and returns some aggregation (typically average) of the results thus obtained.
Compute the balance score and -threshold of a binary classification problem.
Parameters:
y_true (array-like) – Ground truth, with 0 representing the negative class and 1 representing the positive class. Must not contain
NaN.
y_score (array-like) – Predicted scores, i.e., the higher a score the more confident the model is that the sample belongs to the
positive class. Range is arbitrary.
balance_score (float) – Sensitivity at balance_threshold, which by definition is approximately equal to specificity and can
furthermore be shown to be approximately equal to accuracy and balanced accuracy, too.
balance_threshold (float) – Decision threshold that minimizes the difference between sensitivity and specificity, i.e., it is defined as
Compute the prevalence score and -threshold of a binary classification problem.
Parameters:
y_true (array-like) – Ground truth, with 0 representing the negative class and 1 representing the positive class. Must not contain
NaN.
y_score (array-like) – Predicted scores, i.e., the higher a score the more confident the model is that the sample belongs to the
positive class. Range is arbitrary.
prevalence_score (float) – Sensitivity at prevalence_threshold, which can be shown to be approximately equal to positive predictive
value and F1-score.
prevalence_threshold (float) – Decision threshold that minimizes the difference between the number of positive samples in y_true (m) and
the number of predicted positives. In other words, the threshold is set to the m-th largest value in
y_score. If sample_weight is given, the threshold minimizes the difference between the total weight of all
positive samples and the total weight of all samples predicted positive.
Compute the threshold corresponding to the (0,1)-criterion [1] of a binary classification problem.
Although a popular strategy for selecting decision thresholds, [1] advocates maximizing informedness (aka Youden
index) instead, which is equivalent to maximizing balanced accuracy.
Parameters:
y_true (array-like) – Ground truth, with 0 representing the negative class and 1 representing the positive class. Must not contain
NaN.
y_score (array-like) – Predicted scores, i.e., the higher a score the more confident the model is that the sample belongs to the
positive class. Range is arbitrary.
specificity_weight (float, default=1.) – The relative weight of specificity wrt. sensitivity. 1 means that sensitivity and specificity are weighted
equally, a value < 1 means that sensitivity is weighted stronger than specificity, and a value > 1 means that
specificity is weighted stronger than sensitivity. See the formula below for details.
Returns:
threshold – Decision threshold that minimizes the Euclidean distance between the point (0, 1) and the point
(1 - specificity, sensitivity), i.e., arg min_t (1 - sensitivity(y_true, y_score >= t)) ** 2 +
specificity_weight * (1 - specificity(y_true, y_score >= t)) ** 2
Compute the decision threshold that maximizes a given binary classification metric or callable.
Since in most built-in classification metrics larger values indicate better results, there is no analogous
argmin_score_threshold().
Parameters:
func (callable) – The metric or function ot maximize. If a string, function get() is called on it.
y_true (array-like) – Ground truth, with 0 representing the negative class and 1 representing the positive class. Must not contain
NaN.
y_score (array-like) – Predicted scores, i.e., the higher a score the more confident the model is that the sample belongs to the
positive class. Range is arbitrary.
discretize (int, default=100) – Discretization steps for limiting the number of calls to func. If None, no discretization happens, i.e., all
unique values in y_score are tried.
**kwargs – Additional keyword arguments passed to func.
Returns:
score (float) – Value of func at threshold.
threshold (float) – Decision threshold that maximizes func, i.e.,
Compute the calibration curve of a binary classification problem. The predicated class probabilities are binned
and, for each bin, the fraction of positive samples is determined. These fractions can then be plotted against the
midpoints of the respective bins. Ideally, the resulting curve will be monotonic increasing.
Parameters:
y_true (array-like) – Ground truth, array of shape (n,) with values among 0 and 1. Values must not be NaN.
y_score (array-like) – Predicated probabilities of the positive class, array of shape (n,) with arbitrary non-NaN values; in
particular, the values do not necessarily need to correspond to probabilities or confidences.
thresholds (array-like, optional) – The thresholds used for binning y_score. If None, suitable thresholds are determined automatically.
Returns:
fractions (ndarray) – Fractions of positive samples in each bin defined by thresholds, array of shape (m - 1,). Note that the
i-th bin corresponds to the half-open interval [thresholds[i], thresholds[i + 1]) if i < m - 2, and to
the closed interval [thresholds[i], thresholds[i + 1]] otherwise (in other words: the last bin is closed).
thresholds (ndarray) – Thresholds, array of shape (m,).
The output of this function may differ from the output of sklearn.metrics.roc_curve and
sklearn.metrics.precision_recall_curve, because the implementation of the latter changed over time. For instance,
early versions of scikit-learn set the first threshold in the output of roc_curve to 1 + the second threshold,
whereas later this was changed to +inf. Similarly, early versions of precision_recall_curve only returned
precision and recall until full recall was attained, whereas more recent versions return precision and recall for
all thresholds.
Return equally-spaced thresholds for a given array of classification scores or class probabilities.
Parameters:
y (array-like) – Values used for determining the thresholds, typically (but not necessarily) the scores or class probabilities
returned by a binary classification model. Must be a 1D-array of floats; may contain NaN and infinite values,
which are tacitly ignored.
n_max (int, default=100) – Maximum number of thresholds to return. Note that ensure takes precedence over this parameter, i.e., if
ensure is given, the output may contain more than n_max elements.
add_half_one (bool, optional) – Ensure 0.5 and 1.0 in the resulting list of thresholds. If None, 0.5 and 1.0 are added iff all elements of y
are in the [0, 1] interval, i.e., correspond to class probabilities.
ensure (list, optional) – Thresholds to ensure. If given, all of its elements appear in the final list.
sample_weight (array-like, optional) – Sample weights. Thresholds are chosen such that the total sample weights in each bin are roughly equal.
Returns:
Thresholds, ascending list of floats with length >= 2.
Translate multiclass class probabilities into actual predictions, by returning the class with the highest
probability. If two or more classes have the same highest probabilities, the last one is returned. This behavior is
consistent with binary classification problems, where the positive class is returned if both classes have equal
probabilities and the default threshold of 0.5 is used.
Parameters:
y (array-like) – Class probabilities, of shape (n_classes,) or (n_samples, n_classes). The values of y can be arbitrary,
they don’t need to be between 0 and 1. n_classes must be >= 1.
Return type:
Predicted class indices, either single integer or array of shape (n_samples,).
Convenience class for converting a classification metric that can only be applied to class predictions into a
metric that can be applied to probabilities. This proceeds by specifying a fixed decision threshold.
Parameters:
func (callable) – The metric to convert, e.g., accuracy, balanced_accuracy, etc.
threshold (float | str, default=0.5) – The decision threshold. In binary classification this can also be the name of a thresholding strategy that is
accepted by function get_thresholding_strategy().
**kwargs – Additional keyword arguments that are passed to func upon application.
Returns:
New metric that, when applied to y_true and y_score, returns func(y_true, y_score >= threshold) in case of
binary- or multilabel classification, and func(y_true, multiclass_proba_to_pred(y_score)) in case of multiclass
Convenience function for converting a classification metric into its “thresholded” version IF NECESSARY.
That means, if the given metric can be applied to class probabilities, it is returned unchanged. Otherwise,
thresholded(func, threshold) is returned.
Parameters:
func (callable) – The metric to convert, e.g., accuracy, balanced_accuracy, etc.
threshold (float | str, default=0.5) – The decision threshold.
**kwargs – Additional keyword arguments that shall be passed to func upon application.
Return type:
Either func itself or thresholded(func, threshold).
Convenience function for converting a classification metric into its “thresholded” version IF NECESSARY.
That means, if the given metric can be applied to class probabilities, it is returned unchanged. Otherwise,
thresholded(func, threshold) is returned.
Parameters:
func (callable) – The metric to convert, e.g., accuracy, balanced_accuracy, etc.
threshold (float | str, default=0.5) – The decision threshold.
**kwargs – Additional keyword arguments that shall be passed to func upon application.
Return type:
Either func itself or thresholded(func, threshold).
Compute confusion matrix to evaluate the accuracy of a classification.
In the binary and multiclass case, the result is an array of shape (n_classes, n_classes) whose ij-th entry is
the number of samples belonging to class i and classified as class j.
In short: rows = ground truth, columns = predictions.
In the multilabel case, the result is an array of shape (n_labels, 2, 2), with a binary confusion matrix for each
label.
Parameters:
y_true (array-like) – Ground-truth (correct) target values, array-like of shape (samples,) or (n_samples, n_labels).
y_pred (array-like) – Predictions, array-like with the same shape as y_true.
multilabel (str | bool, default="auto") –
Whether to return a binary/multiclass confusion matrix, or a multiplabel confusion matrix:
True: Return a multilabel confusion matrix, even if the input is binary/multiclass. Multiclass data will be
treated as if binarized under a one-vs-rest transformation.
False: Return a binary/multiclass confusion matrix. Raises a ValueError if the input is multilabel.
”auto” (default): Automatically detect the confusion matrix type to return: multilabel if the input is
multilabel, binary/multiclass otherwise.
normalize (str, optional) – Normalize the confusion matrix over the rows (“true”), columns (“pred”) conditions or the whole population
(“all”). If None, the confusion matrix will not be normalized.
For multilabel input, each of the 2x2 confusion matrices is normalized separately.
samplewise (bool, default=False) – In the multilabel case, this calculates a confusion matrix per sample.
Returns:
Confusion matrix, array of shape (n_classes, n_classes) if multilabel is False, or (n_classe, 2, 2) if
This implementation combines both sklearn.metrics.confusion_matrix and
sklearn.metrics.multilabel_confusion_matrix. Setting multilabel to False is equivalent to the former, setting
it to True is equivalent to the latter.
Hamming loss is 1 - accuracy, but their multilabel default averaging policy differs:
accuracy returns subset accuracy by default (i.e., all labels must match), whereas hamming loss returns label-wise
macro average by default.
Hamming loss is 1 - accuracy, but their multilabel default averaging policy differs:
accuracy returns subset accuracy by default (i.e., all labels must match), whereas hamming loss returns label-wise
macro average by default.
Hamming loss is 1 - accuracy, but their multilabel default averaging policy differs:
accuracy returns subset accuracy by default (i.e., all labels must match), whereas hamming loss returns label-wise
macro average by default.
Hamming loss is 1 - accuracy, but their multilabel default averaging policy differs:
accuracy returns subset accuracy by default (i.e., all labels must match), whereas hamming loss returns label-wise
macro average by default.
Hamming loss is 1 - accuracy, but their multilabel default averaging policy differs:
accuracy returns subset accuracy by default (i.e., all labels must match), whereas hamming loss returns label-wise
macro average by default.
Compute precision, recall, F-measure and support for each class. This is the confusion-matrix based variant of
precision_recall_fscore_support.
Parameters:
cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2).
Mutually exclusive with tp, fp, tn and fn.
tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn
must be given, too.
Mutually exclusive with cm.
fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn
must be given, too.
Mutually exclusive with cm.
tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn
must be given, too.
Mutually exclusive with cm.
fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn
must be given, too.
Mutually exclusive with cm.
beta (float, default=1) – The strength of recall versus precision in the F-score.
average (str, optional) – Averaging to perform for multiclass and multilabel input.
zero_division (float | str, default="warn") – The value to return if there is a division by zero.
Compute the precision (positive predictive value) from a given confusion matrix.
This is the confusion-matrix based variant of positive_predictive_value.
Parameters:
cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2).
Mutually exclusive with tp, fp, tn and fn.
tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn
must be given, too.
Mutually exclusive with cm.
fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn
must be given, too.
Mutually exclusive with cm.
tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn
must be given, too.
Mutually exclusive with cm.
fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn
must be given, too.
Mutually exclusive with cm.
swap_pos_neg (bool, default=False) – Swap positive and negative class. If True, negative predictive value is computed instead.
average (str, optional) – Averaging to perform for multiclass and multilabel input.
zero_division (float | str, default="warn") – The value to return if there is a division by zero.
Compute the recall (sensitivity) from a given confusion matrix.
This is the confusion-matrix based variant of sensitivity.
Parameters:
cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2).
Mutually exclusive with tp, fp, tn and fn.
tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn
must be given, too.
Mutually exclusive with cm.
fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn
must be given, too.
Mutually exclusive with cm.
tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn
must be given, too.
Mutually exclusive with cm.
fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn
must be given, too.
Mutually exclusive with cm.
swap_pos_neg (bool, default=False) – Swap positive and negative class. If True, specificity is computed instead.
average (str, optional) – Averaging to perform for multiclass and multilabel input.
zero_division (float | str, default="warn") – The value to return if there is a division by zero.
Compute the accuracy from a given confusion matrix. This is the confusion-matrix based variant of accuracy.
Parameters:
cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2).
Mutually exclusive with tp, fp, tn and fn.
tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn
must be given, too.
Mutually exclusive with cm.
fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn
must be given, too.
Mutually exclusive with cm.
tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn
must be given, too.
Mutually exclusive with cm.
fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn
must be given, too.
Mutually exclusive with cm.
normalize (bool, default=True) – Return the fraction of correctly classified samples. Otherwise, return the number of correctly classified
samples.
average (str, optional) – Averaging to perform for multiclass and multilabel input.
Compute the balanced accuracy from a given confusion matrix. This is the confusion-matrix based variant of
balanced_accuracy.
Parameters:
cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2).
Mutually exclusive with tp, fp, tn and fn.
tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn
must be given, too.
Mutually exclusive with cm.
fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn
must be given, too.
Mutually exclusive with cm.
tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn
must be given, too.
Mutually exclusive with cm.
fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn
must be given, too.
Mutually exclusive with cm.
adjusted (bool, default=False) – Adjust the result for chance, so that random performance would score 0, while keeping perfect performance at a
score of 1.
average (str, optional) – Averaging to perform for multiclass and multilabel input.
Compute the F-beta score from a given confusion matrix.
Parameters:
cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2).
Mutually exclusive with tp, fp, tn and fn.
tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn
must be given, too.
Mutually exclusive with cm.
fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn
must be given, too.
Mutually exclusive with cm.
tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn
must be given, too.
Mutually exclusive with cm.
fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn
must be given, too.
Mutually exclusive with cm.
beta (float, default=1) – The strength of recall versus precision.
average (str, optional) – Averaging to perform for multiclass and multilabel input.
zero_division (float | str, default="warn") – The value to return if there is a division by zero.
Compute Cohen’s kappa from a given confusion matrix. This is the confusion-matrix based variant of cohen_kappa.
Parameters:
cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2).
Mutually exclusive with tp, fp, tn and fn.
tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn
must be given, too.
Mutually exclusive with cm.
fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn
must be given, too.
Mutually exclusive with cm.
tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn
must be given, too.
Mutually exclusive with cm.
fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn
must be given, too.
Mutually exclusive with cm.
weights (str, optional) – Weighting type to calculate the score. None means not weighted; “linear” means linear weighting; “quadratic”
means quadratic weighting.
average (str, optional) – Averaging to perform for multiclass and multilabel input.
Returns:
Kappa statistic, float or array of floats between -1 and 1. The maximum value means complete agreement; zero or
Compute the Matthews correlation coefficient (MCC) from a given confusion matrix.
This is the confusion-matrix based variant of matthews_correlation_coefficient.
Parameters:
cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2).
Mutually exclusive with tp, fp, tn and fn.
tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn
must be given, too.
Mutually exclusive with cm.
fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn
must be given, too.
Mutually exclusive with cm.
tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn
must be given, too.
Mutually exclusive with cm.
fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn
must be given, too.
Mutually exclusive with cm.
average (str, optional) – Averaging to perform for multiclass and multilabel input.
Returns:
Matthews correlation coefficient (+1 represents a perfect prediction, 0 an average random prediction and -1 and
Compute the Jaccard score (intersection over union, IoU) from a given confusion matrix.
This is the confusion-matrix based variant of jaccard.
Parameters:
cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2).
Mutually exclusive with tp, fp, tn and fn.
tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn
must be given, too.
Mutually exclusive with cm.
fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn
must be given, too.
Mutually exclusive with cm.
tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn
must be given, too.
Mutually exclusive with cm.
fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn
must be given, too.
Mutually exclusive with cm.
average (str, optional) – Averaging to perform for multiclass and multilabel input.
zero_division (float | str, default="warn") – The value to return if there is a division by zero.
Hamming loss is 1 - accuracy, but their multilabel default averaging policy differs:
accuracy returns subset accuracy by default (i.e., all labels must match), whereas hamming loss returns label-wise
macro average by default.
Compute the F-beta score from a given confusion matrix.
Parameters:
cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2).
Mutually exclusive with tp, fp, tn and fn.
tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn
must be given, too.
Mutually exclusive with cm.
fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn
must be given, too.
Mutually exclusive with cm.
tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn
must be given, too.
Mutually exclusive with cm.
fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn
must be given, too.
Mutually exclusive with cm.
beta (float, default=1) – The strength of recall versus precision.
average (str, optional) – Averaging to perform for multiclass and multilabel input.
zero_division (float | str, default="warn") – The value to return if there is a division by zero.
Compute the recall (sensitivity) from a given confusion matrix.
This is the confusion-matrix based variant of sensitivity.
Parameters:
cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2).
Mutually exclusive with tp, fp, tn and fn.
tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn
must be given, too.
Mutually exclusive with cm.
fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn
must be given, too.
Mutually exclusive with cm.
tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn
must be given, too.
Mutually exclusive with cm.
fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn
must be given, too.
Mutually exclusive with cm.
swap_pos_neg (bool, default=False) – Swap positive and negative class. If True, specificity is computed instead.
average (str, optional) – Averaging to perform for multiclass and multilabel input.
zero_division (float | str, default="warn") – The value to return if there is a division by zero.
Compute the recall (sensitivity) from a given confusion matrix.
This is the confusion-matrix based variant of sensitivity.
Parameters:
cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2).
Mutually exclusive with tp, fp, tn and fn.
tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn
must be given, too.
Mutually exclusive with cm.
fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn
must be given, too.
Mutually exclusive with cm.
tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn
must be given, too.
Mutually exclusive with cm.
fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn
must be given, too.
Mutually exclusive with cm.
swap_pos_neg (bool, default=False) – Swap positive and negative class. If True, specificity is computed instead.
average (str, optional) – Averaging to perform for multiclass and multilabel input.
zero_division (float | str, default="warn") – The value to return if there is a division by zero.
Compute the precision (positive predictive value) from a given confusion matrix.
This is the confusion-matrix based variant of positive_predictive_value.
Parameters:
cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2).
Mutually exclusive with tp, fp, tn and fn.
tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn
must be given, too.
Mutually exclusive with cm.
fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn
must be given, too.
Mutually exclusive with cm.
tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn
must be given, too.
Mutually exclusive with cm.
fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn
must be given, too.
Mutually exclusive with cm.
swap_pos_neg (bool, default=False) – Swap positive and negative class. If True, negative predictive value is computed instead.
average (str, optional) – Averaging to perform for multiclass and multilabel input.
zero_division (float | str, default="warn") – The value to return if there is a division by zero.
Compute the precision (positive predictive value) from a given confusion matrix.
This is the confusion-matrix based variant of positive_predictive_value.
Parameters:
cm (array-like, optional) – Confusion matrix, array-like of shape (n_classes, n_classes) or (n_labels, 2, 2).
Mutually exclusive with tp, fp, tn and fn.
tp (scalar | array-like, optional) – Number/fraction of true positives, scalar or array-like of shape (n_labels,). If given, fp, tn and fn
must be given, too.
Mutually exclusive with cm.
fp (scalar | array-like, optional) – Number/fraction of false positives, scalar or array-like of shape (n_labels,). If given, tp, tn and fn
must be given, too.
Mutually exclusive with cm.
tn (scalar | array-like, optional) – Number/fraction of true negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and fn
must be given, too.
Mutually exclusive with cm.
fn (scalar | array-like, optional) – Number/fraction of false negatives, scalar or array-like of shape (n_labels,). If given, tp, fp and tn
must be given, too.
Mutually exclusive with cm.
swap_pos_neg (bool, default=False) – Swap positive and negative class. If True, negative predictive value is computed instead.
average (str, optional) – Averaging to perform for multiclass and multilabel input.
zero_division (float | str, default="warn") – The value to return if there is a division by zero.
Calculate and save descriptive statistics including correlation information to disk.
Parameters:
df (DataFrame) – The main dataframe.
target (list) – The target labels; stored in list.
classify (bool) – Is true, if classification task. False for regression task
fn (str | Path) – The directory where to save the statistics files
corr_threshold (int, default=200) – Maximum number of columns for which a correlation-DataFrame is computed.
mann_whitney_u(x, y, nan_policy:str='omit', **kwargs)→float[source]
Mann-Whitney U test for testing whether two independent samples are equal (more precisely: have equal median).
Only applicable to numerical observations; categorical observations should be treated with the chi square test.
Parameters:
x (array-like) – First sample, array-like with numerical values.
y (array-like) – Second sample, array-like with numerical values.
nan_policy (str, default="omit") –
Specifies how to handle NaN values:
”omit”: Perform the test on all non-NaN values.
”propagate”: Return NaN if at least one input value is NaN.
”raise”: Raise a ValueError if at least one input value is NaN.
**kwargs – Keyword arguments passed to scipy.stats.mannwhitneyu().
Returns:
p_value – P-value. Smaller values mean that x and y are distributed differently.
Return type:
float
See also
scipy.stats.mannwhitneyu
Notes
This test is symmetric between x and y if alternative is set to “two-sided” (default), i.e.,
mann_whitney_u(x, y) equals mann_whitney_u(y, x).
The Mann-Whitney U test is a special case of the Kruskal-Wallis H test, which works for more than two samples.
chi_square(x, y, nan_policy:str='omit', **kwargs)→float[source]
Chi square test for testing whether a sample of categorical observations is distributed according to another
sample of categorical observations.
Parameters:
x (array-like) – First sample, array-like with categorical values.
y (array-like) – Second sample, array-like with categorical values.
nan_policy (str, default="omit") –
Specifies how to handle NaN values:
”omit”: Perform the test on all non-NaN values.
”propagate”: Return NaN if at least one input value is NaN.
”raise”: Raise a ValueError if at least one input value is NaN.
**kwargs – Keyword arguments passed to scipy.stats.chisquare().
Returns:
p_value – p-value. Smaller values mean that x is distributed differently from y.
Return type:
float
See also
scipy.stats.chisquare
Notes
This test is not symmetric between x and y, i.e., chi_square(x, y) differs from chi_square(y, x)
in general.
Compute the p-value of the DeLong test for the null hypothesis that two ROC-AUCs are equal.
Parameters:
y_true (array-like) – Ground truth, 1D array-like of shape (n_samples,) with values in {0, 1}.
y_hat_1 (array-like) – Predictions of the first classifier, 1D array-like of shape (n_samples,) with arbitrary values. Larger values
correspond to a higher predicted probability that a sample belongs to the positive class.
y_hat_2 (array-like) – Predictions of the second classifier, 1D array-like of shape (n_samples,) with arbitrary values. Larger values
correspond to a higher predicted probability that a sample belongs to the positive class.
”omit”: Perform the test on all non-NaN values. Since this is a paired test, all observations that are NaN in
any of the three arrays are dropped.
”propagate”: Return NaN if at least one input value is NaN.
”raise”: Raise a ValueError if at least one input value is NaN.
Returns:
p_value – p-value for the null hypothesis that the ROC-AUCs of the two classifiers are equal. If this value is smaller
than a certain pre-defined threshold (e.g., 0.05) the null hypothesis can be rejected, meaning that there is a
statistically significant difference between the two ROC-AUCs.
Return the confidence interval and ROC-AUC of given ground-truth and model predictions.
Parameters:
y_true (array-like) – Ground truth, 1D array-like of shape (n_samples,) with values in {0, 1}.
y_hat (array-like) – Predictions of the classifier, 1D array-like of shape (n_samples,) with arbitrary values. Larger values
correspond to a higher predicted probability that a sample belongs to the positive class.
alpha (float, default=0.95) – Confidence level, between 0 and 1.
Suggest statistical hypothesis tests for comparing two or more groups (samples). The list of suggested tests is
by no means exhaustive, but includes some of the most frequently used tests in practice.
See Notes for some general comments on statistical testing.
Parameters:
task (str, default="comparison") –
The objective of the test, can be either “comparison” or “correlation”:
”comparison”: The objective of the test is to determine whether the given groups were drawn from the same
distribution. This usually, but not necessarily, happens by comparing group statistics, like mean, median
or variance.
”correlation”: The objective of the test is to determine whether the given (paired) groups are correlated.
Groups can be correlated even when drawn from distinct distributions.
quantitative (bool, default=True) – The observations in the given groups are quantitative, i.e., drawn from continuous or discrete distributions,
such that each observation has a numerical value.
The alternative are categorical observations.
paired (bool, default=False) – The observations in the given groups are paired, i.e., the i-th observation in the first group corresponds
with the i-observation in the second group. Correspondence can mean, for instance, that observations
originate from the same subject, measurement device, etc. Note that this implies that all groups must have the
same size.
The alternative are independent groups.
n_groups (int, default=2) – The number of groups the test should handle. Some tests are restricted to two groups, others can handle
arbitrarily many groups.
normal (bool, default=True) – The observations are known to be drawn from a normal distribution. Some tests need this assumption to work
properly, others (called “non-parametric tests”) can deal with arbitrary underlying distributions.
equal_variance (bool, default=False) – The observations are known to be drawn from distributions with equal variance (usually normal distributions).
Some tests need this assumption to work properly, others can deal with arbitrary variances.
This property is also known as homoscedasticity.
Returns:
Dict of suggested tests (possibly empty), keys are names and values are dicts with main properties.
Carefully read the documentation of each test to select the one appropriate for your data.
Return type:
dict
Notes
A statistical test is usually performed by finding evidence _against_ the null hypothesis of the test, e.g., using
the t-test to show that two groups have _different_ mean values. The converse is not true, though: if a test does
not produce evidence against the null hypothesis, we cannot conclude that the null-hypothesis must be true – only
that we have not found any evidence against it. This holds true even if the p-values are close to 1.
More concisely: null hypothesis true ==> (relatively) large p-value. Note the implication, not equivalence!
One common assumption of most statistical tests is that all observations in a group are independent, i.e., all are
drawn independently from the same underlying distribution (i.i.d. assumption). Whether this property holds true
also _between_ groups can be controlled with parameter paired.
There are many resources for finding the right statistical test on the internet, e.g., _[1].
Transform data by scaling each feature to a given range. The only difference to
sklearn.preprocessing.MinMaxScaler is parameter fit_bool that, when set to False, does not fit this scaler on
boolean features but rather uses 0 and 1 as fixed minimum and maximum values. This ensures that False is always
mapped to feature_range[0] and True is always mapped to feature_range[1]. Otherwise, if the training data only
contains True values, True would be mapped to feature_range[0] and False to feature_range[0] - feature_range[1].
The behavior on other numerical data types is not affected by this.
Parameters:
fit_bool (bool, default=True) – Whether to fit this scaler on boolean features. If True, the behavior is identical to
sklearn.preprocessing.MinMaxScaler.
**kwargs – Additional keyword arguments, passed to sklearn.preprocessing.MinMaxScaler.
See also
sklearn.preprocessing.MinMaxScaler
Notes
Note that sklearn.preprocessing.MaxAbsScaler always maps False to 0 and True to 1, so there is no need for an
analogous subclass.
Online computation of min and max on X for later scaling.
All of X is processed as a single batch. This is intended for cases
when fit() is not feasible due to very large number of
n_samples or because X is read from a continuous stream.
Parameters:
X (array-like of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation
used for later scaling along the features axis.
Encode categorical features as a one-hot numeric array. The only difference to
sklearn.preprocessing.OneHotEncoder is parameter drop_na that, when set to True, allows to drop NaN categories.
More precisely, no separate columns representing NaN categories are added upon transformation, resembling the
behavior of pandas.get_dummies().
Parameters:
drop_na (bool, default=False) – Drop NaN categories. If False, the behavior is identical to sklearn.preprocessing.OneHotEncoder.
drop (iterable, optional) – Categories to drop. If drop_na is True, this parameter must be None.
handle_unknown (str, optional) – How to handle unknown categories. If drop_na is True, this parameter must be “ignore”. None defaults to
“ignore” if drop_na is True and to “error” otherwise.
min_frequency (int | float, optional) – Specifies the minimum frequency below which a category will be considered infrequent. If drop_na is True,
this parameter must be None.
max_categories (int, optional) – Specifies an upper limit to the number of output features for each input feature when considering infrequent
categories. If drop_na is True, this parameter must be None.
See also
sklearn.preprocessing.OneHotEncoder
Notes
If drop_na is True, all features containing only NaN values during fit() are removed entirely.
Transform columns of a pandas DataFrame depending on their data types.
The order of columns may change compared to the input.
Parameters:
num (str | BaseEstimator, optional) –
The transformation to apply to numerical columns, or “passthrough”, “drop”, “num”, “cat”, “bool”, “timedelta”,
“datetime”, “obj” or None/”default”:
BaseEstimator: Apply the BaseEstimator to all columns with numerical data type. The BaseEstimator must
implement fit() and transform(). Class instances are cloned before being fit to data, to ensure that
the given instances are left unchanged.
”passthrough”: Pass numerical columns through unchanged.
”drop”: Drop numerical columns.
”num”: Prohibited here, but allowed with cat, bool, timedelta, datetime, obj and default: Treat
columns of the respective data type as numerical and apply the transformation specified by num.
”cat”: Treat numerical columns like categorical columns, and apply the transformation specified by cat.
”bool”: Treat numerical columns like boolean columns, and apply the transformation specified by bool.
”timedelta”: Treat numerical columns like timedelta columns, and apply the transformation specified by
timedelta.
”datetime”: Treat numerical columns like datetime columns, and apply the transformation specified by
datetime.
”obj”: Treat numerical columns like columns with object data type, and apply the transformation specified by
obj.
None or “default”: Apply the default transformation, specified by default.
cat (str | BaseEstimator, optional) – The transformation to apply to categorical columns. Same options as for num.
bool (str | BaseEstimator, optional) – The transformation to apply to boolean columns. Same options as for num.
timedelta (str | BaseEstimator, optional) – The transformation to apply to timedelta columns. Same options as for num.
datetime (str | BaseEstimator, optional) – The transformation to apply to datetime columns. Same options as for num.
obj (str | BaseEstimator, optional) – The transformation to apply to columns with object data type. Same options as for num.
default (str | BaseEstimator, default="passthrough") – Default behavior for columns with unspecified transformation. Same options as for num, but cannot be None.
timedelta_resolution (str | pandas.Timedelta, optional) – Convert timedelta columns to float by diving through the given temporal resolution. This transformation is
applied before any other transformation, and regardless of the value of timedelta.
None keeps the data type of timedelta columns.
datetime_resolution (str | pandas.Timedelta, optional) – Convert datetime columns to float by diving through the given temporal resolution. This transformation is
applied before any other transformation, and regardless of the value of datetime.
None keeps the data type of timedelta columns.
See also
sklearn.compose.ColumnTransformer
Notes
This preprocessing transformation is only applicable to pandas DataFrames.
If the transformation specification is recursive, fit() raises a ValueError. Recursive specifications arise
when some data type A shall be treated like B, B shall be treated like C, C shall be treated like … like A.
Create a transformation for ordinal-encoding categorical features, while keeping other features unchanged.
Parameters:
dtype – Data type of ordinal encoding. Passed to sklearn.preprocessing.OrdinalEncoder.
**kwargs – Keyword arguments passed to DTypeTransformer, most notably num, bool etc. for specifying how to treat
non-categorical columns. cat cannot be specified.
Return type:
DTypeTransformer instance that can be used for ordinal-encoding categorical columns in pandas DataFrames.
Ordinal-encode categorical features in a pandas DataFrame, while keeping other features unchanged.
Internally, this function creates a suitable transformation using ordinal_encoder() and applies its
fit_transform() method to the given DataFrame.
Parameters:
X (pandas.DataFrame) – DataFrame to process.
dtype – Data type of ordinal encoding. Passed to sklearn.preprocessing.OrdinalEncoder.
output (str, default="default") – Desired output type, either “default” (Numpy array) or “pandas” (pandas DataFrame).
**kwargs – Keyword arguments passed to DTypeTransformer, most notably num, bool etc. for specifying how to treat
non-categorical columns. cat cannot be specified.
Return type:
Transformed input, either a DataFrame or an array, depending on output.
Create a transformation for one-hot-encoding categorical features, while keeping other features unchanged.
Parameters:
drop_na (bool, default=False) – Drop NaN categories. Passed to OneHotEncoder.
drop (iterable, optional) – Categories to drop. If drop_na is True, this parameter must be None. Passed to OneHotEncoder.
dtype – Data type of one-hot encoding. Passed to OneHotEncoder.
handle_unknown (str, optional) – How to handle unknown categories. Passed to OneHotEncoder.
**kwargs – Keyword arguments passed to DTypeTransformer, most notably num, bool etc. for specifying how to treat
non-categorical columns. cat cannot be specified.
Return type:
DTypeTransformer instance that can be used for one-hot-encoding categorical columns in pandas DataFrames.
One-hot encode categorical features in a pandas DataFrame, while keeping other features unchanged.
Internally, this function creates a suitable transformation using one_hot_encoder() and applies its
fit_transform() method to the given DataFrame.
Parameters:
X (pandas.DataFrame) – DataFrame to process.
drop_na (bool, default=False) – Drop NaN categories. Passed to OneHotEncoder.
drop (iterable, optional) – Categories to drop. If drop_na is True, this parameter must be None. Passed to OneHotEncoder.
dtype – Data type of one-hot encoding. Passed to OneHotEncoder.
output (str, default="default") – Desired output type, either “default” (Numpy array) or “pandas” (pandas DataFrame).
**kwargs – Keyword arguments passed to DTypeTransformer, most notably num, bool etc. for specifying how to treat
non-categorical columns. cat cannot be specified.
Return type:
Transformed input, either a DataFrame or an array, depending on output.
Create a transformation for k-bins-discretizing numerical features, while keeping other features unchanged.
Parameters:
n_bins (int, default=5) – Number of bins to produce. Passed to sklearn.preprocessing.KBinsDiscretizer.
encode (str, default="onehot") – Method used to encode the transformed result. Passed to sklearn.preprocessing.KBinsDiscretizer.
strategy (str, default="quantile") – Strategy used to define the widths of the bins. Passed to sklearn.preprocessing.KBinsDiscretizer.
timedelta (str, default="num") – How to treat timedelta features.
**kwargs – Keyword arguments passed to DTypeTransformer, most notably cat, bool etc. for specifying how to treat
non-numerical columns. num cannot be specified.
Return type:
DTypeTransformer instance that can be used for k-bins-discretizing numerical columns in pandas DataFrames.
K-bins discretize numerical features in a pandas DataFrame, while keeping other features unchanged.
Internally, this function creates a suitable transformation using k_bins_discretizer() and applies its
fit_transform() method to the given DataFrame.
Parameters:
X (pandas.DataFrame) – DataFrame to process.
n_bins (int, default=5) – Number of bins to produce. Passed to sklearn.preprocessing.KBinsDiscretizer.
encode (str, default="onehot") – Method used to encode the transformed result. Passed to sklearn.preprocessing.KBinsDiscretizer.
strategy (str, default="quantile") – Strategy used to define the widths of the bins. Passed to sklearn.preprocessing.KBinsDiscretizer.
output (str, default="default") – Desired output type, either “default” (Numpy array) or “pandas” (pandas DataFrame).
timedelta (str, default="num") – How to handle timedelta features.
**kwargs – Keyword arguments passed to DTypeTransformer, most notably cat, bool etc. for specifying how to treat
non-numerical columns. num cannot be specified.
Return type:
Transformed input, either a DataFrame or an array, depending on output.
Create a transformation for binarizing numerical features, while keeping other features unchanged.
Parameters:
threshold (float, default=0) – Feature values below or equal to this are replaced by 0, above it by 1.
Passed to sklearn.preprocessing.Binarizer.
**kwargs – Keyword arguments passed to DTypeTransformer, most notably cat, bool etc. for specifying how to treat
non-numerical columns. num cannot be specified.
Return type:
DTypeTransformer instance that can be used for binarizing numerical columns in pandas DataFrames.
Binarize numerical features in a pandas DataFrame, while keeping other features unchanged.
Internally, this function creates a suitable transformation using binarizer() and applies its fit_transform()
method to the given DataFrame.
Parameters:
X (pandas.DataFrame) – DataFrame to process.
threshold (float, default=0) – Feature values below or equal to this are replaced by 0, above it by 1.
Passed to sklearn.preprocessing.Binarizer.
output (str, default="default") – Desired output type, either “default” (Numpy array) or “pandas” (pandas DataFrame).
**kwargs – Keyword arguments passed to DTypeTransformer, most notably cat, bool etc. for specifying how to treat
non-numerical columns. num cannot be specified.
Return type:
Transformed input, either a DataFrame or an array, depending on output.
Create a transformation for scaling numerical features, while keeping other features unchanged.
Parameters:
strategy (str, default="standard") – Strategy used to scale numerical data:
* “standard”: Scale data to have zero mean and unit variance, using sklearn.preprocessing.StandardScaler.
* “robust”: Scale data using statistics that are robust to outliers, using sklearn.preprocessing.RobustScaler.
* “minmax”: Scale data to have zero minimum and unit maximum, using sklearn.preprocessing.MinMaxScaler.
* “maxabs”: Scale data to have a maximum absolute value of 1, using sklearn.preprocessing.MaxAbsScaler.
cat (optional) – How to handle categorical features.
bool (optional) – How to handle boolean features.
timedelta (default="num") – How to handle timedelta features.
datetime (optional) – How to handle datetime features.
obj (optional) – How to handle object features.
default (default="passthrough") – How to handle features for which no transformation is specified elsewhere.
timedelta_resolution (str | pandas.Timedelta, optional) – Timedelta resolution. If None and timedelta is set to “num” (either explicitly or implicitly), the resolution
is automatically set to “s”.
datetime_resolution (str | pandas.Timedelta, optional) – Datetime resolution. If None and datetime is set to “num” (either explicitly or implicitly), the resolution
is automatically set to “s”.
**kwargs – Additional keyword arguments passed to the underlying scikit-learn scaler.
Return type:
DTypeTransformer instance that can be used for scaling numerical columns in pandas DataFrames.
Scale numerical features in a pandas DataFrame, while keeping other features unchanged.
Internally, this function creates a suitable transformation using scaler() and applies its fit_transform()
method to the given DataFrame.
Parameters:
X (pandas.DataFrame) – DataFrame to process.
strategy (str, default="standard") – Strategy used to scale numerical data:
* “standard”: Scale data to have zero mean and unit variance, using sklearn.preprocessing.StandardScaler.
* “robust”: Scale data using statistics that are robust to outliers, using sklearn.preprocessing.RobustScaler.
* “minmax”: Scale data to have zero minimum and unit maximum, using sklearn.preprocessing.MinMaxScaler.
* “maxabs”: Scale data to have a maximum absolute value of 1, using sklearn.preprocessing.MaxAbsScaler.
output (str, default="default") – Desired output type, either “default” (Numpy array) or “pandas” (pandas DataFrame).
**kwargs – Additional keyword arguments passed to scaler().
Return type:
Transformed input, either a DataFrame or an array, depending on output.
Encoder for features- and labels DataFrames. Implements the BaseEstimator class of sklearn, with methods fit(),
transform() and inverse_transform(), and can easily be dumped to and loaded from disk.
Notes
Encoding ensures that:
The data type of every feature column is either float, int, bool, categorical or string (if the installed Pandas
version supports it). Time-like columns are converted into float, and object data types raise an exception.
The data type of every target column is float.
In regression tasks, this is achieved by converting numerical data types (float, int, bool, time-like) into
float, and raising exceptions if other data types are found.
In binary classification, this is achieved by representing the negative class by 0.0 and the positive class by
1.0. If the original data type is categorical, the negative class corresponds to the first category, whereas
the positive class corresponds to the second category. If the original data type is not categorical the
positive and negative classes are determined through sklearn’s LabelEncoder.
In multiclass classification, this is achieved by representing the i-th class by i.
In multilabel classification, this is achieved by representing the presence of a class by 1.0 and its absence
by 0.0.
Both features and labels may contain NaN values before encoding. These are simply propagated, meaning that encoded
data may contain NaN values as well!
Back-transform features- and/or labels DataFrames i.e. Decodes encoded data. In the case of classification,
it is also able to handle Numpy arrays containing class (indices), as returned by predict(), as well as class
probabilities, as returned by predict_proba().
Parameters:
inplace (bool, default=True) – Whether to modify the given data in place.
**kwargs (DataFrame, ndarray, optional) – The data to transform back, with keys “x” (features) or “y” (labels).
Returns:
The back-transformed DataFrame(s), either a single DataFrame if only one of “x” or “y” is passed, or apair
of DataFrames in the same order as in the argument dict.
Convert “object” data types in df into other data types, if possible. In particular, this includes timedelta,
datetime, categorical and string types, in that order. String types are not supported in all Pandas versions.
Parameters:
df (DataFrame) – The DataFrame.
inplace (bool, default=True) – Whether to modify df in place. Note that if no column in df can be converted, it is returned as-is even if
inplace is False.
max_categories (int, default=100) – The maximum number of allowed categories when converting on object column into a categorical column.
Merge the given tables by left-joining them on ID columns.
Parameters:
tables (Iterable) – The tables to merge, an iterable of DataFrames or paths to tables. Function convert_object_dtypes() is
automatically applied to tables read from files.
Returns:
The pair (df, id_cols), where df is the merged DataFrame and id_cols is the list of potential ID columns.
Stratified grouped split into train- and test set. Ensures that groups in the two sets do not overlap, and tries
to distribute samples in such a way that class percentages are roughly maintained in each split.
Parameters:
n_splits (int, default=10) – Number of re-shuffling & splitting iterations.
test_size (float | int, optional) – If float, should be between 0.0 and 1.0 and represent the proportion
of the dataset to include in the test split. If int, represents the
absolute number of test samples. If None, the value is set to the
complement of the train size. If train_size is also None, it will
be set to 0.1.
train_size (float | int, optional) – If float, should be between 0.0 and 1.0 and represent the
proportion of the dataset to include in the train split. If
int, represents the absolute number of train samples. If None,
the value is automatically set to the complement of the test size.
random_state (int | RandomState, optional) – Controls the randomness of the training and testing indices produced.
Pass an int for reproducible output across multiple function calls.
method (str, default="automatic") – Resampling method to use. Can be “automatic”, “exact” and “brute_force”.
If there are many small groups, “brute_force” tends to give reasonable
results and is significantly faster than “exact”. Otherwise, if there
are only few large groups, method “exact” might be preferable.
“automatic” tries to infer the optimal method based on the number of
groups.
n_iter (int, optional) – Number of brute-force iterations. The larger the number, the more
splits are tried, and hence the better the results get. If None, the
number of iterations is determined automatically.
n_splits (int, default=10) – Number of re-shuffling & splitting iterations.
shuffle (bool, default=False) – Whether to shuffle samples before splitting.
random_state (int or RandomState, optional) – Controls the randomness of the training and testing indices produced. Pass an int for reproducible output
across multiple function calls.
method (str, default="automatic") – Resampling method to use. Can be “automatic”, “exact” and “brute_force”. If there are many small groups,
“brute_force” tends to give reasonable results and is significantly faster than “exact”. Otherwise, if there
are only few large groups, method “exact” might be preferable. “automatic” tries to infer the optimal method
based on the number of groups. Note that “brute_force” is only possible if shuffle is set to True.
n_iter (int, optional) – Number of brute-force iterations. The larger the number, the more splits are tried, and hence the better the
results get. If None, the number of iterations is determined automatically.
Predefined split cross-validator. Provides train/test indices to split data into train/test sets using a
predefined scheme specified by explicit test indices.
In contrast to sklearn.model_selection.PredefinedSplit, samples can be in the test set of more than one split.
Parameters:
test_folds (list of array-like) – Indices of test samples for each split. The number of splits equals the length of the list.
Note that the test sets do not have to be pairwise disjoint.
See also
sklearn.model_selection.PredefinedSplit
Notes
In methods split() etc., parameters y and groups only exist for compatibility, but are always ignored.
X is only needed for obtaining the total number of samples.
Resample data in EAV (entity-attribute-value) format wrt. explicitly passed windows of arbitrary (possibly
infinite) length.
Parameters:
df (pd.DataFrame | dask.dataframe.DataFrame) – The DataFrame to resample, in EAV format. That means, must have columns value_col (contains observed values),
time_col (contains observation times), attribute_col (optional; contains attribute identifiers) and
entity_col (optional; contains entity identifiers). Must have one column index level. Data types are
arbitrary, as long as observation times and entity identifiers can be compared wrt. < and <= (e.g., float,
int, time delta, date time). Entity identifiers must not be NA. Observation times may be NA, but such entries
are ignored entirely. df can be a Dask DataFrame as well. In that case, however, entity_col must not be
None and entities should already be on the row index, with known divisions. Otherwise, the row index is set to
entity_col, which can be very costly both in terms of time and memory. Especially if df is known to be
sorted wrt. entities already, the calling function should better take care of this; see
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html.
windows (pd.DataFrame | dask.dataframe.DataFrame | callable) – The target windows into which df is resampled. Must have two column index levels and columns (time_col,
“start”) (optional; contains start times of each window), (time_col, “stop”) (optional; contains end times of
each window), (entity_col, “”) (optional; contains entity identifiers) and (window_group_col, “”) (optional;
contains information for creating groups of mutually disjoint windows). Start- and end times may be NA, but such
windows are deemed invalid and by definition do not contain any observations. At least one of the two
endpoint-columns must be given; if one is missing it is assumed to represent +/- inf. windows can be a Dask
DataFrame as well. In that case, however, entity_col must not be None and entities should already be on the
row index, with known divisions. Otherwise, the row index is set to entity_col, which can be very costly both
in terms of time and memory. Especially if windows is known to be sorted wrt. entities already, the calling
function should better take care of this; see
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html.
Alternatively, windows can be a callable that, when applied to a DataFrame and keyword arguments
entity_col, time_col, attribute_col and value_col, returns a DataFrame of the form described above.
The canonical example of such a callable is the result returned by make_windows(); see the documentation of
make_windows() for details.
agg (dict) –
The aggregations to apply. Must be a dict mapping attribute identifiers to lists of aggregation functions,
which are applied to all observed values of the respective attribute in each specified window. Supported
aggregation functions are:
"mean": Empirical mean of observed non-NA values
"min": Minimum of observed non-NA values; equivalent to “p0”
"max": Maximum of observed non-NA values; equivalent to “p100”
"median": Median of observed non-NA values; equivalent to “p50”
"std": Empirical standard deviation of observed non-NA values
"var": Empirical variance of observed non-NA values
"sum": Sum of observed non-NA values
"prod": Product of observed non-NA values
"skew": Skewness of observed non-NA values
"mad": Mean absolute deviation of observed non-NA values
"sem": Standard error of the mean of observed non-NA values
"size": Number of observations, including NA values
"count": Number of non-NA observations
"nunique": Number of unique observed non-NA values
"mode": Mode of observed non-NA values, i.e., most frequent value; ties are broken randomly but
reproducibly
"mode_count": Number of occurrences of mode
"pxx": Percentile of observed non-NA values; xx is an arbitrary float in the interval [0, 100]
"rxx": xx-th observed value (possibly NA), starting from 0; negative indices count from the end
"txx": Time of xx-th observed value; negative indices count from the end
"callable": Function that takes as input a DataFrame in and returns a new DataFrame out.
See Notes for details.
entity_col (str, optional) – Name of the column in df and windows containing entity identifiers. If None, all entries are assumed to
belong to the same entity. Note that entity identifiers may also be on the row index.
time_col (str, optional) – Name of the column in df containing observation times, and also name of column(s) in windows containing
start- and end times of the windows. Note that despite its name the data type of the column is arbitrary, as
long as it supports the following arithmetic- and order operations: -, /, <, <=.
attribute_col (str, optional) – Name of the column in df containing attribute identifiers. If None, all entries are assumed to belong to the
same attribute; in that case agg may only contain one single item.
value_col (str, optional) – Name of the column in df containing the observed values.
include_start (bool, default=True) – Start times of observation windows are part of the windows.
include_stop (bool, default=False) – End times of observation windows are part of the windows.
optimize (str, default='time') – Optimize runtime or memory requirements. If set to “time”, the function returns faster but requires more
memory; if set to “memory”, the runtime is longer but memory consumption is reduced to a minimum. If “time”,
global variable MAX_ROWS can be used to adjust the time-memory tradeoff: increasing it increases memory
consumption while reducing runtime. Note that this parameter is only relevant for computing non-rank-like
aggregations, since rank-like aggregations (“rxx”, “txx”) can be efficiently computed anyway.
Returns:
Resampled data. Like windows, but with one additional column for each requested aggregation.
Order of columns is arbitrary, order of rows is exactly as in windows – unless windows is a Dask DataFrame,
in which case the order of rows may differ. The output is a (lazy) Dask DataFrame if windows is a Dask
DataFrame, and a Pandas DataFrame otherwise, regardless of what df is.
Return type:
pd.DataFrame | dask.dataframe.DataFrame
Notes
When passing a callable to agg, it is expected to take as input a DataFrame in and return a new DataFrame out.
in has two columns time_col and value_col (in that order). Its row index specifies which entries belong to
the same observation window: entries with the same row index value belong to the same window, entries with
different row index values belong to distinct windows. Observation times are guaranteed to be non-N/A, values may
be N/A. Note, however, that in is not necessarily sorted wrt. its row index and/or observation times! Also note
that the entities the observations in in stem from (if entity_col is specified) are not known to the function.
out should have one row per row index value of in (with the same row index value), and an arbitrary number of
columns with arbitrary names and dtypes. Columns should be consistent in every invocation of the function.
The reason why the function is not applied to each row-index-value group individually is that some aggregations can
be implemented efficiently using sorting rather than grouping. The function should be stateless and must not modify
in in place.
Example 1: A simple aggregation which calculates the fraction of values between 0 and 1 in every window could be
passed as
Example 2: A more sophisticated aggregation which fits a linear regression to the observations in every window
and returns the slope of the resulting regression line could be defined as
Resample interval-like data wrt. explicitly passed windows of arbitrary (possibly infinite) length.
“Interval-like” means that each observation is characterized by a start- and stop time rather than a singular
timestamp (as in EAV data).
Parameters:
df (pd.DataFrame | dask.dataframe.DataFrame) – The DataFrame to resample. Must have columns value_col (contains observed values), start_col (optional;
contains start times), stop_time (optional; contains end times), attribute_col (optional; contains attribute
identifiers) and entity_col (optional; contains entity identifiers). Must have one column index level. Data
types are arbitrary, as long as times and entity identifiers can be compared wrt. < and <= (e.g., float,
int, time delta, date time). Entity identifiers must not be NA. Values must be numeric (float, int, bool).
Observation times and observed values may be NA, but such entries are ignored entirely. Although both
start_col and stop_col are optional, at least one must be present. Missing start- and end columns are
interpreted as -/+ inf. All intervals are closed, i.e., start- and end times are included. This is especially
relevant for entries whose start time equals their end time.
df can be a Dask DataFrame as well. In that case, however, entity_col must not be None and entities should
already be on the row index, with known divisions. Otherwise, the row index is set to entity_col, which can be
very costly both in terms of time and memory. Especially if df is known to be sorted wrt. entities already,
the calling function should better take care of this; see
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html.
windows (pd.DataFrame | dask.dataframe.DataFrame | callable) – The target windows into which df is resampled. Must have either one or two columns index level(s). If it has
one column index level, must have columns start_col (optional; contains start times of each window),
stop_col (optional; contains end times of each window) and entity_col (optional; contains entity
identifiers). If it has two column index levels, the columns must be (time_col, “start”),
(time_col, “stop”) and (entity_col, “”). Start- and end times may be NA, but such windows are deemed
invalid and by definition do not overlap with any observation intervals. At least one of the two
endpoint-columns must be present; if one is missing it is assumed to represent -/+ inf. All time windows are
closed, i.e., start- and end times are included. windows can be a Dask DataFrame as well. In that case,
however, entity_col must not be None and entities should already be on the row index, with known divisions.
Otherwise, the row index is set to entity_col, which can be very costly both in terms of time and memory.
Especially if windows is known to be sorted wrt. entities already, the calling function should better take
care of this; see
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html.
Alternatively, windows can be a callable that, when applied to a DataFrame and keyword arguments
entity_col, start_col, stop_col, time_col, attribute_col and value_col, returns a DataFrame of the
form described above. The canonical example of such a callable is the result returned by make_windows(); see
the documentation of make_windows() for details.
attributes (list, optional) – The attributes to consider. Must be a list-like of attribute identifiers. None defaults to the list of all such
identifiers present in column attribute_col. If attribute_col is None but attributes is not, it must be a
singleton list.
entity_col (str, optional) – Name of the column in df and windows containing entity identifiers. If None, all entries are assumed to
belong to the same entity. Note that entity identifiers may also be on the row index.
start_col (str, optional) – Name of the column in df (and windows if it has only one column index level) containing start times. If
None, all start times are assumed to be -inf. Note that despite its name the data type of the column is
arbitrary, as long as it supports the following arithmetic- and order operations: -, /, <, <=.
stop_col (str, optional) – Name of the column in df (and windows if it has only one column index level) containing end times. If None,
all end times are assumed to be +inf. Note that despite its name the data type of the column is arbitrary, as
long as it supports the following arithmetic- and order operations: -, /, <, <=.
attribute_col (str, optional) – Name of the column in df containing attribute identifiers. If None, all entries are assumed to belong to the
same attribute.
value_col (str, optional) – Name of the column in df containing the observed values.
time_col (list | str, optional) – Name of the column(s) in windows containing start- and end times of the windows. Only needed if windows
has two column index levels, because otherwise these two columns must be called start_col and stop_col,
respectively.
epsilon – The value to set \(W_I\) to if \(I\) is infinite and \(W \cap I\) is non-empty and finite;
see Notes for details.
Returns:
Resampled data. Like windows, but with one additional column for each attribute, and same number of
column index levels.
Order of columns is arbitrary, order of rows is exactly as in windows – unless windows is a Dask DataFrame, in
which case the order of rows may differ.
The output is a (lazy) Dask DataFrame if windows is a Dask DataFrame, and a Pandas DataFrame otherwise,
regardless of what df is.
Notes
A typical example of interval-like data are medication records, since medications can be administered over
longer time periods.
The only supported resampling aggregation is summing the observed values per time window, scaled by the fraction
of the length of the intersection of observation interval and time window divided by the total length of the
observation interval: Let \(W = [s, t]\) be a time window and let \(I = [a, b]\) be an observation interval
with observed value \(v\). Then \(I\) contributes to \(W\) the value
\(W_I = v * \frac{|W \cap I|}{|I|}\)
The overall value of \(W\) is the sum of \(W_I\) over all intervals. Of course, all this is computed
separately for each entity-attribute combination.
Some remarks on the above equation are in place:
If \(v\) is N/A, \(W_I\) is set to 0.
If \(a = b\) both numerator and denominator are 0. In this case the fraction is defined as 1 if
\(a \in W\) (i.e., \(s \leq a \leq t\)) and 0 otherwise.
If \(I\) is infinite and \(W \cap I\) is non-empty but finite, \(W_I\) is set to
\(epsilon * sign(v)\).
Note that \(W \cap I\) is non-empty even if it is of the form \([x, x]\). This leads to the slightly
counter-intuitive situation that \(W_I = epsilon\) if \(I\) is infinite, and \(W_I = 0\) if \(I\)
is finite.
If \(I\) and \(W \cap I\) are both infinite, the fraction is defined as 1. This is regardless of whether
\(W \cap I\) equals \(I\) or whether it is a proper subset of it.
Convenience function for easily creating windows that can be passed to functions resample_eav() and
resample_interval().
Note that internally, invoking this function does not create the actual windows-DataFrame yet. Instead, when
passing the resulting callable to resample_eav() or resample_interval(), it is applied to the DataFrame to be
resampled. This allows to implicitly refer to it here; see the examples below for specific use-cases.
Parameters:
df (pd.DataFrame | str, optional) – Source DataFrame. If None, defaults to the DataFrame to be resampled in resample_eav() or
resample_interval().
Can also be a string, which will be evaluated using Python’s eval() function. The string can contain
references to the DataFrame to be resampled via variable df, and to column-names entity_col, time_col,
start_col and stop_col passed to resample_eav() and resample_interval().
Example: “df.groupby(entity_col)[time_col].max().to_frame()”
entity (pd.Series | pd.Index | str | scalar, optional) – Entity of each window. Series are used as-is (possibly after re-ordering rows to match other row indices),
strings refer to columns in df, and scalars are replicated to populate every window with the same value.
If None, defaults to df[entity_col] if df contains that column.
start (pd.Series | pd.Index | str | scalar, optional) – Start time of each window. Series are used as-is (possibly after re-ordering rows to match other row indices),
strings refer to columns in df, and scalars are replicated to populate every window with the same value.
Note that despite its name the data type of the start times is arbitrary, as long as it supports the following
arithmetic- and order operations: -, /, <, <=.
start and start_rel are mutually exclusive.
stop (pd.Series | pd.Index | str | scalar, optional) – Stop time of each window. Series are used as-is (possibly after re-ordering rows to match other row indices),
strings refer to columns in df, and scalars are replicated to populate every window with the same value.
Note that despite its name the data type of the stop times is arbitrary, as long as it supports the following
arithmetic- and order operations: -, /, <, <=.
stop and stop_rel are mutually exclusive.
start_rel (pd.Series | pd.Index | str | scalar, optional) – Start time of each window, relative to anchor. Series are used as-is (possibly after re-ordering rows to
match other row indices), strings refer to columns in df, and scalars are replicated to populate every window
with the same value. If given, anchor must be given, too.
start and start_rel are mutually exclusive.
stop_rel (pd.Series | pd.Index | str | scalar, optional) – Stop time of each window, relative to anchor. Series are used as-is (possibly after re-ordering rows to
match other row indices), strings refer to columns in df, and scalars are replicated to populate every window
with the same value. If given, anchor must be given, too.
stop and stop_rel are mutually exclusive.
duration (pd.Series | pd.Index | str | scalar, optional) – Duration of each window. Series are used as-is (possibly after re-ordering rows to match other row indices),
strings refer to columns in df, and scalars are replicated to populate every window with the same value.
Durations can only be specified if exactly one endpoint (either start or stop) is specified; the other endpoint
is then computed from duration.
anchor (pd.Series | pd.Index | str | scalar, optional) – Anchor time start_rel and stop_rel refer to. Series are used as-is (possibly after re-ordering rows to
match other row indices), strings refer to columns in df, and scalars are replicated to populate every window
with the same value. Ignored unless start_rel or stop_rel is given.
If start_rel or stop_rel is given but anchor is None, it defaults to time_col, but a warning message is
printed.
Notes
The current implementation does not support Dask DataFrames.
This function does not check whether windows are non-empty, i.e., whether start times come before end times.
Examples
Use-case 1: Create fixed-length windows relative to the time column in the DataFrame to be resampled. Since
anchor is required by start_rel but not set explicitly, it defaults to time_col, but a warning message is
printed.
Use-case 2: Similar to use-case 1, but only create one window per entity, for the temporally last entry. Note
how the DataFrame to be resampled is only passed once directly to function resample_eav(); make_windows()
refers to it implicitly via variable name “df” in the string of keyword argument df. Note also that the
resulting DataFrame may have entities on its row index.
Use-case 3: make_windows() can be used with function resample_interval(), too – regardless of whether
time_col is passed to resample_interval() or not.
resample_interval(df_to_be_resampled,make_windows(stop=pd.Series(...),duration=pd.Series(...),# must have the same row index as the Series passed to `start`),start_col=...,stop_col=...,time_col=...,# optional...)
Find the previous/next values of some columns in DataFrame df, for every entry. Additionally, entries can be
grouped and previous/next values only searched within each group.
Parameters:
df (pd.DataFrame) – The DataFrame.
sort_by (list | str, optional) – The column(s) to sort by. Can be the name of a single column or a list of column names and/or row index levels.
Strings are interpreted as column names or row index names, integers are interpreted as row index levels.
ATTENTION! N/A values in columns to sort by are not ignored; rather, they are treated in the same way as Pandas
treats such values in DataFrame.sort_values(), i.e., they are put at the end.
group_by (list | str, optional) – Column(s) to group df by, optional. Same values as sort_by.
prev_name and next_name are the names of the columns in the result, containing the previous/next values.
If any of them is None, the corresponding previous/next values are not computed for that column.
prev_fill and next_fill specify which values to assign to the first/last entry in every group, which does
not have any previous/next values.
Note that column names not present in df are tacitly skipped.
first_indicator_name (str, optional) – Name of the column in the result containing boolean indicators whether the corresponding entries come first in
their respective groups. If None, no such column is added.
last_indicator_name (str, optional) – Name of the column in the result containing boolean indicators whether the corresponding entries come last in
their respective groups. If None, no such column is added.
keep_sorted (bool, default=False) – Keep the result sorted wrt. group_by and sort_by. If False, the order of rows of the result is identical
to that of df.
inplace (bool, default=False) – If True, the new columns are added to df.
Returns:
The modified DataFrame if inplace is True, a DataFrame with the requested previous/next values otherwise.
Class for performing bootstrapping [1], i.e., repeatedly sample with replacement from given data and evaluate
statistics on each resample to obtain mean, standard deviation, etc. for more robust estimates.
Parameters:
*args (array-like) – Data, non-empty sequence of array-likes with the same length.
kwargs (dict, optional) – Additional keyword arguments passed to the function fn computing the statistics. Like args, the values
of the dict must be array-likes with the same length as the elements of args.
fn (callable | dict | tuple, optional) – The statistics to compute. Must be None, a function that takes the given args as input and returns a
scalar/array-like or a (nested) dict/tuple thereof, or a (nested) dict/tuple of such functions.
seed (int, optional) – Random seed.
replace (bool, default=True) – Whether to resample with replacement. If False, this does not actually correspond to bootstrapping.
size (int | float, default=1.) – The size of the resampled data. If <= 1, it is multiplied with the number of samples in the given data.
Bootstrapping normally assumes that resampled data have the same number of samples as the original data, so
this parameter should be set to 1.
Run bootstrapping for a given number of repetitions, and store the results in a list. Results are appended
to results from previous runs!
Parameters:
n_repetitions (int, default=100) – Number of repetitions.
sample_indices (ndarray, optional) – Pre-computed sample indices to use in each repetition. If not None, n_repetitions is ignored and
sample_indices must have shape (n, size).
Compute aggregate statistics of the results of the individual runs, like mean, standard deviation, etc.
Parameters:
func (str | callable) – The aggregation function to apply. If a string, can be the name of a Numpy function (“mean”, “std”, etc.),
or “iqr” (interquartile range) or “ci<alpha>” (confidence interval wrt. alpha).
Describe the results of the individual runs by computing a predefined set of statistics, similar to pandas’
describe() method. Only works for (dicts/tuples of) scalar values.
Summarize the performance of multiple prediction models trained and evaluated with CaTabRa. This is a convenient
way for quickly comparing them and selecting the best model(s) for a certain task. An implicit assumption of this
function is that all models were trained on the same prediction task.
IMPORTANT: Only pre-evaluated metrics in “metrics.xlsx” and “bootstrapping.xlsx” are considered!
Parameters:
directories (Iterable[str | Path]) – The directories to consider, an iterable of path-like objects. Each directory must be the output directory of
an invocation of catabra.evaluate, or a subdirectory corresponding to a specific split (containing
“metrics.xlsx” and maybe also “bootstrapping.xlsx”). A convenient way to specify a couple of directories
matching a certain pattern is by using Path(root_path).rglob(pattern).
metrics (Iterable[str]) –
List of metrics to include in the summary, an iterable of strings. Values must match the following pattern:
target is optional and specifies the target (or class in case of multiclass classification); can be “*” to
include all available targets, and can be a sequence separated by “,”. Ignored if
bootstrapping_aggregation is specified.
metric_name is the name of the actual metric, exactly as written in “metrics.xlsx” or “bootstrapping.xlsx”;
can be “*” to include all available pre-evaluated metrics, and can be a sequence separated by “,”.
threshold is optional and must be a numeral between 0 and 1 (cannot be a string like “balance”), and cannot
be “*”. Only relevant for threshold-dependent classification metrics, and mutually exclusive with
bootstrapping_aggregation. Note that the given threshold must exactly match one of the thresholds
evaluated in “metrics.xlsx”.
bootstrapping_aggregation is optional and specifies the bootstrapping aggregation to include, like “mean”,
“std”, etc.; can be “*” to include all available pre-evaluated aggregations in “bootstrapping.xlsx”, and
can be a sequence separated by “,”.
split (Iterable[str], default=None) – If a directory in directories has subdirectories corresponding to data splits that were evaluated separately,
only include the splits in split. If None, all splits are included.
path_callback (Callable, default=None) – Callback function applied to every path visited. Must return None, True, False or a dict; False indicates that
the current path should be dropped from the output, True and None are aliases for {}, and a dict adds a
column for every key to the output DataFrame, with the corresponding values in them.
Returns:
DataFrame with one row per evaluation and one column per performance metric. If multiple splits are included in
the performance summary, each is put into a separate row.
Summarize the feature importance of multiple prediction models trained and explained with CaTabRa. This is a
convenient way for quickly comparing them. An implicit assumption of this function is that all models were trained
on the same prediction task, and that the same feature importance calculation method was applied to generate the
importance scores.
IMPORTANT: Only pre-evaluated feature importance scores are considered!
Parameters:
directories (Iterable[str | Path]) – The directories to consider, an iterable of path-like objects. Each directory must be the output directory of
an invocation of catabra.explain, or a subdirectory corresponding to a specific split (containing HDF5 files
with feature importance scores). A convenient way to specify a couple of directories matching a certain pattern
is by using Path(root_path).rglob(pattern).
columns (Iterable[str], default=None) – The columns in global feature importance scores to consider. For instance, if
catabra.explanation.average_local_explanations() is used to produce global scores, 4 columns “>0”, “<0”,
“>0 std” and “<0 std” are normally generated. This parameter allows to include only a subset in the summary.
None defaults to all columns.
new_column_name (str) – String pattern specifying the names of the columns in the output DataFrame. May have two named fields feature
and column, which are filled with original feature- and column names, respectively.
glob (bool) – Whether feature importance scores in directories are global. If not,
catabra.explanation.average_local_explanations() is applied.
split (Iterable[str], default=None) – If a directory in directories has subdirectories corresponding to data splits that were explained separately,
only include the splits in split. If None, all splits are included.
model_id (Iterable[str], default=None) – Model-IDs to consider, optional. Determines the names of the HDF5 files to be included. None defaults to all
found model-IDs.
path_callback (Callable, default=None) – Callback function applied to every path visited. Must return None, True, False or a dict; False indicates that
the current path should be dropped from the output, True and None are aliases for {}, and a dict adds a
column for every key to the output DataFrame, with the corresponding values in them.
Returns:
DataFrame with one row per explanation and one column per feature-column pair. If multiple splits are included
in the importance summary, each is put into a separate row. If there are multiple targets (multiclass/multilabel
classification, multioutput regression) and the feature importance scores for each target are stored in a
separate table, each is put into a separate row and an additional column “__target__” is added.