Evaluation

evaluate(*table: str | Path | DataFrame, folder: str | Path | None = None, model_id=None, explain=None, glob: bool | None = False, split: str | None = None, sample_weight: str | None = None, create_stats: bool | None = None, check_ood: bool | None = None, out: str | Path | None = None, jobs: int | None = None, batch_size: int | None = None, threshold: float | str | None = None, bootstrapping_repetitions: int | None = None, bootstrapping_metrics: list | None = None, from_invocation: str | Path | dict | None = None)[source]

Evaluate an existing CaTabRa object (OOD-detector, prediction model, …) on held-out test data.

Parameters:
  • *table – The table(s) to evaluate the CaTabRa object on. If multiple are given, their columns are merged into a single table. Must have the same format as the table(s) initially passed to function analyze().

  • folder (str | PAth) – The folder containing the CaTabRa object to evaluate.

  • model_id (optional) – ID of the prediction model to evaluate. If None or “__ensemble__”, the sole trained model or the entire ensemble are evaluated.

  • explain (list | str, optional) – Explain prediction model(s). If “__all__”, all models specified by model_id are explained; otherwise, must be a list of the model ID(s) to explain, which must be a subset of the models that are evaluated.

  • glob (bool, default=False) – Whether to create global instead of local explanations.

  • split (str, optional) – Column used for splitting the data into disjoint subsets. If specified and not “”, each subset is evaluated individually. In contrast to function analyze(), the name/values of the column do not need to carry any semantic information about training and test sets.

  • sample_weight (str, optional) – Column with sample weights. If specified and not “”, must have numeric data type. Sample weights are used both for training and evaluating prediction models.

  • create_stats (bool, optional) – Whether to generate and save descriptive statistics of the given data table.

  • check_ood (bool, optional) – Whether to apply the OOD-detector (if any) to the given data.

  • out (str | Path, optional) – Directory where to save all generated artifacts. Defaults to a directory located in folder, with a name following a fixed naming pattern. If out already exists, the user is prompted to specify whether it should be replaced; otherwise, it is automatically created.

  • jobs (int, optional) – Number of jobs to use. Overwrites the “jobs” config param.

  • batch_size (int, optional) – Batch size used for applying the prediction model.

  • threshold (float | str, optional) – Decision threshold for binary- and multilabel classification tasks. Confusion matrix plots and bootstrapped performance results are reported for this particular threshold. In binary classification this can also be the name of a built-in thresholding strategy, possibly followed by “on” and the split on which to calculate the threshold. Splits must be specified by the name of the subdirectory containing the corresponding evaluation results. See /doc/metrics.md for a list of built-in thresholding strategies.

  • bootstrapping_repetitions (int, optional) – Number of bootstrapping repetitions. Overwrites the “bootstrapping_repetitions” config param.

  • bootstrapping_metrics (list, optional) – Names of metrics for which bootstrapped scores are computed, if. Defaults to the list of main metrics specified in the default config. Can also be “__all__”, in which case all standard metrics for the current prediction task are computed. Ignored if bootstrapping is disabled.

  • from_invocation (str | Path | dict, optional) – Dict or path to an invocation.json file. All arguments of this function not explicitly specified are taken from this dict; this also includes the table on which to evaluate the CaTabRa object.

class CaTabRaEvaluation(invocation: str | Path | dict | None = None)[source]

Bases: CaTabRaBase

evaluate_split(y_true: DataFrame, y_hat: ndarray, encoder, directory=None, main_metrics: list | None = None, y_true_decoded=None, y_hat_decoded=None, sample_weight: ndarray | None = None, threshold: float | Callable = 0.5, static_plots: bool = True, interactive_plots: bool = False, bootstrapping_repetitions: int = 0, bootstrapping_metrics: list | None = None, split: str | None = None, verbose: bool = False) dict | None[source]

Evaluate a single split, given by ground truth and predictions.

Parameters:
  • y_true (DataFrame) – Ground truth, encoded DataFrame.

  • y_hat (ndarray) – Predictions array.

  • encoder – Encoder used for encoding and decoding.

  • directory (str | Path) – Directory where to save the evaluation results. If None, results are returned in a dict.

  • main_metrics (list) – Main evaluation metrics. None defaults to the metrics specified in the default config.

  • y_true_decoded (optinal) – Decoded ground truth for creating regression plots. If None, encoder is applied to decode y_true.

  • y_hat_decoded (optional) – Decoded predictions for creating regression plots. If None, encoder is applied to decode`y_hat`.

  • sample_weight (ndarray, optional) – Sample weights, optional. If None, uniform weights are used.

  • threshold – Decision threshold for binary- and multilabel classification problems.

  • static_plots – Whether to create static plots.

  • interactive_plots – Whether to create interactive plots.

  • bootstrapping_repetitions – Number of bootstrapping repetitions.

  • bootstrapping_metrics – Names of metrics for which bootstrapped scores are computed, if bootstrapping_repetitions is > 0. Defaults to main_metrics. Can also be “__all__”, in which case all standard metrics for the current prediction task are computed. Ignored if bootstrapping_repetitions is 0.

  • split – Name of the current split, or None. Only used for logging.

  • verbose – Whether to log key performance metrics.

Returns:

None if directory is given, else dict with evaluation results.

Return type:

dict | None

calc_metrics(predictions: str | Path | DataFrame, encoder: Encoder, threshold: float = 0.5, bootstrapping_repetitions: int = 0, bootstrapping_metrics: list | None = None, sample_weight: str | ndarray = 'from_predictions') Tuple[dict, dict][source]

Calculate performance metrics from raw sample-wise predictions and corresponding ground-truth.

Parameters:
  • predictions (str | Path | DataFrame) – Sample-wise predictions, as saved in “predictions.xlsx”.

  • encoder (Encoder) – Encoder, as saved in “encoder.json”. Can be conveniently loaded by instantiating a CaTabRaLoader oject and calling its get_encoder() method.

  • threshold (float, default=0.5) – Decision threshold for binary- and multilabel classification problems. In binary classification this can also be the name of a built-in thresholding strategy. See /doc/metrics.md for a list of built-in thresholding strategies.

  • bootstrapping_repetitions (int, default=0) – Number of bootstrapping repetitions to perform.

  • bootstrapping_metrics (list, optional) – Names of metrics for which bootstrapped scores are computed, if bootstrapping_repetitions is > 0. Defaults to the list of main metrics specified in the default config. Can also be “__all__”, in which case all standard metrics for the current prediction task are computed. Ignored if bootstrapping_repetitions is 0.

  • sample_weight (str | ndarray, optional) – Sample weights, one of “from_predictions” (to use sample weights stored in predictions), “uniform” (to use uniform sample weights) or an array.

Returns:

Pair (metrics, bootstrapping), where metrics is a dict whose values are DataFrames and corresponds exactly to what is by default saved as “metrics.xlsx” when invoking function evaluate(), and bootstrapping is a dict mapping keys “summary” and “details” to DataFrames or None.

Return type:

Tuple

calc_regression_metrics(y_true: DataFrame, y_hat: DataFrame | ndarray, sample_weight: ndarray | None = None) DataFrame[source]

Calculate all suitable regression metrics for all targets individually, and for their combination.

Parameters:
  • y_true (DataFrame) – Ground truth. All columns must have numerical data type. Entries may be NaN, in which case only

  • considered. (non-NaN entries are) –

  • y_hat (DataFrame, ndarray) – Predictions. Must have the same shape as y_true. Entries may be NaN, in which case only non-NaN entries are considered.

  • sample_weight (ndarray) – Sample weights. If given, must have shape (len(y_true),).

Returns:

DataFrame with one column for each calculated metric, and one row for each column of y_true plus an extra row “__overall__”. Note that “__overall__” is added even if y_true has only one column, in which case the metrics for that column and “__overall__” coincide.

Return type:

DataFrame

calc_binary_classification_metrics(y_true: DataFrame, y_hat: DataFrame | ndarray, sample_weight: ndarray | None = None, thresholds: list | None = None, ensure_thresholds: list | None = None, calibration_thresholds: ndarray | None = None) Tuple[dict, DataFrame, DataFrame, Tuple[ndarray, ndarray, ndarray], Tuple[ndarray, ndarray, ndarray]][source]

Calculate all metrics suitable for binary classification tasks.

Parameters:
  • y_true (DataFrame) – Ground truth. Must have 1 column with float data type and values among 0, 1 and NaN.

  • y_hat (DataFrame | ndarray) – Predictions. Must have the same number of rows as y_true and either 1 or 2 columns.

  • sample_weight (ndarray, optional) – Sample weights. If given, must have shape (len(y_true),).

  • thresholds (list, optional) – List of thresholds to use for thresholded metrics. If None, a default list of thresholds depending on the values of y_hat is constructed.

  • ensure_thresholds (list, optional) – List of thresholds that must appear among the used thresholds, if thresholds is set to None. Ignored if thresholds is a list.

  • calibration_thresholds (ndarray, optional) – Thresholds to use for calibration curves. If None, a default list depending on the values of y_hat is constructed.

Returns:

5-tuple (overall, threshold, calibration, roc_curve, pr_curve):

  • overall is a dict containing the scores of threshold-independent metrics (e.g., ROC-AUC).

  • threshold is a DataFrame with one column for each threshold-dependent metric, and one row for each decision threshold.

  • calibration is a DataFrame with one row for each threshold-bin and three columns with information about the corresponding bin ranges and fraction of positive samples.

  • roc_curve is the receiver operating characteristic curve, as returned by sklearn.metrics.roc_curve(). Although similar information is already contained in threshold[“specificity”] and threshold[“sensitivity”], roc_curve is more fine-grained and better suited for plotting.

  • pr_curve is the precision-recall curve, as returned by sklearn.metrics.precision_recall_curve(). Although similar information is already contained in threshold[“sensitivity”] and threshold[“positive_predictive_value”], pr_curve is more fine-grained and better suited for plotting.

Return type:

Tuple

calc_multiclass_metrics(y_true: DataFrame, y_hat: DataFrame | ndarray, sample_weight: ndarray | None = None, labels: list | None = None) Tuple[dict, DataFrame, DataFrame][source]

Calculate all metrics suitable for multiclass classification.

Parameters:
  • y_true (DataFrame) – Ground truth. Must have 1 column with float data type and values among NaN, 0, 1, …, n_classes - 1.

  • y_hat (DataFrame | ndarray) – Predicted class probabilities. Must have shape (len(y_true), n_classes) and values between 0 and 1.

  • sample_weight (ndarray) – Sample weights.

  • labels (list) – Class names.

Returns:

Triple (overall, conf_mat, per_class), where overall is a dict with overall performance metrics (accuracy, F1, etc.), conf_mat is the confusion matrix, and per_class is a DataFrame with per-class metrics (one row per class, one column per metric).

Return type:

Tuple

calc_multilabel_metrics(y_true: DataFrame, y_hat: DataFrame | ndarray, sample_weight: ndarray | None = None, thresholds: list | None = None, ensure_thresholds: list | None = None) Tuple[DataFrame, DataFrame, dict, dict, dict][source]

Calculate all metrics suitable for multilabel classification.

Parameters:
  • y_true (DataFrame) – Ground truth. Must have n_classes columns with float data type and values among 0, 1 and NaN.

  • y_hat (DataFrame | ndarray) – Predicted class probabilities. Must have shape (len(y_true), n_classes) and values between 0 and 1.

  • sample_weight (ndarray, optional) – Sample weights. If given, must have shape (len(y_true),).

  • thresholds (list, optional) – List of thresholds to use for thresholded metrics. If None, a default list of thresholds depending on the values of y_hat is constructed.

  • ensure_thresholds (list, optional) – List of thresholds that must appear among the used thresholds, if thresholds is set to None. Ignored if thresholds is a list.

Returns:

5-tuple (overall, threshold, threshold_per_class, roc_curves, pr_curves):

  • overall is a DataFrame containing non-thresholded metrics per class and for all classes combined (“__micro__”, “__macro__” and “__weighted__”). Weights are the number of positive samples per class.

  • threshold is a DataFrame containing thresholded metrics for different thresholds for all classes combined.

  • threshold_per_class is a dict mapping classes to per-class thresholded metrics.

  • roc_curves is a dict mapping classes to receiver operating characteristic curves, as returned by sklearn.metrics.roc_curve(). Although similar information is already contained in threshold_per_class, roc_curves is more fine-grained and better suited for plotting.

  • pr_curve is a dict mapping classes to precision-recall curves, as returned by sklearn.metrics.precision_recall_curve(). Although similar information is already contained in threshold_per_class, pr_curves is more fine-grained and better suited for plotting.

Return type:

Tuple

plot_regression(y_true: DataFrame, y_hat: DataFrame, sample_weight: ndarray | None = None, interactive: bool = False) dict[source]

Plot evaluation results of regression tasks. :param y_true: Ground truth. May be encoded or decoded, and may contain NaN values. :param y_hat: Predictions, with same shape, column names and data types as y_true. :param sample_weight: Sample weights. :param interactive: Whether to create interactive plots using the plotly backend, or static plots using the Matplotlib backend. :return: Dict mapping names to figures.

plot_binary_classification(overall: dict, thresholded: DataFrame, threshold: float = 0.5, calibration: DataFrame | None = None, name: str | None = None, neg_label: str = 'negative', pos_label: str = 'positive', roc_curve=None, pr_curve=None, roc_curve_bs=None, pr_curve_bs=None, calibration_curve_bs=None, interactive: bool = False) dict[source]

Plot evaluation results of binary classification tasks.

Parameters:
  • overall (dict) – Overall, non-thresholded performance metrics, as returned by function calc_binary_classification_metrics().

  • thresholded (DataFrame) – Thresholded performance metrics, as returned by function calc_binary_classification_metrics().

  • threshold (float, default=0.5) – Decision threshold.

  • calibration (DataFrame, optional) – Calibration curve, as returned by function calc_binary_classification_metrics().

  • name (str, optional) – Name of the classified variable.

  • neg_label (str, default='negative') – Name of the negative class.

  • pos_label (str, default='positive') – Name of the positive class.

  • roc_curve (optional) – ROC-curve, triple (fpr, tpr, thresholds) or None.

  • pr_curve (optional) – Precision-recall-curve, triple (precision, recall, thresholds) or None.

  • roc_curve_bs (optional) – ROC-curves obtained via bootstrapping. None or a pair (fpr, tpr), where both components are equal-length lists of arrays of shape (n_thresholds,).

  • pr_curve_bs (optional) – Precision-recall-curves obtained via bootstrapping. None or a pair (precision, recall), where both components are equal-length lists of arrays of shape (n_thresholds,).

  • calibration_curve_bs (optional) – Calibration curves obtained via bootstrapping. None or a single array of shape (n_thresholds, n_repetitions); the thresholds must agree with those in calibration.

  • interactive (bool, default=False) – Whether to create interactive plots using the plotly backend, or static plots using the Matplotlib backend.

Returns:

Dict mapping names to figures.

Return type:

dict

plot_multiclass(confusion_matrix: DataFrame, interactive: bool = False) dict[source]

Plot evaluation results of multiclass classification tasks.

Parameters:
  • confusion_matrix (DataFrame) – Confusion matrix, as returned by function calc_multiclass_metrics().

  • interactive (bool, default=False) – Whether to create interactive plots using the plotly backend, or static plots using the Matplotlib backend.

Returns:

Dict mapping names to figures.

Return type:

dict

plot_multilabel(overall: DataFrame, thresholded: dict, threshold: float = 0.5, labels=None, roc_curves=None, pr_curves=None, interactive: bool = False) dict[source]

Plot evaluation results of multilabel classification tasks.

Parameters:
  • overall (DataFrame) – Overall, non-thresholded performance metrics, as returned by function calc_multilabel_metrics().

  • thresholded (dict) – Thresholded performance metrics, as returned by function calc_multilabel_metrics().

  • threshold (float, default=0.5) – Decision threshold.

  • labels (optional) – Class names. None or a DataFrame with n_class columns and 2 rows.

  • roc_curves (optional) – ROC-curves, dict mapping classes to triples (fpr, tpr, thresholds) or None.

  • pr_curves (optional) – Precision-recall-curves, dict mapping classes to triples (precision, recall, thresholds) or None.

  • interactive (bool, default=False) – Whether to create interactive plots using the plotly backend, or static plots using the Matplotlib backend.

Returns:

Dict mapping names to figures.

Return type:

dict

plot_results(predictions: str | Path | DataFrame, metrics_: str | Path | DataFrame | dict, encoder: Encoder, interactive: bool = False, threshold: float = 0.5, bootstrapping_repetitions: int = 0) dict[source]

Plot the results of an evaluation. This happens automatically if config params “static_plots” or “interactive_plots” are set to True. This function does not save the resulting plots to disk, but instead returns them in a (nested) dict. This allows one to further modify / fine-tune them before displaying or saving.

Parameters:
  • predictions (str | Path | DataFrame) – Sample-wise predictions, as saved in “predictions.xlsx”.

  • metrics (str | Path | DataFrame | dict) – Performance metrics, as saved in “metrics.xlsx”.

  • encoder (Encoder) – Encoder, as saved in “encoder.json”. Can be conveniently loaded by instantiating a CaTabRaLoader object and calling its get_encoder() method.

  • interactive (bool, default=False) – Whether to create static Matplotlib plots or interactive plotly plots.

  • threshold (float, default=0.5) – Decision threshold for binary- and multilabel classification problems.

  • bootstrapping_repetitions (int, default=0) – Number of bootstrapping repetitions to perform for adding confidence intervals to ROC-, PR- and calibration curves in binary classification tasks.

Returns:

(Nested) dict of Matplotlib or plotly figure objects, depending on the value of interactive.

Return type:

dict

performance_summary(*args, sample_weight: ndarray | None = None, task: str | None = None, metric_list=None, threshold: float = 0.5, add_na_results: bool = True) dict | Callable[[ndarray, ndarray], dict][source]

Summarize the quality of predictions by evaluating a given set of performance metrics. In contrast to functions calc_regression_metrics() etc., this function is more “light-weight”, for instance in the sense that it only computes binary classification metrics for one particular threshold and only returns aggregate results over all classes in case of multiclass and multilabel classification. It can also be efficiently combined with bootstrapping.

Parameters:
  • *args – Ground truth and predictions, array-like of shape (n_samples,) or (n_samples, n_targets). Should not contain NA values. Predictions must contain class probabilities rather than classes in case of classification tasks. Either none or both must be specified.

  • sample_weight (ndarray, optional) – Sample weight, None or array-like of shape (n_samples,).

  • task (str, optional) – Prediction task.

  • metric_list (list, optional) – List of metrics to evaluate. Metrics that do not fit to the given prediction task are tacitly skipped.

  • threshold (float, default=0.5) – Decision threshold for binary- and multilabel classification problems.

  • add_na_results (bool, default=True) – Whether to add N/A results in the output, e.g., if one of the requested metrics cannot be calculated. If False, these metrics are tacitly skipped.

Returns:

If args is pair (y_true, y_hat): dict whose keys are names of metrics and whose values are the results of the respective metrics evaluated on y_true and y_hat. Otherwise, if args is empty, callable which can be applied to y_true and y_hat (and optionally sample_weight).

Return type:

dict | Callable