Metrics

This section lists all built-in metrics that are available through the catabra-lib.metrics module.

Built-in Regression Metrics

  • Implementation: r2

  • Also known as: coefficient of determination, R squared

  • Range: (-inf, 1]

  • Optimum: 1

  • Documentation: scikit-learn, Wikipedia

Mean Absolute Error

  • Implementation: mean_absolute_error

  • Also known as: MAE

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn, Wikipedia

Mean Squared Error

  • Implementation: mean_squared_error

  • Also known as: MSE

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn, Wikipedia

  • Note: Equivalent to mean_tweedie_deviance with power=0.

Root Mean Squared Error

  • Implementation: root_mean_squared_error

  • Also known as: RMSE, root-mean-square deviation, RMSD

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn, Wikipedia

Mean Squared Logarithmic Error

  • Implementation: mean_squared_log_error

  • Also known as: MSLE

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn

  • Note: Only defined for non-negative inputs.

Median Absolute Error

  • Implementation: median_absolute_error

  • Also known as: MedAE

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn

Mean Absolute Percentage Error

  • Implementation: mean_absolute_percentage_error

  • Also known as: MAPE, mean absolute percentage deviation, MAPD

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn

Max Error

  • Implementation: max_error

  • Also known as: maximum residual error

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn

Explained Variance

  • Implementation: explained_variance

  • Also known as: explained variance regression score

  • Range: (-inf, 1]

  • Optimum: 1

  • Documentation: scikit-learn

Mean Poisson Deviance

  • Implementation: mean_poisson_deviance

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn

  • Note: Equivalent to mean_tweedie_deviance with power=1.

Mean Gamma Deviance

  • Implementation: mean_gamma_deviance

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn

  • Note: Equivalent to mean_tweedie_deviance with power=2.

Mean Tweedie Deviance

  • Implementation: mean_tweedie_deviance

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn

  • Note: Parameter power is the Tweedie power parameter. With power=0 this metric is equivalent to mean_squared_error, with power=1 it is equivalent to mean_poisson_deviance, and with power=2 it is equivalent to mean_gamma_deviance.

Built-in Classification Metrics

Area under Receiver Operator Characteristic Curve

  • Implementation:

    • binary: roc_auc

    • multiclass: roc_auc_ovr, roc_auc_ovr_weighted, roc_auc_ovo, roc_auc_ovo_weighted

    • multilabel: roc_auc_micro, roc_auc_macro, roc_auc_samples, roc_auc_weighted

  • Also known as: ROC-AUC

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: yes

  • Documentation: scikit-learn, Wikipedia

  • Note: roc_auc_ovr and roc_auc_ovo return macro-averaged values by default.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Average Precision

  • Implementation:

    • binary: average_precision

    • multilabel: average_precision_micro, average_precision_macro, average_precision_samples, average_precision_weighted

  • Also known as: AP, mean average precision (mAP) in case of average_precision_macro

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: yes

  • Documentation: scikit-learn, Wikipedia

  • Note: Not equivalent to pr_auc, but similar.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Area under Precision-Recall Curve

  • Implementation:

    • binary: pr_auc

    • multilabel: pr_auc_micro, pr_auc_macro, pr_auc_samples, pr_auc_weighted

  • Also known as: PR-AUC

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: yes

  • Documentation: scikit-learn

  • Note: Not equivalent to average_precision, but similar.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Brier Score

  • Implementation:

    • binary: brier_loss

    • multilabel: brier_loss_micro, brier_loss_macro, brier_loss_samples, brier_loss_weighted

  • Range: [0, 1]

  • Optimum: 0

  • Accepts probabilities: yes

  • Documentation: scikit-learn, Wikipedia

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Hinge Loss

  • Implementation:

    • binary & multiclass: hinge_loss

    • multilabel: hinge_loss_micro, hinge_loss_macro, hinge_loss_samples, hinge_loss_weighted

  • Range: [0, inf)

  • Optimum: 0

  • Accepts probabilities: yes

  • Documentation: scikit-learn, Wikipedia

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Log Loss

  • Implementation:

    • binary & multiclass: log_loss

    • multilabel: log_loss_micro, log_loss_macro, log_loss_samples, log_loss_weighted

  • Also known as: logistic loss, cross-entropy loss

  • Range: [0, inf)

  • Optimum: 0

  • Accepts probabilities: yes

  • Documentation: scikit-learn, Wikipedia

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Calibration Curve

  • Implementation:

    • binary: calibration_curve

  • Accepts probabilities: yes

  • Note: Not actually a metric, but a curve whose x-values correspond to threshold-bins and whose y-values correspond to the fraction of positive samples in each bin. Ideally, the curve should be monotonically increasing.

Confusion Matrix

  • Implementation:

    • binary, multiclass, multilabel: confusion_matrix

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: Not actually a metric. Refer to Section “Confusion-Matrix Based Metrics” for how to compute classification metrics directly on confusion matrices instead of ground-truth and predictions.

  • Note: For binary- and multiclass problems, behaves like sklearn.metrics.confusion_matrix; for multilabel problems, behaves like sklearn.metrics.multilabel_confusion_matrix. The behavior can also be controlled via parameter multilabel.

Accuracy

  • Implementation:

    • binary: accuracy

    • multiclass: accuracy, accuracy_micro, accuracy_macro, accuracy_weighted

    • multilabel: accuracy, accuracy_micro, accuracy_macro, accuracy_samples, accuracy_weighted

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn

  • Note: Not equivalent to jaccard, although this is claimed by the scikit-learn documentation.

  • Note: accuracy is defined for multiclass and multilabel problems even without specifying an averaging policy.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Balanced Accuracy

  • Implementation:

    • binary: balanced_accuracy

    • multiclass: balanced_accuracy, balanced_accuracy_micro, balanced_accuracy_macro, balanced_accuracy_weighted

    • multilabel: balanced_accuracy_micro, balanced_accuracy_macro, balanced_accuracy_samples, balanced_accuracy_weighted

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn

  • Note: balanced_accuracy is defined for multiclass problems even without specifying an averaging policy.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

  • Note: Closely related to informedness, which is balanced_accuracy * 2 - 1 in the binary case.

F1

  • Implementation:

    • binary: f1

    • multiclass: f1_micro, f1_macro, f1_weighted

    • multilabel: f1_micro, f1_macro, f1_samples, f1_weighted

  • Also known as: balanced F-score, F-measure

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: Special case of the F-beta metric, with beta=1. Harmonic mean of sensitivity and positive predictive value.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Sensitivity

  • Implementation:

    • binary: sensitivity

    • multiclass: sensitivity_micro, sensitivity_macro, sensitivity_weighted

    • multilabel: sensitivity_micro, sensitivity_macro, sensitivity_samples, sensitivity_weighted

  • Also known as: recall, true positive rate, hit rate

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Specificity

  • Implementation:

    • binary: specificity

    • multiclass: specificity_micro, specificity_macro, specificity_weighted

    • multilabel: specificity_micro, specificity_macro, specificity_samples, specificity_weighted

  • Also known as: selectivity, true negative rate

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: Can be computed in the same way as sensitivity, by exchanging the positive and negative class.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Positive Predictive Value

  • Implementation:

    • binary: positive_predictive_value

    • multiclass: positive_predictive_value_micro, positive_predictive_value_macro, positive_predictive_value_weighted

    • multilabel: positive_predictive_value_micro, positive_predictive_value_macro, positive_predictive_value_samples, positive_predictive_value_weighted

  • Also known as: precision, PPV

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Negative Predictive Value

  • Implementation:

    • binary: negative_predictive_value

    • multiclass: negative_predictive_value_micro, negative_predictive_value_macro, negative_predictive_value_weighted

    • multilabel: negative_predictive_value_micro, negative_predictive_value_macro, negative_predictive_value_samples, negative_predictive_value_weighted

  • Also known as: NPV

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: Can be computed in the same way as positive predictive value, by exchanging the positive and negative class.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Cohen’s Kappa

  • Implementation:

    • binary: cohen_kappa

    • multiclass: cohen_kappa, cohen_kappa_micro, cohen_kappa_macro, cohen_kappa_weighted

    • multilabel: cohen_kappa_micro, cohen_kappa_macro, cohen_kappa_samples, cohen_kappa_weighted

  • Range: [-1, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: cohen_kappa is defined for multiclass problems even without specifying an averaging policy.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Hamming Loss

  • Implementation:

    • binary: hamming_loss

    • multiclass: hamming_loss, hamming_loss_micro, hamming_loss_macro, hamming_loss_weighted

    • multilabel: hamming_loss, hamming_loss_micro, hamming_loss_macro, hamming_loss_samples, hamming_loss_weighted

  • Also known as: Hamming distance

  • Range: [0, 1]

  • Optimum: 0

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: hamming_loss is equivalent to 1 - accuracy. For multilabel problems, the averaging policy defaults to "macro", whereas accuracy returns subset accuracy (i.e., all labels must match) by default.

  • Note: hamming_loss is defined for multiclass and multilabel problems even without specifying an averaging policy.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Jaccard Index

  • Implementation:

    • binary: jaccard

    • multiclass: jaccard_micro, jaccard_macro, jaccard_weighted

    • multilabel: jaccard_micro, jaccard_macro, jaccard_samples, jaccard_weighted

  • Also known as: Jaccard similarity coefficient, intersection over union, IoU

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: Not equivalent to accuracy, although this is claimed by the scikit-learn documentation.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Matthews Correlation Coefficient

  • Implementation:

    • binary: matthews_correlation_coefficient

    • multiclass: matthews_correlation_coefficient, matthews_correlation_coefficient_micro, matthews_correlation_coefficient_macro, matthews_correlation_coefficient_weighted

    • multilabel: matthews_correlation_coefficient_micro, matthews_correlation_coefficient_macro, matthews_correlation_coefficient_samples, matthews_correlation_coefficient_weighted

  • Also known as: MCC, phi coefficient

  • Range: [-1, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: matthews_correlation_coefficient is defined for multiclass problems even without specifying an averaging policy.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Informedness

  • Implementation:

    • binary: informedness

    • multiclass: informedness_micro, informedness_macro, informedness_samples, informedness_weighted

    • multilabel: informedness_micro, informedness_macro, informedness_samples, informedness_weighted

  • Also known as: Youden index, Youden’s J statistic

  • Range: [-1, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: Wikipedia

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

  • Note: Informedness has a natural generalization to the multiclass case, which is currently not implemented.

  • Note: Closely related to balanced_accuracy, which is (informedness + 1) / 2 in the binary case.

Markedness

  • Implementation:

    • binary: markedness

    • multiclass: markedness_micro, markedness_macro, markedness_samples, markedness_weighted

    • multilabel: markedness_micro, markedness_macro, markedness_samples, markedness_weighted

  • Also known as: deltaP

  • Range: [-1, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: Wikipedia

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

  • Note: Markedness has a natural generalization to the multiclass case, which is currently not implemented.

True Positives

  • Accepts probabilities: no

  • Note: Not actually a metric, but total number of true positives, i.e., correctly predicted positive samples. (1,1)-th entry of confusion_matrix.

True Negatives

  • Accepts probabilities: no

  • Note: Not actually a metric, but total number of true negatives, i.e., correctly predicted negative samples. (0,0)-th entry of confusion_matrix.

False Positives

  • Accepts probabilities: no

  • Note: Not actually a metric, but total number of false positives, i.e., negative samples wrongly predicted as positive. (0,1)-th entry of confusion_matrix.

False Negatives

  • Accepts probabilities: no

  • Note: Not actually a metric, but total number of false negatives, i.e., positive samples wrongly predicted as negative. (1,0)-th entry of confusion_matrix.

Balance Score

  • Implementation:

    • binary: balance_score

  • Accepts probabilities: yes

  • Note: Equal to sensitivity at decision threshold balance_threshold, which by definition is (approximately) equal to specificity at that threshold. Moreover, it can be shown to be (approximately) equal to accuracy and balanced accuracy at that threshold, too.

Prevalence Score

  • Implementation:

    • binary: prevalence_score

  • Accepts probabilities: yes

  • Note: Equal to sensitivity at decision threshold prevalence_threshold, which can be shown to be (approximately) equal to positive predictive value and F1-score at that threshold.

Built-in Classification Thresholding Strategies

Balance Threshold

  • Implementation: balance_threshold

  • Usage in --threshold command-line argument: "balance"

  • Definition: The decision threshold which minimizes the difference between sensitivity and specificity.

Prevalence Threshold

  • Implementation: prevalence_threshold

  • Usage in --threshold command-line argument: "prevalence"

  • Definition: The decision threshold which minimizes the difference between the total number of condition positive samples and the number of predicted positive samples.

(0,1)-Threshold

  • Implementation: zero_one_threshold

  • Usage in --threshold command-line argument: "zero_one", "zero_one(<specificity_weight>)"

  • Definition: The decision threshold which minimizes the Euclidean distance between (0, 1) and (1 - specificity, sensitivity).

Argmax Threshold

  • Implementation: argmax_threshold

  • Usage in --threshold command-line argument: "argmax <metric_name>"

  • Definition: The decision threshold which maximizes a given metric.

Averaging

Regression

If multiple regression targets are specified, regression metrics are computed for each target individually and for all targets combined. The latter simply calls the corresponding functions on the ground-truth and prediction matrices and relies on scikit-learn’s built-in policy to handle such cases. Normally (though not necessarily always), this proceeds by simply taking the unweighted mean of the individual metrics.

Multiclass Classification

Binary classification metrics that do not naturally apply to multiclass problems, like the F1 score, can be computed per-class and then averaged to obtain a single scalar value. To that end, the multiclass problem is cast as a special case of a multilabel problem where always exactly one element of the multilabel indicator matrix is 1. The possible averaging policies are micro, macro, samples and weighted; see below for details. The desired averaging policy can be selected either by using a properly suffixed version of the function, like f1_micro, or by passing a suitable value for parameter average of the non-suffixed function.

Multilabel Classification

Binary classification metrics can be applied to each label of a multilabel problem separately, and then averaged to obtain a single scalar value. Four averaging policies are supported by default, and can be specified either by using a properly suffixed version of the function, or via the average parameter of the original function:

  • micro: Metrics are computed globally by counting the total true positives, true negatives, false positives and false negatives across all classes.

  • macro: Unweighted mean of per-class metric values.

  • samples: Unweighted mean of per-sample metric values; only makes sense for multilabel tasks.

  • weighted: Weighted mean of per-class metric values, with weights corresponding to the number of instances of each class.

In addition, passing average=None returns the metric value for each label separately, in an array of shape (n_labels,).

Note: Some metrics, most prominently accuracy and balanced_accuracy, are defined for multiclass/multilabel problems even without averaging. What is reported in metrics.xlsx are still the averaged versions, though.

Confusion-Matrix Based Metrics

Every classification metric that operates on class predictions (e.g., accuracy, sensitivity, etc.) has a corresponding variant that operates directly on confusion matrices, suffixed with _cm. This comes in handy when multiple such metrics are to be computed, and the number of samples is huge: simply compute the confusion matrix once, and then compute the desired metrics on the (small) confusion matrix.

The following statements hold true for some metric metric and its confusion-matrix based variant metric_cm:

  • metric_cm generally accepts the same keyword arguments as metric. The only notable exception is sample_weight, which has to be taken into account when constructing the confusion matrix.

  • metric_cm generally accepts the same averaging policies as metric, with only two exceptions:

    • average="samples" is not supported by metric_cm, simply because there is no sample dimension anymore.

    • average="global" is not supported by accuracy_cm in multilabel problems.

    Furthermore, metric_cm and metric use the same default averaging policy.

  • For all y_true and y_hat: metric(y_true, y_hat, sample_weight=sw, **kwargs) is equal to metric_cm(cm=confusion_matrix(y_true, y_hat, sample_weight=sw), **kwargs), unless metric_cm is not defined due to one of the reasons listed above.

Calculating Metrics from Raw Predictions

By default, CaTabRa automatically calculates suitable performance metrics when evaluating trained prediction models, and saves them to disk in files called metrics.xlsx and (optionally) bootstrapping.xlsx. These metrics can easily be computed manually as well; all that is required are sample-wise predictions (as saved in predictions.xlsx) and the corresponding data encoder that can be easily obtained from a catabra.util.io.CaTabRaLoader object:

from catabra.util import io
from catabra import evaluation

loader = io.CaTabRaLoader("CaTabRa_dir")
metrics, bootstrapping = evaluation.calc_metrics(
    "predictions.xlsx",
    loader.get_encoder(),
    bootstrapping_repetitions=...,
    bootstrapping_metrics=...
)