Metrics

Built-in Regression Metrics

This section lists all built-in regression metrics that are implemented in the catabra.util.metrics module.

  • Implementation: r2

  • Also known as: coefficient of determination, R squared

  • Range: (-inf, 1]

  • Optimum: 1

  • Documentation: scikit-learn, Wikipedia

Mean Absolute Error

  • Implementation: mean_absolute_error

  • Also known as: MAE

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn, Wikipedia

Mean Squared Error

  • Implementation: mean_squared_error

  • Also known as: MSE

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn, Wikipedia

  • Note: Equivalent to mean_tweedie_deviance with power=0.

Root Mean Squared Error

  • Implementation: root_mean_squared_error

  • Also known as: RMSE, root-mean-square deviation, RMSD

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn, Wikipedia

Mean Squared Logarithmic Error

  • Implementation: mean_squared_log_error

  • Also known as: MSLE

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn

  • Note: Only defined for non-negative inputs.

Median Absolute Error

  • Implementation: median_absolute_error

  • Also known as: MedAE

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn

Mean Absolute Percentage Error

  • Implementation: mean_absolute_percentage_error

  • Also known as: MAPE, mean absolute percentage deviation, MAPD

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn

Max Error

  • Implementation: max_error

  • Also known as: maximum residual error

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn

Explained Variance

  • Implementation: explained_variance

  • Also known as: explained variance regression score

  • Range: (-inf, 1]

  • Optimum: 1

  • Documentation: scikit-learn

Mean Poisson Deviance

  • Implementation: mean_poisson_deviance

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn

  • Note: Equivalent to mean_tweedie_deviance with power=1.

Mean Gamma Deviance

  • Implementation: mean_gamma_deviance

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn

  • Note: Equivalent to mean_tweedie_deviance with power=2.

Mean Tweedie Deviance

  • Implementation: mean_tweedie_deviance

  • Range: [0, inf)

  • Optimum: 0

  • Documentation: scikit-learn

  • Note: Parameter power is the Tweedie power parameter. With power=0 this metric is equivalent to mean_squared_error, with power=1 it is equivalent to mean_poisson_deviance, and with power=2 it is equivalent to mean_gamma_deviance.

Built-in Classification Metrics

This section lists all built-in classification metrics that are implemented in the util.metrics module.

Area under Receiver Operator Characteristic Curve

  • Implementation:

    • binary: roc_auc

    • multiclass: roc_auc_ovr, roc_auc_ovr_weighted, roc_auc_ovo, roc_auc_ovo_weighted

    • multilabel: roc_auc_micro, roc_auc_macro, roc_auc_samples, roc_auc_weighted

  • Also known as: ROC-AUC

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: yes

  • Documentation: scikit-learn, Wikipedia

  • Note: roc_auc_ovr and roc_auc_ovo return macro-averaged values by default.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Average Precision

  • Implementation:

    • binary: average_precision

    • multilabel: average_precision_micro, average_precision_macro, average_precision_samples, average_precision_weighted

  • Also known as: AP, mean average precision (mAP) in case of average_precision_macro

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: yes

  • Documentation: scikit-learn, Wikipedia

  • Note: Not equivalent to pr_auc, but similar.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Area under Precision-Recall Curve

  • Implementation:

    • binary: pr_auc

    • multilabel: pr_auc_micro, pr_auc_macro, pr_auc_samples, pr_auc_weighted

  • Also known as: PR-AUC

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: yes

  • Documentation: scikit-learn

  • Note: Not equivalent to average_precision, but similar.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Brier Score

  • Implementation:

    • binary: brier_loss

    • multilabel: brier_loss_micro, brier_loss_macro, brier_loss_samples, brier_loss_weighted

  • Range: [0, 1]

  • Optimum: 0

  • Accepts probabilities: yes

  • Documentation: scikit-learn, Wikipedia

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Hinge Loss

  • Implementation:

    • binary: hinge_loss

    • multilabel: hinge_loss_micro, hinge_loss_macro, hinge_loss_samples, hinge_loss_weighted

  • Range: [0, inf)

  • Optimum: 0

  • Accepts probabilities: yes

  • Documentation: scikit-learn, Wikipedia

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Log Loss

  • Implementation:

    • binary: log_loss

    • multilabel: log_loss_micro, log_loss_macro, log_loss_samples, log_loss_weighted

  • Also known as: logistic loss, cross-entropy loss

  • Range: [0, inf)

  • Optimum: 0

  • Accepts probabilities: yes

  • Documentation: scikit-learn, Wikipedia

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Calibration Curve

  • Implementation:

    • binary: calibration_curve

  • Accepts probabilities: yes

  • Note: Not actually a metric, but a curve whose x-values correspond to threshold-bins and whose y-values correspond to the fraction of positive samples in each bin. Ideally, the curve should be monotonically increasing.

Confusion Matrix

  • Implementation:

    • binary & multiclass: confusion_matrix

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: Not actually a metric.

Accuracy

  • Implementation:

    • binary: accuracy

    • multiclass: accuracy, accuracy_micro, accuracy_macro, accuracy_weighted

    • multilabel: accuracy, accuracy_micro, accuracy_macro, accuracy_samples, accuracy_weighted

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn

  • Note: Not equivalent to jaccard, although this is claimed by the scikit-learn documentation.

  • Note: accuracy is defined for multiclass and multilabel problems even without specifying an averaging policy.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Balanced Accuracy

  • Implementation:

    • binary: balanced_accuracy

    • multiclass: balanced_accuracy_micro, balanced_accuracy_macro, balanced_accuracy_weighted

    • multilabel: balanced_accuracy_micro, balanced_accuracy_macro, balanced_accuracy_samples, balanced_accuracy_weighted

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

  • Note: Closely related to informedness, which is balanced_accuracy * 2 - 1 in the binary case.

F1

  • Implementation:

    • binary: f1

    • multiclass: f1_micro, f1_macro, f1_weighted

    • multilabel: f1_micro, f1_macro, f1_samples, f1_weighted

  • Also known as: balanced F-score, F-measure

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: Special case of the F-beta metric, with beta=1. Harmonic mean of sensitivity and positive predictive value.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Sensitivity

  • Implementation:

    • binary: sensitivity

    • multiclass: sensitivity_micro, sensitivity_macro, sensitivity_weighted

    • multilabel: sensitivity_micro, sensitivity_macro, sensitivity_samples, sensitivity_weighted

  • Also known as: recall, true positive rate, hit rate

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Specificity

  • Implementation:

    • binary: specificity

    • multiclass: specificity_micro, specificity_macro, specificity_weighted

    • multilabel: specificity_micro, specificity_macro, specificity_samples, specificity_weighted

  • Also known as: selectivity, true negative rate

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: Can be computed in the same way as sensitivity, by exchanging the positive and negative class.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Positive Predictive Value

  • Implementation:

    • binary: positive_predictive_value

    • multiclass: positive_predictive_value_micro, positive_predictive_value_macro, positive_predictive_value_weighted

    • multilabel: positive_predictive_value_micro, positive_predictive_value_macro, positive_predictive_value_samples, positive_predictive_value_weighted

  • Also known as: precision, PPV

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Negative Predictive Value

  • Implementation:

    • binary: negative_predictive_value

    • multiclass: negative_predictive_value_micro, negative_predictive_value_macro, negative_predictive_value_weighted

    • multilabel: negative_predictive_value_micro, negative_predictive_value_macro, negative_predictive_value_samples, negative_predictive_value_weighted

  • Also known as: NPV

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: Can be computed in the same way as positive predictive value, by exchanging the positive and negative class.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Cohen’s Kappa

  • Implementation:

    • binary: cohen_kappa

    • multiclass: cohen_kappa, cohen_kappa_micro, cohen_kappa_macro, cohen_kappa_weighted

    • multilabel: cohen_kappa_micro, cohen_kappa_macro, cohen_kappa_samples, cohen_kappa_weighted

  • Range: [-1, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: cohen_kappa is defined for multiclass problems even without specifying an averaging policy.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Hamming Loss

  • Implementation:

    • binary: hamming_loss

    • multiclass: hamming_loss, hamming_loss_micro, hamming_loss_macro, hamming_loss_weighted

    • multilabel: hamming_loss, hamming_loss_micro, hamming_loss_macro, hamming_loss_samples, hamming_loss_weighted

  • Also known as: Hamming distance

  • Range: [0, 1]

  • Optimum: 0

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: hamming_loss is equivalent to 1 - accuracy.

  • Note: hamming_loss is defined for multiclass problems even without specifying an averaging policy.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Jaccard Index

  • Implementation:

    • binary: jaccard

    • multiclass: jaccard_micro, jaccard_macro, jaccard_weighted

    • multilabel: jaccard_micro, jaccard_macro, jaccard_samples, jaccard_weighted

  • Also known as: Jaccard similarity coefficient, intersection over union, IoU

  • Range: [0, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: Not equivalent to accuracy, although this is claimed by the scikit-learn documentation.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Matthews Correlation Coefficient

  • Implementation:

    • binary: matthews_correlation_coefficient

    • multiclass: matthews_correlation_coefficient, matthews_correlation_coefficient_micro, matthews_correlation_coefficient_macro, matthews_correlation_coefficient_weighted

    • multilabel: matthews_correlation_coefficient_micro, matthews_correlation_coefficient_macro, matthews_correlation_coefficient_samples, matthews_correlation_coefficient_weighted

  • Also known as: MCC, phi coefficient

  • Range: [-1, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: scikit-learn, Wikipedia

  • Note: matthews_correlation_coefficient is defined for multiclass problems even without specifying an averaging policy.

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

Informedness

  • Implementation:

    • binary: informedness

    • multiclass: informedness_micro, informedness_macro, informedness_samples, informedness_weighted

    • multilabel: informedness_micro, informedness_macro, informedness_samples, informedness_weighted

  • Also known as: Youden index, Youden’s J statistic

  • Range: [-1, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: Wikipedia

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

  • Note: Informedness has a natural generalization to the multiclass case, which is currently not implemented.

  • Note: Closely related to balanced_accuracy, which is (informedness + 1) / 2 in the binary case.

Markedness

  • Implementation:

    • binary: markedness

    • multiclass: markedness_micro, markedness_macro, markedness_samples, markedness_weighted

    • multilabel: markedness_micro, markedness_macro, markedness_samples, markedness_weighted

  • Also known as: deltaP

  • Range: [-1, 1]

  • Optimum: 1

  • Accepts probabilities: no

  • Documentation: Wikipedia

  • Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.

  • Note: Markedness has a natural generalization to the multiclass case, which is currently not implemented.

True Positives

  • Accepts probabilities: no

  • Note: Not actually a metric, but total number of true positives, i.e., correctly predicted positive samples. (1,1)-th entry of confusion_matrix.

True Negatives

  • Accepts probabilities: no

  • Note: Not actually a metric, but total number of true negatives, i.e., correctly predicted negative samples. (0,0)-th entry of confusion_matrix.

False Positives

  • Accepts probabilities: no

  • Note: Not actually a metric, but total number of false positives, i.e., negative samples wrongly predicted as positive. (0,1)-th entry of confusion_matrix.

False Negatives

  • Accepts probabilities: no

  • Note: Not actually a metric, but total number of false negatives, i.e., positive samples wrongly predicted as negative. (1,0)-th entry of confusion_matrix.

Balance Score

  • Implementation:

    • binary: balance_score

  • Accepts probabilities: yes

  • Note: Equal to sensitivity at decision threshold balance_threshold, which by definition is (approximately) equal to specificity at that threshold. Moreover, it can be shown to be (approximately) equal to accuracy and balanced accuracy at that threshold, too.

Prevalence Score

  • Implementation:

    • binary: prevalence_score

  • Accepts probabilities: yes

  • Note: Equal to sensitivity at decision threshold prevalence_threshold, which can be shown to be (approximately) equal to positive predictive value and F1-score at that threshold.

Built-in Classification Thresholding Strategies

Balance Threshold

  • Implementation: balance_threshold

  • Usage in --threshold command-line argument: "balance"

  • Definition: The decision threshold which minimizes the difference between sensitivity and specificity.

Prevalence Threshold

  • Implementation: prevalence_threshold

  • Usage in --threshold command-line argument: "prevalence"

  • Definition: The decision threshold which minimizes the difference between the total number of condition positive samples and the number of predicted positive samples.

(0,1)-Threshold

  • Implementation: zero_one_threshold

  • Usage in --threshold command-line argument: "zero_one", "zero_one(<specificity_weight>)"

  • Definition: The decision threshold which minimizes the Euclidean distance between (0, 1) and (1 - specificity, sensitivity).

Argmax Threshold

  • Implementation: argmax_threshold

  • Usage in --threshold command-line argument: "argmax <metric_name>"

  • Definition: The decision threshold which maximizes a given metric.

Averaging

Regression

If multiple regression targets are specified, regression metrics are computed for each target individually and for all targets combined. The latter simply calls the corresponding functions on the ground-truth and prediction matrices and relies on scikit-learn’s built-in policy to handle such cases. Normally (though not necessarily always), this proceeds by simply taking the unweighted mean of the individual metrics.

Multiclass Classification

Binary classification metrics that do not naturally apply to multiclass problems, like the F1 score, can be computed per-class and then averaged to obtain a single scalar value. To that end, the multiclass problem is cast as a special case of a multilabel problem where always exactly one element of the multilabel indicator matrix is 1. The possible averaging policies are micro, macro, samples and weighted; see below for details. The desired averaging policy can be selected either by using a properly suffixed version of the function, like f1_micro, or by passing a suitable value for parameter average of the non-suffixed function.

Multilabel Classification

Binary classification metrics can be applied to each class of a multilabel problem separately, and then averaged to obtain a single scalar value. Three averaging policies are supported by default, and can be specified either by using a properly suffixed version of the function, or via the average parameter of the original function:

  • micro: Metrics are computed globally by counting the total true positives, true negatives, false positives and false negatives across all classes.

  • macro: Unweighted mean of per-class metric values.

  • samples: Unweighted mean of per-sample metric values; only makes sense for multilabel tasks.

  • weighted: Weighted mean of per-class metric values, with weights corresponding to the number of instances of each class.

Note: Some metrics, like accuracy, are defined for multilabel problems even without averaging. What is reported in metrics.xlsx are still the averaged versions, though.

Calculating Metrics from Raw Predictions

By default, CaTabRa automatically calculates suitable performance metrics when evaluating trained prediction models, and saves them to disk in files called metrics.xlsx and (optionally) bootstrapping.xlsx. These metrics can easily be computed manually as well; all that is required are sample-wise predictions (as saved in predictions.xlsx) and the corresponding data encoder that can be easily obtained from a catabra.util.io.CaTabRaLoader object:

from catabra.util import io
from catabra import evaluation

loader = io.CaTabRaLoader("CaTabRa_dir")
metrics, bootstrapping = evaluation.calc_metrics(
    "predictions.xlsx",
    loader.get_encoder(),
    bootstrapping_repetitions=...,
    bootstrapping_metrics=...
)