Metrics
Built-in Regression Metrics
This section lists all built-in regression metrics that are implemented in the
catabra.util.metrics
module.
R²
Implementation:
r2
Also known as: coefficient of determination, R squared
Range: (-inf, 1]
Optimum: 1
Documentation: scikit-learn, Wikipedia
Mean Absolute Error
Implementation:
mean_absolute_error
Also known as: MAE
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn, Wikipedia
Mean Squared Error
Implementation:
mean_squared_error
Also known as: MSE
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn, Wikipedia
Note: Equivalent to
mean_tweedie_deviance
withpower=0
.
Root Mean Squared Error
Implementation:
root_mean_squared_error
Also known as: RMSE, root-mean-square deviation, RMSD
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn, Wikipedia
Mean Squared Logarithmic Error
Implementation:
mean_squared_log_error
Also known as: MSLE
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn
Note: Only defined for non-negative inputs.
Median Absolute Error
Implementation:
median_absolute_error
Also known as: MedAE
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn
Mean Absolute Percentage Error
Implementation:
mean_absolute_percentage_error
Also known as: MAPE, mean absolute percentage deviation, MAPD
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn
Max Error
Implementation:
max_error
Also known as: maximum residual error
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn
Explained Variance
Implementation:
explained_variance
Also known as: explained variance regression score
Range: (-inf, 1]
Optimum: 1
Documentation: scikit-learn
Mean Poisson Deviance
Implementation:
mean_poisson_deviance
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn
Note: Equivalent to
mean_tweedie_deviance
withpower=1
.
Mean Gamma Deviance
Implementation:
mean_gamma_deviance
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn
Note: Equivalent to
mean_tweedie_deviance
withpower=2
.
Mean Tweedie Deviance
Implementation:
mean_tweedie_deviance
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn
Note: Parameter
power
is the Tweedie power parameter. Withpower=0
this metric is equivalent tomean_squared_error
, withpower=1
it is equivalent tomean_poisson_deviance
, and withpower=2
it is equivalent tomean_gamma_deviance
.
Built-in Classification Metrics
This section lists all built-in classification metrics that are implemented in the util.metrics
module.
Area under Receiver Operator Characteristic Curve
Implementation:
binary:
roc_auc
multiclass:
roc_auc_ovr
,roc_auc_ovr_weighted
,roc_auc_ovo
,roc_auc_ovo_weighted
multilabel:
roc_auc_micro
,roc_auc_macro
,roc_auc_samples
,roc_auc_weighted
Also known as: ROC-AUC
Range: [0, 1]
Optimum: 1
Accepts probabilities: yes
Documentation: scikit-learn, Wikipedia
Note:
roc_auc_ovr
androc_auc_ovo
return macro-averaged values by default.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Average Precision
Implementation:
binary:
average_precision
multilabel:
average_precision_micro
,average_precision_macro
,average_precision_samples
,average_precision_weighted
Also known as: AP, mean average precision (mAP) in case of
average_precision_macro
Range: [0, 1]
Optimum: 1
Accepts probabilities: yes
Documentation: scikit-learn, Wikipedia
Note: Not equivalent to
pr_auc
, but similar.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Area under Precision-Recall Curve
Implementation:
binary:
pr_auc
multilabel:
pr_auc_micro
,pr_auc_macro
,pr_auc_samples
,pr_auc_weighted
Also known as: PR-AUC
Range: [0, 1]
Optimum: 1
Accepts probabilities: yes
Documentation: scikit-learn
Note: Not equivalent to
average_precision
, but similar.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Brier Score
Implementation:
binary:
brier_loss
multilabel:
brier_loss_micro
,brier_loss_macro
,brier_loss_samples
,brier_loss_weighted
Range: [0, 1]
Optimum: 0
Accepts probabilities: yes
Documentation: scikit-learn, Wikipedia
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Hinge Loss
Implementation:
binary:
hinge_loss
multilabel:
hinge_loss_micro
,hinge_loss_macro
,hinge_loss_samples
,hinge_loss_weighted
Range: [0, inf)
Optimum: 0
Accepts probabilities: yes
Documentation: scikit-learn, Wikipedia
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Log Loss
Implementation:
binary:
log_loss
multilabel:
log_loss_micro
,log_loss_macro
,log_loss_samples
,log_loss_weighted
Also known as: logistic loss, cross-entropy loss
Range: [0, inf)
Optimum: 0
Accepts probabilities: yes
Documentation: scikit-learn, Wikipedia
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Calibration Curve
Implementation:
binary:
calibration_curve
Accepts probabilities: yes
Note: Not actually a metric, but a curve whose x-values correspond to threshold-bins and whose y-values correspond to the fraction of positive samples in each bin. Ideally, the curve should be monotonically increasing.
Confusion Matrix
Implementation:
binary & multiclass:
confusion_matrix
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note: Not actually a metric.
Accuracy
Implementation:
binary:
accuracy
multiclass:
accuracy
,accuracy_micro
,accuracy_macro
,accuracy_weighted
multilabel:
accuracy
,accuracy_micro
,accuracy_macro
,accuracy_samples
,accuracy_weighted
Range: [0, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn
Note: Not equivalent to
jaccard
, although this is claimed by the scikit-learn documentation.Note:
accuracy
is defined for multiclass and multilabel problems even without specifying an averaging policy.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Balanced Accuracy
Implementation:
binary:
balanced_accuracy
multiclass:
balanced_accuracy_micro
,balanced_accuracy_macro
,balanced_accuracy_weighted
multilabel:
balanced_accuracy_micro
,balanced_accuracy_macro
,balanced_accuracy_samples
,balanced_accuracy_weighted
Range: [0, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Note: Closely related to
informedness
, which isbalanced_accuracy * 2 - 1
in the binary case.
F1
Implementation:
binary:
f1
multiclass:
f1_micro
,f1_macro
,f1_weighted
multilabel:
f1_micro
,f1_macro
,f1_samples
,f1_weighted
Also known as: balanced F-score, F-measure
Range: [0, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note: Special case of the F-beta metric, with beta=1. Harmonic mean of sensitivity and positive predictive value.
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Sensitivity
Implementation:
binary:
sensitivity
multiclass:
sensitivity_micro
,sensitivity_macro
,sensitivity_weighted
multilabel:
sensitivity_micro
,sensitivity_macro
,sensitivity_samples
,sensitivity_weighted
Also known as: recall, true positive rate, hit rate
Range: [0, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Specificity
Implementation:
binary:
specificity
multiclass:
specificity_micro
,specificity_macro
,specificity_weighted
multilabel:
specificity_micro
,specificity_macro
,specificity_samples
,specificity_weighted
Also known as: selectivity, true negative rate
Range: [0, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note: Can be computed in the same way as sensitivity, by exchanging the positive and negative class.
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Positive Predictive Value
Implementation:
binary:
positive_predictive_value
multiclass:
positive_predictive_value_micro
,positive_predictive_value_macro
,positive_predictive_value_weighted
multilabel:
positive_predictive_value_micro
,positive_predictive_value_macro
,positive_predictive_value_samples
,positive_predictive_value_weighted
Also known as: precision, PPV
Range: [0, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Negative Predictive Value
Implementation:
binary:
negative_predictive_value
multiclass:
negative_predictive_value_micro
,negative_predictive_value_macro
,negative_predictive_value_weighted
multilabel:
negative_predictive_value_micro
,negative_predictive_value_macro
,negative_predictive_value_samples
,negative_predictive_value_weighted
Also known as: NPV
Range: [0, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note: Can be computed in the same way as positive predictive value, by exchanging the positive and negative class.
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Cohen’s Kappa
Implementation:
binary:
cohen_kappa
multiclass:
cohen_kappa
,cohen_kappa_micro
,cohen_kappa_macro
,cohen_kappa_weighted
multilabel:
cohen_kappa_micro
,cohen_kappa_macro
,cohen_kappa_samples
,cohen_kappa_weighted
Range: [-1, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note:
cohen_kappa
is defined for multiclass problems even without specifying an averaging policy.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Hamming Loss
Implementation:
binary:
hamming_loss
multiclass:
hamming_loss
,hamming_loss_micro
,hamming_loss_macro
,hamming_loss_weighted
multilabel:
hamming_loss
,hamming_loss_micro
,hamming_loss_macro
,hamming_loss_samples
,hamming_loss_weighted
Also known as: Hamming distance
Range: [0, 1]
Optimum: 0
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note:
hamming_loss
is equivalent to1 - accuracy
.Note:
hamming_loss
is defined for multiclass problems even without specifying an averaging policy.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Jaccard Index
Implementation:
binary:
jaccard
multiclass:
jaccard_micro
,jaccard_macro
,jaccard_weighted
multilabel:
jaccard_micro
,jaccard_macro
,jaccard_samples
,jaccard_weighted
Also known as: Jaccard similarity coefficient, intersection over union, IoU
Range: [0, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note: Not equivalent to
accuracy
, although this is claimed by the scikit-learn documentation.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Matthews Correlation Coefficient
Implementation:
binary:
matthews_correlation_coefficient
multiclass:
matthews_correlation_coefficient
,matthews_correlation_coefficient_micro
,matthews_correlation_coefficient_macro
,matthews_correlation_coefficient_weighted
multilabel:
matthews_correlation_coefficient_micro
,matthews_correlation_coefficient_macro
,matthews_correlation_coefficient_samples
,matthews_correlation_coefficient_weighted
Also known as: MCC, phi coefficient
Range: [-1, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note:
matthews_correlation_coefficient
is defined for multiclass problems even without specifying an averaging policy.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Informedness
Implementation:
binary:
informedness
multiclass:
informedness_micro
,informedness_macro
,informedness_samples
,informedness_weighted
multilabel:
informedness_micro
,informedness_macro
,informedness_samples
,informedness_weighted
Also known as: Youden index, Youden’s J statistic
Range: [-1, 1]
Optimum: 1
Accepts probabilities: no
Documentation: Wikipedia
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Note: Informedness has a natural generalization to the multiclass case, which is currently not implemented.
Note: Closely related to
balanced_accuracy
, which is(informedness + 1) / 2
in the binary case.
Markedness
Implementation:
binary:
markedness
multiclass:
markedness_micro
,markedness_macro
,markedness_samples
,markedness_weighted
multilabel:
markedness_micro
,markedness_macro
,markedness_samples
,markedness_weighted
Also known as: deltaP
Range: [-1, 1]
Optimum: 1
Accepts probabilities: no
Documentation: Wikipedia
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Note: Markedness has a natural generalization to the multiclass case, which is currently not implemented.
True Positives
Accepts probabilities: no
Note: Not actually a metric, but total number of true positives, i.e., correctly predicted positive samples. (1,1)-th entry of
confusion_matrix
.
True Negatives
Accepts probabilities: no
Note: Not actually a metric, but total number of true negatives, i.e., correctly predicted negative samples. (0,0)-th entry of
confusion_matrix
.
False Positives
Accepts probabilities: no
Note: Not actually a metric, but total number of false positives, i.e., negative samples wrongly predicted as positive. (0,1)-th entry of
confusion_matrix
.
False Negatives
Accepts probabilities: no
Note: Not actually a metric, but total number of false negatives, i.e., positive samples wrongly predicted as negative. (1,0)-th entry of
confusion_matrix
.
Balance Score
Implementation:
binary:
balance_score
Accepts probabilities: yes
Note: Equal to sensitivity at decision threshold
balance_threshold
, which by definition is (approximately) equal to specificity at that threshold. Moreover, it can be shown to be (approximately) equal to accuracy and balanced accuracy at that threshold, too.
Prevalence Score
Implementation:
binary:
prevalence_score
Accepts probabilities: yes
Note: Equal to sensitivity at decision threshold
prevalence_threshold
, which can be shown to be (approximately) equal to positive predictive value and F1-score at that threshold.
Built-in Classification Thresholding Strategies
Balance Threshold
Implementation:
balance_threshold
Usage in
--threshold
command-line argument:"balance"
Definition: The decision threshold which minimizes the difference between sensitivity and specificity.
Prevalence Threshold
Implementation:
prevalence_threshold
Usage in
--threshold
command-line argument:"prevalence"
Definition: The decision threshold which minimizes the difference between the total number of condition positive samples and the number of predicted positive samples.
(0,1)-Threshold
Implementation:
zero_one_threshold
Usage in
--threshold
command-line argument:"zero_one"
,"zero_one(<specificity_weight>)"
Definition: The decision threshold which minimizes the Euclidean distance between
(0, 1)
and(1 - specificity, sensitivity)
.
Argmax Threshold
Implementation:
argmax_threshold
Usage in
--threshold
command-line argument:"argmax <metric_name>"
Definition: The decision threshold which maximizes a given metric.
Averaging
Regression
If multiple regression targets are specified, regression metrics are computed for each target individually and for all targets combined. The latter simply calls the corresponding functions on the ground-truth and prediction matrices and relies on scikit-learn’s built-in policy to handle such cases. Normally (though not necessarily always), this proceeds by simply taking the unweighted mean of the individual metrics.
Multiclass Classification
Binary classification metrics that do not naturally apply to multiclass problems, like the F1 score, can be computed
per-class and then averaged to obtain a single scalar value. To that end, the multiclass problem is cast as a special
case of a multilabel problem where always exactly one element of the multilabel indicator matrix is 1. The possible
averaging policies are micro, macro, samples and weighted; see below for details. The desired averaging
policy can be selected either by using a properly suffixed version of the function, like f1_micro
, or by passing a
suitable value for parameter average
of the non-suffixed function.
Multilabel Classification
Binary classification metrics can be applied to each class of a multilabel problem separately, and then averaged to
obtain a single scalar value. Three averaging policies are supported by default, and can be specified either by using
a properly suffixed version of the function, or via the average
parameter of the original function:
micro: Metrics are computed globally by counting the total true positives, true negatives, false positives and false negatives across all classes.
macro: Unweighted mean of per-class metric values.
samples: Unweighted mean of per-sample metric values; only makes sense for multilabel tasks.
weighted: Weighted mean of per-class metric values, with weights corresponding to the number of instances of each class.
Note: Some metrics, like accuracy
, are defined for multilabel problems even without averaging. What is reported
in metrics.xlsx are still the averaged versions, though.
Calculating Metrics from Raw Predictions
By default, CaTabRa automatically calculates suitable performance metrics when evaluating trained prediction models,
and saves them to disk in files called metrics.xlsx and (optionally) bootstrapping.xlsx. These metrics can easily be
computed manually as well; all that is required are sample-wise predictions (as saved in predictions.xlsx
) and the
corresponding data encoder that can be easily obtained from a
catabra.util.io.CaTabRaLoader
object:
from catabra.util import io
from catabra import evaluation
loader = io.CaTabRaLoader("CaTabRa_dir")
metrics, bootstrapping = evaluation.calc_metrics(
"predictions.xlsx",
loader.get_encoder(),
bootstrapping_repetitions=...,
bootstrapping_metrics=...
)