Metrics
This section lists all built-in metrics that are available through the
catabra-lib.metrics module.
Built-in Regression Metrics
R²
Implementation:
r2Also known as: coefficient of determination, R squared
Range: (-inf, 1]
Optimum: 1
Documentation: scikit-learn, Wikipedia
Mean Absolute Error
Implementation:
mean_absolute_errorAlso known as: MAE
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn, Wikipedia
Mean Squared Error
Implementation:
mean_squared_errorAlso known as: MSE
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn, Wikipedia
Note: Equivalent to
mean_tweedie_deviancewithpower=0.
Root Mean Squared Error
Implementation:
root_mean_squared_errorAlso known as: RMSE, root-mean-square deviation, RMSD
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn, Wikipedia
Mean Squared Logarithmic Error
Implementation:
mean_squared_log_errorAlso known as: MSLE
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn
Note: Only defined for non-negative inputs.
Median Absolute Error
Implementation:
median_absolute_errorAlso known as: MedAE
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn
Mean Absolute Percentage Error
Implementation:
mean_absolute_percentage_errorAlso known as: MAPE, mean absolute percentage deviation, MAPD
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn
Max Error
Implementation:
max_errorAlso known as: maximum residual error
Range: [0, inf)
Optimum: 0
Documentation: scikit-learn
Explained Variance
Implementation:
explained_varianceAlso known as: explained variance regression score
Range: (-inf, 1]
Optimum: 1
Documentation: scikit-learn
Mean Poisson Deviance
Implementation:
mean_poisson_devianceRange: [0, inf)
Optimum: 0
Documentation: scikit-learn
Note: Equivalent to
mean_tweedie_deviancewithpower=1.
Mean Gamma Deviance
Implementation:
mean_gamma_devianceRange: [0, inf)
Optimum: 0
Documentation: scikit-learn
Note: Equivalent to
mean_tweedie_deviancewithpower=2.
Mean Tweedie Deviance
Implementation:
mean_tweedie_devianceRange: [0, inf)
Optimum: 0
Documentation: scikit-learn
Note: Parameter
poweris the Tweedie power parameter. Withpower=0this metric is equivalent tomean_squared_error, withpower=1it is equivalent tomean_poisson_deviance, and withpower=2it is equivalent tomean_gamma_deviance.
Built-in Classification Metrics
Area under Receiver Operator Characteristic Curve
Implementation:
binary:
roc_aucmulticlass:
roc_auc_ovr,roc_auc_ovr_weighted,roc_auc_ovo,roc_auc_ovo_weightedmultilabel:
roc_auc_micro,roc_auc_macro,roc_auc_samples,roc_auc_weighted
Also known as: ROC-AUC
Range: [0, 1]
Optimum: 1
Accepts probabilities: yes
Documentation: scikit-learn, Wikipedia
Note:
roc_auc_ovrandroc_auc_ovoreturn macro-averaged values by default.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Average Precision
Implementation:
binary:
average_precisionmultilabel:
average_precision_micro,average_precision_macro,average_precision_samples,average_precision_weighted
Also known as: AP, mean average precision (mAP) in case of
average_precision_macroRange: [0, 1]
Optimum: 1
Accepts probabilities: yes
Documentation: scikit-learn, Wikipedia
Note: Not equivalent to
pr_auc, but similar.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Area under Precision-Recall Curve
Implementation:
binary:
pr_aucmultilabel:
pr_auc_micro,pr_auc_macro,pr_auc_samples,pr_auc_weighted
Also known as: PR-AUC
Range: [0, 1]
Optimum: 1
Accepts probabilities: yes
Documentation: scikit-learn
Note: Not equivalent to
average_precision, but similar.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Brier Score
Implementation:
binary:
brier_lossmultilabel:
brier_loss_micro,brier_loss_macro,brier_loss_samples,brier_loss_weighted
Range: [0, 1]
Optimum: 0
Accepts probabilities: yes
Documentation: scikit-learn, Wikipedia
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Hinge Loss
Implementation:
binary & multiclass:
hinge_lossmultilabel:
hinge_loss_micro,hinge_loss_macro,hinge_loss_samples,hinge_loss_weighted
Range: [0, inf)
Optimum: 0
Accepts probabilities: yes
Documentation: scikit-learn, Wikipedia
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Log Loss
Implementation:
binary & multiclass:
log_lossmultilabel:
log_loss_micro,log_loss_macro,log_loss_samples,log_loss_weighted
Also known as: logistic loss, cross-entropy loss
Range: [0, inf)
Optimum: 0
Accepts probabilities: yes
Documentation: scikit-learn, Wikipedia
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Calibration Curve
Implementation:
binary:
calibration_curve
Accepts probabilities: yes
Note: Not actually a metric, but a curve whose x-values correspond to threshold-bins and whose y-values correspond to the fraction of positive samples in each bin. Ideally, the curve should be monotonically increasing.
Confusion Matrix
Implementation:
binary, multiclass, multilabel:
confusion_matrix
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note: Not actually a metric. Refer to Section “Confusion-Matrix Based Metrics” for how to compute classification metrics directly on confusion matrices instead of ground-truth and predictions.
Note: For binary- and multiclass problems, behaves like
sklearn.metrics.confusion_matrix; for multilabel problems, behaves likesklearn.metrics.multilabel_confusion_matrix. The behavior can also be controlled via parametermultilabel.
Accuracy
Implementation:
binary:
accuracymulticlass:
accuracy,accuracy_micro,accuracy_macro,accuracy_weightedmultilabel:
accuracy,accuracy_micro,accuracy_macro,accuracy_samples,accuracy_weighted
Range: [0, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn
Note: Not equivalent to
jaccard, although this is claimed by the scikit-learn documentation.Note:
accuracyis defined for multiclass and multilabel problems even without specifying an averaging policy.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Balanced Accuracy
Implementation:
binary:
balanced_accuracymulticlass:
balanced_accuracy,balanced_accuracy_micro,balanced_accuracy_macro,balanced_accuracy_weightedmultilabel:
balanced_accuracy_micro,balanced_accuracy_macro,balanced_accuracy_samples,balanced_accuracy_weighted
Range: [0, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn
Note:
balanced_accuracyis defined for multiclass problems even without specifying an averaging policy.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Note: Closely related to
informedness, which isbalanced_accuracy * 2 - 1in the binary case.
F1
Implementation:
binary:
f1multiclass:
f1_micro,f1_macro,f1_weightedmultilabel:
f1_micro,f1_macro,f1_samples,f1_weighted
Also known as: balanced F-score, F-measure
Range: [0, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note: Special case of the F-beta metric, with beta=1. Harmonic mean of sensitivity and positive predictive value.
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Sensitivity
Implementation:
binary:
sensitivitymulticlass:
sensitivity_micro,sensitivity_macro,sensitivity_weightedmultilabel:
sensitivity_micro,sensitivity_macro,sensitivity_samples,sensitivity_weighted
Also known as: recall, true positive rate, hit rate
Range: [0, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Specificity
Implementation:
binary:
specificitymulticlass:
specificity_micro,specificity_macro,specificity_weightedmultilabel:
specificity_micro,specificity_macro,specificity_samples,specificity_weighted
Also known as: selectivity, true negative rate
Range: [0, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note: Can be computed in the same way as sensitivity, by exchanging the positive and negative class.
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Positive Predictive Value
Implementation:
binary:
positive_predictive_valuemulticlass:
positive_predictive_value_micro,positive_predictive_value_macro,positive_predictive_value_weightedmultilabel:
positive_predictive_value_micro,positive_predictive_value_macro,positive_predictive_value_samples,positive_predictive_value_weighted
Also known as: precision, PPV
Range: [0, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Negative Predictive Value
Implementation:
binary:
negative_predictive_valuemulticlass:
negative_predictive_value_micro,negative_predictive_value_macro,negative_predictive_value_weightedmultilabel:
negative_predictive_value_micro,negative_predictive_value_macro,negative_predictive_value_samples,negative_predictive_value_weighted
Also known as: NPV
Range: [0, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note: Can be computed in the same way as positive predictive value, by exchanging the positive and negative class.
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Cohen’s Kappa
Implementation:
binary:
cohen_kappamulticlass:
cohen_kappa,cohen_kappa_micro,cohen_kappa_macro,cohen_kappa_weightedmultilabel:
cohen_kappa_micro,cohen_kappa_macro,cohen_kappa_samples,cohen_kappa_weighted
Range: [-1, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note:
cohen_kappais defined for multiclass problems even without specifying an averaging policy.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Hamming Loss
Implementation:
binary:
hamming_lossmulticlass:
hamming_loss,hamming_loss_micro,hamming_loss_macro,hamming_loss_weightedmultilabel:
hamming_loss,hamming_loss_micro,hamming_loss_macro,hamming_loss_samples,hamming_loss_weighted
Also known as: Hamming distance
Range: [0, 1]
Optimum: 0
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note:
hamming_lossis equivalent to1 - accuracy. For multilabel problems, the averaging policy defaults to"macro", whereasaccuracyreturns subset accuracy (i.e., all labels must match) by default.Note:
hamming_lossis defined for multiclass and multilabel problems even without specifying an averaging policy.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Jaccard Index
Implementation:
binary:
jaccardmulticlass:
jaccard_micro,jaccard_macro,jaccard_weightedmultilabel:
jaccard_micro,jaccard_macro,jaccard_samples,jaccard_weighted
Also known as: Jaccard similarity coefficient, intersection over union, IoU
Range: [0, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note: Not equivalent to
accuracy, although this is claimed by the scikit-learn documentation.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Matthews Correlation Coefficient
Implementation:
binary:
matthews_correlation_coefficientmulticlass:
matthews_correlation_coefficient,matthews_correlation_coefficient_micro,matthews_correlation_coefficient_macro,matthews_correlation_coefficient_weightedmultilabel:
matthews_correlation_coefficient_micro,matthews_correlation_coefficient_macro,matthews_correlation_coefficient_samples,matthews_correlation_coefficient_weighted
Also known as: MCC, phi coefficient
Range: [-1, 1]
Optimum: 1
Accepts probabilities: no
Documentation: scikit-learn, Wikipedia
Note:
matthews_correlation_coefficientis defined for multiclass problems even without specifying an averaging policy.Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Informedness
Implementation:
binary:
informednessmulticlass:
informedness_micro,informedness_macro,informedness_samples,informedness_weightedmultilabel:
informedness_micro,informedness_macro,informedness_samples,informedness_weighted
Also known as: Youden index, Youden’s J statistic
Range: [-1, 1]
Optimum: 1
Accepts probabilities: no
Documentation: Wikipedia
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Note: Informedness has a natural generalization to the multiclass case, which is currently not implemented.
Note: Closely related to
balanced_accuracy, which is(informedness + 1) / 2in the binary case.
Markedness
Implementation:
binary:
markednessmulticlass:
markedness_micro,markedness_macro,markedness_samples,markedness_weightedmultilabel:
markedness_micro,markedness_macro,markedness_samples,markedness_weighted
Also known as: deltaP
Range: [-1, 1]
Optimum: 1
Accepts probabilities: no
Documentation: Wikipedia
Note: Refer to Section “Averaging” for information about micro-, macro-, samples- and weighted averaging.
Note: Markedness has a natural generalization to the multiclass case, which is currently not implemented.
True Positives
Accepts probabilities: no
Note: Not actually a metric, but total number of true positives, i.e., correctly predicted positive samples. (1,1)-th entry of
confusion_matrix.
True Negatives
Accepts probabilities: no
Note: Not actually a metric, but total number of true negatives, i.e., correctly predicted negative samples. (0,0)-th entry of
confusion_matrix.
False Positives
Accepts probabilities: no
Note: Not actually a metric, but total number of false positives, i.e., negative samples wrongly predicted as positive. (0,1)-th entry of
confusion_matrix.
False Negatives
Accepts probabilities: no
Note: Not actually a metric, but total number of false negatives, i.e., positive samples wrongly predicted as negative. (1,0)-th entry of
confusion_matrix.
Balance Score
Implementation:
binary:
balance_score
Accepts probabilities: yes
Note: Equal to sensitivity at decision threshold
balance_threshold, which by definition is (approximately) equal to specificity at that threshold. Moreover, it can be shown to be (approximately) equal to accuracy and balanced accuracy at that threshold, too.
Prevalence Score
Implementation:
binary:
prevalence_score
Accepts probabilities: yes
Note: Equal to sensitivity at decision threshold
prevalence_threshold, which can be shown to be (approximately) equal to positive predictive value and F1-score at that threshold.
Built-in Classification Thresholding Strategies
Balance Threshold
Implementation:
balance_thresholdUsage in
--thresholdcommand-line argument:"balance"Definition: The decision threshold which minimizes the difference between sensitivity and specificity.
Prevalence Threshold
Implementation:
prevalence_thresholdUsage in
--thresholdcommand-line argument:"prevalence"Definition: The decision threshold which minimizes the difference between the total number of condition positive samples and the number of predicted positive samples.
(0,1)-Threshold
Implementation:
zero_one_thresholdUsage in
--thresholdcommand-line argument:"zero_one","zero_one(<specificity_weight>)"Definition: The decision threshold which minimizes the Euclidean distance between
(0, 1)and(1 - specificity, sensitivity).
Argmax Threshold
Implementation:
argmax_thresholdUsage in
--thresholdcommand-line argument:"argmax <metric_name>"Definition: The decision threshold which maximizes a given metric.
Averaging
Regression
If multiple regression targets are specified, regression metrics are computed for each target individually and for all targets combined. The latter simply calls the corresponding functions on the ground-truth and prediction matrices and relies on scikit-learn’s built-in policy to handle such cases. Normally (though not necessarily always), this proceeds by simply taking the unweighted mean of the individual metrics.
Multiclass Classification
Binary classification metrics that do not naturally apply to multiclass problems, like the F1 score, can be computed
per-class and then averaged to obtain a single scalar value. To that end, the multiclass problem is cast as a special
case of a multilabel problem where always exactly one element of the multilabel indicator matrix is 1. The possible
averaging policies are micro, macro, samples and weighted; see below for details. The desired averaging
policy can be selected either by using a properly suffixed version of the function, like f1_micro, or by passing a
suitable value for parameter average of the non-suffixed function.
Multilabel Classification
Binary classification metrics can be applied to each label of a multilabel problem separately, and then averaged to
obtain a single scalar value. Four averaging policies are supported by default, and can be specified either by using
a properly suffixed version of the function, or via the average parameter of the original function:
micro: Metrics are computed globally by counting the total true positives, true negatives, false positives and false negatives across all classes.
macro: Unweighted mean of per-class metric values.
samples: Unweighted mean of per-sample metric values; only makes sense for multilabel tasks.
weighted: Weighted mean of per-class metric values, with weights corresponding to the number of instances of each class.
In addition, passing average=None returns the metric value for each label separately, in an array of shape
(n_labels,).
Note: Some metrics, most prominently accuracy and balanced_accuracy, are defined for multiclass/multilabel
problems even without averaging. What is reported in metrics.xlsx are still the averaged versions, though.
Confusion-Matrix Based Metrics
Every classification metric that operates on class predictions (e.g., accuracy, sensitivity, etc.) has a
corresponding variant that operates directly on confusion matrices, suffixed with _cm. This comes in handy when
multiple such metrics are to be computed, and the number of samples is huge: simply compute the confusion matrix once,
and then compute the desired metrics on the (small) confusion matrix.
The following statements hold true for some metric metric and its confusion-matrix based variant metric_cm:
metric_cmgenerally accepts the same keyword arguments asmetric. The only notable exception issample_weight, which has to be taken into account when constructing the confusion matrix.metric_cmgenerally accepts the same averaging policies asmetric, with only two exceptions:average="samples"is not supported bymetric_cm, simply because there is no sample dimension anymore.average="global"is not supported byaccuracy_cmin multilabel problems.
Furthermore,
metric_cmandmetricuse the same default averaging policy.For all
y_trueandy_hat:metric(y_true, y_hat, sample_weight=sw, **kwargs)is equal tometric_cm(cm=confusion_matrix(y_true, y_hat, sample_weight=sw), **kwargs), unlessmetric_cmis not defined due to one of the reasons listed above.
Calculating Metrics from Raw Predictions
By default, CaTabRa automatically calculates suitable performance metrics when evaluating trained prediction models,
and saves them to disk in files called metrics.xlsx and (optionally) bootstrapping.xlsx. These metrics can easily be
computed manually as well; all that is required are sample-wise predictions (as saved in predictions.xlsx) and the
corresponding data encoder that can be easily obtained from a
catabra.util.io.CaTabRaLoader object:
from catabra.util import io
from catabra import evaluation
loader = io.CaTabRaLoader("CaTabRa_dir")
metrics, bootstrapping = evaluation.calc_metrics(
"predictions.xlsx",
loader.get_encoder(),
bootstrapping_repetitions=...,
bootstrapping_metrics=...
)