Evaluation Metrics¶

This notebook provides a practical introduction to the evaluation metrics available in probly. While standard accuracy is important, it doesn’t tell the whole story about a model’s performance, especially for probabilistic models.

We will cover three key types of metrics:

Proper Scoring Rules: Metrics like Negative Log-Likelihood (NLL) that evaluate the quality of the entire predicted probability distribution.
Calibration Metrics: Metrics like Expected Calibration Error (ECE) that measure how well a model’s predicted confidence aligns with its actual accuracy.
Sharpness Metrics: Metrics like efficiency and coverage that evaluate the precision and reliability of set-valued (uncertain) predictions.

1. Proper Scoring Rules¶

Proper scoring rules are loss functions that evaluate the quality of a predictive distribution. They don’t just care about whether the top-1 prediction is correct; they penalize a model for being confidently wrong and reward it for being accurately uncertain.

The most common proper scoring rule is the Negative Log-Likelihood (NLL), also known as Log Loss.

import numpy as np

from probly.evaluation.metrics import brier_score, log_loss

# Imagine a 3-class problem
# A well-calibrated, correct prediction
calibrated_probs = np.array([[0.8, 0.1, 0.1]])
targets = np.array([0])

# An overconfident, wrong prediction
overconfident_probs = np.array([[0.01, 0.98, 0.01]])

print(f"Log Loss (Calibrated): {log_loss(calibrated_probs, targets):.4f}")
print(f"Log Loss (Overconfident): {log_loss(overconfident_probs, targets):.4f}")

print(f"\nBrier Score (Calibrated): {brier_score(calibrated_probs, targets):.4f}")
print(f"Brier Score (Overconfident): {brier_score(overconfident_probs, targets):.4f}")

Log Loss (Calibrated): 0.2231
Log Loss (Overconfident): 4.6052

Brier Score (Calibrated): 0.0600
Brier Score (Overconfident): 1.9406

Notice how the Log Loss heavily penalizes the overconfident wrong prediction. The Brier Score is another proper scoring rule that measures the mean squared error between the predicted probabilities and the one-hot encoded true labels.

2. Calibration Metrics¶

A model is well-calibrated if its predicted probabilities reflect its true accuracy. For example, if a model makes 100 predictions with 80% confidence, we expect it to be correct on 80 of those predictions.

The Expected Calibration Error (ECE) is the standard metric for measuring this. It groups predictions into bins based on their confidence scores and calculates the weighted average difference between the confidence and accuracy in each bin.

from probly.evaluation.metrics import expected_calibration_error

# A perfectly calibrated model
# Predicts with 80% confidence and is correct 80% of the time
perfect_probs = np.array([[0.8, 0.1, 0.1]] * 10)
perfect_labels = np.array([0] * 8 + [1] * 2)

# An overconfident model
# Predicts with 99% confidence but is only correct 80% of the time
overconfident_probs = np.array([[0.99, 0.005, 0.005]] * 10)
overconfident_labels = np.array([0] * 8 + [1] * 2)

ece_perfect = expected_calibration_error(perfect_probs, perfect_labels, num_bins=5)
ece_overconfident = expected_calibration_error(overconfident_probs, overconfident_labels, num_bins=5)

print(f"ECE (Perfectly Calibrated): {ece_perfect:.4f}")
print(f"ECE (Overconfident): {ece_overconfident:.4f}")

ECE (Perfectly Calibrated): 0.0000
ECE (Overconfident): 0.1900

As expected, the overconfident model has a much higher ECE. A lower ECE score indicates a more reliable and trustworthy model.

3. Sharpness Metrics (for Set-Valued Predictions)¶

For models that output a set of possible predictions (like from an ensemble or a credal set), we need metrics to evaluate the quality of that set.

Coverage: How often does the true label fall within the predicted set?
Efficiency: How sharp or precise is the predicted set? A smaller set is more efficient.

There is a natural trade-off: a model can achieve 100% coverage by always predicting all possible classes, but this would be completely inefficient.

from probly.evaluation.metrics import coverage, efficiency

# Predictions from a 3-member ensemble for two data points
# Each row is a sample, each column is a class
ensemble_preds = np.array(
    [
        [[0.7, 0.2, 0.1], [0.6, 0.3, 0.1], [0.8, 0.1, 0.1]],  # Instance 1: High agreement (sharp)
        [[0.4, 0.3, 0.3], [0.3, 0.4, 0.3], [0.3, 0.3, 0.4]],  # Instance 2: High disagreement (unsharp)
    ]
)

# For simplicity, assume one-hot encoded targets
# Note: `coverage` can also accept integer labels
targets = np.array([[1, 0, 0], [1, 0, 0]])

cov = coverage(ensemble_preds, targets)
eff = efficiency(ensemble_preds)

print(f"Coverage: {cov:.4f}")
print(f"Efficiency (Lower is better): {1 - eff:.4f}")  # Print 1-eff for intuitive sharpness

Coverage: 0.0000
Efficiency (Lower is better): 0.1167