Calibration Metrics

The content and examples in this notebook are based on the paper On Calibration of Modern Neural Networks (Guo et al., 2017).*

Introduction

In modern machine learning, we often train models that achieve very high accuracy. A deep neural network might classify an image as a “cat” and report a 99% confidence score from its softmax output. But what does that 99% confidence actually mean? Can we trust it? This is the core question of model calibration.

A model is considered well-calibrated if its predicted probabilities are statistically meaningful. In simple terms, if a well-calibrated model makes 100 predictions, each with a reported confidence of 80%, then we should expect about 80 of those predictions to be correct. The confidence should match the real-world accuracy.

The groundbreaking 2017 paper, “On Calibration of Modern Neural Networks,” revealed a critical problem: while modern, deep, and highly accurate neural networks (like ResNets) are excellent classifiers, they are often poorly calibrated. These models tend to be systematically overconfident. Their predicted probabilities are often much higher than their actual accuracy, which is a significant risk in high-stakes applications like medical diagnosis or autonomous driving, where trusting an overconfident but wrong prediction can have serious consequences.

Fortunately, this miscalibration can be fixed after the model has been trained, using post-hoc calibration methods. These techniques adjust the model’s outputs to make its confidence scores more reliable without sacrificing its predictive accuracy.

In this tutorial, we will explore the concepts introduced in that paper. We will learn how to measure a model’s calibration using Reliability Diagrams and then implement Temperature Scaling, a simple but highly effective technique to correct for overconfidence and make our models more trustworthy.

TODO: expand notebook content based on the work of group #3