Abstract
A model that digitises a scanned well log returns a softmax score for every pixel, but a score is not yet a decision, and the decision a production digitiser actually needs is binary: accept this depth on the model's word, or set it aside for a person to check. This survey reads how the literature converts a confidence number into that operating rule. We trace the question to its decision-theoretic root, the reject option and the error-reject tradeoff, and follow it through the modern lineage of maximum-softmax baselines, calibration, predictive uncertainty, selective classification, and out-of-distribution detection. Two moves are easy to conflate and worth separating: a threshold is only as good as the ordering it cuts, so the score must rank pixels by genuine reliability; and even a perfect ordering still demands a choice of where to cut on the risk-coverage curve, which is an operating decision, not a modelling one. We read this against a real three-class raster-log baseline whose per-class scores diverge sharply, with the background class at recall 0.97 and precision 0.96 against the two curve classes at recall 0.37 and 0.32 and precision 0.41 and 0.36. The central finding is that under that imbalance a single global cutoff is structurally wrong, because the cutoff that auto-accepts the easy class wholesale is far too generous for the hard ones; the defensible design sets the cutoff per class, accepting the abundant background automatically and escalating the doubtful curve depths, and the published selective-prediction machinery is exactly what makes that split principled rather than ad hoc.
What a confidence number is being asked to do
It helps to be precise about the task before reaching for a method. A petrophysical raster log is a scanned image of curve traces winding down a depth axis, and digitising it means assigning every pixel to a class: background, the first curve, or the second. A trained segmentation network does this by emitting, for each pixel, a probability vector over the three classes, and the argmax of that vector is the prediction. The largest component of the vector, the maximum softmax probability, is the natural candidate for a confidence score, and it is the baseline the field still measures everything else against [2].
The reason a score matters at all is that a digitiser does not operate in a vacuum where every prediction is equal. A geoscientist will review the output, and review time is the scarce resource. If the model could mark the depths it is sure about and the depths it is not, the reviewer could spend their attention where it changes the answer and skim the rest. That is the entire economic case for a confidence threshold: it turns a uniform review burden into a triaged one. The model auto-accepts where it is confident and escalates where it is not, and the only knob is the cutoff that divides the two.
This framing is old. Chow set it out in 1970 as the error-reject tradeoff: a classifier that is allowed to abstain on some fraction of inputs can lower its error on the rest, and there is an optimal threshold at which the marginal cost of a rejection equals the marginal cost of an error [1]. Every modern confidence-thresholding method is a descendant of that idea, dressed in the language of deep networks. What changed is not the question but the difficulty of getting a trustworthy score out of a model that, left alone, tends to be confidently wrong.
The ordering problem: is the score worth cutting?
A threshold partitions pixels into accept and escalate by where their score falls relative to the cut. That partition is only useful if the score orders pixels by something real, that is, if higher-scoring pixels are genuinely more likely to be correct than lower-scoring ones. If the ordering is noise, no choice of cutoff helps, because the accepted set is no cleaner than the rejected one.
Two strands of the literature attack the ordering directly. The first is calibration, the property that a score of 0.8 corresponds to an empirical 80 percent chance of being right. Modern networks are typically miscalibrated and over-confident, which means a raw 0.9 cannot be read at face value; temperature scaling rescales the logits with a single learned parameter so that the score recovers its probabilistic meaning, which is what lets a threshold expressed as a probability behave as advertised [3]. Calibration does not change the ordering of pixels, only the numbers attached to them, but it makes the cutoff interpretable, which matters when a domain expert has to choose one.
The second strand improves the ordering itself. The maximum softmax probability is a weak signal because a network can place most of its mass on a wrong class. Deep ensembles average several independently trained models and use their disagreement as an uncertainty estimate, sharpening the separation between reliable and unreliable predictions at the cost of training more than one model [4]. ODIN combines temperature scaling with a small input perturbation to widen the score gap between in-distribution and out-of-distribution inputs, which is the same gap a threshold relies on [6]. Two later proposals abandon the softmax as the thing to threshold at all: the Trust Score compares a test point's distance to the nearest class manifold rather than reading the network's own probability [7], and ConfidNet trains a dedicated confidence head whose output orders predictions by true correctness better than the softmax does, specifically for the failure-prediction task that thresholding is [9]. The shared lesson is that the quality of a confidence cut is bounded by the quality of the score it cuts, and that the raw softmax is a floor, not a ceiling.
The operating problem: where to cut
Suppose the ordering is good. There is still a second, distinct decision, and it is the one practitioners under-think: where on the curve to place the cutoff. Raising the threshold accepts fewer pixels but the accepted ones are cleaner; lowering it accepts more at the cost of letting errors through. This is the risk-coverage tradeoff, and Geifman and El-Yaniv gave it its modern formalism, defining a selection function paired with the classifier and a guaranteed bound on the risk of the accepted set at a chosen coverage [5]. SelectiveNet later folded the same idea into training, learning the classifier and its reject rule jointly at a target coverage so the two are optimised together rather than bolted on afterward [8].
The risk-coverage view reframes the cutoff as a point on a curve rather than a magic number. Each candidate threshold yields a coverage, the fraction of pixels auto-accepted, and a risk, the error rate among those accepted. Plotting risk against coverage traces the whole menu of operating points, and the right one is wherever the cost of a missed escalation crosses the cost of an unnecessary one for the specific deliverable. For a digitiser that feeds a curve into a downstream petrophysical calculation, a single mis-accepted curve pixel can shift a depth by enough to matter, so the acceptable risk is low and the coverage one is willing to surrender to get there is correspondingly high.
The exhibit above sweeps a single cutoff and reads both quantities off it for each of our three classes. The shape that matters is the difference between the curves, not their absolute height: the easy class holds its coverage almost flat as the cutoff climbs, while the two hard classes shed coverage steeply for only a modest gain in retained accuracy. That divergence is the argument for what follows.
Reading the literature against our baseline
We have the numbers to make this concrete. Under a Dice loss our multiclass segmenter splits its three classes into one easy and two hard, and the per-class metrics from the handover make the split unambiguous. The background class reaches a recall of 0.97 and a precision of 0.96; the two curve classes sit far below ceiling, with recall 0.37 and 0.32 and precision 0.41 and 0.36. These are real engagement figures used here as a worked example of where the survey's lessons bite, not as a benchmark of competing thresholding methods.
What the imbalance implies for thresholding is immediate. The background class is mostly composed of high-confidence pixels, because the model has learned it thoroughly, so a cutoff placed almost anywhere keeps nearly all of its coverage at near-ceiling accuracy. The two curve classes are the opposite: a large share of their pixels carry middling confidence, because the model is uncertain about exactly the thin traces the task exists to recover, so raising the cutoff strips their coverage quickly while lifting their accuracy only a little. A global cutoff is therefore caught in a contradiction. The threshold that auto-accepts the abundant background wholesale, which is what you want, is far too permissive for the curves, letting through exactly the uncertain curve pixels that most need a second look; and the threshold strict enough to clean up the curves would needlessly escalate huge volumes of background the model already has right.
The resolution the literature supports is a per-class operating point. Nothing in the reject-option framework requires one threshold for all classes; the error-reject tradeoff is defined per decision, and the risk-coverage curve can be drawn class by class. Setting a low cutoff on background and high cutoffs on the two curves accepts the easy depths automatically and routes the doubtful curve depths to review, which is the triage the engagement actually wanted. The out-of-distribution detection work pushes this further into the large-label production regime, where a single global score is least adequate and per-group thresholds become standard practice [10].
Where this sits in the field, and what it cannot settle
Read together, the literature draws a clean separation that is easy to lose in practice. Calibration and the score-improving methods govern whether the ordering is trustworthy; selective classification governs where to cut a trustworthy ordering. They are complementary, not competing, and a deployment needs both: a sharper score with no operating-point discipline wastes its sharpness, and a careful risk-coverage choice on a noisy score cuts noise. Our baseline shows why the second move cannot be made once for the whole model when the classes are this unequal, and why a confidence threshold for a raster-log digitiser is really a small vector of thresholds, one per class, each chosen from its own risk-coverage curve.
What the survey deliberately does not claim is a single best method. The field has not converged on one, and the right choice depends on budget and stakes: an ensemble buys a better ordering at several times the training cost, a learned confidence head buys some of that gain for one extra head, and temperature scaling buys interpretability for almost nothing. The honest recommendation is procedural rather than prescriptive: measure the risk-coverage curve per class on held-out data, choose the operating point from the cost of a missed escalation against the cost of an unnecessary one, and revisit it when the input distribution drifts.
Limitations
This is a survey of the confidence-thresholding literature and inherits a survey's limits. It synthesises what the published work reports on the reject option, calibration, predictive uncertainty, selective classification, and out-of-distribution detection, and it does not re-implement or benchmark those methods against one another. Where it quotes numbers, the per-class recall and precision are the real Dice-loss multiclass scores of one model from a single engagement, used as a worked illustration of the imbalance regime rather than as a fresh evaluation of thresholding methods, and they are point estimates from one training run without confidence intervals across seeds or splits. The interactive exhibit is anchored on those sourced per-class scores, but the coverage that each class sheds as the cutoff rises and the accuracy it retains are illustrative monotone geometry consistent with selective prediction, not separately measured at every cutoff on held-out pixels; the true coverage and risk at a given cutoff depend on the full score distribution of each class, which was not re-measured across the sweep, so the exhibit should be read as the shape of the tradeoff rather than as a calibrated risk-coverage table. The survey also scopes itself to score-based and selective-prediction approaches as the period's literature frames them and treats Bayesian deep learning, conformal prediction, and learned abstention costs only at the edges; and it stops at the close of its own quarter, so methods the field has explored since are out of frame. A reader should take this as a map of how the literature turns a softmax score into an operating threshold and why imbalance forces that threshold to be per class, not as a substitute for measuring the risk-coverage curve on their own logs.
Findings in brief
- A softmax score is not a decision. The decision a raster-log digitiser needs is binary, auto-accept this depth or escalate it for review, and a confidence threshold is the rule that makes that split. The framing is Chow's 1970 error-reject tradeoff in modern dress.
- A threshold makes two distinct moves that are easy to conflate: it must cut a trustworthy ordering, so the score has to rank pixels by genuine reliability, and it must sit at a defensible point on the risk-coverage curve, which is an operating choice rather than a modelling one.
- The raw maximum softmax probability is the baseline, not the ceiling. Temperature scaling makes the cutoff interpretable, while deep ensembles, the Trust Score, and a learned confidence head improve the ordering the cutoff acts on.
- Our Dice-loss baseline splits into one easy and two hard classes: background at recall 0.97 and precision 0.96 against the two curves at recall 0.37 and 0.32 and precision 0.41 and 0.36. That spread is exactly the regime a single global cutoff cannot serve.
- A per-class operating point resolves the contradiction. Accept the abundant background wholesale at a low cutoff and route the doubtful curve depths to review at high cutoffs, choosing each point from its own risk-coverage curve rather than from one global number.
The practice this survey would install is one sentence long: never set a confidence threshold for a model before you have drawn the risk-coverage curve for each class it predicts, because a digitiser that runs one cutoff over a 0.97 class and a 0.37 class is not triaging its work, it is averaging two problems that have nothing in common and calling the average a policy.
References
[1] Chow, C. K. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16(1), 41-46 (1970). The decision-theoretic origin of the reject option and the error-reject tradeoff that every confidence threshold descends from. https://doi.org/10.1109/TIT.1970.1054406
[2] Hendrycks, D., and Gimpel, K. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. ICLR (2017). Established the maximum softmax probability as the baseline confidence score that thresholding methods are measured against. https://arxiv.org/abs/1610.02136
[3] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On Calibration of Modern Neural Networks. ICML (2017). Documented systematic over-confidence and introduced temperature scaling, which rescales scores so a probability threshold means what it claims. https://arxiv.org/abs/1706.04599
[4] Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. NeurIPS (2017). A non-Bayesian uncertainty estimate whose model disagreement sharpens the score a threshold cuts. https://arxiv.org/abs/1612.01474
[5] Geifman, Y., and El-Yaniv, R. Selective Classification for Deep Neural Networks. NeurIPS (2017). Gave the risk-coverage tradeoff its modern formalism, with a selection function and a guaranteed bound on the risk of the accepted set at a chosen coverage. https://arxiv.org/abs/1705.08500
[6] Liang, S., Li, Y., and Srikant, R. Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks (ODIN). ICLR (2018). Combined temperature scaling with input perturbation to widen the score gap a threshold exploits. https://arxiv.org/abs/1706.02690
[7] Jiang, H., Kim, B., Guan, M. Y., and Gupta, M. To Trust Or Not To Trust A Classifier. NeurIPS (2018). Proposed the Trust Score, a distance-to-manifold alternative to the softmax for deciding when to defer. https://arxiv.org/abs/1805.11783
[8] Geifman, Y., and El-Yaniv, R. SelectiveNet: A Deep Neural Network with an Integrated Reject Option. ICML (2019). Trained the classifier and its rejection rule jointly at a target coverage rather than bolting the threshold on afterward. https://arxiv.org/abs/1901.09192
[9] Corbiere, C., Thome, N., Bar-Hen, A., Cord, M., and Perez, P. Addressing Failure Prediction by Learning Model Confidence (ConfidNet). NeurIPS (2019). Learned a dedicated confidence head that orders predictions by true correctness better than the raw softmax for the failure-prediction task thresholding performs. https://arxiv.org/abs/1910.04851
[10] Hendrycks, D., Basart, S., Mazeika, M., Zou, A., Kwon, J., Mostajabi, M., Steinhardt, J., and Song, D. Scaling Out-of-Distribution Detection for Real-World Settings. ICML (2022). Pushed score-based thresholding into the large-label, multi-class production regime where a single global score is least adequate. https://arxiv.org/abs/1911.11132