Abstract
When a paper reports that a segmentation model is good, what number is it usually reporting, and does that number measure the thing the reader cares about? This survey reads the published literature on how agreement between a predicted mask and ground truth is quantified, and finds that the field overwhelmingly converges on a single family, region overlap, expressed as intersection over union or as the closely related Dice coefficient. We trace that family to its pre-computing origins, credit the benchmark that made it the default, and then survey the period-correct work that documents where it systematically misleads: on small objects, on thin structures, and whenever a frame-averaged score lets one easy, enormous class drown out the classes that matter. We set the boundary-aware and per-class alternatives the field reaches for in those cases against a real three-class raster-log baseline, in which a single prediction earns an intersection over union of 0.94 on the background but only 0.26 and 0.21 on the two well-log curves, and an F1 of 0.97 against 0.37 and 0.32. The central finding is that region overlap is not a neutral measurement but a choice with a built-in bias toward area, and that for thin foregrounds the published correctives, instance-level normalisation, boundary-restricted overlap, and honest per-class reporting, are not refinements but the difference between a score that tracks the task and one that hides its failure.
Why measuring overlap is harder than it looks
A segmentation prediction and its ground truth are two sets of pixels, and the obvious question is how much they agree. The literature's dominant answer is to measure how much the two sets overlap, and the dominant way to express that overlap predates deep learning by most of a century. The Jaccard index, introduced to compare the floral composition of alpine regions, divides the size of the intersection of two sets by the size of their union [1]. The Dice coefficient, introduced to measure ecological association between species, instead divides twice the intersection by the sum of the two set sizes [2]. The two are monotonic transforms of each other, so a ranking by one is a ranking by the other, and in the segmentation literature they reappear under new names: the Jaccard index is intersection over union, and the Dice coefficient is exactly the F1 score, the harmonic mean of precision and recall, when the sets are a predicted and a true foreground.
What turned these two ecology coefficients into the lingua franca of segmentation was a benchmark. The PASCAL Visual Object Classes challenge adopted mean intersection over union as its semantic-segmentation score, averaging the per-class overlap across classes and across images, and because the leaderboard was the field's shared yardstick, the metric it used became the metric everyone reported [3]. That is worth stating plainly, because it explains the shape of the literature: intersection over union is not dominant because the community deliberated and judged it the truest measure of segmentation quality, but because a widely-used benchmark picked it and the rest of the field standardised on the comparison. The number is excellent for what a benchmark needs, a single, bounded, intuitive scalar that is comparable across methods. Whether it measures what any particular downstream task needs is a separate question, and it is the question the rest of this survey is about.
That separate question was raised early and has never gone away. Csurka and colleagues asked directly, in 2013, whether overlap is a good evaluation measure for semantic segmentation, and argued that the per-pixel, area-weighted view that intersection over union encodes does not match human judgements of segmentation quality, particularly around object contours [4]. Their paper is the canonical early statement of the unease this survey traces: that the field's default score and the field's actual goals are not the same thing, and that the gap is largest exactly where the structures are smallest.
How this survey was assembled
The synthesis here is a structured reading of the published measurement literature, not a new experiment, and the procedure was deliberately narrow so its claims stay defensible. We started from the metric the literature reports most often, region overlap, and traced it to its two originating coefficients [1] [2] and to the benchmark that canonised it [3]. We then collected the published work that examines the limits of that metric rather than merely using it, taking three threads that the field treats as the standard correctives: the critique-and-catalogue thread that documents when overlap misleads [4] [5] [8], the normalisation thread that changes how overlap is aggregated so size cannot dominate [6] [7], and the boundary thread that restricts overlap to the contour where region scores saturate [5] [9]. For each thread we extracted what the metric measures, what it is blind to, and the regime its authors claim it for.
To keep the survey anchored to a real task rather than to abstractions, we read it against one concrete reference point drawn from the engagement archive: a three-class raster well-log segmentation problem, background plus two thin curves traced across a scanned log, where the curves are on the order of one to three pixels wide and the background occupies the overwhelming majority of every frame. The numbers we quote from that reference, the per-class intersection over union and F1 of a single trained prediction, are real and used as a worked example of the failure modes the literature describes. They are not a new benchmark, and the survey does not re-measure any published metric; it illustrates the published behaviour against a baseline where the relevant structures are as thin as they get. The interactive exhibit below is built on the same footing: real anchor numbers, with any blend across frame compositions flagged as illustrative.
What region overlap rewards, and what it hides
The reference baseline makes the central problem visible in a single line. One trained prediction, scored against its ground truth, earns an intersection over union of 0.94 on the background, 0.26 on the first curve, and 0.21 on the second; the F1 scores for the same prediction are 0.97, 0.37, and 0.32. The background, which is easy and enormous, is essentially solved. The two curves, which are the entire reason a petrophysicist would run the model, sit between a fifth and a third of the way to a perfect score. Frame-averaged, the same prediction reports an intersection over union of 0.51 and an F1 of 0.55, numbers that read as mediocre-but-working and quietly conceal that the model is close to useless on the only classes the task exists to recover.
The exhibit above renders the mechanism the literature has been describing since Csurka and colleagues raised it [4]. The same prediction is scored three ways. Graded by region overlap, the headline drifts upward as the background majority grows, because a frame average is a coverage-weighted blend and the background's near-perfect score carries the most weight. Graded by set similarity, the Dice or F1 view tells the same inflated story. Only the per-class breakdown refuses to hide anything, and it shows the spread the headline averages away: a roughly four-and-a-half-fold gap between the kindest score the mask earns, 0.94 on background, and the harshest, 0.21 on the second curve. The prediction never changes. The verdict changes entirely with the rule, and the rule the field reports by default is the one most flattered by the easy class.
This is not a quirk of one engagement, and the published catalogues say so. Reinke and colleagues assembled a visual story of the common failure modes of image-processing metrics, and small and thin structures recur throughout it, because for a one-pixel-wide curve a prediction shifted by a single pixel can halve the intersection while the union barely moves, so region overlap punishes a near-perfect trace as if it were a gross error and rewards a fattened, smeared prediction that happens to cover more true pixels [8]. Taha and Hanbury, surveying the metric landscape for medical segmentation, drew the same boundary between overlap measures that are dominated by region area and distance or boundary measures that are sensitive to the contour, and recommended choosing by the property the task depends on rather than by convention [5]. The thin-curve regime sits precisely where these critiques bite hardest.
Where region overlap stops being the right ruler
Read together, the correctives the field has proposed are not competing replacements for intersection over union but three different ways of putting back what the frame-averaged region score throws away.
The first corrective changes the aggregation so size cannot dominate. The Cityscapes benchmark observed that a pixel-level mean intersection over union is biased toward large instances, since big objects contribute proportionally more pixels to the score, and introduced an instance-level variant that weights each object more evenly so a small but correctly-segmented instance is not buried under a large one [6]. The broader version of this point is Maier-Hein and colleagues' demonstration that how scores are aggregated across cases, not only which metric is used, can reorder a competition's rankings, which means an evaluation that does not state its aggregation has not really stated its result [7]. For the well-log task the implication is direct: reporting one frame-averaged number for a frame that is mostly background is choosing the aggregation that flatters the model most.
The second corrective changes what the overlap is computed over. Boundary intersection over union restricts the overlap calculation to a band around the object contour, so the score measures whether the edge is in the right place rather than whether the bulk area is covered, and the authors show it stays sensitive precisely in the large-object, high-region-IoU regime where the plain score saturates and stops distinguishing good predictions from very good ones [9]. A thin curve is almost all boundary and almost no interior, so a boundary-aware view is close to the natural measure of how well the curve was traced, and it is far less forgiving of the fattened predictions that region overlap quietly rewards.
The third corrective is the least technical and the most often skipped: report the per-class scores and refuse to average them into a single headline. The reference baseline is the argument for it, because the only honest description of that prediction is the disaggregated one, 0.94 and 0.26 and 0.21, not the 0.51 that the mean produces. This is the recommendation that the most recent of the period's syntheses, the Metrics Reloaded framework, builds into a process: choose the validation metric from the problem, state what each metric is blind to, and treat a single aggregated overlap score as a starting point for inquiry rather than a verdict [10]. The framework is a 2022 statement of an argument the field had been making for the better part of a decade, and it lands on the same conclusion this survey reaches from the other direction.
Where our own work sits relative to this literature is worth marking, since it is the line between this survey and our applied writing. This is a reading of how the public field measures overlap. Our use of these metrics on the raster-log task is downstream of the survey: we report the per-class intersection over union and F1 above precisely because the literature surveyed here establishes that the frame-averaged number would misrepresent a thin-structure result, and our separate decision to grade the model on the reconstructed one-dimensional curve rather than on pixel overlap at all is the same critique carried one step further, to a metric that lives in the deliverable's space instead of the mask's. The survey explains why those choices are not idiosyncratic; the field has been arguing for them since 2013.
Limitations
This is a survey and inherits a survey's limits. It synthesises what the published metric literature reports and does not re-implement or re-measure any of the metrics it discusses; where it quotes numbers, those are the real per-class and frame-averaged scores of a single prediction from one engagement and one architecture, used as a worked illustration rather than as a fresh benchmark of competing metrics. The reference task is deliberately extreme, a three-class problem with one-to-three-pixel curves and a background that swamps the frame, which is exactly the regime where region overlap fails most visibly, so the size of the spread shown here should be read as the high end of the effect, not as a typical gap on tasks with thicker foregrounds or milder imbalance. The interactive exhibit's coverage blend, the way the frame-averaged headline moves as the background fraction is dragged, is the standard coverage-weighted mean of the fixed per-class scores and is flagged as illustrative geometry; the true per-frame composition varies log to log and was not re-measured across the sweep. The survey also scopes itself to the region-overlap family and the three correctives the period's literature treats as canonical, and it stops at the close of its own quarter, so learned, calibration-aware, and uncertainty-aware evaluation measures that the field has since explored are out of frame. A reader should take this as a map of how the field measures overlap and where that measurement misleads, not as a substitute for choosing and validating the metric against their own task.
What to carry from the survey
- Region overlap, expressed as intersection over union (the Jaccard index) or the Dice coefficient (which equals the F1 score), dominates segmentation evaluation because a single widely-used benchmark adopted it, not because the field judged it the truest measure of quality.
- The metric is biased toward area. On a real three-class raster-log prediction the SAME mask earns IoU 0.94 on the easy background but only 0.26 and 0.21 on the two thin curves, and F1 0.97 against 0.37 and 0.32, a roughly four-and-a-half-fold spread that the frame-averaged 0.51 IoU / 0.55 F1 hides entirely.
- The unease is old and documented: Csurka and colleagues questioned overlap as an evaluation measure in 2013, and the picture-story and Metrics Reloaded catalogues show small and thin structures as the recurring failure mode of region scores.
- The field's three correctives are not replacements but restorations of what the frame average throws away: instance-level aggregation so size cannot dominate, boundary-restricted overlap that is sensitive where region IoU saturates, and honest per-class reporting instead of one headline number.
- For thin foregrounds a frame-averaged region-overlap score is a choice that flatters the easy class. The defensible report is the disaggregated one, chosen from the task; grading the actual deliverable rather than the pixel mask carries the same critique one step further.
The single habit this survey would change is small and almost clerical: before quoting an overlap score, name the class it belongs to and the aggregation that produced it. A frame-averaged intersection over union on a frame that is mostly background is not a measurement of the model so much as a measurement of the background, and the gap between those two readings is exactly the width of the thin structure the task was built to find.
References
[1] Jaccard, P. The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37-50 (1912). The origin of the similarity index that the segmentation field reports as intersection over union. https://doi.org/10.1111/j.1469-8137.1912.tb05611.x
[2] Dice, L. R. Measures of the amount of ecologic association between species. Ecology, 26(3), 297-302 (1945). The overlap coefficient later rediscovered as the Dice segmentation loss and, equivalently, the F1 score. https://doi.org/10.2307/1932409
[3] Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 88(2), 303-338 (2010). The benchmark that canonised mean intersection over union as the segmentation score. https://doi.org/10.1007/s11263-009-0275-4
[4] Csurka, G., Larlus, D., and Perronnin, F. What is a good evaluation measure for semantic segmentation? BMVC (2013). The early, direct interrogation of whether overlap matches human judgements of segmentation quality, especially at contours. https://doi.org/10.5244/C.27.32
[5] Taha, A. A., and Hanbury, A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Medical Imaging, 15, 29 (2015). A catalogue separating area-dominated overlap measures from boundary and distance measures, with guidance on when each applies. https://doi.org/10.1186/s12880-015-0068-x
[6] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. CVPR (2016). Introduced an instance-level intersection over union to counter the pixel-level score's bias toward large objects. https://arxiv.org/abs/1604.01685
[7] Maier-Hein, L., Eisenmann, M., Reinke, A., Onogur, S., Stankovic, M., et al. Why rankings of biomedical image analysis competitions should be interpreted with care. Nature Communications, 9, 5217 (2018). Demonstrated that metric and aggregation choices, not only model quality, reorder leaderboards. https://doi.org/10.1038/s41467-018-07619-7
[8] Reinke, A., Eisenmann, M., Tizabi, M. D., Sudre, C. H., Radsch, T., et al. Common Limitations of Image Processing Metrics: A Picture Story (2021). A visual catalogue of overlap-metric failure modes, with small and thin structures recurring throughout. https://arxiv.org/abs/2104.05642
[9] Cheng, B., Girshick, R., Dollar, P., Berg, A. C., and Kirillov, A. Boundary IoU: Improving Object-Centric Image Segmentation Evaluation. CVPR (2021). A boundary-restricted overlap measure that stays sensitive where region intersection over union saturates. https://arxiv.org/abs/2103.16562
[10] Maier-Hein, L., Reinke, A., Godau, P., Tizabi, M. D., Buettner, F., et al. Metrics Reloaded: Pitfalls and Recommendations for Image Analysis Validation (2022). A framework for selecting validation metrics from the problem and stating what each metric is blind to. https://arxiv.org/abs/2206.01653