Reading IoU and Dice Scores Without Fooling Yourself

A segmentation model ships with one number attached to it, and that number does most of the talking. Someone asks how the digitiser is doing, and the answer comes back as an IoU or a Dice score, a single figure between zero and one that is supposed to summarise a mask. On our raster well-log work the figure that gets quoted is a peak IoU of 0.51 and a peak F1 of 0.55 across the loss-function ablation behind VeerNet, the encoder-decoder EarthScan uses to lift curves off scanned paper logs. Read cold, those numbers say a model that is roughly halfway to good. This note is about why that reading is wrong, and specifically about the one habit that keeps it honest: never accept the averaged overlap score without asking what it averaged.

The problem is not the metric. IoU measures how well a predicted region overlaps a true region, and Dice is its close cousin. The problem is that a task with more than one class does not produce one IoU. It produces one per class, and the headline you get handed is a pooled version of several. The pooling is where the deception lives, because the classes are not equally hard, and on a thin-curve log they are not remotely equally sized.

The headline is a blend, and the blend is mostly background

Multiclass curve segmentation on a well log has three classes: the background, and two curve traces. Under Dice loss on our multiclass run, the per-class IoU comes out as background 0.94, curve-1 0.26, and curve-2 0.21. Those three numbers describe three completely different situations. The background is a large, contiguous, high-contrast region that any competent network segments almost perfectly. The two curves are one-pixel-wide traces that wander across the frame, and the model is weak on both.

Now pool them. The convention that gives a single figure is mean IoU, the per-class overlap averaged across classes, which is how the PASCAL VOC challenge defined the summary that most of the field inherited [1]. But there is more than one way to pool, and the choice is not cosmetic. Long, Shelhamer, and Darrell reported both a plain mean IoU and a frequency-weighted IoU precisely because weighting the per-class overlap by how many pixels each class occupies produces a different headline from weighting each class equally [2]. On a thin-curve log those two poolings are worlds apart, because the background is almost the entire image and the curves are a sliver of it. Weight by pixel count and the background's 0.94 swallows the average whole. Weight the classes equally and the curves get an honest third of the vote each.

Per-class IoU for the Dice-loss multiclass run, drawn on one axis: background 0.94, curve-1 0.26, curve-2 0.21. The background class is trivial to segment because it is almost all of a thin-curve raster, so when the three per-class scores are collapsed into a single headline the answer depends entirely on how much weight the average hands that easy class. Drag the lever from equal three-class weighting toward pixel-count weighting, where background dominates the frame, and the orange headline rule floats up off the two curve bars toward the background bar. Slide it back and the headline drops onto the curves, near the honest curve-only reading of 0.235, the mean of the two curve IoUs. Per-class F1 (0.97 background against 0.37 and 0.32 on the curves) tells the same story, and the widely quoted peaks (IoU 0.51, F1 0.55, recall 0.97) sit above both curve classes because a high recall on a one-pixel trace means the model finds and smears the curve rather than overlapping it cleanly. The orange headline is the only element that argues: the same model reads strong or weak depending only on the weighting the average silently chose. The per-class scores and the peaks are sourced from the engagement archive; the background pixel share that sets the flattering end of the lever is an illustrative input, not a logged number.

The exhibit is the two poolings made into a lever you can drag. The three per-class IoU bars sit on one axis. Slide the weighting toward pixel-count, where the background dominates the frame, and the orange headline rule floats up off the curve bars and toward the background bar. Slide it back to equal three-class weighting and the headline drops to 0.47; the honest curve-only reading, the mean of the two curve IoUs, is lower still at 0.235. Same model, same masks, same run. The only thing that changed is how much credit the average handed the class that was never in doubt.

Why F1 and recall do not rescue the reading

The instinct at this point is to reach for a second metric, and the archive has them. The per-class F1 under Dice loss is background 0.97, curve-1 0.37, curve-2 0.32, and the ablation records a peak recall of 0.97. It is tempting to quote the recall and feel better, and this is exactly the trap the honest reading is built to catch.

Recall on a one-pixel trace is easy to earn for a bad reason. A model that predicts a curve two or three pixels wide where the truth is one pixel wide will recover almost every true curve pixel, because the truth sits inside the fat prediction. That inflates recall toward 0.97 while doing nothing for overlap, because IoU divides the intersection by the union, and the union now includes all the extra pixels the model smeared on. So a high recall paired with a curve IoU near 0.24 is not a contradiction. It is the signature of a model that finds the trace and then over-draws it. Read recall alone and you would conclude the model rarely misses a curve, which is true and beside the point. The per-class overlap tells you the shape of the recovered curve is loose, and the shape is what a petrophysicist reads.

The per-class F1 tells the same story, and for the same reason: background 0.97 against 0.37 and 0.32 on the curves is the easy class carrying a blended headline. No second metric rescues you if you pool it the same careless way. What rescues you is reading it per class, so the gap between the 0.9-something background and the 0.3-something curves is visible instead of averaged into a comfortable middle.

This is a property of thin structures, not of our model

It would be easy to read all this as a confession that the model is bad. That is not the claim, and the metric literature is clear that the effect is structural. Csurka, Larlus, and Perronnin showed that pooled overlap measures reward getting the dominant, easy region right and can be nearly insensitive to errors on small structures, which is the regime a curve trace lives in [3]. Taha and Hanbury made the mechanical point sharper: when the true region is a few pixels thin, a single pixel of boundary slack swings the overlap score hard, because the denominator is tiny [4]. A curve is the thinnest structure there is, one pixel wide by construction, so its IoU is fragile in a way the background's never is. The 0.94 is stable because the region is huge; the 0.26 and 0.21 are jumpy because the regions are slivers.

So the averaged headline is not just optimistic, it is optimistic in a predictable direction every time. On any thin-structure task, the bulk class is near-perfect, the thin signal classes are weak, and any pooling that lets the bulk class vote by pixel count reports a flattering number. The 0.51 peak IoU is not a lie. It is a true average of one number that was never at risk and two that carry all of it, and reading it as a summary of the whole is how you fool yourself.

The habit

The discipline this leaves us with is small and mechanical. When a segmentation result arrives as a single overlap number, we do not read it until we have split it back into per-class scores, and we look hardest at the classes that are both hard and small, because those are the ones the average was built to hide. On this task that means reading curve-1 and curve-2 IoU directly, treating the background score as a check that the trivial thing is trivial rather than as evidence of anything, and refusing to let a pixel-count-weighted mean stand in for a judgment about the curves. It also means quoting recall and overlap together, never recall alone.

None of this is a new metric or a clever trick. It is the ordinary definitions of IoU and Dice, read at the resolution they were defined at, which is per class [1]. The averaging is a convenience for leaderboards, harmless on a balanced scene. On a thin-curve log it is the difference between a 0.51 that soothes and a 0.24 that tells you where the next month of work goes.

Limitations

The per-class IoU and F1 figures, the peak IoU of 0.51, the peak F1 of 0.55, and the peak recall of 0.97 are the real archive numbers from the Dice-loss multiclass run and the ablation that surrounds it. The instrument's weighting lever, however, is a teaching device: the pixel-count end assumes a background pixel share that is illustrative of a thin two-curve trace rather than a logged count for a specific image, so the exact headline it shows at the pixel-count extreme is a plausible blend, not a measured one. The two anchors it moves between, the equal-weight curve-only mean and the per-class bars, are exact. This note is also about how to read a metric, not about which metric to prefer; the separate and larger question of whether a mask-overlap score is even the right thing to grade a curve digitiser on, as opposed to the reconstructed curve it exports, is a different argument that this note does not make. And per-class overlap tells you where the model is weak, not why. It does not say whether the weakness is the loss function, the synthetic training distribution, the thinness of the traces, or the label quality, each of which is its own investigation.

References

[1] Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision 88 (2010), pp. 303-338. The reference definition of IoU as a per-class overlap and the mean-IoU convention that averages it across classes. https://link.springer.com/article/10.1007/s11263-009-0275-4

[2] Long, J., Shelhamer, E., and Darrell, T. Fully Convolutional Networks for Semantic Segmentation. CVPR (2015). Reports mean IoU alongside frequency-weighted IoU, making explicit that how you weight per-class overlap when you pool it changes the headline number. https://arxiv.org/abs/1411.4038

[3] Csurka, G., Larlus, D., and Perronnin, F. What is a good evaluation measure for semantic segmentation? BMVC (2013). Shows that pooled overlap measures reward getting the dominant, easy region right and can be nearly insensitive to errors on small structures. https://www.bmva.org/bmvc/2013/Papers/paper0032/index.html

[4] Taha, A. A., and Hanbury, A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Medical Imaging 15 (2015), article 29. How overlap metrics behave on thin and small structures, where a single boundary pixel of slack swings the score. https://bmcmedimaging.biomedcentral.com/articles/10.1186/s12880-015-0068-x

Reading IoU and Dice Scores Without Fooling Yourself

The headline is a blend, and the blend is mostly background

Why F1 and recall do not rescue the reading

This is a property of thin structures, not of our model

The habit

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on