Why a 0.51 IoU Can Still Ship a Useful Product

Read the segmentation scoreboard and the model behind VeerNet looks like a miss. Peak intersection over union of 0.51, peak F1 of 0.55: on any grading instinct trained by leaderboards, those are numbers you apologise for. If the mask were the deliverable, the review would be over. But the mask is not the deliverable. VeerNet is the encoder-decoder EarthScan uses to lift well-log curves off scanned paper, and what a petrophysicist opens at the end is not a mask. It is a depth-indexed curve, the recovered trace of a logging tool's reading down a borehole, and on the Tversky curve-1 validation example that recovered curve fits the reference at R-squared 0.9891. The scoreboard says fail. The product says ship. This note is about why both are true at once, and which one you are supposed to believe.

The mask is a means; the curve is the end

A curve-digitisation pipeline has a segmentation model in the middle of it, not at the end. The model paints, per pixel, where on the scanned image a given curve runs. That painted mask is then handed to two more steps the segmentation metric never sees. First, centreline extraction reduces the painted band to a single trace, the one-pixel-wide spine down the middle of the mask, because a curve is a line and the mask is a ribbon. Second, that trace is validated and resampled into a smooth depth-indexed function, across 300 interpolated depth points on the validation notebooks, so the output is a curve a petrophysicist can read at any depth rather than a jagged pixel path. Only after those two steps does anything the product cares about exist.

Intersection over union scores the first artefact and is blind to the last two. It asks what fraction of the painted ribbon's pixel area agrees with the reference ribbon. Taha and Hanbury, cataloguing overlap-based metrics, are explicit that a spatial-overlap score measures agreement of regions and does not by itself certify that the segmentation is fit for a given downstream use, which depends on what you do with the region afterward [2]. A ribbon two pixels too fat on both sides tanks its IoU, because the fattened area is all false positive, while its centreline, the mid-line of a symmetric band, lands in almost exactly the same place. The overlap penalty and the centreline error are not the same quantity, and it is the centreline that flows to the curve.

Why the number the task is scored on is the wrong number

This is not a local excuse for a weak model. It is a specific case of a general problem the metrics literature has sharpened for years: the metric a task is scored on is frequently not the metric the task is for. The Metrics Reloaded framework makes the argument at full strength, that metric selection has to be driven by the underlying domain interest rather than by whatever score is conventional for the problem shape, and that scoring a task on a convenient overlap metric which does not reflect the actual objective is a documented, recurring pitfall [1]. Our objective is a correct curve. Overlap is the convenient score for a segmentation shape. The two come apart precisely when the geometry of the error is orthogonal to the geometry the product reads, which for a thin curve painted as a band is most of the time.

There is a second, blunter reason not to trust the single overlap number: it is fragile even as a summary of segmentation quality on its own terms. Maier-Hein and colleagues, examining biomedical image-analysis competitions, showed how much an algorithm's apparent standing moves when you change the metric or the ranking scheme, and warned against reading a single such number as a stable verdict on a model's worth [3]. If overlap is a shaky basis even for ranking two segmenters, it is a far shakier basis for the decision we actually face, which is not "which mask is better" but "does the curve that comes out the far end of this pipeline match the log."

Two metrics, two jobs. On the left the model is scored on raw pixel overlap: peak IoU 0.51 and peak F1 0.55, both sitting below the line an eye instinctively treats as good, so the mask looks weak. On the right the product is judged on the depth-indexed curve recovered after centreline extraction and spline validation, scored by R-squared, which reaches 0.9891 on the Tversky curve-1 example and clears the 0.85 acceptance arc a petrophysicist would sign off on. The recall lever sets how much continuity the extractor is asked to preserve through the mask; recall up to 0.97 keeps the centreline unbroken, so the orange verdict needle holds in the accept band. Drag recall down and continuity breaks, the recovered fit decays, and the needle drops out of accept. The orange needle is the only element that argues: usefulness tracks the recovered curve, not the overlap the mask was scored on. IoU 0.51, F1 0.55, the 0.9891 R-squared, the 300 interpolated depth points, and the 0.97 recall are sourced from the engagement archive; the 0.75 and 0.85 review lines and the recall-to-R-squared decay are illustrative, anchored on the sourced end points.

What carries the curve through a mediocre mask

If the overlap is 0.51 and the curve fit is 0.9891, something is doing the work between them. The quantity that carries the curve is recall, not overlap. On the binary segmentation the model reached recall up to 0.97, and recall governs continuity: it is the fraction of the true curve pixels the mask actually caught. A mask can have poor union with the reference, because it over-paints and racks up false positives, while still catching almost every pixel of the real curve. That continuity is exactly what centreline extraction needs. The extractor can thin a band that is too wide, and the spline can absorb a trace that is locally noisy, but neither can invent a curve where the mask left a hole. A break forces the spline to bridge a gap it has no evidence for, and that is where recovered curves actually go wrong. High recall keeps the curve unbroken; the fat, low-IoU ribbon is the cosmetic cost of buying that continuity, and the downstream steps are built to pay it.

That is the whole mechanism of the gap. Precision errors, the false-positive pixels that widen the band, cost IoU dearly and cost the recovered curve almost nothing, because thinning removes them. Recall errors, the missed pixels that break the band, cost the recovered curve everything, because bridging fabricates signal. IoU folds both into one number weighted by area; the product cares about only one, weighted by whether it breaks continuity. A model tuned to preserve recall at the expense of precision, which is what a class-weighted loss on a thin target produces, lands a low IoU and a high recovered R-squared at once, on purpose.

Reading the exhibit

The exhibit above sets the two scoreboards side by side and lets you drive the one variable that matters. On the left, the raw overlap bars sit under the line an eye reads as good; on the right, the verdict dial's needle reads the recovered curve R-squared against the teal acceptance band. The lever is preserved recall. Drag it up toward the sourced 0.97 and the centreline stays continuous, the fit holds near 0.9891, and the orange needle sits inside accept: a mediocre-looking mask shipping a usable curve. Drag recall down and continuity breaks, the fit decays as the spline bridges gaps, and the needle falls out of accept. The needle is the only element that argues, and its swing is the whole thesis: usefulness tracks the recovered curve, and the recovered curve tracks recall, not the overlap the mask was graded on.

What we did with this, and what we did not

Concretely, this changed which checkpoint we shipped and which review we trusted. We stopped treating a low IoU as a stop condition and started treating a low recovered R-squared as one, because the recovered curve is what the client receives. We kept IoU and F1 on the dashboard, because a sudden IoU collapse still signals that something upstream broke, but we demoted them from acceptance gate to instrument reading and moved acceptance to the far end of the pipeline, scored on the depth-indexed fit across the 300 validation depth points.

What we did not do is pretend the 0.51 is good. It is not; it is a fat, imprecise mask with real headroom in tightening precision without sacrificing the recall that carries continuity. The point is narrower than a rationalisation: the 0.51 is not disqualifying, because it is a score on the wrong artefact, and the right artefact clears its own bar. A model can be improvable on the metric it is scored on and shippable on the metric it is for at the same time, and the discipline is to keep those two facts in separate columns instead of letting the first veto the second.

Limitations

The two headline numbers are real archive figures: peak IoU 0.51, peak F1 0.55, recall up to 0.97 on the binary task, and the Tversky curve-1 recovered R-squared of 0.9891 across 300 interpolated depth points. But 0.9891 is a best-case example from one curve on one loss, not a distribution: other curves and examples scored lower, and this note argues that a low overlap need not veto a good recovered curve, not that every mask at 0.51 recovers to 0.99. The relationship the lever draws between preserved recall and recovered R-squared is illustrative geometry, monotone and anchored on the sourced end points, not a logged recall-versus-fit curve; the real dependence is noisier and curve-specific. The two review lines, the 0.75 overlap instinct and the 0.85 acceptance arc, are stand-ins for a reviewer's judgement, not fixed engagement thresholds. And the argument is scoped to a task where the deliverable is a thin curve recovered from a band; for a task whose product genuinely is the region itself, area segmentation of a reservoir body, say, IoU is much closer to the thing you care about and this gap narrows or closes.

Score the artefact you ship

The habit worth keeping from all of this is to ask, before trusting any metric, which artefact it measures and whether that artefact is the one leaving the building. A segmentation IoU measures a mask. If the mask is the product, score it and mean it. If the mask is a means and a recovered curve is the product, then the IoU is a diagnostic and the curve fit is the verdict, and a 0.51 on the diagnostic does not get to overrule a 0.9891 on the verdict. The number that should gate the ship is the number the customer reads, measured where the customer reads it.

References

[1] Maier-Hein, L., Reinke, A., et al. Metrics Reloaded: Recommendations for Image Analysis Validation. arXiv:2206.01653 (2022). The argument that metric choice must follow the underlying domain interest, and a catalogue of the pitfalls of scoring a task on a metric that does not reflect what the task is for. https://arxiv.org/abs/2206.01653

[2] Taha, A. A., and Hanbury, A. Metrics for Evaluating 3D Medical Image Segmentation: Analysis, Selection, and Tool. BMC Medical Imaging 15, 29 (2015). A systematic analysis of overlap-based segmentation metrics and of where a spatial-overlap score is and is not informative about fitness for a downstream purpose. https://bmcmedimaging.biomedcentral.com/articles/10.1186/s12880-015-0068-x

[3] Maier-Hein, L., Eisenmann, M., Reinke, A., et al. Why Rankings of Biomedical Image Analysis Competitions Should Be Interpreted with Care. Nature Communications 9, 5217 (2018). How sensitive an algorithm's apparent standing is to the metric and ranking scheme chosen, and why a single overlap number is a fragile summary of a model's worth. https://www.nature.com/articles/s41467-018-07619-7

Why a 0.51 IoU Can Still Ship a Useful Product

The mask is a means; the curve is the end

Why the number the task is scored on is the wrong number

What carries the curve through a mediocre mask

Reading the exhibit

What we did with this, and what we did not

Limitations

Score the artefact you ship

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on