Tversky versus Dice: Tuning the Precision-Recall Tradeoff in Thin-Curve Segmentation

Most segmentation papers pick a loss in a sentence and spend the rest of the page on the architecture. That ordering is backwards for the problem of pulling a one-to-three-pixel ink trace off a scanned well log, where the network's capacity is rarely the binding constraint and the loss is doing almost all of the work of telling the optimiser what to care about. So when we built VeerNet, the encoder-decoder EarthScan uses to digitise raster logs, we treated the loss as an experiment in its own right and ran all five candidates through one controlled comparison. This piece is about what that comparison showed, and in particular about why Dice and Tversky, which are the same equation until you turn one dial, came apart so decisively once we read them on the metric that actually ships.

The literature on segmentation losses is by now a small field of its own, and the comparison below leans on it rather than reinventing it. The survey by Jadon catalogues the families we drew from [5]; the specific objectives have their own canonical references, which we credit as they come up. What we add is not a new loss but a controlled read of these existing losses on a task with an unusually thin foreground and a regression stage bolted onto the back of the segmentation, where the goodness-of-fit of the recovered curve, not the pixel overlap, is the number a petrophysicist will judge us on.

Five losses, one bench

The five candidates were Dice, Focal, Lovasz-Softmax, Soft Cross-Entropy, and Tversky. They were trained under identical conditions: the same encoder-decoder, the same 15,000-instance synthetic multiclass dataset of three classes (background plus two curves), the same 80/20 split, the same 50-epoch budget, the same optimiser. Holding everything but the loss fixed is the whole point. It is the only way to attribute a difference in the output to the objective rather than to a luckier learning-rate or a deeper decoder, and it is the discipline that lets you say the loss caused the result instead of correlated with it.

Each loss embodies a different theory of what a good segmentation is. Dice, in the V-Net lineage, scores a prediction by the overlap between its mask and the ground truth, normalised by their union, which makes it indifferent to the vast background and well-suited to imbalanced foregrounds [1]. Focal reshapes cross-entropy with a modulating factor that down-weights the easy, already-correct pixels so the gradient concentrates on the hard ones [3]. Lovasz-Softmax optimises a tractable surrogate of the intersection-over-union measure directly, so the gradient is aligned with the IoU number you report rather than a per-pixel proxy [4]. Soft Cross-Entropy is the stable, well-understood per-pixel baseline. And Tversky generalises Dice by splitting the penalty for the two ways a prediction can be wrong [2].

That last one is the hinge of the whole comparison, so it is worth being precise about the algebra.

Dice and Tversky are the same equation until you split one term

The Dice coefficient can be written as twice the true positives over twice the true positives plus the false positives plus the false negatives. Tversky takes the same numerator and rewrites the denominator as the true positives plus a weight alpha on the false positives plus a weight beta on the false negatives [2]. Set alpha and beta both to one-half and the two coincide exactly: Tversky with balanced weights is Dice. The generalisation buys you nothing until you make the weights unequal.

When you do, the asymmetry becomes a precision-recall dial. A false positive is a pixel the model painted as curve that was actually background, and penalising it harder, by raising alpha, pushes the model toward precision, toward firing only when it is sure. A false negative is a curve pixel the model missed, and penalising it harder, by raising beta, pushes the model toward recall, toward never letting a curve pixel slip. Dice, with its single symmetric denominator, cannot express that preference at all. It weighs a missed curve pixel and a stray background pixel the same, which on a target that is a sliver of the frame is exactly the wrong neutrality. The idea that similarity itself is asymmetric, that the features you weigh depend on what you are comparing to what, goes back to Tversky's original work in cognitive psychology [6]; the loss function is that asymmetry made differentiable.

This is the lever the whole result turns on, and it is the one Dice structurally lacks.

A controlled head-to-head of VeerNet's segmentation losses on the metric that actually ships: the error left on the digitised curve. The small-multiples panel plots mean MAE or MSE for curve-1 and curve-2 under the three losses whose regression-stage error was logged (Tversky, Dice, Focal); Lovasz and Soft-CE were evaluated in the same sweep but their regression error was not logged in this run, so they are shown as evaluated-but-not-measured rather than guessed. Pick MAE or MSE, select a loss, then drag the Tversky alpha/beta slider toward recall: raising beta penalises false negatives and walks curve-1 goodness-of-fit up the sourced R-squared ladder 0.5461 (curve-2, example 2) to 0.8126 (curve-1, example 2) to 0.9891 (curve-1, example 3). The MAE/MSE figures and the three R-squared values are sourced from the engagement archive; the mapping of slider position onto those operating points is illustrative.

Read the loss on the metric that ships

Here is where a digitisation task departs from a benchmark leaderboard. The mask is not the deliverable. The deliverable is a curve, a one-dimensional signal extracted from the predicted mask and written back as a digital log, and the question a geoscientist asks is not what the IoU was but how closely the recovered curve tracks the real one. So we read each loss not on overlap alone but on the regression error left on the two recovered curves: the mean absolute error and the mean squared error against ground truth, per curve.

On that metric the ordering is clear and it is not the ordering you would guess from the segmentation scores. Tversky left the lowest error on the harder of the two curves: a mean MAE of 0.0277 on curve-1 against Dice's 0.0367 and Focal's 0.0405, and a mean MSE of 0.0021 on curve-1 against Dice's 0.0091. On curve-2 the picture flips, with Dice ahead on MAE at 0.0774 versus Tversky's 0.1241, while their squared errors converge at 0.0269 and 0.0253. That split is itself the lesson: there is no loss that wins every cell of the table, and a single blended number would have hidden the trade entirely. Which loss is best depends on which curve, and which error norm, you have decided to be judged on. MAE treats every residual linearly and is the right lens when occasional large pick errors should not dominate; MSE squares them and is the right lens when a single large excursion in the recovered curve is the failure you fear. The two norms can disagree about the winner, and on curve-2 they nearly do.

The honest caveat is that the regression-stage error was logged for three of the five losses, Tversky, Dice, and Focal, in this run. Lovasz and Soft-CE were evaluated on the segmentation metrics in the same sweep, but their per-curve MAE and MSE were not captured, so the instrument above shows them as evaluated rather than guessing numbers that do not exist. We would rather leave a cell blank than fabricate it.

What the asymmetry actually buys: the R-squared ladder

The clearest evidence that the alpha-beta dial is doing real work is in the goodness-of-fit of Tversky's recovered curves across examples of increasing cleanliness. With beta tilted above alpha, so that missed curve pixels are penalised harder than stray ones, curve-1 reconstructions climbed a clear ladder: an R-squared of 0.5461 on a harder curve-2 example, 0.8126 on a curve-1 example, and 0.9891 on the cleanest curve-1 example we measured. That top figure is the best fit in the entire study, and it came from the recall-tilted Tversky configuration, not from the symmetric overlap loss it generalises.

Read that progression carefully, because it is easy to over-claim from it. The 0.9891 is not the model's average performance; it is its best case, the cleanest example under the most favourable loss configuration, and the 0.5461 at the other end is a reminder that on a noisier curve even the winning loss leaves real error on the table. The ladder is a statement about the ceiling the right loss unlocks, and about the fact that it was the recall tilt, the thing Dice cannot express, that unlocked it. A symmetric loss leaves that headroom unclaimed because it has no way to tell the optimiser that on a thin curve a miss costs more than a stray.

Why precision and recall are not symmetric on a thin curve

The reason the recall tilt helps is geometric, not statistical. A curve in a raster log is one to three pixels wide. If the model is conservative, demanding high confidence before it calls a pixel a curve, it breaks the trace into disconnected fragments, and a fragmented mask yields a jagged, gap-ridden curve that no interpolation rescues cleanly. If the model is generous, firing on the marginal pixels at the edge of the ink, it occasionally paints a background pixel by mistake, but those strays are easy to clean up downstream and the trace stays continuous. On a one-pixel target, continuity is worth more than purity, because a continuous slightly-fat curve regresses to a good signal while a precise broken curve does not. That asymmetry in the cost of the two error types is exactly what beta-greater-than-alpha encodes, and it is why a recall-tilted Tversky beats a balanced one on the regression metric even when their raw overlap scores are close.

This is also why the result does not transfer blindly. Tilt too far toward recall and the model fires everywhere, precision collapses, and the recovered curve fattens into a band. The dial has a sweet spot, and finding it is a small search, not a fixed prescription. What generalises is not the specific alpha and beta but the discipline: identify which error type is more expensive for your downstream task, and pick or tune a loss whose gradient charges more for it.

The general rule the comparison leaves behind

Step back from the well logs and the lesson is a single sentence: the metric you optimise is the metric you report, so make sure the loss you train on charges for the errors your downstream task actually pays for. Dice optimises overlap and reports overlap, which is fine until your deliverable is a regressed curve whose continuity matters more than its pixel purity. Focal optimises hard-example focus, which helps on imbalance but says nothing about precision versus recall. Lovasz optimises IoU directly, which is the right move when IoU is the deliverable. Tversky is the only one of the five that lets you state, in the gradient itself, that one kind of mistake costs more than the other, and on a task where it genuinely does, that expressiveness is decisive.

None of this is an argument that Tversky is universally best. It is an argument for running the controlled comparison instead of inheriting a default, reading every candidate on the metric that ships rather than the one that is convenient, and recognising that the loss is not a footnote to the architecture but the place where you tell the model what you actually want.

Key takeaways

All five candidate losses for VeerNet (Dice, Focal, Lovasz-Softmax, Soft-CE, Tversky) were trained under identical conditions on the 15,000-instance synthetic multiclass dataset, so any difference in output is attributable to the loss, not the architecture or schedule.
Dice and Tversky are the same equation with balanced weights; Tversky's alpha-beta split turns the loss into a precision-recall dial that Dice structurally cannot express, because Dice penalises a missed curve pixel and a stray background pixel identically.
Read on the metric that ships, the regression error on the recovered curve, Tversky left the lowest curve-1 error (MAE 0.0277 / MSE 0.0021) against Dice (0.0367 / 0.0091) and Focal (0.0405); on curve-2 Dice's MAE (0.0774) led Tversky's (0.1241) while their MSE converged. No single loss wins every cell.
Tilting beta above alpha to penalise false negatives walked curve-1 goodness-of-fit up an R-squared ladder of 0.5461, 0.8126, and 0.9891, the best fit in the study, from the recall-tilted Tversky configuration rather than the symmetric overlap loss it generalises.
On a one-to-three-pixel curve, continuity beats purity, so the cost of a miss exceeds the cost of a stray. That asymmetry is what the recall tilt encodes; the rule that generalises is to charge the loss for the error your downstream task actually pays for.

References

[1] Milletari, F., Navab, N., and Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 3DV (2016). The Dice loss in its now-standard differentiable form. https://arxiv.org/abs/1606.04797

[2] Salehi, S. S. M., Erdogmus, D., and Gholipour, A. Tversky loss function for image segmentation using 3D fully convolutional deep networks. MLMI Workshop, MICCAI (2017). Introduces the alpha-beta false-positive / false-negative weighting that generalises Dice. https://arxiv.org/abs/1706.05721

[3] Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. Focal Loss for Dense Object Detection. ICCV (2017). The modulating factor that down-weights easy examples under foreground-background imbalance. https://arxiv.org/abs/1708.02002

[4] Berman, M., Triki, A. R., and Blaschko, M. B. The Lovasz-Softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. CVPR (2018). Direct optimisation of IoU as a training objective. https://arxiv.org/abs/1705.08790

[5] Jadon, S. A survey of loss functions for semantic segmentation. IEEE CIBCB (2020). The catalogue of loss families this comparison draws from. https://arxiv.org/abs/2006.14822

[6] Tversky, A. Features of similarity. Psychological Review, 84(4), 327-352 (1977). The original asymmetric-similarity argument the loss function makes differentiable. https://psycnet.apa.org/record/1978-09287-001

Tversky versus Dice: Tuning the Precision-Recall Tradeoff in Thin-Curve Segmentation

Five losses, one bench

Dice and Tversky are the same equation until you split one term

Read the loss on the metric that ships

What the asymmetry actually buys: the R-squared ladder

Why precision and recall are not symmetric on a thin curve

The general rule the comparison leaves behind

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on