Tversky Loss When You Care More About Recall Than Precision

There is a moment in almost every imbalanced-segmentation project where you stop arguing about architectures and start arguing about which mistakes you are willing to make. For us, building VeerNet, the encoder-decoder EarthScan uses to lift a one-to-three-pixel ink trace off a scanned well log, that moment arrived the first time we looked at a precise prediction and realised it was useless. The mask was clean. It fired only where it was sure. And the curve it produced was a dotted line of fragments that no downstream interpolation could honestly join. That is the day the Tversky loss earned its place in the pipeline, not because it is a better loss in the abstract, but because it has the one knob that lets you say, in the gradient itself, that you would rather catch every curve pixel and clean up a few strays than miss any and keep the background pristine.

The loss is not ours and the credit belongs squarely to its authors. Salehi, Erdogmus and Gholipour introduced the alpha-beta weighting for exactly this kind of imbalanced problem [1], generalising the Dice overlap loss that Milletari and colleagues had made standard the year before [2], and the deeper idea that similarity is asymmetric goes all the way back to Tversky's work in cognitive psychology [3]. None of that is what this post is about. There is already a careful field survey that puts all five candidate losses on one bench and reads them against each other. This piece is narrower and more practical: it is about the knob itself, which way to turn it when recall is the thing you care about, and how to reason about how far.

The knob, in one paragraph

Tversky scores a prediction the way Dice does, by how much its mask overlaps the ground truth, but it splits the penalty for being wrong into two weighted terms. Alpha multiplies the false positives, the pixels you painted as curve that were really background. Beta multiplies the false negatives, the curve pixels you missed. Set alpha and beta both to one-half and you have written Dice exactly, no more and no less [1] [2]. The whole expressive power lives in making them unequal. Raise alpha and you punish strays harder, which pushes the model toward precision, toward firing only when it is certain. Raise beta and you punish misses harder, which pushes the model toward recall, toward never letting a curve pixel slip through. Dice cannot say either of those things; with one symmetric denominator it charges the same price for a missed curve pixel and a stray background one, which on a target that is a sliver of the frame is precisely the wrong indifference.

So the practitioner's question is not whether Tversky is good. It is which direction to turn beta, and that depends entirely on what your downstream task pays for.

Why a thin curve pays more for a miss than a stray

The reason we turn the knob toward recall is geometric, and it is worth being concrete about because it is the entire justification for everything that follows. A well-log curve in a raster scan is one to three pixels wide. When the segmentation feeds a regression stage that reads a single value per depth row, the failure modes of the two error types are not symmetric at all.

A false negative, a missed curve pixel, punches a hole in the trace. String enough of them together and the curve breaks into disconnected fragments. There is no good recovery from that: an interpolator asked to bridge a gap it cannot see either invents a value or leaves a void, and a petrophysicist reading the digitised log gets a curve with phantom flat spots where the ink was simply never detected. A false positive, a stray pixel painted just off the true ink, is a different kind of problem and a much smaller one. It fattens the trace by a pixel here and there, and a column-wise argmax or a light morphological clean-up removes most of it before anyone sees it. On a one-pixel target, continuity is worth more than purity, because a slightly fat but unbroken curve regresses to a faithful signal and a perfectly clean but broken one does not.

That asymmetry in the cost of the two mistakes is the whole reason to make the loss asymmetric to match. We turn beta up because a miss genuinely costs us more than a stray, and we want the optimiser to feel that difference in every gradient step.

What it actually buys, with the recall held still

The cleanest way to see what the dial does is to fix the thing you are unwilling to give up and watch everything else move. For VeerNet the non-negotiable was recall: we wanted the model catching essentially every curve pixel, and the binary runs had already shown us that 0.97 recall was reachable on this kind of mask. So we held the recall target at 0.97 and walked the operating point from the balanced Dice setting toward a recall-tilted Tversky one, reading the error left on the recovered curve at each end.

The instrument below is that walk. Drag the dial from the balanced Dice end, where alpha equals beta, toward the recall tilt where beta exceeds alpha, and watch the two curve-1 error bars contract while the goodness-of-fit needle climbs.

Tversky generalises the Dice loss by splitting the penalty for the two ways a thin-curve prediction can be wrong: alpha weights a false positive (a stray background pixel painted as curve) and beta weights a false negative (a real curve pixel missed). With alpha equal to beta equal to 0.5 the loss is exactly Dice. Drag the dial toward beta to tilt the gradient so that misses cost more than strays, which is the deliberate trade a thin-curve digitiser makes: surrender a little precision to buy a continuous trace. The dial moves between two measured endpoints from the engagement while holding the recall target at 0.97 fixed throughout: the balanced Dice end leaves curve-1 mean MAE at 0.0367 and MSE at 0.0091, and the recall-tilted Tversky end drops them to 0.0277 and 0.0021 while lifting curve-1 goodness-of-fit to its peak R-squared of 0.9891. The two endpoint metric triplets and the 0.97 recall figure are sourced from the engagement archive; the values between the ends are an illustrative monotone read-out so the trade stays legible. The orange accent marks the recall-tilted Tversky operating point, the setting the digitiser actually ships.

The endpoints are measured, not asserted. At the balanced Dice setting, curve-1 came back with a mean absolute error of 0.0367 and a mean squared error of 0.0091. At the recall-tilted Tversky setting, with the same architecture, the same 15,000-instance synthetic multiclass dataset, the same 50-epoch budget, and the same 0.97 recall target, curve-1 error fell to a mean MAE of 0.0277 and a mean MSE of 0.0021. That MSE drop is the one to dwell on: squared error punishes large excursions, so cutting it by more than four-fifths means the recall tilt is not shaving a little off the average residual, it is eliminating the big misses, the kind of single bad pick that throws a regressed curve off a cliff. The values the dial shows between the two ends are an illustrative read-out so the trade stays legible; only the two anchors are sourced.

Reading the goodness-of-fit ladder honestly

The other thing the dial tracks is curve-1 goodness-of-fit, and here is where it is easy to over-claim, so let me be careful. On the cleanest curve-1 example we measured, the recall-tilted Tversky configuration reached an R-squared of 0.9891, which is the best single fit anywhere in the study. That number is real and we are proud of it, but it is a best case, not an average. On a harder example the same configuration sits lower, and the honest framing is that the recall tilt raises the ceiling rather than lifting every example to it. A balanced overlap loss leaves that ceiling unclaimed because it has no vocabulary for telling the optimiser that a miss on a thin curve is the expensive mistake. The tilt is what unlocks the headroom; whether a given log reaches it still depends on how clean the scan is.

This is also why the gauge in the instrument shows the Dice end well short of the Tversky peak. The gap between them is not noise. It is the measurable value of the one capability Dice structurally does not have, and it is the reason we keep the loss in the pipeline.

How far to turn it, and the failure on the other side

A practitioner's post owes you the part where the advice stops being free. You cannot simply crank beta to its maximum and walk away, because the recall tilt has a far side and it is ugly. Push beta too high and the model learns that the cheapest way to avoid the now-enormous penalty for a miss is to fire almost everywhere. Recall stays pegged near the top, which looks fine on a dashboard, but precision collapses, the strays stop being strays and become a smear, and the one-pixel trace fattens into a band that the regression stage can no longer center. You have not bought continuity any more; you have bought a blur with a good recall number attached.

So the dial has a sweet spot rather than a direction, and finding it is a small search, not a fixed recipe. The procedure we would actually recommend is unglamorous: pin the recall target where your downstream task needs it, start from the balanced setting, raise beta in a few steps, and at each step look not at the recall number but at the recovered curve and its MAE and MSE together. Stop when the squared error stops falling, because that is the signal that you have wrung out the big misses and any further tilt is just inviting the smear. The literature has even folded a focusing term on top of the Tversky asymmetry to handle the hardest examples [4], and the generalised-Dice line of work attacks the same imbalance from the weighting side [5]; both are worth knowing, but neither removes the need to actually look at the curve your loss produces.

The one-line version for your next imbalanced run

If you take nothing else from this, take the decision rule, because it outlives the well logs. Before you accept Dice as a default, ask which of your two error types is more expensive once your prediction has flowed all the way downstream to whatever a human finally reads. If a miss is the costly one, as it is for any thin, connected structure you intend to trace, reach for Tversky and turn beta above alpha until the squared error bottoms out, then stop before the precision smear sets in. The knob is Salehi and colleagues' contribution, not ours [1], and the asymmetric-similarity intuition under it is Tversky's [3]. What we can vouch for from a real digitisation pipeline is that turning that knob toward recall, deliberately and not too far, is what carried curve-1 from a fragmented Dice trace to a continuous one good enough to put in front of a geoscientist.

Key takeaways

Tversky generalises Dice by splitting the wrong-prediction penalty into alpha on false positives and beta on false negatives; alpha equal to beta equal to one-half is exactly Dice, and all the expressive power is in making them unequal. The loss is Salehi, Erdogmus and Gholipour's and the credit is theirs.
On a one-to-three-pixel curve the two error types are not symmetric in cost: a miss breaks the trace into fragments no interpolation rescues, while a stray fattens it by a pixel and cleans up easily. Continuity beats purity, so you turn beta up to charge more for misses.
Holding the recall target fixed at 0.97 and walking from the balanced Dice setting to the recall-tilted Tversky one cut curve-1 error from MAE 0.0367 / MSE 0.0091 to 0.0277 / 0.0021. The four-fifths MSE drop means the tilt is eliminating the big misses, not just shaving the average residual.
The recall tilt raises the goodness-of-fit ceiling to its peak R-squared of 0.9891 on the cleanest curve-1 example, the best fit in the study; that is a best case the balanced loss leaves unclaimed, not an average every log reaches.
The knob has a sweet spot, not a direction: push beta too far and the model fires everywhere, precision collapses, and the trace smears into a band with a misleadingly good recall number. Raise beta in steps and stop when the squared error stops falling.

References

[1] Salehi, S. S. M., Erdogmus, D., and Gholipour, A. Tversky loss function for image segmentation using 3D fully convolutional deep networks. MLMI Workshop, MICCAI (2017). Introduces the alpha-beta false-positive / false-negative weighting that this whole post turns on. https://arxiv.org/abs/1706.05721

[2] Milletari, F., Navab, N., and Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 3DV (2016). The Dice loss that Tversky reduces to when alpha equals beta. https://arxiv.org/abs/1606.04797

[3] Tversky, A. Features of similarity. Psychological Review, 84(4), 327-352 (1977). The original asymmetric-similarity argument the loss function makes differentiable. https://psycnet.apa.org/record/1978-09287-001

[4] Abraham, N., and Khan, N. M. A Novel Focal Tversky Loss Function with Improved Attention U-Net for Lesion Segmentation. IEEE ISBI (2019). Folds a focusing term onto the Tversky asymmetry for the hardest examples. https://arxiv.org/abs/1810.07842

[5] Sudre, C. H., Li, W., Vercauteren, T., Ourselin, S., and Cardoso, M. J. Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations. DLMIA Workshop, MICCAI (2017). Attacks the same imbalance from the class-weighting side. https://arxiv.org/abs/1707.03237

Tversky Loss When You Care More About Recall Than Precision

The knob, in one paragraph

Why a thin curve pays more for a miss than a stray

What it actually buys, with the recall held still

Reading the goodness-of-fit ladder honestly

How far to turn it, and the failure on the other side

The one-line version for your next imbalanced run

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on