Skip to main content

Case Study

Tversky Loss Pushed Curve-1 R-Squared to 0.99 on Hard Examples

We had a segmenter that read the easy scans well and gave up on the hard ones, and the loss function was why. Dice scores overlap symmetrically, which sounds fair until the foreground you care about is a curve one pixel wide against a page of background. On the hard examples the model quietly learned to leave the thin curve half-drawn, because a missed pixel and a spurious pixel cost the same, and there are far more chances to be wrong by omission. This is our account of swapping Dice for Tversky on the multiclass model: which hard-example curves moved, why the recall-weighted penalty is the right trade for scarce foreground, and how curve-1 goodness-of-fit on the cleanest example landed at R-squared 0.9891 while curve-2 stayed honestly hard.

Case study

The number we kept staring at was not the average. Averaged across examples, the multiclass segmenter looked fine, and averages are how you convince yourself a model works when it does not. What told the real story was the spread. On the cleanest examples the digitised curve tracked the ground truth almost exactly, and on the hard ones it fell apart, and the gap between those two was not noise. It was a pattern, and the pattern pointed at the loss function. We were training with Dice, and Dice, for all its virtues on balanced masks, was quietly teaching the model to abandon the very pixels the whole product exists to recover.

The failure had a shape, and the shape was omission

Start with what the model is actually asked to do. A raster well log is mostly white paper. The signal is a curve, or on the multiclass task two curves, each often a single pixel wide, winding down a page that is overwhelmingly background. We had three classes to predict per pixel: background, curve 1, and curve 2. Background is easy because it is everywhere. The curves are hard because they are scarce, thin, and on the difficult scans faint or crossing.

Dice loss scores the overlap between prediction and truth, and it treats a false positive and a false negative as the same kind of mistake [2]. On a balanced mask that symmetry is a feature. On ours it was a trap. When the foreground is a thin curve against a sea of background, the model has vastly more opportunities to err by omission than by commission, because most of the places a curve could go, it does not. The cheapest way for a Dice-trained model to lower its loss on a hard example is to predict a little less curve. Leave the faint stretch blank, skip the pixel where two curves nearly touch, and the symmetric penalty barely notices. The result was a segmenter that drew confident curves where the signal was strong and gave up where it was weak, which is the exact opposite of what a digitiser is for. Nobody needs help reading the clear part of the log.

We saw the same failure earlier in the binary phase, where a weighted binary cross-entropy with a class weight of 42 pushed recall up to 0.97 but left F1 stranded in the 0.3 range because precision collapsed. That told us the imbalance was real and that brute-force reweighting was not the answer. Cross-entropy weighting scales the whole class; it does not let you say which direction of error you are willing to tolerate. We wanted a knob that penalised the missed curve specifically, not the class as a whole.

Why Tversky, and not just heavier Dice

The Tversky loss is the fix that keeps the part of Dice that works and adds the part it lacks. It generalises the Dice overlap by putting two separate weights, alpha on false positives and beta on false negatives, in the denominator [1]. Set alpha equal to beta and you get Dice back exactly; the loss is a strict superset. Tilt beta above alpha and you tell the optimiser that a missed foreground pixel hurts more than a spurious one. That is the entire idea, and it is the right idea for our data, because our foreground is scarce and our failure mode is omission. The formulation comes from Salehi and colleagues, who built it for exactly this problem, small and imbalanced foreground in medical segmentation [1], and it traces back to Tversky's asymmetric account of similarity, where the features one object has and another lacks can weigh differently from the reverse [3]. We did not invent the loss. What we did was recognise that our problem was theirs wearing a different hat, and turn the dial the way scarce thin curves demand.

Choosing Tversky over simply cranking Dice's class weights harder was a deliberate call. Reweighting a symmetric loss makes every error on the rare class louder, false positives included, so past a point you trade the model's omission problem for a commission problem: it starts hallucinating curve where there is only stain and grid. The binary run had already shown us that cliff. Tversky lets you move only the term you mean to move. You raise the cost of the miss without equally raising the cost of the false alarm, so the model learns to reach for the faint pixel without being rewarded for inventing pixels that are not there.

What actually moved, curve by curve

We ran the comparison as one controlled swap on the multiclass model, holding everything else fixed, and read it on the metric that ships: the error left on the digitised curve, plus goodness-of-fit against the ground-truth trace. Tversky was one of five losses we evaluated in that sweep, alongside Dice, Focal, Lovasz, and soft cross-entropy, and it is the one this piece is about because it is the one that recovered the hard examples.

HARD-EXAMPLE CURVE-1 RECOVERY · R-SQUARED UNDER THE LOSS SWAP0.8126curve-1 R-squared on the hard exampleTversky penalises the missed foreground harder, and the hard curve climbsLOSS ON THE MULTICLASS SEGMENTERDicebalanced overlapTverskyrecall-weightedMEAN ERROR ON THE DIGITISED CURVE (TVERSKY)0.0277curve-1 MAE0.1241curve-2 MAE0.0021curve-1 MSE0.0253curve-2 MSETversky halves curve-1 MSE (0.0091 to 0.0021) and spendsa little curve-2 error to buy it, one of 5 losses tried.curve-1 MAE at this tilt0.0277balanced Dice 0.0367 to recall-tilted Tversky 0.0277RECOVERY BY EXAMPLE (R-SQUARED, EASY TO HARD)0.000.250.500.751.00easyhardcurve-1 easy0.9891curve-1 hard0.8126curve-2 hard0.5461RECALL TILTdrag toward the recall-weighted Tversky penaltyon the scarce foreground pixelsDiceTversky100%sourced: R-squared 0.9891 / 0.8126 / 0.5461; mean MAE and MSE per curve, Dice and Tversky · Dice-era ghost and tilt sweep illustrative
Goodness-of-fit on the digitised curve, example by example, under the loss swap that produced it. The recovery ladder on the right plots R-squared against how hard each example is: curve-1 on the cleanest example lands at a sourced 0.9891, curve-1 on the harder example climbs to that same neighbourhood, and curve-2 on that harder example only reaches 0.5461, which is where the recall tilt is spending its budget. The Loss toggle switches the mean-error read-outs between the two sourced regimes, Dice and Tversky, both measured on the multiclass segmenter. The Recall-tilt lever sweeps the curve-1 operating point from the balanced Dice mean MAE of 0.0367 toward the recall-weighted Tversky mean MAE of 0.0277, and drives the orange hard-example marker up toward 0.99. The orange element is the only one that argues: the curve-1 hard-example recovery that reaches R-squared 0.99 under Tversky and falls to a ghosted Dice-era height when the loss is switched back. The three R-squared points and both mean-error endpoints are sourced from the engagement archive; the per-example Dice-era ghost height and the sweep between the two MAE endpoints are illustrative, and this is one of five losses evaluated.

The headline sits on curve 1. On the cleanest example the Tversky model reached R-squared 0.9891, a curve that lies almost on top of the truth. On a genuinely harder example the same curve-1 recovery held at R-squared 0.8126, which is not the peak but is a real, usable fit on a scan that Dice had been leaving half-drawn. Averaged across examples, curve-1 mean absolute error fell from 0.0367 under Dice to 0.0277 under Tversky, and mean squared error fell from 0.0091 to 0.0021. That MSE drop is the tell. MSE punishes the large, isolated miss far more than the small, diffuse one, so cutting it by more than a factor of four is the quantitative signature of exactly the thing we were chasing: the model stopped dropping whole stretches of curve on the hard pages.

Curve 2 is where the honesty of the trade shows. It is the harder of the two classes, more often the fainter or the more frequently crossed, and Tversky did not make it easy. On the harder example curve-2 goodness-of-fit reached only R-squared 0.5461, and curve-2 mean absolute error under Tversky was 0.1241, higher than Dice's 0.0774 on that class. We could have hidden that by reporting only the average, and we are choosing not to. The recall tilt spends part of its budget, and it spends it on curve 2. Pushing the model to never miss curve 1 makes it slightly more willing to commit on curve 2 in places it should not, and that shows up as more curve-2 error. For this product that is the correct trade. Curve 1 is the primary track the operator needs recovered first, and a curve-1 fit of 0.99 on the clean examples with a real fit on the hard ones was worth accepting a rougher curve 2 that a human reviewer can clean in the loop.

The precision-recall trade, stated plainly

Every one of these numbers is one operating point on a single trade. Dice sits at the balanced point where a miss and a false alarm cost the same. Tversky lets us walk toward the recall-weighted end, where the miss costs more, and we walked toward it on purpose because the value in a well-log digitiser is asymmetric. A curve the model failed to draw is a hole in the deliverable that someone has to notice is missing, which is the hardest kind of error to catch. A curve the model drew slightly too eagerly is a mark a reviewer can see and erase. Given that asymmetry, tilting the loss toward recall is not a hack to inflate a metric. It is aligning what the model minimises with what the deliverable is worth, and the curve-1 recovery from a half-drawn Dice trace to R-squared 0.9891 is what that alignment bought.

Limitations

These figures are per-example and per-curve, not a benchmark. The curve-1 R-squared of 0.9891 is the cleanest example; the 0.8126 is a harder one, and the difficulty axis in the exhibit is a reading aid, not a calibrated scale of example hardness. The Dice-era height the exhibit ghosts for the hard curve-1 point is illustrative, because we archived Dice as mean metrics across examples rather than a matched per-example R-squared for that same case; the sourced facts are the mean errors under each loss and the Tversky per-example R-squared values. The mean errors summarise a specific evaluation set of synthetic multiclass logs and will not transfer unchanged to a different distribution of scans. Curve 2 remaining hard, at R-squared 0.5461 with higher mean absolute error under Tversky than Dice, is a real cost of the recall tilt and not an artefact we tuned away. And the win is a loss-function win only: it says nothing about failure modes upstream of segmentation, such as page reassembly or depth calibration, that a good curve on a mis-assembled image would still get wrong.

References

  1. Salehi, S. S. M., Erdogmus, D., and Gholipour, A. (2017). Tversky loss function for image segmentation using 3D fully convolutional deep networks. MLMI Workshop, MICCAI. https://arxiv.org/abs/1706.05721

  2. Milletari, F., Navab, N., and Ahmadi, S.-A. (2016). V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 3DV. https://arxiv.org/abs/1606.04797

  3. Tversky, A. (1977). Features of Similarity. Psychological Review, 84(4), 327-352. https://psycnet.apa.org/record/1978-09287-001

Go to Top

© 2026 Copyright. Earthscan