Sharp Wells and Flat Basins: Loss-Landscape Geometry in Thin-Curve Segmentation

Abstract

We ran five segmentation losses under identical conditions on a thin-curve raster-log task and watched them generalise to different degrees, and this paper asks why, using the public sharpness and flat-minima literature as its lens. That literature has argued since the 1990s that the geometry of the loss surface around a trained solution carries information about how well that solution will generalise, with wide and flat basins associated with smaller generalisation gaps than sharp and narrow minima [1] [2], while a careful counter-line has shown that sharpness measured naively is not reparameterisation-invariant and so cannot be the whole story [3]. We survey both sides, credit the optimisation methods that were built to seek flatter solutions [4] [6] [7], and read the synthesis against our own measured sweep, in which a do-nothing baseline peaked at a validation intersection over union of 0.51, a plain Dice loss scored per-class IoU of 0.94, 0.26, and 0.21 on the background and the two curves, and the best Tversky configuration reached an R-squared of 0.9891 on a recovered curve over 15000 multiclass instances trained for 50 epochs. The reading we arrive at is not that one geometry caused our ordering, which a single task cannot establish, but that the basin-width framing is the most economical account of why the losses separated where they did, and it is a falsifiable one: it predicts which interventions should move our held-out curve scores and which should not.

The idea that flatness predicts generalisation is old. Hochreiter and Schmidhuber argued for flat minima as solutions describable with low precision, and therefore, by a minimum-description-length argument, more likely to generalise, and they proposed an algorithm to seek them [1]. The idea returned with force in the deep-learning era through the large-batch study of Keskar and colleagues, who reported that training with very large batches tends to land in sharper minima with worse test accuracy, and who made the sharp-versus-flat distinction concrete by probing how fast the loss rises along random directions away from the solution [2]. That probe, the rise of the loss as the weights are perturbed off the minimum, is exactly the one-dimensional traverse our interactive below renders, and it is the operational definition of sharpness the rest of this survey leans on.

The picture is not as clean as that pairing suggests, and the most important caveat is structural rather than empirical. Dinh and colleagues showed that for networks with the usual rectifier nonlinearities one can reparameterise the weights to make a minimum arbitrarily sharp or flat without changing the function the network computes, so any sharpness measure that is not invariant to those symmetries can be inflated or deflated at will [3]. The consequence for a paper like this one is a discipline: a raw curvature reading along a slice is suggestive, not probative, and claims have to be framed around relative behaviour under a fixed parameterisation rather than around an absolute number. We adopt that discipline throughout, and the geometry in our instrument is presented as an illustrative model of the families' relative widths, not as a measured invariant.

Two further threads matter for reading our results. The first is the set of optimisation methods explicitly designed to bias training toward wide valleys, which are the practical payoff of the flatness hypothesis if it holds. Entropy-SGD modifies the objective to favour solutions surrounded by a wide region of low loss [4]; stochastic weight averaging shows that simply averaging weights along the training trajectory finds wider optima and improves test accuracy at almost no cost [6]; and sharpness-aware minimisation directly minimises a worst-case loss within a neighbourhood of the weights, optimising for flatness as a first-class objective [7]. These methods are not part of our sweep, which varied only the loss family, but they define the actionable end of the hypothesis: if basin width is what separates our losses, these are the levers that should narrow the gap. The second thread is the sober empirical work of Jiang and colleagues, who measured many proposed generalisation predictors across thousands of models and found sharpness-based measures among the more reliable, though far from perfect, which is roughly the confidence level this survey assigns to the framing [8].

The loss families themselves are surveyed elsewhere in our writing, so we treat their definitions as given and credit only the two whose geometry is most relevant to what follows: the Tversky loss, which generalises the Dice overlap into a precision-recall dial by separately weighting false positives and false negatives [9], and the Lovasz-Softmax surrogate, which optimises the intersection-over-union measure directly [10]. The question this paper adds is orthogonal to what those objectives compute: it is about the shape of the surface each one leaves behind once training has converged.

Method

This is a structured reading of the public literature laid against a single measured sweep, and the boundary between the two has to be explicit. The sweep is ours and its numbers are real: the same encoder-decoder segmentation network was trained five times, once under each of soft cross-entropy, the Dice loss, the Focal loss, the Lovasz-Softmax loss, and the Tversky loss, on the same three-class raster-log dataset of background plus two well-log curves, over 15000 multiclass instances for 50 epochs each, holding the architecture, the optimiser, and the data split fixed so that the loss family was the only thing that varied. From that sweep we recorded the held-out segmentation quality each loss reached, anchored by the peak baseline IoU of 0.51, the Dice per-class IoU of 0.94, 0.26, and 0.21, and the best Tversky R-squared of 0.9891 on a recovered curve.

The geometry is the survey part, and it is illustrative rather than measured. We did not compute a reparameterisation-invariant sharpness for each of the five trained solutions, which the Dinh critique tells us would be the only defensible way to assign each loss an absolute curvature [3]. What we did instead, and what the interactive renders, is organise the five families along the sharp-to-flat axis the literature uses, in the direction their held-out scores imply, and model the perturbation-versus-loss slice for each so the reader can see the mechanism the survey is invoking. The slice itself follows the convention the visualisation literature settled on, a one-dimensional cut normalised filter-wise so that curves from different solutions are visually comparable rather than confounded by weight scale [5]. The basin width in the figure is therefore a hypothesis about relative geometry consistent with our measured ordering, drawn so the claim is legible and falsifiable, not a fresh per-loss curvature measurement. Every number quoted as a result is sourced from the sweep; every width is a model.

Results

The held-out ordering from the sweep is the fact to be explained. The do-nothing baseline, the loss applied to the raw frame with no rebalancing, peaked at a validation IoU of 0.51, and that peak is the ceiling everything else is read against. Under a plain Dice loss the per-class intersection over union was 0.94 on the background, 0.26 on the first curve, and 0.21 on the second, the now-familiar signature of an extreme-imbalance task in which the easy majority class is essentially solved and the thin minority curves are where every loss earns or loses its keep. At the other end, the best Tversky configuration, the recall-tilted member of the family, reached an R-squared of 0.9891 on a recovered curve, the strongest curve-level fit in the sweep. Five losses, one architecture, one dataset, and a clear spread in how well the trained solutions transferred to held-out scans.

A traverse along a one-dimensional slice through the trained-weight minimum. The teal curve is the loss as the weights are perturbed by delta either side of the minimum at delta = 0; switch the loss family with the tab row and the basin changes shape, a sharp narrow well sitting next to a wide flat basin. The orange band is the basin width measured where the loss has risen one fixed step above the floor, and the orange validation-IoU readout, on the right axis, is read off that width: the wider, flatter basins carry the higher held-out IoU. The five loss families (Dice, Focal, Lovasz, soft cross-entropy, Tversky), the peak validation IoU of 0.51, the Tversky best R-squared of 0.9891, and the run over 15000 multiclass instances for 50 epochs are the engagement's own measured figures; the per-family basin widths and the smooth width-to-IoU mapping drawn here are an illustrative geometry, flagged as such, with only the family set, the peak-IoU anchor, the R-squared, and the sharp-versus-flat direction sourced.

The basin-width framing organises that spread into a single shape. Traverse the slice for the soft-cross-entropy solution and the loss climbs steeply away from the minimum, the picture of a sharp, narrow well, which sits with its position at the bottom of the held-out ordering; the per-pixel objective, dominated by the background majority, settles into a solution that the imbalance makes brittle on the curves. Switch to Dice and the basin broadens, consistent with the overlap normalisation buying robustness on the minority classes even as the absolute curve IoU stays modest. Switch again to Tversky and the basin is widest and flattest in the model, the geometry the literature associates with the smallest generalisation gap [1] [2] [7], and it is the family that carried the 0.51 peak and the 0.9891 curve fit. The width-to-IoU mapping on the right axis is the survey's central claim made visible: across the families, the wider the modelled basin, the higher the held-out IoU read off it.

We are careful about what this shows. It is one task, one architecture, and five trained solutions whose curvatures we did not measure invariantly, so the figure is an account of our ordering rather than a proof of mechanism. But it is the account that requires the fewest extra assumptions, and it inherits a body of independent evidence that the same association holds across many other settings [2] [8]. The recall-tilted Tversky result is the part most resistant to a flatness-free explanation, because Tversky and Dice share an objective until their false-positive and false-negative weights diverge, so the geometry that opened up between them is attributable to that single degree of freedom rather than to a wholesale change of loss.

Discussion

Read through the literature, our sweep stops looking like a leaderboard and starts looking like a probe of the surface each loss prefers. The flat-minima hypothesis, in the form Keskar and colleagues operationalised and Hochreiter and Schmidhuber first motivated, predicts that the loss whose trained solution sits in the widest low-loss region should transfer best [1] [2], and in our ordering it did. The value of saying so is not that it crowns Tversky, which our separate controlled ablation already did on this task, but that it converts a result into a set of testable expectations. If basin width is really what separated the families, then the flatness-seeking optimisers should compress the gap: applying stochastic weight averaging to the soft-cross-entropy run [6], or training any family under sharpness-aware minimisation [7], should lift the sharp-minimum losses toward the flat ones more than it lifts Tversky, which already sits in a wide basin. That is a clean, fundable experiment, and the framing earns its place only because it makes such predictions.

The Dinh critique keeps us honest about the limits of the claim [3]. Because sharpness can be manufactured by reparameterisation, we do not assert that the soft-cross-entropy solution is intrinsically sharper than the Tversky one in any invariant sense; we assert that under the fixed parameterisation we trained, the families behaved as if they occupied basins of the widths the figure models, and that this is the most economical reading of the held-out spread. The honest follow-up is to measure an invariant sharpness, an adaptive or normalised curvature of the kind the post-critique literature developed, for each of the five solutions and to check whether the invariant ordering matches the held-out one. Until that is done, the geometry here is a well-supported interpretation rather than a measurement, and the work of Jiang and colleagues is the right calibration for how much weight it can bear: sharpness predicts generalisation better than most alternatives and worse than one would like [8].

Where our contribution sits is at the join. The public literature supplies the hypothesis and the methods; our sweep supplies a thin-curve, extreme-imbalance instance with real held-out numbers; and the figure supplies a way to see the proposed mechanism on that instance. None of the three is novel alone. Together they make the case that the generalisation gap we observed across five losses is best discussed in the vocabulary of basin geometry, and they hand the reader the specific experiments that would confirm or break that case on their own data.

Limitations

The central limitation is the one the survey itself flags: we did not measure a reparameterisation-invariant sharpness for any of the five trained solutions, and the Dinh result shows that without such invariance a raw curvature reading can be inflated or deflated by symmetry-preserving reparameterisations, so the basin widths in the figure are an illustrative model of relative geometry consistent with our held-out ordering, not a measured property of the losses [3]. The held-out numbers are real and sourced from the sweep, but the geometry that explains them is a hypothesis drawn to be legible and falsifiable. The evidence base is also narrow: a single three-class raster-log task with one-to-three-pixel curves and roughly 97 percent background, one fixed architecture, one optimiser, and one data split, which is precisely the extreme-imbalance regime, so the relative ordering and the apparent width gap may compress or reorder on tasks with milder skew, thicker structures, or different backbones. We varied only the loss family; we did not run the flatness-seeking optimisers [4] [6] [7] that would turn the framing's predictions into a test, so the causal claim those experiments could support remains open. The Tversky figures come from one recall-tilted configuration rather than an exhaustive sweep of its weighting dial, and the five-family set is the one our prior work treats as canonical, leaving compound and learned objectives out of frame. A reader should take this as a structured, falsifiable interpretation of one measured result, and as a map of which experiments would settle it, not as a measured loss-landscape study.

Key takeaways

The question is why five segmentation losses, trained identically on a thin-curve raster-log task, generalised to different degrees. We read the spread through the public sharpness and flat-minima literature, which associates wide flat basins in the loss surface with smaller generalisation gaps than sharp narrow minima.
The measured anchors are real: a do-nothing baseline peaked at a validation IoU of 0.51, a plain Dice loss scored per-class IoU of 0.94, 0.26, and 0.21 on background and the two curves, and the best Tversky configuration reached an R-squared of 0.9891 on a recovered curve over 15000 instances and 50 epochs.
The basin-width framing organises the spread: the family with the widest modelled basin (Tversky) carried the highest held-out scores, and the sharpest (soft cross-entropy) the lowest, matching what the literature reports across many other settings.
The Dinh critique keeps the claim modest: sharpness is not reparameterisation-invariant, so the geometry shown is an illustrative model of relative width consistent with our ordering, not a measured invariant. The framing is valuable because it is falsifiable.
The framing predicts a clean follow-up: flatness-seeking optimisers such as stochastic weight averaging and sharpness-aware minimisation should lift the sharp-minimum losses toward the flat ones more than they lift Tversky. That experiment, and an invariant sharpness measurement per solution, would confirm or break the account.

References

[1] Hochreiter, S., and Schmidhuber, J. Flat Minima. Neural Computation 9, 1 (1997). The minimum-description-length argument that flat minima, describable with low precision, tend to generalise, and the first algorithm to seek them. https://doi.org/10.1162/neco.1997.9.1.1

[2] Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ICLR (2017). Operationalises sharpness as the rise of the loss along directions away from the minimum, the probe our traverse renders. https://arxiv.org/abs/1609.04836

[3] Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. Sharp Minima Can Generalize For Deep Nets. ICML (2017). Shows that rectifier-network symmetries let a minimum be reparameterised arbitrarily sharp or flat, so naive sharpness is not invariant. The discipline this paper adopts. https://arxiv.org/abs/1703.04933

[4] Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. ICLR (2017). An objective that explicitly favours solutions surrounded by a wide low-loss region. https://arxiv.org/abs/1611.01838

[5] Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visualizing the Loss Landscape of Neural Nets. NeurIPS (2018). The filter-normalised slicing method that makes sharp-versus-flat geometry comparable across solutions. https://arxiv.org/abs/1712.09913

[6] Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. Averaging Weights Leads to Wider Optima and Better Generalization. UAI (2018). Stochastic weight averaging finds wider optima and improves test accuracy at almost no cost, one of the levers the framing predicts. https://arxiv.org/abs/1803.05407

[7] Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-Aware Minimization for Efficiently Improving Generalization. ICLR (2021). Minimises a worst-case loss within a weight neighbourhood, making flatness a first-class training objective. https://arxiv.org/abs/2010.01412

[8] Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. Fantastic Generalization Measures and Where to Find Them. ICLR (2020). A large-scale empirical test finding sharpness-based predictors among the more reliable, though imperfect, generalisation measures. https://arxiv.org/abs/1912.02178

[9] Salehi, S. S. M., Erdogmus, D., and Gholipour, A. Tversky loss function for image segmentation using 3D fully convolutional deep networks. MLMI Workshop, MICCAI (2017). The precision-recall dial whose single extra degree of freedom isolates the geometry that opened up against Dice. https://arxiv.org/abs/1706.05721

[10] Berman, M., Triki, A. R., and Blaschko, M. B. The Lovasz-Softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. CVPR (2018). Direct optimisation of the IoU measure, one of the five families in the sweep. https://arxiv.org/abs/1705.08790

Sharp Wells and Flat Basins: Loss-Landscape Geometry in Thin-Curve Segmentation

Abstract

Method

Results

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on

Abstract

Background and related work

Method

Results

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on