Binary Segmentation Hit a Wall at F1 0.55, So We Switched to Multiclass

Every model has a number it cannot beat, and ours was F1 0.55. We had built the first segmentation stage of VeerNet, our raster-log digitization model, to read a scanned well-log and mark where each curve runs. It was trained, it was converging, and it was stuck. Across the three curve masks we cared about, the F1 scores read 0.37, 0.26, and 0.55. Recall sat at 0.96 to 0.97 on all three. We poured more class weight into the loss, we generated more synthetic data, we let it train longer. The ceiling held. This is the story of the day we stopped trying to climb the wall and realized we had built it ourselves, by stating the problem wrong.

The first regime: one binary mask per curve

The obvious way to segment curves out of a scanned log is per-pixel binary classification. For each curve you want to extract, you train the network to answer one question at every pixel: is this curve here, yes or no? A log with three curves of interest becomes three independent binary segmentation heads, each producing a mask that is foreground where its curve runs and background everywhere else. The architecture is standard encoder-decoder territory, the lineage that runs from fully convolutional networks [1] through the U-Net skip-connection design [2], and it is the first thing any segmentation practitioner reaches for.

We trained this binary regime on 2,000 synthetic instances. The training ran at a batch size of 1, forced on us by the fact that our synthetic logs vary in size from one example to the next and we had not yet written the machinery to pad them into a uniform batch. Each mask got its own loss. And immediately we hit the structural problem that haunts every segmentation task where the foreground is thin: class imbalance, in its most extreme form.

A curve on a well-log is roughly one pixel wide. The log image around it is overwhelmingly background. When you ask a network to call each pixel foreground or background, and 99-point-something percent of the pixels are background, the loss-minimizing strategy is brutally simple. Predict background almost everywhere, dust a little foreground along the obvious dark traces, and the per-pixel loss barely registers the curves at all. The network is not wrong to do this. We told it to minimize per-pixel error, and predicting background is how you minimize per-pixel error when background is the world.

Class weight 42, and the wall it built

The textbook answer to foreground starvation is to reweight the loss so a foreground mistake costs far more than a background one. We did exactly that. We pushed the positive-class weight in the binary cross-entropy loss up to 42, telling the optimizer that missing a curve pixel is forty-two times worse than a spurious one. This is the same instinct that motivates focal loss, which reshapes the loss to stop easy background examples from drowning out the rare foreground [3].

It worked, in the narrow sense that it forced the network to stop ignoring the curves. Recall climbed to 0.96 and 0.97. The model was now finding nearly every curve pixel. But look at what that recall costs when it is the only thing the loss rewards. A weight of 42 tells the network that a false negative is catastrophic and a false positive is nearly free. So the network does the rational thing: it predicts curve generously, smearing foreground across anything that might plausibly be a trace, because over-predicting is cheap and under-predicting is ruinous. Recall goes to the ceiling. Precision falls through the floor.

That is the wall. F1 is the harmonic mean of precision and recall, and a harmonic mean is dragged down by its smaller term. With recall pinned near 1.0 and precision in the low hundredths, F1 cannot climb past the precision. Our best mask reached 0.55 not because the model was half-good, but because precision capped it there. We had a high-recall, low-precision regime, and no amount of additional class weight could fix it, because class weight was the thing producing it. Every notch of weight we added bought a little more recall we did not need and cost a little more precision we could not afford.

Why the binary regime plateaued

A one-pixel-wide curve makes the foreground a tiny fraction of the image, so an unweighted per-pixel loss is minimized by predicting background almost everywhere and the curves vanish.
Pushing the binary cross-entropy class weight to 42 forced recall up to 0.96 to 0.97, but it did so by making the model over-predict curve pixels, which crushed precision.
F1 is bounded below by precision, so with precision on the floor the masks capped at 0.37, 0.26, and 0.55 no matter how long we trained or how much synthetic data we generated.

Diagnosing it as the wrong problem, not a bad model

The instinct when a model plateaus is to assume the model is the problem. A deeper backbone, a different loss, more regularization, more data. We had been doing all of that. What changed our minds was reading the metrics honestly rather than as a single F1 number to be maximized.

Recall near total and precision near random is a specific, diagnosable signature. It is not the signature of an undertrained model or an underpowered architecture. It is the signature of a model that has been told the only sin is a false negative, operating on a target where false positives are nearly unconstrained. The binary regime treats every curve as an independent yes-or-no question against an ocean of background, and that framing has no mechanism to make the model pay for spraying foreground around. Each binary head competes only against background, never against the other curves. Nothing in the loss says that a pixel claimed by curve one cannot also be claimed by curve two.

The exhibit below is the diagnosis we kept coming back to. It plots each mask in precision-recall space. In the binary regime the three points sit stranded in the same corner: recall against the right wall, precision near the floor. You can flip the regime to see where the same problem lands once the target is reframed.

The decisive architecture pivot of the raster-log digitization project. The first segmenter framed the target as binary (curve present or absent per pixel) and leaned on a class weight of 42 to fight the fact that a one-pixel curve is buried in background. That regime plateaued: per-mask F1 of 0.37 / 0.26 / 0.55 with recall pinned at 0.96 to 0.97, a high-recall, low-precision wall. Reframing the target as a single three-class softmax (background / curve-1 / curve-2) redistributed the error budget instead of stacking independent binary masks. Flip the regime: in binary, the three mask points sit stranded in the orange-marked high-recall / low-precision corner; in multiclass, the three per-class points spread out, with background near the ceiling and the two curve classes earning precision without collapsing recall. Sourced: binary F1 0.37/0.26/0.55 and recall 0.96/0.97/0.97 at class_weight 42 on 2,000 instances at batch size 1; multiclass precision 0.96/0.41/0.36 and recall 0.97/0.37/0.32 under Dice loss on 15,000 instances at batch size 16 (background/curve-1/curve-2). Binary precision is derived from the sourced F1 and recall.

Once we saw the binary points clustered in that corner, the fix stopped being a hyperparameter and started being a problem statement. The masks were not failing independently. They were all failing the same way, for the same structural reason, and that reason was the binary framing itself.

The pivot: one three-class softmax instead of a stack of binary masks

The reframe was to stop predicting masks and start predicting classes. Instead of three independent binary heads each asking "is my curve here," we built a single multiclass head that asks, at every pixel, one mutually exclusive question: is this background, curve one, or curve two? Three classes, one softmax, every pixel assigned to exactly one of them.

This is a small change in the output layer and a large change in what the model is forced to learn. Under a softmax the classes compete. The probability mass at each pixel has to sum to one, so claiming a pixel for curve one means not claiming it for curve two or for background. The model can no longer get away with smearing foreground everywhere, because foreground is no longer free, it is mutually exclusive. A pixel spent on a curve is a pixel taken from background, and the cross-entropy will punish that if it is wrong. The precision pressure that was missing from the binary regime is built into the multiclass formulation by construction. This is the standard footing for semantic segmentation, where dense per-pixel softmax classification over a fixed label set is the default frame [4].

The change in problem statement also changed the engineering around it. We scaled the synthetic corpus from 2,000 binary instances to 15,000 multiclass instances, because the harder task of separating one curve from another, rather than each curve from background alone, needs more varied crossings and overlaps to learn. We also fixed the batching. The binary stage had run at batch size 1 because of the variable image dimensions; for the multiclass stage we wrote a custom collate function that pads variable-width logs into a single tensor, which let us train at a batch size of 16. The pivot was not only a different head. It was a different dataset scale and a different training pipeline, all flowing from the decision to treat curve separation as a classification problem rather than a stack of detections.

What the multiclass regime bought, and what it did not

The honest result is that multiclass broke the ceiling without making the problem easy. Under a per-class Dice loss the multiclass model reached precision of 0.96, 0.41, and 0.36 and recall of 0.97, 0.37, and 0.32 across background, curve one, and curve two, with intersection-over-union of 0.94, 0.26, and 0.21. Read those curve-class numbers without flinching: segmenting a thin, faded, overlapping trace out of a noisy scan is genuinely hard, and the per-curve scores say so.

But notice what moved. In the binary regime the curve masks were pinned to the recall axis, precision on the floor, F1 capped by precision. In the multiclass regime the curve classes are off the wall. Precision and recall are now in the same neighborhood as each other, in the 0.3 to 0.4 range, rather than one near 1.0 and the other near 0. The error budget has been redistributed. The model is no longer buying recall it does not need at the cost of precision it cannot afford. It is making the genuine, hard tradeoff between the two curves and the background, which is the tradeoff the problem actually contains. The background class sits near the ceiling, exactly where a well-posed model should put the easy class, instead of dominating the entire prediction.

That redistribution is the win. The binary regime gave us a high-recall, low-precision model that could not improve, because its framing forbade precision. The multiclass regime gave us a balanced-error model that can improve, because its framing prices both kinds of error. The ceiling we hit at F1 0.55 was a property of the question we were asking, and changing the question is what let the metric move again.

The lesson we took into the rest of the project

The durable lesson here has nothing to do with well-logs specifically. It is that a plateau with a lopsided precision-recall signature is usually a problem-statement bug, not a model bug. We spent real effort tuning the binary regime, pushing class weight, adding data, training longer, and every one of those levers was pulling against a ceiling that the binary framing had welded in place. The moment that mattered was not a better optimizer. It was reading the metrics as a diagnosis, recognizing the high-recall low-precision corner for what it was, and asking whether the target itself was wrong.

It was. Curves on a log are not independent yes-or-no questions against background. They are mutually exclusive classes that compete for the same pixels, and the model only learns to separate them when the output layer forces that competition. Once we framed it that way, with one softmax over background and two curves, the rest of VeerNet's segmentation work, the loss-function ablations and the synthetic-data scaling, became improvements on a well-posed problem rather than attempts to climb a wall we had built. The architecture pivot from binary masks to a multiclass softmax was the decision the whole project turned on.

References

Long, J., Shelhamer, E., and Darrell, T. (2015). Fully Convolutional Networks for Semantic Segmentation. CVPR 2015. https://arxiv.org/abs/1411.4038
Ronneberger, O., Fischer, P., and Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. https://arxiv.org/abs/1505.04597
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. (2017). Focal Loss for Dense Object Detection. ICCV 2017. https://arxiv.org/abs/1708.02002
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. (2018). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. ECCV 2018. https://arxiv.org/abs/1802.02611

Binary Segmentation Hit a Wall at F1 0.55, So We Switched to Multiclass

The first regime: one binary mask per curve

Class weight 42, and the wall it built

Diagnosing it as the wrong problem, not a bad model

The pivot: one three-class softmax instead of a stack of binary masks

What the multiclass regime bought, and what it did not

The lesson we took into the rest of the project

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on