Why We Chose Multiclass Over Binary Segmentation

Before you pick a loss function or a class weight for a segmentation model, you make a quieter decision that shapes everything after it: what the network is actually asked to predict at each pixel. It is easy to skip past, because both obvious options train and both produce a loss curve that goes down. This note is about the two shapes we tried for the curve-segmentation target behind VeerNet, the encoder-decoder EarthScan uses to lift curves off scanned paper well logs, and why the shape we kept was the one that separated the two curves rather than the one that scored best on a single metric. It is not the class-imbalance story about how hard to weight the rare pixels; here the weight is held fixed and the only thing that changes is the target.

Two shapes for the same problem

The task is narrow and worth stating plainly. A digitised raster log carries two curves winding down the depth axis, and we want a pixel-level prediction that says which curve, if any, each pixel belongs to. There are two natural ways to phrase that.

The first is a stack of binary masks. Give each curve its own output channel with its own sigmoid, and train each channel as an independent yes-or-no question: is this pixel part of curve one, and separately, is this pixel part of curve two. This is the segmentation analogue of one-vs-all classification, and it is an appealing default because each mask is simple, the masks can be trained and debugged in isolation, and the framing matches how you first think about the problem, one curve at a time. Rifkin and Klautau make the honest case that a well-tuned set of one-vs-all binary classifiers is a genuinely strong baseline and not the strawman it is sometimes treated as [1]. We took that seriously and built it.

The second shape is a single multiclass pass. One head predicts, per pixel, a distribution over three mutually exclusive classes: background, curve one, and curve two. A softmax normalises across those three so their probabilities sum to one, which means the pixel is forced to commit. This is the standard formulation for dense semantic labelling since fully convolutional networks framed segmentation as per-pixel softmax over all classes at once [2], and it is the head an encoder-decoder like U-Net trains over foreground and background together [3].

Both shapes fit the same encoder-decoder body. The difference is entirely in the last layer and the loss it feeds, which is exactly why this is a formulation decision and not an architecture one.

What the stacked masks actually did

We trained the binary-mask stack with a weighted binary cross-entropy, pushing the positive-class weight up so the model would stop ignoring the thin curve pixels. At a class weight of 42, roughly the inverse of how rare the curve pixels are, the recall told a good story: the masks reached recall of 0.96 and 0.97. The model was finding almost every curve pixel it was supposed to find.

The F1 told a different story, and it is the one that mattered. On the three binary masks the F1 came in at 0.37, 0.26, and 0.55. The best mask peaked at 0.55 and would not go higher, and two of them sat well below that. High recall with a stuck F1 has only one explanation: precision was weak and stayed weak. The masks were firing generously, catching the real curve pixels and a great many wrong ones alongside them.

The structural reason is the part worth keeping. Because each mask is an independent binary question, nothing in the formulation stops both masks from answering yes for the same pixel. A pixel sitting between the two curves, or on a spot where the ink is ambiguous, can be claimed by the curve-one mask and the curve-two mask at the same time, and each mask is individually rewarded for its confident yes. There is no term in a stack of independent losses that says these two curves are different things and this pixel belongs to at most one of them. You can raise the weight all day; weight changes how much a missed pixel costs, not whether the two masks are allowed to overlap. That is why the F1 plateaus. The ceiling is a property of the target shape, not of the tuning.

The reformulation

Switching to a single three-class softmax head changes precisely the thing that was broken. Because the three class probabilities are normalised to sum to one, a pixel that becomes more likely to be curve one becomes, by the same arithmetic, less likely to be curve two. The commitment is built into the output, not bolted on with a rule. The model can no longer hand the same pixel to both curves, so the between-curve confusion that was dragging precision down has nowhere to live. This is also what lets an overlap-based objective like Dice work cleanly per class, because each class is competing for pixels rather than each mask independently grabbing them [4].

The exhibit below makes the two shapes tangible side by side. On the left, drag the class weight in the binary stack from one up to the sourced 42 and watch the recall bars climb toward 0.96 and 0.97 while the F1 bars refuse to cross the 0.55 line. On the right is the reformulation: the three-class pass drawn as a partition where each depth cell is owned by exactly one class, the mutually exclusive assignment the stacked masks could never produce. The orange partition is the whole argument, and it is the only element on the plate that carries it.

The same two-curve segmentation problem shaped two ways. Panel A stacks a separate binary mask per curve, each trained with a weighted binary cross-entropy; drag the positive-class weight from 1 up to the sourced 42 and the recall bars climb to 0.96 and 0.97 while the F1 bars refuse to cross the 0.55 ceiling, because independent masks are free to claim the same pixels, so precision never recovers. Panel B is the reformulation: one three-class softmax pass over background, curve 1 and curve 2, which is mutually exclusive by construction, so a pixel given to curve 1 cannot also be curve 2. The orange partition on the right is the only element that argues: the single pass separates the two curves the stacked masks kept blurring together. The binary F1 values 0.37 / 0.26 / 0.55, the recall values 0.96 / 0.97 at class_weight 42, and the three-class reframe are sourced from the engagement archive; the intermediate recall and F1 values along the weight sweep are an illustrative interpolation toward those sourced end-points.

The point the instrument is built to land is that the left panel does not have a tuning problem. It has a shape problem. No setting of the weight lever moves the F1 past its ceiling, because the ceiling comes from letting two independent masks answer for the same pixel. The right panel is not a better-tuned version of the left; it is a different question asked of the same network.

Why we did not just weight harder

The tempting response to a stuck F1 is to keep turning knobs on the formulation you already have: push the weight past 42, add a precision term, threshold the masks harder at inference. Each buys a little and none removes the cause, which is that the two curves are modelled as unrelated binary events when they are in fact competing labels for one pixel. Rifkin and Klautau are careful that one-vs-all is strong when the per-class problems really are separable and independent [1]. Two curve traces a few pixels apart on the same scan are not: whether a pixel is curve one is bound up with whether it is curve two. When the labels compete, the joint formulation that encodes the competition is the right shape, and no amount of weighting on the wrong shape substitutes for it.

The multiclass shape is not a free lunch. The single softmax head trains over the larger multiclass instance set rather than three cheap independent masks, and it is less convenient to debug because you cannot isolate a curve's channel. It also settles later questions the stack would have kept reopening: the loss can be an overlap objective that competes classes against each other, and there is no inference-time tie-breaking rule for when two masks both claim a pixel, because that case cannot arise. The stack would have forced exactly that hand-tuned rule at serve time; the reformulation removed the need for it by construction.

None of the machinery here is ours. The multiclass softmax formulation for dense labelling is the fully-convolutional and U-Net lineage [2] [3], the overlap objective is V-Net's [4], and the clear-eyed account of when independent binary classifiers are and are not the right call is Rifkin and Klautau's [1]. What we contributed was recognising, on our own numbers, that the 0.55 F1 ceiling was a symptom of target shape and not of tuning, and changing the shape rather than continuing to weight a formulation that could not separate two competing curves.

Limitations

This is one formulation decision on one engagement's data, not a universal ranking of multiclass over binary. The binary-stack F1 values of 0.37, 0.26, and 0.55 and the recall values of 0.96 and 0.97 are the real archive numbers at class weight 42, but the intermediate recall and F1 the instrument draws as you move the weight lever are an illustrative interpolation toward those sourced end-points, not a logged weight sweep; the plateau argument holds at the measured endpoints. The claim is specific to a problem where the classes genuinely compete for the same pixels, as two nearby curve traces do. A task whose foreground classes are spatially disjoint and truly independent is exactly the setting where a one-vs-all binary stack is a strong choice, and this note should not be read as an argument against it there. We also do not claim the three-class pass is good in absolute terms; separating the curves is necessary for a usable digitised log but not sufficient, and the downstream question of whether a separated prediction reconstructs a curve a petrophysicist would accept is a different measurement this note does not make.

The decision under the decision

The habit this left us with is to treat the shape of the prediction as a first-class choice rather than a default inherited from how the problem first occurred to us. A stack of binary masks is the shape you reach for when you think of the curves one at a time. A three-class pass is the shape you reach for when you notice the curves are competing for the same pixels. The numbers said the second was true here, and once the target matched the structure of the problem, the ceiling that no amount of weighting had moved simply was not there anymore.

References

[1] Rifkin, R., and Klautau, A. In Defense of One-Vs-All Classification. Journal of Machine Learning Research 5 (2004), pp. 101-141. The argument that well-tuned one-vs-all binary classifiers are a strong baseline, and the conditions under which a joint multiclass formulation is preferable. https://www.jmlr.org/papers/v5/rifkin04a.html

[2] Long, J., Shelhamer, E., and Darrell, T. Fully Convolutional Networks for Semantic Segmentation. CVPR (2015), pp. 3431-3440. Dense pixel labelling as a single per-pixel softmax over all classes at once. https://openaccess.thecvf.com/content_cvpr_2015/html/Long_Fully_Convolutional_Networks_2015_CVPR_paper.html

[3] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015), pp. 234-241. The encoder-decoder that trains a single softmax head over foreground and background classes together. https://arxiv.org/abs/1505.04597

[4] Milletari, F., Navab, N., and Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 3DV (2016), pp. 565-571. The overlap-based Dice objective a mutually exclusive multiclass head can optimise per class without the pixel-claiming conflict independent masks allow. https://arxiv.org/abs/1606.04797

Why We Chose Multiclass Over Binary Segmentation

Two shapes for the same problem

What the stacked masks actually did

The reformulation

Why we did not just weight harder

Limitations

The decision under the decision

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on