There is a particular kind of model failure that looks exactly like success on the dashboard. You train a segmentation network on scanned well logs, the validation accuracy climbs past 0.95 in the first few epochs, the loss curve drops like a stone, and you feel good about your afternoon. Then you overlay a prediction on a real log and the curves are gone. The model has learned the one thing the data was quietly begging it to learn: paint the whole image as background and collect a near-perfect score. This post is about why that happens, why a raster log is almost a worst-case input for it, and what the fix actually costs in the per-class numbers we measured on our own VeerNet pipeline for a Texas onshore operator.
A log is mostly nothing
Start with the geometry of the input. A digitised raster log is a tall, narrow grayscale strip. The curves a petrophysicist cares about are thin ink traces winding down the depth axis, one to three pixels wide at the resolution we trained on. Everything else is paper, grid, and margin. When you set up multiclass pixel segmentation with three classes (background plus two curves) the label tensor is overwhelmingly one value. Background is on the order of 97% of the pixels in a frame; the two curve classes share the thin remainder between them.
A pixel-wise loss does not know that the two curve classes are the entire point of the exercise. It sees a few hundred thousand background pixels and a few thousand curve pixels, and it optimises the average. The cheapest way to drive that average down is to get the abundant class right and ignore the scarce one. A model that emits "background" for every pixel is wrong only on the thin curves, which is to say it is wrong on almost nothing by pixel count. The gradient that would push it to find the curves is a faint signal buried under an avalanche of easy, already-correct background pixels. This is the textbook foreground-background imbalance that motivated focal loss for dense detection, and a raster log is an unusually pure instance of it [1].
The trap is that every aggregate metric rewards the degenerate solution. Pixel accuracy is dominated by background. Even a naive macro score can look respectable if you are not careful about which average you report. The only honest way to see the failure is to stop averaging across classes and look at each class on its own.
The per-class spread is where the truth lives
When we trained the multiclass segmentation model on the 15,000-instance synthetic dataset under a Dice loss and then split the metrics out by class, the imbalance was written across the table in plain numbers. The background class came in at an F1 of 0.97, with precision 0.96 and recall 0.97. The two curve classes, the ones we were actually building the model to recover, landed at an F1 of 0.37 and 0.32 respectively. Precision on the curves was 0.41 and 0.36; recall was 0.37 and 0.32. The Intersection-over-Union told the same story from a different angle: 0.94 on background, 0.26 and 0.21 on the two curves.
That 0.97-against-0.37 gap is the whole argument. The model is excellent at the class that does not matter and barely functional at the two that do. A single blended number would have hidden this completely. The per-class view is what turns "the model scores 0.9-something, ship it" into "the model has learned nothing about curves yet."
It is worth being precise about why the curve F1 is low and not merely the IoU. F1 here is the harmonic mean of precision and recall on the curve pixels, and both halves are weak. Low recall means the model misses curve pixels, falling back to its background habit. Low precision means that when it does fire on a curve, it is often a pixel or two off the trace, which on a one-to-three-pixel-wide ink line is the difference between a hit and a miss. The thinness of the target makes the geometry unforgiving: there is almost no margin for a soft or smeared prediction to overlap the ground truth.
Make the rare pixels expensive
The standard lever for this is to stop treating every pixel as equally important in the loss. If the optimiser will chase whatever lowers the average, then change what the average weighs. In a weighted binary cross-entropy you attach a multiplier to the positive (curve) class so that getting a curve pixel wrong costs far more than getting a background pixel wrong. The all-background shortcut stops being cheap, because now the handful of missed curve pixels carry most of the loss.
On the binary-mask formulation of the problem we pushed the positive-class weight hard: a class_weight of 1 became a class_weight of 42. In effect every curve pixel counts as forty-two background pixels when the gradient is computed. That number is not arbitrary; it is roughly the inverse of the class frequency, which is the natural starting point for a rebalancing weight. Weight the rare class by how rare it is and the loss surface stops sloping toward the trivial solution.
The result of that reweighting is instructive about which half of the problem moves first. On the three binary masks, recall climbed to 0.96, 0.97, and 0.97. The model stopped ignoring curve pixels almost entirely. But the F1 on those same masks sat at 0.37, 0.26, and 0.55, because precision lagged. Pushing the weight up tells the model "never miss a curve," and it obliges by firing generously, which lifts recall fast and drags precision along slowly behind it. You buy recall cheaply and pay for precision later. That is the honest shape of the trade, and it is visible the moment you stop looking at a single blended score.
Weighting is one tool among several
Reweighting the cross-entropy is the bluntest instrument for class imbalance, and it is rarely the last word. Two other families are worth naming because we evaluated them on the same problem.
The first is overlap-based losses. Dice and IoU losses score a prediction by how well its mask overlaps the target rather than by per-pixel correctness, which partly sidesteps the frequency problem because a tiny perfectly-overlapped curve scores well even though it is a small fraction of the image. The Tversky loss generalises Dice with separate penalties for false positives and false negatives, so you can tilt it toward recall on the scarce class without touching the data [2]. The Lovasz-Softmax loss goes further and optimises a tractable surrogate of the IoU metric directly, so the gradient is aligned with the number you actually report rather than a per-pixel proxy [3]. In our own loss sweep across five candidates these overlap-aware objectives behaved very differently from plain weighted cross-entropy, and the choice of loss mattered as much as the choice of weight.
The second family is architectural. An encoder-decoder with a strong decoder recovers fine spatial detail that a coarse classifier throws away, and recovering thin one-pixel curves is exactly a fine-detail problem [4]. None of this removes the imbalance; it changes how gracefully the model copes with it. The point of the weighting fix is not that it is the best tool but that it is the first diagnostic move. If a high class weight does not lift recall on the rare class, your problem is not imbalance and you should look elsewhere.
What to take to your own logs
The lesson generalises past well logs to any segmentation problem where the thing you care about is a small fraction of the frame: cracks in concrete, vessels in a retina, lane markings on a road. The failure mode is identical and so is the discipline that catches it.
Key takeaways
- A scanned log is roughly 97% background. A pixel-wise loss optimises the average, and the cheapest way to lower it is to predict background everywhere and ignore the thin curves, which scores beautifully on any blended metric.
- Always read per-class metrics. Under Dice loss our background class hit F1 0.97 (IoU 0.94) while the two curve classes sat at F1 0.37 and 0.32 (IoU 0.26 and 0.21). One averaged number would have hidden a model that had learned nothing about curves.
- Reweighting the loss is the first move: a weighted BCE with the positive-class weight pushed from 1 to 42 (roughly the inverse class frequency) makes the all-background shortcut expensive.
- Recall recovers before precision. After reweighting, recall on the binary masks reached 0.96 / 0.97 / 0.97 while F1 lagged at 0.37 / 0.26 / 0.55 because precision is the slower, harder half of the trade.
- Weighting is one tool. Overlap-aware losses (Dice, Tversky, Lovasz-Softmax) and a strong encoder-decoder decoder each attack imbalance differently; weighting is the diagnostic, not the destination.
Imbalance does not announce itself. It hides inside a good-looking loss curve and a high accuracy and waits for you to ship. The cheapest insurance against it is also the cheapest thing in machine learning: split your metrics by class and look at the one you came for. If the rare class is far below the abundant one, the model has found the background and called it a day, and no amount of further training will fix what the loss is rewarding. Change what the loss weighs, then watch recall move first and precision follow.
References
[1] Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. Focal Loss for Dense Object Detection. ICCV (2017). The canonical treatment of foreground-background imbalance in dense prediction. https://arxiv.org/abs/1708.02002
[2] Salehi, S. S. M., Erdogmus, D., and Gholipour, A. Tversky loss function for image segmentation using 3D fully convolutional deep networks. MLMI Workshop, MICCAI (2017). https://arxiv.org/abs/1706.05721
[3] Berman, M., Triki, A. R., and Blaschko, M. B. The Lovasz-Softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. CVPR (2018). https://arxiv.org/abs/1805.02396
[4] Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (DeepLabv3+). ECCV (2018). https://arxiv.org/abs/1802.02611