For most of the engagement we detected beddings and fractures with one model. The two are the same kind of object in a borehole-image log: a planar feature that unrolls into a sinusoid, described by the same three numbers your network regresses, depth and dip and azimuth. One model, one label schema, two class flags. That symmetry is real, and it made a shared detector the obvious first design. What it hid is that beddings and fractures are labelled by geologists with different priorities, and when those priorities diverge across wells, one detector is asked to fit two incompatible label styles at once. This is the story of the well that made that failure legible, and why we split the task in two.
The two classes are not the same distribution
A bedding plane and a fracture look alike in the fitting form, but their statistics are opposite. In this Middle East carbonate dataset the average bed is about 0.25 m tall and beds are dense; the average fracture is about 2 m tall and fractures are sparse. Beds are laid down by deposition and tend to be picked consistently well to well. Fractures are picked by whichever expert looked at whichever interval, and how aggressively they pick, and where they stop, varies. The label distribution for fractures carries the interpreter's attention in it in a way the bedding distribution does not.
That difference does not show up when you read a single metric. It shows up when you put the two classes on the same axes and watch them fail in opposite directions.
Same axes, opposite failures
Score the beddings-only behaviour and the fracture behaviour on the four things the model actually predicts. Beddings hold depth F1 around 70 percent at the 9 cm band and dip accuracy near 96 to 97 percent at a 4-degree tolerance, but their azimuth accuracy sits at about 88 percent at 20 degrees, and the azimuth mean error is 22 to 23 degrees. Fractures invert that profile: depth stabilises around 77 percent, dip is softer at about 91 percent, but azimuth is tight, roughly 92 percent at 15 degrees with a mean error near 9.3 degrees.
Read those two columns side by side and the shape of the problem is plain. Beddings are a dip-strong, azimuth-weak class; fractures are an azimuth-strong, dip-softer class. A shared regression head has to find one setting that serves both, and there is no such setting, because the classes want the loss pulled in different directions. The most telling signal was not any single average but this opposition itself: the two classes do not just differ in level, they invert. A dip tolerance that flatters beddings is where fractures are softest; an azimuth tolerance that flatters fractures is where beddings are worst. That mirrored profile is the signature of one objective being pulled two ways at once, not of one class the model has genuinely learned.
The well that forced the decision
The trigger was a specific well, the 15th we brought in. Its picks prioritised fractures over beddings, and the bedding labels had gaps, in one horizontal well the sampling-interval check exposed pick gaps of up to 175 m. We went back to the interpreter, and he confirmed it: in that interval he had deliberately focused on fractures over beddings. There was nothing wrong with his picks. They were correct picks of fractures. But dropped into a shared bedding-and-fracture training set, those multi-metre bedding gaps are not missing at random, they are a systematic bias that teaches the shared model that beddings are absent where they are merely unpicked.
We had already learned in this engagement that adding a well does not always help, both from a well-count ablation and from an episode where synthetic overload made a larger dataset worse (covered elsewhere). This was a third, different way for more data to hurt: not too few wells, not too much augmentation, but a well whose label style contradicted the others for one of the two classes. A well that helped fractures hurt beddings, because the two classes were sharing one model and one training set, and the picks that were correct for fractures read as absent beddings to the shared objective.
Splitting the task
The fix was to stop making one model serve both label styles. We trained a beddings-only model on the 11 wells whose bedding picks were style-consistent, and a fractures-only model on all 14 wells where the fracture picks were the trustworthy signal. The combined Beddings-plus-Fractures detector carried a class error of 1.537. The beddings-only model on the style-consistent subset came in at 0.645. Same architecture, same loss, same everything except that each model now trains on a class whose labels agree with each other.
The point is not that two models beat one in general. It is that the well-selection decision and the class-split decision are the same decision. Once you accept that fractures and beddings are labelled with different priorities, the training set for each class has a different notion of which wells are clean. Eleven wells are style-consistent for beddings; fourteen are usable for fractures. A single detector cannot honour both memberships, because it has one training set. Splitting the task lets each model be trained only on the wells where its class is labelled the way the model will be judged.
What we still gave up
Two models cost more than one to train, serve, and keep in sync, and we do not pretend otherwise. A single detector that regressed both classes was the cheaper artefact, and if the two label distributions had stayed consistent across wells, it would have been the right one to keep. The split is a response to a property of the labels, not a universal preference for per-class models.
The deeper caution is that the divergence signal is easy to miss if you only track headline accuracy. It looked, at a glance, like the beddings model was doing fine on its own strong metric. It was the opposition between the two classes on the same axes, beddings holding dip while losing azimuth and fractures doing the reverse, that told us the shared abstraction was wrong before the aggregate class error made it undeniable. When two tasks share a model, watch for the classes failing in opposite ways. That is the tell that they should not be sharing a model at all.
Limitations
Every number here comes from one engagement on one Middle East carbonate reservoir with 14 to 15 vertical wells, a sample small enough that a single well moves aggregate metrics by whole points, which is part of why the 15th well was legible at all. The class-error figures (1.537 shared versus 0.645 beddings-only) are from the project's own error tables and are not independently reproduced. The beddings-only and combined-fracture metrics are reported at specific depth, dip, and azimuth tolerance bands agreed with the client's expert interpreter; other bands give other numbers. Whether label-style divergence between classes is common enough elsewhere to justify a per-class split by default is exactly the kind of claim this single case cannot settle. Treat the split as the right call given labels that diverged, not as a rule to apply before you have looked at whether yours do.
References
[1] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). End-to-End Object Detection with Transformers. European Conference on Computer Vision (ECCV). The set-prediction detector whose shared regression head has to serve both classes here. https://arxiv.org/abs/2005.12872
[2] Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., and Finn, C. (2020). Gradient Surgery for Multi-Task Learning. Advances in Neural Information Processing Systems (NeurIPS). Formalises the conflicting-gradient problem behind two classes wanting the loss pulled in opposite directions. https://arxiv.org/abs/2001.06782
[3] Standley, T., Zamir, A., Chen, D., Guibas, L., Malik, J., and Savarese, S. (2020). Which Tasks Should Be Learned Together in Multi-Task Learning? International Conference on Machine Learning (ICML). Evidence that jointly training tasks can underperform separate models when the tasks interfere, the empirical case for splitting. https://arxiv.org/abs/1905.07553
[4] Song, H., Kim, M., Park, D., Shin, Y., and Lee, J.-G. (2022). Learning From Noisy Labels With Deep Neural Networks: A Survey. IEEE Transactions on Neural Networks and Learning Systems. Systematises how systematic, non-random label bias, like beddings unpicked where an interpreter focused on fractures, degrades a shared model. https://arxiv.org/abs/2007.08199