Skip to main content

Blog

The 15th Well Made Our Model Worse: Label Consistency Beats Data Volume

We added a fifteenth well to a fracture-detection training set that was already working, and validation F1 at 5 cm fell from about 60 percent to about 57. The well was not corrupt and the pipeline was not broken. Its ground truth had been picked with a different emphasis, fractures favoured over beddings, with pick gaps running to 175 m, so the extra rows pulled the label distribution off the style the model had settled on across the first fourteen wells. The fix was not more data but separate per-class models trained on style-consistent wells. Past a threshold, how your annotator picks matters more than how many wells you feed the network.

Narendra Patwardhanby Narendra Patwardhan8 min read
EarthScan insight

Most of the time, one more well of training data helped. Our fracture detector for a mid-sized Middle East carbonate operator got steadily better as wells came in, with the usual shape: sharp gains early, diminishing returns later. Then we added a fifteenth well and validation F1 at a 5 cm depth threshold fell, from about 60 percent to about 57. Nothing had broken. The pixel-range check passed, the labels loaded, the training curve looked normal. A well we had every reason to trust had made the model worse.

This piece is about why, and about the thing that mattered more than well count once we were past a certain amount of data: the style the ground truth was picked in.

The uptick we had already seen once

The clue was in the well-count ablation we had run months earlier. Adding wells drove class error down hard and then flattened, but the last step went the wrong way: class error fell to a floor around 0.82 and then rose again at the largest well count, a small non-monotone bump we noted and moved past. The full curve, and why the last well behaves differently from every well before it, is a separate case study; here it is enough that the trend has a real up-tick at the tail, not a smooth asymptote. (See How Many Wells Is Enough? A Well-Count Ablation for Fracture Detection.)

The fifteenth-well F1 drop was the same phenomenon in a different metric. Once we lined the two up, the question stopped being "is more data always good" and became "what is different about the wells at the tail." The answer was not in the pixels. It was in the picks.

A clean well, picked in a different hand

The image-log data in the fifteenth well was fine. What differed was the human ground truth attached to it. When we pulled the raw picks and checked the sampling interval, the well had long stretches with no annotations at all, gaps running up to 175 m in a single interval. That is not a data-quality defect. The interpreter had deliberately concentrated on fractures in that well and left beddings largely unmarked, which is a reasonable thing for a geologist to do when the fractures are what the study cares about in that section.

For a person reading the log, that emphasis is invisible and harmless. For a network learning a joint distribution over beddings and fractures across all wells, it is a distribution shift. The first fourteen wells had been picked in a broadly consistent style, dense in both classes, so the model had settled its class balance and its notion of a "typical" interval around that style. The fifteenth well arrived with a different balance baked into its labels: heavy on fractures, sparse on beddings, with multi-metre stretches the model reads as "nothing here" even where beddings exist. Averaged into the same objective, that pulled the model off the operating point the earlier wells had found, and the three-point F1 drop landed exactly where the two label styles disagreed.

THE 15TH WELL · VALIDATION F1 AT 5 CM57%down 3 pts from the 14-well modelA clean well can still hurt: its labels were picked in a different styleHANDLING OF THE STYLE-DIFFERENT WELLPool it inone shared modelSplit by classstyle-consistent setsVALIDATION F1 AT 5 CM · BEFORE VS AFTER14-well baseline 606014 wellsone style5715 wellstwo stylesWELL-COUNT CLASS ERROR · LOG SCALE0.51101003691114wells in training setfloor 0.82up-tick 2.54WHAT MADE THE 15TH WELL DIFFERENTfracturesemphasis over beddings175 mlongest pick gapTHE FIX · SEPARATE STYLE-CONSISTENT MODELS11 wellsbeddings-only set14 wellsfractures-only setanchors: beddings-only F1 val 60.63 @3cm; bedding F1 ~69 @5cmsourced: class error 93.115 / 18.370 / 1.055 / 0.817 / 2.536 at 3 / 6 / 9 / 11 / 14 wells; F1@5cm 60 to 57 on the added well; beddings-only F1 60.63 @3cmthe split-by-class F1 is anchored to the sourced ~69 bedding F1 @5cm; the 15th-well endpoints are the sourced validation regression
The one experiment that inverts the well-count intuition. A fourteen-well fracture model validated at F1@5cm near 60 percent. Adding a fifteenth well dropped it to about 57. The well was not corrupt: its ground truth was picked with a different emphasis, fractures favoured over beddings, with pick gaps running to 175 m, so the added rows pulled the label distribution off the fourteen-well style the network had settled on. The right panel plots the sourced well-count class-error trend, 93.115 / 18.370 / 1.055 / 0.817 / 2.536 at 3 / 6 / 9 / 11 / 14 wells, whose non-monotone up-tick at the last well is the same tell that shows up in validation F1. Toggle the handling of the style-different well: pool it into one model and the orange marker drops the three points; split by class onto style-consistent well sets, the combined and beddings-only sets of 11 wells and the fractures-only set of 14, and the model recovers toward the sourced bedding F1 near 69 at 5 cm. The orange element is the only one that argues: the outcome bar and the up-tick point that fall when styles are mixed. The class-error trend, the two F1 anchors, and the 11 and 14 well split are sourced from the engagement archive; the split-by-class F1 is anchored to the sourced bedding F1 at 5 cm and the fifteenth-well endpoints are the sourced validation regression.

The instrument puts the two facts side by side on purpose. On the left, the before-and-after: a fourteen-well model near 60 F1, and the same model near 57 once the style-different well is pooled in. On the right, the well-count class-error trend with its non-monotone tail. They are the same story told twice. Neither says "more data is bad." Both say the fifteenth well was not the same kind of thing as the first fourteen, and the network could tell even though we could not, until we looked at the picks.

This is a different failure from feeding a model too much synthetic data, which we hit separately when an aggressive augmentation run ballooned one dataset and degraded the result; that mechanism, overlap and augmentation drowning the real signal, is covered in The 92k-Patch Trap. Here the extra data was real, from a real well, and still hurt, because the label style behind it was inconsistent with the rest.

Why averaging punishes the minority style

The mechanism generalises past borehole images. A detection model trained on pooled wells minimises a single loss averaged over every labeled example. When most wells share a picking style, that style is the majority and the loss rewards fitting it. A well picked in a minority style pulls the gradient toward a different class balance and a different sense of what an empty interval means. If the minority is small, the model mostly ignores it and you lose a little. If it is systematically different, as ours was with its fractures-over-beddings emphasis and its 175 m gaps, the pull is enough to move the validation number the wrong way.

There is a threshold hiding in this. Below some amount of data you are starved, and almost any additional well helps because the model needs coverage more than consistency. Above that threshold the model already has enough of the majority style to be well-fit, and the marginal well no longer buys coverage; it buys either reinforcement of the existing style or contamination of it. Past the threshold, annotator-style homogeneity dominates raw well count. That is the sentence we would tell a past version of ourselves.

The fix was to stop pooling

The instinct when a model regresses is to add more data or tune harder. Neither addresses a label-style split, because the problem is not that the model is under-trained, it is that it is being asked to fit two distributions at once. The fix was to stop pooling and route by class onto style-consistent well sets.

Concretely, we split into separate models. Beddings and the combined dataset were built from the eleven wells whose picks were consistent for both classes. Fractures, where more wells carried usable picks, got their own model on fourteen wells. Each model then saw one label style, and the minority-style contamination went away because there was no longer a shared objective for it to contaminate. The beddings-only model validated at F1 60.63 at a 3 cm threshold and around 69 at 5 cm, in the range the combined model had held before the fifteenth well disturbed it, without the drop. The gain was not from more data. It was from removing the style conflict the pooling had created.

That is the counter-intuitive part worth carrying. We improved the model by training it on fewer, more homogeneous wells per head rather than more wells in one pool. The separate-model split is not elegant, and it costs you a second training run and a second checkpoint to serve, but it respects a fact about the data that a single pooled objective cannot: two interpreters, or one interpreter in two moods, do not produce interchangeable ground truth, and a network averaging over them pays for the difference.

What we check now before adding a well

The lasting change was procedural. A new well is no longer trusted just because its pixels are clean and its labels load. Before it goes into a training pool we profile its picks: class balance, pick density, largest sampling gap. A well whose profile sits far from the pool's is a candidate for its own model, not the shared one. This is upstream of any accuracy metric, cheap to run, and it would have flagged the fifteenth well before it cost us three F1 points.

None of this argues against inter-model consistency as a goal; the case for AI picks over variable human interpretation still holds, and we make it in Consistent, Auditable Interpretation. The point here is narrower and, for anyone assembling a training set well by well, more immediately useful: once you have enough data, the next well helps only if it was picked in the same style as the rest. Count is not the variable that matters at the tail. Style is.

Limitations

The fifteenth-well regression is a single validation result on one operator's data; the F1 endpoints near 60 and 57 are the sourced validation figures for that experiment, not a repeated measurement across seeds. The recovered per-class numbers are the sourced beddings-only figures, F1 60.63 at 3 cm and around 69 at 5 cm; where this piece describes the split-by-class recovery at 5 cm it is anchored to that bedding figure rather than a separately reported 5 cm number, and the instrument captions it as such. The threshold at which style begins to dominate count is described qualitatively; we did not run a controlled sweep to locate it, and it will vary with class balance, model capacity, and how strongly the minority style diverges. Finally, "picking style" here bundles class emphasis, pick density, and sampling gaps together, which a more careful study would separate; we treat them as one because in this well they moved together.

Go to Top

© 2026 Copyright. Earthscan