The 92k-Patch Trap: When More Synthetic Data Made the Model Worse

There is a seductive move every machine-learning engineer reaches for when a model underperforms and new labels are slow to arrive: turn up the augmentation. Halve the stride, add more crops, flips, and contrast jitter, and watch the training-set count climb from thousands into the tens of thousands. The dataset feels bigger. The epochs take longer. Surely the model gets better. In a roughly twenty-month engagement with a mid-sized Middle East carbonate operator we partnered with, building a Detection-Transformer-based detector for fractures and bedding planes on two different microresistivity imaging tools' image logs, we ran exactly that experiment to its conclusion — and it taught us a lesson worth more than the compute it cost. Past a point, more synthetic data did not help. It hurt. The fix was not more augmentation; it was more geology.

Where the patches come from

To follow the trap you need to know how the training set is manufactured, because the whole problem lives in the data-engineering layer, not the model. This is the unglamorous half of applied-AI work on scarce data — the patch generator, the augmentation policy, and the split logic are where a subsurface detector is won or lost long before a single transformer layer is tuned.

A processed high-resolution borehole image is enormous — a single well unrolls into an image roughly 690,000 pixels tall, around 1.5 GB on disk — and a DETR-style detector trains on fixed-size patches, not whole wells. We cut the image into patches 800 pixels tall, which works out to about 2.2 m of borehole, a height chosen because more than 95% of fracture sinusoids fit inside 2 m. So far, so ordinary. The multiplier is overlap: instead of tiling the well into disjoint 2.2 m windows, you slide the window down by a small stride of 40 pixels, so consecutive patches overlap by roughly 95%. Every fracture is now seen dozens of times, each time at a slightly different vertical offset inside the frame. Stack flips, contrast normalisation, and other train-time transforms on top — an augmentation factor of five or more per patch — and the arithmetic runs away from you fast. In this programme, an original corpus of about 900 image-and-ground-truth pairs grew past 55,000 through overlap and augmentation, a 65-fold inflation. The number on the dataset card looks like a real dataset. It is not. It is ~900 wells'-worth of geology, photocopied.

That distinction — apparent count versus distinct information — is the entire article.

The experiment: four models, one cliff

The model lineage was a clean, disciplined sweep, the kind any production MLOps practice should run before it touches hyperparameters. One model trained on the base set of usable wells. The next added a couple of newly received wells (including one logged with a different imaging tool — a compact microresistivity tool rather than the high-resolution borehole image log used elsewhere — which matters later). Then came the move that looked like a free win: keep the wells, but push the augmentation harder — denser overlap, more transforms — until the training set ballooned to about 92,000 patches. Call it Model 4.

Model 4 was worse. Not catastrophically, not in a way that screams from a single loss curve, but consistently: the blind-set performance — the held-out, continuous 12 m zone with no overlapping patches, the only split that honestly simulates a new well — got softer, and the gains we expected from "more data" simply did not appear. We had pushed the patch count past 90,000 and degraded the product.

The follow-up confirmed the mechanism. The next iteration changed nothing about the wells and instead coarsened the stride from 40 to 80 pixels — i.e. less overlap, fewer near-duplicate patches — and it improved on the 92k augmented dataset. Reducing synthetic volume made the model better. When cutting your training set in half improves the held-out metric, the set was never carrying the information its size implied. It was carrying redundancy, and redundancy at ~95% overlap is close to label leakage between train and validation: near-identical 2.2 m windows landing on both sides of the split let the model memorise vertical position rather than learn the sinusoid. The model overfits the texture of its own augmentation, and the blind zone — genuinely out-of-sample — pays for it.

Out-of-distribution generalisation is the single most under-reported failure mode in published subsurface results. A model trained on one basin's distribution (Gulf of Mexico salt, teal) has no support over a geologically distinct one (pre-salt Brazil, grey). Drag the deployment point across the feature axis: while it sits inside the training lobe the model is interpolating; once it crosses the orange OOD cliff — where training support falls to the floor — the model is extrapolating and the 'single universal model' promise fails, so re-training is required. The lobes, support curve and cliff position are schematic illustrations of the distribution-shift mechanism — the article sources no benchmark numbers, so none are shown.

The cliff in that exhibit is the right mental model. Augmentation thickens the density of samples inside the region your real wells already cover; it cannot push a single new sample past the edge of that region. A flipped, re-strided, contrast-jittered patch of a fracture is, in feature space, the same fracture. It adds samples, not support. And generalisation is governed by support, not by sample count. Once the model has enough samples to fit the manifold its real wells span, every additional synthetic patch is interpolation practice inside a box it has already mastered — and beyond that box, where the next real well actually lives, it is still extrapolating off the cliff.

What actually moved the needle: real wells

The contrast with real data is stark, and it is the load-bearing result of this piece. Going from 8 real wells to 11 real wells improved depth, dip, and azimuth by about 0.007 in mean absolute error — a small absolute number that represents real, durable accuracy at the precision band geoscientists care about, and one that no amount of augmentation on the original wells produced. Three new wells, each a genuinely independent draw from the field's geology — different formations, different fracture intensities, the contrast quirks of a different logging run — did what tens of thousands of synthetic patches could not.

That is the asymmetry at the heart of the trap. Adding wells widens the distribution; adding augmentation deepens an existing point in it. When your model is data-limited because it has not seen enough kinds of fracture, only the first move helps. The second move, applied to a model that is distribution-limited rather than sample-limited, doesn't just fail to help — it spends your epochs teaching the network to be confident about patches it has effectively already seen, which is precisely how you manufacture overfitting.

It is also why one of those newly received wells was instructive in a second way. It came from a different tool — compact-microresistivity image logs, not the high-resolution borehole image log used elsewhere — and so it was not a denser sampling of the existing distribution but a genuine shift in it. Real wells do not just add support; they occasionally relocate the whole region, which is exactly the kind of variation augmentation can never synthesise because augmentation only knows the transforms you wrote down.

The diagnostic question every team should ask

The practical takeaway is a triage discipline, and it belongs in your training pipeline as a standing check, not a one-off. When a model plateaus and you are tempted to crank augmentation, ask one question first: am I sample-limited or distribution-limited?

You can answer it empirically, cheaply, without new labels. Hold the wells fixed and vary the augmentation intensity — we compared a tight regime (stride 40, augmentation factor 2) against a deliberately aggressive one (stride 160, augmentation factor 7). If pushing augmentation harder keeps lifting your honest held-out metric, you were sample-limited and augmentation is doing real work. If the curve flattens or — as in the 92k case — bends downward on the no-overlap blind set, you have hit the augmentation ceiling. More synthetic volume past that ceiling is not neutral; it is negative, because it shifts the train distribution further from the test distribution and rewards memorisation. At that point the only thing that helps is a new real draw from the field.

Two engineering guardrails make this honest. First, augment the training split only — never let a transformed copy of a validation or blind patch leak across the split, or your metric will lie to you in exactly the direction that hides this failure. Second, keep a no-overlap blind zone as the gate that matters: a continuous interval the patcher never tiled with overlapping windows, scored at the offset a petrophysicist trusts. On this programme the production detector landed at roughly 90% recall within a 10 cm depth offset on that blind zone — a number we trust precisely because it was measured on geology the augmentation pipeline could not have memorised.

The deeper point about subsurface AI

This is not a quirk of one operator's wells. It is structural to subsurface machine learning, where labelled data is scarce, expensive, and physically bounded — there are only so many wells, and each one is a hard-won, irreplaceable sample of the earth. That scarcity makes augmentation tempting and makes the 92k-patch trap especially easy to fall into. The instinct to manufacture data is correct in spirit and dangerous in excess. It is also why the engineering discipline matters as much as the modelling: a versioned dataset, a reproducible patch-generation step, and an honest split are what let you attribute a regression to the augmentation pipeline rather than blame the network.

The number that should anchor a data strategy is not the size of the augmented set. It is the count of distinct, real wells — and, increasingly, the geological diversity among them. Augmentation buys you robustness inside the envelope your real data defines. It will never enlarge the envelope. Engineering effort spent past the augmentation ceiling is effort not spent on the one thing that reliably moves a subsurface model: another real well from a part of the basin the model has not yet seen.

Key takeaways

Overlapping-patch generation plus train-time transforms inflated ~900 real image-GT pairs into a 55,000+ patch set (65x). The card looks like a large dataset; it is ~900 wells' worth of geology, multiplied.
Pushing augmentation to ~92k patches (Model 4) made the model WORSE on the no-overlap 12 m blind zone. Coarsening the stride from 40 to 80 — i.e. less synthetic volume — improved on it. At ~95% patch overlap, near-duplicates leak position across the train/val split and the model memorises its own augmentation.
Going from 8 to 11 REAL wells improved depth/dip/azimuth by ~0.007 MAE — a gain no amount of augmentation on the original wells produced. Real wells widen the distribution; augmentation only deepens a point already inside it. Generalisation tracks support, not sample count.
Diagnose before you crank augmentation: hold wells fixed and sweep intensity (e.g. stride 40 / aug 2 vs stride 160 / aug 7). Still rising on the honest held-out metric → sample-limited, augment more. Flat or bending down → distribution-limited; only a new real well helps.
Two non-negotiable guardrails: augment the TRAINING split only (no transformed copies leaking across splits), and gate on a continuous no-overlap blind zone scored in physical units (here, ~90% recall within a 10 cm depth offset).

The 92k-Patch Trap: When More Synthetic Data Made the Model Worse

Where the patches come from

The experiment: four models, one cliff

What actually moved the needle: real wells

The diagnostic question every team should ask

The deeper point about subsurface AI

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on