Proving It Generalises: A Blind Zone the Model Had Never Seen

Every interpreter who has watched a machine-learning demo asks the same question, and they are right to. It looks good on your validation set — but does it actually generalise to rock it has never seen? In a project where labels are expensive and wells are few, that question is not pedantry. It is the difference between a model a senior geoscientist will sign off against and a curiosity that lives in a notebook. For the fracture-detection work we did with a mid-sized Middle East carbonate operator we partnered with, answering it convincingly — not rhetorically — was the milestone that turned a research result into something the interpretation team would stand behind.

The answer we built was deliberately uncomfortable for the model. We carved a continuous interval out of a vertical well the network had never been trained on, masked it completely, and scored the model only on that strip. The result that earned trust: on the held-out zone, the fracture model recovered roughly 90% of picks within a 10 cm depth offset, and — the part interpreters cared about most — it produced those picks repeatably.

Why a held-out interval, not a held-out well

The honest constraint shaping this engagement was scale: the supervised corpus was 14 vertical wells of a single fractured carbonate play. With that few wells, the textbook move — hold an entire well out as a test set — is statistically self-defeating. Drop a well into the test bucket and you have removed roughly 7% of the geological diversity the model needs precisely to generalise; the well-count ablation on this dataset showed the Hungarian matching loss collapsing from 0.801 at 3 wells to 0.015 at 14, and classification error from 93.1% to a low single digit. Every well you withhold makes the model you are testing measurably worse. Fracture validation recall told the same story from the other side: above 70% once 15 wells were in play, but barely above 50% at 12.

So we partitioned at the interval level instead. Patches were drawn at the well-and-depth level and split so that a chosen, contiguous depth band of one well — its sinusoids, its overlaps, its texture — never appeared in training, validation, or augmentation. The model trained on everything else and was then asked to interpret an interval it had genuinely never seen, on a well held out for exactly this purpose. That is a stricter test than a shuffled random split, because adjacent borehole patches are correlated; a random split leaks. A contiguous masked band does not.

The test, stated honestly

We did not hold out a whole well — with a 14-well corpus, that would have degraded the very model under test. We held out a continuous depth band on an unseen well, with no overlapping patches bridging into training data, and scored only on that band. It is the strictest generalisation test the data budget allowed.

The blind zone was engineered, not improvised. We took a held-out vertical well — the blind well — and defined a single continuous strip of roughly 25 m of measured depth as the judgment zone, sited around 2.6 km down the hole. Crucially, the patches in that strip carried no overlap with any training patch. This matters because the production pipeline tiles each unwrapped log into 2.2-metre patches (800 pixels, the height inside which more than 95% of fracture and bedding sinusoids fit) using an overlapping stride. Overlap is good for coverage and bad for a clean blind test: an overlapping window straddling the boundary would smuggle blind-zone pixels into training. So for the held-out band we explicitly suppressed overlap, guaranteeing the model saw those metres for the first time at inference.

We ran this at two blind-zone lengths by design, fixed in a phase-closure agreement with the operator so the goalposts could not move after the fact: a 12 m continuous zone for the primary model and a tighter 4 m zone for a second, with the held-out band extended to as much as ~25 m of continuous, non-overlapping rock in the fuller evaluation. Two lengths, agreed in advance, is a small thing that buys a lot of credibility — it makes the test pre-registered rather than cherry-picked.

The model under test was the production fracture configuration: a ResNet-10 backbone trained from scratch (no pretrained weights — the ablations showed heavier backbones overfitting this small corpus), a 4-layer transformer encoder and 4-layer decoder, feedforward dimension 1,024, AdamW at learning rate 0.0004, batch size 128, early stopping after 40 epochs without improvement. The loss paired Focal loss for the fracture/bedding/no-object classification (weight 5) with L1 regression on the depth, dip, and azimuth triplet (weight 1); inference kept queries above a 0.5 probability. Nothing about the architecture changed for the blind test. That is the point — we were measuring the deployed model, not a special-cased one.

What "~90% recall within 10 cm" means to an interpreter

A pick is not a bounding box here; it is a sinusoid with a depth, a dip, and an azimuth. So recall has to be scored the way a geoscientist scores a colleague's pick: a prediction counts only if its depth lands inside a tolerance window of the ground-truth sinusoid. On the blind zone, at a 10 cm depth offset, the fracture model recalled roughly 90% of the true fractures. That window is not arbitrarily generous — it sits just above the irreducible floor of the data itself. At the native digital log resolution, one pixel equals about 3 cm, so a ±3 cm depth uncertainty is baked in before the model makes a single decision. Asking for sub-centimetre agreement on a 3-cm-per-pixel image would be measuring noise.

An unwrapped borehole image log: azimuth runs 0–360° across, depth runs down, and every fracture plane intersecting the wellbore traces a sinusoid (amplitude tracks dip, phase tracks azimuth). The DETR model predicts each sinusoid's depth; a pick is a hit only if it falls inside the interpreter's depth-match window. Drag the window — fractures flip recalled (teal) / missed (orange), false flags creep in as it loosens, and sensitivity trades against precision. At the 8 cm interpreter window the model recalls ~85%: validator-ready. Sensitivity, corpus and geometry per the case study; the per-pick errors and live precision are an illustrative detection model around that point.

The instrument above makes the trade tactile: tighten the depth-match window and recall falls as genuine picks drift outside tolerance; loosen it and false flags creep in. The operating point that mattered to this team sat where recall was high and the false-flag set stayed empty — and on never-seen rock, that point held. The broader sensitivity curve corroborated it: the fractures-only model's depth recall stabilised after 6 cm at 82% and reached 88% at its ceiling, with dip accuracy peaking near 98% (92% already at a 3° threshold) and azimuth near 97% (94% at 20°). Those are not blind-zone-specific flukes; they are the same shape of curve the validation set produced, which is exactly what generalisation is supposed to look like.

Repeatability was the real deliverable

Recall on a single run is necessary but not sufficient. The question behind the question is will it give me the same answer tomorrow? A model that recovers 90% of fractures but draws a different 90% each run is useless to a workflow that feeds a reservoir model. So the milestone was not one number — it was demonstrating that the picks on the blind zone were repeatable run to run, the same sinusoids recovered at the same depths with the same dip and azimuth. For a Detection Transformer that emits a set via one-to-one Hungarian matching rather than a thresholded pile of anchor proposals, that stability is structural: there is no non-maximum-suppression threshold whose jitter reshuffles which detections survive. The set the model commits to is the set you get.

Repeatability is also where the human-time argument lands. The manual baseline this replaced — path-opening with a sweep of threshold values — ran roughly 2 minutes per 2-metre interval of careful operator attention, and two operators rarely produced byte-identical picks. A model that is both accurate on unseen rock and deterministic across runs is what let the team treat its output as a first-pass interpretation to validate rather than a draft to redo. In the broader well-to-well rollout, that translated into the productivity and consistency gains the operator measured: on the order of +60% interpreter productivity and +75% interpretation consistency across the wells the tooling touched.

What actually earns interpreter trust

Before

Validation-set accuracy

Strong numbers on shuffled patches the model has effectively seen the neighbours of — answers 'does it fit?' not 'does it generalise?'

After

Blind-zone repeatability

~90% recall within 10 cm on a continuous, non-overlapping interval of an unseen well — and the same picks run to run

From 'looks good in the notebook' to 'a senior interpreter will sign off against it'

The engineering that made the test fair

A blind test is only as honest as the data plumbing behind it, and most of the credibility work was MLOps, not modelling. Three things had to be true and verifiable. First, no leakage: the overlapping-patch tiler had to be told to exclude the blind band, and we confirmed the patch manifests carried no window crossing the boundary — when we tightened the production stride from 40 to 80 pixels in a later model revision, the blind-set partition was rebuilt alongside it so the guarantee survived the change. Second, a frozen split: the blind-zone definition and its two lengths were fixed before evaluation, version-controlled, and reused identically across runs so that "repeatable" meant repeatable against a stable target. Third, negative controls on capacity: one model revision that pushed the augmented corpus to roughly 92k patches actually regressed, a reminder that more data volume without more geological diversity does not buy generalisation — and a result you only catch if your held-out band is genuinely held out.

None of that is glamorous. All of it is why an interpretation team believed the 90%. A generalisation claim is a claim about the test harness as much as the model; when the harness is auditable, the claim is defensible. It is the same applied-AI discipline — data engineering, leakage-free partitioning, and an MLOps-grade evaluation loop — that we carry into subsurface engagements with operators across the Middle East and the United States, where the geology changes but the small-data failure modes rhyme.

The caveats we kept in front of the operator

The result is strong and bounded, and we said so. The blind zone is a single continuous interval on one held-out vertical well of one fractured carbonate play — compelling evidence of generalisation within that distribution, not a guarantee across fields or across well geometries. Horizontal wells, where the average fracture sinusoid is far shorter, are a separate distribution that needs its own held-out validation. The ±3 cm pixel floor caps how tight any depth tolerance can ever be. And the well-count ablation is unambiguous about the next lever: the model is hungry for geological diversity, so the surest way to extend this blind-zone result is more wells, not more augmentation. Those caveats are not hedging — stating them is part of what made the headline number trustworthy.

How a blind zone earned interpreter trust

The credible generalisation test was a continuous, non-overlapping interval on an unseen vertical well — not a shuffled split (which leaks across adjacent patches) and not a held-out whole well (which, at a 14-well corpus, would degrade the model under test).
On that blind zone the fracture model recalled ~90% of picks within a 10 cm depth offset — just above the irreducible ±3 cm digital-log pixel floor — with the same sensitivity curve shape seen in validation, which is what real generalisation looks like.
Repeatability, not a single-run score, was the deliverable: a Hungarian-matched set has no NMS threshold to jitter, so picks were stable run to run — letting the team treat output as a first-pass interpretation to validate rather than redo, and underpinning ~+60% productivity / +75% consistency at rollout.
A blind test is only as honest as its plumbing: leakage-free patch manifests, a frozen pre-registered split with agreed blind-zone lengths (4 m and 12 m), and negative controls (a 92k-patch revision that regressed) are the MLOps that made the 90% defensible.

References

Generalisation, blind-zone, and ablation figures derived from internal validation on a 14-well carbonate dataset of high-resolution borehole image logs acquired with two different microresistivity imaging tools; data and code withheld under operator confidentiality.
Carion et al. (2020). End-to-End Object Detection with Transformers (DETR). ECCV 2020. https://arxiv.org/abs/2005.12872
Native digital-log resolution (~3 cm/pixel) establishes the irreducible depth-tolerance floor referenced throughout; blind-zone depth offsets are reported against that floor.

Proving It Generalises: A Blind Zone the Model Had Never Seen

Why a held-out interval, not a held-out well

Building the blind zone

What "~90% recall within 10 cm" means to an interpreter

Repeatability was the real deliverable

The engineering that made the test fair

The caveats we kept in front of the operator

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on