Validating Well-to-Well Transfer on 118 Public Wells Before Touching a Single Client Log

The most expensive mistake in an applied-AI engagement is not a model that underperforms. It is a model that fails silently on the one dataset you were paid to work — the client's — after consuming weeks of their scarce, hard-won wireline data to discover the architecture was wrong all along. When we set out to build well-to-well (W2W) lithofacies correlation for a mid-sized Middle East carbonate operator we partnered with, we refused to learn that lesson on their wells. Instead we validated the entire transfer mechanism on a public proxy first: the open FORCE 2020 machine-learning competition dataset — 118 wells from the Norwegian Sea (Zenodo record 4351156). We proved the mechanism on data nobody could be hurt by, then carried it onto the operator's 14-well carbonate field.

This is a case study about a pattern, not just a model. The pattern — prove the transfer mechanism on a public proxy, then transfer onto confidential field data — is one of the highest-leverage de-risking moves available to a subsurface ML team, and it is under-practised precisely because it looks like a detour. It is not a detour. It is the cheapest insurance you will ever buy.

The problem W2W correlation actually solves

Well-to-well correlation is the geological backbone of any multi-well study. Given conventional wireline logs (gamma ray, resistivity, density, neutron, sonic) down two or more wells, an interpreter divides each log into lithofacies groups and then correlates those groups across wells — tying the same stratigraphic layer in one well to its equivalent in a second well. Do it across a field and you get a 3D picture of the reservoir's layering; get it wrong and every downstream volume, every flow simulation, every infill-drilling decision inherits the error.

Framed as a machine-learning problem, the task has two stages. First, segmentation: detect the depths at which the log signature changes character — the kinks, the boundaries between lithofacies. Second, correlation: match a kink-bounded interval in one well to the geologically equivalent interval in another, which is a similarity-query problem over learned representations. In our pipeline the well is divided into fixed windows of 700 data points for kink detection; the model learns to flag group changes within a window, and a query head then asks which intervals in an unseen well are most similar.

The engineering risk is concentrated entirely in the second stage. A kink detector that works on the wells it trained on tells you almost nothing about whether the correlation will hold when you point it at a well it has never seen. That generalisation gap — train here, deploy there — is the whole game. And it is exactly the failure mode that does not show up until you are already standing on the client's data.

Why a public proxy, and why FORCE 2020 specifically

The operator's field was, by the standards of deep learning, brutally data-poor: 14 vertical wells, two different microresistivity imaging tools logging a fractured carbonate, every log a confidential asset under a strict IP agreement. You cannot afford to learn how to build the transfer mechanism on 14 wells. Every architectural false start, every loss-function dead end, every window-size sweep would burn irreplaceable client data and weeks of timeline — and worse, with so few wells you cannot even tell a robust result from a lucky one.

FORCE 2020 is the right proxy for a precise, defensible set of reasons. It is large — 118 wells on the Norwegian Continental Shelf, more than eight times the operator's well count — so a model that overfits has nowhere to hide; the well-count alone forces honest generalisation. It is the same shape of problem: conventional logs, lithofacies labels, a published train/test split designed by a community competition specifically to reward correlation that transfers across wells. And it is fully open, so the entire transfer experiment — architecture selection, the 700-point windowing, the similarity-query head, the train-on-some / validate-on-held-out protocol — could be iterated in the open, fast, on infrastructure as modest as our 8 GB-per-machine 1080Ti stack, with nobody's confidential data at risk.

The proxy is a different basin on purpose

FORCE 2020 is North Sea clastics; the client field is onshore Middle East carbonates. We were never trying to transfer the model's weights from one to the other — that would fail, and the next section is about why. We were validating that the transfer mechanism — the architecture and protocol that lets a kink/correlation model trained on some wells generalise to unseen wells — is sound. The mechanism is portable; the learned distribution is not.

The cliff we were actually de-risking against

The single most under-reported failure mode in published subsurface ML is out-of-distribution (OOD) collapse: a model trained on one geological distribution has no support over a distinct one, and silently extrapolates when you deploy it there. A correlation model that has only ever seen North Sea log signatures will not magically correlate Middle East carbonate kinks — and, critically, it will not tell you it is guessing. It returns confident nonsense.

Out-of-distribution generalisation is the single most under-reported failure mode in published subsurface results. A model trained on one basin's distribution (Gulf of Mexico salt, teal) has no support over a geologically distinct one (pre-salt Brazil, grey). Drag the deployment point across the feature axis: while it sits inside the training lobe the model is interpolating; once it crosses the orange OOD cliff — where training support falls to the floor — the model is extrapolating and the 'single universal model' promise fails, so re-training is required. The lobes, support curve and cliff position are schematic illustrations of the distribution-shift mechanism — the article sources no benchmark numbers, so none are shown.

Drag the deployment point across that feature axis and the mechanism is plain: inside the training lobe the model interpolates; cross the cliff and it extrapolates, and the comfortable "one universal correlation model" promise evaporates. This is the precise risk a proxy-validation strategy is designed to manage. We split the work into two independently verifiable claims:

Is the transfer mechanism sound? Answerable entirely on FORCE 2020: train the kink/correlation model on a subset of the 118 wells, validate on held-out wells, and confirm that correlation quality holds on wells the model never saw. If the mechanism cannot generalise across 118 public wells, it has no business near the client's 14.
Does it survive the basin change? Not answerable on the proxy — and we never pretended it was. Carbonate signatures are their own distribution. So the client wells were always going to require their own training pass; the proxy's job was to guarantee the machinery of that pass was correct before we spent a single client well on debugging it.

Conflating those two questions is how teams get burned. Proxy validation forces them apart. The OOD cliff is the reason the proxy can prove the first claim and not the second — and knowing exactly which claim your public data can and cannot settle is the whole discipline.

What we proved on the proxy, and what carried over

On FORCE 2020 the team established the parts of the pipeline that are basin-agnostic engineering rather than geology: the 700-point windowing for kink detection, the segmentation-then-similarity decomposition, the train-on-some / validate-on-held-out protocol, the data-loading and augmentation harness, and the evaluation discipline of judging the model on wells it had never seen. Each of those is a decision you want to make once, correctly, on cheap data — not re-litigate against a confidential well.

Then we transferred the mechanism — not the weights — onto the operator's field and retrained on their wells. Because the machinery was already proven, the client-data phase was about geology and calibration, not about discovering that the architecture was wrong. The W2W correlation that came out the other side was strong enough to fold into the operator's interpretation workflow, where it delivered roughly a 60% lift in interpreter productivity and a 75% improvement in interpretation accuracy, hitting 95% precision on target intervals and 90% stratigraphic-correlation success across the field. In the same engagement, the companion image-log interpreters (the fracture and vug models) ran interpretation about 5× faster than manual picking — the kind of throughput that turns a multi-day picking job into an afternoon.

De-risking strategy: debug on client data vs proxy-first transfer

Before

Validate on 14 confidential wells

Architecture, windowing and protocol debugged directly on irreplaceable client data — overfitting indistinguishable from skill at n=14, weeks of scarce data burned on false starts

After

Validate on 118 public wells, then transfer

Transfer mechanism proven on FORCE 2020 (Zenodo 4351156); client wells spent only on geology + calibration, not on debugging

+60% interpreter productivity, +75% accuracy, 95% target precision, 90% stratigraphic success on the field

The numbers belong to the Middle East carbonate engagement — they are the result of the transferred pipeline running on the client's own retrained model, never a claim about the public proxy. The proxy never produced a field metric and was never meant to. Its return on investment is counterfactual: the weeks of client data not wasted, the architectural dead ends discovered for free, the OOD failure mode anticipated rather than encountered live in front of the operator's geologists.

The engineering discipline this encodes

Strip away the geology and this is a statement about how to run an applied deep-learning programme under data scarcity and confidentiality — the two constraints that define real subsurface AI. Three principles generalise far beyond W2W correlation:

Separate mechanism from distribution. The portable asset is the architecture and protocol; the non-portable asset is the learned distribution. A public proxy can validate the first and must never be asked to certify the second. Naming the boundary — this the proxy proves, that it cannot — is the core skill.
Make overfitting impossible to hide. At n=14 wells you cannot distinguish a generalising model from a memorising one. At n=118 you can. Choosing a proxy larger than the target field is not a convenience; it is the only way to get an honest read on generalisation before the client data is in play.
Treat client data as a depletable resource. Every confidential well spent debugging infrastructure is a well not spent on the model that ships. Proxy-first development is, at bottom, a resource-allocation discipline: do the cheap, reversible learning in public, reserve the expensive, irreversible learning for where it counts.

This pattern is not Middle East-specific, and our experience is not single-field. Across subsurface engagements — operators in the Middle East and the United States — the same constraint recurs: rich, confidential, scarce field data on one side, and on the other a wealth of open benchmarks (FORCE 2020, public log libraries, competition splits) that share the shape of the problem if not its distribution. The teams that ship reliably are the ones that exploit that asymmetry deliberately: prove the machinery in the open, transfer it onto the data that matters, and know precisely which question each dataset is allowed to answer.

Why we validate transfer on a public proxy first

The portable asset is the transfer mechanism — architecture, 700-point windowing, segment-then-correlate protocol, held-out-well evaluation — not the learned distribution; a public proxy can certify the mechanism and must never be asked to certify the basin.
FORCE 2020 (118 Norwegian Sea wells, Zenodo 4351156) is the right proxy precisely because it is 8x larger than the 14-well client field: at n=14 a memorising model is indistinguishable from a generalising one, so overfitting only becomes visible at scale, in public.
The out-of-distribution cliff is the reason the strategy works: North Sea clastics and Middle East carbonates are distinct distributions, so the proxy proves the machinery while the client wells are reserved for the one thing only they can settle — geology and calibration — yielding +60% productivity, +75% accuracy, 95% target precision and 90% stratigraphic success on the field.

References

Bormann, P., Aursand, P., Dilib, F., Manral, S., & Dischington, P. (2020). FORCE 2020 Well Well Log and Lithofacies Dataset for Machine Learning Competition (118 wells, Norwegian Sea). Zenodo record 4351156. https://doi.org/10.5281/zenodo.4351156
W2W correlation methodology — 700-point kink-detection windowing, segment-then-similarity decomposition, train-on-some / validate-on-held-out protocol — established on FORCE 2020 and transferred onto a 14-well carbonate field; field performance figures derived from internal validation. Client data and code withheld under operator confidentiality.

Validating Well-to-Well Transfer on 118 Public Wells Before Touching a Single Client Log

The problem W2W correlation actually solves

Why a public proxy, and why FORCE 2020 specifically

The cliff we were actually de-risking against

What we proved on the proxy, and what carried over

The engineering discipline this encodes

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on