A subsurface model is not the file you train. It is the entire apparatus required to retrain it on demand, explain why it predicted what it did, and run it next year after the geoscientist who built it has moved on. Most operators acquire the first thing and assume they have bought the second. They have not. The gap between a model that works in a notebook and a model an in-house team can own is an engineering gap, and it is the gap where multi-year subsurface AI programmes quietly fail.
Why this is an engineering problem, not a research one
Over roughly twenty months, working with a mid-sized Middle East carbonate operator across three phases, we built a formation-evaluation system on high-resolution borehole image logsA high-resolution resistivity-based image of the borehole wall used to pick bedding planes, fractures and vugs.: bedding-plane and fracture detection, dip and azimuth regression, vug analysis, and well-to-well correlation. The modelling was hard. The modelling was also not the part that determined whether the operator would still have a working system at the end.
The research literature measures success in a metric on a held-out set. An operational programme is measured differently. Can you reproduce last quarter's result to the decimal? Can you tell which of forty checkpoints is the one running on the asset? When a new batch of wells arrives, can the operator retrain without you in the room? These are not modelling questions. They are versioning, provenance, and handover questions — and they are answered by infrastructure, not by a better architecture.
The 15 percent rule
A trained model is roughly fifteen percent of a production subsurface AI system. The other eighty-five percent — ingestion, versioning, experiment tracking, CI/CD, reproducibility scaffolding, serving, and the handover runbooks — is the part that decides whether the fifteen percent survives contact with a second year and a second team.
This whitepaper walks the eighty-five percent. It is written for the IT lead, the CTO, and the MLOps engineer who will inherit a subsurface model and have to keep it alive — and it is organised around the four disciplines that decide whether they can.
Discipline one: version the data, not just the code
Subsurface teams instinctively version code. Almost none version data, and in subsurface ML the data is where the entropy lives. The raw input — apparent dip and azimuth picks, well radius, the binary wireline log fileThe binary container format for wireline log data., the interpreter's PDF — arrives well by well, gets normalised, gets cut into overlapping patches, gets augmented, gets split into train, validation, and test. By the time a model trains, the bytes it sees are four or five transformations removed from anything an engineer can point at. If you cannot name the exact dataset a run consumed, you cannot reproduce the run, and you cannot defend the prediction.
We solved this the way it has to be solved: every dataset was a first-class, immutable, content-addressed artefact. We used DVC for data and model versioning sitting alongside Git for source, so that a commit hash pinned not just the code but the exact processed dataset and the model weights it produced. Each generated dataset was named with a hexadecimal UUID rather than a human label like final_v2, because human labels lie and hashes do not. A run referenced its dataset by that UUID; the UUID referenced a frozen set of patches; the patches traced back through the augmentation recipe to the specific wells and depth intervals that produced them.
This mattered because the dataset itself moved constantly. The core sinusoid dataset grew from roughly 900 image-and-ground-truth pairs to over 55,000 — a 65-fold expansion driven by overlapping-patch generation and augmentation. One reservoir interval alone went from 236 raw patches to 4,212 after augmentation, lifting sinusoid-bearing patches from 19 to 2,046. Without versioning, "the 55,000-pair dataset" is a meaningless phrase, because there were dozens of them. With versioning, every model was permanently bound to the one it actually learned from.
Data QC is a versioned gate, not a vibe
Of ten wells received in one intake, two were excluded before training — both carried abnormal static binary-log value ranges that fell outside the normal 0–255 band and defeated normalisation, and one of the two had additionally been acquired with a different image-logging tool whose response was not directly comparable. That exclusion is a data decision with model consequences, so it lives in the dataset's provenance record. A future engineer reading the lineage sees eight wells and the recorded reason for the other two. Silent exclusion is how irreproducibility enters a programme.
The operational layer on top of this was a data-management dashboard that itself went through twelve numbered versions over the engagement. Versioning is not a one-time setup. It is a habit that has to hold for the life of the programme, and the only way it holds is if the tooling makes the versioned path the easy path.
Discipline two: track every experiment, including the failures
A subsurface model is the survivor of a search, not the product of a recipe. The search space here was large by intent. Backbones spanned ResNet-10 through ResNet-34; optimisers covered SGD, Adam, and AdamW; learning rates were swept from 0.001 down to 0.0005; encoder and decoder depth, feed-forward dimension, loss composition, and class weighting were all in play. Across this space we ran hyperparameter sweeps with Weights & Biases as the experiment tracker, logging every run's configuration, metrics, and artefacts to a single shared project so that no result existed only in someone's terminal scrollback.
The discipline that separates an engineering programme from a research one is logging the failures with the same rigour as the wins. The first supervised runs overfit — constant prediction on the validation set despite a matched class distribution across splits. That is a result. It drove a specific, recorded set of changes: single-channel input, a smaller backbone than ResNet-18, a different sinusoid loss, augmenting only the sinusoid-bearing patches rather than the whole set, and incorporating well angle. None of those decisions would be defensible a year later if the failing runs that motivated them had been deleted. An experiment tracker is a lab notebook that cannot be edited after the fact, and that immutability is the point.
The search converged on a configuration the operator's team could read off a single record: a ResNet-10 backbone trained from scratch with no pretrained weights, four encoder and four decoder layers, feed-forward dimension 1024, AdamW at an optimal learning rate of 0.0004, dropout 0.2, a combined focal-and-L1 loss with class-loss weight 5 against parameter-loss weight 1, an inference probability threshold of 0.5, and early stopping after 40 epochs without improvement. That is not a paragraph of folklore. It is a tracked, reproducible artefact — and when the operator's engineers needed to retrain on new wells, it was the starting point they inherited, not a starting point they had to rediscover.
The drift-and-staleness picture above is the reason experiment tracking has to outlive the build phase. A model is not finished when it ships; it begins decaying the moment the formation it sees in production diverges from the formation it trained on. The team that owns the model needs the same tracking discipline the build team used, because the retrain that refreshes a stale model is itself an experiment that has to be logged, compared, and version-pinned. Hand over the model without handing over the tracking habit, and the operator inherits a system they can run exactly once.
Discipline three: CI/CD and the path to serving
Most subsurface ML stops at a checkpoint and a Streamlit demo. Production needs the rung above that: continuous integration that catches a broken pipeline before it reaches the asset, and a serving path that turns a frozen checkpoint into something a geoscientist can run without a data scientist standing behind them. We treated this as an explicit phase rather than an afterthought — the programme's own phase ladder named "model rollback and serving" and "production CI/CD" as Phase 3 deliverables, distinct from the Phase 1 and Phase 2 modelling work.
The lifecycle we built and handed over was the full loop, not a fork of it: data exploration, feature engineering, training and experimentation, evaluation against geoscientist-validated ground truth, a production-ready frozen model, CI/CD, drift and staleness monitoring, and a feedback loop back to data. A model registry sat at the centre, so that "the model running on the asset" was an addressable, promotable object rather than a file path someone remembered. The serving surface was a set of applications running inside the operator's perimeter on a custom MLOps control plane, with the underlying compute provisioned on-premise rather than rented from a cloud that would also be ingesting the operator's subsurface data.
The funnel above is the attrition every operator should expect and most do not plan for. A pilot that produces a validated metric has cleared one gate of several. Ingestion with provenance, a versioned model registry, a CI pipeline, a serving layer inside the security perimeter, drift monitoring, and a feedback loop are all gates between a working notebook and a workflow on the asset — and a programme that skips any of them ships a model that cannot be operated. The discipline here is to scope CI/CD and serving in from the first phase as line items with acceptance criteria, not to bolt them on once the modelling "is done." Modelling is never done; serving is what makes that survivable.
Discipline four: reproducibility you can defend in a review
Reproducibility in subsurface AI is not an academic nicety. A dip and azimuth prediction feeds a structural model that feeds a drilling decision worth millions. When a reviewer asks why the model placed a fracture at a given depth, "the network said so" is not an answer. The answer has to be reconstructible: this dataset UUID, this frozen checkpoint, this seed, this threshold, this depth-conversion arithmetic.
Several things make subsurface reproducibility harder than it looks, and each has to be engineered around explicitly:
- Irreducible measurement error has to be separated from model error. At the image resolution in play, a single image-log pixel corresponds to about 3 centimetres of depth, so a ±3 cm uncertainty is baked into the input before the model does anything. When a prediction sits 3 cm off a ground-truth pick, that is the instrument, not the network — and the localisation tolerance that defines a true versus false positive has to be chosen and recorded against that physical floor. We swept tolerances at 2, 4 and 6 cm precisely so the metric could be read against the instrument, not in spite of it; report a number without naming the tolerance and the metric is meaningless.
- Frozen checkpoints have to be self-describing. A checkpoint whose filename encodes its own provenance — learning rate, epoch count, backbone, loss, and a timestamp — tells an engineer what it is without a lookup. "The 0.0004 / ResNet-10 / focal-plus-L1 checkpoint trained at this timestamp" is unambiguous;
best_model_final.ptis a future incident. - Thresholds are part of the model. A focal-loss configuration tuned for recall ran at a 0.5 confidence threshold; a cross-entropy configuration tuned for precision-and-recall ran at 0.9. A checkpoint shipped without its threshold is a checkpoint that will be run wrong.
- Determinism has to survive the data growth. Because the dataset itself was non-stationary — growing from 8 to 11 to 14 to 16 wells over the phases, where moving from 8 to 11 wells improved depth, dip, and azimuth error by roughly 0.007 MAE — every reported number is meaningless unless it is pinned to the well-count and dataset version that produced it. Reproducibility and versioning are the same discipline viewed from two angles.
The reproducibility test
The honest test of a subsurface ML programme is this: hand a new engineer the repository, the dataset UUID, and the checkpoint name, and ask them to reproduce last quarter's headline number to the decimal, without talking to the person who produced it. If they can, the programme is engineered. If they cannot, it is research wearing an operations costume — and it will not survive the handover.
The final mile: handover to the operator's ICT team
Everything above exists to make one thing possible: handing the system to the operator's in-house ICT team and walking away without the system dying. This is the discipline most programmes skip, and skipping it is why so many "successful" pilots are quietly dead within a year of the consultants leaving.
Handover is not a documentation drop. We treated it as the acceptance test for the whole programme, and structured it three ways. First, deliverable packaging: each of the three production capabilities — vug detection, bedding-and-fracture detection, well-to-well correlation — was handed over as a complete unit comprising the versioned dataset, the frozen model, the architecture, the output format, and the runbooks. A model without its dataset and its operating documentation is not a deliverable; it is a liability with good metrics.
Second, a real choice about operating posture. We did not assume the operator wanted to run everything themselves on day one. We costed three honest scenarios with their genuine trade-offs: a fully self-operated path, where the team retrains independently on their own on-premise GPU hardware on a cadence measured in days; the current managed path, where retraining inside the existing arrangement runs in minutes to hours; and an off-premise managed path, where retraining is handled externally on a two-to-three-week cadence. The point of laying out all three is that capability transfer is a dial, not a switch — and the operator, not the vendor, sets it.
Third, and most important, people, not just artefacts. Software hands over cleanly; the judgement to operate it does not. The programme deliberately built local capability alongside the system, training a cohort of 55 young professionals — 15 of them local nationals, drawn from regional universities as in-country capability building — so that the know-how to run, debug, and extend the platform lived inside the region rather than departing with the delivery team. A handover that transfers files but not judgement produces shelfware with a maintenance contract. The infrastructure disciplines in this whitepaper are what make the judgement transferable: a versioned dataset, a tracked experiment history, a CI/CD pipeline, and a reproducible checkpoint are teachable. A model that only one person understands is not.
The return on all of this is not abstract. The same engineering that makes a system reproducible and handoverable is what frees expert time at scale — interpretation work that took a geoscientist hours per well collapses to a reviewed, agentic workflow, with the expert retained for anomalies and sign-off rather than for mechanical picking. But that dividend only persists if the operator can keep the system running. A productivity gain that evaporates when the build team leaves was never a gain. It was a loan.
What good looks like
For an IT lead or CTO evaluating a subsurface AI programme — one you are buying, one you are running, or one you are trying to rescue — the questions that matter are not about model architecture. They are about whether the programme was engineered to be owned:
- Is every dataset content-addressed and immutable, with data-QC exclusions recorded as provenance rather than applied silently?
- Is every experiment, including the failures, tracked in a shared, immutable record that the operator's team can read and extend?
- Are CI/CD, a model registry, and a serving path inside the security perimeter scoped as first-phase deliverables with acceptance criteria — not bolted on at the end?
- Can a new engineer reproduce a headline number to the decimal from the repository, a dataset UUID, and a checkpoint name alone?
- Does the handover transfer datasets, frozen checkpoints, runbooks, and judgement — with the operating posture chosen by the operator, not assumed by the vendor?
If the answer to all five is yes, the operator owns a system. If the answer to any is no, they own a checkpoint and a countdown.
What this whitepaper argues
- A trained subsurface model is ~15% of a production system; the other 85% is versioning, tracking, CI/CD, reproducibility, and handover.
- Version data, not just code — every dataset content-addressed and immutable (DVC + Git, hex-UUID dataset names), with QC exclusions recorded as provenance.
- Track every experiment including failures in an immutable shared record (W&B); the retrain that refreshes a stale model is itself a tracked experiment.
- Reproducibility means separating instrument error from model error, self-describing frozen checkpoints, shipped thresholds, and numbers pinned to dataset version.
- The handover is the acceptance test for everything before it — transfer datasets, checkpoints, runbooks AND judgement, with operating posture chosen by the operator.
References
International Energy Agency, 2025 International Energy Agency. Energy and AI Special Report (2025). Missing internal expertise identified as the dominant adoption barrier across the energy sector. https://www.iea.org/reports/energy-and-ai
McKinsey & Company, 2025 McKinsey & Company. The State of AI (2025). Workflow redesign identified as the single strongest EBIT correlate of AI value capture. https://www.mckinsey.com/
Carion et al., 2020 N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko. End-to-End Object Detection with Transformers (DETR). ECCV 2020. Architectural basis for the set-prediction detection approach. https://arxiv.org/abs/2005.12872
Sculley et al., 2015 D. Sculley et al. Hidden Technical Debt in Machine Learning Systems. NeurIPS 2015. The canonical argument that ML code is a small fraction of a production ML system. https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html