The Data Bottleneck Is the Real Bottleneck in Subsurface AI

There is a comfortable story that gets told about machine learning in the subsurface, and it goes like this: the hard part is the model. Pick the right architecture, tune the right loss, find the right backbone, and accuracy follows. It is a story engineers like because it is the part of the problem they control. In our work with a mid-sized Middle East carbonate operator — a roughly twenty-month engagement building a fracture-detection system on image logs from two different microresistivity imaging tools — that story turned out to be almost exactly backwards. The model mattered. But the variable that moved accuracy by orders of magnitude was not the architecture. It was how many annotated wells actually showed up.

A program scoped for 25 wells, fed 8

The engagement was scoped, on paper, around 25 wells delivered for annotation. That was the number the data-supply plan promised, the number the training schedule assumed, and the number the accuracy targets were implicitly underwritten by. What arrived, through the early phases when the model was being built, was 8 usable wells. Ten image-log wells were delivered in the first tranche; two of those were dropped — one for a tool mismatch, one for image-quality problems — leaving 8 to actually train on. The raw-pick interpretations and binary wireline log files for the rest of the scoped wells trickled in over many months, tracked one delivery at a time on a data-management dashboard, well behind the cadence the project needed.

This is not a story about a vendor missing a deadline. It is the normal shape of subsurface data supply, and it is the single most under-modelled risk in operator AI programs. Annotated wells do not arrive on the schedule a Gantt chart assumes. They are gated by an expert interpreter's availability, by which fields have been digitised, by confidentiality and partner sign-off, by the physical reality that a producing asset's priority is production, not feeding a model. A program that treats labelled-well supply as a given has mis-located its own critical path.

The accuracy curve tracked well count, not algorithm changes

Here is the part that should reorganise how a data leader thinks about these programs. We ran a controlled sweep — hold the architecture fixed, vary only the number of training wells, and watch the model's classification error. The fracture-detection model is a Detection-Transformer-derived set predictor: it emits a fixed set of candidate sinusoids per image patch, matches them to ground-truth picks with a Hungarian bipartite assignment, and is scored on how cleanly it separates "fracture here" from "nothing here." That classification error is the cleanest single readout of whether the model has learned the geology or is guessing.

The numbers are not subtle. At 3 wells, classification error sat at 93.1% — the model was, functionally, guessing. At 6 wells it fell to 18.4%. At 9 wells it dropped to 1.06%. By 11 wells it reached 0.82%. The same sweep showed the Hungarian matching loss falling from 0.801 at 3 wells to 0.025 at 11. Two orders of magnitude of accuracy improvement, bought with nothing but more annotated wells through the same unchanged pipeline.

Drag the dashed analyst ceiling: the orange breach point and the “automation imperative” wedge recompute. It only flips to “within human reach” if you raise the ceiling above today’s 4,000-fold — a budget no team has. Anchors are sourced; the ceiling is an editorial throughput proxy.

Set that against what the algorithm work bought. Across the same period the team did a great deal of genuine engineering — refining the augmentation pipeline, sweeping backbones, tuning the matching cost, swapping static for dynamic imagery. Those changes were real and some were decisive. But the largest movements on the accuracy curve, the order-of-magnitude ones, lined up with the calendar on which wells were delivered, not the calendar on which code was merged. When the dataset grew from 8 to 11 wells, depth, dip, and azimuth mean-absolute-error improved by roughly 0.007 — a meaningful gain from data alone, the team's own notes flatly labelling it "data beats augmentation." The model was never the bottleneck. The supply of labelled geology was.

Why a small dataset bends every engineering decision

The data shortage did not just lower the ceiling on accuracy — it propagated into the architecture and the MLOps design, and that is the part operators rarely see. With only a handful of wells, you cannot hold out an entire well as a clean test set without starving training; the team was forced to split at the image-patch level rather than the well level, a compromise driven entirely by well scarcity, not by preference. The backbone sweep told the same story from the other side: a deliberately small, from-scratch ResNet-10 beat every deeper variant, posting a 0.50 classification error against 26.76 for ResNet-34. On a data-rich problem you reach for a heavier feature extractor; on this one, a bigger network simply overfit before the set-prediction objective could converge. Small data does not just hurt accuracy. It dictates your model class, your validation strategy, and your entire experimental tempo.

It also forced the data engineering to do heavy lifting that more wells would have made unnecessary. Augmentation was not a nicety here — it was load-bearing. With augmentation switched off, classification error pinned at 100%; switched on, it collapsed to 2.6%, with the augmentation pipeline inflating the training set roughly 65-fold and fixing a brutal class imbalance (the raw data had a few dozen sinusoid-bearing patches against thousands of empty ones). That entire sub-system — a synthetic-multiplication layer engineered to manufacture variety the well count could not supply — existed because the data supply fell short. Every hour spent building it was an hour the program paid for the missing wells.

The instrument argues the same thing in a different register

The visual above frames a related throughput gap: as data density per survey has climbed roughly a hundred-fold over two decades, the load it places on interpretation has outrun anything a fixed human headcount can clear, which is the structural reason automation is no longer optional. Drag the analyst ceiling and it only flips to "within human reach" at a budget no team has. The mirror image is true on the training side. The work an AI model can do is bounded above by the labelled geology a fixed, slow, expert-gated supply chain can deliver. On one side, too much data for humans to interpret; on the other, too little annotated data for the model to learn from. Both are supply-and-capacity problems wearing different clothes, and both are solved by governing the data flow as deliberately as you govern the model.

What this means for an operator running an AI program

If you are a subsurface or data leader sponsoring one of these programs, the practical conclusions are uncomfortable but clear.

Treat labelled-well supply as the critical path, and instrument it. The team here tracked deliveries on a data-management dashboard precisely because slippage was the dominant risk. Make annotated-well throughput a first-class, weekly-reported metric — wells scoped, wells delivered, wells usable after QC — and forecast accuracy against the wells you will actually have, not the wells in the contract. The gap between 25 and 8 is not a footnote. It is the difference between the model you promised and the model you can build.

Budget interpreter time as a program resource, not an afterthought. The rate-limiting step is rarely GPU. It is an expert geoscientist producing ground-truth picks. That person is the real constraint, and their time should be planned, protected, and paid for like the scarce capital it is.

Spend engineering effort where the data is thin, not where it is comfortable. On a small-well program the highest-leverage engineering is not a fancier architecture — it is the augmentation pipeline, the imbalance handling, the patch-level validation discipline, and the data QC that turns 10 delivered wells into 8 trustworthy ones. Match the engineering to the actual bottleneck.

Do not over-build the model for data you do not have. The from-scratch ResNet-10 result is the whole lesson in miniature: on a data-constrained problem, the disciplined small model wins. Reaching for the largest backbone because it tops a public leaderboard is a way to convert a data problem into an overfitting problem.

The encouraging corollary is that the lever works in both directions. If well scarcity is what caps accuracy, then governing well supply is a controllable, high-return investment — arguably a higher-return one than another month of architecture search. In this engagement, the path from a 93% error to a sub-1% error did not run through a cleverer model. It ran through getting from 3 wells to 11. For an operator, that is good news: the most powerful knob on AI accuracy is one you already own.

Key takeaways

A subsurface-AI program scoped around 25 annotated wells was fed only 8 usable wells through the build phase — labelled-well supply, not the model, was the real critical path.
In a controlled sweep with the architecture held fixed, classification error fell from 93.1% at 3 wells to 18.4% at 6, 1.06% at 9, and 0.82% at 11 — two orders of magnitude of accuracy from data alone.
Data scarcity propagated into the engineering: it forced patch-level (not well-level) validation, favoured a small from-scratch ResNet-10 (0.50 vs 26.76 class error against ResNet-34), and made the ~65× augmentation pipeline load-bearing (100% → 2.6% class error).
For operators, data-supply governance is the highest-return lever on AI accuracy: instrument annotated-well throughput weekly, budget expert-interpreter time as scarce capital, and forecast accuracy against the wells you will actually have.
Match engineering effort to the real bottleneck — on a thin-data program, augmentation, imbalance handling, and QC beat architecture search, and over-building the model for data you do not have just converts a data problem into an overfitting one.

The Data Bottleneck Is the Real Bottleneck in Subsurface AI

A program scoped for 25 wells, fed 8

The accuracy curve tracked well count, not algorithm changes

Why a small dataset bends every engineering decision

The instrument argues the same thing in a different register

What this means for an operator running an AI program

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on