The Economics of Automated Interpretation: Reviewing Published Cost and Throughput Studies

Abstract

How fast, and how cheap, does machine learning actually make subsurface interpretation? The upstream literature answers with confident multipliers: a million times faster than a human geologist, hundreds to thousands of times faster than a reservoir simulator, a hundred times faster at forecasting production, a fifth off the cost of drilling. This review collects the most-cited of these figures, all of which trace through a single widely-read survey of artificial intelligence in upstream oil and gas [1], and asks the only question that matters before any of them enters a budget: what did each number measure? The recurring finding is not fraud but category error. The large speedups are kernel measurements, the time for one accelerated stage rather than the time for the end-to-end job, and quoting a kernel acceleration as an application result is exactly the mistake the performance-measurement literature has warned against for decades [2] [3]. The cost saving is a different currency again, a ratio of money rather than a ratio of time, and the two are routinely added to the same slide as though they were comparable. We propose a short deflation discipline, grounded in the un-accelerated-fraction argument [2], that converts any headline kernel factor into an effective end-to-end figure, and we read the four published numbers through it. The result is sobering and useful in equal measure: a 1,000,000x kernel speedup is real and a 1,000,000x project is a fantasy, and the gap between them is governed not by the model but by the fraction of the workflow the model never touched.

Why the headline numbers travel further than the studies

The figures this review examines are not obscure. They circulate in conference keynotes, in vendor decks, and in the introductions of papers that go on to do careful work, and they almost always arrive detached from the experiment that produced them. The most-cited source for the cluster is a 2021 survey of artificial intelligence across the upstream value chain, which gathered acceleration and cost results from the exploration, drilling, reservoir, and production literatures into one place [1]. That survey did the field a service by collecting them. The problem is what happens after collection, when a single sentence such as "deep learning can be a million times faster than manual geological mapping" is lifted out of its context and pasted into a business case as though it described the time it takes to interpret a well.

There is a well-developed body of thought on how to avoid exactly this. The performance-measurement tradition in computer architecture has insisted for half a century that you measure the speedup of the whole program, not the loop you optimised, because the part you did not touch comes to dominate the runtime the moment the part you did touch gets fast enough [2] [3]. The machine-learning systems literature has made the parallel argument about cost: the model is a small fraction of a deployed system, and a cost figure that counts only the model leaves out the data plumbing, the retraining, the monitoring, and the human review that make up most of the total [5]. A widely-read worked example showed how a per-inference energy cost that looks trivial balloons once the amortised training and architecture search behind it are folded back in [6]. None of this is new. What is new, in the upstream setting, is how rarely the headline figures are read through any of it before they shape a decision.

This review is therefore not an attack on the underlying studies, several of which are sound, and it is not a claim to have re-measured them. It is a reading discipline. We take the four figures the survey is most often quoted for, state plainly what each one measures, and apply a single deflation lens so that a reader can convert a headline into something a budget can use.

Method: what we collected and how we read it

The corpus is deliberately small and concrete. We took the four upstream-AI figures that recur most often in the secondary literature and that all resolve to the 2021 survey [1]: a 1,000,000x speedup for a deep network performing geological assessment versus manual mapping; a 200x to 2000x acceleration for a deep-network reservoir proxy versus a conventional simulator; a 100x speedup for gradient-boosted production forecasting versus a baseline; and a 20% cost saving from machine-learning optimisation of drilling. For each we recorded three things: the quantity it measures (wall-clock time, throughput, or money), the boundary of that measurement (a single kernel, a stage, or the end-to-end job), and the baseline it is compared against.

We then applied one deflation lens, drawn directly from the un-accelerated-fraction argument [2]. If a reported factor s accelerates only a fraction p of the end-to-end task, the effective whole-task speedup is not s but 1 divided by the quantity (1 minus p) plus (p divided by s). The intuition is the one architects have used for decades: as s grows large, the term p over s vanishes and the effective speedup is capped by 1 over (1 minus p), which depends only on the part you never accelerated [2] [3]. This lens does not need the original study's internals; it needs only the honest admission that interpretation is a pipeline, and that segmentation or simulation is one stage of it. The interactive exhibit below renders exactly this transformation on the four collected figures.

A note on what we did not do. We did not re-run any of the four experiments, we did not audit the original baselines, and we treat the published factors as given. The deflation is an analytical reading applied on top of the literature's own numbers, not a competing measurement. Where we quote a number from our own raster-log work, we flag it as such and use it only to ground the argument in a workflow we can account for end to end.

Results: the four figures on one scale, and what survives

Put the four figures on a common axis and the first thing that becomes visible is that they do not belong on the same axis at all. Three are speedups, ratios of time, and span seven orders of magnitude from 100x to 1,000,000x. The fourth is a 20% cost saving, a ratio of money, and it has no natural position on a speedup scale. The exhibit draws the three speedups on a shared logarithmic axis and the saving on its own percentage scale precisely to make that incommensurability impossible to miss.

A claims-scrutiny chart that puts four published upstream-AI figures on one common scale and then deflates them. Three are kernel speedups drawn on a shared log axis: a 1,000,000x DNN-versus-manual geological-mapping factor, a 200-to-2000x reservoir-simulator proxy speedup, and a 100x production-forecasting speedup. The fourth, a 20% drilling cost saving, is a portfolio cost number rather than a speedup, so it is drawn in orange on its own zero-to-one-hundred-percent scale and flagged as the odd one out. Drag the accelerated-fraction lever p and Amdahl's law converts each headline kernel factor into the effective end-to-end gain 1/((1-p)+p/s): below full coverage, even a million-times kernel collapses toward a single-digit job-level gain, because the un-accelerated remainder of the workflow sets the ceiling. The four figures and the 200-2000x range are the literature's own published numbers, attributed to a 2021 review of AI in upstream oil and gas; the Amdahl deflation and the p lever are an analytical lens for reading the claims, not a re-measurement of any study.

The deflation is where the headline numbers come apart. The 1,000,000x figure is a kernel result: it is the time for a trained network to produce a geological assessment compared against the time for a human to map the same thing by hand, and it says nothing about the data preparation, the quality control, the cases that fall back to a human, or the integration with everything downstream. Treat geological assessment as the whole job and the million stands; treat it as one accelerated stage in a pipeline that is, say, seventy percent automatable, and the effective end-to-end speedup is capped near three to four times no matter how fast the kernel runs [2]. The reservoir figure behaves the same way. A deep-network proxy that evaluates a reservoir scenario 200 to 2000 times faster than a full physics simulator is a genuine and valuable result, but a reservoir study is not only the forward simulation; it is history matching, uncertainty quantification, and the engineering judgement that frames the runs, and those do not speed up because the inner loop did. The 100x production-forecasting figure is the most modest of the three and, not coincidentally, the most likely to survive contact with a real workflow, because forecasting is closer to being the whole task than mapping or simulation is.

The 20% drilling cost saving is the honest one, and it is honest precisely because it is already expressed at the level that matters. It is a fraction of money off a real budget line, not a multiplier on a sub-step, which is why it can be used in a business case more or less as stated, and why it is the smallest-looking number in the set and yet the most directly bankable. The lesson is not that cost savings are better than speedups. It is that a number quoted at the level of the whole job is worth more to a decision than a spectacular number quoted at the level of a kernel, however impressive the kernel.

It is worth adding one figure from an adjacent study to complete the cautionary set. A carbon-capture engagement-prediction model reported an accuracy of 90.476% [7], a perfectly respectable result, but accuracy is a third currency again, neither time nor money, and it cannot be converted into either without a model of what each error costs. A slide that places a 90% accuracy beside a 1,000,000x speedup beside a 20% saving is comparing three things that share nothing but a number line.

Discussion: where our own work sits, and how to read the next claim

We have written elsewhere about our own raster-log digitisation system, and it is the right place to ground this review honestly, because we can account for its economics end to end rather than at the level of a kernel. The throughput gain that matters in that system is not the per-image inference time, which is fast and largely irrelevant to the cost, but the fraction of a scanned archive that clears automatically without a human touching it. That is a whole-job number by construction, and it is the kind of figure this review argues should replace kernel speedups in any serious business case. We deliberately do not quote it here as a headline multiplier, because doing so would commit the exact error the review is about. The discipline cuts both ways, and the firm that invented the system has to live by it too.

Reading the field's claims through the deflation lens suggests a short checklist that costs nothing to apply. First, identify the currency: is the number a ratio of time, a ratio of money, or a measure of quality, because those are three different axes and adding them is meaningless. Second, identify the boundary: did the figure measure a kernel, a stage, or the end-to-end job, because only the last is directly usable [3]. Third, estimate the accelerated fraction p and deflate accordingly, because a large s buys almost nothing once p drops below one [2]. Fourth, for any cost figure, ask what the model-only number left out, since the surrounding system is usually where the money actually goes [5] [6]. Fifth, demand a fixed task and a fixed quality bar before comparing two throughput figures, the same discipline the benchmarking community imposes on itself [4]. None of these steps require access to the original study; all of them can be run against a single slide.

The broader point is that the upstream-AI numbers are mostly real and mostly useless in the form they travel, and the fix is not skepticism but conversion. A 1,000,000x kernel speedup is a true statement about a stage and a misleading statement about a project, and the distance between those two readings is set by the part of the workflow the model never saw.

Limitations

This is a literature reading, and it carries the limits of one. The four figures we examine are taken as published; we did not re-run the underlying experiments, audit their baselines, or verify that the original studies bounded their measurements the way their headline phrasing implies, so our critique is of how the numbers are used downstream rather than of the studies themselves. The deflation lens is a single model, the un-accelerated-fraction argument [2], and it assumes the end-to-end task decomposes cleanly into an accelerated fraction and a serial remainder; real interpretation pipelines have feedback, parallelism, and human-in-the-loop steps that this one-parameter model does not capture, so the effective factors it produces are illustrative bounds, not predictions. The accelerated-fraction p in the exhibit is a reader-set parameter, not a measured property of any of the four studies, and different workflows will place it very differently. We also restrict the corpus to four figures from one survey of the secondary literature [1]; a systematic review would widen the sample, code each claim's measurement boundary from the primary source, and weight by how often each is cited in decisions rather than in papers. Finally, the one number we draw from our own work is a single workflow on a single archive, offered to ground the argument rather than to generalise it. A reader should take the deflation discipline as a way to interrogate the next claim they are handed, not as a finished accounting of the field's true economics.

What to carry into the next claim

The most-cited upstream-AI acceleration figures (1,000,000x mapping, 200-2000x reservoir simulation, 100x production forecasting, 20% drilling cost saving) all trace to a single 2021 survey and travel detached from what they measured.
Three of the four are kernel speedups, not job-level results. Quoting a kernel acceleration as an end-to-end outcome is the classic measurement error the performance literature has warned against for fifty years.
Deflate any headline factor with the un-accelerated-fraction argument: a speedup s on only a fraction p of the task yields an effective whole-job speedup of 1/((1-p)+p/s). A million-times kernel caps near single digits once p drops below one.
A speedup, a cost saving, and an accuracy are three different currencies (time, money, quality) and cannot share a number line. The 20% saving is the most bankable figure precisely because it is already stated at the level of the whole budget.
A five-step reading discipline (currency, boundary, deflate by p, account for the model-only omission, fix the task and quality bar) converts spectacular but unusable headlines into figures a business case can actually use.

References

[1] Koroteev, D., and Tekic, Z. Artificial intelligence in oil and gas upstream: Trends, challenges, and scenarios for the future. Energy and AI, 3 (2021). The survey that aggregates the four headline figures examined here and the source they all resolve to in the secondary literature. https://doi.org/10.1016/j.egyai.2020.100041

[2] Amdahl, G. M. Validity of the single processor approach to achieving large scale computing capabilities. AFIPS Spring Joint Computer Conference (1967). The un-accelerated-fraction argument that caps achievable speedup, the deflation lens used throughout this review. https://doi.org/10.1145/1465482.1465560

[3] Hennessy, J. L., and Patterson, D. A. Computer Architecture: A Quantitative Approach, 6th edition. Morgan Kaufmann (2017). The standard treatment of honest speedup measurement and the warning against quoting a kernel acceleration as an application result. https://www.sciencedirect.com/book/9780128119051/computer-architecture

[4] Mattson, P., Reddi, V. J., Cheng, C., et al. MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance. IEEE Micro, 40(2) (2020). Why throughput claims need a fixed task, a fixed quality target, and an end-to-end boundary to be comparable. https://doi.org/10.1109/MM.2020.2974843

[5] Sculley, D., Holt, G., Golovin, D., et al. Hidden Technical Debt in Machine Learning Systems. NeurIPS (2015). The surrounding glue, data plumbing, and maintenance that a model-only cost figure omits from the total cost of ownership. https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html

[6] Strubell, E., Ganesh, A., and McCallum, A. Energy and Policy Considerations for Deep Learning in NLP. ACL (2019). A worked example of a per-inference cost that looks negligible until amortised training and search are folded back in. https://aclanthology.org/P19-1355/

[7] Buah, E., et al. Can Machine Learning Predict Engagement in Carbon Capture and Storage Adoption? Energies, 13(23) (2020). The 90.476% accuracy figure used here to make the point that accuracy is a third currency, neither time nor money. https://doi.org/10.3390/en13236259

The Economics of Automated Interpretation: Reviewing Published Cost and Throughput Studies

Abstract

Why the headline numbers travel further than the studies

Method: what we collected and how we read it

Results: the four figures on one scale, and what survives

Discussion: where our own work sits, and how to read the next claim

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on