Physics or Data: The Long Debate in Subsurface Modeling, Explained

The argument gets framed as a fight more often than it deserves. On one side, the reservoir engineer with a first-principles simulator that solves the governing equations of flow through rock, and a well-earned suspicion of any model that cannot say why it believes what it believes. On the other, the machine-learning practitioner with a net that never opened a physics textbook and does not need to, because it learned the mapping straight from data. Put in those terms it sounds like a war, and people treat it like one. The framing is wrong, and the cost of the wrong framing is real: teams pick a camp, defend it, and miss that the two approaches answer different questions and win in different regimes. This is a note about reading the subsurface model families as a spectrum instead, and about the one move on that spectrum that buys the most speed for the least compromise. It is deliberately not a walk through VeerNet, our own raster-log digitiser; that architecture has its own paper. This is the map that VeerNet sits on, not the pin.

Two poles, and why neither is wrong

Start with the poles, because each is defensible on its own terms. A physics-first simulator encodes what we actually know: mass balance, Darcy flow, the coupled thermodynamics of a reservoir under production. When the physics is known and the rock properties are constrained, that solver is not just accurate, it is auditable. You can trace an output back to an equation and a boundary condition. The price is that a full-order solve is expensive, and a study that needs thousands of runs, history-matching, uncertainty sweeps, optimisation over well placements, can turn a defensible model into a bottleneck.

The other pole gives up the equations entirely. A pure data-driven net fits the input-output relation from examples and asks no permission from first principles. Where the physics is poorly known, or the governing relation is a tangle nobody has written down cleanly, that is exactly the right trade. Our own raster-log digitiser lives here: it turns scanned curve traces back into depth-indexed values with no flow equation anywhere in sight, and on the multiclass validation set its best regression read-out reached an R-squared of 0.9891. There is no physics prior in that number and there does not need to be. Reconstructing a curve off a photograph is a perception problem, not a reservoir-physics one, so the data-driven pole is the honest choice, not a concession.

The mistake is treating those two as the whole story. Between them sits a family that keeps some physics and learns the rest, and the field has a precise name for the continuum: physics-informed machine learning, running from strong physical priors at one end to purely data-driven models at the other, with the right choice set by how much reliable physics and how much data you actually hold [4]. A physics-informed neural network makes the idea concrete, folding a governing partial differential equation into the training loss as a soft constraint so the fit is pulled toward physically admissible solutions rather than merely plausible ones [2]. You do not have to be at a pole.

The move that pays: a net standing in for the simulator

The most useful position on the spectrum is not a philosophical compromise, it is an engineering one, and it is older than the neural version. Reduced-order modelling has argued for decades that you can replace an expensive full-order solve with a cheap trained stand-in that preserves the input-output behaviour you care about, and skip the parts you do not [3]. The neural surrogate is that idea with a more flexible function class. You run the expensive physics simulator enough times to generate training pairs, fit a net to imitate it, and then answer thousands of downstream queries against the net at a fraction of the cost. The physics is still in the loop, upstream, where it trained the surrogate. What changed is that you stopped paying for it on every single query.

The numbers are what make this more than a slogan, and the upstream survey by Koroteev and Tekic collects them in one place [1]. A reservoir-engineering DNN surrogate runs 200 to 2000 times faster than a conventional simulator. A production-optimisation model built on gradient boosting delivers a 100-times-plus speedup on well forecasting. A geological-assessment DNN reaches up to a millionfold over manual mapping, and ML-driven drilling optimisation lands around 20 percent in cost savings. Those speedups are not measured on the same task and should not be added up, but read together they say something clear about where the acceleration lives on the axis. The surrogate sits in the middle, and it is the middle that wins the day-to-day work.

The subsurface-modelling debate read on one axis instead of as two camps. The horizontal axis runs from a physics-first simulator on the left to a pure data-driven net on the right, and each family sits at the position that describes how much physics prior it keeps. The vertical axis, on a log scale, is the acceleration each family buys over its first-principles or hand-built baseline. The argument the chart makes is that the tallest useful bars are not at either pole: the orange band is a data-driven surrogate net standing in for an expensive reservoir simulator, running 200x to 2000x faster while still answering a physics-shaped question, which is the fastest practical win on the axis. A production-forecast gradient-boosting model adds 100x or more, a geological DNN mapping step reaches up to 1000000x over manual mapping, and at the far right a pure data-driven raster-log digitiser reaches an R-squared of 0.9891 with no physics prior at all, judged on fit rather than speed. Drag the latency-budget lever to shade out the families whose single run does not clear a given minute box. The four accelerations and the R-squared are sourced from the engagement archive and the Koroteev and Tekic survey; the minute-budget lever and the baseline run cost it divides are illustrative reader inputs, not measured latencies.

The exhibit is the spectrum made one axis. Physics-first on the left, pure data-driven on the right, and each family plotted at the position that says how much physics prior it keeps, against the acceleration it buys over its baseline on a log scale. The tallest useful bar is not at either pole. It is the orange one, the surrogate net standing in for the simulator, running 200x to 2000x while still answering a physics-shaped question. Drag the budget lever and watch which families a given latency box actually admits: tighten the box and the simulator falls out first, because a single full-order run cannot clear it, while the surrogate and the boosters stay in. That is the whole argument in one interaction. The geological mapping net towers even higher at a claimed millionfold, but it has walked most of the way to the data-driven pole to get there, and the pure digitiser at the far right is judged on fit, its 0.9891, not on speed at all.

Choosing a position instead of a side

Read this way, the practical question stops being which camp is right and becomes which position on the axis the problem sits at. If you have trustworthy physics and can afford the solves, stay near the physics pole and keep the auditability. If the relation is genuinely unknown and you have labels, go to the data pole and stop apologising for it. If you have a simulator that is correct but too slow to use at the cadence the work demands, build the surrogate and take the 200x to 2000x, because that is where the physics you already trust gets to run at data-driven speed. The 20 percent drilling saving and the 100x forecasting speedup are the same lesson in other clothes: a learned model earning its place beside the physics rather than replacing it.

We land here because our own portfolio spans the axis rather than picking a spot on it. The digitiser is unapologetically data-driven because perception has no PDE. A reservoir study is physics-first because the flow equations are real and known. The surrogate is the bridge, and it is the piece we reach for most, because most of the time the physics is not the thing that is wrong, the physics is just the thing that is slow.

Limitations

The acceleration figures are a caution as much as a headline. They come from a survey that aggregates results across different tasks, datasets, and hardware [1], so the 200x-to-2000x surrogate range, the 100x-plus forecast speedup, and the millionfold geological figure are not comparable in the strict sense and must not be summed or averaged; each describes its own study. The instrument plots them at illustrative x-positions on the physics-prior axis to make the ordering legible, and those positions are our editorial reading of how much physics each family keeps, not a measured coordinate. The latency-budget lever, and the single-run baseline it divides, are illustrative reader inputs, not logged latencies from any engagement. The R-squared of 0.9891 is a real validation figure from our digitiser, but it is a best-case regression read-out on a specific split and stands in here only as a marker that a pure data-driven model can be excellent without any physics prior, not as a claim about every digitisation run. A surrogate is only as good as the simulator that trained it and the sampling that covered the input space; outside that envelope its speed buys you a confident wrong answer, which is the failure mode the physics pole exists to avoid. And the spectrum framing itself is a lens, not a taxonomy: real systems often stitch several positions together, and the interesting engineering is usually in the seams.

References

[1] Koroteev, D., and Tekic, Z. Artificial intelligence in oil and gas upstream: Trends, challenges, and scenarios for the future. Energy and AI 3 (2021), 100041. The upstream survey behind the acceleration figures: reservoir DNN surrogates 200x to 2000x over a conventional simulator, production optimisation 100x-plus via gradient boosting, geological assessment DNNs up to a millionfold over manual mapping, and about 20 percent drilling cost savings from ML optimisation. https://www.sciencedirect.com/science/article/pii/S2666546820300410

[2] Raissi, M., Perdikaris, P., and Karniadakis, G. E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics 378 (2019), pp. 686-707. The formal statement of embedding a governing PDE as a soft constraint in the training loss. https://www.sciencedirect.com/science/article/pii/S0021999118307125

[3] Benner, P., Gugercin, S., and Willcox, K. A survey of projection-based model reduction methods for parametric dynamical systems. SIAM Review 57(4) (2015), pp. 483-531. The reduced-order-model lineage that frames the surrogate idea before the neural version: replace an expensive full-order solve with a cheap stand-in that preserves the behaviour that matters. https://epubs.siam.org/doi/10.1137/130932715

[4] Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris, P., Wang, S., and Yang, L. Physics-informed machine learning. Nature Reviews Physics 3 (2021), pp. 422-440. The review that names the spectrum from strong physical priors to purely data-driven models and argues the choice is set by how much reliable physics and how much data you hold. https://www.nature.com/articles/s42254-021-00314-5

Physics or Data: The Long Debate in Subsurface Modeling, Explained

Two poles, and why neither is wrong

The move that pays: a net standing in for the simulator

Choosing a position instead of a side

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on