Why the archive, not the model, decides where you run
Most write-ups of machine learning for well logs start with the network and end with a metric. This one starts somewhere less glamorous and more binding: the filing cabinet. Before a model can read a subsurface log it needs the log, and in upstream oil and gas the authoritative copy of that log usually does not sit in a private cloud bucket you control. It sits in a public regulator's archive, held under rules that say where the bytes are allowed to live and who is allowed to move them. That single fact reorders the whole engineering problem. The question stops being how good the interpretation is and becomes where the interpretation is legally allowed to happen.
We wrote this as a field survey rather than an architecture note on purpose. Our VeerNet work covers how to turn a scanned raster log into curves; this piece is the companion question of where you are permitted to do that reading at all, because in practice the second question decides the deployment shape of the first.
What a sovereign subsurface archive actually looks like
Take one concrete example. The Texas Railroad Commission, the state oil-and-gas regulator, holds a well-log archive that is public and enormous: on the order of 136,771 scanned TIF images and 7,781 LAS files that anyone can request. These are not a private operator's internal dataset. They are a sovereign, publicly held record, curated and released by a government body, and their governance follows from that. The regulator decides the terms of access, the format of release, and the constraints on redistribution, and a project that wants to interpret them at scale inherits those terms whether it likes them or not.
The composition matters as much as the count. The bulk of the archive, the 136,771 TIF files, is raster: pixels of a scanned strip-log, not machine-readable measurements. Only the smaller 7,781 LAS set is already digital curve data. That skew is the reason automated raster interpretation exists as a discipline, and it is also why the archive is so large in bytes: images are heavy. Moving a corpus of that size across a border is not a trivial upload, and where a residency rule forbids the move outright, it is not an upload at all.
Sovereignty is a serving constraint, not a storage footnote
Data sovereignty is the principle that data is subject to the laws of the jurisdiction where it is collected or held. Data residency is its operational edge: the requirement that specific data physically remain within, or not leave, a defined territory. The legal literature has mapped these regimes carefully. Chander and Le catalogue the rise of data-localisation rules and the national-security and regulatory logic behind them [2]. Bygrave sets out the cross-border transfer restrictions that decide when regulated data may move between jurisdictions and when it may not [3]. Hoeren frames how those localisation and residency rules bear directly on where a regulated dataset may be processed, not only where it may be stored [1].
The word "processed" is the one that reaches the model. If a rule says the data may not leave a territory, then any computation over that data must occur inside the territory. Inference is computation over the data. So the residency rule on a national subsurface archive is, in effect, a rule about where the interpretation model may run. It is not a preference expressed by a cautious compliance team. It is a boundary condition on the system architecture, fixed before the first training run.
The exhibit above makes the coupling legible. On the left is the sovereign archive drawn to scale, the 136,771 TIF rasters and the 7,781 LAS files. Drag the lever to set how much of that corpus a deployment must actually read in place, and the right-hand map of serving loci responds: as the required share rises from an open sample toward the whole archive, the legally reachable loci collapse from cross-border cloud, through in-region cloud, down to in-jurisdiction hosting and finally on-premises, air-gapped serving. The archive counts, the cloud GPU economics, the build footprint, and the market figure on the canvas are sourced from the engagement record; the per-locus residency ceilings are an illustrative governance model and are flagged as such, because the exact ceiling for any given archive is a legal question, not a number we measured. The shape of the argument, though, is the sourced part: the more of a sovereign archive you must ingest, the closer to it the model must run.
The compute economics do not rescue you
The reflex answer to a heavy dataset is cloud serverless GPU: pay per month, scale to zero, let someone else own the hardware. In our own build the rentable tiers ran 750 EUR per month for a high-end card and 1800 EUR per month for an advanced one, and for a served model that reads scans one at a time that rent amortises to a small per-log cost. If the only variable were money, cross-border serverless would win almost every time.
But residency is not a cost you can pay down. A cross-border cloud region can be the cheapest place in the world to run the model and still be the one place you are not allowed to run it, because getting the data there means moving a sovereign archive out of its jurisdiction. When that is the constraint, the compute price of the forbidden option is irrelevant. The relevant question becomes the cost of the compliant option, which is standing up the interpretation capability inside the jurisdiction. Koroteev and Tekic, surveying upstream artificial intelligence, are explicit that data access and infrastructure, not algorithmic novelty, are the recurring bottlenecks in deploying these systems in the field [4]. The residency wall is the sharpest version of that bottleneck.
What building in-jurisdiction actually costs
Because the compliant path is to bring the model to the data rather than the data to the model, the honest cost of an interpretation capability under sovereignty is a delivery cost, not a rental line. In our engagement the build-in-jurisdiction footprint was 4 to 6 engineers over 16 to 32 weeks, depending on whether the track was standard or accelerated. That is the real number a data-sovereign deployment carries: a team, a schedule, and a delivery that lands the trained model on infrastructure sitting inside the same territory as the archive it reads.
This is not simply cloud being more expensive than it looked. Training remains a one-off; the recurring question is where the served model lives, and under a residency rule that answer is fixed to the jurisdiction regardless of where compute is cheapest. The build cost is the price of admission to a market a cheaper, non-compliant deployment cannot legally enter at all.
The market is gated by geography, not by accuracy
The reason any of this matters commercially is a market figure. The serviceable market for oil-and-gas technology of the kind that reads and interprets subsurface data is on the order of $6.7B, and access to it is distributed not by who has the best segmentation metric but by who can serve their model where the data is legally allowed to sit. A slightly weaker model that runs in-jurisdiction can serve an archive that a stronger model cannot touch when its only deployment is cross-border. Sovereignty turns a technical race into a jurisdictional one.
This is the finding worth carrying out of the survey. National subsurface archives are governed by data-sovereignty rules that dictate where models may run, and regulator-owned raster archives, being both sovereign and heavy, push automated interpretation firmly toward on-premises and in-jurisdiction serving. The engineering choice to build for on-premises deployment is not a taste for old-fashioned infrastructure. It is the direct consequence of where the primary record of the subsurface is legally required to stay.
Limitations
This is a field survey and inherits a survey's limits. The one archive we quantify in detail, the 136,771 TIF and 7,781 LAS files of a single state regulator, is a real and representative example of a sovereign, publicly held subsurface record, but it is one archive in one jurisdiction; the counts and the residency posture of other national repositories differ, and this piece does not enumerate them. The residency ceilings in the interactive exhibit are an illustrative governance model used to show the coupling between archive scope and reachable serving locus; they are not a legal schedule for any specific archive, and the exact rule for any given dataset is a question for counsel in that jurisdiction, not a value we measured. The compute figures, 750 and 1800 EUR per month, and the delivery footprint, 4 to 6 engineers over 16 to 32 weeks, are the real numbers from our own engagement and are used as a worked reference for the cost of the compliant path, not as an industry benchmark. The $6.7B serviceable market is a sourced sizing of the addressable technology segment and is quoted as framing, not as a claim about capture. Finally, this piece deliberately stops at the deployment question and does not re-tell how the interpretation itself works; that is the subject of our VeerNet architecture work, which this survey complements rather than repeats.
References
[1] Hoeren, T. Big Data and Data Quality. In Big Data in Context (SpringerBriefs in Law), Springer (2018). Sets out how data-localisation and residency rules constrain where regulated datasets may be processed and stored. https://doi.org/10.1007/978-3-319-62461-7
[2] Chander, A., and Le, U. P. Data Nationalism. Emory Law Journal, 64(3), 677-739 (2015). A survey of data-localisation regimes and the legal logic that requires certain data to remain within a jurisdiction. https://scholarlycommons.law.emory.edu/elj/vol64/iss3/2/
[3] Bygrave, L. A. Data Privacy Law: An International Perspective. Oxford University Press (2014). Establishes the cross-border transfer restrictions that govern movement of regulated data between jurisdictions. https://global.oup.com/academic/product/data-privacy-law-9780199675555
[4] Koroteev, D., and Tekic, Z. Artificial intelligence in oil and gas upstream: Trends, challenges, and scenarios for the future. Energy and AI, 3, 100041 (2021). Documents the data-access and infrastructure constraints that shape where upstream machine learning can be deployed. https://doi.org/10.1016/j.egyai.2020.100041