Abstract
The best subsurface model for a basin is rarely the one any single operator can train, because the data that would build it is partitioned across companies that compete for the same acreage and will not hand each other their logs. Federated and privacy-preserving learning is the branch of the field built for that partition: it moves model updates instead of data, so each participant trains on its own archive behind its own firewall and contributes only the parameters that change. This survey reads that literature in the cross-silo setting a handful of operators inhabit. We separate the two operations the field keeps conflating, transferring raw data and transferring model updates, trace federated averaging to the work that named it, and set out the two guarantees that make sharing updates safe: secure aggregation, which hides any single participant's update, and differential privacy, which bounds what the shared model can leak about any one scan. We read the synthesis against a real single-operator baseline from our raster well-log work: an archive of 136,771 TIF and 7,781 LAS scans, dense-prediction models trained at a physical batch size of one under a hard GPU-memory ceiling on rented cards priced at 750 and 1,800 EUR per month, and a shared segmentation target that peaks at a goodness-of-fit of R-squared 0.9891. The central finding is that federation changes the object an operator optimises over, from the archive it owns to the union of every participant's archive, while every raw scan stays resident, and that for a small, memory-bound, single-GPU project this is the cheapest way to enlarge the effective training corpus without a data-sharing contract no competitor will sign.
Background and related work
Federated learning began as an answer to a phone problem and turned out to answer an oilfield one. McMahan and colleagues introduced federated averaging to train across millions of mobile devices without collecting their data centrally: each device computes an update on its local examples, a server averages those updates into a new global model, and the cycle repeats, so the training data never leaves the device that holds it [1]. The setting they described first was cross-device, but the same protocol covers what a later survey named the cross-silo case, a few institutions each holding a large, stable, private dataset [7]. A federation of operators is cross-silo almost by definition: few participants, each with a substantial archive, each with a legal and competitive reason to keep it.
Two problems sit between that clean picture and a system an operator would trust. The first is communication: sending a full set of model weights every round is expensive, and Konecny and colleagues cut the bytes with structured and sketched updates so a round costs a fraction of the naive transfer [2]. The second, the one that matters for competitors, is that sharing an update is not the same as sharing nothing. Zhu and colleagues demonstrated deep leakage from gradients: given the gradients a client sends, an attacker can under some conditions reconstruct the training inputs that produced them [6]. An operator that shipped raw gradients in the clear would be shipping a reconstructable version of its logs.
The two guarantees that close that gap are distinct. Secure aggregation, from Bonawitz and colleagues, is a cryptographic protocol that lets the server compute the sum of all clients' updates without seeing any individual client's update, so each operator's contribution is masked and only the aggregate is revealed [3]. Differential privacy, formalised by Dwork and Roth, is an orthogonal promise about the output rather than the channel: it bounds, with a tunable budget, how much any single record can change what the model reveals, and its composition theorems track cumulative leakage across rounds [4]. Abadi and colleagues made it practical for deep learning by clipping each per-example gradient and adding calibrated noise, accounting for the budget as training proceeds [5]. Secure aggregation hides who contributed what; differential privacy limits what the finished model discloses about any contributor. A serious cross-silo deployment wants both, because they defend against different adversaries.
The last piece keeps a federation honest about its own data. Operators do not hold identically distributed logs: basins, tool vintages, and digitisation quality all differ, so the local datasets are statistically heterogeneous, and naive averaging can pull the shared model apart as each client optimises toward its own distribution. Li and colleagues addressed this with FedProx, which adds a proximal term to each client's local objective so local training does not stray too far from the current global model, stabilising convergence when the data is non-identically distributed [8]. This is the failure mode a subsurface federation is most exposed to, and the reason plain federated averaging is a starting point rather than a finished recipe.
Method
This is a structured reading of the published literature, not a new experiment, and the scope was kept narrow so the claims stay defensible. We organised the field around the distinction the survey turns on, the transfer of raw data against the transfer of model updates, and read the federated-averaging protocol that defines the second [1] with the communication-efficiency work that makes it affordable [2]. We read the two privacy guarantees as separate layers on that protocol: secure aggregation as the channel-level defence [3], and differential privacy as the output-level bound [4] [5], with the gradient-inversion result as the attack that motivates both [6]. We read the cross-silo framing and the heterogeneity problem as the setting an operator federation occupies [7] [8]. For each method we extracted what crosses the network, what an adversary learns from it, and the regime the authors claim it for.
To keep the survey anchored to a real task, we read it against one single-operator baseline from our engagement archive: the raster well-log corpus behind our digitiser, VeerNet, an archive of 136,771 TIF images and 7,781 LAS files held by one operator, with dense-prediction segmentation trained on synthetic derivations at a physical batch size of one, memory-bound by the variable and very wide log images, on rented GPUs priced at 750 and 1,800 EUR per month. The metric we quote, a peak R-squared of 0.9891 on the cleanest curve, is real and used as a worked baseline for what a lone archive reaches on its own, not a new benchmark. The interactive exhibit below is on the same footing: the single-operator counts, the batch size of one, the GPU tiers, and the R-squared ceiling are sourced, while the number of operators and the shape of the corpus-to-accuracy climb are flagged illustrative federation inputs.
Moving data versus moving the model
The whole survey turns on a distinction a data-sharing agreement obscures. The traditional way to build a basin model across operators is to move the data: negotiate access, copy logs into a shared store, and train centrally. That path is not primarily technical, it is legal and competitive, and where two operators bidding on adjacent acreage are direct rivals, the agreement is the thing that does not get signed. Federated learning moves the model instead. Each operator trains locally on data that never moves, and only the parameter updates cross the network to be averaged into a shared model [1]. The object being optimised is no longer any single archive; it is the union of all of them, reached without any of them being copied.
That inversion makes federation attractive precisely where central pooling is impossible. A single operator holding 136,771 TIF and 7,781 LAS scans has a large archive by any absolute measure, but for a dense-prediction model the effective corpus is bounded by what that one operator has seen, which is one basin's worth of tool vintages and formations. The union across several operators is a strictly larger and more varied corpus, reached without a copy leaving anyone's firewall. The cost is a different one: the participants must agree on an architecture and a training protocol, and they must trust that the updates they exchange do not leak their data back out, which is what the privacy layer is for.
Why sharing an update is not sharing nothing
The naive reading, that because raw data stays put the scheme is automatically private, is wrong. Zhu and colleagues showed gradients carry enough information to reconstruct their inputs under some conditions, so an operator that shipped raw updates would be shipping a reconstructable shadow of its logs to whoever aggregates them [6]. That result is why the two guarantees exist rather than being optional polish. Secure aggregation answers the channel question: the aggregator learns the sum of every operator's update while learning nothing about any single one, so even a curious server and the other operators see only the blend [3]. Differential privacy answers a different question, about the model rather than the message: clipping and adding noise during training bounds how much the shared model can reveal about any one scan, tracked as a budget across rounds [4] [5]. For competing operators the pairing is the point. Secure aggregation means no participant sees a rival's contribution; differential privacy means the finished model cannot be interrogated to recover a rival's data either.
The exhibit sets the two facts side by side. On the left, each operator's vault stays sealed, its 136,771 TIF and 7,781 LAS scans resident on-premises, and only a thin gradient update crosses into the aggregator. On the right, the payoff: one operator's effective corpus is a short bar, the federated union is the tall one, and the shared-model goodness-of-fit climbs toward the sourced ceiling of R-squared 0.9891 that a lone small archive plateaus below. The single-operator counts, the batch size of one, the GPU tiers, and the ceiling are sourced from the engagement; the federation width and the shape of the climb are flagged illustrative federation inputs, drawn to make the trade legible rather than to assert a multi-operator benchmark our archive does not contain.
What federation changes for a memory-bound run
Our baseline is a small-data, memory-bound project by construction, which is the regime where enlarging the effective corpus matters most. The models train at a physical batch size of one because a single log image nearly fills the rented card, and the card is a commodity at 750 or 1,800 EUR per month rather than a cluster. Under those constraints an operator cannot buy its way to a bigger model, and it cannot easily buy more data because the data it wants belongs to a competitor. Federation is the one lever that enlarges the corpus without enlarging the hardware bill or the legal exposure: each additional operator contributes gradient signal from its own archive, and the shared model trains against the union while every participant keeps paying only for its own single card.
The honest caveat is heterogeneity. Operators in a real federation hold logs from different basins and tool eras, so their distributions differ, and plain federated averaging can stall when clients pull the shared model in different directions [7]. That is the reason the proximal correction exists: FedProx tethers each operator's local training to the current global model, which lets a heterogeneous federation converge to a shared model better than any single operator's rather than a muddle worse than all of them [8]. A subsurface federation should expect to need it.
Discussion
Read together, the literature describes a system an operator federation could adopt, in three layers. The base is federated averaging: train locally, average the updates, never move the data [1], made affordable by communication-efficient updates [2]. The privacy layer is two guarantees against two adversaries: secure aggregation hides any single participant's contribution [3], and differential privacy bounds what the finished model can leak about any record [4] [5], both necessary because raw gradients are reconstructable [6]. The robustness layer is the heterogeneity correction that keeps dissimilar archives converging to a shared model worth having [7] [8]. Each answers a different objection an operator's legal and technical teams would raise, and the stack is only as adoptable as its weakest layer.
Where our own work sits is worth marking, because it is the line between this survey and the applied architecture writing it complements. Our VeerNet baseline is a single-operator, batch-of-one run that reaches R-squared 0.9891 on its own archive, and this survey is not a report on federating it. It is a reading of how the public field lets competitors train together without trusting each other with their data, read against that baseline to show where the value would land: the union of several operators' archives is a larger and more varied corpus than any one holds, reached without a data-sharing agreement, which for a memory-bound project is the cheapest enlargement available. The survey also marks the limit of the free lunch. Federation buys a bigger effective corpus only if the privacy layer is real and the heterogeneity correction is applied, and it does not buy agreement between operators on architecture or on what a good curve looks like, which remains a negotiation no protocol settles.
Limitations
This is a survey and inherits a survey's limits. It synthesises what the published literature reports and does not implement or measure any federated system; the numbers it quotes are the real metrics of a single-operator baseline from one engagement and one architecture, an archive of 136,771 TIF and 7,781 LAS scans and a peak single-curve R-squared of 0.9891 at batch size one on rented GPUs at 750 and 1,800 EUR per month, used as a worked baseline rather than a federated benchmark. We did not run a multi-operator federation on this task, so the corpus-enlargement argument is a prediction from the cited mechanism, not a result we recorded. The exhibit's federation width and the shape of its corpus-to-accuracy climb are illustrative federation inputs, flagged as such on the canvas; only the single-operator counts, the batch size, the GPU tiers, and the R-squared ceiling are sourced. The survey scopes itself to the cross-silo setting a handful of operators occupy, not the cross-device setting of many small participants, and it stops at the close of its own quarter. A reader should take this as a map of when moving model updates rather than data is the right move for competitors who share a basin but not a database, not as a substitute for standing up a federation and measuring it.
References
[1] McMahan, H. B., Moore, E., Ramage, D., Hampson, S., and Aguera y Arcas, B. Communication-Efficient Learning of Deep Networks from Decentralized Data. AISTATS (2017). Introduces federated averaging: clients train locally and a server averages their updates, with raw data never leaving the client. https://arxiv.org/abs/1602.05629
[2] Konecny, J., McMahan, H. B., Yu, F. X., Richtarik, P., Suresh, A. T., and Bacon, D. Federated Learning: Strategies for Improving Communication Efficiency. NeurIPS Workshop (2016). Structured and sketched updates that cut the bytes each round of federated training must send. https://arxiv.org/abs/1610.05492
[3] Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, H. B., et al. Practical Secure Aggregation for Privacy-Preserving Machine Learning. ACM CCS (2017). A protocol that lets a server sum client updates without seeing any individual client's update. https://eprint.iacr.org/2017/281
[4] Dwork, C., and Roth, A. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science (2014). The formal privacy definition and the composition theorems that bound cumulative leakage across many queries. https://doi.org/10.1561/0400000042
[5] Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep Learning with Differential Privacy. ACM CCS (2016). Differentially private stochastic gradient descent by clipping per-example gradients and adding noise, with a moments accountant for the privacy budget. https://arxiv.org/abs/1607.00133
[6] Zhu, L., Liu, Z., and Han, S. Deep Leakage from Gradients. NeurIPS (2019). Shows that shared gradients can be inverted to reconstruct the training inputs, which is why sending updates is not automatically private. https://arxiv.org/abs/1906.08935
[7] Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., et al. Advances and Open Problems in Federated Learning. Foundations and Trends in Machine Learning (2021). A broad survey that names the cross-silo setting of a few organisations and the statistical heterogeneity that setting brings. https://arxiv.org/abs/1912.04977
[8] Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. Federated Optimization in Heterogeneous Networks. MLSys (2020). Adds a proximal term to local training so clients with non-identically-distributed data do not drift the shared model apart. https://arxiv.org/abs/1812.06127