Foundation models trained on cat photos do not understand seismic amplitudes. A family of masked autoencoders pretrained on ~30 TB of Norwegian Continental Shelf data shows that basin-targeted pretraining — not global aggregation — is the strongest default for regional interpretation workflows.
Abstract
Automated seismic interpretation has inherited much of its model machinery from computer vision, but the physics that produces a seismic amplitude has little in common with the physics that produces a photographic pixel. That mismatch shows up as brittle transfer: state-of-the-art self-supervised vision encoders, dropped onto migrated post-stack volumes, underperform even modest seismic-specific baselines. This note summarises a family of foundation models pretrained exclusively on approximately 30 TB of 3D seismic from the Norwegian Continental Shelf (NCS), using masked autoencoders (MAE) [1] with Vision Transformer backbones in three tokenization variants — 2D, 2.5D multi-view, and 3D volumetric.
Across four geological interpretation benchmarks, basin-targeted pretraining consistently outperforms both generic vision models and a globally pretrained seismic baseline [2,3]. Gains are largest on amplitude-sensitive tasks such as flatspot mapping, where local acquisition and processing characteristics dominate the signal. The 2.5D multi-view formulation achieves the strongest average accuracy at a fraction of the compute of dense 3D tokenization, making it the practical default at repository scale. Learned embeddings also support interactive similarity search across full 3D cubes in seconds, opening a human-in-the-loop mapping mode with minimal labelling [5]. Pretrained weights are released openly.
Background
Seismic interpretation is a labour-bound bottleneck in subsurface workflows. A 3D survey arrives as a multi-terabyte amplitude cube; an interpreter spends weeks to months picking horizons, faults, and direct hydrocarbon indicators. The field has reached for deep learning to compress that cycle, but most production-grade architectures still trace their lineage to ImageNet-class natural-image backbones. The implicit assumption — that low-level visual primitives transfer cleanly from photographs to migrated amplitudes — is increasingly difficult to defend.
Seismic data is not an image. A pixel encodes reflected light intensity at a sensor; an amplitude sample encodes a band-limited estimate of an acoustic impedance contrast, modulated by acquisition geometry, processing flow, and overburden. The statistics differ — Gaussian-like, signed, with structured spatial correlation along bedding rather than object boundaries. Recent surveys of foundation models in seismic processing make the gap explicit: natural-image pretraining helps where edges and textures dominate, and fails where amplitude fidelity matters [4].
masked autoencoderA self-supervised pretraining objective in which a large fraction of input patches are hidden and the network learns to reconstruct them in pixel space from the visible remainder.Two responses have emerged. The first is global seismic pretraining: aggregate everything available worldwide and train a single backbone [2,3]. The second, examined here, is basin-targeted pretraining: restrict the corpus to a coherent geological province and let the model specialise to its acquisition, processing, and stratigraphic conventions. The NCS — with decades of open public-domain surveys via the Norwegian Offshore Directorate disclosure regime — is an unusually good testbed for the second strategy.
Method
The pretraining corpus comprises approximately 30 TB of 3D post-stack migrated amplitude volumes from the NCS. Density-aware sampling is applied to mitigate the dominance of a small number of large overlapping surveys, but the corpus remains spatially imbalanced — dense across the North Sea, sparser in the Norwegian Sea, and thin in the Barents Sea. All variants share a Vision Transformer encoder–decoder architecture trained with the MAE objective of He et al. [1].
Three tokenization regimes are compared. The 2D variant patchifies inline or crossline sections and treats each as an independent image. The 2.5D variant samples three orthogonal slices through a common centre voxel and concatenates their token sequences before encoding, giving the model a multi-view summary of local 3D structure at modest cost. The 3D variant patchifies a true volumetric block; due to memory and I/O constraints, it is trained with sparse pillar sampling rather than dense volumetric patches, which limits per-update batch diversity.
- Three orthogonal slices through a common voxel
- Token sequences concatenated before encoding
- Captures local 3D structure at 2D-like cost
- Strongest average benchmark accuracy
- True volumetric patchification
- Sparse pillar sampling due to memory/IO limits
- Reduced per-update batch diversity
- Dense regime remains an open question
An 85% masking ratio is applied to the flattened token sequence across all variants, matching the ratio that proved optimal in the original MAE work [1]. The reconstruction objective minimises mean squared error in pixel space over the masked patch positions:
Pretrained encoders are then evaluated on four downstream geological interpretation benchmarks spanning facies classification, fault identification, salt-body delineation, and flatspot mapping, with linear probing and light fine-tuning protocols. Baselines include a frozen DINOv2 natural-image encoder, a globally pretrained seismic foundation model [2], and a from-scratch supervised ViT.
Results
Three findings hold across the four benchmarks. First, natural-image foundation models do not transfer reliably to seismic — DINOv2 and comparable self-supervised vision encoders trail seismic-specific baselines on every task, with the gap widest on amplitude-driven problems. Second, seismic-domain pretraining is necessary but not sufficient: a globally aggregated seismic baseline beats all natural-image encoders on average, but is in turn outperformed by basin-targeted pretraining on the NCS test suite. Third, the 2.5D multi-view tokenization delivers the best average accuracy while using a fraction of the compute and memory of the 3D variant.
Pretraining regime, ranked by average benchmark accuracy
Basin-targeted NCS pretraining
Global seismic pretraining
Natural-image self-supervised
From-scratch supervised ViT
Gains are largest on flatspot mapping — a direct hydrocarbon indicator task that depends on accurate preservation of local amplitude relationships. This is consistent with the hypothesis that basin-targeted pretraining absorbs region-specific acquisition and processing signatures that a globally pooled corpus averages out. Facies classification and fault identification show smaller but consistent improvements; salt-body delineation, where geometry dominates over amplitude, shows the narrowest margins.
Beyond benchmark scores, the learned embeddings prove useful as a retrieval substrate. Cosine similarity search over indexed embeddings returns geologically analogous patches across full 3D cubes in seconds, supporting an interactive mapping mode in which an interpreter labels a handful of exemplars and propagates them via nearest-neighbour lookup. This is the practical mechanism by which a foundation model becomes a productivity layer rather than a one-shot classifier [5].
Interactive similarity search loop
Interpreter labels exemplar
A handful of patches of interest
Encode to embedding
Frozen basin-targeted ViT encoder
Cosine search over cube
Seconds across full 3D volumes
Propagate + refine
Human-in-the-loop geological map
Discussion
The headline result — that basin-targeted pretraining beats global aggregation — runs against the prevailing instinct in foundation-model work, which assumes more data is always better. For seismic, more data is better only when it shares the acquisition and processing conventions of the target province. A backbone pretrained across mixed vintages, contractors, and processing flows learns an averaged prior that is robust but blunt; a backbone pretrained on a single basin learns a sharper prior that pays off precisely where amplitude fidelity matters.
That has practical implications for operators. The default architecture for a regional interpretation workflow should be a basin-specialised encoder, not a globally pooled one. For data-rich provinces with public disclosure regimes — the NCS, the UK Continental Shelf, parts of the US Gulf — the corpus already exists. For data-poor regions, the case for federated or transfer-from-analogue strategies becomes the open research question.
Operator takeaway
Limitations are real and worth stating plainly. The corpus is geographically biased toward the North Sea. Evaluation is restricted to migrated post-stack amplitude volumes, leaving angle stacks and well-log integration to future work. Ground-truth labels embed interpreter subjectivity, which caps achievable scores and may mask small inter-model differences. The 3D variant was trained with sparse pillar sampling, so whether a fully dense 3D regime would close the gap with 2.5D is unresolved. No quantitative scaling law is established — the monotonic ranking from natural-image → global seismic → basin-targeted may not yet be in a saturation regime, and the relative contribution of corpus size versus corpus diversity remains open.
A final methodological caveat. Masked pixel reconstruction biases the encoder toward local texture statistics rather than higher-level structural or stratigraphic abstractions. Self-distillation and latent-prediction objectives — which optimise in representation space rather than pixel space — may be better matched to the relational signals that interpretation tasks ultimately demand. The right next step is not a bigger MAE; it is a different objective.
By the numbers
The quantitative spine of this work, in one frame.
By the numbers
NCS 3D seismic pretraining corpus
MAE token masking ratio
Tokenization variants — 2D, 2.5D, 3D
Geological interpretation benchmarks
Basin-targeted vs. global + natural-image baselines
References
[1] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R. (2022). Masked autoencoders are scalable vision learners. IEEE/CVF CVPR, 15979–15988.
[2] Sheng, H., Wu, X., Si, X., Li, J., Zhang, S., and Duan, X. (2025). Seismic foundation model: A next generation deep-learning model in geophysics. GEOPHYSICS, 90(2), IM59–IM79.
[3] Sansal, A., Lasscock, B., and Valenciano, A. (2025). Scaling seismic foundation models. First Break, 43, 69–74.
[4] Fuchs, F., Fernandez, M.R., Ettrich, N., and Keuper, J. (2025). Foundation models for seismic data processing: An extensive review. arXiv:2503.24166. https://arxiv.org/abs/2503.24166
[5] Waldeland, T.J., Forgaard, L., Ordonez, A., Wade, D., and Bugge, A.J. (2025). Interactive injectite mapping with minimal training data using self-supervised learning. 86th EAGE Conference, Extended Abstracts.