A Tour of Where Subsurface Data Actually Lives

Ask someone new to subsurface machine learning where the data is, and the honest answer is stranger than they expect. It is not in one place and it is not in one shape. Some of it lives in a handful of generous open releases, clean and ready to model. A great deal more lives in government filing cabinets, scanned to images and effectively unreadable by a computer. This is a walking tour of those locations. We want to credit the people and institutions who opened the accessible parts of the field, look honestly at the physical form the data arrives in, and explain why, despite all that openness, the largest single store of well-log information is still a stack of pictures of paper. Nothing in the tour below is ours to claim except the digitiser we built to deal with the last stop on it; the archives belong to the field.

The accessible end: a whole field handed over

The first stop is the most generous, because it is where a major operator simply opened the doors. In 2018 Equinor released the complete data set from the Volve field, a decommissioned oil field in the Norwegian North Sea, for research and education [1]. What makes Volve remarkable is not that it is a tidy benchmark, because it is not one. It is closer to the raw contents of a working subsurface archive: well logs, seismic, production records, reports, the lot. For a newcomer it is the clearest possible demonstration that real, complete, messy field data can live in the public domain at all. Its value is realism and breadth. Its limitation, if you want to call it that, is that it was never packaged as a single scored prediction task, so two groups working on it still have to agree among themselves on exactly what they are predicting and how they will judge it.

The competition stop: everyone solving the same problem

The next stop is where the field acquired a shared yardstick. The FORCE 2020 dataset assembled wells from the Norwegian continental shelf, each carrying a suite of wireline measurements as inputs and an interpreted lithology label as the target, and posed one question to everyone at once: predict the lithology log from the measurement logs [2]. By fixing the inputs, the label, the train and test split, and the scoring rule, it turned a private modelling exercise into a public, comparable one. The competition ran on the Xeek platform, and it is worth keeping the two contributions separate even though people usually say them in one breath. The dataset is the corpus and the labels; the platform supplied the held-out test set, the evaluation, and the leaderboard that made every submission mean the same thing [3]. A labelled corpus with no agreed evaluation is just a private leaderboard waiting to happen, and an evaluation harness with no corpus has nothing to score. The pairing is what mattered.

A widely used teaching slice makes the shape of this data vivid. A tutorial on handling missing values steps through a subset of 118 Norwegian Sea wells, each described by 22 measurement columns [4]. Read that sentence slowly, because it tells you almost everything about why this data is so workable. It is vector LAS, meaning every curve is already a depth-indexed numeric series rather than a picture of one. It is multi-well, so you can train on some wells and test on others. It is dense and tabular, with a named, manageable set of features per well. That is exactly the form supervised learning consumes most readily, and it is why FORCE on Xeek became the default first stop for petrophysical machine learning.

This kind of openness has roots that predate the headline releases, and they deserve the credit too. An early, influential facies-classification exercise published both its data and its code so that anyone could reproduce the whole thing end to end [5]. It was small, but it set the template the larger competitions later scaled up: real logs, an explicit label, a fixed evaluation, nothing hidden. When you stand at the accessible end of the field today, you are standing on work that started there.

The national archive: where most of it actually is

Now the tour leaves the curated district and walks into the warehouse, because that is where the bulk of the data turns out to live. State and national regulators have required operators to file well records for decades, and those filings accumulate into archives that dwarf any competition dataset. The public records of the Texas regulator are the example we know best, because we built on them [6]. The point of this stop is not the headline scale but the contrast hiding inside it, and that contrast is the whole reason this tour exists.

The public Texas regulatory well-log archive holds 136,771 scanned TIF raster logs alongside only 7,781 LAS digital files, an order-of-magnitude gap between scanned and machine-readable subsurface data. This funnel runs both populations from the raw archive down to ML-ready records and lets you set how strict the triage is. The LAS files are already numeric depth-indexed curves, so they pass nearly whole whatever the setting. The raster scans are pictures of paper logs, so each one must clear a legibility bar and a curve-recovery step before it counts as machine-readable, and its survival falls as the triage gets stricter. Drag the lever to see how much of the paper archive actually reaches the model. The two archive endpoint counts are sourced from the engagement archive; the per-stage survival fractions are illustrative.

The archive holds 136,771 scanned raster well logs in TIF form alongside only 7,781 digital LAS files. Those two numbers describe the same field of wells, yet they are not the same kind of thing at all. The 7,781 LAS files are the FORCE-shaped data: depth-indexed numeric curves a model can read straight away. The 136,771 raster files are pictures, photographs in effect of paper logs, where the curve survives only as ink on a scan and means nothing to a computer until it has been traced back into numbers. The instrument above runs both populations down a triage funnel from the raw archive to machine-ready records. Drag the lever and the asymmetry jumps out: the vector files glide through nearly untouched, while the scanned files shed count at every stage because each one has to be made legible and then have its curves recovered before it counts at all. The order-of-magnitude gap at the mouth of the funnel, roughly seventeen scanned logs for every digital one, is not a quirk of Texas. It is the ordinary ratio of legacy subsurface data everywhere.

Why so much value stays stuck in paper

It is tempting to treat the scanned archive as a backlog that someone simply has not gotten around to, but the reason it stays trapped is more structural than that. The logs were created over many decades, on different instruments, by different service companies, with different scales, tracks, and conventions. They were filed to satisfy a regulator, not to feed a model, so legibility for a human reader was the only bar they ever had to clear. A scan preserves the human-readable picture perfectly and the machine-readable curve not at all. So the value in those 136,771 files is real and it is large, but it is held one conversion step away from anything an algorithm can use, and that step is genuinely hard: thin overlapping curves, faded ink, skew, grid lines that look exactly like data, and a depth axis that has to be reconstructed before any single reading is trustworthy.

This is the stop the rest of the field tends to skip, and skipping it is why the accessible datasets, for all their importance, are not where most of the answers are. The clean vector releases let you ask whether one curve can be predicted from others on data that is already digital. The scanned archive holds the prior question: how do you get a numeric curve out of a picture in the first place, at the scale of a regulatory trove rather than a single log? That is the problem VeerNet, our encoder-decoder for raster-log digitisation, was built to attack, and the only thing on this tour we claim as our own. We are deliberately not calling the scanned archive a benchmark. It ships with no fixed task, no held-out split, and no agreed score, so it is not one in the FORCE sense. It is raw material, sitting at the front of the pipeline, before the curated benchmarks become relevant to it.

Mapping the stops to your own problem

If you are deciding where to begin, the tour gives a clean rule. When your data is already clean vector LAS and your question is curve-to-curve or curve-to-label prediction, start in the curated district: FORCE 2020 on Xeek for a labelled, scored task [2][3], Volve when you want a fuller and more realistic field to wrestle with [1], and the published tutorials and slices that make either approachable [4][5]. When your raw material is scanned paper, you are standing in the warehouse, and the honest next move is to recognise that digitisation comes first and the open benchmarks only matter to you once your images have become curves. The two ends are not rivals; they are different stations on one line, and the open releases are the reason the line has a map at all.

What stays with us after walking the whole route is the inversion at the heart of it. The data that is easiest to model is the rarest, and the data that is most abundant is the hardest to touch. The field has spent its open-data energy, rightly and gratefully, on the accessible district, and we have leaned on that work ourselves. But the warehouse is where the wells mostly are, and a hundred and thirty-six thousand scanned logs will keep their secrets until someone teaches a machine to read them off the page. That, more than any leaderboard, is the frontier this tour points to.

Key takeaways

Subsurface data lives in two very different places. A small, accessible district of open releases (Equinor's Volve field, the FORCE 2020 dataset, the Xeek platform that scored it) holds clean, model-ready data, and we credit those releases for opening the field.
Volve (2018) opened a whole real field for research; FORCE 2020 added a fixed lithology task with a label, inputs, and a split; Xeek supplied the held-out evaluation and public leaderboard. An earlier open facies exercise set that template at smaller scale.
The accessible data is vector LAS: a teaching slice is 118 Norwegian Sea wells with 22 measurement columns, depth-indexed, multi-well, dense, and labelled. That tabular shape is exactly what supervised learning reads most easily.
The national archive is where most of the data actually is. The Texas regulator's public records hold 136,771 scanned raster TIF logs against only 7,781 digital LAS files, an order-of-magnitude gap between scanned and machine-readable subsurface data.
The scanned logs stay trapped because they were filed to be human-readable, not machine-readable: a scan keeps the picture and loses the numeric curve. Recovering a curve from a picture at archive scale is the prior problem VeerNet (ours) was built to solve, before any open benchmark applies.

References

[1] Equinor. Volve field data set. Equinor open data release (2018). The full set of subsurface and production data from the decommissioned Volve field in the Norwegian North Sea, released for research and education. https://www.equinor.com/energy/volve-data-sharing

[2] Bormann, P., Aursand, P., Dilib, F., Manral, S., and Dischington, P. FORCE 2020 well log and lithofacies dataset for machine learning competition. FORCE / GitHub (2020). The labelled North Sea and Norwegian Sea well-log corpus used in the FORCE 2020 lithology prediction contest. https://github.com/bolgebrygg/Force-2020-Machine-Learning-competition

[3] Xeek (Enthought). FORCE 2020: Predict lithology from well logs. Xeek competition platform (2020). The data-science challenge that hosted the FORCE 2020 lithology task and standardised its evaluation and leaderboard. https://xeek.ai/challenges/force-well-logs/overview

[4] McDonald, A. Using the missingno Python library to identify and visualise missing data prior to machine learning. Towards Data Science (2021). A tutorial that walks through 118 Norwegian Sea wells with 22 measurement columns from the FORCE / Xeek slice, making the data's vector-LAS shape concrete. https://towardsdatascience.com/using-the-missingno-python-library-to-identify-and-visualise-missing-data-prior-to-machine-learning-34c8c5b5f009

[5] Hall, B. Facies classification using machine learning. The Leading Edge, 35(10), 906-909 (2016). An early open, reproducible well-log facies-classification exercise that published both data and code and helped set the template the later competitions followed. https://library.seg.org/doi/10.1190/tle35100906.1

[6] Railroad Commission of Texas. Well log and digital records, public well data. Texas RRC (accessed 2022). The state regulatory archive of scanned raster well logs and a smaller set of digital files for onshore Texas wells. https://www.rrc.texas.gov/resource-center/research/data-sets-available-for-download/

A Tour of Where Subsurface Data Actually Lives

The accessible end: a whole field handed over

The competition stop: everyone solving the same problem

The national archive: where most of it actually is

Why so much value stays stuck in paper

Mapping the stops to your own problem

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on