A well log is a record of what a drill bit passed through: a strip of curves, gamma ray and resistivity and density, plotted against depth. For a working geoscientist it is the closest thing there is to ground truth about a rock column that no one will ever see directly. And for a large share of the wells ever drilled, that record exists only as a scanned image. A photograph of a paper strip, saved as a TIF, sitting in a folder or a filing cabinet or a regulator's public database. A computer can display that image. It cannot read it. The curve on the page is not a column of numbers; it is a smear of pixels that happens to look like a line to a human eye.
That gap between "we have the scan" and "we have the data" is the whole subject of this note. We build VeerNet, the encoder-decoder EarthScan uses to lift curves off scanned logs, so we spend our days on the technical half. But the technical half only matters because of a market fact that is easy to state and hard to act on: the stranded archive is enormous, it is valuable, and almost nobody who owns a piece of it is equipped to free it. We will not re-tell how VeerNet works here; the question is who else is in the room, and why the room is emptier than the size of the prize would predict.
The archive is bigger than it sounds, and it is one of many
Start with a number that is real and public. One state regulator's scanned-log archive holds 136,771 TIF images and 7,781 LAS files. That is a single archive, in a single jurisdiction, and it is a public one, which is why we can count it at all. The private archives, the ones inside operators and service companies, are not countable from the outside, but there is no reason to think they are smaller, and every reason from decades of drilling history to think they are larger. The scanned regulator holding is the visible tip; the recoverable total is a multiple of it that no one has a clean census for.
The reason this is a stranded asset rather than an old filing system is that the data inside is not obsolete. A log from 1975 describes the same rock a log from 2015 would, and for many mature basins the old wells are the only wells. A team modelling a reservoir, screening an acreage position, or pricing a transaction works from whatever logs exist, and the ones that exist are disproportionately the old raster ones. The value is not nostalgic. The recovered curve feeds directly into the analyses that decide where money goes, and the surveys of where machine learning already pays in upstream work all assume the input curves exist as numbers, not pictures [1]. Digitisation is the quiet precondition for everything downstream of it.
Why a data moat does not free the archive
Here is the part that surprised us least as engineers and surprises newcomers most. The players sitting on the deepest subsurface data holdings are, by and large, not the ones freeing the raster archive. Regulators hold the largest public scan collections and have essentially no mandate or incentive to convert them. Legacy data vendors have spent decades building proprietary libraries and charge for access to them, but access to a scan is not the same as recovery of the curve inside it, and the incremental engineering to turn one into the other is a different business than the one they run. Operators and national oil companies hold vast private archives, but their scarce technical attention goes to the next well, not to the back catalogue.
The common thread is that owning the data and being able to recover it are separate capabilities people conflate. A moat is about access: who is allowed to see the scan. Recovery is about signal: turning the pixels back into a depth-indexed curve a model can consume. The second is a computer-vision problem, and it does not get easier because you own more scans. If anything the moat cuts against solving it, because the incumbent's position is built on the scan being scarce and gated, not on it being cheap to convert. Even cleaned, digitised datasets are patchy enough that a full curve set is rare; the widely used FORCE 2020 benchmark of 118 Norwegian Sea wells is instructive precisely because it took a coordinated public effort to assemble that much clean log data, and it still arrives with gaps a practitioner has to reckon with before modelling [2]. That is the curated end of the spectrum. The raster archive is several steps behind it.
The map above is the argument in one frame. Put the players on two axes, how deep their data moat runs and how equipped they are to unlock a raster scan, and the high-capability band is nearly empty. The incumbents cluster bottom-right: deep moats, low recovery capability. Generic machine-learning shops drift toward the middle, technically able but without the subsurface-specific tooling or the log corpus to train on. The upper region, high recovery capability aimed squarely at the raster problem, has room in it. That vacancy is not an accident of the diagram; it is the structural reason a focused digitisation challenger has somewhere to stand.
The prize, sized honestly
Being precise about the size of the opportunity matters, because "large market" is a phrase that means nothing on its own. The total addressable market here is the oil and gas transactions market, on the order of 134B USD. That is the whole arena, and no digitisation company touches most of it. The serviceable slice, the oil and gas software and technology market, is roughly 6.7B USD. That is the honest denominator, the part a software product can actually compete for. And the obtainable slice, the piece a specific company plans to win in a defined window, is smaller still: our own working target is 180M USD by the end of year five, a low-single-digit share of the serviceable market rather than a fantasy fraction of the whole arena.
Stacking the numbers this way matters because the credibility of a digitisation play lives in the gap between them. The 134B USD is context. The 6.7B USD is the field you are actually on. The 180M USD is what you claim you can take, and it has to be defended by something other than the size of the archive. In our case the early defence is concrete rather than projected: 3 pilot customers have signed letters of intent, the smallest honest evidence that the recovered curve is worth paying for to someone not obligated to say so. Three is not a market. It is a signal that the wedge is real.
What actually has to be built
The open quadrant stays open because standing in it is genuinely hard, and the work deserves naming so the vacancy does not read as a free lunch. Recovering a curve from a scan is not optical character recognition with a different label on it. The input is a photograph of a plot: overlapping traces, faded ink, grid lines that look like signal, margin annotations, and a depth axis that has to be recovered as carefully as the curves themselves, because a curve at the wrong depth is worse than no curve. The scans vary in width and resolution across decades of equipment, so the recovery has to be robust to inputs it was never shown. And the output has to be trustworthy enough that a petrophysicist will stake an interpretation on it.
None of that is intractable, which is the point of building VeerNet, but all of it is why the incumbents with the moats have not done it. It is a different discipline from gating access to a library, and it does not benefit from the thing they are good at. A challenger that treats recovery as the whole job, rather than a feature bolted onto a data-access business, competes on the axis that is actually empty. That is the wedge, and it is why a company with a fraction of the incumbents' data can still take a defensible slice of the serviceable market: it solves the problem the archive actually poses, not the one the incumbents are already paid to solve.
Limitations
The market figures here are the working numbers from a specific effort, not an independent audit of the sector. The 134B USD transactions market and the 6.7B USD software market are the denominators we plan against; a different analyst would draw the boundaries somewhat differently, and the 180M USD obtainable target is a plan, not a result. The archive counts, 136,771 TIF and 7,781 LAS, are exact for one public regulator holding and are not an estimate of the global stranded total, which we cannot cleanly measure. The player map is a positioning judgement: the two axes are real distinctions, but where any given player-type sits on them is our reading of the field, not a measured index, and reasonable people would move a marker or two. And the three letters of intent are intent, not booked revenue. The claim this note defends is structural, that recovery capability and data moats are different things and the high-capability band is thin, and it does not depend on any single one of these numbers being the last word.
What the map is really saying
The honest version of the pitch is not that the archive is large, though it is. Plenty of large things are not businesses. The version that holds up is narrower: the archive is large, its value is undiminished by age, freeing it is a data-recovery problem, and the people best positioned by ownership are worst positioned by incentive and skill to do the freeing. Draw the players on the two axes that matter and the room that a focused challenger needs is visibly there. That is the whole case, and it is why we spend our days on the model even though this note was about the market. The market is what makes the model worth building.
References
[1] Koroteev, D., and Tekic, Z. Artificial Intelligence in Oil and Gas Upstream: Trends, Challenges, and Scenarios for the Future. Energy and AI 3 (2021), 100041. A survey of where machine learning already pays in upstream oil and gas, framing why usable subsurface data is worth recovering. https://www.sciencedirect.com/science/article/pii/S2666546820300410
[2] McDonald, A. Using the missingno Python Library to Identify and Visualise Missing Data Prior to Machine Learning. Towards Data Science (2021). A walkthrough on the Xeek/FORCE 2020 dataset of 118 Norwegian Sea wells showing how patchy even digitised, curated log data is before modelling. https://towardsdatascience.com/using-the-missingno-python-library-to-identify-and-visualise-missing-data-prior-to-machine-learning-ba7a9a17b0a2