Indexing a 136,771-Scan Raster Archive for ML Ingestion

A model is the last thing you build, not the first. Before VeerNet could read a single scanned well-log, someone had to answer a more basic question: what is actually in the archive, and which of it can a training pipeline even read. For a Texas onshore operator we partnered with, the raw material was the public Texas Railroad Commission record, a corpus of scanned paper logs measured in the hundreds of thousands of files. None of it was addressable. There was no index, no deduplication, and no way to tell an imagery file from its digital twin without opening it. The first deliverable of the engagement was not a network. It was a catalog.

At a glance

Three numbers frame the substrate problem and the imbalance it exposed.

136,771

the imagery side of the archive

Scanned TIF rasters

7,781

the only files with vector ground truth

Digital LAS files

17.6 : 1

why hand-labelling could never close the gap

Imagery-to-vector ratio

The archive is not a dataset until you index it

A pile of files on disk is not a dataset. The Texas RRC holdings we drew from contained 136,771 scanned TIF rasters against 7,781 digital LAS files. A TIF is an image of a printed log, a picture of ink on paper. A LAS file is the same log already digitized into depth-indexed numeric curves. The two formats are not interchangeable. One is what a person reads, the other is what a machine reads, and an ML pipeline needs to know which is which before it can do anything at all.

At the scale of an operator archive, that distinction is not obvious from the outside. Files arrive with inconsistent names, mixed casing, scanner-assigned identifiers, and the occasional format mismatch where a file's extension lies about its contents. A pipeline that trusts the filename will silently feed a corrupt scan into training and never know. So the first job was a census: open every file, confirm its real format, record its dimensions, and assign it a stable identifier that the rest of the pipeline could refer to without ever opening the file again. The output of that pass is the catalog, and the catalog is what makes the corpus addressable.

Indexing legacy well-log records is a recognised data-engineering problem in its own right, not a preprocessing afterthought. The morphological and gridline-elimination work on raster logs treats the scanned image as a first-class artefact to be parsed and preserved, rather than a throwaway input to a model [1]. We took the same stance one layer earlier: before extraction, the archive itself has to be inventoried, classified, and made queryable.

The data-engineering substrate beneath VeerNet. The public Texas RRC archive holds 136,771 scanned TIF rasters against just 7,781 digital LAS files, a roughly 17.6-to-1 imagery-to-vector imbalance. Only a TIF that has a matching LAS vector file could ever serve as paired ground truth, and the slider sweeps the coverage frontier to show how thin that paired band is even at its ceiling. Drag the coverage slider toward the true LAS limit and the orange matched-ground-truth band stays a sliver of a teal ocean. That sliver is the entire reason a 20,000-log synthetic corpus had to be manufactured rather than hand-labelled. The 136,771 TIF, 7,781 LAS, 17.6:1 ratio and 20,000-log synthetic corpus are the engagement's own numbers; the grid is a proportional abstraction (each cell stands for a block of files) and the cell placement plus swept frontier are illustrative geometry.

The 17.6-to-1 imbalance, measured rather than assumed

The headline finding of the indexing pass was the imbalance, and the value of measuring it is that we stopped guessing. The ratio of scanned imagery to digital vector files came out at roughly 17.6 to 1: 136,771 rasters for every 7,781 LAS files. That single number reframed the entire project.

Here is why it matters. A supervised digitization model learns from paired data, a raster input and a vector target that says what the correct curves are. The only files that can supply that target are the LAS files, because they already hold the numeric curves. So the absolute ceiling on naturally paired training data is 7,781 examples, and that is the optimistic case where every LAS file has a matching scan, which it does not. The real overlap is smaller. The imbalance is not a nuisance to be cleaned up. It is the governing constraint of the whole effort, and we could only act on it once the index made it visible.

The grid above is the imbalance made concrete. Each cell stands for a block of scanned rasters, and the orange band is the slice that could ever be paired with a vector file. Sweep the coverage slider to the true ceiling and the paired band is still a thin sliver of the field. That picture is the argument for everything that followed.

Deduplication: the same log wearing three names

A public archive accumulated over decades carries duplicates, and duplicates poison a machine-learning corpus in a specific way. If the same physical log appears three times under three scanner identifiers, and one copy lands in training while another lands in validation, the model gets to study the exam before it sits it. The validation score looks excellent and means nothing. So deduplication is not housekeeping. It is a correctness requirement for any honest evaluation.

Deduplication at this scale is harder than comparing filenames, because the duplicates do not share filenames. A log can be rescanned at a different resolution, cropped differently, or saved with different compression, so two files that are visually the same log are bit-for-bit different on disk. We had to detect duplicates by content and by metadata rather than by name: matching on well identifiers where present, on image fingerprints where not, and flagging near-duplicates for review rather than deleting them blind. The result was a deduplicated index where each distinct log is represented once, with its variant scans linked rather than scattered. Only after that pass could we draw a train/validation split that actually holds out unseen logs.

What the index and dedup pass delivered

A content-verified census: every file opened, its true format and dimensions recorded, and a stable identifier assigned, so the rest of the pipeline refers to files without re-opening them and never trusts a filename that lies.
The 17.6-to-1 imagery-to-vector ratio measured rather than assumed: 136,771 TIF rasters against 7,781 LAS files, which fixed the natural ceiling on paired training data at well under 8,000 examples.
Content-based deduplication so the same log under different scanner identifiers cannot leak across the train/validation split, which is the difference between an honest validation score and a meaningless one.

Making a raster ML-ingestible

A scanned TIF that passes the census is still not ready for a training loop. Real archive scans are not uniform. In this corpus the logs varied enormously in geometry, and a pipeline that assumes a fixed input shape breaks on the first oddly sized scan. We recorded the dimensions of every raster during indexing precisely so the downstream loader could plan for the spread instead of crashing on it. The synthetic logs we later rendered to match this archive spanned widths from 3,200 to 12,800 pixels and heights from 480 to 640 pixels, a range taken directly from what the index told us the real scans looked like.

Variable geometry is not a cosmetic detail. It is the reason a naive data loader cannot stack scans into a batch, because you cannot put images of different sizes into a single tensor without a deliberate padding strategy. That constraint, surfaced by the index, propagated all the way into how training batches were assembled. The substrate decisions made at indexing time set the shape of everything above them.

There is a second, quieter dimension to ingestibility, which is data quality. A LAS file can be present and still be unusable, with missing curves, null-filled depth ranges, or gaps where a tool stopped recording. Before any of these vector files could serve as ground truth, they had to be profiled for completeness, because a target curve riddled with nulls teaches the model the wrong lesson. Profiling missing data before training is a standard discipline in petrophysical machine learning for exactly this reason, and the tooling to visualise where a well-log is incomplete is well established [2]. We applied that profiling to the LAS side of the archive so that the small set of paired examples we did have was at least clean.

Why the substrate forced the synthetic decision

Everything above leads to one conclusion, and the index is what made it unavoidable. The natural ceiling on paired training data was under 8,000 examples, the real overlap after deduplication was smaller still, and a segmentation model that has to read faded, overlapping, multi-curve scans needs far more labelled examples than that to generalize across the messiness of a real archive. Hand-labelling the imbalance away was never an option, because tracing curves through a noisy multi-thousand-pixel scan is a multi-interpreter-year effort and the result is noisy and inconsistent.

So we manufactured the ground truth instead. The synthetic-data pipeline that rendered logs with known masks produced a 20,000-log corpus, more training examples than the entire LAS side of the archive could ever have supplied, with labels that are exact by construction. That decision is usually told as a modelling story. It is really a data-engineering one. We did not choose synthetic data because it was elegant. We chose it because the index told us, in a single measured ratio, that the paired data did not exist at the volume the problem demanded.

This is the broader pattern across subsurface AI work, where the gating factor is rarely the model and almost always the data: getting it versioned, getting it clean, and getting it into a form a pipeline can consume [3]. The archive census is where that work starts.

Why every legacy archive hides the same census

The specific numbers belong to one Texas archive, but the substrate is the same for every operator sitting on decades of scanned logs. The pattern recurs without fail. The imagery vastly outnumbers the vectors, the files are not addressable, duplicates lurk under inconsistent names, and the geometry is anything but uniform. None of that is visible until someone builds the index, and none of the modelling that an operator actually wants is possible until it exists.

The honest limit is that an index does not create ground truth. It tells you precisely how little of it you have, which is exactly what an operator needs to know before committing to a labelling campaign that cannot finish or a synthetic pipeline that can. For this engagement the index pointed straight at synthetic data, and the 20,000-log corpus that followed is the direct answer to a 17.6-to-1 ratio that, until we measured it, nobody had put a number on. The catalog is unglamorous work. It is also the work everything else was waiting on.

References

Yuan, B. and Yang, Q. (2019). Digitization of Well-Logging Parameter Graphs Based on Gridlines-Elimination Approach. Journal of Petroleum Exploration and Production Technology. http://www.jsoftware.us/show-409-JSW15423.html
McDonald, A. (2021). Using the missingno Python library to Identify and Visualise Missing Data Prior to Machine Learning. Towards Data Science. https://towardsdatascience.com/using-the-missingno-python-library-to-identify-and-visualise-missing-data-prior-to-machine-learning-34c8c5b5f009
Koroteev, D. and Tekic, Z. (2021). Artificial intelligence in oil and gas upstream: Trends, challenges, and scenarios for the future. Energy and AI. https://www.sciencedirect.com/journal/energy-and-ai

Indexing a 136,771-Scan Raster Archive for ML Ingestion

At a glance

The archive is not a dataset until you index it

The 17.6-to-1 imbalance, measured rather than assumed

Deduplication: the same log wearing three names

Making a raster ML-ingestible

Why the substrate forced the synthetic decision

Why every legacy archive hides the same census

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on