The most useful hour we spent on the raster-log project was not spent training anything. It was spent counting files. Before a single model touched the data, we scoped the public Texas regulatory well-log archive and wrote down two numbers that ended up shaping every decision after them: the archive holds 136,771 scanned raster logs and only 7,781 machine-readable LAS files [2]. Those two counts describe the same field of wells, and the distance between them is the whole subject of this piece. The data-centric-AI movement taught the field to treat the dataset as the first thing to interrogate rather than the last [1]; this is what that interrogation looked like on a real subsurface corpus, and why the answer it returned made a synthetic dataset unavoidable.
Counting the files before believing the dataset
It is tempting to treat a large public archive as obviously enough data, because large is the word everyone reaches for. A hundred and thirty-six thousand logs sounds like a feast. The discipline the data-centric framing presses on you is to refuse that intuition and ask the narrower question instead: how many of these files are in the form my model can actually consume [1]. For supervised learning on well logs, the consumable form is the LAS file, a depth-indexed numeric series where each curve is already numbers. A widely used teaching slice makes that shape vivid, 118 Norwegian Sea wells described by 22 measurement columns, dense and tabular and ready to model [3]. Held to that bar, the Texas archive is not a feast at all. The 7,781 LAS files are the feast. The 136,771 TIF scans are pictures of paper, and to a model a picture of a curve is not a curve.
So the honest size of the trainable dataset was 7,781, not 136,771, and the ratio between the two is the finding. Roughly seventeen and a half scanned logs exist for every machine-readable one. We did not engineer that ratio and it is not a quirk of Texas; it is the ordinary shape of legacy subsurface data, which accumulated over decades of regulatory filings that only ever had to be legible to a human reader. The scan preserves the human-readable picture perfectly and the machine-readable curve not at all.
The picture the inventory actually makes
We drew the inventory to scale rather than describing it, because the gap is the kind of thing the eye understands faster than a sentence does. The instrument above sizes each stock of data so that equal area means equal record count, with no log-scaling to flatter the small numbers. The orange block is the LAS files, the real data a model can read, and it is a sliver. The large teal block is the raster scans, real data the model cannot yet read at all. Sweep the scope across and the read-out keeps the live raster-to-vector ratio in view; it sits at about 17.6 to one wherever you stand. The point the drawing argues is the point counting the files forced on us: the abundant data is the unreadable kind, and the readable kind is rare.
That is also why the third block exists. The synthetic corpus we generated, 15,000 curves, is drawn on the same scale beside the two real stocks, and the comparison is the argument of the whole post. We did not generate synthetic logs because we preferred them to real ones. We generated them because there were only 7,781 real readable curves in the entire public archive, and that was not enough labelled signal to train a digitiser that could then go on to digitise the other 136,771.
Why scarce real data made synthesis the only road
There is a clean version of the reasoning, and it is worth stating without hedging. Recovering a curve from a scan is a dense pixel-labelling problem, and the lineage of networks built for it, the U-Net family of encoder-decoders with skip connections, was designed precisely to learn dense labels from scarce data [6]. But scarce is relative, and even a sample-efficient architecture needs labelled examples that span the variation it will meet in the wild: different scales, tracks, vintages, ink densities, scan qualities, and the grid lines that look maddeningly like data. Seven thousand depth-indexed LAS files are not raster scans at all, so they cannot directly teach a model to read a picture; and hand-labelling enough of the 136,771 scans to cover that variation is the very manual, error-prone work that classical curve-extraction pipelines have always struggled to scale [4]. The labelled raster data we would have needed did not exist, and creating it by hand at the required scale was not realistic.
Synthetic data is the move that dissolves the bottleneck, and the survey literature is clear about the condition under which it works: generated data substitutes for collected data only when the generator captures the variation that matters, and stops helping when it does not [5]. So this was not a shortcut. We built a generator that renders synthetic raster logs we control completely, which means we own perfect pixel labels for free, and we widened its variation deliberately so the 15,000-curve corpus would transfer to real scans rather than to a clean cartoon of them. That corpus is roughly twice the size of the entire real LAS holding, and unlike the LAS files it is in the right modality: it is raster images with known curves, the exact thing a digitiser must learn to read.
The reflex this leaves us with
What the two numbers really taught us is a sequencing rule for any data-scarce problem. Scope the dataset in the modality your model consumes before you decide whether you have a data problem or a model problem, because the headline file count almost always overstates the trainable count. On this archive the overstatement was a factor of about seventeen, and once we had measured it the path stopped being a choice. With only 7,781 readable curves against 136,771 unreadable scans, synthetic data was not the clever option or the cheap option; it was the only supply of labelled raster examples large and varied enough to train on at all. The synthetic corpus did not replace the real archive. It was the bridge we built so a model could finally walk back across it and read the paper.
Key takeaways
- Scoping the public Texas regulatory archive in the modality a model can consume returned two numbers that drove every later decision: 136,771 scanned raster TIF logs but only 7,781 machine-readable LAS files, a raster-to-vector gap of about 17.6 to 1.
- The data-centric-AI argument (credited to Andrew Ng and collaborators) is the habit of interrogating the dataset first; applied here it shows the trainable count was 7,781, not 136,771, because to a model a picture of a curve is not a curve.
- Model-ready well data is depth-indexed vector LAS (a teaching slice is 118 wells by 22 columns); the 136,771 raster scans are pictures of paper, the ordinary shape of legacy subsurface data filed to be human-readable, never machine-readable.
- Even a sample-efficient U-Net-lineage digitiser needs labelled raster examples spanning real variation. The 7,781 LAS files are the wrong modality, and hand-labelling enough scans is the manual work classical curve-extraction pipelines never scaled, so the labelled raster data we needed simply did not exist.
- Synthesis was mandatory, not optional: we generated 15,000 synthetic raster curves with perfect labels, deliberately widened to transfer, roughly twice the real LAS holding and in the right modality. The reflex it leaves: measure the dataset in your model's modality before deciding it is a model problem.
References
[1] Ng, A. A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. DeepLearning.AI (2021). The talk that reframed stuck models as a data problem first, arguing the next gain often sits in the dataset rather than the network. https://www.youtube.com/watch?v=06-AZXmwHjo
[2] Railroad Commission of Texas. Well log and digital records, public well data. Texas RRC (accessed 2023). The state regulatory archive of scanned raster well logs and a smaller set of digital LAS files for onshore Texas wells. https://www.rrc.texas.gov/resource-center/research/data-sets-available-for-download/
[3] McDonald, A. Using the missingno Python library to identify and visualise missing data prior to machine learning. Towards Data Science (2021). A tutorial over 118 Norwegian Sea wells with 22 measurement columns that makes the depth-indexed, vector-LAS shape of model-ready well data concrete. https://towardsdatascience.com/using-the-missingno-python-library-to-identify-and-visualise-missing-data-prior-to-machine-learning-34c8c5b5f009
[4] Yuan, B., and Yang, Q. Digitization of well-logging parameter graphs based on a gridlines-elimination approach. Journal of Petroleum Exploration and Production Technology (2019). A classical image-processing pipeline for pulling curves off scanned well-log graphs, and a reminder of how much manual care raster recovery has always demanded. http://www.jsoftware.us/show-409-JSW15423.html
[5] Nikolenko, S. I. Synthetic Data for Deep Learning. Springer (2021). A survey of when generated data substitutes for collected data, and the conditions under which the substitution stops working. https://arxiv.org/abs/1909.11512
[6] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The symmetric encoder-decoder built to learn dense labels from scarce data, the lineage our digitiser sits in. https://arxiv.org/abs/1505.04597