Building a Labeled Corpus When None Exists

Most write-ups of a data problem start from bad labels. Ours started from no labels. The operator's archive, and the public Texas Railroad Commission corpus behind it, held 136,771 scanned TIF well-logs, every one a raster image with no annotation attached. For a segmentation model that has to be told, pixel by pixel, which stroke belongs to which curve, that archive was not a small training set or a noisy one. As a supervised corpus its usable size was zero. The images existed; the labels never did.

That distinction changed what kind of project this was. A noisy-label problem is a cleaning problem, solved with better guidelines and more careful interpreters. A no-label problem is a supply problem, and you cannot clean your way out of a supply of nothing. We had already ruled out a hand-labelling campaign: tracing one-pixel curves through decades-old scans is slow, inconsistent between people, and would have burned interpreter-years before a single model trained. The corpus we needed was not going to be found. It was going to have to be built.

The corpus was a supply problem, not a quality problem

Once we framed it that way, the central question became almost embarrassingly direct. If we can generate a labelled log image from a curve we drew ourselves, then corpus size stops being a fixed constraint and becomes a dial. We can make 2,000 instances, or 15,000, or 20,000, and the cost of each additional thousand is render time on a machine, not a week of somebody's attention. The label is exact because we placed the curve; there is no tracing error to clean up afterwards. What had been an intractable annotation project turned into a throughput question: how many labelled instances do we need, and of what kind, before the model we want is trainable.

This is the data-centric posture the field had been converging on for years. The result that model quality scales with data volume across a wide range of tasks was already well established [3], and the trick of generating training data from a controllable simulator, then randomising it enough that a network trained on the synthetic distribution transfers to real inputs, had a solid track record in vision by the time we reached for it [1] [2]. We were not inventing the method. We were applying it to an archive where the alternative was not a worse dataset but no dataset at all, which is where a data factory earns its keep most clearly.

Climbing the generator, one stage at a time

We did not jump straight to the corpus we ended up with. We climbed to it, and the shape of the climb was the plan.

The first stage was deliberately small and deliberately easy: 2,000 instances of a single curve against a background, framed as a binary problem, one mask, present or absent per pixel. Two thousand is not a large corpus, and it was not meant to be. It was a proof that the generator produced images a network could actually learn from, and that the labels lined up with the pixels the way we claimed. When a model trained on those 2,000 rendered logs segmented a curve at all, the method was validated. The dial worked; now we could turn it.

The second stage was where the volume mattered. A real log track carries more than one curve, so the useful target is not one curve against background but several curves that must be told apart, including where they cross. We moved to a three-class formulation, background plus two curves, and that harder target needed far more varied examples to learn from: more crossings, more overlaps, more combinations of the defects that make a real scan hard to read. We scaled the generator to 15,000 multiclass instances. The jump from 2,000 to 15,000 was not a round-number flourish; it was the volume at which the model had seen enough distinct two-curve arrangements that the multiclass target became learnable rather than a coin toss on the curve classes.

The third stage tightened the dataset rather than just enlarging it. We generated 20,000 synthetic logs in a constant two-curve configuration, a cleaner and more uniform corpus aimed at the specific problem the production model had to solve. Twenty thousand exact labels, none of them traced by hand, is a corpus that simply could not have been assembled from the real archive by any amount of manual effort we were willing to spend. It exists only because generation made label volume a compute line item.

The training corpus did not exist until we built it. The archive held 136,771 real raster TIFs, but not one carried a pixel-level label, so its usable size as a supervised training set was zero, drawn here as the flat grey floor. Everything above that floor was manufactured: a procedural generator emitted labelled instances on demand, climbing the sourced staircase from 2,000 binary instances to 15,000 multiclass instances to a final 20,000 two-curve logs. Each teal riser is a generation decision bought with compute rather than interpreter-years. Drag the generator-output lever to raise the volume and watch the marker climb the same axis; the read-out names which sourced stage the current output clears and the verdict flips once it passes the orange trainability line, the volume where a two-curve, three-class target became learnable. The orange line is the only element that argues: below it the corpus is a proof of concept, above it it is a multiclass dataset a segmenter can actually be trained on, with an 80/20 train-validation split and zero hand labels. The four stage volumes, the split, the class count, and the real-TIF total are sourced from the engagement archive; the trainability line is a reading of the sourced multiclass stage, not a separate measured number.

The exhibit above is that climb made draggable, and it is built to argue one thing. The flat grey floor is the 136,771 real TIFs, pinned at zero usable labelled instances, because that is what the archive offered. Everything above the floor we manufactured. Drag the generator-output lever and the marker climbs the same axis the sourced stages sit on, and the verdict on the right flips only when the output crosses the orange trainability line, the volume at which the multiclass target became learnable. Below that line the corpus is a proof of concept; above it, it is a dataset a segmenter can train on. The single orange element carries the whole point: the problem was never solved by finding data, it was solved by generating past a threshold.

Why the split, and why zero hand labels

We held out a standard 80/20 train-validation split across each generated corpus. That split is worth a sentence because of what it means when the labels are exact. On a hand-labelled set, a disagreement between the model and the label can be the model's error or the annotator's; the validation number is blurred by the noise in the labels themselves. On a generated set the mask is correct by construction, so a validation disagreement is a genuine model error and nothing else. The metrics mean what they say, which is a quieter benefit of manufacturing the corpus but a real one.

The other number that stays fixed through all of it is the count of hand labels required to train: zero. At every stage of the climb, from the 2,000-instance proof to the 20,000-log production set, no interpreter traced a curve for the training data. Hand-labelling moved off the training path entirely and onto the validation path, where a senior petrophysicist checking model output against a handful of real scans is doing the judgement work that actually needs a human. The 136,771 real TIFs were only ever inputs at inference time. They never had to be labelled for the model to learn.

What the data factory did and did not solve

The honest boundary of this method is that it solves supply, not difficulty. Generating 20,000 exact labels turned an unlabellable archive into a trainable dataset, and that was the blocker that stops most raster-log efforts before they begin. It did not, by itself, make a thin, faded, overlapping curve easy to segment. The per-curve accuracy on the multiclass target is where the real difficulty still lives, and that difficulty comes from the images being genuinely hard, not from the corpus being small. Past the volume needed for coverage, the returns shift from generating more instances to making each generated instance a closer match to the texture of the real archive.

But that is a problem you only get to have once the corpus exists. The reason this engagement could run at all is that we stopped treating a labelled dataset as something we had to locate and started treating it as something we could produce to spec. Corpus size became a lever, we climbed it from 2,000 to 20,000, and somewhere on that climb an unlabelled archive of 136,771 scans turned into a training set. When none exists, you build one.

Limitations

The trainability line in the exhibit is a reading of the sourced stages, not an independently measured threshold: it marks the multiclass volume at which the two-curve, three-class target became learnable in our runs, and it is captioned as such rather than presented as a general constant. The four corpus sizes, the 80/20 split, the three-class formulation, and the 136,771-TIF archive total are sourced from the engagement archive; the exhibit's train-row arithmetic is derived directly from those figures. Transfer to real scans depends on the fidelity of the generated defects, a separate body of work; this account is about corpus supply, and it does not claim that volume alone closes the sim-to-real gap. Accuracy figures for the trained models live in the companion pieces on target design and loss selection.

References

Tobin, J. et al. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS 2017. https://arxiv.org/abs/1703.06907
Tremblay, J. et al. (2018). Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. CVPR Workshops 2018. https://arxiv.org/abs/1804.06516
Sun, C. et al. (2017). Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. ICCV 2017. https://arxiv.org/abs/1707.02968

Building a Labeled Corpus When None Exists

The corpus was a supply problem, not a quality problem

Climbing the generator, one stage at a time

Why the split, and why zero hand labels

What the data factory did and did not solve

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on