Building a Synthetic Data Factory When Labels Do Not Exist

The generator behind VeerNet, the encoder-decoder EarthScan uses to lift curves off scanned raster well logs, is often described as a preprocessing step. That framing is wrong in a way that costs you. A preprocessing step is a thing you run once and forget. This generator was never that. It was the production tool the entire model stood on, because there was no labelled corpus of scanned logs to learn from, so every training instance the network ever saw came out of it. When the tool that manufactures all of your training data is a script someone wrote in an afternoon, you find out the hard way. When it is built like a factory line, with throughput, a fixed build order, stop criteria per stage, and versioned output, you can run it a second and a third time without archaeology. This note is about that second framing.

Two neighbouring questions it does not answer: why the labels were missing, and how the randomisation axes were chosen. Those are design questions about the content of each sample. This is a question about the machine that stamps them out and the order you run it in.

The single machine on the line: render, then emit the mask

Everything else is scaffolding around one loop. For each instance the generator draws a synthetic log image from known equations, and in the same pass it emits the pixel-perfect segmentation mask, because it drew every curve from an equation it already holds, so it knows the class of every pixel it just placed. That is the whole primitive: render an image, emit its mask, from one call. This collapses two production steps into one. A generator that only produced images would owe a separate annotation pass on every sample, and that pass is exactly the human bottleneck synthetic data exists to remove. Because our loop emits the label as a byproduct of drawing, the marginal cost of one more perfectly labelled instance is one more call, not one more call plus a person. That is what makes throughput a real number rather than a wish.

Each unit is an image plus its mask at a deliberately variable size: 3,200 to 12,800 pixels wide and 480 to 640 pixels tall, with 2 constant curves per log and 3 classes. The variable dimensions force the downstream training code to handle ragged batches, but from the generator's side they are just parameters the loop reads before it draws.

Build order is a decision, not an accident

A line has stages, and their order is a choice you make on purpose. Ours ran in three, each reusing the machine built for the one before it.

The first stage was a binary run stopped at 2,000 instances, small on purpose. The job of the first batch was not to train a shippable model; it was to shape the loop and prove the central claim, that the mask really does fall out of the same call that draws the curve, on the simplest target of one curve against background. You do not build the whole line and then discover the emitter is subtly misaligned with the render. You build the smallest batch that can falsify that, and you stop.

The second stage moved to multiclass, ran the same loop on the two-curve three-class target, and stopped at 20,000 logs. That was the December 2021 batch. The jump from two thousand to twenty thousand is the moment the tool stops being a proof and starts being a factory: an order of magnitude more output from the same machine, with nothing rewritten except the parameters saying how many classes to draw and how many passes to run.

The third stage produced the corpus that shipped: 15,000 curated multiclass curves, the v2 final set. The shipped number is smaller than the batch before it, and that is not a regression; it is what curation looks like on a line. The 20,000-log batch taught us which rendered configurations were pulling their weight and which were near-duplicates padding the count, and the v2 run emitted a tighter 15,000 that was more useful than the larger set it succeeded. Throughput on a data factory is not a race to the biggest number. It is the ability to run the loop enough times that you can afford to throw some output away.

The synthetic-data generator read as a factory line rather than a one-off script. The left rail is the build order: three ordered stages, each one a render-and-emit-mask loop closed by an explicit stop criterion, and each reusing the emitter written for the stage below it. Stage v0 stops the binary run at 2,000 instances to shape the loop and prove the mask falls out of the same pass that draws the curve; stage v1 stops the two-curve multiclass run at 20,000 logs (the December 2021 output batch); stage v2 stops the curated run at 15,000 multiclass curves, the corpus that shipped. Pick a stage on the rail, then drag the throughput lever to run the loop for more passes. The teal column is the units emitted so far; the orange line is the only element that argues, the stage's fixed stop criterion, and the column meeting it is the batch closing and getting versioned. Each unit is one image plus a pixel-perfect mask at 3,200 to 12,800 pixels wide and 480 to 640 tall, 2 constant curves and 3 classes per log. The batch counts, dimensions, curve count, and class count are sourced from the engagement archive; the stage ordering and the stop-criteria framing are how the pipeline was actually run as reusable tooling, not new numbers.

Stop criteria are what make a batch a batch

The exhibit makes the part that is easy to skip visible: every stage has an explicit stop count, and that count is what turns an open-ended run into a versioned artifact. Without a stop criterion, a generator just emits samples until someone gets bored and kills it, and the corpus you train on is defined by that arbitrary moment. With one, the batch has a boundary you chose in advance, which gives it an identity: the 2,000-instance binary set, the 20,000-log December set, the 15,000-curve v2 corpus. You can point at each one, rerun it, diff it against the next. That versioning is the difference between a data factory and a data accident. When a later training run behaves strangely, you can ask which corpus it saw, and the answer is a name, not a shrug.

Why the line pays off: the loop is reusable, the batches are not

The quiet return is that the expensive thing, the render-and-emit-mask loop, was written once and reused three times, while the cheap thing, the specific batch, was disposable. A script bakes the batch into the code, so the 2,000-instance run and the 15,000-curve run would be different programs, and moving between them means editing and re-testing the machine every time. A line separates the machine from the run: the machine is a fixed, trusted asset and the run is a set of parameters and a stop count. We changed classes, dimensions, curve count, and batch size across the three stages without rewriting the emitter, because it never encoded any of those. It rendered whatever equations it was handed and emitted the mask for whatever it drew.

That reuse is what would let a fourth stage happen cheaply. If a new field showed up with logs the current corpus did not cover, the line already exists: pick the parameters, set a stop count, run the loop, stamp a version. The generator is built; the remaining work is deciding what the next batch should contain, a data question, not an engineering one. A one-off script gives you none of that: a corpus and a maintenance burden, and the next time you need training data you start over.

The line, in brief

The generator is not a preprocessing step but the production tool the whole model stands on, because with no labelled scans every training instance came from it. Build it as a factory line and you can rerun it; build it as a script and you rebuild it.
The single machine is a render-and-emit-mask loop: it draws each curve from a known equation and emits the pixel-perfect mask in the same call, so the label is a byproduct of drawing and the marginal cost of one labelled instance is one call, not one call plus a person.
Build order is a decision. A 2,000-instance binary run to prove the loop, a 20,000-log two-curve multiclass batch in December 2021, then a curated 15,000-curve v2 corpus, each stage reusing the emitter written for the one below it.
Stop counts turn open-ended runs into versioned batches with names you can rerun and diff. The shipped v2 corpus being smaller than the 20,000-log batch is curation, not regression: throughput is what lets you afford to throw output away.
Each unit is an image plus its mask at 3,200 to 12,800 pixels wide and 480 to 640 tall, 2 constant curves and 3 classes per log. The loop never encoded those; they are parameters, which is why the same machine served all three stages.

Limitations

This is how one generator was run across one engagement, not a general recipe. The batch counts, the output dimensions, the curve count, and the class count are the real archive figures, but the ladder the exhibit draws between them is a schematic of the build order, not a logged emission-rate trace: the instrument shows cadence, not passes per second. The stop counts were fit to this task and do not transfer as constants; a different digitisation target would want different batch sizes and a different curation ratio. We have also said nothing about whether the rendered content was any good, whether the randomisation covered the real field failure modes, or whether a model trained on this corpus produced usable curves on scans it never saw. Those questions belong to the design of each sample, not to the line that stamped them out.

What the line habit leaves behind

The habit this left us with is to treat any generator that produces all of a model's training data as a production tool from the first commit, not a step to tidy up later. That means a loop you can point at, a build order you chose on purpose, a stop criterion per stage so each batch has a name, and versioned output you can rerun and diff. None of it is elaborate. It is the ordinary discipline of a line: the same machine, run in a known order, stopped on purpose, stamped and kept. What it buys is that the second time you need synthetic data, and there is always a second time, you run the factory instead of rebuilding it.

Building a Synthetic Data Factory When Labels Do Not Exist

The single machine on the line: render, then emit the mask

Build order is a decision, not an accident

Stop criteria are what make a batch a batch

Why the line pays off: the loop is reusable, the batches are not

Limitations

What the line habit leaves behind

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on