Manufacturing Ground Truth: a synthetic-data pipeline for raster well-log digitization

A deep-learning model is only as good as the labels it learns from. For raster well-log digitization the labels barely exist: an operator's archive is a wall of scanned paper, and almost none of it comes with a pixel-level annotation of where each curve runs. So before we could train VeerNet to read a scanned log, we had to manufacture the ground truth ourselves. We built a synthetic-data pipeline that renders realistic scanned-log images with controllable defects and emits pixel-perfect curve masks for free, then we trained on the synthetic data and ran inference on real scans.

At a glance

Three numbers frame the data problem and how we solved it.

136,771

in the Texas RRC public dataset

Scanned TIF logs

15,000

with pixel-perfect masks, rendered not labelled

Synthetic multiclass instances

synthetic train to real-TIF inference

Hand-labelled rasters required

The problem: hand-labelling raster logs does not scale

The operator, a Texas onshore producer, sits on the same asset every mature operator does: decades of well-logs that exist only as scanned paper. The public Texas Railroad Commission archive alone holds 136,771 scanned TIF logs against just 7,781 digital LAS files, and the operator's private holdings follow the same ratio. The information that drives reactivation and infill decisions is on those scans. It is not queryable until someone digitizes it.

Supervised segmentation is the obvious tool. Feed a network a scanned log image, ask it to output a per-pixel mask of where each curve runs, post-process the mask into a depth-value series, export to LAS. The architecture for this is well understood, and encoder-decoder segmentation networks are a mature line of work [2]. The blocker is not the model. The blocker is the training labels.

To train a segmentation model you need paired data: the input raster, and a mask that marks every pixel belonging to each curve. Producing that mask by hand means a person opens a scanned log in an annotation tool and traces each curve, pixel by pixel, across the full depth of the log. A single log can be more than ten thousand pixels tall. Curves overlap, cross track gridlines, fade where the ink has worn, and disappear under hand-written annotations. One careful interpreter produces a handful of usable masks a day, and the masks disagree between interpreters because tracing a 1-pixel curve through a noisy scan is a judgement call.

That math does not close. A segmentation model wants thousands of labelled examples to generalize across the messiness of a real archive. Hand-labelling thousands of multi-thousand-pixel rasters is a multi-interpreter-year effort before a single model is trained, and the labels you get are noisy and inconsistent. Classical gridline-elimination approaches [1] sidestep labelling but break on exactly the degraded, multi-curve scans that dominate a real archive. We needed labels at a scale hand-tracing cannot reach, and we needed them to be exact.

The inversion: render the image, get the mask for free

The insight that unlocked the project is simple. When you trace a real scan by hand, you are trying to recover a mask you do not have. But if you generate the log image yourself from a known curve, you already have the mask. You drew it.

So we inverted the pipeline. Instead of starting from a scanned image and labelling it, we start from a synthetic curve, render it into a realistic scanned-log image, and keep the curve's own pixel footprint as the ground-truth mask. The mask is exact by construction. There is no interpreter judgement, no disagreement, no tracing error. Every pixel is labelled correctly because we placed it.

This turns a labelling problem into a rendering problem, and rendering scales in a way hand-labelling never will. Generating ten thousand labelled examples is a matter of compute, not interpreter-years.

Hand-labelling raster well logs does not scale, so the team manufactures the ground truth instead: a synthetic-log renderer turns one clean signal into thousands of realistic scanned-log images, each carrying a pixel-perfect mask for free. Toggle the defect layers on the preview - grid lines, scan noise, ink erosion, annotations, variable width - and the rendered scan gets convincingly dirty, but the orange ground-truth mask beside it never moves, because the renderer knows exactly where the curve is. That is the argument: the label ships with every render at zero labelling cost, so dataset scale is bounded by GPU time, not by people. The right panel marks the two real anchors reached this way - 2,000 synthetic instances for binary segmentation and 15,000 for multiclass - at an 80/20 train/val split with image widths spanning 3200-12800px. Sourced: the 2,000 and 15,000 instance counts, the 80/20 split, the 3200-12800px width range, and the defect families the renderer simulates. The dataset-growth curve between the two anchors is illustrative (anchors sourced, path schematic) and the rendered preview and mask are schematic depictions, both flagged on the canvas.

The catch is realism. A model trained on clean, synthetic curves will learn to read clean, synthetic curves and then fall apart on a 4th-generation photocopy of a 1980s scan. The whole effort hinges on making the synthetic images look like the real archive, defects and all. That is where most of our engineering went.

Modelling the defects of a real scan

A pristine plot of a curve on white background is useless as training data. Real scans carry the full accumulated damage of decades of paper, photocopying, and digitization. We built each of those damage modes as a controllable, parameterised step in the rendering pipeline, so we could turn each one up or down and match the texture of the operator's actual archive.

Track gridlines. Every printed log has a grid of depth and value lines. A naive model latches onto these straight, high-contrast lines instead of the curve. We render gridlines at variable spacing, weight, and contrast so the network learns to treat them as background rather than signal.
Scan artefacts. Photocopying and scanning add speckle, banding, skew, and compression noise. We inject these so the model sees the same texture it will meet at inference.
Ink erosion. On old paper the curve fades and breaks. We erode the rendered trace stochastically, leaving gaps and thin spots, so the model learns to bridge a discontinuous curve the way an interpreter does.
Annotations. Real logs are covered in hand-written depth marks, initials, and stamps that cross the curves. We composite synthetic annotations over the trace so the model learns to read through them.
Variable geometry. Real logs are not one size. We render at widths from 3,200 to 12,800 pixels and heights from 480 to 640 pixels, so the model never assumes a fixed input shape. This variability is the reason the binary stage ran at a batch size of 1: with image dimensions changing example to example, we could not stack a batch into a single tensor until we wrote a custom collate function for the multiclass stage.

Because every defect is a knob, we are not stuck with whatever a fixed dataset happens to contain. If the operator's archive is heavy on faded ink and light on annotations, we render to match. The synthetic distribution is tunable to the target archive, which is something no fixed corpus of hand-labelled scans can offer.

From 2,000 binary to 15,000 multiclass instances

We grew the pipeline in two stages, and the two stages answer two different questions.

The first stage was binary: one curve against the background, mask present or absent per pixel. We generated 2,000 synthetic instances for this and trained a segmentation model in roughly 2 hours over 50 epochs (about 110 minutes of wall-clock). The binary stage was a proof of the inversion itself. It told us that a model trained purely on rendered logs could in fact segment a curve, and that the defect models were carrying it toward realistic inputs rather than memorizing clean ones.

The second stage was multiclass, and this is where the volume came in. A real log track carries more than one curve, so we moved to three output classes: background plus two curves. The two-curve, three-class problem is genuinely harder, because the model now has to separate one curve from another where they cross, not just curve from background. We scaled the synthetic corpus to 15,000 instances to give the model enough varied crossings, overlaps, and defect combinations to generalize. Training the multiclass model took about 10 hours for 50 epochs (roughly 550 minutes), at a batch size of 16 made possible by a custom collate function that padded the variable-width images into a single batch.

The jump from 2,000 to 15,000 was not arbitrary. The multiclass task has more to learn (curve-versus-curve disambiguation on top of curve-versus-background), and the synthetic pipeline made the extra volume nearly free. The marginal cost of another thousand labelled multiclass instances was render time, not interpreter time. That is the entire economic argument for synthetic ground truth: once the renderer exists, label volume is a compute line item.

Splitting, training, and the volume-versus-accuracy trade

We held out a standard 80/20 train-validation split across the synthetic corpus. Because the labels are exact by construction, the validation signal is clean: a disagreement between prediction and mask is a genuine model error, not an artefact of a noisy hand label. That is a quiet but real advantage of synthetic ground truth. When your labels are perfect, your metrics mean what they say.

Across the multiclass runs we evaluated five segmentation loss functions (Dice, Focal [5], Lovasz-Softmax [4], Soft Cross-Entropy, and Tversky [3]) under otherwise identical conditions, because the choice of loss matters more than practitioners assume when the foreground is a sparse 1-pixel curve. Under Dice loss the multiclass model reached an Intersection-over-Union of 0.94 on the background class, with 0.26 and 0.21 on the two curve classes, and F1 scores of 0.97, 0.37, and 0.32 respectively. Recall on the background class held at 0.97. Read those curve-class numbers honestly: segmenting a thin, faded, overlapping trace out of a noisy scan is hard, and the per-curve scores reflect that. The best individual reconstructions were strong, with an R-squared of 0.9891 on the cleanest multiclass curve example under Tversky loss, but the per-class averages are where the real difficulty lives.

More synthetic volume bought accuracy, but not linearly. Past a point the model stops being limited by how many examples it has seen and starts being limited by how faithfully the synthetic defects match the real ones. A thousand more clean-ish renders move the needle less than getting the ink-erosion and annotation models closer to the operator's actual scans. The trade we kept landing on was volume versus fidelity: beyond the volume needed for coverage, the return came from making each synthetic image harder and more real, not from generating more easy ones.

What the synthetic pipeline bought us

Pixel-perfect masks by construction: rendering the image from a known curve gives an exact label for free, eliminating interpreter tracing error and inter-interpreter disagreement.
Tunable defect distribution: gridlines, scan artefacts, ink erosion, and annotations are each a knob, so the synthetic set can be matched to a specific operator archive rather than to a fixed corpus.
Label volume becomes a compute cost: scaling from 2,000 binary to 15,000 multiclass instances was render time, not interpreter-years, which is the whole economic case for synthetic ground truth.

Sim-to-real: trained on renders, run on real TIFs

The pipeline only earns its keep if a model trained on synthetic logs can read a real one. That is the sim-to-real gap, and it is the standard failure mode of synthetic data: a model fits the quirks of the renderer and never transfers.

Our defence was the defect modelling. By training across rendered logs that already carried gridlines, scan speckle, ink erosion, annotations, and the full 3,200-to-12,800-pixel width range, the model never saw a clean, idealized log during training. The synthetic distribution was built to overlap the real one, so inference on a real TIF was not a distribution shift the model had never encountered. We trained entirely on synthetic instances and ran inference directly on real scanned TIF logs from the archive, with zero hand-labelled rasters in the training set.

This is the part worth holding onto. The operator's archive is the public Texas RRC corpus and its own private holdings: 136,771 scanned TIFs against 7,781 LAS files in the public set alone. None of those TIFs needed a hand-drawn mask for us to train on them. The masks lived in the synthetic data; the real scans were only ever inputs at inference time. We manufactured the ground truth, and the real archive never had to be labelled at all.

How manufactured labels reset the engagement

The synthetic pipeline is the reason the rest of the digitization work was possible. It removed the one blocker that stops most raster-log ML efforts before they start, which is the absence of labels at usable scale. With ground truth that we could generate on demand, tune to the target archive, and trust completely, the modelling work became a normal segmentation problem rather than an impossible annotation problem.

It also changed the unit economics of starting a new engagement. When a new operator brings a different archive with different scan characteristics, we do not commission a fresh hand-labelling campaign. We tune the defect models to the new archive's texture and render a matched synthetic corpus. The ground truth scales with compute, and the interpreter time we would have spent tracing curves goes instead to validating model output, which is where a senior petrophysicist's judgement actually belongs.

The honest limitation is the same one the metrics show. Synthetic data closes the labelling gap but does not by itself make a thin, faded, overlapping curve easy to segment. The per-curve IoU and F1 numbers are where the next gains have to come from, and they come from two places: richer defect fidelity so the synthetic distribution hugs the real one more tightly, and architecture work on the model that consumes the data. The synthetic pipeline does not finish the problem. It makes the problem tractable, which for raster-log digitization is the step everything else was waiting on.

References

Yuan, B. and Yang, Q. (2019). Digitization of Well-Logging Parameter Graphs Based on Gridlines-Elimination Approach. Journal of Petroleum Exploration and Production Technology. http://www.jsoftware.us/show-409-JSW15423.html
Chen, L.-C. et al. (2018). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. ECCV 2018. https://arxiv.org/abs/1802.02611
Salehi, S. S. M. et al. (2017). Tversky loss function for image segmentation using 3D fully convolutional deep networks. MLMI 2017. https://arxiv.org/abs/1706.05721
Berman, M. et al. (2018). The Lovasz-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. CVPR 2018. https://arxiv.org/abs/1805.02396
Lin, T.-Y. et al. (2017). Focal Loss for Dense Object Detection. ICCV 2017. https://arxiv.org/abs/1708.02002

Manufacturing Ground Truth: a synthetic-data pipeline for raster well-log digitization

At a glance

The problem: hand-labelling raster logs does not scale

The inversion: render the image, get the mask for free

Modelling the defects of a real scan

From 2,000 binary to 15,000 multiclass instances

Splitting, training, and the volume-versus-accuracy trade

Sim-to-real: trained on renders, run on real TIFs

How manufactured labels reset the engagement

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on