Designing Synthetic Data That Generalises, Not Memorises

A synthetic-data generator is a teacher, and a model cannot learn anything the teacher never puts in front of it. That single fact governs everything about building training data for a curve-segmentation model, the kind EarthScan uses inside VeerNet to lift well-log curves off scanned raster paper. If the generator only ever draws logs that are 4,000 pixels wide with two clean curves and no coffee stains, the model becomes an expert at exactly that and nothing else. It will post a beautiful validation score, because validation is drawn from the same generator, and then it will fail the first time a real scan arrives at 11,000 pixels with a torn margin. The whole problem of synthetic data is that the generator is both the training set and, if you are not careful, the ceiling on what the model can ever recognise.

This note is about the design decision that sets that ceiling: how wide to make the generator's variety, along which axes, and how to tell whether you widened it enough. It is deliberately not a walk through the VeerNet pipeline itself, which is documented elsewhere, but the narrower and more transferable question of how to design the diversity of a synthetic corpus so the model generalises to unseen scans instead of memorising the generator that made its training data. None of the underlying idea is ours. The case for widening a generator's variance rather than polishing its realism is domain randomization, set out by Tobin and colleagues for simulation-to-real transfer [2] and sharpened by Tremblay and colleagues for training on synthetic images [1]. What we contribute is a concrete setting of those axes on one oddly shaped subsurface dataset, and a way to read whether it worked.

The validation score is the generator grading its own homework

The first trap is the most seductive, because it looks like success. When you train on synthetic data and hold out a validation split from that same synthetic data, a high validation score tells you the model learned the generator. It does not tell you the model learned the subsurface. The 20% we held back on the 80/20 split was manufactured by the same code that made the 80% we trained on, so it shares every unstated assumption baked into the generator: the same distribution of widths, the same curve shapes, the same absence of artefacts we never thought to add. A model that scores well there has proven it can interpolate inside the generator's world. Whether it can step outside that world is a separate question the split cannot answer, because the split never leaves the world.

This is why we treat the in-generator validation number and the unseen-scan number as two different measurements rather than one. The gap between them is the real object of interest. A small gap means the generator's world is wide enough that a real scan looks, to the model, like just another sample it already knows how to handle, which is precisely the state Tobin and colleagues describe as the goal of domain randomization: make the simulator varied enough that reality becomes one more variation [2]. A large gap means the model memorised a narrow generator. Arpit and colleagues give the mechanism for why that memorisation is so easy to fall into: networks will learn genuine structure first when the data has structure to learn, but a low-diversity dataset gives them little structure to find and a great deal to simply memorise [3].

Which axes of variety actually matter

Not every axis of variety buys the same generalisation. The instinct with synthetic data is to make each sample more realistic, more photographic, closer to a real scan. Tremblay and colleagues argue, and we found, that this instinct is often backwards: what forces a network to latch onto the true invariant signal is not realism but variance [1]. If every synthetic log is rendered the same plausible way, the model can lean on incidental rendering cues. If the same log appears at wildly different widths, heights, curve shapes, and artefact levels, those incidental cues stop being reliable, and the only thing that survives across all the variation is the actual curve the model is supposed to segment. Variety, not fidelity, is what starves the shortcuts.

So we chose the axes deliberately. Width was the largest lever, because real scans vary enormously in physical size and scan resolution, so we spanned the generator from 3,200 to 12,800 pixels wide, a factor of four. Height we varied from 480 to 640 pixels. Curve shape was widened beyond the smooth idealised traces to include the kinks, flat intervals, and overlaps that real curves show, and artefacts were added rather than avoided, because a generator that never produces a smudge teaches a model that smudges cannot happen. Scale mattered too, but as an enabler rather than a cure: we grew the corpus from the initial 2,000 binary instances to 15,000 multiclass instances and eventually 20,000 two-curve logs, because a wider set of axes needs more samples to cover, and a small corpus cannot represent a wide generator no matter how varied the generator's intent. More data on a narrow generator only memorises the narrowness faster.

Whether a curve-segmentation model learns the world or just the synthetic generator it trained on comes down to one design decision: how wide you make the generator's variety. The teal frontier is the model's score on scans drawn from the same generator (the 80/20 train/validation split of the synthetic corpus); held inside the generator it stays high and barely moves, because there is always more of the same kind of thing to fit. The ghost line is the same model's score on unseen scans -- the widths, shapes, and artefacts the generator never produced. A narrow generator makes that line collapse: the model memorised the generator and cannot transfer. Drag the variety lever to widen the generator across its sourced axes -- curve width from 3,200 to 12,800 pixels, height from 480 to 640 pixels, shape and artefacts, and the corpus scaling from 2,000 to 15,000 and 20,000 instances -- and the one orange element, the memorise-to-generalise gap between the two frontiers, closes as the lines converge. The instance counts, pixel spans, and the 80/20 split are sourced from the engagement archive; the two performance curves themselves are illustrative geometry drawn between those anchors to show the direction of the argument, not logged benchmark points.

The instrument above is the argument made tangible. The teal frontier is how the model scores inside the generator, on the 80/20 split; it stays high and barely moves as you widen variety, because the model always has more of the same kind of thing to fit. The ghost line is the same model on unseen scans, and the orange element is the gap between them. Drag the variety lever left toward a narrow generator and the gap yawns open: that is memorisation, a model that aced its own homework and fails the exam. Drag it right and the two frontiers converge. That convergence is the entire definition of generalising, and it is the thing the in-generator validation score alone can never show you, because that score only ever reads the teal line.

The augmentation view of the same idea

There is a familiar version of this argument that does not mention synthetic data at all, and it is the same mechanism. Data augmentation, surveyed by Shorten and Khoshgoftaar, expands the effective diversity of a training set by transforming existing samples, and its whole justification is that a more diverse training distribution shrinks the gap between training and validation performance [4]. Synthetic-data generation is augmentation taken to its end: instead of transforming a small real set, you write the generator that produces the diversity directly. The design questions are identical, and the only difference is that with a generator you own every axis explicitly, which is both the power and the danger, because an axis you forget to vary is an axis the model will assume is constant.

That framing also sets the honest limit on the method. Widening variety along the axes you thought of does nothing for the axes you did not. If real scans have a failure mode our generator never produces, no amount of width and height variation will prepare the model for it, and the in-generator validation score will stay cheerfully high while the unseen gap hides the blind spot. This is the residual risk that survives even a well-randomised generator, and it is why the gap to genuinely unseen scans, not the validation split, is the number we watch.

What we actually did, and what we did not

Put plainly: we scaled the corpus from 2,000 to 15,000 to 20,000 instances not to chase a bigger number but to afford the variety we wanted to express, spanned widths across a four-fold range and heights across a narrower one, added shape variation and artefacts on purpose, and then judged the result not by the 80/20 validation score but by how much of that score survived contact with scans the generator had never made. When the two frontiers sat close, we trusted the model. When they diverged, we widened an axis rather than trained for longer, because more epochs on a narrow generator only deepens the memorisation.

None of the underlying theory is ours. The idea that variance beats realism is Tobin and colleagues' and Tremblay and colleagues' [1] [2], the account of why low-diversity data gets memorised is Arpit and colleagues' [3], and the augmentation framing is the survey literature's [4]. Our contribution is the setting of the axes on a real subsurface corpus and the discipline of reading the memorise-to-generalise gap instead of the flattering in-generator score. The generator is the teacher, and we spent our effort making it show the model a wide enough world that the real one would not surprise it.

Limitations

This is one engagement's calibration, not a benchmark, and it should be read that way. The instance counts, the pixel spans of 3,200 to 12,800 in width and 480 to 640 in height, and the 80/20 split are the real archive numbers, but the two performance frontiers the instrument draws between those anchors are illustrative geometry chosen to show the direction of the argument, not a logged benchmark curve, and the exact position where the gap closes will differ for any other operator's logs. The method also cannot cover axes it does not include: a real-scan failure mode absent from the generator stays invisible to both the training and the validation split, so the unseen gap we rely on is only as trustworthy as the unseen set is genuinely representative. Finally, this note treats generator design in isolation; whether a model that generalises across our variety axes actually produces a usable digitised curve, and whether the segmentation masks reconstruct to the right depths, are downstream questions this note does not settle.

References

[1] Tremblay, J. et al. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. CVPR Workshops (2018). The case that widening a synthetic generator's variance, rather than making it more photorealistic, is what forces transfer to real data. https://openaccess.thecvf.com/content_cvpr_2018_workshops/w14/html/Tremblay_Training_Deep_Networks_CVPR_2018_paper.html

[2] Tobin, J. et al. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IEEE/RSJ IROS (2017). With enough simulator variability, the real world becomes just another variation, so no real samples are needed at train time. https://ieeexplore.ieee.org/document/8202133

[3] Arpit, D. et al. A Closer Look at Memorization in Deep Networks. ICML (2017). Networks fit real structure before they memorise, and data diversity moves where that transition happens. https://proceedings.mlr.press/v70/arpit17a.html

[4] Shorten, C., and Khoshgoftaar, T. M. A Survey on Image Data Augmentation for Deep Learning. Journal of Big Data 6, 60 (2019). Augmentation as a way to expand effective training diversity and shrink the train-to-validation gap. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0

Designing Synthetic Data That Generalises, Not Memorises

The validation score is the generator grading its own homework

Which axes of variety actually matter

The augmentation view of the same idea

What we actually did, and what we did not

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on