Building the Curve Generator's Realism Knobs

The first version of our synthetic curve generator had one dial in everyone's head: realism. Turn it up, the renders look more like a scanned Texas Railroad Commission paper log; turn it down, they look like the clean vector plots they started as. It is a comfortable way to think and a useless one, because realism is not a quantity. A render is more realistic in some specific way, and the network does not generalise because a picture became prettier. It generalises because the training distribution started carrying a particular kind of variation it was previously missing. So when the v2 generator finally produced the 15,000-curve set we trained on, the question we cared about was not whether the generator helped. We already knew it did. The question was which of its knobs did, and by how much.

This post is about answering that question honestly, which turns out to be the same problem the broader field has been chewing on for years under several different names. The short version of our own result is at the bottom; the more useful part is how you get a number you can trust onto each knob in the first place.

A generator is a bank of switches, not a slider

It helps to be concrete about what the knobs actually are, because the temptation to treat the generator as a monolith comes partly from never naming its parts. Ours renders a synthetic paper log together with the pixel-perfect mask of every curve, at a constant two curves per log, with the image width sampled across the sourced 3,200 to 12,800 pixel range and the height across 480 to 640 pixels. On top of that base render sit the realism knobs, and each one models one way a real scan departs from a clean plot.

Page skew applies a small rotation, the crooked feed of a flatbed or a phone held not-quite-square. Ink bleed spreads and overprints the strokes so two analogue curves crossing each other fuse into a single blob the mask still has to pull apart. Grid jitter breaks and fades the depth ruling, so the network cannot anchor on a perfect grid it will never meet in the field. Paper noise adds the high-frequency speckle of fibre texture and scanner grain. Stroke-width variance wanders the pen weight along a curve. Each knob is a term, with its own intensity, and the renderer is the composition of them.

That framing is not ours; it is what the synthetic-data community converged on. The domain-randomisation argument that started the modern thread made the case that you should not chase one photorealistic render but instead widen the simulator's variation until the real world looks like another sample from training [1]. The systems that operationalised it for vision did so as exactly this kind of factor bank, varying lighting, pose, texture and noise as separate, named terms rather than as a single fidelity knob [2]. The augmentation literature says the same thing from the supervised side: a survey of image augmentation is, in effect, a catalogue of separable terms, each with its own documented effect, which is the opposite of an undifferentiated boost [3].

The field already knew that "it helps" is not an answer

The reason "the generator helps" is unsatisfying has a precise statement: it is a claim about a bundle, and bundles hide their internal structure. Two ideas from the literature give you the tools to open the bundle.

The first is the ablation, borrowed straight from corruption-robustness work. The clean way to grade a model under acquisition-channel corruptions is to switch one corruption on at a time, hold everything else fixed, and record the drop, which yields a per-corruption profile rather than a single robustness score [5]. Run that protocol backwards on a generator instead of on a test set, and you get a per-knob profile: switch one realism knob on, hold the rest of the pipeline constant, retrain, and record the validation lift attributable to that knob. The augmentation-search literature leans on the same assumption from the other direction. AutoAugment can only search over operations and magnitudes because it takes for granted that each operation has a measurable, separable contribution to validation accuracy; the search is meaningless if the knobs are inseparable [4].

The second idea is fairer and harder. Knobs interact. Skew and grid jitter both attack the same anchoring behaviour, so once one is on, the marginal value of the other is smaller than its solo effect, and a naive sum of solo lifts double-counts the shared robustness. The principled fix is credit assignment with the interactions built in, which is what Shapley-value attribution provides: distribute the joint outcome across the contributing factors by averaging each factor's marginal contribution over the orderings in which factors are added [6]. We do not run the full combinatorial Shapley computation over the knob set in this note, but its discipline is the one we keep: a knob is worth its marginal contribution given the others, never its standalone effect counted in isolation.

Attribute to the term, not to the pipeline

A single number for "the generator" is a property of the whole bundle and tells you nothing about where to spend the next render-budget hour. The moment you can say skew is worth more than noise, the roadmap writes itself: model the high-attribution term harder, stop polishing the saturated one. The ablation gives you the per-term effect; the Shapley framing keeps you from adding those effects up as if the terms did not overlap.

Reading the knobs one at a time

The instrument below is the per-knob attribution made operable. Each knob is a draggable intensity, and as you turn it the panel reads off the validation lift attributed to that knob in isolation, beside a cumulative composite. Two design choices in it are worth pointing out because they encode the two ideas above. First, every knob saturates: its lift follows a ceiling times one minus an exponential, so it helps up to a point and then extra intensity buys nothing, which is what you actually observe once the model has absorbed that flavour of variation. Second, the composite is a diminishing-returns combine of the active knobs, not their sum, and the panel prints the naive sum alongside it so you can see exactly how much the overlap would have made you overcount.

A per-knob attribution panel for the synthetic curve generator. Drag each realism knob to an intensity and read the validation lift attributed to that knob in isolation, beside a cumulative composite. Each knob saturates with intensity (it helps up to a point, then the model has already absorbed that kind of variation), so the per-knob lift follows a ceiling times one minus an exponential. The composite is a diminishing-returns combine of the active knobs rather than a naive sum, because the knobs share robustness, and the panel shows both so the overcount is visible. The orange accent marks the single knob carrying the most lift, which is the practical output: where the next render-budget hour pays. The sourced configuration is the generation width range of 3200 to 12800 pixels, the height range of 480 to 640 pixels, the constant two curves per log, and the 15000-curve v2 dataset the knobs produced. The per-knob lift values, their ceilings, and the composite are an illustrative attribution decomposition built to argue the ranking, not a measured ablation table.

What the panel argues, and what our retraining runs bore out, is an ordering. Page skew carries the most attributed lift, because a rotation relocates every depth row at once and a segmenter that learned curve continuity on square renders has to re-find the entire trace; geometry is the variation the real world most reliably contains and the renders most conspicuously lacked. Ink bleed comes next, because it attacks class separation directly, fusing exactly the two curves the multiclass mask is supposed to keep apart, and thin-structure separation was already the brittle part of the task. Grid jitter and paper noise are real but smaller and they saturate fast: a little teaches the network not to trust a clean ruling, and more does almost nothing. Stroke-width variance is the smallest term, because the network turned out to be largely width-tolerant on its own.

That ordering is not a coincidence either, and the place it comes from is instructive. The classical gridlines-elimination digitiser, the non-learning baseline our segmenter replaced, fails precisely on skew, overlapping curves, faded ink and broken gridlines [7]. Its failure surface is a ranked list of the corruptions that matter, handed to us for free by the method that could not survive them. Our high-attribution knobs are its top failure modes. The generator's job, read this way, is to manufacture the baseline's nightmares on purpose.

Where this leaves the renderer roadmap

The practical payoff of attributing lift to knobs instead of to the generator is that it converts a vague instinct ("make it more realistic") into a queue. The next knob worth building, or the next existing knob worth deepening, is the one whose attributed lift is still climbing rather than the one that already plateaued. Skew earned more render engineering; paper noise, having saturated early, did not get another week. We found the residual gap the same way the corruption-robustness protocol finds a model's weak spot, by asking which single term, switched on, still moves the validation number, and then spending there [5].

There is a quieter benefit too, which is honesty about what a headline accuracy means. A model trained on a generator with five knobs has a score that is a property of those five knobs and their intensities, not of the field. If you cannot say which knob is carrying the number, you cannot say what would happen to it on a scan whose dominant corruption you never modelled. Per-knob attribution is the audit trail for that claim. It is the difference between reporting that the generator works and being able to point at the two knobs that are the reason it does, while admitting that a sixth corruption, the one no knob imitates yet, is exactly the size of the gap still left on real scans.

Key takeaways

A synthetic-data generator is a bank of named knobs (page skew, ink bleed, grid jitter, paper noise, stroke variance), not a single realism slider. 'The generator helps' is a claim about a bundle and hides which term is actually responsible.
The field already supplies the tools to open the bundle: the corruption-robustness ablation protocol run backwards gives a per-knob lift profile, and Shapley-style credit assignment keeps you from summing solo effects as if the knobs did not overlap.
Each knob saturates. Its attributed lift climbs with intensity up to a ceiling and then flattens, which is why the right composite is a diminishing-returns combine of the active knobs rather than their naive sum.
The ranking, not the total, is the finding. Page skew carries the most lift (a rotation relocates every depth row at once), ink bleed next (it fuses the two curves the multiclass mask must separate), with grid jitter, paper noise, and stroke variance smaller and faster to saturate.
The ordering matches the classical gridlines-elimination baseline's own failure surface (skew, overlapping curves, faded ink, broken gridlines); the generator's job is to manufacture that baseline's nightmares on purpose, across the sourced 3,200 to 12,800 px width and 480 to 640 px height ranges at two curves per log, in the 15,000-curve v2 set.

References

[1] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, P. Abbeel. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS (2017). The argument that widening a simulator's variation, rather than perfecting one render, is what carries a model onto reality. https://arxiv.org/abs/1703.06907

[2] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, S. Birchfield. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. CVPR Workshops (2018). A synthetic-data pipeline whose realism is a set of explicitly varied factors rather than one photorealistic target. https://arxiv.org/abs/1804.06516

[3] C. Shorten, T. M. Khoshgoftaar. A Survey on Image Data Augmentation for Deep Learning. Journal of Big Data (2019). A catalogue of augmentation factors, which makes plain that augmentations are separable terms with individual effects, not one undifferentiated boost. https://doi.org/10.1186/s40537-019-0197-0

[4] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, Q. V. Le. AutoAugment: Learning Augmentation Strategies from Data. CVPR (2019). A search over augmentation operations and magnitudes, which presupposes each operation has a measurable, separable contribution to validation accuracy. https://arxiv.org/abs/1805.09501

[5] D. Hendrycks, T. Dietterich. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. ICLR (2019). A taxonomy of acquisition-channel corruptions with a protocol for grading a model under each one in turn. https://arxiv.org/abs/1903.12261

[6] S. M. Lundberg, S.-I. Lee. A Unified Approach to Interpreting Model Predictions. NeurIPS (2017). The Shapley-value framing for distributing a model output fairly across contributing factors. https://arxiv.org/abs/1705.07874

[7] B. Yuan, Q. Yang. Digitization of Well-Logging Parameter Graphs Based on a Gridlines-Elimination Approach. Journal of Petroleum Exploration and Production Technology (2019). The classical baseline whose failure surface (skew, overlapping curves, faded ink, broken gridlines) is the list of knobs a generator must learn to imitate. https://doi.org/10.1007/s13202-019-0625-x

Building the Curve Generator's Realism Knobs

A generator is a bank of switches, not a slider

The field already knew that "it helps" is not an answer

Reading the knobs one at a time

Where this leaves the renderer roadmap

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on