Sim-to-Real Demystified: Training on Fakes

The odd thing about the model behind VeerNet, the encoder-decoder EarthScan uses to lift curves off scanned paper well logs, is that it never trained on a paper well log. Every example it learned from was drawn by a generator: synthetic images of curves on a grid, produced to order, with the answer known because we placed the curves ourselves. Then we pointed the trained model at real scans and it worked well enough to be useful. To a lot of people that sounds like it should not work at all. You trained on fakes and tested on reality. This note is the intuition companion to our VeerNet pipeline write-up, which covers how the generator and model are built; here we stay on why the whole idea holds together, where it stops working, and how to reason about the gap before you trust it.

The reason to train on synthetic data is not laziness. Labelling real scanned logs needs a specialist to trace each curve pixel by pixel down a page that can be twelve thousand pixels tall. A generator hands you the label for free, because it knew the answer before it drew the picture. That let us build a set of 15,000 synthetic multiclass logs, scaled up from an initial 2,000-log binary set, with no manual annotation. The trade is plain: you buy scale and clean labels, and you pay by training on pictures that are not the pictures you care about. Whether it pays off turns on how close the fakes are to the real work.

What the model actually learns from a fake

The intuition that makes synthetic training reasonable is that a segmentation model does not memorise pictures. It learns a function from local image evidence to a per-pixel label: this dark continuous trace is a curve, that faint ruled line is a graticule, this smear is background. If the fakes present the same kinds of local evidence a real scan presents, the function the model learns on fakes is close to the one it needs on reality, even though it never saw a real page. Tobin and colleagues framed this as domain randomization: vary the synthetic renderer enough and the real world stops looking special, becoming one more sample from a distribution the model has already seen wide variation across [1]. The model is not fooled into thinking a fake is real. It simply never learned to depend on the things that separate them.

That is why photorealism is not the goal people assume. The model does not grade on realism, it grades on whether the decision boundary it needs was covered by something in training. Peng and colleagues showed this early for detectors trained on 3D renderings: adding variety helped transfer more than making any single rendering more photorealistic [3]. For our logs the generator earned its keep by spanning what actually varies between real scans, curve widths, spacing, grid styles, scan artefacts, and page dimensions from roughly three thousand to twelve thousand pixels wide, not by making one page look like a photograph of paper. Variety covers cases. Polish on a narrow set of cases covers nothing new.

The reality gap, stated plainly

So much for why it works. Now the part that keeps you honest. However wide you make the generator, there will be something in the real archive it did not draw: a smudge, a hand annotation, a curve that runs off the edge, a scan so faded the trace is a suggestion. On that thing the model has no learned behaviour to fall back on, only whatever its nearest training experience extrapolates to. This is the reality gap, and it is not a bug in a particular generator, it is a permanent property of training on a distribution that is not the one you deploy on. The formal version is the bound of Ben-David and colleagues: your target-domain error is bounded by your source-domain error plus a term measuring how far apart the two distributions are [2]. Drive the source error to zero and the divergence term remains. That term is the reality gap, written as an inequality.

The bound tells you which lever does what. Making the model better on synthetic data pushes down the first term. Making the generator cover more of what real scans contain pushes down the divergence term, and Tremblay and colleagues showed exactly this, that widening a generator's variety measurably narrows the reality gap on a real task [4]. Neither lever touches the other. You can drive synthetic accuracy to a mirror finish and still fail on real scans if the divergence is large, and a mediocre synthetic score can transfer fine because the generator covered the real variety. Confusing the two numbers is the most common way a synthetic pipeline gets oversold.

The exhibit below is the whole argument in one shape. The teal curve is how the sim-trained model scores as the generator is made more realistic and varied, climbing toward the real-data ceiling it actually reached on the recovered curve. The orange bracket is the residual reality gap that stays open at the top of the sweep. Drag the realism lever right and watch the gap narrow and refuse to close.

Why a model trained on fake scans can score well on real ones, and why it never fully catches up. The teal curve is the sim-trained model's held-out quality as the synthetic generator is made more realistic and varied, climbing from a plain generator toward the real-data ceiling. That ceiling is the sourced anchor: the peak R-squared of 0.9891 the model actually reached on the recovered curve after training on the 15,000-log multiclass corpus for 50 epochs. Drag the realism lever right and the teal marker walks up the climb, but the orange bracket at the right edge, the residual reality gap, narrows without ever shutting. The reason it stays open is visible on the right rail: real classes land unevenly under one model, background F1 0.97 against curve F1 of 0.37 and 0.32, so polishing the generator raises the floor but cannot manufacture the parts of the real distribution the fakes never contained. The ceiling value, the 15,000 and 2,000 log corpus sizes, the 50-epoch 550-minute budget, and the class F1 scores are sourced from the engagement archive. The generator-realism axis and the shape of the climb are illustrative reasoning geometry, not measured points.

Why the gap on our task never went to zero

Our own numbers show the gap rather than hide it. On the recovered curve the model reached a peak coefficient of determination of 0.9891, the real-data ceiling in the exhibit and a strong result for a model that trained on nothing but fakes. But that headline hides the unevenness underneath. Class by class, the background mask is nearly perfect at an F1 of 0.97, while the two curve masks land at 0.37 and 0.32. The easy, abundant part of the picture transferred almost completely; the thin, high-value part, the curves that are the entire point, transferred far less well. That spread is the reality gap made concrete: the fakes taught background well because background is easy to generate faithfully, and taught curves less well because the specific ways real curves degrade, overlap, and fade are exactly the variety a generator struggles to anticipate.

This is the reasoning that generalises past our task. When a sim-to-real model underperforms, the useful question is not just how good is it, it is which parts of the real distribution did the generator fail to cover, because that is where the errors will be. For us the answer was the thin structures, which told us where to spend the next round of generator effort and where the model needed a human check. The residual gap did not close, but it stopped being mysterious. It had an address.

How to reason about a synthetic pipeline before you trust it

If you take one habit from this, ask two separate questions and never let one answer stand in for the other. First, how well does the model do on held-out synthetic data, which tells you whether it learned the task at all. Second, and independently, how much of the real distribution does the generator actually cover, which tells you whether the learning will transfer. A high synthetic score with poor coverage is a model that looks excellent in a report and fails in the field. Modest synthetic scores with honest coverage can be the more trustworthy pipeline, because it is not hiding its reality gap behind a flattering in-distribution number.

The corollary is that the generator, not the model, is where most of the real work lives. Once the architecture is competent, further gains come from the generator drawing more of what reality contains, the failure modes and degradations and odd cases, rather than from another point of synthetic accuracy. A synthetic-data project is less a modelling problem than the problem of reproducing the variety of the real world, done well enough that the model never depends on the things that separate a fake from the real thing.

Limitations

This is a primer, and the numbers should be read as such. The real-data ceiling of R-squared 0.9891 and the per-class F1 figures of 0.97, 0.37, and 0.32 are measured archive results on our curve-segmentation runs, and the corpus sizes of 15,000 and 2,000 logs and the training budget are logged. The generator-realism axis in the exhibit, and the shape of the climb toward the ceiling, are illustrative reasoning geometry rather than a measured sim-versus-real sweep: we did not run the model at graded, quantified realism levels and plot the points, so that axis carries the intuition, not evidence. The residual gap is real, but its exact width at a given generator setting is not a number we measured. The bound we lean on describes the existence and structure of the gap, not its size on any task, and the coverage question is answered qualitatively here rather than by a formal distribution-distance measurement. Whether a checkpoint transfers to a specific operator's archive remains a question only that operator's real scans can settle.

References

[1] Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2017. The mechanism: enough rendering variety makes real data look like one more sample from the training distribution. https://arxiv.org/abs/1703.06907

[2] Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A Theory of Learning from Different Domains. Machine Learning 79 (2010), pp. 151-175. The bound tying target-domain error to source error plus a divergence between the distributions. https://link.springer.com/article/10.1007/s10994-009-5152-4

[3] Peng, X., Sun, B., Ali, K., and Saenko, K. Learning Deep Object Detectors from 3D Models. IEEE International Conference on Computer Vision (ICCV) 2015. Evidence that rendering variety matters more than photorealism for synthetic-to-real transfer. https://arxiv.org/abs/1412.7122

[4] Tremblay, J., Prakash, A., Acuna, D., Brophy, M., Jampani, V., Anil, C., To, T., Cameracci, E., Boochoon, S., and Birchfield, S. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. CVPR Workshops 2018. A direct demonstration that widening generator variety narrows the reality gap on a real task. https://arxiv.org/abs/1804.06516

Sim-to-Real Demystified: Training on Fakes

What the model actually learns from a fake

The reality gap, stated plainly

Why the gap on our task never went to zero

How to reason about a synthetic pipeline before you trust it

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on