Synthetic-to-Real Transfer in Industrial Inspection: Evidence Across Five Domains

Abstract

Training a model on cheaply generated synthetic data and deploying it on scarce, expensive real data is one of the oldest bargains in applied computer vision, and industrial inspection is where the bargain is most tempting, because real labelled defects, real scanned instruments, and real annotated tiles are all costly to gather at scale. The bargain does not always pay. This survey reads the published record on synthetic-to-real transfer across five industrial-inspection domains and asks a single question of each: does the reported success of the transfer track how faithfully the synthetic generator reproduced the target's actual degradation? We find a consistent answer. Domains whose generators model the real corruption directly, notably scene and document text, where blur, perspective warp, and compression noise are simulated on purpose, report the strongest transfer. Domains whose generators reproduce geometry but leave the sensor and the atmosphere unmodelled, notably aerial and remote-sensing tiles, report the weakest. Between them sit robotic flight from computer-aided-design renders and manufacturing surface-defect inspection, exactly where a partial degradation match predicts. We anchor the reading on the one domain in our own archive we can pin to measured numbers, a raster well-log digitiser in which 15,000 synthetic multiclass instances transfer toward 136,771 real scanned logs at a peak intersection over union of 0.51 and a best single-curve goodness-of-fit of R-squared 0.9891. The claim is modest and specific: across these five domains, transfer succeeds in proportion to how well the generator's degradation model matches the target, and the corollary for a practitioner is that effort spent modelling the target's corruption returns more than effort spent on generic appearance realism.

The idea that a network trained on synthetic images can be deployed on real ones is older than the phrase now attached to it. The clearest early statement of the appearance-gap version came from Tobin and colleagues, who randomised the textures, lighting, and camera parameters of a simulator so aggressively that the real world became, to the trained network, one more variation it had already seen [1]. The recipe works when the thing that must transfer is invariant to appearance, and it is the reason the domain-randomisation literature reads as a study of what survives randomisation and what does not.

Two domains established the strong end of the transfer record early, both in text. Jaderberg and colleagues built a synthetic word generator whose fonts, colour, blur, and background noise were tuned to look like the recogniser's real inputs, and trained a scene-text recogniser on synthetic words alone that worked on real photographs [3]. Gupta and colleagues extended the recipe to detection by rendering text into real background scenes with blending, perspective, and noise chosen to mimic how text actually sits in a photograph, and again the synthetic-trained detector transferred [2]. The common thread is not that the images looked photorealistic in an absolute sense; it is that the specific corruptions a real text image carries, the blur, the warp, the compression artifacts, were modelled deliberately.

The robotics literature marks a different point on the same axis. Sadeghi and Levine trained a flight policy entirely in randomised computer-aided-design environments and flew it in the real world without a single real training image [4]. What transferred there was geometry and obstacle layout, the parts of the task that a CAD render gets right; appearance was randomised over rather than matched, because the policy did not need to read fine texture. Tremblay and colleagues then quantified the dependence for object detection, measuring how much appearance randomisation a synthetic-trained detector needs before it closes the reality gap on real photographs [5], which is the same question asked with a dial rather than a switch.

The weak end of the record is instructive precisely because the generators there are good in the ways that do not help. Ros and colleagues released a large synthetic urban dataset for semantic segmentation, and the subsequent literature on transferring from it to real street scenes documents a stubborn residual gap that appearance realism alone does not close [6], because the sensor response, the atmosphere, and the label statistics of a real camera differ from the render in ways the render did not set out to reproduce. Surface-defect inspection sits between the extremes: Bosnar and colleagues built a physics-aware synthesis pipeline for surface inspection in which transfer depends on modelling the optical signature of a defect, its interaction with light, and not merely its shape [7]. The dense-prediction backbone underlying much of this work, and our own well-log anchor, is the U-Net of Ronneberger and colleagues [8], and the well-log domain itself has a prior digitisation lineage that leaned on real scanned inputs [9], which is exactly why a faithful synthetic scan generator is the interesting variable there.

Method

This is a structured reading of the published transfer record, not a new benchmark, and the procedure was kept narrow so the claim stays defensible. We fixed one explanatory variable, the fidelity with which a synthetic generator reproduces the target domain's real degradation, and one outcome variable, the reported strength of synthetic-to-real transfer, and we placed five industrial-inspection domains on those two axes. The five are scene and document text [2] [3], robotic flight from CAD renders [4] [5], manufacturing surface-defect inspection [7], aerial and remote-sensing style urban segmentation [6], and raster well-log curve digitisation [8] [9]. For each we extracted the same two things from the literature and, where it is ours, from the engagement archive: what the generator actually modelled about the target's corruption, and how strong the reported transfer was.

Two honesty constraints shape everything that follows. First, the degradation-match coordinate is a survey reading, not a measured scalar; there is no universal unit in which the atmosphere gap of a satellite tile and the blur gap of a document scan are commensurable, so we place the five domains on a common ordinal band and argue the ordering, not exact positions. Second, of the five domains only one is ours to measure. The raster well-log case is drawn from our archive and carries real numbers; the other four carry the qualitative verdicts of their published sources. The interactive exhibit below is built on the same footing, with the well-log anchor flagged as sourced and every other placement flagged as an illustrative survey reading, so no reader mistakes the ordinal band for a re-measured metric.

The axis that sorts the evidence

The variable that sorts the five domains is not photorealism and it is not simulator sophistication. It is a narrower thing: whether the generator modelled the particular way the real target is corrupted between the ideal signal and the pixels the network sees. A document scanner blurs, warps, compresses, and speckles; a text generator that reproduces those four corruptions transfers well [2] [3]. A satellite delivers a specific atmospheric scattering, a specific sensor noise floor, and a specific ground-sample distance; an urban synthetic dataset that renders geometry beautifully but does not reproduce those sensor and atmosphere effects leaves a gap that realism does not close [6]. The distinction matters because it redirects effort. The intuition that a more photorealistic render is always a better render is wrong in a useful direction: past the point where the target's actual corruption is captured, extra realism buys little, and before that point, no amount of generic prettiness substitutes for modelling the corruption that is there.

Surface-defect inspection is the cleanest demonstration that the axis is corruption fidelity rather than realism, because a defect is defined by its optical behaviour. A scratch is not a shape; it is a shape with a characteristic way of catching light, and a synthesis pipeline that renders the shape without the optical signature produces training data that teaches the wrong invariance [7]. The physics-aware pipelines that model the light interaction transfer; the geometry-only ones do not, even when the geometry is exact.

Where the well-log case falls

The raster well-log domain is the one we can measure, and it lands where the axis predicts. A well log arrives as a scanned raster image of curves drawn on gridded paper, and the corruption between the ideal curve and the scanned pixel is specific and enumerable: paper texture, scan blur, gridline interference, compression, and the wild variation in image width as logs run from a few thousand to over twelve thousand pixels wide. A synthetic generator for this domain succeeds to the degree that it reproduces that scan degradation directly rather than drawing clean curves. In our engagement the synthetic set grew to 15,000 multiclass instances built to carry exactly that degradation, and the transfer target was the real archive of 136,771 scanned log images. The measured outcome of that transfer, on a fifty-epoch multiclass segmentation, was a peak intersection over union of 0.51 and, on the cleanest curve example, a best single-curve goodness-of-fit of R-squared 0.9891.

Five industrial-inspection domains arranged by one variable: how closely the synthetic generator's degradation model matches the real target. Read left to right the reported strength of synthetic-to-real transfer rises with that match, so the domains fall into a rising band rather than scattering, which is the survey's whole claim. The orange marker is the only element that argues: the raster well-log case, the single domain we can pin to measured numbers rather than to the literature's qualitative reports, sitting where a strong degradation match predicts it should, with 15,000 synthetic multiclass instances transferring toward 136,771 real TIF scans at a peak IoU of 0.51 and a best curve-fit R-squared of 0.9891. Drag the match-threshold read-head along the bottom to cut the same five domains into those whose generator matches the target above the line and those below it; the read-out shows the average reported transfer on each side and the gap between them. The cut line only moves where you slice the band, it never moves where a domain sits. The well-log numbers are sourced from the engagement archive; the degradation-match coordinate for every domain and the transfer readings for the four non-well-log rows are illustrative survey placements, not re-measured metrics.

The two well-log numbers say different things, and reading them honestly is the point of anchoring here. The peak intersection over union of 0.51 is the pixel-overlap metric on thin, sparse curve masks, and it is a hard metric by construction, because a curve one or two pixels wide punishes any spatial slip heavily; 0.51 is a respectable score for that class of target, not a failure. The best single-curve R-squared of 0.9891 is the downstream metric that the consumer of a digitiser actually cares about, the goodness-of-fit between the recovered curve values and the truth, and it is high because the corruption the generator modelled matched the corruption the real scans carried. The gap between a hard pixel metric and a strong value-recovery metric is itself the signature of a good degradation match: the network learned the real corruption well enough to recover the underlying signal even where the pixel-exact mask is unforgiving.

Reading the band, not the points

The exhibit above arranges the five domains on the degradation-match axis and lets the reader drag a threshold that splits them. The shape of the argument is the rising band: as the generator's degradation match improves from the atmosphere-blind aerial case through the geometry-only flight case, the partly-modelled defect case, the closely-mimicked text case, and finally the directly-modelled well-log case, the reported transfer strength climbs with it. The well-log anchor is drawn in the single scarce accent because it is the one point we can defend with measured numbers rather than a field verdict; everything else on the band is an ordinal survey placement, flagged as such on the canvas. Dragging the threshold does not move any domain; it only moves where the same five domains are cut, and on every honest cut the domains above the line report stronger transfer than those below it. That is the whole claim rendered as a lever: the ordering is the finding, and the cut point is the reader's to choose.

The reason to render it as a band rather than a table is that the individual positions are contestable and the ordering is not. One could argue whether robotic flight sits slightly above or below manufacturing defects, because the degradation-match coordinate is a reading. One cannot plausibly argue that the atmosphere-blind aerial case transfers better than the corruption-matched text case; the literature is consistent on the endpoints even where it is fuzzy in the middle [2] [3] [6]. The band survives the fuzziness in a way that precise coordinates would not.

Discussion

The practical reading of this survey is a reallocation of effort. A team standing up a synthetic pipeline for an inspection task has a fixed budget of engineering attention, and the default temptation is to spend it on realism, on better textures, better lighting, better renders. The five-domain record says the higher-return spend is on the target's specific corruption: characterise how the real signal is degraded before it reaches the network, then build a generator that reproduces that degradation, even at the cost of a render that looks less pretty in isolation. The text domains earned their strong transfer by modelling blur and warp and compression, not by photorealism [2] [3]; the well-log domain earned its strong value-recovery by modelling scan degradation, not by drawing beautiful curves [8] [9]; the aerial domain paid for its weak transfer by getting geometry right and the sensor wrong [6].

It is worth marking the line between this survey and our applied writing, because the well-log case appears in both. Here it is one anchor among five, used only to pin a single point of the cross-domain band to measured numbers rather than to a field verdict, and the survey makes no claim about the digitiser as a product. The general lesson is the one the five domains share: synthetic-to-real transfer is not a property of how real the synthetic data looks, but of how well it carries the specific corruption the real target will impose, and a practitioner who internalises that will spend a synthetic-data budget very differently from one who chases realism for its own sake.

Limitations

This is a survey and carries a survey's limits. It does not re-implement or re-measure the four non-well-log domains; their transfer verdicts are the qualitative reports of their published sources, placed on a common ordinal band, and the degradation-match coordinate for every domain, including the well-log case, is a survey reading rather than a measured scalar in a universal unit. There is no unit in which the atmosphere gap of a satellite tile and the blur gap of a document scan are directly commensurable, so the exhibit argues an ordering and not exact positions, and the interior positions, particularly the relative order of robotic flight and surface-defect inspection, are contestable while the endpoints are not. Only the well-log domain carries measured numbers: 15,000 synthetic multiclass instances, 136,771 real scanned logs, a peak intersection over union of 0.51, and a best single-curve R-squared of 0.9891, from a single multiclass run on one architecture in one engagement, used as a worked anchor and not as a fresh benchmark against the other four. The survey scopes itself to five inspection domains and to work published on or before its own quarter, so later refinements in generative modelling of sensor degradation are out of frame. A reader should take this as a decision rule for where to spend a synthetic-data budget, model the target's corruption before chasing realism, and not as a substitute for measuring transfer on their own domain and metric.

What to carry from the survey

Across five industrial-inspection domains, reported synthetic-to-real transfer tracks one variable: how faithfully the synthetic generator reproduced the target's real degradation, not how photorealistic the render looked.
The strong end is scene and document text, where blur, perspective warp, and compression noise are modelled on purpose; the weak end is aerial and remote-sensing tiles, where geometry is rendered well but sensor and atmosphere effects are left unmodelled.
Surface-defect inspection is the cleanest proof the axis is corruption fidelity, not realism: a defect is a shape with an optical signature, and geometry-only renders teach the wrong invariance while physics-aware ones transfer.
The measured anchor is the raster well-log case: 15,000 synthetic multiclass instances transferring toward 136,771 real scanned logs, at a peak intersection over union of 0.51 and a best single-curve R-squared of 0.9891. The hard pixel metric alongside the strong value-recovery metric is itself the signature of a good degradation match.
The practical rule is a reallocation of effort: characterise how the real signal is corrupted before it reaches the network, then build a generator that reproduces that corruption, even at the cost of a render that looks less pretty in isolation.

The smallest habit this survey would install is a question to ask before a single synthetic image is rendered: what, exactly, corrupts the real signal between the instrument and the network, and does the generator reproduce that corruption, because the five-domain record says the answer to that question predicts the transfer better than any judgment of how real the synthetic data looks.

References

[1] Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS (2017). Randomises simulator appearance so the real world reads as one more variation. https://arxiv.org/abs/1703.06907

[2] Gupta, A., Vedaldi, A., and Zisserman, A. Synthetic Data for Text Localisation in Natural Images. CVPR (2016). Renders text into real scenes with blending, perspective, and noise that mimic the target corruption. https://arxiv.org/abs/1604.06646

[3] Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman, A. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition. NeurIPS Deep Learning Workshop (2014). A synthetic word generator whose fonts, blur, and background noise model the recogniser's real inputs closely. https://arxiv.org/abs/1406.2227

[4] Sadeghi, F., and Levine, S. CAD2RL: Real Single-Image Flight without a Single Real Image. RSS (2017). Trains a flight policy in randomised CAD environments and flies it in the real world, transferring geometry while appearance is randomised over. https://arxiv.org/abs/1611.04201

[5] Tremblay, J., Prakash, A., Acuna, D., Brophy, M., Jampani, V., Anil, C., To, T., Cameracci, E., Boochoon, S., and Birchfield, S. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. CVPR Workshops (2018). Quantifies how much appearance randomisation closes the reality gap on real photographs. https://arxiv.org/abs/1804.06516

[6] Ros, G., Sellart, L., Materzynska, J., Vazquez, D., and Lopez, A. M. The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. CVPR (2016). Exposes the residual real-scene gap that appearance realism alone does not close. https://openaccess.thecvf.com/content_cvpr_2016/html/Ros_The_SYNTHIA_Dataset_CVPR_2016_paper.html

[7] Bosnar, L., Saric, D., Dutta, S., Weibel, T., Rauhut, M., Hagen, H., and Gospodnetic, P. Image Synthesis Pipeline for Surface Inspection. LEVIA (2020). A physics-aware rendering pipeline where transfer depends on modelling a defect's optical signature, not just its shape. https://publica.fraunhofer.de/handle/publica/410149

[8] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The dense-prediction backbone that recurs across inspection domains and underlies the well-log anchor. https://arxiv.org/abs/1505.04597

[9] Yuan, B., and Yang, Q. Digitization of Well-Logging Parameter Graphs Based on a Gridlines-Elimination Approach. Journal of Petroleum Exploration and Production Technology (2019). A prior raster-log digitisation approach whose reliance on real scanned inputs frames why a faithful synthetic scan generator matters. https://link.springer.com/article/10.1007/s13202-019-0640-y

Synthetic-to-Real Transfer in Industrial Inspection: Evidence Across Five Domains

Abstract

Method

The axis that sorts the evidence

Where the well-log case falls

Reading the band, not the points

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on