Closing the Sim-to-Real Gap with Targeted Degradation Models

Abstract

A model that scores R-squared 0.9891 on its own held-out renders has not been shown to read a single real scan. We trained VeerNet, our encoder-decoder segmenter with a transformer refinement stage, entirely on 15,000 procedurally generated well-log curves and then asked it to recover curves from scanned Texas Railroad Commission rasters it had never seen. The clean-synthetic peak was excellent and almost meaningless: it measures how well the network reads renders that look exactly like its training set. The number that pays for the engagement is the real-scan one, and the distance between the two is the sim-to-real gap. This note pairs that result with the public literature on degradation modelling, domain randomisation and corruption robustness, and makes one claim precise. Generic augmentation, the kind that rotates and jitters an image to inflate a dataset, does not close the gap. Modelling the specific degradations a real scan carries does: perspective tilt, paper noise, ink bleed and skew, each rendered the way the acquisition channel actually produces it. The literature established which corruptions matter and why widening the simulator beats perfecting one render. Our contribution is the price tag on each term for one concrete document-segmentation task.

Background

The idea that you can train on rendered data and deploy on real data is old, and the document and scientific-imaging communities settled it early because their inputs are synthetic to begin with. A printed well log is a deterministic rendering of curve values through a known drawing process; own the process and the pixel-perfect masks come for free. Scene-text recognition was trained this way before it was trained any other way, on engines that render text through randomised fonts and backgrounds and then apply an acquisition-noise model (Jaderberg et al., 2014), and the localisation work that followed made the same point sharper: realism comes from modelling how the text sits in a real scene, not from raw render count (Gupta et al., 2016). The U-Net paper had already argued the labelling half of this: with only a handful of annotated images, elastic-deformation augmentation manufactures the variation a small labelled set cannot supply (Ronneberger et al., 2015).

What the robotics literature added is the part that matters here: a result about which kind of synthetic variation transfers. Domain randomisation made the counter-intuitive case that you should not try to make one render photorealistic; you should randomise the simulator so aggressively that the real world looks like just another sample from the training distribution (Tobin et al., 2017). The complementary line of work narrows the gap from the other side, adapting synthetic imagery toward the real-image distribution at the pixel level rather than only in feature space (Bousmalis et al., 2017). Both agree on the diagnosis. A model fails on real data when the real data carries structure the renders never did, and the fix is to put that structure into the renders.

For well logs the structure has a name list, and the classical baseline supplies it. A gridlines-elimination pipeline that digitises well-logging parameter graphs with morphology alone works on clean scans and degrades precisely on skew, overlapping curves, faded ink and broken gridlines (Yuan and Yang, 2019). That failure surface is not a coincidence; it is the inventory of real-scan degradations, handed to us by the method that could not survive them. The corruption-robustness literature gives the same inventory a measurement frame: a taxonomy of acquisition-channel corruptions, noise, blur, geometric and digital, with a protocol for grading a model under each one in turn (Hendrycks and Dietterich, 2019). The targeted-degradation ablation in this note is that protocol applied to a renderer rather than to a test set.

Method

The renderer emits a synthetic paper log together with the exact pixel mask of every curve, then passes the raster through a degradation stack before the network ever sees it. The clean configuration draws a constant two curves per log against a ruled background, three output classes, at print resolution, and is the distribution the network is trained and first validated on. Four degradation terms then model the corruptions a real Texas RRC scan carries.

Perspective tilt applies a small homography: a flatbed or phone capture is never square to the page, so the grid the network anchors on is warped before it is read. Paper noise adds the high-frequency speckle of fibre texture, foxing and scanner grain that a clean render simply does not have. Ink bleed spreads and overprints the strokes, so that two analogue curves crossing each other fuse into one blob the mask must still pull apart, which is the hard case classical pipelines choke on. Skew applies a rotational shear, the crooked page feed, which moves every depth row and is, in our hands, the single largest real-scan term. These are not the generic augmentations of a training loop; each is a model of one acquisition-channel corruption, in the spirit of the corruption taxonomy (Hendrycks and Dietterich, 2019) and the AugMix observation that mixing corruption chains widens support rather than deepening it (Hendrycks et al., 2020).

We measure on an 80/20 train/validation split of the 15,000-curve corpus, and report the per-curve coefficient of determination R-squared between the predicted numeric trace and ground truth. All figures here are under Tversky loss in the multiclass setting (Salehi et al., 2017); the choice of loss is its own study and is held fixed throughout this one. The ablation is read the way the corruption-robustness protocol reads a test set: switch one degradation on, hold the rest of the pipeline constant, and record how far the validation R-squared falls off the clean peak.

The instrument below is that ablation made operable. Each toggle is one degradation term; the live R-squared and the gap it opens update as you compose the stack.

Generic augmentation is not enough. A generator trained on 15,000 clean synthetic curves peaks at R-squared 0.9891 on its own renders, but that is the clean-synthetic ceiling, not the real-scan score. Switch on each targeted degradation a real TXRRC field scan carries (perspective tilt, paper noise, ink bleed, skew) and watch the real-scan validation R-squared step down off the peak toward the realistic mid-range (0.8126, then 0.5461), and each term quantifies how much of the sim-to-real gap it closes. The clean peak 0.9891, realistic mid 0.8126 / 0.5461, the 15,000-curve corpus and the 80/20 train/validation split are the engagement's own recorded numbers (Tversky loss, multiclass); the per-term split of the gap and the schematic log render are illustrative.

Results

The clean-synthetic peak is R-squared 0.9891. That is the network reading renders drawn from the same process it trained on, and it is the right ceiling to quote and the wrong number to ship. The moment a realistic degradation enters, the validation R-squared steps down: a representative mid-range example lands at 0.8126, and a harder case, two curves that both overprint and bleed, falls to 0.5461. The endpoints are the measured anchors; the strip between them is where the engagement lives.

Read across the four terms, the ordering is the useful finding. Skew costs the most, because a rotational shear moves the location of every curve sample at once, and a segmenter that learned curve continuity on square renders has to re-find the whole trace. Ink bleed costs next, because it attacks class separation directly: the multiclass mask must keep two curves apart exactly where the degradation has fused them, and thin-structure separation was already the brittle part of the task. Perspective tilt and paper noise cost less individually but stack, and the corruption literature predicted that too, since geometric and noise corruptions sit in different parts of the taxonomy and the model has no shared invariance to amortise across them (Hendrycks and Dietterich, 2019).

The control that matters is the negative one. Generic augmentation, rotating and jittering the clean render to inflate the dataset, raises the clean-synthetic score and does close to nothing for the real-scan score, because it deepens a distribution the model already covers instead of widening it toward the real one. That is the domain-randomisation result restated for a document task: variation that the real world does not contain is wasted, and variation modelled on the real acquisition channel is what transfers (Tobin et al., 2017).

Key takeaways

The clean-synthetic peak (R-squared 0.9891) measures the model reading its own training distribution; the real-scan score is lower and is the only number that pays. The distance between them is the sim-to-real gap.
Generic augmentation deepens a distribution the model already covers and barely moves the real-scan score; degradation modelled on the real acquisition channel is what transfers, exactly the domain-randomisation result restated for a document task.
Each targeted degradation has a price. Skew costs the most (a rotational shear relocates every curve sample at once), ink bleed next (it fuses the two curves the multiclass mask must keep apart), with perspective tilt and paper noise smaller individually but additive.
The realistic mid-range is where deployment lives: a representative case at R-squared 0.8126 and a harder overprint-and-bleed case at 0.5461, both measured on an 80/20 split of the 15,000-curve corpus under Tversky loss.
The classical gridlines-elimination baseline already named the corruptions that matter (skew, overlap, faded ink, broken gridlines); modelling them in the renderer turns its failure surface into our training signal.

Discussion

The honest reading is that two numbers describe one model and they answer different questions. The 0.9891 answers can the network trace a curve at all, and the answer is yes, decisively. The real-scan figures answer can it trace the curve a scanner actually handed us, and the answer is conditional on which degradations the renderer modelled. This is why a single headline accuracy is misleading for any sim-to-real system: it is almost always the clean-distribution score, and the clean-distribution score is a property of the renderer, not of the field.

The per-term ablation also tells you where to spend. If skew is the largest term, then geometric normalisation, deskew at ingest, is worth more than another round of noise augmentation, and the ablation says so quantitatively rather than by intuition. The same logic governs the renderer roadmap: the next degradation worth modelling is the one whose absence opens the largest residual gap, and the strip is how you find it. This is the corruption-robustness protocol used as a development tool, not just a report card (Hendrycks and Dietterich, 2019).

Limitations

The per-term gap attribution in the instrument is an illustrative decomposition: only the endpoints (R-squared 0.9891 clean, and the realistic 0.8126 and 0.5461 cases) are measured, and the split of the drop across the four degradation terms is for explanation, not a controlled four-cell ablation table. Every figure comes from procedurally generated logs and a fixed degradation model; a render only covers the corruptions we wrote down. Real scans carry artefacts no model in this stack imitates, coffee stains, torn margins and hand annotations across the curve, which sit outside the synthetic support no matter how many logs we generate. The results are on a constant two-curve multiclass setting; three or more overprinted curves are harder and untested at this fidelity.

Conclusion

The public literature was right twice. It was right that you can render the supervision when you own the drawing process, and it was right that the variation which transfers is the variation modelled on the real acquisition channel rather than raw render volume. Our sim-to-real result on raster well-log digitisation makes both concrete for one task: a generator trained on 15,000 clean synthetic curves peaks at R-squared 0.9891 on its own renders, and the path onto real Texas RRC scans is walked one targeted degradation at a time, perspective tilt, paper noise, ink bleed and skew, down through the realistic 0.8126 and 0.5461 cases the model has to survive. Generic augmentation does not close that gap. Modelling the specific degradations does, and the ablation tells you, term by term, how much each one is worth.

Closing the Sim-to-Real Gap with Targeted Degradation Models

Abstract

Background

Method

Results

Discussion

Conclusion

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on