Sim-to-Real Failure Modes, and How to Tell Which One You Have

Abstract

A segmenter trained only on 15,000 clean synthetic well-log curves peaks at R-squared 0.9891 on its own held-out renders and then drops on real Texas Railroad Commission scans, but "it drops" is not a finding, it is the start of one. We treat VeerNet's recorded numbers as a diagnostic rather than a scoreboard and ask the operational question: when the model fails on a real scan, which of several distinct failures is it, and how do we tell them apart before choosing a fix. Reading the public literature on dataset shift, domain transfer, corruption robustness and out-of-distribution detection, we name four sim-to-real failure modes that recur in document segmentation and give each a signature in the metrics we already collect. Covariate shift leaves recall high while overlap sags. Class confusion holds the background mask at IoU 0.94 while curve1 and curve2 collapse to 0.26 and 0.21. Localisation drift preserves shape but moves the trace in depth, so R-squared falls while recall does not. The out-of-support artefact produces a confident, high-precision mask over something that is not a curve at all, which no render in the corpus contains. The contribution is the triage table: a way to route observed symptoms to the responsible failure class, because each class has its own fix and the fixes do not substitute for one another.

Where this sits, and what it is not

There is already a companion note on closing the gap one targeted degradation at a time, which measures what each modelled corruption is worth once you have decided that covariate shift is your problem. This note is the step before that one. It does not assume you know which corruption to add; it assumes you are staring at a bad real-scan result and need to decide what kind of bad it is. The distinction matters because the most expensive sim-to-real mistake is not a low score, it is a confidently misdiagnosed one: pouring renderer effort into degradation modelling when the real issue is a rotational skew the model never anchored on, or chasing class weights when the mask is firing on a coffee stain.

The framing is borrowed wholesale from the error-analysis tradition, which has argued for a decade that a single aggregate number hides more than it reveals, and that per-instance inspection is how you learn what a model is keying on when it errs (Ribeiro et al., 2016). We apply that habit to the sim-to-real setting specifically.

Background

The vocabulary for "a model trained on one distribution and tested on another" predates deep learning and is worth keeping precise. The dataset-shift literature separates covariate shift, where the inputs move but the labelling rule is stable, from prior-probability shift and concept shift, which are different problems with different remedies (Quinonero-Candela et al., 2009). A synthetic-to-real pipeline is a deliberate covariate shift by construction: the labelling rule is exact because we drew the curves, but the input distribution of a real scan is not the input distribution of a clean render. Domain-adaptation theory then tells you when that shift is survivable, bounding target-domain error by source error plus a divergence between the two input distributions (Ben-David et al., 2010). The bound is the reason the clean-synthetic peak does not transfer for free: if the divergence is large, a high source score buys you little.

What the robotics line of work added is a prescription, not just a diagnosis. Domain randomisation argued that you should stop trying to perfect one render and instead randomise the simulator until the real world reads as one more sample from training (Tobin et al., 2017). The corruption-robustness literature supplied the inventory of what to randomise, a taxonomy of acquisition-channel corruptions with a protocol for grading a model under each in turn (Hendrycks and Dietterich, 2019). Both, though, presume the failure is covariate shift. They have nothing to say about the case where the model is confident about something that is not in any distribution it should be modelling, which is the province of out-of-distribution detection, where a maximum-softmax-probability baseline already established that the question "is this input even one I should be answering" is separable from "what is my answer" (Hendrycks and Gimpel, 2017).

The well-log domain hands us its own period-correct enumeration. A classical gridlines-elimination digitiser fails precisely on skew, overlapping curves and faded ink (Yuan and Yang, 2019). Those are not three names for one problem; skew is a geometry failure, overlap is a class-separation failure, and faded ink is an input-corruption failure, and a learned segmenter inherits all three as distinct modes rather than one.

Method

We do not run a new experiment here; we re-read the engagement's existing held-out metrics as evidence for or against each failure class. The model is VeerNet, an encoder-decoder segmenter with a transformer refinement stage, trained on the 15,000-curve multiclass corpus with a constant two curves per log against a ruled background, three output classes, on an 80/20 train/validation split. The clean-synthetic peak is R-squared 0.9891. The multiclass per-class numbers under Dice loss are the diagnostic surface: background IoU 0.94 against curve1 IoU 0.26 and curve2 IoU 0.21; background F1 0.97 against curve1 F1 0.37 and curve2 F1 0.32; recall 0.97 for background against 0.37 and 0.32 for the two curves.

From these we define four failure classes and the signature each leaves.

Covariate shift is the renderer omitting a corruption the real scan carries. Its signature is that the curve is still found but the trace is degraded: recall stays comparatively high while IoU and R-squared sag off the clean peak, exactly the pattern the corruption taxonomy predicts for noise and blur terms (Hendrycks and Dietterich, 2019). The fix is to add the missing term to the renderer.

Class confusion is two curves fusing where they overprint. Its signature is the sharpest one in our numbers: the background mask holds at IoU 0.94 while the thin classes collapse to 0.26 and 0.21, because the network can find pixels-are-curve everywhere but cannot keep curve1 and curve2 apart. The fix is on the loss and the renderer's bleed model, where a tunable precision-recall objective is the lever (Salehi et al., 2017), not more clean renders.

Localisation drift is geometry the model never anchored on, a rotational skew or perspective tilt. Its signature is that shape is preserved but the whole trace is displaced in depth, so per-curve R-squared falls while recall is intact. The fix is geometric, deskew at ingest or randomise the homography in the renderer, which is the domain-randomisation prescription applied to the geometry axis specifically (Tobin et al., 2017).

The out-of-support artefact is the one no render fixes. Its signature is a confident, high-precision mask over a region that is not a curve at all, a stain, a fold, a stamp, which sits outside the synthetic support no matter how many logs we generate. The right frame is detection, not segmentation: flag low-confidence-that-should-be-low inputs and route them to review (Hendrycks and Gimpel, 2017).

The instrument below is the triage made operable. Tick the symptoms you see on a real scan and it routes to the most likely class, names the metric signature that confirms it, and gives the matching fix.

A diagnostic, not a score. When a segmenter trained on 15,000 clean synthetic curves (clean-synthetic peak R-squared 0.9891) misreads a real Texas RRC scan, the question is which of four failures you have: covariate shift (the renderer never modelled this corruption), class confusion (two curves fuse where they overprint), localisation drift (skew or tilt the model never anchored on) or an out-of-support artefact (a stain or fold no render contains). Tick the symptoms you observe and the triage routes to the single most likely class, names the held-out metric signature that confirms it, and prescribes the fix. The confirming signatures are the engagement's own recorded multiclass numbers (background / curve1 / curve2 IoU 0.94 / 0.26 / 0.21 and F1 0.97 / 0.37 / 0.32 under Dice loss, plus the 0.9891 clean peak); the four-class taxonomy, the routing logic and the symptom weights are an illustrative diagnostic, and the schematic is illustrative geometry.

Results

Read as a diagnostic table, the engagement's own numbers separate cleanly into the four classes, and that separability is itself the result.

The class-confusion signature is unambiguous in the recorded metrics. A background IoU of 0.94 sitting next to curve IoUs of 0.26 and 0.21 cannot be a uniform covariate shift, because a uniform input corruption would drag all three classes down together. The fact that the easy class is untouched while the two thin classes collapse is the fingerprint of a separation failure: the model knows where curve-pixels are and does not know which curve they belong to. The matching F1 figures, 0.97 against 0.37 and 0.32, say the same thing from the precision-and-recall side. This is the dominant failure mode in the multiclass setting, and it is a loss-and-renderer problem, not a volume problem.

The covariate-shift and localisation-drift classes are distinguished by what survives. When recall holds but overlap falls, the curve is being found and mis-traced, which is corruption in the input channel. When recall holds but R-squared falls while shape is preserved, the curve is being found, traced, and placed in the wrong depth, which is geometry. The two look similar on a single aggregate score and look nothing alike once you read recall and R-squared as separate axes, which is the entire argument for triaging before fixing.

The out-of-support artefact is the class the metrics cannot fully measure, and saying so is part of the result. It does not show up as a low IoU on a curve; it shows up as a high-confidence prediction on a non-curve, which an overlap metric computed against a curve ground truth will not even register. Detecting it requires a confidence-calibration view layered on top of the segmentation metrics, and treating it as a segmentation regression is the misdiagnosis that wastes the most effort.

What the diagnosis buys you

A sim-to-real drop is not one failure but four, and they do not share a fix. Reading the clean peak (R-squared 0.9891) against the per-class numbers tells you which one you have.
Class confusion is the dominant multiclass mode and has the sharpest signature: background IoU holds at 0.94 while curve1 and curve2 collapse to 0.26 and 0.21. The model finds curve-pixels but cannot keep the two curves apart; the fix is loss-and-bleed, not more renders.
Covariate shift and localisation drift look identical on a single score and obvious once you split the axes: covariate shift keeps recall high while overlap sags (input corruption), drift keeps recall and shape while R-squared falls (wrong depth, a geometry problem).
The out-of-support artefact (a stain or fold segmented as a curve) is invisible to an IoU computed against a curve ground truth. It is a detection problem; routing it to review beats any amount of additional rendering.
The cost of skipping triage is a correct-looking fix aimed at the wrong failure. The classical gridlines baseline already named the three structural modes (skew, overlap, faded ink); the discipline is keeping them apart instead of averaging them into one number.

Discussion

The deeper point is that aggregate accuracy is structurally unable to support the decision you actually have to make. A single real-scan score answers "is it good enough" and refuses to answer "what is wrong", and on a sim-to-real system the second question is the one that allocates the next sprint. The dataset-shift vocabulary is useful here precisely because it forces the question to be specific: covariate shift, prior shift and concept shift have different cures, and collapsing them into "the model got worse on real data" guarantees you will sometimes apply the cure for a disease you do not have (Quinonero-Candela et al., 2009).

There is also an order-of-operations claim. Triage should gate the targeted-degradation work, not run alongside it, because degradation modelling only pays once you have confirmed covariate shift is the binding constraint. If the binding constraint is class confusion, every additional realistic render makes the dataset more real and the curves no easier to separate; if it is localisation drift, a deskew step at ingest is worth more than a month of renderer features; if it is an out-of-support artefact, the entire render-more reflex is the wrong instinct and a confidence gate is the right one. The transfer bound makes the same point in theory: you cannot shrink the source-target divergence on an axis the divergence does not live on (Ben-David et al., 2010).

Where our work sits in the field, then, is at the join between two literatures that rarely cite each other. Domain randomisation and corruption robustness tell you how to fix covariate shift; out-of-distribution detection tells you how to recognise the case that is not covariate shift at all. A working sim-to-real pipeline needs both, and the triage table is the small piece of glue that decides which one to reach for.

Limitations

The four-class taxonomy is a model of the failure space, not a complete one, and real scans can present mixtures: a skewed page with faded ink fails as drift and covariate shift at once, and the triage routes to the dominant symptom rather than decomposing the blend. The per-class metric signatures we lean on are the engagement's recorded multiclass numbers under Dice loss (background, curve1, curve2 IoU of 0.94, 0.26, 0.21 and the matching F1 and recall figures) on a constant two-curve setting; three or more overprinted curves are harder and the confusion signature there is untested at this fidelity. The routing weights inside the instrument are an illustrative diagnostic, not a fitted classifier, and they encode our priors about which symptom implicates which class rather than a learned mapping. Crucially, the out-of-support artefact class is the one our standard metrics cannot score, because an overlap metric against a curve ground truth is silent on a confident mask over a non-curve; quantifying that mode needs a calibration study we have not run. Every figure traces to procedurally generated training logs and a held-out validation split, so the signatures describe how this model fails on this corpus, and a different backbone or loss could shift which mode dominates.

A practitioner's order of operations

If there is a way to use this on the next bad real scan, it is to resist the first instinct, which is almost always to render more. Read recall and R-squared as separate axes before you read the headline. If recall held and overlap fell, you have an input-corruption problem and the degradation note is where you go next. If recall held and the trace landed at the wrong depth, deskew before you do anything else. If the background mask is fine and only the thin classes collapsed, the renderer is not your problem and the loss is. And if the model is confident about something that was never a curve, stop reaching for the simulator entirely and build the gate that hands that scan to a person. The number tells you that you failed; only the fingerprint tells you how, and the how is the only part that decides what to do on Monday.

References

[1] J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, N. D. Lawrence (eds.). Dataset Shift in Machine Learning. MIT Press, 2009. The reference vocabulary that separates covariate shift, prior probability shift and concept shift, so a sim-to-real gap can be named rather than lumped into one accuracy drop. https://mitpress.mit.edu/9780262170055/dataset-shift-in-machine-learning/

[2] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, J. W. Vaughan. A Theory of Learning from Different Domains. Machine Learning, 2010. The H-divergence framework that formalises when a model trained on a source distribution can be expected to transfer to a target, and when it cannot. https://link.springer.com/article/10.1007/s10994-009-5152-4

[3] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, P. Abbeel. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS 2017. The argument that widening simulator variation, rather than perfecting one render, is what carries a model onto reality; the prescription for the localisation-drift class. https://arxiv.org/abs/1703.06907

[4] D. Hendrycks, T. Dietterich. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations (ImageNet-C). ICLR 2019. A taxonomy of acquisition-channel corruptions and a protocol for grading a model under each, which supplies the symptom list for the covariate-shift class. https://arxiv.org/abs/1903.12261

[5] D. Hendrycks, K. Gimpel. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. ICLR 2017. The maximum-softmax-probability baseline that frames the out-of-support artefact as a detection problem, not a segmentation one. https://arxiv.org/abs/1610.02136

[6] S. R. Hashemi et al. (Salehi et al.). Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. MLMI Workshop, MICCAI 2017. The tunable precision-recall loss whose class-imbalance handling is the lever for the class-confusion failure mode. https://arxiv.org/abs/1706.05721

[7] B. Yuan, Q. Yang. Digitization of Well-Logging Parameter Graphs Based on a Gridlines-Elimination Approach. Journal of Petroleum Exploration and Production Technology, 2019. The classical baseline whose own failure surface (skew, overlap, faded ink) is the period-correct enumeration of the failure modes a learned segmenter inherits. https://www.jsoftware.us/show-409-JSW15423.html

[8] M. T. Ribeiro, S. Singh, C. Guestrin. Why Should I Trust You? Explaining the Predictions of Any Classifier. KDD 2016. The case that per-instance inspection, not a single aggregate metric, is how you discover what a model is actually keying on when it errs. https://arxiv.org/abs/1602.04938