10x More Training Signal: Turning 236 Patches Into 4,212 So a Small-Data DETR Could Converge

The hard part of applying a Detection Transformer to borehole geology was never the transformer. It was that the one reservoir interval we started from gave us 32 labelled sinusoids. Across 236 image-log patches. Of which 19 contained a sinusoid at all. No set-prediction model — no model of any kind — learns a depth-dip-azimuth regression from that. This is the story of the data-engineering move that made the rest of the project possible: a geometry-preserving augmentation pipeline that grew the corpus more than tenfold, and the ablation that proved it was load-bearing rather than cosmetic. In our work with a mid-sized Middle East carbonate operator we partnered with, this single decision separated a model that learned nothing from one that converged.

The data regime, stated honestly

When you build a model for a fractured carbonate play, the labels are expensive and the geology is sparse where it matters. The seed reservoir interval we used to bootstrap GeoBFDT spanned roughly sixty metres of a single well. Tiled into 100x360-pixel patches, it produced 236 patches. Only 19 of those carried a sinusoid, and the total sinusoid count was 32. The full programme would eventually span 14 vertical wells logged with two different microresistivity imaging tools — but the architecture had to prove itself on this seed interval first, and at this size the class imbalance is not a tuning nuisance, it is the entire problem.

Why this is a degenerate regime for set prediction

A DETR matches a fixed bank of learned queries one-to-one against the ground-truth set via Hungarian matching. With 32 positives spread over 236 patches, the overwhelming majority of every batch is the “no-object” class. The model can drive its loss down to a plausible-looking floor by predicting “no sinusoid” everywhere — and on this corpus, untreated, that is exactly what it did.

The naive fixes do not work here. Class-weighting the loss is real and we use it — Focal loss is precisely the right classification objective for this imbalance — but a focal term cannot manufacture geometric diversity that the 32 examples do not contain. Holding out a validation split from 32 positives leaves you estimating accuracy off a handful of sinusoids. And a heavier backbone, the instinct when a model underfits, does the opposite of help: with this little signal, more capacity means more overfitting. The lever that actually moves the regime is the training data itself.

The augmentation pipeline as a data-engineering build

We treated augmentation as a production data pipeline, not a transforms.Compose afterthought. The unit of augmentation is the sinusoid-bearing patch. We re-tiled the interval into 100x270-pixel patches with a vertical stride of 20 pixels — heavily overlapping, so that no sinusoid is clipped at a patch boundary and every trace appears in several spatial contexts. Then, on each patch that contained a sinusoid, we applied 10 randomised augmentations drawn from six families: ColorJitter, GaussianBlur, Sharpen, Gaussian noise, Emboss, and MedianBlur.

The selection of those six is the whole point, and it is a domain decision rather than a default. Every transform in that set perturbs photometry — contrast, sharpness, texture, noise — and leaves geometry untouched. That constraint is non-negotiable on an image log: the label is the geometry. A sinusoid's depth, dip and azimuth are encoded in its position and curvature on the unwrapped borehole image. Rotate, shear, or elastically warp that patch and you have silently corrupted the very target the regression head must predict. So no affine, no flip, no crop that changes the trace. We vary how the rock looks — bright resistive against dark conductive, sharp against blurred, clean against noisy — never where the sinusoid sits. The augmentations stay on the training split only; validation and test see real, untransformed logs, so the reported accuracy is not inflated by synthetic samples leaking across the split.

The conceptual reason this works is the one the augmentation-robustness literature formalises: a model should make the same prediction for a clean sample and a photometrically corrupted version of it. That consistency, not the raw count of extra images, is what augmentation buys you. The interactive below makes the mechanism concrete — diversity blends across augmentation chains, but the prediction-consistency tie is what holds the model robust.

The article's most concrete mechanism is AugMix, not a benchmark number. The article describes AugMix as three moves: generate several augmentation chains of varying severity ('a family of them') applied to the same sample, mix their outputs with random convex weights into one continuum-of-corruptions sample, then tie the model's prediction on the original to its prediction on the mix with a Jensen-Shannon-divergence consistency loss. Drag the convex-weight handle: the chains re-weight and the mixed sample re-blends live, but the orange JS-consistency tie holds the original and mixed predictions together no matter how the weights move — that, the article argues, is what makes AugMix robust where naive augmentation bakes in distortion. Sourced from the article: AugMix (Hendrycks et al., ICLR 2020, arXiv 1912.02781), the three moves, 'several chains of varying severity', the random convex mixing weights, and the Jensen-Shannon consistency loss. The article does not fix the number of chains, so the three chains shown (A/B/C), their severities, and the live weight split are schematic and flagged as such on the canvas.

The pipeline output is blunt about its own scale. From 236 patches it produced 4,212. From 19 sinusoid-bearing patches, 2,046. From 32 individual sinusoids, 3,565 — a greater-than-tenfold increase in usable training signal, generated entirely from re-tiling and photometric perturbation, with not one additional well or interpreter-hour spent.

Training-corpus growth from the augmentation pipeline

Before

236 / 19 / 32

Raw seed interval: total patches / sinusoid-bearing patches / individual sinusoids — degenerate for set prediction

After

4,212 / 2,046 / 3,565

After overlapping re-tile (100x270, stride 20) + 10 photometric augmentations per sinusoid patch (6 families)

greater than 10x training signal — geometry preserved, photometry varied

The ablation that made it non-negotiable

Augmentation is the kind of decision teams rationalise after the fact. We did the opposite: we ran the model with augmentation switched off and on, holding everything else fixed, and let the numbers adjudicate. The result is one of the cleanest ablations in the whole programme.

Loss-function choice decides whether the network learns curve continuity. VeerNet tested five losses under identical conditions; only the one whose gradient aligns with the IoU/F1 metric (Lovász-Softmax) wins, and the shipped answer is a two-loss SCE-warmup → Lovász-finetune schedule. Pick a loss to see its ablation verdict, the reason, and a schematic ground-truth-vs-prediction trace; toggle the two-loss schedule (same accuracy, half the wall-clock). The five candidates, verdicts, F1 35%/IoU 30% and the two-loss schedule are the whitepaper's own; the podium bar heights are ordinal (rank sourced) and the prediction thumbnails are schematic.

Without augmentation, the model's classification error was 100% — it learned nothing usable, collapsing to the majority “no-object” prediction exactly as the imbalance predicts. With augmentation, classification error fell to 2.618%. The two regression-and-matching loss terms moved in lockstep: the Hungarian matching loss dropped from 0.174 to 0.0135, and the parameter (depth/dip/azimuth) loss from 0.575 to 0.062. This is not a marginal gain you argue about over coffee. It is the difference between a model that does not function and one that does.

That single comparison reframes how to read the rest of the model's design choices, because every other ablation we ran sits downstream of having a corpus the model can actually learn from:

Backbone. On this small-data regime a deliberately light backbone won decisively — ResNet-10 reached a class error of 0.499 where ResNet-34 collapsed to 26.759. More parameters overfit; fewer generalised. Augmentation is what gives a small backbone enough varied signal to generalise from.
Dynamic over static logs. Training on dynamically normalised images rather than static ones cut class error from 63.45 to 2.536 — the dynamic image carries the local contrast the attention layers key on, and augmentation amplifies exactly that contrast variation.
Geological diversity compounds. Across the well-count ablation (3 to 14 wells), class error fell from 93.115 to the low single digits. More wells add real geological variety; augmentation adds photometric variety. They are complementary levers, and the project needed both.

The training recipe around these choices was conventional once the data problem was solved: ResNet-10 trained from scratch, a 4-layer encoder and 4-layer decoder, feedforward dimension 1,024, dropout 0.2, AdamW at a learning rate of 0.0004, batch size 128, with Focal loss (classification weight 5) and L1 loss (parameter weight 1), inference thresholded at probability 0.5, and early stopping after 40 epochs without improvement. None of those knobs would have mattered at all without the corpus the augmentation pipeline produced.

What an ML engineer should take from this

The transferable lesson is a sequencing one, and it is the engineering discipline — not the geoscience — that we bring to small-data subsurface engagements for national and independent operators across the Middle East and the United States: in a small-data, high-imbalance vision problem, the data pipeline is the model. The temptation is to spend the first sprint on architecture — a fancier transformer, a pretrained backbone, a cleverer loss. On 32 positives, every one of those moves is premature. The first sprint belongs to the corpus: tile with overlap so nothing is clipped, augment only the dimension that does not carry the label, and prove the augmentation earns its place with a switch-it-off ablation before you trust a single accuracy figure. That is an applied-ML build — data engineering, a CV-aware augmentation contract, and an MLOps-grade ablation harness — not a geology exercise wearing a model.

The honest caveats travel with the method. Photometric augmentation expands the appearance manifold; it cannot invent geological structures the wells never sampled, which is why the well-count ablation, not the augmentation factor, names the next real lever — more diverse wells. Tenfold synthetic growth from one seed interval also raises the correlation between training samples, so the validation and test splits must stay strictly real and strictly held-out, as they were here, or the headline accuracy is a mirage. And the six-family photometric set is tuned to high-resolution borehole image-log statistics; ported to a different log type or a seismic raster, the augmentation families themselves would need re-deriving from first principles. The pipeline is the contribution — not any one transform in it.

Engineering a small-data DETR to convergence

On a degenerate corpus — 32 sinusoids across 236 patches, only 19 of them positive — augmentation, not architecture, was the difference: an overlapping re-tile (100x270, stride 20) plus 10 photometric augmentations per sinusoid patch grew the data to 4,212 patches / 2,046 sinusoid-patches / 3,565 sinusoids, a greater-than-tenfold increase.
Augment only the dimension that does not carry the label: the six families used (ColorJitter, GaussianBlur, Sharpen, Gaussian noise, Emboss, MedianBlur) all vary photometry and leave the sinusoid geometry — the regression target — untouched. Augmentation runs on the training split only.
The switch-it-off ablation made the decision non-negotiable: class error fell from 100% (no augmentation, model learns nothing) to 2.618%, with the Hungarian loss dropping 0.174 to 0.0135 and the parameter loss 0.575 to 0.062. In small-data vision, the data pipeline is the model.

References

GeoBFDT augmentation and ablation figures derived from internal validation on a 14-well Middle East carbonate dataset acquired with two different microresistivity imaging tools; data and code withheld under operator confidentiality.
Carion et al. (2020). End-to-End Object Detection with Transformers (DETR). ECCV 2020. https://arxiv.org/abs/2005.12872
Hendrycks et al. (2020). AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty. ICLR 2020. https://arxiv.org/abs/1912.02781
Lin et al. (2017). Focal Loss for Dense Object Detection. ICCV 2017. https://arxiv.org/abs/1708.02002

10x More Training Signal: Turning 236 Patches Into 4,212 So a Small-Data DETR Could Converge

The data regime, stated honestly

The augmentation pipeline as a data-engineering build

The ablation that made it non-negotiable

What an ML engineer should take from this

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on