Overfit, Then Fixed: The ~25-Item Backlog That Tamed Our First DETR Model

The first version of almost every supervised model overfits, and the temptation is always the same: reach for a bigger, cleverer architecture. In a roughly twenty-month engagement with a mid-sized Middle East carbonate operator we partnered with, we trained a Detection-Transformer (DETR) to pick fractures and bedding planes off two different microresistivity imaging tools' borehole image logs — sinusoids parameterised by depth, dip, and azimuth — and the first two training runs failed in the most dispiriting way a set-prediction model can fail. They did not produce wrong picks; they produced a constant output on the validation set, the same handful of queries regardless of what borehole imagery they were shown. The train/val/test class distributions were matched, the loss was the right one, and the network had memorised the training set wholesale. This is the story of what actually fixed it — a backlog of roughly twenty-five small engineering changes, none of them a new architecture, and the single change that mattered most.

The diagnosis: a constant prediction is a memorisation tell

When a DETR's decoder emits the same set of queries for every input, it has stopped looking at the image. The classification head has found a degenerate optimum — predict the majority outcome, ignore the encoded features — and the bipartite matcher, which needs a varied cost surface to assign predictions to ground truth, has nothing to chew on. We confirmed it the unglamorous way: predictions on the validation set were near-identical across patches even though the underlying geology was not, and the validation loss flat-lined while training loss kept falling. That divergence is the textbook overfitting signature, and on a dataset this small it was overdetermined.

The data appetite of set prediction is the root cause. A DETR learns the number of objects per patch on its own and matches a fixed query budget to a variable ground-truth set; with too few wells and too little within-patch variety, the easiest thing for it to learn is a constant. Our earliest supervised dataset was tiny by deep-learning standards — one reservoir interval held only 32 sinusoids spread across 236 patches, and just 19 of those 236 patches contained any sinusoid at all. You cannot regularise your way out of that with weight decay alone. You have to manufacture variety and shrink the model until the two meet.

The backlog, not the architecture

Instead of swapping the network, we wrote down everything cheap that might help and worked the list. The full improvement backlog ran to roughly twenty-five items across the data layer, the model, and the training loop. The representative ones:

Augment only the sinusoid-bearing patches, and heavily — because the empty patches were already over-represented, augmenting them would have deepened the imbalance.
Shrink the backbone below ResNet-18, and shrink the encoder/decoder depth, on the principle that a small dataset cannot support a large feature extractor.
Switch the input to a single channel and use the full 360-degree borehole width rather than a cropped window.
Pad patches consistently and tune the patch height to the physical span of a typical fracture.
Change the activation and add dropout for explicit regularisation.
Drop the pathological >70-sinusoid patches that were poisoning the query budget.
Train a fractures-only model separately, rather than forcing one network to share capacity across bedding planes and fractures.
Reconsider the sinusoid-parameter loss (it started as a plain L1) and fold the well angle into the geometry so dip and azimuth could be compared across wells.

None of these is glamorous. Together they are the entire difference between a model that memorises and a model that generalises. The discipline here is ordinary applied-ML engineering — a versioned dataset under DVC, source under Git, a reproducible patch-generation step, and a hyperparameter sweep that could be re-run as the dataset grew — and it is exactly the discipline that lets you attribute a gain to the one change that caused it. That attribution is the whole point of an ablation, and the ablations told a very clear story about which lever moved the needle.

Loss-function choice decides whether the network learns curve continuity. VeerNet tested five losses under identical conditions; only the one whose gradient aligns with the IoU/F1 metric (Lovász-Softmax) wins, and the shipped answer is a two-loss SCE-warmup → Lovász-finetune schedule. Pick a loss to see its ablation verdict, the reason, and a schematic ground-truth-vs-prediction trace; toggle the two-loss schedule (same accuracy, half the wall-clock). The five candidates, verdicts, F1 35%/IoU 30% and the two-loss schedule are the whitepaper's own; the podium bar heights are ordinal (rank sourced) and the prediction thumbnails are schematic.

Augmentation was not a nicety — it was the fix

The single largest lever was data augmentation, and the numbers are not subtle. With augmentation switched off, the fractures-only model's classification error pinned at 100%: it learned nothing usable, the constant-prediction failure mode in its purest form. Switched on, the same model's classification error fell to 2.618%, with the Hungarian matching loss dropping from 0.174 to 0.0135 and the parameter loss from 0.575 to 0.062. One change, two orders of magnitude. Augmentation alone is the line between a dead model and a deployable one.

The mechanism is mundane and that is precisely why it is easy to under-rate. We applied six photometric transforms — colour jitter, Gaussian blur, sharpen, Gaussian noise, emboss, and median blur — and generated 10 augmented variants per original patch (we had started at 7). Applied selectively to the sinusoid-bearing patches, this grew the working dataset by more than tenfold: patches went from 236 to 4,212, sinusoid-bearing patches from 19 to 2,046, and individual labelled sinusoids from 32 to 3,565. The matcher now had thousands of distinct cost surfaces to learn from instead of nineteen, and the degenerate constant-prediction optimum stopped being the easy answer. Augmentation here is not regularisation cosmetics; it is the supply of variety that a data-hungry set-prediction objective needs before it can converge at all.

It is worth being precise about what augmentation could not fix, because the depth floor is physical, not statistical. The binary wireline log file's image resolution is 1 pixel = 3 cm, so an inherent ±3 cm depth error is baked into every pick no matter how much you augment. We chose a patch height of 800 pixels — about 2.2 m of borehole — because over 95% of fracture and bedding sinusoids fit within that window; the rare giant fracture reaches roughly 9 m (3,200 px). Augmentation multiplies what you have; it does not raise the sensor's resolution.

The rest of the backlog, quantified

Augmentation was the headliner, but two other backlog items earned their place with ablations of their own.

Backbone size. We swept four ResNet depths under identical conditions. A from-scratch ResNet-10 won outright with a 0.499% classification error; ResNet-34 was a disaster at 26.759%. On fourteen wells, a heavier feature extractor does not help — it overfits before the set-prediction objective can do its job. "Shrink the model" was not a hunch; it was the measured answer to a small-data problem.

Well count. Sweeping the number of training wells exposed exactly how data-hungry the objective is. At 3 wells the classification error was 93.115% — essentially the constant-prediction failure again. At 6 wells it dropped to 18.370%, at 9 wells to 1.055%, and the full 14-well fractures-only model landed at 2.536%. The curve is steep and non-linear: the gap between a useless model and a deployable one was a handful of wells and a disciplined augmentation pipeline, not a redesign.

And the structural items — dropping the >70-sinusoid patches (over 95% of patches carried a modest count, only ~5% were that crowded), padding consistently, and training a fractures-only model so the network did not have to split capacity across bedding planes and fractures — each removed a specific way the fixed query budget could be overwhelmed. They do not photograph as well as an architecture diagram, but they are why the picks finally became stable.

We have since carried the same instinct into engagements with operators across the Middle East and the United States: when a first supervised model overfits, audit the data and the training loop before you touch the network. The backbone-size and augmentation results above are specific to this Middle East carbonate dataset and should not be read as universal constants — but the order of operations travels well.

The takeaway for the practitioner

Overfitting is rarely a verdict on your architecture. It is usually a verdict on your dataset size, your label distribution, and the size of the model you pointed at them. Our first DETR memorised because it was too large for the data and the data was too uniform for the matcher. The fix was a backlog of cheap, attributable changes — and the biggest single line item, augmentation, moved classification error from 100% to 2.618% on its own. Before you reach for a new backbone, plot your validation predictions, check whether they are constant, and work the data layer first.

Key takeaways

The first supervised DETR overfit hard — it emitted a near-constant prediction on validation despite a matched class split. A constant decoder output is a memorisation tell: the matcher has no varied cost surface to learn from.
The fix was a ~25-item engineering backlog (augment only sinusoid patches, shrink the backbone and encoder/decoder, single-channel input, full 360° width, padding, patch-height tuning, dropout/activation changes, drop >70-sinusoid patches, frac-only training, well-angle geometry) — not a new architecture.
Data augmentation alone moved fractures-only classification error from 100% to 2.618% (Hungarian loss 0.174→0.0135, parameter loss 0.575→0.062). Six photometric transforms, 10 variants per patch, grew the working set from 236→4,212 patches, 19→2,046 sinusoid-bearing patches, and 32→3,565 sinusoids — a >10x increase that gave the matcher variety to learn from.
Smaller is better on small data: a from-scratch ResNet-10 won (0.499% class error) against ResNet-34 (26.759%). Well count is the other lever — class error fell 93.115% (3 wells) → 1.055% (9) → 2.536% (14).
Augmentation multiplies variety but cannot beat physics: the binary wireline log file's floor of 1 px = 3 cm fixes a ±3 cm depth error, and an 800-px (~2.2 m) patch height was chosen because >95% of sinusoids fit it. When a first model overfits, audit the data and training loop before touching the network.

Overfit, Then Fixed: The ~25-Item Backlog That Tamed Our First DETR Model

The diagnosis: a constant prediction is a memorisation tell

The backlog, not the architecture

Augmentation was not a nicety — it was the fix

The rest of the backlog, quantified

The takeaway for the practitioner

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on