Skip to main content

Blog

Why We Trained a Detection Transformer From Scratch on Just 14 Wells

A reviewer asked us to justify a claim that reads as heresy in modern computer vision: we used no pretrained models. GeoBFDT's ResNet, transformer encoder-decoder, and losses were assembled and trained from scratch on a 236-patch, 14-well confidential dataset. This is the technique argument for that decision, and for the three coordinated disciplines it forced: no pretrained backbone, a deliberately light ResNet-10 to resist overfitting, and greater-than-tenfold augmentation, which turned out to be the single largest lever on class error.

Narendra PatwardhanTarry Singhby Narendra Patwardhan, Tarry Singh10 min read
EarthScan insight

A reviewer of our fractures paper pushed on a sentence that, in 2025, reads like heresy. We had written that GeoBFDT, our Detection Transformer for picking fractures and beddings on borehole image logs, used no pretrained models: the ResNet feature extractor, the transformer encoder and decoder, and the matching losses were integrated and trained from scratch, together, with no ImageNet weights anywhere in the stack. The reviewer wanted us to justify that. It is a fair challenge. The entire modern reflex in vision is to start from a pretrained backbone, because ImageNet-scale pretraining is what lets a heavy network behave on a small target set. We did the opposite, on one of the smallest datasets you will ever see used to train a set-prediction transformer: fourteen vertical wells from a mid-sized Middle East carbonate operator, and a base training set of 236 patches. This piece is the answer we gave, generalised into a technique argument. Training from scratch was not stubbornness. It was a decision that forced three coordinated disciplines, and those disciplines are what actually made the model work.

The claim that needed defending

The pretraining reflex rests on a transfer assumption: the features a backbone learns on millions of natural photographs, edges, textures, object parts, are a useful starting point for whatever you point it at next. For borehole imagery that assumption is weak. The signal is a sinusoid traced across an unrolled cylinder of resistivity, greyscale, one channel, with no object-part hierarchy for an ImageNet backbone to lend you. A pretrained feature extractor arrives carrying priors tuned for cats and cars, and on this data those priors are closer to noise than to a head start. So the escape hatch that normally rescues a big backbone on small data, borrowed features, does not open here.

That leaves you with a stark position. If you cannot lean on pretraining, every parameter in the feature extractor has to be learned from those fourteen wells. That makes the model far more prone to overfitting, and it means the usual "just fine-tune a big pretrained backbone" playbook is off the table. The honest response is not to pretend the constraint away. It is to build the whole system around it: pick components small enough to be trainable from scratch on scarce data, manufacture enough signal to constrain them, and measure the result on a metric that cannot be gamed. Those are the three disciplines, and they are coupled. None of them works alone.

Discipline one: assemble a light stack you can train from scratch

The first move is architectural restraint. Because there is no pretraining to absorb the risk of a heavy feature extractor, capacity itself becomes the enemy. We swept four ResNet backbones under otherwise identical conditions and the smallest, ResNet-10 with basic blocks, won decisively, posting a class error of 0.499 against 26.759 for ResNet-34. The deeper networks did not learn more; they memorised the fourteen wells faster than the matching loss could teach them the geology. We treat the full ablation and the mechanism of that cliff in a companion piece, Why a Smaller Backbone Won: ResNet-10 Beat ResNet-34 by 50x on Class Error, and will not re-derive it here. The point for this argument is narrower: the from-scratch decision and the light-backbone decision are the same decision. Once you refuse pretraining, a small ResNet stops being a compromise and becomes the correct choice, because it is the only feature extractor whose parameter count the available signal can actually pin down.

The rest of the stack is sized to match. GeoBFDT uses 4 transformer encoder layers and 4 decoder layers with a feedforward dimension of 1024 (chosen over 512 and 2048), a set of learned queries that each regress one sinusoid's depth, dip, and azimuth, and a Hungarian bipartite matching loss with a focal classification term and an L1 parameter term. We train it with AdamW, a batch size of 128, a learning rate of 0.0004, and early stopping after 40 epochs with no improvement. Every one of those numbers is a small-data choice. A larger batch, a hotter learning rate, or a patient training schedule with no early stop would all, on 236 base patches, walk the model straight into the overfitting regime the light backbone was chosen to avoid.

Discipline two: manufacture the signal that pretraining would have supplied

Pretraining is, in effect, borrowed data. When you refuse it, you have to make your own. On a single labelled well of 32 sinusoids and 236 patches, of which only 19 contained any sinusoid at all, there is nothing for a set-prediction transformer to learn from. The imbalance alone would collapse it to predicting "nothing here" everywhere. The build log of how we expanded that cold start into a trainable corpus, the transform menu chosen to model real image-log variability, and the overlapping-window sampling that grew it further, lives in From One Well and 32 Sinusoids to a Production Fracture Detector; the case for treating the transform choice as a design decision rather than a default is in Augmentation Is a Design Decision, Not a Default. Rather than re-tell those, hold onto the one number that carries this argument.

Augmentation, applied only to the sinusoid-bearing patches so that the class balance improved as the corpus grew, expanded the set by more than tenfold: 236 patches became 4,212, the 19 sinusoid-bearing patches became 2,046, and the 32 individual sinusoids became 3,565. That is not a tuning knob. In the augmentation ablation, with augmentation switched off the model's classification error pinned at 100%, it learned nothing usable; switched on, it fell to 2.618%. That gap, from 100 to 2.618, is the difference between a non-model and a model, and it is the largest single lever we found in the entire programme. It is larger than the backbone choice, larger than the optimiser, larger than any hyperparameter. When you train from scratch, the augmentation pipeline is not decoration on top of the network. It is the substitute for the pretraining you gave up, and it does more work than the network itself.

FROM SCRATCH ON 14 WELLS · NO IMAGENET, NO PRETRAINED BACKBONE2.618class error % at this augmentation settingThree disciplines, not a pretrained crutch, made a tiny set trainableTHE DOCTRINE · EACH MOVE, THE CLASS ERROR IT BOUGHTNo pretrained backboneResNet + transformer + losses trained from scratchResNet-10 0.499vs ResNet-34 26.76Light ResNet-10 to resist overfitbasic blocks; heavier backbones memorise 14 wells0.499 err53x under ResNet-34Greater than 10x augmentationthe single largest lever on class error100 -> 2.618non-model -> modelWHAT AUGMENTATION BUILT FROM THE CONFIDENTIAL SET4,212patches (236 to 4,212)2,046sinusoid patches (19 to 2,046)3,565sinusoids (32 to 3,565)Only the 19 sinusoid-bearing patches were multiplied, so the class balanceimproved as the corpus grew: a greater-than-tenfold expansion of real signal.AUGMENTATION IS THE LARGEST LEVER · CLASS ERRORno aug 100.000with aug 2.6182.618% errorwith augmentationWELL-COUNT ABLATION · CLASS ERROR VS TRAINING WELLS110100369111411w 0.81714w 2.536AUGMENTATION LEVERdrag from no-aug to with-aug: the model appearsas the corpus and the class balance growno augwith aug2.62%batch / LR128 / 0.0004optimiserAdamWenc / dec · FFN4 / 4 · 1024early stop40 epochssourced: class error no-aug 100.000 vs with-aug 2.618 · ResNet-10 0.499 vs ResNet-34 26.759 · wells 3/6/9/11/14 = 93.115/18.370/1.055/0.817/2.536
The from-scratch discipline that let a detection transformer converge without ImageNet-scale pretraining, drawn from the engagement's ablation tables. The left column states the doctrine as three coordinated moves and the class error each bought: no pretrained backbone (ResNet, transformer, and losses assembled and trained from scratch), a deliberately light ResNet-10 (0.499 class error) chosen over ResNet-34 (26.759) to resist overfitting on a tiny set, and greater-than-tenfold augmentation, the single largest lever. Drag the augmentation lever from no-aug to with-aug and the class error collapses from 100.000 to 2.618 while the corpus grows from 236 to 4,212 patches, 19 to 2,046 sinusoid patches, and 32 to 3,565 sinusoids, because only the sinusoid-bearing patches were multiplied. The orange element is the only one that argues: the augmentation error bar that collapses as you drag. The well-count curve below plots class error against training wells honestly, including the non-monotonic reading where 11 wells (0.817) sits below 14 wells (2.536). Every number is sourced from the engagement archive; nothing here is illustrative.

The instrument above is the argument in one frame. Drag the augmentation lever and watch the class error collapse from 100 to 2.618 as the corpus grows; that collapse is the from-scratch decision paying for itself. The well-count curve underneath it is where the honesty lives. Class error falls off a cliff as wells accumulate, 93.115% at 3 wells, 18.370% at 6, 1.055% at 9, 0.817% at 11, and then it ticks back up to 2.536% at 14. That last reading is not a mistake. Adding the final wells brought in harder or noisier geology that the model, trained from scratch with no external prior to fall back on, did not fully absorb. We report it as measured rather than smoothing it into a tidy monotone, because a small-data result that only ever improves is usually a result that has been massaged.

Discipline three: a metric that cannot flatter the model

The third discipline is evaluation, and it is the one the pretraining reflex tends to skip. On a set this small, with an overwhelming no-object class, almost any scoring rule can be gamed by a model that learns to say "nothing here." A single pixel of the unrolled image corresponds to about 3 cm of depth, so a feature cannot be localised more precisely than one raster cell, and a bounding-box overlap metric is meaningless for a sinusoid that has no box. So we did not score the model on box overlap or on raw accuracy. We built a novel evaluation strategy around depth-matching in physical bands, calibrated to the pixel budget, and we measured it on a held-back, non-overlapping corpus the augmentation pipeline had never touched: 2,291 images across 14 wells for the fractures-only model. Train on overlapping, augmented patches; evaluate on clean ones. Conflating the two is the most common way an image-log model flatters itself, and the from-scratch setting makes that trap more dangerous, not less, because there is no pretraining baseline to sanity-check against. The novel metric is the third leg of the same stool. Without it, the 2.618% would be a number you could not trust.

When a from-scratch transformer is the right call

The reviewer's question has a general answer. Refusing pretraining is defensible exactly when the transfer assumption fails, when your imagery is far enough from natural photographs that borrowed features are noise rather than a head start, and it is only defensible if you accept the three obligations it creates. Size the whole stack down until it is trainable from scratch on the data you actually have. Manufacture enough signal, through augmentation and windowing, to constrain the parameters you chose to learn from nothing. And evaluate on a metric calibrated to your physics and measured on data the training pipeline never saw. GeoBFDT works without ImageNet because those three moves were taken together, and because the largest of them, augmentation, was treated as the load-bearing component it is rather than as a finishing touch. The novelty was never a single clever architecture. It was the discipline of doing small-data training honestly.

Key takeaways

  1. GeoBFDT's ResNet, transformer encoder-decoder, and losses were assembled and trained from scratch, with no pretrained weights, because ImageNet-style transfer is weak on greyscale borehole imagery where the signal is a sinusoid on an unrolled cylinder, not an object-part hierarchy.
  2. Refusing pretraining forces three coupled disciplines: a stack small enough to train from scratch, enough manufactured signal to constrain it, and a metric that cannot be gamed. None works alone.
  3. The light-backbone and from-scratch decisions are the same decision. A ResNet-10 posted 0.499 class error against ResNet-34's 26.759; heavier networks memorise 14 wells before the matching loss can teach them the geology.
  4. Augmentation was the single largest lever. Applied only to sinusoid-bearing patches, it grew 236 patches to 4,212, 19 sinusoid patches to 2,046, and 32 sinusoids to 3,565, and it moved class error from 100% with no augmentation to 2.618% with it: the difference between a non-model and a model.
  5. The well-count ablation is reported honestly, including the non-monotone reading where 11 wells (0.817%) sits below 14 wells (2.536%), and the model is scored on a depth-band metric calibrated to the ~3 cm pixel budget, measured on a held-back non-overlapping corpus of 2,291 images across 14 wells.

Limitations

This is one engagement's evidence, not a general proof that pretraining should be refused. The result holds for a specific regime: greyscale borehole image logs where natural-image transfer is genuinely weak, and a dataset small enough that a heavy from-scratch backbone overfits. On imagery closer to natural photographs, or at larger data scale, a pretrained backbone would very likely win, and nothing here argues otherwise. The ablation numbers are single-configuration readings from the engagement archive rather than averaged over many seeds, so the exact figures carry run-to-run variance; the ordering is stable, the third decimal is not. The non-monotone well-count reading at 14 wells is our reading of harder incoming geology and is not a controlled isolation of that cause. And the depth-band evaluation is calibrated to this tool's raster resolution, so the specific bands do not transfer unchanged to a dataset with a different pixel-to-depth ratio.

Go to Top

© 2026 Copyright. Earthscan