A reviewer of our fractures paper pushed on a sentence that, in 2025, reads like heresy. We had written that GeoBFDT, our Detection Transformer for picking fractures and beddings on borehole image logs, used no pretrained models: the ResNet feature extractor, the transformer encoder and decoder, and the matching losses were integrated and trained from scratch, together, with no ImageNet weights anywhere in the stack. The reviewer wanted us to justify that. It is a fair challenge. The entire modern reflex in vision is to start from a pretrained backbone, because ImageNet-scale pretraining is what lets a heavy network behave on a small target set. We did the opposite, on one of the smallest datasets you will ever see used to train a set-prediction transformer: fourteen vertical wells from a mid-sized Middle East carbonate operator, and a base training set of 236 patches. This piece is the answer we gave, generalised into a technique argument. Training from scratch was not stubbornness. It was a decision that forced three coordinated disciplines, and those disciplines are what actually made the model work.
The claim that needed defending
The pretraining reflex rests on a transfer assumption: the features a backbone learns on millions of natural photographs, edges, textures, object parts, are a useful starting point for whatever you point it at next. For borehole imagery that assumption is weak. The signal is a sinusoid traced across an unrolled cylinder of resistivity, greyscale, one channel, with no object-part hierarchy for an ImageNet backbone to lend you. A pretrained feature extractor arrives carrying priors tuned for cats and cars, and on this data those priors are closer to noise than to a head start. So the escape hatch that normally rescues a big backbone on small data, borrowed features, does not open here.
That leaves you with a stark position. If you cannot lean on pretraining, every parameter in the feature extractor has to be learned from those fourteen wells. That makes the model far more prone to overfitting, and it means the usual "just fine-tune a big pretrained backbone" playbook is off the table. The honest response is not to pretend the constraint away. It is to build the whole system around it: pick components small enough to be trainable from scratch on scarce data, manufacture enough signal to constrain them, and measure the result on a metric that cannot be gamed. Those are the three disciplines, and they are coupled. None of them works alone.
Discipline one: assemble a light stack you can train from scratch
The first move is architectural restraint. Because there is no pretraining to absorb the risk of a heavy feature extractor, capacity itself becomes the enemy. We swept four ResNet backbones under otherwise identical conditions and the smallest, ResNet-10 with basic blocks, won decisively, posting a class error of 0.499 against 26.759 for ResNet-34. The deeper networks did not learn more; they memorised the fourteen wells faster than the matching loss could teach them the geology. We treat the full ablation and the mechanism of that cliff in a companion piece, Why a Smaller Backbone Won: ResNet-10 Beat ResNet-34 by 50x on Class Error, and will not re-derive it here. The point for this argument is narrower: the from-scratch decision and the light-backbone decision are the same decision. Once you refuse pretraining, a small ResNet stops being a compromise and becomes the correct choice, because it is the only feature extractor whose parameter count the available signal can actually pin down.
The rest of the stack is sized to match. GeoBFDT uses 4 transformer encoder layers and 4 decoder layers with a feedforward dimension of 1024 (chosen over 512 and 2048), a set of learned queries that each regress one sinusoid's depth, dip, and azimuth, and a Hungarian bipartite matching loss with a focal classification term and an L1 parameter term. We train it with AdamW, a batch size of 128, a learning rate of 0.0004, and early stopping after 40 epochs with no improvement. Every one of those numbers is a small-data choice. A larger batch, a hotter learning rate, or a patient training schedule with no early stop would all, on 236 base patches, walk the model straight into the overfitting regime the light backbone was chosen to avoid.
Discipline two: manufacture the signal that pretraining would have supplied
Pretraining is, in effect, borrowed data. When you refuse it, you have to make your own. On a single labelled well of 32 sinusoids and 236 patches, of which only 19 contained any sinusoid at all, there is nothing for a set-prediction transformer to learn from. The imbalance alone would collapse it to predicting "nothing here" everywhere. The build log of how we expanded that cold start into a trainable corpus, the transform menu chosen to model real image-log variability, and the overlapping-window sampling that grew it further, lives in From One Well and 32 Sinusoids to a Production Fracture Detector; the case for treating the transform choice as a design decision rather than a default is in Augmentation Is a Design Decision, Not a Default. Rather than re-tell those, hold onto the one number that carries this argument.
Augmentation, applied only to the sinusoid-bearing patches so that the class balance improved as the corpus grew, expanded the set by more than tenfold: 236 patches became 4,212, the 19 sinusoid-bearing patches became 2,046, and the 32 individual sinusoids became 3,565. That is not a tuning knob. In the augmentation ablation, with augmentation switched off the model's classification error pinned at 100%, it learned nothing usable; switched on, it fell to 2.618%. That gap, from 100 to 2.618, is the difference between a non-model and a model, and it is the largest single lever we found in the entire programme. It is larger than the backbone choice, larger than the optimiser, larger than any hyperparameter. When you train from scratch, the augmentation pipeline is not decoration on top of the network. It is the substitute for the pretraining you gave up, and it does more work than the network itself.
The instrument above is the argument in one frame. Drag the augmentation lever and watch the class error collapse from 100 to 2.618 as the corpus grows; that collapse is the from-scratch decision paying for itself. The well-count curve underneath it is where the honesty lives. Class error falls off a cliff as wells accumulate, 93.115% at 3 wells, 18.370% at 6, 1.055% at 9, 0.817% at 11, and then it ticks back up to 2.536% at 14. That last reading is not a mistake. Adding the final wells brought in harder or noisier geology that the model, trained from scratch with no external prior to fall back on, did not fully absorb. We report it as measured rather than smoothing it into a tidy monotone, because a small-data result that only ever improves is usually a result that has been massaged.
Discipline three: a metric that cannot flatter the model
The third discipline is evaluation, and it is the one the pretraining reflex tends to skip. On a set this small, with an overwhelming no-object class, almost any scoring rule can be gamed by a model that learns to say "nothing here." A single pixel of the unrolled image corresponds to about 3 cm of depth, so a feature cannot be localised more precisely than one raster cell, and a bounding-box overlap metric is meaningless for a sinusoid that has no box. So we did not score the model on box overlap or on raw accuracy. We built a novel evaluation strategy around depth-matching in physical bands, calibrated to the pixel budget, and we measured it on a held-back, non-overlapping corpus the augmentation pipeline had never touched: 2,291 images across 14 wells for the fractures-only model. Train on overlapping, augmented patches; evaluate on clean ones. Conflating the two is the most common way an image-log model flatters itself, and the from-scratch setting makes that trap more dangerous, not less, because there is no pretraining baseline to sanity-check against. The novel metric is the third leg of the same stool. Without it, the 2.618% would be a number you could not trust.
When a from-scratch transformer is the right call
The reviewer's question has a general answer. Refusing pretraining is defensible exactly when the transfer assumption fails, when your imagery is far enough from natural photographs that borrowed features are noise rather than a head start, and it is only defensible if you accept the three obligations it creates. Size the whole stack down until it is trainable from scratch on the data you actually have. Manufacture enough signal, through augmentation and windowing, to constrain the parameters you chose to learn from nothing. And evaluate on a metric calibrated to your physics and measured on data the training pipeline never saw. GeoBFDT works without ImageNet because those three moves were taken together, and because the largest of them, augmentation, was treated as the load-bearing component it is rather than as a finishing touch. The novelty was never a single clever architecture. It was the discipline of doing small-data training honestly.
Key takeaways
- GeoBFDT's ResNet, transformer encoder-decoder, and losses were assembled and trained from scratch, with no pretrained weights, because ImageNet-style transfer is weak on greyscale borehole imagery where the signal is a sinusoid on an unrolled cylinder, not an object-part hierarchy.
- Refusing pretraining forces three coupled disciplines: a stack small enough to train from scratch, enough manufactured signal to constrain it, and a metric that cannot be gamed. None works alone.
- The light-backbone and from-scratch decisions are the same decision. A ResNet-10 posted 0.499 class error against ResNet-34's 26.759; heavier networks memorise 14 wells before the matching loss can teach them the geology.
- Augmentation was the single largest lever. Applied only to sinusoid-bearing patches, it grew 236 patches to 4,212, 19 sinusoid patches to 2,046, and 32 sinusoids to 3,565, and it moved class error from 100% with no augmentation to 2.618% with it: the difference between a non-model and a model.
- The well-count ablation is reported honestly, including the non-monotone reading where 11 wells (0.817%) sits below 14 wells (2.536%), and the model is scored on a depth-band metric calibrated to the ~3 cm pixel budget, measured on a held-back non-overlapping corpus of 2,291 images across 14 wells.
Limitations
This is one engagement's evidence, not a general proof that pretraining should be refused. The result holds for a specific regime: greyscale borehole image logs where natural-image transfer is genuinely weak, and a dataset small enough that a heavy from-scratch backbone overfits. On imagery closer to natural photographs, or at larger data scale, a pretrained backbone would very likely win, and nothing here argues otherwise. The ablation numbers are single-configuration readings from the engagement archive rather than averaged over many seeds, so the exact figures carry run-to-run variance; the ordering is stable, the third decimal is not. The non-monotone well-count reading at 14 wells is our reading of harder incoming geology and is not a controlled isolation of that cause. And the depth-band evaluation is calibrated to this tool's raster resolution, so the specific bands do not transfer unchanged to a dataset with a different pixel-to-depth ratio.