Every team that ships a detector eventually has the same argument, and it is almost never about the architecture. It is about the loss. In a roughly twenty-month engagement with a mid-sized Middle East carbonate operator we partnered with, the model that picks bedding and fracture sinusoids off two different microresistivity imaging tools — a customised Detection Transformer, internally GeoBFDT — converged on a focal classification loss plus an L1 parameter loss, run with a 0.5 confidence threshold and a class-loss weight five times the parameter-loss weight. None of those choices were defaults. Each one was a response to a property of the data, and the most important property of this data is that the classes are violently imbalanced. This piece is about why class-weighted cross-entropy lost that argument, and why the loss you choose is really a decision about the shape of your label distribution — the kind of engineering call that decides whether a borehole-image detector ever leaves the notebook.
The imbalance is the problem, and it is structural
Start with the geology, because the loss has to answer to it. The classifier in this pipeline has three labels per query: bedding, fracture, and no-sinusoid. The first two are the geological objects a petrophysicist actually wants picked; the third is the empty class a fixed-size set predictor needs so that surplus queries can say "nothing here." In a real reservoir interval, the empty class swamps everything else — most of a borehole wall, most of the time, contains no pickable sinusoid at the resolution you are training on.
How extreme is it? The first well we trained on — a vertical well — made the point with brutal clarity. Across a roughly sixty-metre reservoir section, the entire interval yielded only 32 sinusoids. Tiling that section into 236 image patches produced just 19 patches that contained any sinusoid at all. The other 217 were, as far as the classifier was concerned, empty. That is the regime you are designing a loss for: a positive class that appears in well under one patch in ten, and a negative class that is everywhere. A naive objective trained on that distribution learns the laziest possible hypothesis — predict "no-sinusoid" and be right almost every time — and a geologist gets back a blank log.
That structural fact, not a benchmark leaderboard, is why loss design here is a data-imbalance decision first and a modelling decision second.
Cross-entropy with class weights: the honest first attempt
The textbook fix for imbalance is class-weighted cross-entropy: scale up the loss contribution of the rare classes so the gradient stops ignoring them. We tried exactly that — cross-entropy with per-class weights across the bedding / fracture / no-sinusoid split — and it is a perfectly reasonable starting point. It is stable, it is well understood, and the class weights do pull the rare positives back into the gradient.
But two things hold it back on this kind of distribution. First, weighting is a blunt instrument: a fixed per-class multiplier treats every example in a class identically, so the thousands of easy empty patches — flat, featureless wall the model already classifies correctly with high confidence — keep contributing gradient long after they have stopped teaching the model anything. They drown out the handful of genuinely hard examples near a fracture edge. Second, to get acceptable precision out of the cross-entropy model we had to run it at a high confidence threshold — 0.9 — accepting a query as a real sinusoid only when the network was very sure. That buys precision, but on a recall-critical task it is exactly the wrong trade: a missed fracture is a missed fracture, and a 0.9 gate quietly discards the marginal picks that matter most in a sparsely fractured zone.
Focal loss: down-weight the easy, keep the threshold low
Focal loss attacks the first problem at its root. Instead of a fixed per-class weight, it applies a per-example modulating factor that shrinks the loss on examples the model already classifies confidently and leaves the loss on hard, uncertain examples almost untouched. The easy empty patches — the 217-out-of-236 majority — stop dominating the gradient automatically, without anyone hand-tuning a class weight to compensate. This is the same foreground-background imbalance that focal loss was originally designed for in dense object detection; a borehole image log is just a particularly lopsided instance of it.
Because focal loss has already corrected the imbalance inside the gradient, it lets you drop the confidence threshold and chase recall. The shipped model runs at a 0.5 confidence threshold — recall-focused — rather than the 0.9 the cross-entropy variant needed to stay precise. That single difference reframes the whole detector: the cross-entropy configuration optimised for precision and recall and paid for it with a conservative gate; the focal configuration optimises for recall and trusts the modulating factor to keep precision honest. For a geoscience workflow where the cost of a false negative (a fracture you never see) dwarfs the cost of a false positive (a pick a human reviewer waves away in seconds), recall-focused is the correct posture.
The general lesson — that the loss function, not the backbone, decides what the network actually learns — is worth seeing laid out as an ablation. A sibling EarthScan model for borehole-image segmentation, VeerNet, ran exactly this experiment under controlled conditions, pitting focal loss against four alternatives:
The verdict there is domain-specific — for dense per-pixel segmentation, focal loss was actually unstable, its loss spiking through the first twenty epochs with no recovery, and a metric-aligned loss won — and that is precisely the point. There is no universally best loss. The right loss is the one whose gradient matches the structure of your task: for VeerNet's segmentation, a metric-aligned objective; for our sparse, imbalanced set-prediction classifier, focal loss. The discipline is the same in both cases — interrogate the loss against the data, do not inherit it from a tutorial.
The other half of the loss: L1 on the parameters
Classification is only half the job. Each query that survives also has to regress the physical parameters of its sinusoid — depth, dip, and azimuth — and that term needs its own loss. We use an L1 loss (mean absolute error) on those parameters rather than an L2 / MSE loss. The reason is robustness: image-log picks carry occasional large errors — a mislabelled trace, a partially imaged sinusoid, a noisy image-log section — and L2 squares those residuals, letting a few bad outliers dominate the gradient and drag the regression off the well-behaved majority. L1 weights every residual linearly, so an outlier pick is just one more example, not a gradient bomb. For parameters a geologist will read in physical units, the median-seeking behaviour of L1 is the safer default.
So the full objective is two losses stitched together: focal for the three-way class decision, L1 for the depth/dip/azimuth regression, with the depth, dip, and azimuth each normalised to a common scale so no single parameter monopolises the regression gradient. On this engagement that pairing — focal + L1, optimised with AdamW — beat every alternative we tried; cross-entropy plus L2 was the runner-up, not the winner.
Encoding the priority: a 5-to-1 weight
A two-term loss forces one more decision that teams routinely treat as an afterthought and shouldn't: how to weight the terms against each other. Here the classification loss carries a weight of 5 and the parameter loss a weight of 1. That five-to-one ratio is not a tuning artefact; it is a statement of priority encoded directly into the gradient.
The logic follows the imbalance one more time. Getting the existence and class of a sinusoid right is the load-bearing decision — if the model fails to fire on a fracture at all, no amount of precise depth regression rescues it, because there is nothing to regress. Refining a dip estimate by half a degree is valuable but strictly secondary to not missing the feature in the first place. The 5-to-1 weight tells the optimiser, in the only language it understands, that a classification error costs five times what an equivalent parameter error costs. It also stabilises early training: before the model can localise anything well, the parameter loss is large and noisy, and an unweighted sum would let that noise swamp the classification signal that has to converge first.
Put the three knobs together and they tell one coherent story. Focal loss handles the within-class easy/hard imbalance. The 0.5 threshold turns the recovered recall headroom into actual recall. The 5-to-1 weight encodes that detection dominates regression. Every one of those is downstream of the same fact: the empty class is everywhere and the geology is rare.
How much does the loss actually buy you?
It is fair to ask whether any of this matters next to the bigger levers — more wells, more augmentation, a better backbone. The honest answer is that the loss is a precondition for those levers paying off, not a substitute for them. Two ablation results frame it. With data augmentation switched off entirely, the classifier's error pinned at 100% — it learned nothing usable, because the raw imbalance was too severe for any loss to overcome; switch augmentation on and the error collapsed to 2.62%. Separately, on this small, imbalanced dataset a from-scratch ResNet-10 backbone posted a 0.5% classification error while a much deeper ResNet-34 blew up to 26.76%, overfitting before it could generalise.
Read those numbers the right way. They say that on imbalanced small data, restraint wins — a light backbone, heavy augmentation to manufacture the missing positives, and a loss whose gradient is engineered around the imbalance rather than fighting it. A focal+L1 loss does not paper over a lack of data; it makes the data you do have count, by refusing to let the easy empty majority set the agenda. Choose the loss for the distribution you actually have, threshold it for the error you actually fear, and weight its terms for the decision that actually dominates — and the rest of the pipeline has something to build on.
Key takeaways
- Loss design for sinusoid picking is a class-imbalance decision first: the empty no-sinusoid class swamps bedding and fracture labels — one early well held just 32 sinusoids across 236 patches, only 19 of which contained any sinusoid at all.
- Class-weighted cross-entropy is the honest baseline but blunt: a fixed per-class weight keeps easy empty patches in the gradient, and reaching acceptable precision forced a high 0.9 confidence threshold that sacrifices recall — the wrong trade when a missed fracture is the costly error.
- Focal loss down-weights confidently-classified easy examples per-example, correcting the imbalance inside the gradient and letting the model run at a recall-focused 0.5 threshold instead of 0.9.
- The parameter term is L1, not L2: absolute-error regression is robust to the occasional large pick error that MSE would let dominate the depth/dip/azimuth gradient. Focal + L1 (with AdamW) beat cross-entropy + L2 on this engagement.
- The two terms are weighted 5 (classification) to 1 (parameters), encoding that detecting and classifying a sinusoid dominates refining its geometry — and there is no universal best loss: the right one is whichever gradient matches your task's structure, validated by ablation.