Skip to main content

Blog

The Sinusoid-Count Skew: Why a Long-Tailed Patch-Label Distribution Wrecked Early Training

A long tail in the per-patch label distribution — not the model — stalled early training of our borehole fracture detector. Over 95% of patches carried a similar, modest sinusoid count and a thin ~5% carried far more, and that skew quietly broke the fixed query budget of a Detection Transformer until we diagnosed it as a data problem and fixed it in the data layer.

Quamer Nasimby Quamer Nasim10 min read
EarthScan insight

Most of the time a model that will not converge is telling you something about your model. Occasionally it is lying, and the real story is in the histogram of your labels. In a roughly twenty-month engagement with a mid-sized Middle East NOC carbonate operator we partnered with, we hit exactly that case while training a Detection-Transformer-based fracture picker on borehole image logs — high-resolution resistivity imagery of the well wall. The architecture was sound, the loss was the right one, and the picks were still unstable for weeks. The culprit turned out to be a long tail in the per-patch label distribution — a tiny fraction of image patches carried far more sinusoids than the rest — and that skew was silently fighting the one part of a DETR you are least likely to suspect: the fixed query budget. This is a post about why a label-distribution problem masquerades as a model problem, how the fixed-set design of a Detection Transformer makes you uniquely exposed to it, and what the fix actually was.

How borehole imagery becomes a set-prediction dataset

Start with the data engineering, because the skew is manufactured there. A single image log is large — on the order of 1.5 GB per well — and a fracture or bedding plane on it is a sinusoid: unroll the cylindrical borehole wall into a flat strip and a planar feature traces a sine wave whose amplitude and phase encode dip and azimuth. You cannot feed a kilometres-long strip to a transformer, so the geomatics pipeline tiles the strip into overlapping patches. We settled on a patch height of 800 pixels — about 2.2 m of borehole — because over 95% of fracture and bedding sinusoids fit inside that window; the rare giant fracture runs up to roughly 9 m (3,200 px) and a handful spill across patch boundaries, but 2.2 m captures the overwhelming majority cleanly. Overlapping the patches with a short vertical stride means the same sinusoid can appear, fully or partially, in several adjacent patches.

Each patch then becomes one training example whose label is a set of sinusoids — zero, one, a few, occasionally many — each parameterised by depth, dip, and azimuth. That is the set-prediction framing a DETR is built for, and it is also where the trap is laid. The number of objects per example is not a fixed quantity you control; it is an emergent property of how fractured the rock is at that depth and how your tiling interacts with it. You are, in effect, sampling a count distribution every time you cut a patch.

The skew, in numbers

When we plotted the count of sinusoids per patch across the dataset, the distribution was sharply unimodal with a long right tail. Over 95% of patches sat at roughly the same modest sinusoid count, and only about 5% carried meaningfully more — the densely fractured intervals where conjugate sets, breakouts, and crossing features pile up. In one reservoir interval we examined closely the imbalance was even starker at the other end of the scale: the section held only 32 sinusoids spread across 236 patches, and just 19 of those 236 patches contained any sinusoid at all. So the dataset had two compounding skews at once — most patches were empty or near-empty, and among the non-empty ones a thin minority were extraordinarily crowded.

The instinct is to treat this as ordinary foreground-background class imbalance, the kind focal loss already handles inside the matching cost. It is not. The dangerous skew here is not the ratio of object pixels to background; it is the distribution of object counts per example, and that distribution interacts with a structural constant of the architecture that nothing in your loss function protects.

Why a fixed query budget is uniquely exposed

A Detection Transformer does not predict a variable number of objects. Its decoder is handed a fixed number of learned object queries — call it the query budget — and it always emits exactly that many candidate detections per patch, regardless of how many sinusoids are actually present. Bipartite Hungarian matching then assigns each ground-truth sinusoid to one query and trains the surplus queries to confidently predict "no object." That fixed-set design is precisely what lets a DETR drop anchors and non-maximum suppression, and it is wonderful — until the count distribution of your labels has a long tail.

Here is the failure mode in mechanical terms. You size the query budget for the bulk of your data — comfortably above the typical patch's sinusoid count, so the great majority of patches have far more queries than objects and the no-object term dominates. Then the tail arrives. On a densely fractured patch the true sinusoid count approaches or exceeds the budget, and three bad things happen at once. First, the matcher has too few queries to claim every real sinusoid, so genuine fractures in exactly the most geologically interesting intervals go unmatched and unlearned — a ceiling no amount of training removes. Second, those rare high-count patches produce enormous, jagged gradients because nearly every query is "on the hook" for a real target, which is the opposite of the sparse-supervision regime the rest of the dataset trains in. Third, the optimiser sees these two regimes alternate batch to batch — a long run of near-empty patches teaching queries to say nothing, then a tail patch demanding they all fire — and the no-object classifier thrashes between the two. The visible symptom was unstable picks and a classification signal that would not settle. The actual cause was a label-count distribution whose tail the fixed query budget could not absorb.

This is why the problem disguises itself. Every knob you reach for first — learning rate, the class-versus-parameter loss weighting, the focal-loss gamma, the backbone — is a model knob, and none of them addresses a count distribution. You can tune them for a week and move nothing, because the model is faithfully fitting a dataset whose tail is structurally incompatible with the prediction set you gave it.

A benchmark dominated by the common case hides the case that breaks you

There is a more general lesson here that is worth pulling out, because it recurs across applied-AI work far beyond borehole logs. When over 95% of your examples look the same, every aggregate you compute is a report on the common case. Average loss, mean count, overall accuracy — all of them are dominated by the bulk and all of them stay quiet while the thin tail does the damage. The instrument below makes the dynamic concrete in a related setting: a benchmark whose score is dominated by the common, easy category looks healthy right up until you re-weight it toward what production actually depends on, at which point the hidden failure surfaces. Read it as an analogy for the sinusoid-count skew — weight your evaluation by the dense, high-count patches that production geoscience cares about most, not by the empty ones that pad your averages, and the problem stops being invisible.

BENCHMARK WEIGHTING · CONSUMER → PRODUCTION+59 ppWellBot lead · weighted by consumer fluencyCOPILOT COLLAPSES ON TRUST WEIGHTWeight the benchmark by what your job needsCopilot only scores on conceptual knowledge — zero on the categories production weights most.WEIGHT BYconsumer fluencyproduction trustWellBot94%Copilot34%BY CATEGORY — score (WellBot · Copilot) and current weightConceptual knowledge9/105/1063%Data-grounded queries4/40/413%Hallucination resistance6/60/613%Safety & audit trail4/41/413%As-tested (equal points): 95.8% vs 25.0%. Weight by production trust and Copilot falls further.Per-category scores & the as-tested 25.0% / 95.8% per the whitepaper · the consumer→production weighting is an illustrative lens
Why 25% is generous. The 14-task benchmark has four categories; a general LLM scores only on conceptual knowledge (5/10) and posts zero on data-grounded queries (0/4) and hallucination resistance (0/6) — the categories production cares about most. Slide the weighting from consumer fluency to production trust and Copilot's weighted score collapses (~34% → ~10%) while the domain-native WellBot holds ~94% → ~99%. The as-tested equal-points benchmark sits in between at 95.8% vs 25.0%. Per-category scores are the whitepaper's own; the weighting is an illustrative lens.

The discipline this teaches is to stop looking at the mean and start looking at the distribution — and specifically at the tail's relationship to your architecture's hard limits. For a fixed-set predictor that limit is the query budget, and the right diagnostic is not "what is the average sinusoid count" but "what fraction of patches approach or exceed the budget, and what is the model's behaviour on exactly those."

The fix lived in the data layer, not the model

Once we named it as a per-patch label-distribution problem, the remedies were data-engineering moves, not architecture changes. Two were decisive.

The first was to handle the tail explicitly rather than let it poison every batch. Because the overlapping-tile generator was producing the extreme high-count patches partly as an artefact of how a few giant or boundary-crossing sinusoids stacked up, we curated the patch set so the count distribution the model trained on stayed inside the budget the architecture could honour — dropping or re-cutting the pathological high-count patches instead of forcing the matcher to fail on them. An ablation on partial-versus-full sinusoids confirmed the principle: cleaning up the patches that carried only fragmentary, boundary-clipped sinusoids cut the Hungarian matching loss from 0.021 to 0.014 and the classification error from 3.63% to 2.77%. The signal was the same each time — the matcher converges when the per-patch label set is well-formed and degrades when it is not.

The second move attacked the other half of the skew: the overwhelming majority of empty and near-empty patches that gave the model almost nothing to match against. We augmented the sinusoid-bearing patches aggressively — re-cutting at a 100×270 patch size with a vertical stride of 20 and generating ten augmentations per sinusoid patch using a stack of image transforms (ColorJitter, Gaussian blur, sharpening, Gaussian noise, emboss, median blur). On the imbalanced reservoir interval this turned 236 patches into 4,212, the 19 sinusoid-bearing patches into 2,046, and the 32 individual sinusoids into roughly 3,565 — a better than tenfold increase that concentrated the training signal on exactly the examples the matcher needed and rebalanced the count distribution toward the body the query budget was sized for. Note the engineering discipline: augment the minority (sinusoid-bearing) patches, not the empties, so you raise the floor of useful supervision without inflating the no-object regime that was already overrepresented.

Neither fix touched the decoder, the query count, the loss, or the backbone. The model had been right all along. The thing that was wrong was the shape of the label distribution we were asking a fixed-set predictor to fit, and the place to repair it was upstream in the data pipeline — the patch generator, the curation rule, and the augmentation policy — which is exactly where a count-distribution problem should be solved.

What to take from this

The portable lesson is a habit, not a trick. When a set-prediction model stalls, profile the per-example object-count distribution before you touch a single hyperparameter, and check the tail against your architecture's fixed prediction budget. If the tail breaches the budget, no model knob will save you; you have a data-layer problem wearing a model-layer costume. Across the engagements our team has run — carbonate operators in the Middle East and the United States among them — the cleanest accuracy gains have repeatedly come not from a fancier decoder but from the data engineering: making the label distribution match what the architecture can actually represent, which is where that work belongs.

Key takeaways

  1. A borehole image log is tiled into ~800 px (2.2 m) patches; >95% of fracture and bedding sinusoids fit, but each patch's label is a variable-count SET of sinusoids — so the per-patch object-count distribution becomes a dataset property you do not directly control.
  2. The skew was a long tail: over 95% of patches held a modest, similar sinusoid count and only ~5% held far more. One reservoir interval was starker still — 32 sinusoids across 236 patches, with just 19 patches containing any sinusoid at all.
  3. A Detection Transformer predicts a FIXED number of object queries per patch. When the tail's sinusoid count approaches that query budget, the matcher runs out of queries on exactly the densely fractured intervals, gradients spike, and the no-object classifier thrashes — producing unstable picks that look like a model bug but are a label-distribution bug.
  4. Averages hide the tail. Diagnose by the fraction of patches that approach or exceed the query budget, not by the mean count — weight your evaluation toward the dense patches production cares about, the way a re-weighted benchmark exposes a hidden failure.
  5. The fix lived in the data layer: curate/drop pathological high-count patches (partial-vs-full ablation cut Hungarian loss 0.021→0.014 and class error 3.63%→2.77%) and augment the minority sinusoid-bearing patches (236→4,212 patches, 19→2,046 sinusoid-patches, 32→~3,565 sinusoids; >10x). No decoder, query-count, loss, or backbone change was needed.
Go to Top

© 2026 Copyright. Earthscan