Anchor-Free, Mask-Free: Adapting DETR Object Queries to Sinusoid Picking

Most computer-vision teams that meet a borehole image log for the first time reach instinctively for a segmentation network. The fracture you want to pick is a thin sinusoidal trace running across a high-resolution borehole image log strip — whether acquired with one microresistivity imaging tool or another — so the obvious move is to label every pixel of that trace, train a U-Net, and post-process the predicted mask back into a sine wave. It is the wrong move — and one of the more useful design decisions in our work with a mid-sized Middle East carbonate operator was to throw the mask out entirely. What replaced it was a Detection Transformer whose object queries were quietly re-pointed from their original job — regressing bounding boxes on COCO photographs — to a job they were never designed for: regressing the depth, dip, and azimuth of a geological sinusoid. This piece is about that substitution: what an object query actually is, why it transfers cleanly from boxes to sinusoids, and what you have to change in the prediction head, the loss, and the data pipeline to make the transfer land.

Why masks are the wrong representation

A segmentation mask is a per-pixel commitment. To use one for fracture picking you have to label the sinusoid trace pixel by pixel, train the network to reproduce that binary stencil, and then run a separate curve-fitting stage — path opening, Hough-style voting, or a least-squares sine fit — to recover the three numbers a petrophysicist actually reports: where the fracture sits in depth, how steeply it dips, and which way it faces. Every one of those numbers is reconstructed after the network, from a noisy mask, by hand-tuned geometry code. The deep-learning model never optimises the quantities you care about; it optimises a pixel stencil and hopes the downstream fit survives.

That indirection costs you in three concrete ways. The mask is expensive to annotate; the curve-fit is brittle exactly where sinusoids cross or fade, which is the highly fractured intervals that matter most; and the metric you train against (pixel IoU) has almost nothing to do with the metric you are judged on (depth/dip/azimuth accuracy in physical units). We wanted the network to emit the geology directly — three regressed parameters per fracture — with no mask and no post-hoc fitting in the loop at all.

What an object query really is

The Detection Transformer (DETR) gives you exactly the right primitive for that, once you stop thinking of it as a box detector. Strip away the COCO-specific framing and DETR is a general set-prediction machine: a convolutional backbone and a transformer encoder turn the image into a field of context-rich features, and a transformer decoder is handed a fixed number of learned object queries that each attend over the whole encoded image and come out the other side as one candidate detection. Nothing in that machinery is intrinsically about boxes. A query is a slot that says "go find me one object and describe it." What "describe it" means is defined entirely by the prediction head bolted onto the decoder output — and the head is the only part that knows about boxes.

(undefined, undefined)

So the adaptation is structural, not cosmetic. The convolutional backbone (we swept ResNet-18, ResNet-50, and EfficientNet-B2 as candidates), the transformer encoder, the decoder, the object queries, the positional encodings — all of that stays exactly as DETR designed it. We kept the published encoder/decoder depth in the deployable configuration: four encoder layers and four decoder layers, a feed-forward dimension of 1024, sinusoidal positional embeddings, and multi-head attention. The transferable backbone is the load-bearing idea here: the heavy, expensive, geology-agnostic feature stack is reused wholesale, and only the small head at the very end is re-specified for the new task.

EarthAdaptNet's load-bearing claim is structural, not a metric: 'the architecture is the novel part — not the training recipe.' The Residual-Block encoder plus the 1×1 bottleneck is a reusable backbone; the decoder is the only task-specific part. Keep the Transposed-Residual decoder and you get per-pixel facies segmentation (the 8-facies output, ~80% pixel accuracy on Roosevelt held-out sections); replace it with fully-connected layers and you get a per-image facies classifier for free — same encoder weights, same bottleneck, no retraining. Toggle the head to see exactly what the decoder buys, and drag the depth handle to grow the symmetric U and the U-Net skip lattice that rides at every depth. The single scarce orange marks the swappable head — the part the article argues is the only thing you ever need to replace. The per-class accuracies (76/54/79/91/45/70/88/97), ~80% pixel accuracy, 8 facies, and the encoder-decoder + skip-at-every-depth + free-classifier reuse mechanism are the article's own; the drawn block count at each depth is schematic (it shows topology, not the exact layer tally).

Swapping the head: from boxes to depth, dip, azimuth

In stock DETR each decoder output passes through two tiny heads: a linear classifier (object class versus the "no-object" null) and a three-layer MLP that regresses four box coordinates. We kept the classifier — a fracture query still has to declare fracture versus nothing here — and we replaced the box MLP with a parameter head that emits three numbers per query: location, dip, and azimuth. In the production model the regression head is two linear layers and the classification head is one; deliberately small, because the representation work is already done by the shared encoder.

The substitution only works if the three target numbers are conditioned for a neural network, and this is the detail that quietly carries the whole approach. Raw dip lives on [0, 90] degrees, azimuth on [0, 360], and location on the patch's pixel height. Feeding a regression head three quantities on wildly different scales lets the largest-range parameter dominate the gradient. The fix is a clean per-channel normalisation: location, dip, and azimuth are each scaled to the unit interval by dividing by the patch height (100), 90, and 360 respectively, so all three regression targets live on [0, 1] and contribute comparably to the L1 parameter loss. At inference the head's outputs are simply multiplied back out to physical units. There is no anchor grid to define, no aspect ratios to tune, no mask to threshold — the model reports the geology, normalised in and de-normalised out.

Because the prediction set is fixed-size but the number of fractures per patch is not — three in a shattered interval, zero in a clean one — the classification head also has to learn when to predict nothing. That is the role of the no-object class and the bipartite matching loss behind it: before any gradient is computed, the Hungarian algorithm finds the optimal one-to-one pairing between the model's queries and the real sinusoids, matched queries are graded on classification plus L1 parameter loss, and every surplus query is trained to confidently predict no-object. In the production loss the classification term is a focal loss carrying a weight of 5 against the parameter weight of 1 — the empty class overwhelmingly dominates, which is precisely the foreground-background imbalance focal loss exists to handle — and at inference we keep only the queries whose object probability clears 0.5. That one-to-one guarantee is what lets us delete NMS outright: each fracture is claimed by exactly one query, so there are no duplicates to suppress.

The data and training engineering underneath

Re-pointing the head is the elegant part; making it converge on real borehole data was the engineering part. The input pipeline tiles each well's image log into overlapping patches — in our configuration a patch is 100 pixels tall by 270 wide, generated with vertical and horizontal strides so a sinusoid is never cut at a patch edge without appearing whole in a neighbour. Each patch ships with a compact JSON ground-truth record carrying the location, dip, azimuth, depth, and type of every sinusoid it contains, which is the supervision the matcher consumes.

The first training runs failed in the most diagnostic way possible: the network overfit instantly and emitted a constant output regardless of input. The fix list was a clinic in small-data set-prediction hygiene — collapse to a single-channel input, shrink the encoder/decoder and the backbone rather than enlarge them, augment only the patches that actually contain sinusoids so the matcher sees varied positive examples instead of an ocean of empty crops, and re-derive dip and azimuth to be independent of well deviation before mixing data from multiple wells. That targeted-augmentation step is visible directly in the dataset ledger: an early iteration that augmented only the sinusoidal sections produced 4,212 training patches, of which 2,046 contained sinusoids, for 3,564 labelled sinusoids in total — a deliberately fracture-dense set rather than a representative-but-starved one, because a matching loss can only learn to assign what it is repeatedly shown.

On top of that sat a proper MLOps and hyperparameter-search layer rather than a notebook and a prayer. Data and model artifacts were versioned with DVC against a GitHub source tree (a standard datasets/, models/ layout with detr.py, matcher.py, transformer.py, backbones.py, and position_encoding.py), and the architecture was tuned with a log-uniform search — learning rate over 1e-4 to 1e-1, backbone learning rate over 1e-6 to 1e-4, encoder/decoder depth over three to six layers, feed-forward dimension across 512/1024/2048, hidden dimension across 128/256/512, attention heads across 4/8/16, and the positional embedding toggled between sine and learned — optimised with AdamW and tracked end-to-end in Weights & Biases. The search converged on the four-and-four, FFN-1024 configuration above; the published deployable model trained with AdamW at a learning rate around 4e-4. None of those numbers were guessed at a whiteboard — they fell out of an instrumented sweep, which is the only way a small-data set-prediction model gets tuned honestly.

Why this generalises

This transfer is worth writing down because it is not specific to fractures. Object queries are a representation-agnostic slot mechanism; the box head is the only part of DETR that assumes you are detecting rectangles. Replace that head with a regressor for whatever low-dimensional vector describes your object — a sinusoid's depth/dip/azimuth here, equally a planar fault's strike and throw or a bed boundary's depth and apparent dip — normalise the targets, re-derive the metric in physical units, and you inherit DETR's entire anchor-free, NMS-free, mask-free pipeline. For subsurface computer vision, where the objects of interest are parametric geological features rather than photographic boxes, that is a far better starting point than a segmentation map you then have to fit a curve through.

Key takeaways

A borehole fracture is a sinusoid described by three numbers — location, dip, azimuth — not a pixel stencil. Segmentation forces a brittle, hand-tuned curve-fit after the network and optimises pixel IoU instead of the physical quantities a geoscientist reports; we dropped masks entirely.
DETR's object queries are a representation-agnostic set-prediction primitive. The convolutional backbone, transformer encoder/decoder, and queries transfer unchanged from boxes to sinusoids — only the small prediction head is task-specific.
The adaptation swaps DETR's 4-coordinate box MLP for a parameter head emitting location/dip/azimuth, keeping the no-object classifier. Targets are normalised to [0,1] by dividing by patch height (100), 90, and 360 so no parameter dominates the L1 regression gradient.
No anchors, no NMS, no masks: the Hungarian bipartite matching loss (focal class weight 5, L1 parameter weight 1, 0.5 inference threshold) guarantees one query per fracture, so there are no duplicates to suppress and surplus queries learn the empty class.
Making it converge was data and MLOps engineering: 100x270 overlapping patches, JSON sinusoid ground truth, sinusoid-only augmentation (one iteration: 4,212 patches / 3,564 sinusoids), DVC + GitHub versioning, and a log-uniform W&B hyperparameter sweep that settled on 4 encoder + 4 decoder layers, FFN 1024, AdamW.

References

[1] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. End-to-End Object Detection with Transformers (DETR). ECCV (2020). The original set-prediction formulation with learned object queries and the Hungarian bipartite matching loss. https://arxiv.org/abs/2005.12872

Anchor-Free, Mask-Free: Adapting DETR Object Queries to Sinusoid Picking

Why masks are the wrong representation

What an object query really is

Swapping the head: from boxes to depth, dip, azimuth

The data and training engineering underneath

Why this generalises

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on