Skip to main content

Blog

How Detection Transformers Work — and Why Object Queries Replace Anchors

A first-principles primer on the Detection Transformer — how it throws out anchor boxes and non-maximum suppression and replaces them with a fixed set of learned object queries matched to ground truth by Hungarian assignment. That set-prediction framing is exactly why the same architecture that detects 80 COCO classes can be re-pointed at borehole sinusoids.

Tannistha Maitiby Tannistha Maiti12 min read
EarthScan insight

For most of a decade, "object detection" and "anchor boxes" were almost synonymous. Faster R-CNN, RetinaNet, YOLO, SSD — different in a hundred details, identical in their skeleton: paper the image with a dense grid of candidate boxes, predict an objectness score and an offset for each, threshold, and run non-maximum suppression to collapse the survivors. Most of those parts are hand-tuned heuristics, not learned components. The Detection Transformer — DETR, introduced by Carion and colleagues at FAIR in 2020 — removed nearly all of them in one stroke. No anchors. No region-proposal network. No non-maximum suppression. In their place: a fixed set of learned vectors called object queries, and a single loss that matches those queries to ground-truth objects directly.

This post is about how that substitution works, and why it matters far beyond the COCO benchmark it was born on. We came to DETR through field work — building a borehole-image fracture detector for a mid-sized Middle East carbonate operator we partnered with — and the reason we reached for it over a classical detector is the same reason it is worth understanding in the abstract: the set-prediction framing generalises. The architecture that detects 80 object classes in a street scene is, structurally, the same one we re-pointed at the geometry of a sine wave on a rock-face image. Understanding why requires understanding what object queries are and what they replace.

The anchor pipeline, and what it costs you

Start with the thing DETR deletes. A classical detector reasons about locations. It lays down anchors — thousands of pre-sized boxes tiled across the feature map at fixed scales and aspect ratios — and asks of each: is there an object centred near here, and if so, how should I nudge this box to fit it? Because dozens of nearby anchors all fire on the same object, you then need non-maximum suppression (NMS) to keep the best box per object and throw away the rest.

The hidden cost is the configuration surface. Anchor scales, aspect ratios, the IoU thresholds that decide which anchors count as positive during training, the separate IoU threshold inside NMS at inference — none of these is learned. They are dials, set by hand, that encode strong priors about the size, shape, and density of the objects you expect. Fine for pedestrians and cars whose statistics are well studied; a liability the moment your objects do not look like boxes, overlap heavily, or vary in count from zero to dozens within a single crop. NMS in particular assumes overlapping detections are duplicates to be deleted — precisely wrong when objects genuinely overlap.

DETR's wager is that this machinery is a workaround for a problem we should solve directly. Detection is not about scoring locations; it is about predicting a set. An image holds some unordered collection of objects, and the model should emit an unordered collection of predictions and be graded against the truth as a set. Frame it that way and anchors and NMS simply evaporate.

Object queries: the part that does the work

Here is the mechanism. DETR runs the image through a convolutional backbone — a ResNet-50 in the reference model — to produce a coarse grid of feature vectors. With the standard downsampling factor of 32, an 800-pixel image collapses to a feature map on the order of 2048 channels over a 25-by-34 spatial grid. A transformer encoder then lets every cell attend to every other cell, so each location is contextualised by the whole image. So far this is a fairly ordinary feature-extraction stack.

The novelty is the decoder. Instead of reading predictions out of spatial locations, DETR hands the decoder a fixed, learned set of vectors — the object queries. The reference model uses 100. Each query attends through cross-attention over the encoded image and through self-attention over the other queries (so they coordinate and avoid all claiming the same object), and emerges as exactly one candidate detection: a class label and box coordinates, or in our adaptation, the regressed parameters of one geological feature. The model always emits 100 candidates, regardless of how many objects are present. A query is best understood not as a location but as a learned question: "is there an object of roughly this character, and where?" Over training the queries specialise, each gravitating toward a region and scale, collectively tiling the space of plausible objects with nobody hand-specifying anchors.

EARTHADAPTNET · ROOSEVELT GEOTHERMAL~80%pixel accuracy · per-pixel segmentationencoder reused · decoder swappedThe decoder is the only task-specific part — swap it, keep the backboneResidual-Block encoder + bottleneck is reused as-is; U-Net skips ride every depth.Decoder head — segmentationFC head — free classifierENCODER · reusedBRIDGEDECODER · swappableRB 1RB 2RB 3RB 4BOTTLENECK1×1 convTRB 1TRB 2TRB 3TRB 4PER-CLASS ACC · 8 FACIESc076%c154%c279%c391%c445%c570%c688%c797%encoder depth — 4 RBs · mirrored by 4 TRBs← drag depth · the symmetric U grows · teal dashed = U-Net skip at every levelorange = the one swappable task head · teal = the reused backbone~80% pixel acc, 8-facies per-class & decoder-swap reuse are the article's own · block count is schematic
EarthAdaptNet's load-bearing claim is structural, not a metric: 'the architecture is the novel part — not the training recipe.' The Residual-Block encoder plus the 1×1 bottleneck is a reusable backbone; the decoder is the only task-specific part. Keep the Transposed-Residual decoder and you get per-pixel facies segmentation (the 8-facies output, ~80% pixel accuracy on Roosevelt held-out sections); replace it with fully-connected layers and you get a per-image facies classifier for free — same encoder weights, same bottleneck, no retraining. Toggle the head to see exactly what the decoder buys, and drag the depth handle to grow the symmetric U and the U-Net skip lattice that rides at every depth. The single scarce orange marks the swappable head — the part the article argues is the only thing you ever need to replace. The per-class accuracies (76/54/79/91/45/70/88/97), ~80% pixel accuracy, 8 facies, and the encoder-decoder + skip-at-every-depth + free-classifier reuse mechanism are the article's own; the drawn block count at each depth is schematic (it shows topology, not the exact layer tally).

This split — a reusable backbone-plus-encoder that produces a contextualised representation, and a task-specific decoder head that reads it out — is the structural heart of the design, and it is exactly what makes DETR portable. The expensive, general part (learning to see) is decoupled from the cheap, specific part (learning what to report). Swap the decoder's output heads and loss for a different object parameterisation and the same backbone carries over.

The matching problem, and the Hungarian loss that solves it

Emitting a fixed set of 100 predictions creates an immediate scoring problem. The ground truth for an image is an unordered set of, say, four objects. The model emits 100 candidates in no particular order. Which prediction should be graded against which label? Query number seven has no privileged relationship to the seventh object — there may not even be a seventh object. Pair them up wrong and the loss punishes a perfectly good prediction for being matched to the wrong target, and the gradient drags the model in nonsensical directions.

This is a textbook assignment problem, and DETR's central trick is to make the assignment part of the loss. Before computing any gradient, it builds a cost matrix between every prediction and every ground-truth object — blending classification confidence with how well the predicted geometry fits — and runs the Hungarian algorithm to find the single lowest-cost one-to-one pairing. Only then does it compute the actual losses: a classification term and a regression term on the matched pairs, and a "no-object" classification term on every unmatched query, training the surplus predictions to confidently say nothing is here.

That no-object loss is the quiet hero of the whole scheme. Because the prediction count is fixed at 100 but the true object count is not, most queries on most images should predict empty — and the no-object term is what teaches them to do so cleanly. The one-to-one constraint is what kills NMS: each object is claimed by exactly one query, so there are no duplicate detections left to suppress. The model learns the deduplication that NMS used to do by hand, and learns it end-to-end.

Why this generalises — from COCO classes to sinusoids

Now the payoff. A classical anchor detector is welded to the assumption that objects are boxes: its anchors encode box scales, its NMS reasons about box overlap (IoU), its evaluation is average precision over boxes. Repoint it at something that is not a box and you fight the architecture at every layer.

DETR's set-prediction framing has no such commitment. Strip it down and it says only: emit a fixed set of candidates, regress each candidate's parameters, and match the set to ground truth by Hungarian assignment. The word "box" never appears. The class set is arbitrary — 80 COCO categories or two geological ones. The "parameters" of an object are whatever your regression head outputs and your matching cost scores. That is the latitude we exploited in the field.

A fracture on a borehole image log is not a box; it is a sinusoid. Unroll the cylindrical wall of a well into a flat strip and a planar feature crossing that cylinder traces a sine wave whose amplitude encodes dip and whose phase encodes azimuth. To turn DETR into a fracture detector — internally, GeoBFDT — we changed almost nothing structural. We kept the transformer encoder-decoder and the object-query mechanism intact, swapped the box-coordinate head for one regressing a sinusoid's depth, dip, and azimuth, and replaced IoU-and-average-precision evaluation with depth-thresholded matching, because a sine wave has no box to overlap. The Hungarian matcher, the no-object loss, the fixed-query set — all untouched. The COCO detector and the fracture detector are, in their load-bearing parts, the same model.

The pieces that did change mark which choices are task-specific. Where COCO DETR uses 100 queries, a heavy ResNet-50, and six encoder/decoder layers, our small-data borehole problem wanted lighter: a from-scratch ResNet-10 backbone with four encoder and four decoder layers, focal classification loss with the no-object case down-weighted, and a class-loss weight of 5 against a parameter-loss weight of 1. Confidence conventions differ too — COCO keeps detections above 0.9; our recall-sensitive pick used 0.5. Recipe knobs, not architecture. The set-prediction skeleton was the constant.

What the choice bought us, concretely

The clearest evidence that the architecture, not the tuning, is doing the work is the backbone sweep. On this small geological dataset a from-scratch ResNet-10 posted a classification error of roughly 0.5%, while a heavier ResNet-34 — more parameters, more capacity — blew up to 26.8%. The set-prediction objective converged cleanly on a light backbone and overfit on a heavy one; a bigger feature extractor only helped it memorise. The data appetite of set prediction shows in a second sweep: classification error fell from 93.1% with three training wells to 1.06% at nine, then settled near 2.5% on the full multi-well dataset. Steep and non-linear — the gap between useless and deployable was a handful of wells — but the same convergence behaviour DETR shows on COCO, scaled to a domain where labelled data is measured in wells, not millions of images.

None of that is reachable from an anchor pipeline without a fight: you would be inventing sinusoid-shaped anchors, redefining IoU for sine waves, and re-tuning NMS to stop deleting genuinely crossing fractures. DETR let us skip the entire argument by working at the level of sets and parameters rather than boxes and locations — which is the whole point of object queries. They are not a cleverer anchor. They are what you reach for when the concept of an anchor no longer fits your objects.

The takeaway for practitioners

If there is one idea to carry out of DETR, it is that the loss is the inductive bias. The Hungarian matching loss does not assume your objects are boxes, occur at fixed scales, or come in predictable counts. It assumes only that they form a set you can pair to ground truth by cost. Whenever your detection targets are unordered, variable in count, parameterised by continuous quantities, and prone to genuine overlap, that is a better fit than anchors and NMS — and the architecture transfers across domains as cleanly as the assumption holds. We learned this not on a benchmark but on kilometres of carbonate image log, and the lesson has held everywhere we have applied it since, across operators in the Middle East and the United States: pick the framing whose assumptions match your physics, and the rest of the pipeline follows.

Key takeaways

  1. Classical detectors (Faster R-CNN, YOLO, RetinaNet) rely on dense anchor grids plus non-maximum suppression — a stack of hand-tuned heuristics (anchor scales, aspect ratios, IoU thresholds) that encode a strong 'objects are boxes' prior.
  2. DETR replaces all of it with a fixed set of learned object queries (100 in the reference model). Each query attends over a ResNet-50 backbone's encoded feature grid and emerges as exactly one candidate detection — no anchors, no region proposals, no NMS.
  3. A Hungarian bipartite matching loss makes assignment part of the objective: it finds the optimal one-to-one pairing between the 100 predictions and the ground-truth objects before computing any gradient. The one-to-one constraint is what eliminates duplicates, so NMS is unnecessary.
  4. The no-object loss trains surplus queries to predict 'empty,' which is how a fixed-size prediction set adapts to a variable object count (zero to dozens per image).
  5. Because the framing is set-prediction over arbitrary parameters — never boxes — the same architecture re-points from 80 COCO classes to borehole sinusoids by swapping only the regression head and evaluation metric. On that geological task a light from-scratch ResNet-10 (~0.5% class error) beat a heavier ResNet-34 (~26.8%), and error fell from 93.1% (3 wells) to ~2.5% (full dataset) as data accumulated.

References

[1] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. End-to-End Object Detection with Transformers (DETR). ECCV (2020). The original set-prediction formulation: learned object queries, the transformer encoder-decoder, and the Hungarian bipartite matching loss. https://arxiv.org/abs/2005.12872

Go to Top

© 2026 Copyright. Earthscan