The Bipartite Matching Loss: How Transformers Learn to Pick Fractures

A fracture on a borehole image log is a sinusoid. Unroll the cylindrical wall of a well into a flat strip and a planar feature cutting that cylinder traces a clean sine wave whose amplitude encodes dip and whose phase encodes azimuth. A geologist picks these by eye, one trace at a time, across kilometres of imagery from two different microresistivity imaging tools. The question for a machine-learning team is deceptively simple: how do you teach a model to output a set of sinusoids — not a fixed grid, not a heatmap, but an unordered set of geological objects, each with its own depth, dip, and azimuth — and score it against a ground-truth set that may contain three sinusoids in one patch and zero in the next? The answer that worked, in a roughly twenty-month engagement with a mid-sized Middle East carbonate operator, was the bipartite matching loss at the heart of the Detection Transformer. This piece is about why that loss is the right inductive bias for borehole fractures, and what it buys you over the anchor-and-threshold machinery that dominates object detection.

The problem with the obvious approach

The instinct of most engineers asked to "detect sinusoids in an image" is to reach for the standard detection toolkit: tile the image with anchor boxes, predict an objectness score and an offset per anchor, threshold the scores, and run non-maximum suppression to collapse the duplicates. It works for pedestrians and traffic cones. It is a poor fit for fractures.

The reason is that a fracture is not a box. Its salient parameters are depth, dip, and azimuth — a position and the geometry of a sine wave — not a width and a height in pixels. Forcing that into an axis-aligned bounding box throws away the structure you actually care about and reintroduces it as a regression afterthought. Worse, anchors and NMS bring a thicket of hand-tuned thresholds — anchor scales, aspect ratios, IoU cutoffs, suppression radii — none of which has a natural meaning for a sine wave. In a highly fractured carbonate play, sinusoids genuinely overlap and cross. NMS, which exists precisely to delete overlapping detections, will happily delete the real ones.

What you want instead is a model that emits a small, fixed-size set of predictions, learns by itself how many of them correspond to real fractures, and is scored on the set as a whole. That is exactly what the Detection Transformer (DETR) does, and it is why the team built its DETR-style fracture model on the DETR formulation rather than on a region-proposal network.

(undefined, undefined)

Set prediction, and the assignment problem it creates

In the DETR formulation the decoder is handed a fixed number of learned object queries. Each query attends over the encoded image and comes out the other side as one candidate detection: a class label plus, in our case, the regressed depth, dip, and azimuth of a single sinusoid. The model always emits the same number of candidates — say, a few dozen — regardless of how many fractures are actually present. Patches with three sinusoids and patches with none go through identical machinery.

That design immediately raises a scoring problem. If the network outputs an unordered set of candidates and the ground truth is an unordered set of fractures, which prediction should be graded against which label? There is no canonical order. You cannot simply zip the two lists together, because query number seven has no privileged relationship to the seventh fracture — there may not even be a seventh fracture. Get the assignment wrong and the loss punishes a perfectly good prediction for being matched to the wrong target, and the gradient pushes the model in nonsensical directions.

This is a classic assignment problem, and it has a clean, exact solution.

WellBot's SQL-first, refusal-aware pipeline. Pick a query and watch it flow through Intent → Route → SQL → Answer and stop at the correct place: an in-scope query with data executes SQL and returns a grounded answer with a full audit trail; an in-scope query with no data returns 0 rows and a RULE 1 refusal (no interpolation); an out-of-scope query is refused at the router and re-routed. Every path is the right behaviour — answer, refuse, or re-route, never fabricate. Pipeline and principles per the whitepaper; example values illustrative.

The Hungarian matching loss

The trick is to make assignment part of the loss, not a heuristic bolted on afterward. Before computing any gradient, you find the single best one-to-one pairing between the model's predictions and the ground-truth fractures — the pairing that minimizes total matching cost — and only then compute the regression and classification loss on those matched pairs. The optimal pairing is found with the Hungarian algorithm, which solves the bipartite assignment exactly and in polynomial time.

Mechanically it runs in two stages per training patch:

Match. Build the cost matrix between every prediction and every ground-truth sinusoid. The cost blends how confident the prediction is in the right class with how close its regressed depth, dip, and azimuth land to the target. Run the Hungarian algorithm to pick the lowest-cost one-to-one assignment. Predictions that win a real fracture are "on the hook" for it; the rest are matched to a no-object label.
Compute loss on the optimal pairing. For matched pairs, apply a classification loss plus an L1 regression loss on the sinusoid parameters. For the unmatched predictions, apply only the classification loss, training them to confidently say "no fracture here."

In the production model the classification term is a focal loss and the parameter term is an L1 loss on depth, dip, and azimuth — the no-object case overwhelmingly dominates, exactly the foreground-background imbalance focal loss was designed for. The two terms are weighted: a class-loss weight of 5 against a parameter-loss weight of 1, with depth, dip, and azimuth each normalised to the unit interval so no single parameter dominates the regression gradient. At inference the model simply keeps the queries whose object probability clears a 0.5 threshold. There is no NMS, no anchor grid, no IoU bookkeeping — the one-to-one matching guarantees that, once trained, each fracture is claimed by exactly one query, so there are no duplicates to suppress in the first place.

No-object loss is the piece that makes the whole scheme adapt to a variable number of fractures. Because the prediction count is fixed but the fracture count is not, most queries on most patches should learn to say nothing — and the no-object term is what trains them to do so cleanly.

One more substitution matters for geoscience. Standard detectors score themselves with IoU and average precision, both defined on overlapping boxes. A sinusoid has no box to overlap. So the team replaced IoU-based evaluation with depth thresholding: a prediction counts as a true positive if its picked depth falls within a tolerance of a real fracture — evaluated at 3 cm, 6 cm, and 9 cm bands — after which dip and azimuth accuracy are measured only on those matched true positives. This keeps the entire pipeline, from matching cost to final metric, expressed in the physical units a petrophysicist actually trusts.

An unwrapped borehole image log: azimuth runs 0–360° across, depth runs down, and every fracture plane intersecting the wellbore traces a sinusoid (amplitude tracks dip, phase tracks azimuth). The DETR model predicts each sinusoid's depth; a pick is a hit only if it falls inside the interpreter's depth-match window. Drag the window — fractures flip recalled (teal) / missed (orange), false flags creep in as it loosens, and sensitivity trades against precision. At the 8 cm interpreter window the model recalls ~85%: validator-ready. Sensitivity, corpus and geometry per the case study; the per-pick errors and live precision are an illustrative detection model around that point.

Why this is the right inductive bias for sinusoids

A loss function is an inductive bias in disguise. The bipartite matching loss bakes in three assumptions that happen to be exactly true of fractures on image logs, and that is why it works.

First, fractures are a set, not a grid. They have no natural ordering and no fixed count. Set prediction respects that directly; an anchor grid imposes a spatial quantisation that a sine wave does not honour.

Second, overlap is signal, not noise. Crossing and conjugate fractures are geologically meaningful, and a one-to-one matcher lets two near-coincident sinusoids each keep their own query. NMS would erase one of them by construction.

Third, the parameters are physical and continuous. Depth, dip, and azimuth are regressed end-to-end against the matched target, so the model optimises the quantities a geologist reports, rather than a box from which those quantities must later be reverse-engineered.

The cost of this elegance is that set prediction is data-hungry. The matching is global, the supervision per query is sparse, and the model has to discover the right number of objects on its own — which means it needs to see enough geology to generalise. That dependence shows up vividly in the ablations.

What the ablations actually showed

The cleanest evidence that the matching loss is doing real work — and that it is hungry — comes from sweeping the number of training wells and watching the classification error fall off a cliff as data accumulates. With only 3 wells the model is hopeless: classification error sits at 93.1%, essentially guessing. At 6 wells it drops to 18.4%. By 9 wells it is 1.06%, and the full 14-well dataset lands the fractures-only model at a 2.54% classification error with a Hungarian loss of 0.015 and an L1 parameter loss of 0.059. The curve is steep and non-linear: the gap between a useless model and a deployable one is a handful of wells, and the set-prediction objective only converges once there is enough geology to anchor the assignment.

Loss-function choice decides whether the network learns curve continuity. VeerNet tested five losses under identical conditions; only the one whose gradient aligns with the IoU/F1 metric (Lovász-Softmax) wins, and the shipped answer is a two-loss SCE-warmup → Lovász-finetune schedule. Pick a loss to see its ablation verdict, the reason, and a schematic ground-truth-vs-prediction trace; toggle the two-loss schedule (same accuracy, half the wall-clock). The five candidates, verdicts, F1 35%/IoU 30% and the two-loss schedule are the whitepaper's own; the podium bar heights are ordinal (rank sourced) and the prediction thumbnails are schematic.

Two other ablations reinforce the point that the loss rewards real structure. Training on dynamic borehole-image imagery rather than static collapsed classification error from 63.5% to 2.54% — the dynamic normalisation preserves the sinusoid contrast the matcher keys on. And augmentation was not a marginal nicety: with augmentation switched off, classification error pinned at 100% (the model learned nothing usable); switched on, it fell to 2.62%. A matching loss can only assign what it can see, and both results say the same thing — give the matcher clean, varied sinusoids and it converges; starve it and it fails completely.

Finally, the backbone sweep is a useful caution against over-parameterising a small-data problem. A from-scratch ResNet-10 beat every deeper variant, posting a 0.499 classification error against 26.76 for ResNet-34. With 14 wells, a heavier feature extractor does not help the matcher; it just overfits before the set-prediction objective can do its job.

Takeaways for the practitioner

If you are porting object detection to a geoscience problem, the lesson is not "use a transformer." It is "pick the loss whose assumptions match your physics." For features that are unordered, variable in count, parameterised by continuous physical quantities, and prone to genuine overlap, the bipartite matching loss is a near-perfect fit — and it eliminates two of the most error-prone heuristics in classical detection, anchors and NMS, in one move. The price is data appetite. Budget for it: in this engagement the difference between a 93% error and a 2.5% error was the difference between 3 wells and 14.

Key takeaways

Fractures on image logs are sinusoids parameterised by depth, dip, and azimuth — an unordered, variable-count set, not boxes on a grid. Anchor-and-NMS detection is the wrong tool because it quantises space and deletes genuinely overlapping features.
DETR-style set prediction emits a fixed number of learned queries; the Hungarian bipartite matching loss finds the optimal one-to-one pairing between predictions and ground-truth fractures before computing any gradient, so there are no duplicates to suppress.
The production loss is a focal classification term (weight 5) plus an L1 parameter term (weight 1) on normalised depth/dip/azimuth; a no-object loss trains surplus queries to predict empty, which is how a fixed prediction set adapts to a variable fracture count.
IoU and average precision are replaced by depth-thresholded evaluation (3/6/9 cm) so the whole pipeline stays in the physical units a geoscientist trusts.
Set prediction is data-hungry: classification error fell from 93.1% (3 wells) to 1.06% (9 wells) to 2.54% (14 wells). Dynamic-vs-static imagery (63.5%→2.5%) and augmentation (100%→2.6%) mattered enormously; a from-scratch ResNet-10 beat ResNet-34 (0.50 vs 26.76 class error) on this small dataset.

References

[1] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. End-to-End Object Detection with Transformers (DETR). ECCV (2020). The original set-prediction formulation with the Hungarian bipartite matching loss. https://arxiv.org/abs/2005.12872

The Bipartite Matching Loss: How Transformers Learn to Pick Fractures

The problem with the obvious approach

Set prediction, and the assignment problem it creates

The Hungarian matching loss

Why this is the right inductive bias for sinusoids

What the ablations actually showed

Takeaways for the practitioner

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on