There is a tidy way to lose an argument about model architectures, and it is to argue about a single number. Ask which detector is better for borehole image logs and the literature hands you a leaderboard: a Mask R-CNN reporting a mean IoU of 81.2% for fracture segmentation, another Mask R-CNN reporting 96% precision and 92% recall, a Swin-based W-shaped network reporting an mIoU of 0.689. Those are real, well-earned numbers. They are also the wrong thing to compare, because they answer a question — "how well does this model draw a mask around the thing it found?" — that is not the question a fracture interpreter is asking. In a roughly twenty-month engagement with a mid-sized Middle East carbonate operator we worked with, the model that won was not the one with the highest segmentation score. It was the one whose output set matched what the geology requires: a mask-free Detection Transformer that retrieves intersecting and multiple fractures, regresses dip and azimuth end-to-end, and carries no post-processing stack at all. This piece is about why that comparison is a capability matrix, not a metric, and what each row of the matrix costs the mask-based approach.
What Mask R-CNN actually optimises
Mask R-CNN is an excellent instance-segmentation architecture, and it is worth being precise about what it does well before saying where it does not fit. It runs a region-proposal network over a backbone feature map, classifies and box-regresses each proposal, then predicts a per-instance binary mask inside the winning boxes. For a feature whose ground truth is a region of pixels — a vug outline, a breakout patch, a lithofacies band — that is a clean formulation, and the published borehole results bear it out. The mIoU and precision/recall figures above are the scores of a pipeline doing exactly what it was built to do: deciding, pixel by pixel, which parts of the image belong to the feature.
The trouble starts when the feature you care about is not a region but a parameterised geometric object. A fracture on an unrolled borehole image is a sinusoid — a sine wave whose amplitude encodes dip and whose phase encodes azimuth. The geologist does not want a mask of the bright pixels along the trace; they want three numbers: depth, dip, azimuth. A segmentation model can produce a mask of those pixels, but the dip and azimuth then have to be recovered from the mask by a separate curve-fitting step — a post-processing stage with its own thresholds, its own failure modes, and no gradient connecting it to the loss that trained the network. The headline accuracy number is computed before that step ever runs. It tells you nothing about how well the dip came out.
The capability matrix
So instead of asking whose mIoU is higher, line the approaches up against the capabilities a borehole feature picker has to deliver, and mark which architecture supplies each one natively — without a bolt-on heuristic.
Seven rows decide the comparison, and the mask-based baselines miss most of them by construction rather than by tuning:
- Intersecting fractures. Conjugate and crossing fractures are geologically routine and diagnostic of stress history. Instance segmentation with non-maximum suppression is built to delete overlapping detections — that is what NMS is for. When two sinusoids genuinely cross, the suppression step erases one of them. A set-prediction model with one-to-one Hungarian matching lets each crossing fracture keep its own query, so overlap is preserved as the signal it is.
- Multiple fractures per patch. A region-proposal pipeline handles this in principle, but its recall is throttled by anchor density and NMS radius — tuned parameters that trade missed fractures against duplicate boxes. The transformer emits a fixed set of learned queries and learns, from data, how many correspond to real fractures.
- Bedding planes. Beds are also sinusoids, often low-amplitude and low-contrast. Handling them in the same forward pass — rather than as a second model — is a property of the unified detection head, not the segmentation mask.
- Mask-free operation. No per-pixel mask means no mask-to-parameter curve fit downstream. The geometry is the output.
- Dip / azimuth retrieval. The transformer regresses dip and azimuth as part of the prediction, supervised end-to-end against the labelled values. Mask R-CNN has to reconstruct them after the fact.
- Dip / azimuth validation. Because the parameters are first-class outputs, they can be validated directly against expert tadpole picks at a chosen depth tolerance — there is a clean number to check.
- Post-processing-free. No anchor grid, no IoU threshold sweep, no NMS, no curve fitting. The one-to-one matching guarantees each fracture is claimed by exactly one query, so there is nothing to suppress and nothing to fit afterward.
A mask-based baseline can be made to approximate rows 1, 2, and 5 with enough engineering — a more aggressive proposal head here, a curve-fit module there — but every one of those is an added stage with its own thresholds, and each stage is a place the dip can go wrong without the training loss ever noticing. The Detection Transformer supplies all seven from a single end-to-end objective. That is the substance of the comparison, and no segmentation mIoU can adjudicate it because mIoU does not measure any of these seven things.
Why "post-processing-free" is the load-bearing row
Of the seven, the one that does the most quiet work is the last. Classical detection — and Mask R-CNN inherits this — is a pipeline of heuristics wrapped around a learned core: anchor generation, objectness thresholding, non-maximum suppression, and, for our problem, a final mask-to-sinusoid fit. Each heuristic has parameters that are tuned on a validation set and silently assumed to transfer to new wells. In a fractured, vuggy carbonate where image-log coverage ranged from 45% to 85% across the 14 vertical wells in the study, that assumption is fragile: the anchor scale that works in a densely fractured interval over-detects in a sparse one, and the NMS radius that cleans up duplicates in one well deletes real crossings in another.
The DETR formulation collapses that pipeline into the loss. Assignment is performed by Hungarian bipartite matching during training, so there is no inference-time suppression to tune; the model outputs its final set of fractures directly. The engineering consequence is that there is one model and one objective to maintain, not a model plus a drawer of thresholds — which is also why the same network generalised across wells with very different fracture densities without per-well retuning. For a production team, "post-processing-free" is not an elegance argument. It is a maintenance-surface argument: every heuristic you remove is a configuration that can no longer drift out of calibration between wells.
Does the transformer give up accuracy for these capabilities?
The reasonable objection is that a more flexible architecture might trade away raw accuracy. It did not. On the held-out evaluation the end-to-end model reached roughly 90% recall within a 10 cm depth offset — measured in the physical units an interpreter trusts, with true positives scored by depth thresholding at 3 cm, 6 cm, and 9 cm bands rather than by IoU, after which dip and azimuth accuracy were checked only on the matched picks. The capabilities were not bought with an accuracy penalty; the set-prediction objective recovered the geological parameters and the recall at once.
Two engineering choices made that possible on a dataset this size, and both cut against intuition. First, backbone capacity should be small. A from-scratch ResNet-10 was the best feature extractor in our ablations, posting a classification error of 0.499 against 26.76 for ResNet-34 — a heavier backbone simply overfit before the set-prediction objective could converge. Second, the set-prediction objective is data-hungry but data-efficient at the right scale: it needed enough geology to learn the matching, and once it had it, a deliberately lean backbone generalised better than a deep one. This is the opposite of the "bigger backbone, higher mIoU" reflex that segmentation leaderboards encourage, and it is a direct consequence of optimising for the set of geological objects rather than for pixel overlap.
The lesson for practitioners
The mistake the leaderboard invites is to pick the architecture with the best headline score and then spend six months bolting on the capabilities it lacks — a crossing-fracture exception here, a dip-recovery module there — until you have rebuilt, badly, the thing set prediction gives you for free. Across our work with subsurface operators in the Middle East and the United States, the pattern that holds is the same one this comparison makes explicit: choose the architecture whose native output matches the structure of the feature. Fractures are an unordered, variable-count set of parameterised sinusoids that genuinely overlap. A mask-free, post-processing-free Detection Transformer outputs exactly that. Mask R-CNN outputs masks, and masks are the wrong currency — however high the mIoU on the mask happens to be.
Key takeaways
- Borehole feature detection should be compared on a capability matrix, not a single accuracy number. Mask R-CNN baselines post strong segmentation scores (mIoU 81.2%; precision 96%/recall 92%; Swin mIoU 0.689) yet still cannot natively deliver the capabilities a fracture interpreter needs.
- A fracture is a parameterised sinusoid (depth, dip, azimuth), not a region of pixels. A segmentation mask must be curve-fit afterward to recover dip/azimuth — a post-processing stage with its own thresholds that the training loss never sees. The headline mIoU is computed before that step runs.
- Seven capabilities decide it: intersecting fractures, multiple fractures, bedding planes, mask-free operation, dip/azimuth retrieval, dip/azimuth validation, and post-processing-free inference. A mask-free DETR supplies all seven from one end-to-end objective; mask-based baselines miss most by construction (NMS deletes the crossings it is asked to find).
- Post-processing-free is the load-bearing row: removing anchors, NMS, and mask-to-curve fitting shrinks the maintenance surface and lets one model generalise across wells of very different fracture density (45%-85% image-log coverage, 14 wells) without per-well threshold retuning.
- The capabilities came with no accuracy penalty: ~90% recall within a 10 cm offset, with true positives depth-thresholded at 3/6/9 cm rather than by IoU. A lean from-scratch ResNet-10 backbone (0.50 class error) beat ResNet-34 (26.76) — the opposite of the 'bigger backbone, higher mIoU' reflex segmentation leaderboards reward.
References
[1] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. End-to-End Object Detection with Transformers (DETR). ECCV (2020). The set-prediction formulation with Hungarian bipartite matching that this comparison builds on. https://arxiv.org/abs/2005.12872
[2] He, K., Gkioxari, G., Dollár, P., and Girshick, R. Mask R-CNN. ICCV (2017). The instance-segmentation baseline architecture the borehole literature adapts. https://arxiv.org/abs/1703.06870