Skip to main content

Blog

Why We Refused to Benchmark Our Model Against YOLO and Mask R-CNN (and Were Right To)

A reviewer asked us to run a head-to-head accuracy comparison of our end-to-end fracture transformer against YOLO and Mask R-CNN. We declined, and the rebuttal we wrote turned on a distinction worth generalising: mask detectors segment fractures and then need a human to pick dip and azimuth, while our model regresses depth, dip, and azimuth directly. The two have no shared output to score, so the requested benchmark would have measured nothing. This is a short field guide to telling a meaningful benchmark from a category error.

Narendra PatwardhanTannistha Maitiby Narendra Patwardhan, Tannistha Maiti8 min read
EarthScan insight

A peer reviewer, reading our fracture-detection manuscript, made a request that sounds unarguable: run a head-to-head accuracy comparison against YOLO and Mask R-CNN so readers can see how the new model stacks up against the standard detectors. We wrote back and declined. Not because we feared the numbers, and not because the comparison was inconvenient to produce. We declined because the benchmark, as posed, would have measured nothing, and printing a table that appears to compare two methods while actually comparing two different tasks is worse than printing no table at all. This piece is the reasoning we sent back, generalised past our own well data, because the same trap catches a lot of applied-ML papers.

What the reviewer was really asking for

The instinct behind the request is sound in most settings. If you claim a new detector is good, show it beside the detectors everyone already trusts, on the same data, scored the same way. That is how the object-detection literature works, and how it should work when the models in the table produce the same kind of thing.

The problem is that our model and the mask detectors do not. GeoBFDT takes an unrolled borehole-image patch and regresses three numbers per feature: depth, dip, and azimuth. It is trained from scratch, end to end, with no non-maximum suppression and no masks anywhere in the pipeline. A mask detector, whether Mask R-CNN or a segmentation network, outputs a region of pixels it believes belong to the fracture. Those are different targets. One is a geometry; the other is a pixel set. A benchmark can only compare two methods where they emit the same quantity, and on the quantity we care about, only one of the two emits anything at all.

Segmentation is step one of four, not the whole job

The argument lives in the workflow, not in a slogan. To get dip and azimuth out of a mask detector on borehole images, you run four steps. First, the network segments the fracture into a mask. Second, you post-process that mask, with the thresholds and cleanup that any segmentation output needs. Third, a human picks the dip from the segmented trace. Fourth, a human picks the azimuth. Steps three and four are not automation; they are a person with interpretation software doing exactly the manual labour the model was supposed to remove, and creating the ground-truth masks to train such a detector in the first place is itself prohibitively time-consuming on this kind of data.

GeoBFDT runs three steps and none of them is manual: patch in, transformer, depth/dip/azimuth out. The dip and azimuth are supervised directly against the labelled values, so they arrive as part of the prediction rather than being recovered afterward by someone with a mouse. When a reviewer asks us to benchmark accuracy on dip and azimuth against a mask detector, they are asking us to score a number the mask detector never computes. The mask detector's dip and azimuth come from the human in step three and four. Benchmarking our dip against a human's dip, and calling the human a baseline model, is not a fair fight; it is a mislabelled one.

A BENCHMARK COMPARES ONLY WHERE BOTH METHODS EMIT THE SAME OUTPUTcomparableon the segmentation quality axisMask methods segment, then a human picks dip and azimuth. GeoBFDT regresses all three directly.COMPARE THEM ONSegmentation qualityboth can be scored on a maskDepth / dip / azimuthonly one method regresses itNATIVE OUTPUTmask pipelinesegmentationGeoBFDTdepth + dip + azimuthTHE PRIOR WORK IS MASK-BASEDSwin dual encoder-decoder: masksDOI 10.30632/PJV64N1-2023a3CrackDiffusion: not boreholeDOI 10.1088/1361-665X/acc624We never claimed to outperform them.MASK-BASED PIPELINE · 4 STEPS1. Segment maskpixels of the fracture2. Post-process maskthresholds, NMS3. Manual dip pickinghuman, per featureHUMAN4. Manual azimuth pickinghuman, per featureHUMANGeoBFDT · 3 STEPS, END TO END1. Patchunrolled image tile2. Transformerno NMS, no masks3. Regress depth/dip/azimuthdirect, cm + degreeshared axisboth score a mask: fairOn segmentation, both emit a mask, so a head-to-head is legitimate.4-step mask pipeline vs 3-step GeoBFDT: comparable on this axis.
Why a head-to-head accuracy benchmark of GeoBFDT against mask-based detectors is a category error rather than an evasion. The two workflows are laid out as ordered step tracks: the mask-based pipeline runs four steps (segment mask, post-process mask, then manual dip picking and manual azimuth picking), and its native output is a segmentation only; GeoBFDT runs three steps end to end (patch, transformer, direct depth/dip/azimuth regression) with no NMS and no masks, and its native output is cm-level depth and degree-level dip and azimuth. The toggle sets what the reviewer proposed to compare on. On segmentation quality both methods emit a mask, the bridge closes, and a head-to-head is fair. On depth/dip/azimuth the mask track has no machine output at all, because those numbers arrive from a human afterward: the orange break marks where the comparison loses its common quantity. The named prior work is mask-based, the Swin dual encoder-decoder at DOI 10.30632/PJV64N1-2023a3, while CrackDiffusion at DOI 10.1088/1361-665X/acc624 does not address borehole images, and GeoBFDT never claimed to outperform either. Every step, output, and DOI shown is sourced from the review correspondence; nothing here is illustrative.

The instrument above is the whole argument in one view. Toggle the comparison axis. On segmentation quality, both tracks emit a mask, the bridge closes, and a head-to-head is legitimate. On depth, dip, and azimuth, the mask track simply has no machine output to place on the axis, and the bridge breaks. That break is the category error. It is not that our model wins the comparison; it is that on this axis the comparison has no second competitor.

Why this is a category error, not an evasion

The distinction that matters is between declining a benchmark you would lose and declining a benchmark that does not exist. We were doing the second. A category error, in the ordinary sense, is attributing to something a property its category cannot hold. Asking for the regressed-azimuth accuracy of a segmentation model is like asking for the colour of a sound. You can bolt a curve-fitting stage onto the mask and manufacture an azimuth, but then you are benchmarking the curve-fitter and the human who tuned it, not the detector, and the detector's training loss never saw that stage.

There is a real capability-and-cost comparison to be made between mask-based picking and end-to-end regression, and we made it in detail in a separate piece, Mask-Free DETR vs Mask R-CNN for Borehole Feature Picking, which lays the two approaches out on a capability matrix rather than a single score. That is the honest form: which capabilities each architecture supplies natively, and what the missing ones cost you downstream. What we refused was the dishonest form, a single accuracy row on an axis only one method can stand on.

We never claimed to have beaten anyone

A quieter point sat under our refusal, and it defused the reviewer's worry once we said it plainly: we never claimed GeoBFDT outperforms prior detectors. The manuscript's contribution is a formulation that removes the mask and the manual picking, not a leaderboard win. Once the claim is "this method regresses the geometry directly and end to end," the relevant evidence is that it does so and that the regressed values hold up against expert picks at a stated depth tolerance, which we reported. A YOLO or Mask R-CNN accuracy column would not support or refute that claim. It would answer a question we were not asking and imply a rivalry we did not assert.

The prior work we cite is itself mask-based

It helped to be concrete about the neighbourhood our method sits in. The closest borehole-image segmentation work we cite, a Swin dual encoder-decoder network, is mask-based by construction, and everything about the manual-picking tail applies to it as much as to Mask R-CNN. Another cited method, CrackDiffusion, is a crack-segmentation approach that does not address borehole images at all. Neither is a drop-in accuracy comparator for a model that regresses dip and azimuth, and pointing to their actual outputs made the category distinction concrete. The literature the reviewer wanted us to benchmark against was, on inspection, the very category we argue our method leaves behind.

A field guide to the distinction

The generalisable lesson is a two-question test you can run before agreeing to any head-to-head benchmark. First: do the two methods emit the same quantity? If method A outputs a mask and method B outputs regressed parameters, they do not, and any single-number comparison is smuggling a conversion step into the table without pricing it. Second: is the "baseline's" output on your axis actually produced by the model, or by a human downstream of it? If a person supplies the numbers you are scoring, the baseline is not a model and the benchmark is theatre.

When both answers are clean, benchmark aggressively; the field is right to demand it. When they are not, the professional move is to decline the comparison, explain the category mismatch, and offer the comparison that is actually meaningful instead, which for us was a capability matrix and a validation against expert picks. Reviewers accept this when you show the mechanism rather than assert the conclusion. Ours did.

Limitations

This argument is about comparison design, not about GeoBFDT being universally preferable. Where the deliverable genuinely is a region of pixels, a vug outline, a breakout patch, a lithofacies band, a mask detector is the right tool and a segmentation benchmark against it is entirely fair; nothing here says otherwise. Our reasoning also rests on the specific mask-based workflow for borehole dip and azimuth, in which the parameters are picked manually from the segmented trace; a pipeline that automated that curve fit end to end would narrow the gap and change what a fair benchmark looks like. Finally, we compared native outputs and workflow steps, not runtimes or annotation budgets, both of which matter to a production choice and neither of which this note quantifies.

References

[1] Swin dual encoder-decoder network for borehole image segmentation. Petrophysics (2023). Cited as a mask-based borehole-image method whose output is a segmentation, subject to the same manual dip and azimuth picking tail. DOI: 10.30632/PJV64N1-2023a3. https://doi.org/10.30632/PJV64N1-2023a3

[2] CrackDiffusion: crack segmentation approach. Smart Materials and Structures (2023). Cited as a segmentation method that does not address borehole images, and therefore not a drop-in accuracy comparator for end-to-end dip and azimuth regression. DOI: 10.1088/1361-665X/acc624. https://doi.org/10.1088/1361-665X/acc624

Go to Top

© 2026 Copyright. Earthscan