Designing for the Geoscientist Who Distrusts the Black Box

The first time we put the curve-segmentation model in front of a working geoscientist, the reaction was not about the metric. It was a question: how do I know which parts of this are wrong. That is the whole design problem, and it is different from the one the model solves. The model turns a scanned well log into a mask that marks where each curve runs down the page. The product has to convince a person who reads logs for a living to stake a decision on that mask, and no headline score does that on its own. A geoscientist adopts a tool when they can see where it fails and fix those places faster than starting from scratch. This note is about how we designed for that person around a model we knew, and told them, was weak.

We should be precise about how weak, because the design only makes sense once the weakness is on the table. The pipeline behind VeerNet, the encoder-decoder EarthScan uses to lift curves off raster logs, peaked at an IoU of 0.51 and an F1 of 0.55 on the curve masks. To keep the numbers honest, everything that follows is read off one run, the binary segmentation model trained with a class-weighted loss, so recall, precision and F1 all describe the same masks rather than getting spliced across regimes. On that run the per-mask picture is lopsided in a way an expert notices immediately: precision on the two curve masks is only about 0.23 and 0.38, against F1 scores of 0.37 and 0.55. A mask that flags far more curve pixels than it gets right is exactly the mask a geoscientist will refuse to trust at face value, and they would be right to.

What kind of wrong the model is

The instinct when a model scores 0.51 is to keep it out of the user's hands until it scores higher. We took the opposite view, for a single property of the errors rather than their quantity. On the same binary run, recall on the curve masks reaches 0.96 and 0.97 while precision sits at roughly 0.23 and 0.38, the values implied by pairing that recall with the run's logged F1 of 0.37 and 0.55. That gap says the model rarely misses a real curve pixel and instead over-marks: most of its mistakes are false positives, pixels it flagged as curve that are not. For a reviewer, that failure mode is the difference between an easy job and an impossible one.

A low-recall model hides its errors. If it drops real curve pixels, a reviewer has to notice an absence, scan the whole log for the segment quietly left out, and reconstruct it, which is the same labour as digitising by hand and destroys the reason to use the tool. A high-recall, low-precision model does the opposite. It shows you too much, and everything it got wrong is visible on the page as a mark that should not be there. Vetoing a mark you can see is fast; finding a mark that is missing is slow. The model's weakness, stated as recall and precision rather than as a single IoU, is one a human can clear with a cursor.

Put the failure where the eye already is

That property is what let us design the interface as a verification surface instead of a result screen. The dashboard does not present the mask as an answer. It presents it as a claim to be checked, and it puts the checking exactly where the model is weakest: the reviewer works per scan, zooms into the curve traces rather than the background the model already handles, and overrides the false positives directly on the image. We spent the affordance budget on the two curve masks and almost none on the empty page the model already reads without help.

This is the inversion that made the tool trusted. The model is most confident on the class that matters least, the blank page, and it over-marks on the two curves that matter most, so a naive clean-overlay-and-approve interface would look most persuasive exactly where it is least reliable. We designed against that, drawing the eye to the curve traces, making the over-marking legible rather than smoothed over, and giving the reviewer a one-gesture veto at the pixel scale. Transparency about the weakness and override at the weakness are the same design move, and together they convert a 0.51 mask into an output a geoscientist will sign.

Why an expert geoscientist adopts a model whose peak IoU is only 0.51. Every metric here is read off one run, the binary segmentation model trained with a class-weighted loss, so recall, precision and F1 describe the same masks. Panel A shows the black box unflattered: per-mask precision on the two curves the model exists to trace is only about 0.23 and 0.38, against logged F1 of 0.37 and 0.55 and pipeline peaks of 0.51 IoU and 0.55 F1. Panel B shows the one operating fact that makes the weakness reviewable: recall on the curve masks is 0.96 and 0.97 while precision is 0.23 and 0.38, so the error is dominated by false positives the model over-flagged, not misses a human has to hunt for. The dashed orange gap between the recall and precision bars is the review surface, the false positives a reviewer can strike out. The review-coverage lever drags the share of those flags the expert inspects and overrides; the trusted output climbs from the raw 0.51 toward the recall ceiling near 0.965 as coverage rises, validated over 8 real scans with zoom-and-override. Recall, F1 and the scan count are sourced from the engagement archive; precision is derived within that run as the harmonic-mean complement of its recall and F1; the trusted-output path between the raw floor and the recall ceiling is illustrative geometry, and its two end anchors are the sourced numbers.

Trust is a workflow property, not a model property

The console above pairs the black box shown unflattered, the low per-mask precision an expert reads as a red flag, with the recall-precision asymmetry from that same binary run that makes the flag survivable: the gap between what the model flags and what it gets right is precisely the false positives a reviewer strikes out. Drag the review-coverage lever and the trusted output climbs off the raw 0.51 toward the recall ceiling, not because the model improved but because the human removed the errors it could not. Reported as a bare model score, 0.51 reads as not ready; reported as the floor of a review process a geoscientist drives to a checkpoint they trust, the same number reads as a starting point, which is what it is.

What eight scans taught us about the affordances

We validated the workflow over 8 real scans through the dashboard, and the small number is the point. These were real scanned logs, not synthetic ones where we controlled the failure modes, put through the actual zoom-and-override interface by someone checking whether the tool saved time against digitising by hand. What we watched was not the model's score, which we knew, but whether the override gestures landed where the errors were and whether trust survived contact with a log the model had never seen.

The affordances that mattered were unglamorous. Zoom that snapped to the curve traces rather than the whole page, because the reviewer's attention belongs on the ink. Override at the granularity the errors occurred at, so a single over-marked segment could be struck without redrawing the curve. And a per-scan boundary, so trust was granted one log at a time, which is how an expert actually thinks: not this model is good but this scan is now correct. None of those are model improvements. They are the product standing next to the user at exactly the points where the model is weak.

The design principle, stated plainly

The principle we now apply before looking at any headline score is to ask what kind of wrong the model is, not just how wrong. A model that fails by omission and one that fails by over-marking need opposite interfaces, and the same IoU can describe either. If the failure is visible and vetoable, surface it and put the override there; if it is invisible, no interface polish will earn an expert's trust. Expert users do not distrust models because models are imperfect. They distrust interfaces that pretend otherwise. The geoscientist who asked how do I know which parts are wrong wanted a tool that would show them and let them fix it. A 0.51-IoU pipeline can be that tool. It just has to be designed as one.

Limitations

This is a design account grounded in one engagement. The recall and F1 figures, the pipeline peaks, and the count of 8 validated scans are real archive numbers from the binary segmentation run trained with a class-weighted loss; the per-mask precision values of about 0.23 and 0.38 are derived within that same run as the harmonic-mean complement of its logged recall and F1, not separately measured, and we quote them rounded. The trusted-output curve the console draws between the raw 0.51 floor and the recall ceiling is illustrative geometry; only its two end anchors are sourced, and the shape between them stands in for a review yield per unit of coverage we did not instrument. Eight scans is enough to see whether the affordances land and not enough to quantify how much time the loop saves at scale, which would need a controlled study we did not run. The argument that high recall makes errors reviewable holds for this task, where false positives are visible marks on an image; it does not transfer where a false positive is silent. And a workflow can earn trust and still be wrong, since a confident reviewer can miss a subtle model error, so human-in-the-loop raises the floor without guaranteeing the ceiling. The modelling choices behind the scores are the subject of the VeerNet whitepaper and are out of scope here.

Designing for the Geoscientist Who Distrusts the Black Box

What kind of wrong the model is

Put the failure where the eye already is

Trust is a workflow property, not a model property

What eight scans taught us about the affordances

The design principle, stated plainly

Limitations

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on