Skip to main content

Blog

Consistent, Auditable Interpretation: Why AI Picks Beat Inter-Interpreter Variance

Two senior geologists rarely pick the same fractures and vugs on the same image log. We argue that the real prize of a computer-vision interpretation pipeline is not raw speed but determinism — the same input yields the same picks, every parameter is logged, and the model even surfaces sinusoids and vugs the expert ground truth missed. That turns interpretation from an interpreter-dependent art into a reproducible, auditable process, which is a governance win as much as a throughput one.

Tannistha Maitiby Tannistha Maiti10 min read
EarthScan insight

Hand the same two-metre patch of a high-resolution borehole image log to two experienced geologists and you will get back two different interpretations. Not wildly different — both will catch the obvious open fractures — but the marginal picks diverge: a faint conductive sinusoid one analyst calls a fracture and the other calls noise, a vug that one outlines and the other folds into a bedding plane, a dip that lands a degree or two apart because the eye fit the sine wave slightly differently. Run the same patch past the same geologist on a different afternoon and the answer shifts again. This is inter-interpreter variance, and on a fractured carbonate play it is not a rounding error — it propagates straight into fracture-density maps, net-pay calls, and well-to-well correlation. The usual framing of machine learning in subsurface work is that it makes interpretation faster. In a roughly twenty-month engagement with a mid-sized Middle East carbonate operator we partnered with, the more durable result was that it made interpretation consistent and auditable — and that, for a chief geologist or a QA/QC lead, is the larger prize.

Variance is the hidden tax on manual interpretation

A manual pick is a function of the image and the interpreter: their training, their fatigue, the threshold they happen to be carrying in their head that day. None of those variables is recorded anywhere. When a fracture-density log disagrees with the offset well, you cannot diff the two interpretations to find out why, because the decision process lived in two people's heads and was never written down. You re-pick, you argue, you average, and you move on — and the variance quietly survives into every downstream product.

The cost compounds with scale. The study scope here ran to 80-plus processed and interpreted image logs, with correlation envisaged across a 2-to-80-well range. At that breadth, "ask a senior geologist to re-pick it" is not a control — it is the bottleneck and the source of variance at once. What a QA/QC function actually needs is a pick that is reproducible by construction: same input, same output, with every threshold that produced it written to disk.

Determinism is an engineering property, not a wish

That reproducibility is something you build, and it is worth being precise about how. The interpretation pipeline we deployed has two halves, and both are deterministic by design.

The fracture and bedding picker is a Detection Transformer (a DETR-style set-prediction model) with a from-scratch ResNet-10 backbone. It ingests a normalised dynamic image-log patch and emits a fixed set of candidate sinusoids, each carrying a regressed depth, dip, and azimuth. Crucially, once the weights are frozen, the forward pass is a pure function: the same patch produces the same picks, bit-for-bit, on every run. There is no analyst in the loop to drift. The vug pipeline is a classical computer-vision chain — top-k mode subtraction, Gaussian-modulated adaptive thresholding, Suzuki–Abe contour extraction, then area-and-circularity refinement with centroid-plus-IoU aggregation and Laplacian-variance filtering. Every one of those stages is governed by an explicit, logged constant rather than a judgement call.

LOAD PER SURVEY vs ANALYST CEILING33×load uncleared at today’s 4,000-foldAUTOMATION IMPERATIVE401004001,2004,000early 2000slate 2000s2010stodayACQUISITION ERA · OFFSET <4.5 km → 18–24 km4,000-fold40-foldTHE AUTOMATION IMPERATIVE ↑120analyst ceiling (editorial est.)drag the dashed ceiling ⇕ (or focus it and use ↑ / ↓)interpretation load / surveyanalyst ceiling40-fold feasible-by-handAnchors 40 / 4,000-fold: Viridien interview, 2024 · intermediate = log-linear interpolation · ceiling = editorial throughput proxy
Drag the dashed analyst ceiling: the orange breach point and the “automation imperative” wedge recompute. It only flips to “within human reach” if you raise the ceiling above today’s 4,000-fold — a budget no team has. Anchors are sourced; the ceiling is an editorial throughput proxy.

The MLOps consequence is the part QA leads should care about most. Because every pick traces to a recorded parameter set, an interpretation is no longer an opinion — it is an artefact you can version, diff, and re-run. Change a threshold, regenerate, and git-diff the two pick sets; the provenance is total. That is the difference between "the model said so" and an auditable pipeline, and it is the property that lets you put a number on agreement and defend it.

The evidence: one parameter set, three very different wells

The strongest test of consistency is whether a single configuration holds across geology it has never seen. We validated the vug pipeline on three wells chosen to be deliberately diverse: a vertical well, a horizontal well logged with a compact microresistivity tool roughly 10 km away, and a third vertical well about 12 km from the first. Different tools, different inclinations, different parts of the field.

One global parameter set carried all three. The structural constants — top-k modes at 5, an IoU aggregation threshold of 20%, a default 31-pixel adaptive-threshold block, a circularity gate of 0.3–1.0 — were held fixed across every well. Only two per-well knobs, the Laplacian-variance and mean-intensity filter thresholds, were re-tuned to the local image character. That is the opposite of what manual interpretation gives you: instead of an unwritten threshold per interpreter per shift, you have a documented global recipe plus two logged numbers per well. Reproducibility is the default, and the small amount of tuning that remains is itself recorded.

CARBONATE VUG DETECTION · CIRCULARITY FILTER0.30circularity gate — keep round, drop linear3 / 3 expert-missed vugs in catalogOne geometric gate turns contours into a vug catalogCircularity 0.3–1.0 rejects fractures & wellbore-parallel artifacts; orange = vugs manual picking missed.rejected (grey)catalogued vug (teal)expert-missed vug (orange)12 cm²1 cm²vugarea0.28 elongated0.85 round · circularity →← drag the gate · right of it = catalogued, left = dropped — orange vugs the expert missedSourced: gate 0.3–1.0, circ 0.28–0.85, area 1–12 cm², 3 recovered vugs · contour coords schematic
The pipeline's single geometric decision, made tactile. Each contour the detector traces sits on a circularity axis — 0.28 (elongated, dissolution-aligned) to 0.85 (near-circular). A circularity gate of 0.3–1.0 rejects linear fractures and wellbore-parallel artifacts while keeping true vugs. Drag the gate: contours left of it fall out as rejected fractures (grey), contours right of it enter the vug catalog (teal), and the three orange contours are vugs that meet every geometric criterion the expert applied — yet manual picking missed across two of the validation wells. The gate bounds, circularity span, 1–12 cm² area, and three recovered intervals are the article's own; each contour's exact coordinate is schematic (a plausible population, not a published catalog).

On the fracture side the same story holds geometrically. The bedding-and-fracture model produced picks whose geometry was tightly clustered: on the validation set, roughly 87% of true-positive picks landed within 2 degrees of the reference dip and about 93% within 20 degrees of azimuth, with fracture sensitivity around 85% within an 8 cm depth offset — and that performance was confirmed to repeat on a held-out 12 m continuous blind zone the model had never trained on. A human re-picking the same interval would not reproduce their own dips to within 2 degrees, let alone match a colleague's. The point is not that the model is superhuman on any single pick; it is that its picks are the same every time, which is exactly what a control function needs.

It also catches what the expert missed

Consistency would be a hollow virtue if the deterministic pipeline were merely a faithful copy of the human's blind spots. It is not. Against the expert's interpretation-software ground truth, the vug pipeline matched or exceeded the manual interpretation — and in several intervals it flagged vugs the expert ground truth had missed entirely, surfacing dissolution porosity at depths the manual pass had skipped, while excluding bedding and fracture features that can masquerade as vugs. Overall agreement with the expert sat at roughly 85% accuracy alignment, with a per-interval vug-area mean absolute error of about 1.21 cm² against the expert pick — close enough to trust, divergent in exactly the places worth a second look.

The fracture model showed the same behaviour from the other direction. In a highly fractured, complex interval where overlapping sinusoids defeat the eye, the path-opening front end recovered three distinct sinusoids in a section a manual pass would compress into an indistinct smear. This is the quiet inversion at the heart of the argument: the model is not just a faster human, it is a different and complementary observer. When the deterministic pipeline disagrees with the ground truth, that disagreement is itself a high-value QA signal — a short, ranked list of intervals where the expert and the machine see the rock differently, which is precisely where a senior geologist's scarce attention should go.

This reframes the model's output. It is not a replacement for the interpreter; it is an auditor of the interpretation — a second, tireless, perfectly consistent observer whose every disagreement is logged and reviewable.

Why this is a governance win, not just a speed win

The throughput numbers are real and they matter: the well-to-well system delivered interpretation roughly 5x faster at about 30 seconds per metre, with the operator's own assessment putting the productivity gain near 60% and interpretation accuracy up around 75%, against a target picking precision of 95%. We have worked these problems across operators in the Middle East and the United States, and the speed figures are always the first thing a sponsor asks about.

But speed is the line item that wins the pilot; consistency is the line item that survives the audit. Three properties make the difference for a subsurface manager. First, reproducibility: the same log re-run next quarter yields the same picks, so a fracture-density map is a stable artefact rather than a snapshot of who was on shift. Second, traceability: every pick carries its parameters, so a disputed interpretation can be reconstructed and explained rather than re-litigated. Third, defensibility: when a reserves estimate or a completion decision rests on these picks, "here is the exact, versioned configuration that produced them" is an answer that holds up in a technical review in a way that "our senior geologist's judgement" never quite does.

There is a determinism caveat worth stating plainly, because it is also a governance control. A set-prediction model is only as consistent as the data it was trained on is representative. The well-count ablation makes this vivid: trained on 3 wells the fracture model's classification error was a useless 93.1%; at 9 wells it fell to 1.06%, and across the full 14-well dataset it settled near 2.54%. A model starved of geology will be consistently wrong — reproducibility is not the same as correctness. The governance implication is that the training corpus, the augmentation regime, and the validation wells belong under the same version control as the picks themselves. Auditability has to run all the way back to the data, or it is theatre.

The takeaway for QA/QC and chief geologists

Stop asking only "how much faster?" and start asking "how reproducible, how traceable, how defensible?" A deterministic computer-vision pipeline turns interpretation from an interpreter-dependent art into a versioned, diffable, auditable process — one that holds a single documented recipe across vertical and horizontal wells and across two different imaging tools, agrees with the expert about 85% of the time, and earns its keep precisely in the 15% where it disagrees, surfacing fractures and vugs the manual pass missed. The speed is the headline. The consistency is the institution you can actually build on.

Key takeaways

  1. Inter-interpreter variance is an unrecorded tax on manual interpretation: the same patch yields different picks across analysts and across days, and that variance propagates into fracture-density maps, net-pay, and well-to-well correlation with no audit trail.
  2. Determinism is an engineering property you build. A frozen DETR-style fracture model and a constant-driven classical-CV vug pipeline are pure functions of the input, so every pick traces to a logged parameter set — versionable and diffable, unlike a judgement call in someone's head.
  3. One global parameter set generalised across three deliberately diverse wells — a vertical, a horizontal logged with a compact microresistivity tool ~10 km away, and a vertical ~12 km away — with only two per-well filter thresholds re-tuned and recorded. Fracture geometry repeated tightly: ~87% of picks within 2° dip, ~93% within 20° azimuth, ~85% sensitivity within 8 cm, confirmed on a 12 m continuous blind zone.
  4. The pipeline catches what the expert missed: it flagged vugs absent from the manual ground truth and recovered three sinusoids in a complex fractured zone the eye blurs together — ~85% agreement and ~1.21 cm² vug-area MAE vs expert, with the disagreements forming a ranked QA worklist rather than noise.
  5. Speed wins the pilot (~5x faster, ~30 s/m, ~60% productivity, ~75% accuracy uplift); consistency wins the audit. Reproducibility, traceability, and defensibility are the durable prizes — provided the training corpus is versioned alongside the picks, since a data-starved model is consistently wrong (93.1% class error at 3 wells vs 2.54% at 14).
Go to Top

© 2026 Copyright. Earthscan