Validation Said 51%, the Blind Well Said 4%: The Error Ledger That Kept Us Honest

Every model we shipped in this carbonate borehole-image engagement carried two numbers, not one. The first was the validation score, the one a slide would quote. The second was the score on a well the model had never seen during training, a continuous held-out zone we called the blind set. We kept both in a single spreadsheet, one config per row, validation and blind side by side at several depth thresholds. That table is the most honest artefact the project produced, because the two columns disagree so badly that quoting either one alone would be a lie by omission.

Take the row we tuned hardest: the fracture-only model trained on the full well set with a high-overlap patch split. On validation, F1 at a 2 cm depth match read 50.87%. On the blind zone, the same config at the same threshold read 3.69%. Not a 10% haircut. A collapse by a factor of roughly fourteen. Loosen the match to 4 cm and it is 70.51% against 5.53%; loosen to 6 cm and it is 77.37% against 7.83%. The blind curve climbs with the threshold, as it must, but it never gets within shouting distance of validation. If we had reported the validation column, we would have claimed a working fracture detector. The well that never trained the model said otherwise.

Why a patch-level split flatters you

The mechanism is not exotic, and we have written the general leakage argument up separately in Splitting by Well, Not by Row. The one-line version for this piece: we had a tiny number of real wells, so to get enough training samples we cut each well into many overlapping image patches. When you then split those patches randomly into train and validation, patches that overlap by most of their area land on both sides of the line. The validation patch is a near-duplicate of a training patch a few centimetres up the borehole. The model has, in all but name, already seen it. So validation grades the model on samples it memorised, and the number comes out high and meaningless.

The blind zone breaks that loop by construction. It is a contiguous stretch of a well held out whole, with no patch from it anywhere in training. No overlap can leak across that boundary because the boundary is a well, not a random row index. That is why the blind column is the one we believed.

The row that proved the point

The cleanest evidence sits in one comparison inside the same table. We built a second fracture configuration on the same wells but engineered the split to have low overlap between train and validation, deliberately trading sample count for less bias. Its validation F1 at 2 cm dropped from 50.87% down to 8.77%. That looks like a worse model. It is the same model. What changed is that we stopped letting validation cheat, and the validation number fell most of the way to honesty.

Because the blind zone for that config sat at 12.05%, the low-overlap validation number (8.77%) and the blind number (12.05%) finally agree to within a few points. Every high-overlap row shows validation and blind separated by a factor of ten or more. The one low-overlap row shows them converging. The gap between a config's two columns is a direct readout of how much overlap bias its split carries. That is the finding: a low-overlap validation split is the difference between believing 50.87% and knowing the truth is nearer 12%.

The combined beddings-plus-fractures model on the smaller 11-well set tells the same story with a slightly softer edge: validation F1 at 2 cm of 52.44% against a blind 14.29%. Still a collapse, still the validation column running three to four times hot.

The internal model-selection error ledger, drawn as it argues. Left: fracture F1 on the validation split (teal) against the held-out blind zone (orange), for three of the ledger configurations at the depth-match threshold you pick with the toggle. The high-overlap 14-well config reads a validation F1 of 50.87 and a blind F1 of 3.69 at 2 cm; loosening the threshold to 4 or 6 cm lifts both bars but the ratio barely moves, so the collapse is not a threshold you can tune away. The low-overlap config, split deliberately for less overlap-induced bias, is the only row whose validation number (8.77) sits near its blind number (12.05). Right: beddings, which fail in the opposite direction - blind dip accuracy at 1 deg (71) exceeds validation (49), while blind azimuth at 5 deg craters to 7 at a mean error near 75 degrees. The orange ink is the only element that argues: the blind reality the headline validation number hides. All figures are sourced from the engagement's master error table at the exact thresholds tabulated; where a configuration does not tabulate a given depth, the bar is drawn as a dash rather than interpolated. This is an internal error ledger, not a published benchmark.

Beddings fail in the opposite direction

The ledger earns its keep on beddings, because the fracture story does not transfer to them. It inverts.

For beddings, blind dip accuracy is better than validation. At a 1 degree dip tolerance, validation accuracy was 49% while the blind zone hit 71%, with mean absolute dip error of 0.81 degrees on blind against 1.38 on validation. The held-out well was easier for dip, not harder. Beddings are numerous, laterally continuous, and gently dipping, so a fresh well full of them is a soft target for the dip head, and the overlap bias that inflated fracture validation was not the dominant effect here. At the looser 3 degree band both columns are strong, 93% to 99%, and the distinction stops mattering.

Then azimuth for beddings falls off a cliff. At a 5 degree azimuth tolerance, blind accuracy is 7%, with mean absolute error around 75 degrees, which is close to a coin toss on where the plane points. At 10 degrees, validation holds at 76% to 77% while blind sits at 15% to 17%. So within one feature type the model can generalise its depth and dip picks to a new well and almost entirely fail to generalise its azimuth. A single headline accuracy would have hidden all of this.

That is the argument for a ledger over a metric. Fractures and beddings do not just differ in difficulty; they fail in different directions. Fracture validation lies high because of overlap. Bedding dip generalises fine. Bedding azimuth does not generalise at all past a tight tolerance. No one summary statistic survives contact with those three facts at once.

What the ledger changed about how we worked

Three habits came out of keeping the table.

We stopped comparing configs on validation. Two fracture models with validation F1 of 50.87% and 8.77% are not a strong model and a weak one; they are one model under an honest split and a dishonest one. Config selection moved to the blind column, full stop.

We started reading the val-to-blind gap as a diagnostic, not a nuisance. A wide gap means the split is leaking. A narrow gap, like the low-overlap fracture row or the bedding dip row, means the number is trustworthy. The gap is information about your evaluation, not noise to be averaged away.

And we reported per feature and per parameter, never pooled. Depth, dip, and azimuth generalise differently, and fractures and beddings generalise differently again. Anything coarser would have let the strong bedding-dip result paper over the broken bedding-azimuth one.

Limitations

These numbers come from one engagement's internal error table, on a scarce, company-provided well set, for a specific detector family on carbonate borehole-image logs. They are an internal model-selection ledger, not a published benchmark, and the absolute values will not transfer to another operator, another tool, or another rock. The blind zone is a single held-out interval, so it estimates generalisation to one unseen well rather than a distribution of wells, and a single zone can be lucky or unlucky. Where a configuration did not tabulate a given depth threshold we left it out rather than interpolate, so the rows are not uniformly populated across all thresholds. The mechanism we lean on, overlap-induced validation bias under patch-level splits, is well documented in the wider literature; what this data adds is the size of the gap and the opposite failure directions of fractures and beddings, both of which are properties of this well set and should be re-measured, not assumed, on any new one.

References

[1] Kapoor, S. and Narayanan, A. (2023). Leakage and the reproducibility crisis in machine-learning-based science. Patterns, 4(9), 100804. https://doi.org/10.1016/j.patter.2023.100804

[2] Roberts, D. R. et al. (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, 40(8), 913-929. https://doi.org/10.1111/ecog.02881

Validation Said 51%, the Blind Well Said 4%: The Error Ledger That Kept Us Honest

Why a patch-level split flatters you

The row that proved the point

Beddings fail in the opposite direction

What the ledger changed about how we worked

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on