Splitting by Well, Not by Row: Leakage-Safe Evaluation for Petrophysical ML

Abstract

A model's validation score is a promise that the number was earned on data the model had never effectively seen, and on depth-indexed well-log data the easiest way to break that promise is a uniformly random train-validation split. Measurements logged a few centimetres apart down the same borehole are nearly identical, so a random split sends near-duplicate samples to both sides of the boundary and the model is graded, in part, on points it already memorised. This paper surveys the public literature that defined and quantified this failure and then applies it to our own raster well-log digitisation work. We credit the canonical formulation of leakage as information about the target that would not be available at prediction time [1], the cross-validation strategies built for spatially and hierarchically structured data where random folds overestimate performance [2], the recent cross-disciplinary accounting of how train-test leakage inflated results across machine-learning-based science [3], the demonstration that selecting on the same folds you report on is itself a leak [4], and the tooling that makes a group-aware split a one-line change [5]. We ground the survey on a public multi-well petrophysical dataset of 118 Norwegian Sea wells with 22 measurement columns [6] and on our own pipeline, where the unit that must never be split is the well and synthetic-log identity is a second grouping key the split has to honour. The finding is that the corrective is structural, not statistical: grouping by well so every sample from a borehole stays on one side of the 80/20 split is what makes the score mean what it claims, and neither a larger validation set nor a different metric substitutes for it, because both leave the leakage channel open. The leakage taxonomy, the grouped-cross-validation methods, and the splitters are the field's contributions; the application to well-grouped and synthetic-identity-aware digitisation evaluation is ours, and VeerNet, the digitiser whose score we are protecting, is ours.

What leakage is, and why depth-sorted samples invite it

Leakage is the contamination of an evaluation with information that the model would not legitimately have at prediction time. The canonical formulation by Kaufman and colleagues states it precisely: a feature, a sample, or a preparation step leaks when it carries information about the target that, in the real deployment, would only become available after the prediction is made [1]. Their contribution that bears most directly on us is the second of their two leakage types, leakage from improper sampling, where the way the data is partitioned for evaluation lets the test instances borrow strength from training instances they are statistically entangled with. The framing we adopt from them is that leakage is a property of the protocol, not of the model: the same architecture trained on the same data can be honest under one split and dishonest under another, and the split is where the damage is done or avoided.

Why depth-sorted well-log samples invite exactly this failure follows from how the data is built. A digitised log is a curve sampled along depth, and our validation logs are interpolated to 300 depth points each, with two curves per synthetic log. Two samples a few depth steps apart are, for most petrophysical signals, almost the same vector: porosity, density, and resistivity vary smoothly with depth except across bed boundaries, so consecutive samples are autocorrelated by physics rather than by accident. A uniformly random train-validation split treats those samples as exchangeable and independent, which they are not. It can put depth point 149 of a borehole in training and depth point 150 in validation, and now the validation point has a near-twin in the training set. The model does not need to generalise to score well on it; it needs only to have memorised its neighbour. This is the improper-sampling leak of Kaufman and colleagues realised concretely on depth-indexed data [1].

The ecology and spatial-statistics literature reached the same conclusion from a different direction and gave us the corrective vocabulary. Roberts and colleagues survey cross-validation for data with temporal, spatial, hierarchical, or phylogenetic structure and show that when samples are autocorrelated or grouped, random k-fold cross-validation systematically overestimates predictive performance, sometimes dramatically, because the held-out fold is never truly independent of the training folds [2]. Their proposed remedy is blocking: partition by the unit across which dependence operates, so that an entire block of correlated samples is held out together rather than interleaved. A borehole is a textbook block. Every sample down a well shares that well's tool calibration, its mud and hole conditions, its operator, and its local geology, so the well is the natural unit of dependence and the natural unit of the split. The contribution we take from this work is the principle that the partition key must match the dependence structure, and for well logs that key is the well.

That principle has been validated at the scale of a whole research enterprise. Kapoor and Narayanan audited machine-learning-based science across many disciplines and found train-test leakage in a large number of published studies, organising the failures into a taxonomy whose entries include splitting after preprocessing and, centrally for us, failing to keep dependent samples on the same side of the split [3]. Their finding reframes leakage from an individual mistake into a systemic one: it is common, it is easy to commit, and it tends to inflate reported performance in a consistent direction. Their proposed discipline, evaluation protocols documented well enough to be audited, is the standard we hold our own split to in this paper. We are not claiming to have discovered the problem; we are applying a problem the field has already documented to the specific shape of petrophysical data.

One more form of leakage is worth separating out because it hides inside an otherwise careful split. Varma and Simon show that if you select a model or tune its hyperparameters using the same cross-validation folds on which you then report the final score, the reported number is optimistically biased, because the selection has quietly fit the evaluation set [4]. The remedy is nesting: an inner loop for selection and an outer, untouched loop for reporting. This is a different channel from the depth-adjacency leak, but it compounds with it, and a protocol that fixes the grouping while leaving selection entangled with reporting has only closed one of two doors.

Method: what the split has to respect

Our method here is a protocol description rather than a benchmark, so we state exactly what the honest split requires on our data and why each requirement follows from the literature above. The setting is the evaluation of VeerNet, our encoder-decoder for raster well-log digitisation, and the public reference dataset whose structure we mirror is the 118-well Norwegian Sea collection with 22 measurement columns documented by McDonald from the FORCE 2020 lithology contest [6]. That public dataset is useful precisely because it makes the grouping unit legible: it is unmistakably a collection of wells, not a bag of independent rows, and any honest split of it is a split over wells.

The first requirement is that the well is the grouping unit. We partition at the level of the borehole, assigning whole wells to the training side or the validation side so that no well contributes samples to both. With an 80/20 train-validation split, that means roughly eighty percent of the wells, not eighty percent of the depth points, go to training, and every depth point of a validation well is held out together. This is the blocking prescription of Roberts and colleagues applied to boreholes [2], and it is what severs the depth-adjacency channel: if depth point 149 is in validation, so is depth point 150, because both belong to the same withheld well, and neither has a near-twin on the training side.

The second requirement is specific to our synthetic pipeline and is the part that is ours rather than the field's. Our training corpus is generated, and a single synthetic log can be rendered into more than one raster sample through augmentation and tiling. If those derived samples were split independently, two renders of the same underlying synthetic log could land on opposite sides of the boundary, which is the depth-adjacency leak wearing a different costume: the validation render would have a near-identical sibling in training. So synthetic-log identity becomes a second grouping key. Two samples that descend from the same synthetic log are kept together exactly as two depth points from the same real well are kept together. The grouping logic is identical; only the identifier changes, from well number to synthetic-log identifier.

The third requirement is that grouping is enforced by construction, not by intention, and here we lean on the tooling. The grouped k-fold splitters in scikit-learn take an explicit array of group labels and guarantee that the same group never appears in both the training and the held-out fold [5]. Implemented this way, the grouping is a property the harness cannot violate even if someone forgets the reason for it, because the splitter is handed the well identifier or the synthetic-log identifier as the group and refuses to place a group on both sides. The contribution we draw from the tooling literature is that the honest protocol costs one argument: passing the group key turns a leaky random split into a grouped one without a custom evaluation loop.

The fourth requirement is to keep selection separate from reporting, following Varma and Simon [4]. Any choice we make by looking at validation scores, which loss to prefer, where to set a confidence threshold, is a selection step, and if the number we publish comes from the same partition those choices were made on, it is biased upward. We therefore treat the grouped validation set used during development as a selection instrument and reserve an untouched grouped split for the figure we report, so that the grouping discipline and the selection discipline are both satisfied rather than one being used as an alibi for the other.

Results

The instrument below makes the mechanism visible on a small pool of depth-sorted samples drawn from a handful of schematic wells. It partitions the same pool two ways and shows what each partition does to the reported score.

The same pool of depth-sorted petrophysical samples, partitioned two ways. Under a random-row split each sample is placed independently, so depth-adjacent neighbours from one well land on both sides of the train-validation boundary; the orange threads mark those straddling pairs, the path through which a near-duplicate leaks the answer into validation. Switch to a whole-well split and every sample from a well moves together to one side, closing that path. The paired panel on the right shows the validation R-squared the leaky split would report against the honest score the grouped split reports, and the orange figure between them is the inflation. Drag the lever to change the train fraction; the dashed pin marks the engagement's 80/20 split. Public figures are real (7,781 LAS files, 118 Norwegian Sea wells from the FORCE 2020 benchmark, 22 measurement columns, 300 depth points, 2 curves per log); the before/after validation pair is an illustrative demonstration of the inflation direction and rough size, and the five wells drawn are a schematic stand-in for the well population.

Three readings come off it. The first is the leakage channel itself. Under the random-row rule, samples are assigned to training and validation independently, and the orange threads light up wherever two depth-adjacent samples from the same well end up on opposite sides of the boundary. Each thread is a near-duplicate pair, a validation sample whose answer is sitting in the training set under a different index. As the train fraction moves, the threads persist, because the problem is not the size of the validation set but the rule that built it; a leaky split stays leaky at 70/30 and at 90/10. The second reading is the cure. Switching to the whole-well rule moves every sample from a well to one side together, and the threads vanish: there are no straddling neighbours left to draw, because no well spans the boundary. This is the blocking remedy of Roberts and colleagues rendered as an animation [2], and the visual point is that the fix is structural. Nothing about the samples changed; only the partitioning unit did, from the row to the well.

The third reading is the one the paper exists for, and it is in the paired panel on the right. The leaky random-row split reports a higher validation R-squared than the honest whole-well split, and the orange figure between the two bars is the inflation: the optimism the leaky protocol would publish. In the instrument that gap shrinks to zero the moment the split becomes grouped, because with the leakage channel closed the reported number and the honest number are the same number. We flag plainly that the two validation values in that panel are an illustrative before-and-after pair chosen to show the direction and rough magnitude of leakage inflation documented in the grouped-cross-validation literature [2][3]; they are not a measured benchmark result for our model. The figures that are real are the structural ones the audit is built on, the 80/20 split, the 7,781 LAS files in the archive, the 118 public Norwegian Sea wells with 22 measurement columns [6], the 300 interpolated depth points, and the two curves per synthetic log. What the instrument argues with those numbers is qualitative and robust: a random split over depth-indexed samples reports a score that a grouped split does not, and the difference is leakage, not skill.

Discussion

The honest summary is that for petrophysical machine learning the most consequential evaluation decision is made before any metric is computed, and it is the choice of partitioning unit. Everything the public literature says points the same way. Leakage is a property of the protocol [1]; on autocorrelated, grouped data random cross-validation overestimates performance and the remedy is to block by the unit of dependence [2]; the failure is common enough across published science to be treated as a default risk rather than an edge case [3]; and the tooling makes the grouped protocol a one-argument change [5]. Against that backdrop our contribution is narrow and specific. We did not invent grouped cross-validation; we identified the two grouping keys that a synthetic-data digitisation pipeline has to respect, the physical well for real data and the synthetic-log identity for generated data, and we showed that both are instances of the same blocking principle applied to two different sources of near-duplication.

Where our position differs from a naive reading of the problem is in what we explicitly reject as a fix. The intuitive responses to an inflated validation score, enlarge the validation set or switch to a sterner metric, do not work here, and the reason is worth stating because it is easy to get wrong. A larger validation set built by the same random rule simply contains more leaked pairs, so it estimates the same optimistic quantity with smaller variance, which is worse, not better, because it lends false confidence to a biased number. A sterner metric measures a different thing about the same contaminated comparison; if the validation samples have twins in training, no choice of metric un-sees them. Only changing the partitioning unit removes the contamination, which is why we frame the corrective as structural throughout. The score becomes trustworthy when the protocol changes, and not before.

The placement we want to leave is about where the well-grouped view sits in the broader leakage conversation. The field's accounting of leakage spans tabular data mining [1], ecological and spatial modelling [2], and a cross-disciplinary reproducibility audit [3], and well-log digitisation is a clean special case of all three: the samples are spatially structured along depth, the grouping unit is concrete and recorded, and the synthetic-data twist adds a second grouping key that the general literature does not name but that its principle covers exactly. A petrophysical model reported on a random split should be read with the same suspicion the literature now attaches to any grouped-data model evaluated without blocking, and the way to earn back that trust is mechanical: group by well, group by synthetic-log identity, keep selection off the reporting split, and let the splitter enforce all of it.

Limitations

This is a methodological survey and protocol description, not a benchmark, and its edges should be read that way. The before-and-after validation pair shown in the instrument is illustrative: it is chosen to demonstrate the direction and approximate size of the inflation that the grouped-cross-validation literature documents [2][3], and it is not a measured leaky-versus-grouped score for our model on a named dataset, so it should not be cited as a quantitative result. The five wells and the sample scatter in the instrument are a schematic stand-in for a well population, drawn to make the mechanism legible rather than to depict any specific boreholes. The real figures we rely on, the 80/20 split, the 7,781 LAS files, the 118 Norwegian Sea wells with 22 measurement columns, the 300 interpolated depth points, and the two curves per synthetic log, are structural facts about the data and the protocol, not effect sizes, and we make no claim here about how large the inflation would be on any particular corpus, which depends on how strongly samples within a well are correlated and on how many wells the split has to work with. The argument that grouping is the only adequate fix is a claim about the leakage channel, not a proof that a grouped score is unbiased in every other respect; a grouped split can still mislead if selection is run on the reporting partition, which is why we treat the nesting requirement of Varna and Simon as a separate, compounding obligation rather than a corollary of grouping [4]. Finally, our second grouping key, synthetic-log identity, is specific to a generated training corpus; a pipeline trained only on real logs has the well as its single grouping unit, and a pipeline that mixes sources would need both keys reconciled, which we describe in principle but do not evaluate quantitatively here.

References

[1] Kaufman, S., Rosset, S., Perlich, C., and Stitelman, O. Leakage in data mining: formulation, detection, and avoidance. ACM Transactions on Knowledge Discovery from Data, 2012. The canonical treatment of leakage, defining it as the introduction of information about the prediction target that would not legitimately be available at prediction time, cataloguing how it enters through data preparation and improper sampling, and proposing a methodology for detecting and avoiding it. https://dl.acm.org/doi/10.1145/2382577.2382579

[2] Roberts, D. R., Bahn, V., Ciuti, S., Boyce, M. S., Elith, J., Guillera-Arroita, G., Hauenstein, S., Lahoz-Monfort, J. J., Schroder, B., Thuiller, W., Warton, D. I., Wintle, B. A., Hartig, F., and Dormann, C. F. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, 2017. A treatment of why uniformly random cross-validation overestimates predictive performance when samples are autocorrelated or grouped, and a survey of blocked and grouped cross-validation schemes that keep dependent units together to recover an honest estimate. https://onlinelibrary.wiley.com/doi/10.1111/ecog.02881

[3] Kapoor, S., and Narayanan, A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns, 2023. A cross-disciplinary survey finding train-test leakage in a large number of published studies across many fields, organising the failure into a taxonomy that includes splitting after preprocessing and not separating dependent samples, and proposing model-info sheets to make evaluation protocols auditable. https://arxiv.org/abs/2207.07048

[4] Varma, S., and Simon, R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics, 2006. A demonstration that selecting a model or tuning hyperparameters on the same cross-validation folds used to report the final score produces an optimistically biased estimate, and that a nested protocol with an outer held-out loop is needed to remove that bias. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-91

[5] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: machine learning in Python. Journal of Machine Learning Research, 2011. The library that provides the grouped and stratified cross-validation splitters, including a group-aware k-fold that guarantees the same group never appears in both the training and the test fold, making the grouped protocol a one-line change rather than a custom harness. https://jmlr.org/papers/v12/pedregosa11a.html

[6] McDonald, A. Using the missingno Python library to identify and visualise missing data prior to machine learning. Towards Data Science, 2021. A practitioner walkthrough on a public petrophysical dataset of 118 Norwegian Sea wells with 22 measurement columns drawn from the FORCE 2020 lithology contest, used here as the period-correct public reference for the structure and scale of multi-well log data. https://towardsdatascience.com/using-the-missingno-python-library-to-identify-and-visualise-missing-data-prior-to-machine-learning-34c8c5b5f009

Splitting by Well, Not by Row: Leakage-Safe Evaluation for Petrophysical ML

Abstract

What leakage is, and why depth-sorted samples invite it

Method: what the split has to respect

Results

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on