1D Interpolation vs KNN Imputation: A 20x+ Speedup Benchmark for Well Logs

Every machine-learning pipeline that ingests borehole image logs starts with a deletion problem. An unwrapped high-resolution borehole image covers only about 80% of the borehole wall — the pads carrying the microresistivity buttons leave wedge-shaped strips of nothing between them — and the tool writes a sentinel into every unmeasured pixel, which the pipeline reads as NaN. Before any model sees the image, those NaNs have to go. The obvious instinct, especially from an ML engineer who has just read the scikit-learn docs, is to reach for a learned imputer: KNNImputer, fill each missing value from its nearest neighbours, move on. Building the preprocessing layer for an applied-AI fracture-detection stack in a roughly twenty-month engagement with a mid-sized Middle East carbonate operator we partnered with, we benchmarked exactly that against a humbler classical baseline — one-dimensional periodic interpolation — and the result was lopsided enough to be worth a whole post. On a four-metre interval the 1D method ran in 0.115 s against KNN's 2.625 s, roughly a 23x gap; scaled to a whole well, 1D finished a pass in ~11 s while KNN never finished at all. This piece is about why that gap exists, why it is not the obvious 80/20 cleverness story, and the engineering lesson hiding inside it.

The setup: a NaN-removal bake-off, not a modelling experiment

Be precise about what was measured, because the framing is everything here. This is a preprocessing benchmark, not a model comparison. The job is to take a column of pixel intensities down a borehole — a 1D signal sampled at roughly centimetre depth resolution, with stretches of NaN where the pads had no contact — and return a dense, finite array a downstream detector can consume. Both methods produce a quality-acceptable fill on a short interval. The question on the table was purely operational: which one can we afford to run across dozens of wells, repeatedly, every time we re-process?

Two candidates, both off-the-shelf:

1D periodic interpolation. Treat each track as a 1D function of depth, drop the NaNs, and interpolate the missing samples from the surrounding finite values along the depth axis. Periodic because the unrolled image wraps around 360 degrees of azimuth, so the left and right edges of the strip are neighbours. The cost is dominated by a single vectorised pass over the array.
KNN imputation (k=5). Treat the problem as multivariate: for each row with missing entries, find the five most similar complete rows under a distance metric, and fill the holes with their (weighted) average. Conceptually richer — it borrows from all the columns at once instead of interpolating within one — but, as we will see, it pays for that richness in a way that does not survive contact with well-scale data.

~0.115 s

~23x faster

1D periodic interpolation, per 4 m interval

~2.625 s

same quality at this scale

KNN imputer (k=5), same interval

~11 s

KNN never finished one

1D, whole-well pass

The headline 0.115 s vs 2.625 s was measured on a single four-metre window. That ratio alone would be enough to prefer 1D in a tight loop. But the number that actually decided the pipeline is the one that isn't a ratio: scaled from four metres to a full reservoir section, 1D's cost grew roughly linearly to about eleven seconds, while KNN's cost grew badly enough that we killed the whole-well run before it returned. A 23x gap on a small interval became an unbounded gap on the real workload. That is the tell that the two methods do not live in the same complexity class.

Why the gap is algorithmic, not implementation noise

It is tempting to read a 23x difference as "one is written in C and the other in Python," or "we forgot to warm a cache." It is neither. The gap is in the asymptotics, and once you see it you can predict the whole-well blow-up without running it.

1D interpolation is essentially linear in the number of samples. Sort the known points (already sorted — they are indexed by depth), then for each missing sample do a constant-time lookup-and-blend between its bracketing neighbours. That is an O(n) pass over the column, and the constant is small because the inner loop is a vectorised NumPy operation, not a Python for. Doubling the well roughly doubles the time. Eleven seconds for a whole well is exactly what a linear method should cost.

KNN imputation is super-linear, and the dominant term is the neighbour search. For every row that has a missing value, the imputer has to find its k nearest complete rows, which in the naive dense case means computing a distance from that row to every candidate row. With m rows that are mutual candidates, you are looking at on the order of O(m squared d) distance work to resolve the column — a pairwise computation that quietly explodes as the well lengthens. There are tree-based accelerations in principle, but they degrade in the high-dimensional, mostly-dense regime of an unrolled image strip, and scikit-learn's KNNImputer defaults to the brute-force pairwise path. A borehole with tens of thousands of depth samples turns that quadratic term into the wall the process hit.

So the four-metre benchmark and the whole-well failure are the same fact seen at two scales. On a tiny window, m is small, the quadratic term is cheap, and KNN is merely 23x slower than a linear pass. Extend m to a whole well and the quadratic term dominates everything — which is why one method finished in eleven seconds and the other never returned. The benchmark didn't get more lopsided as we scaled; it revealed the complexity class it had been hiding the whole time.

The trap in "same quality"

Here is the part that catches good engineers. On the four-metre interval, both fills were acceptable. If you had stopped at a short-window quality check — overlay the recovered track on the original, eyeball the continuity, declare victory — you would have learned nothing that distinguishes a 0.1-second method from a method that cannot run your actual job. Quality-equal at the benchmark scale told you nothing about cost-equal at the production scale. The two axes are independent, and the only way to see it was to push both methods to the workload they would actually face.

This is the recurring failure mode of preprocessing benchmarks: people profile quality on a toy slice and assume cost. The discipline we took from this engagement is the inverse — profile cost on the real workload first, and only spend quality-comparison effort on the methods that survive. A method that is 23x slower on a slice and asymptotically worse on the full set has already disqualified itself before the quality question is even interesting.

But — and this is the twist that keeps the story honest — cheapest did not win the pipeline either. On a four-metre window both fills looked fine; the difference only appears when the gap is wide and the feature crossing it is a sinusoid. A fracture or bedding plane projects to a sine wave across the unrolled wall, and the imputation question is really a continuity question: does the fill keep a sine wave a sine wave across the cut? 1D interpolation, blazing as it is, tends to flatten a wide gap toward a chord and leave seam artifacts at the pad edges. KNN, by borrowing from rows that actually trace the same curve, stays continuous. The instrument below lets you switch fills and watch the recovered sinusoid redraw across the null band.

A high-resolution borehole image-log pad gap — the dead strip left between two different microresistivity imaging tools' pads — cuts a vertical null band through the unrolled borehole image, and whatever fills it becomes part of the sinusoid a detector traces — so the imputation question is well-posed: does the fill keep a sine wave a sine wave across the cut? Pick a method and the recovered fill redraws across the gap: KNN imputation (n_neighbors=5) interpolates along the curve and stays continuous (teal); the GAN inpaints locally-realistic texture that breaks the curve and exits at the wrong phase (the orange discontinuity is the argument); 1D-linear flattens it to a chord and leaves vertical-line artifacts; the iterative imputer stays continuous but is too slow for per-well runs. KNN won. The method ranking and compute markers (1D ~0.115 s vs KNN ~2.625 s on a 4 m interval; 1D ~11 s whole-well; KNN never finished a whole-well pass) are the article's own; the borehole image texture and the recovered-sinusoid curves are schematic.

So the full decision had two axes, and they pointed in different directions. On cost, 1D wins by an order of magnitude and KNN doesn't even finish. On feature continuity through wide gaps, KNN is the safer fill. The honest engineering answer is not "always use the fast one" — it is "know which axis your data is going to punish you on, and benchmark that axis on the real workload." For narrow gaps where continuity is never at risk, the 23x-cheaper method is free quality; for the wide pad gaps that sit on top of the features a detector must trace, the extra compute buys correctness you can't interpolate your way to.

What this means for the preprocessing layer

The wider lesson is about where to spend optimisation effort in an ML pipeline. The imputation stage is upstream of everything — it runs on every well, every re-process, every time you tweak a patch size or an augmentation and have to regenerate inputs. A method that is 23x slower there is not a 23x tax on one experiment; it is a 23x tax on the iteration loop, multiplied by every well and every re-run. That is precisely the kind of cost that is invisible in a notebook and crippling in a production cadence. Profiling the preprocessing layer against the real well count, not a convenient slice, is what turns "it ran fine on my four metres" into a number you can actually plan a re-processing schedule around — and it is the same discipline we carry into every small-data subsurface engagement, across operators in the Middle East and the United States.

And the meta-point for anyone porting scikit-learn idioms into a geoscience pipeline: the friendliest API is not the cheapest algorithm. KNNImputer is two lines and feels principled. Underneath those two lines is a pairwise distance computation whose cost is quadratic in well length, and on a real well that asymptotic term is the difference between an eleven-second pass and a job you have to kill. Read the complexity, not the docstring — and benchmark on the workload you actually have to run, not the one that fits in a cell.

Key takeaways

Filling NaN gaps in high-resolution borehole image logs is a preprocessing benchmark, not a modelling experiment: both 1D periodic interpolation and a KNN imputer (k=5) produced acceptable fills on a short interval, so the deciding question was operational cost, not quality.
On a 4 m interval, 1D interpolation ran in ~0.115 s vs KNN's ~2.625 s — a ~23x gap. Scaled to a whole well, 1D finished a pass in ~11 s while KNN never finished at all.
The gap is algorithmic, not implementation noise: 1D interpolation is O(n) and vectorised; KNN imputation is dominated by a pairwise nearest-neighbour search that is super-linear (≈ quadratic in well length) on dense, mostly-complete image rows, so the same fact looks like 23x on a slice and an unbounded gap at well scale.
The trap is 'same quality': quality-equal at benchmark scale says nothing about cost-equal at production scale. Profile cost on the real workload first, then spend quality-comparison effort only on the methods that survive.
Cheapest still didn't win outright — through wide pad gaps the imputation question is whether a fracture's sinusoid stays continuous across the cut. 1D flattens wide gaps toward a chord; KNN borrows from rows on the same curve and stays continuous. Know which axis your data punishes, and benchmark that axis on the real well count.

1D Interpolation vs KNN Imputation: A 20x+ Speedup Benchmark for Well Logs

The setup: a NaN-removal bake-off, not a modelling experiment

Why the gap is algorithmic, not implementation noise

The trap in "same quality"

What this means for the preprocessing layer

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on