Every supervised model that ever read a borehole image log inherited a decision nobody wrote a paper about: what to put in the holes. An unwrapped high-resolution borehole image log covers only about 80% of the borehole wall — the pads and flaps that carry the microresistivity buttons leave wedge-shaped strips of nothing between them — and the tool writes a sentinel of -9999 into every unmeasured pixel, which the pipeline reads as NaN. In our work with a mid-sized Middle East carbonate operator across 14 vertical wells imaged with two different microresistivity imaging tools, those nulls were not a rounding-error nuisance. They sat directly on top of the features we were trying to detect, and whatever we chose to fill them with became part of the sinusoid a downstream detector would trace. This is the story of how we picked the fill — and why the fastest method lost.
At a glance
Three numbers frame the imputation bake-off, and the headline one is a trap.
1D interpolation, per 4 m interval
KNN imputer, same interval
1D, whole-well pass
The hole problem is a data-engineering problem first
It is tempting to treat NaN-filling as a one-line fillna() and move on. On image logs that instinct quietly poisons everything downstream, because the gaps are not random missingness — they are structured, vertical, and aligned with the exact geometry the model has to learn.
Start with the physical layer. The high-resolution borehole image log images at roughly a 30-inch depth of investigation; the compact microresistivity tool, the slimmer sibling used in tighter or horizontal sections, sees only about 0.90 inches. Both leave inter-pad strips unmeasured, and both code the absence as -9999. Unwrap the cylinder into a 2D image — azimuth on the horizontal axis, depth on the vertical — and the missing strips become vertical null bands that run the full height of the patch. A planar geological feature (a fracture, a bedding surface) crosses the borehole at an angle and therefore projects onto that unwrapped image as a sinusoid. The null band cuts straight through it.
So the imputation question is unusually well-posed. We are not asking "what is a plausible pixel value here?" We are asking a sharper, testable thing: does the fill keep a sine wave a sine wave across the cut? A method that produces locally realistic texture but breaks the curve is worse than useless — it manufactures a false discontinuity that a detector will faithfully learn as a feature.
Why a generic imputer is the wrong default here
General-purpose tabular imputers optimise for marginal plausibility — each filled value should look like its column. An image-log gap needs the opposite: the fill must respect a global shape (the sinusoid) that spans both sides of the null band. Continuity, not plausibility, is the objective.
Before any of that, there was an ingestion layer to build. The operator's logs arrived as binary wireline log files, the industry binary container for wireline data, and the first engineering deliverable was a reader that pulls the dynamic and static image channels out of any such log, maps the -9999 sentinel to NaN, and emits a clean array the rest of the pipeline can consume. We deliberately wrote that ingestion as format-general boilerplate: it runs on any well log in the digital log format, not just the wells in this engagement, which is what later let the same gap-filling backbone run unchanged on new wells as they were delivered. Even the QC lived in that layer — automated checks flagged sampling-interval gaps of roughly 7, 10, and 27 metres at three depth bands where the logging tool had simply skipped intervals, the kind of defect that silently corrupts a training set if a human has to catch it by eye.
The bake-off: four ways to fill a sinusoid
We evaluated four candidate fills, and treated continuity-across-the-gap as the pass/fail criterion rather than any pixel-wise error metric.
- 1D linear interpolation. For each row of the unwrapped image, draw a straight chord across the null band between the last real pixel on the left and the first on the right. Trivially fast. But a chord is a straight line, and the true curve through the gap is not straight — so on wide gaps it flattens the sinusoid and leaves tell-tale vertical-line artifacts at the band edges. It introduced exactly the stretching and edge artifacts we were trying to avoid.
- GAN inpainting (a GAIN-style generator). The fashionable choice, and the one that failed most instructively. The adversarial generator paints texture that looks locally like real borehole rock — but the adversarial loss never ties the two sides of the gap together, so the recovered curve exits the band at the wrong phase. It hallucinated plausible pixels and broke the feature. Continuity not preserved; rejected.
- Iterative imputation. Cycle a regressor over the missing entries until convergence. Continuity was acceptable. The problem was wall-clock: it cycles to convergence on every interval, which is a non-starter for a per-well batch pipeline.
- KNN imputation (k = 5). Fill each missing pixel from its five nearest neighbours in feature space, applied to both the dynamic and static image channels. Because the neighbours lie along the sinusoid, the fill interpolates along the curve rather than across it. Continuous, and cheap enough to run at scale.
The interactive above is the whole argument in one frame: pick a method and watch the recovered fill redraw across the null band. KNN tracks the true curve; the GAN exits at the wrong phase (the single orange discontinuity is the entire case against it); 1D flattens to a chord. KNN preserved feature continuity at the lowest compute of the continuous methods — and that, not raw speed, is why it won the classical pipeline.
The number that looks decisive and isn't
Now the trap. On a representative 4 m interval, 1D interpolation filled the gaps in about 0.115 seconds; the KNN imputer took about 2.625 seconds on the same interval — roughly 23x slower. Scale that up and the gap widens brutally: 1D interpolation imputed a whole well in about 11 seconds, while a whole-well KNN pass never finished at all in our runs.
Before
~0.115 s / interval · ~11 s whole-well
1D linear interpolation: fast, finishes a well — but flattens the sinusoid and leaves vertical-line artifacts
After
~2.625 s / interval · whole-well never finished
KNN (k=5): ~23x slower, would not complete a whole-well pass — but keeps the sinusoid continuous
KNN won anyway — continuity beat speed
Read naively, this is an open-and-shut win for 1D. It is not, and the reason is the discipline at the centre of this whole programme: the fill is not the product — the detected geology is. A method that runs a well in 11 seconds but flattens every sinusoid crossing a gap hands the downstream detector corrupted training signal at exactly the depths that matter. KNN's 2.625 seconds buys a continuous curve. So the engineering response to "KNN can't finish a whole well" was not to fall back to the fast-but-wrong method — it was to make KNN tractable: run it on overlapping patches around real features rather than the full log, cache the fills, and parallelise across wells. The slow method becomes the operational one once you stop asking it to impute pixels nobody will ever train on.
The twist nobody expected: sometimes the cleanest fill is no fill
Here is the finding that reframed the entire effort. Once we had a clean, continuous KNN fill and could finally train supervised detectors on it, we ran the obvious control — train on the KNN-imputed image versus train on a non-imputed input where the nulls were simply set to zero, left as an honest "no data here" marker rather than a guessed pixel. The zero-filled, non-imputed input won for the supervised detector.
That is not a contradiction; it is the deeper lesson. Imputation was always solving a preprocessing problem (give classical curve-fitting and clustering a hole-free image to chew on), and for those methods continuity was decisive. But a modern detector with enough geological diversity in its training set can learn to read around an explicit zero band as easily as it learns to read a fracture — and feeding it a guessed fill, however continuous, adds a synthetic signal it then has to learn to distrust. Note the order of operations that makes the zero honest: the -9999 sentinel still has to be mapped out first, because a raw -9999 left in place is a valid float that wrecks normalisation. The choice was never "skip preprocessing" — it was "map the sentinel, then leave the gap as a clean zero instead of a fabricated curve." The imputation pipeline was not wasted. It was the scaffold that got us to a clean, well-scale dataset and a sharp, falsifiable question — and the answer told us where in the stack the holes actually needed filling, and where they did not.
Why this transfers
The transferable asset here is not a magic imputer. It is a posture toward missing data on structured signals: define the objective the fill has to serve before you pick the fill. On image logs the objective is geometric continuity, which immediately disqualifies the locally-plausible generative methods that win on natural images and demotes the fast linear chord to a fallback. The same format-general ingestion of the digital log format, sentinel-mapping, and continuity-scored bake-off run on any operator's image logs — we have carried versions of this preprocessing backbone across subsurface engagements in the Middle East and the United States — because every microresistivity image log on Earth has the same pads, the same gaps, and the same -9999.
The honest caveats: these results are from a single confidential carbonate reservoir, and the "zero-fill beats imputed" result is a supervised-learning finding that depends on having enough wells to teach a model to read around the gaps — on a two-well cold start, the continuous fill still earns its keep. The compute markers are from our runs on this operator's logs; they are method-level, not a benchmark you should quote against a different tool stack.
What the imputation bake-off actually taught us
- On image logs the imputation objective is geometric continuity, not pixel plausibility — a gap cuts a sinusoid, and the fill becomes part of the curve a detector traces, so a continuity-preserving KNN fill (k=5) beat a ~23x-faster 1D interpolation and a locally-realistic-but-curve-breaking GAN.
- Speed is a decoy: 1D ran a whole well in ~11 s and KNN never finished one, but the right engineering move was to make the continuous method tractable (patch-level, cached, parallel), not to ship the fast method that flattens features.
- Build the missingness handling into a format-general ingestion + QC layer that maps the -9999 sentinel, flags sampling gaps, and runs on any well log in the digital log format — then test the fill empirically, because for a well-trained supervised detector a clean zero-fill (sentinel mapped out, gap left as an honest zero) beat every guessed imputation.
References
-
NaN-imputation method comparison (1D interpolation, KNN, iterative, GAN/GAIN), continuity-vs-compute findings, and the ~0.115 s / ~2.625 s / ~11 s figures derived from internal preprocessing experiments on a 14-well Middle East carbonate dataset acquired with two different microresistivity imaging tools; data and code withheld under operator confidentiality.
-
The binary wireline log format — Digital Log Interchange Standard (API RP66), the binary container format for wireline log data, including microresistivity image channels.