A high-resolution borehole image log is a picture with holes cut in it before anyone runs a model. The tool presses microresistivity buttons against the rock on pads and flaps, and those pads reach only part of the way around the wall. In an 8-inch hole the array reads about 80% of the circumference, so roughly 20% of every unrolled image is inter-pad blank strip where no electrode ever touched. That missing fraction is not a defect to apologise for. It is the shape of the data, fixed by tool geometry, and it happens to be the shape a whole family of self-supervised methods spends effort manufacturing on purpose. We saw the connection in December 2021, and it has organised how we think about pretraining on well logs since. This is the argument for treating the image log as what it already is: a masked image.
The tool hands you a hole, not a picture
A classic image-log tool carries 192 button electrodes on its pads and flaps, imaging the rock at about 0.2-inch (5-mm) nominal resolution with a 0.1-inch sampling step and a depth of investigation near 30 inches. Unroll the cylindrical borehole wall into a flat strip and you get an azimuth-by-depth raster of resistivity. But because the pads do not wrap the full circumference, the strip is not solid: vertical bands run down it where the tool measured nothing, coded with a sentinel value that preprocessing converts to NaN. Across the wells we worked with, image-log coverage ranged from roughly 45% to 85%, so on some intervals more than half the raster is blank.
Every team that trains a network on these images has to decide what to do with those blanks, because a convolutional or transformer network has no built-in notion of "not measured" and will fit a sentinel if you feed it one. The usual move is to fill the gaps and move on. We ran a bake-off on exactly that question in the same period, comparing 1-D interpolation along depth against a KNN imputer and a GAN-based filler; the practical verdict was that cheap interpolation was the only thing that finished across a whole well while the heavier methods stalled. That work treated the blank strips as a nuisance to paper over so the supervised model could see a clean image.
The forward-looking read is different. The blanks are not noise to be removed before learning. They are the learning signal.
What masked autoencoders make on purpose
In November 2021 a paper landed titled Masked Autoencoders Are Scalable Vision Learners, and its v2 went up on 19 December 2021 [1]. The recipe is short. Cut an image into patches. Hide most of them; the paper masks 75% of the patches at random. Feed only the visible 25% to an encoder, then hand the encoder's output plus placeholder tokens for the missing patches to a small decoder whose whole job is to reconstruct the pixels that were removed. Train the network to predict the masked patches from the visible ones. The encoder that falls out of this has learned a representation good enough that, once fine-tuned, a ViT-Huge reaches 87.8% top-1 on ImageNet-1K. Because the encoder only ever processes the visible quarter of each image, pretraining runs more than 3x faster than working on the full image.
Two design decisions in that recipe sit next to a borehole image. The mask ratio is high because images are spatially redundant and an easy hole is filled by copying a neighbour; hiding 75% forces the network to learn scene structure rather than local texture. And the encoder never sees the masked positions at all, which is what makes it cheap. Now look at the image log again. It arrives with a fixed fraction of its patches already missing, carved out by tool geometry rather than a random-masking function. The reconstruction target the MAE builds by throwing patches away is the target the log ships with.
The ratios line up, and that is the point
The instrument above puts the two mask fractions on one axis so the comparison is direct. The tool leaves about 20% of the wall blank; the MAE default hides 75% of patches. Those are not the same number, and the honest version of this argument does not pretend they are. What matters is that both live in the same regime: a large, structured fraction of the image is absent, and the task is to reason about the whole from the visible part. Across the coverage range we saw, from 45% to 85%, a borehole image is missing anywhere from 15% to 55% of its raster, which sweeps most of the band that masked-image modelling deliberately targets.
That resemblance is not cosmetic. The pretext task for well-log pretraining is not something you have to invent; the geometry already sets it up. Ask a network to reconstruct the inter-pad strips from the electrode-covered ones and you are asking it to learn the spatial structure of carbonate texture, bedding continuity, and fracture geometry that a downstream detector needs. No geologist labels anything for that step, which is the whole appeal, because labels are the scarce resource here.
Why the scarcity changes the default
The supervised default in subsurface machine learning is label-hungry by habit. You collect wells, pay a petrophysicist to pick sinusoids and tag features, and train a detector on the picks. That worked for us in the end, but the ceiling is set by how many labelled wells exist, and in a single field that number is small. Our own ablations showed a detector's classification error collapsing as labelled wells were added, from the 90%-plus range at three wells toward a couple of percent by nine to eleven. Labels, not images, are the constraint: a field has kilometres of unlabelled image log in an archive and a handful of intervals a human has actually interpreted.
Self-supervised reconstruction inverts which resource you spend. Pretraining on the inter-pad masking task consumes unlabelled well-kilometres, of which there are many, and produces an encoder that a small labelled set then fine-tunes. You trade a resource you have in bulk for one you are short of. The MAE result that a strong representation transfers from a masked-reconstruction pretext is the evidence this trade is available in vision, and the geometry of the image log is the argument it is available here without the machinery you would normally build to create the mask.
We read the MAE paper within weeks of its release, while the imputation bake-off was still running, and the two threads pointed in opposite directions: one working to erase the blanks so a supervised model could ignore them, the other suggesting the blanks were the free supervision we had been stepping over. Our own records from that period already list a masked-reconstruction contender inside the imputation comparison, which is how close the idea sat to the surface at the time.
Where the analogy stops
The match is structural, not exact, and treating it as exact would be a mistake. Three differences bite. First, the masking is not random. MAE hides patches uniformly and re-randomises every epoch, part of why the encoder generalises, while the image-log blanks are fixed vertical strips at tool-set azimuths, correlated down the whole well. A network pretrained on that fixed mask alone could learn the strip layout instead of the rock, so you add random masking on top of the physical one, and the model never sees the same hole twice.
Second, the blank strips are genuinely unmeasured, not merely hidden. In MAE the masked pixels exist and are the reconstruction target; the inter-pad rock was never measured, so there is no ground truth there. Pretraining has to mask visible, measured patches and reconstruct those, treating the physical blanks only as a reminder that partial coverage is the native state. Third, resolution caps what any of this recovers: a raster cell spans a few centimetres of depth, so the pretext task can teach carbonate texture and continuity but cannot conjure detail finer than the buttons resolved.
Limitations
This piece argues an alignment, not a benchmarked result. We have not published a controlled comparison here of MAE-style pretraining against the supervised detector on the same held-out wells, and the two mask-ratio numbers set side by side come from different sources: tool coverage from the image-log brochure, 75% from the public MAE paper. The MAE transfer numbers, the ViT-Huge 87.8% and the 3x-plus speedup, are ImageNet-scale photographic results, not measured on borehole imagery; whether they carry over to a small, single-field, resistivity-imaging regime is the open question, and the coverage-versus-mask alignment is a reason to test it, not evidence that it works. The coverage range and imputation timings come from one engagement with a mid-sized Middle East carbonate operator and are not universal tool behaviour.
References
[1] K. He, X. Chen, S. Xie, Y. Li, P. Dollar, R. Girshick. Masked Autoencoders Are Scalable Vision Learners. arXiv:2111.06377 (v2), 19 December 2021. https://arxiv.org/abs/2111.06377