Unsupervised segmentation is seductive because it asks for no labels, and brutal because it gives you no control. Point a Kim-style differentiable feature-clustering CNN at a borehole image log and it will, eventually, partition the image into regions of self-similar texture — but "eventually" can mean five hundred backpropagation iterations, and the partition it settles on is just as likely to carve the rock into thirty meaningless colour blobs as into the handful of classes a petrophysicist would recognise. In a roughly twenty-month engagement with a mid-sized Middle East carbonate operator, the early Phase-1 pipeline for image logs from two different microresistivity imaging tools leaned on exactly this family of methods, and exactly this pair of problems — slow convergence and runaway over-segmentation — stalled it. The fix was almost embarrassingly classical: stop drawing scribbles by hand and let a Hough transform draw them for you. Auto-scribbles cut convergence from 500 iterations to 50 — a clean 10x — and the boundaries came out better than a human annotator's. This is a piece about why, and about a recurring lesson in applied computer vision: the cheapest, oldest algorithm in your toolbox is often the right way to inject the prior a modern network cannot discover on its own.
The method, and where it fragments
The backbone here is the unsupervised segmentation-by-backpropagation formulation of (Kim et al., 2020)Kim et al. · 2020Unsupervised Learning of Image Segmentation Based on Differentiable Feature ClusteringIEEE TIP. A small CNN maps every pixel to a feature vector, an argmax over channels assigns each pixel a cluster label, and the network is trained — on a single image, with no ground truth — to satisfy two competing losses backpropagated together. A feature-similarity term pushes look-alike pixels toward the same label; a spatial-continuity term, scaled by a weighting factor written μ, penalises label changes between neighbouring pixels, pressuring the segmentation toward large, contiguous regions rather than salt-and-pepper noise.
The class count is not fixed in advance; it emerges from the balance of those two losses. That is the method's elegance and its liability. Crank μ up and the continuity term forces neighbours to agree, collapsing the class count; crank it down and feature similarity wins, fine texture survives, and the image shatters into dozens of labels. There is no a-priori "correct" μ, and on a textured high-resolution borehole image log the wrong choice is catastrophic.
We measured this on a public benchmark well — the Utah FORGE deep geothermal site, roughly 350 km south of Salt Lake City and 16 km north-northeast of Milford, where crystalline bedrock sits near 500 m depth — precisely so the experiment could be reported without touching client data. Sweeping μ across 0.1, 0.5, 1.0, 2.0, 3.0 on one input image walked the class count down a jagged staircase — 31, 31, 20, 12, 19 labels — while a second image traced its own descent, 39, 28, 19, 8, 7. μ is a blunt global dial: no single setting reliably lands on the two-or-three classes (vugs, sinusoids, background lithology) the geology actually contains.
A smaller engineering choice compounded the problem: grayscale versus raw input. Comparing the two on the same images — 20 versus 18 class labels on one, 19 versus 12 on the other — grayscale clustered more cleanly in each case. The colourmap that makes an image log legible to a human injects channel variance the feature-similarity loss happily over-fits; strip it to a single luminance channel and the network has less spurious structure to chase. The kind of pre-processing decision that never appears in a headline result but decides whether the pipeline ships.
Scribbles: the right idea, the wrong ergonomics
The standard escape hatch is weak supervision via scribbles: the user paints a few strokes of label onto the image and a scribble loss nails those pixels to their assigned labels while the unsupervised losses fill in the rest. It is the lightest-touch supervision imaginable, and it works — with a few hand-drawn scribbles and an aggressive continuity weight of μ=10, the segmentation converged in 50 iterations instead of 500, collapsing to a sane 11 and 11 labels on the first image and 10 and 7 on the second.
So the scribble loss already buys the 10x. Why not stop there?
Because hand-drawn scribbles do not survive contact with production. A scribble is a per-image manual annotation. Across 80-plus wells and kilometres of image log, asking a geoscientist to paint strokes onto every two-metre interval is precisely the interpretation bottleneck the programme exists to remove — it puts the human back in the loop at the worst place, scales linearly with footage, and bakes one annotator's hand into the result. And there is a subtler failure: a human scribbles where the labels are obvious, not where the boundaries are hard. The continuity loss most needs a hint along the faint, low-contrast edge of a dipping bed — the place a tired eye skips — so hand scribbles guide the network toward easy interiors and leave the diagnostic boundaries under-constrained.
What we needed was a scribble generator that was automatic, deterministic, and biased toward the geological structure that matters. On a borehole image, that structure has a name and a shape.
Why Hough is the natural scribble generator
Every planar feature that cuts a borehole — a bed, a fracture, a fault — unrolls into a sinusoid on the image log, with the curve's amplitude encoding dip and its phase encoding azimuth. That is the entire geometric content of the image. A scribble that traces those sinusoidal edges is a scribble placed exactly where the continuity loss is starved. And detecting straight or near-straight bright/dark edges in an image is the oldest trick in computer vision.
The auto-scribble pipeline is three classical stages, all in OpenCV, no learning required:
- Canny edge detection to reduce the grayscale image to a binary edge map — run at thresholds 200/255 with an aperture size of 7, after a median blur of kernel size 5 to suppress the pad-array speckle that would otherwise fire spurious edges.
- Probabilistic Hough line transform (
HoughLinesP) to vote those edge pixels into line segments — distance resolution ρ=1, angular resolution θ=π/180, an accumulator threshold of 30 votes, a minimum line length of 30 pixels, and a maximum gap of 5 pixels to bridge breaks along a single feature. - Rasterise the surviving segments back onto the image as scribble strokes, which then drive the scribble loss in exactly the same way a human's strokes would.
The Hough transformA voting scheme that maps each edge pixel into a parameter space (here, the (rho, theta) of candidate lines) and accumulates evidence; peaks in the accumulator correspond to lines supported by many collinear edge pixels. The probabilistic variant (HoughLinesP) returns finite line segments with explicit endpoints rather than infinite lines, and is robust to gaps and noise. is the right primitive precisely because it is a global voting scheme: it does not care that a sinusoid is locally curved or broken by a vug or a washout. As long as enough collinear edge pixels vote for a segment, the line survives — and a low-curvature stretch of a dipping bed reads, locally, as a line. The result is a set of strokes laid along the real structural edges, generated in milliseconds, identically every run.
What the auto-scribbles bought
Swapping hand-drawn strokes for Hough-generated ones, and dropping the continuity weight back to a gentle μ=1, the network still converged in 50 iterations — the 10x holds — but now produced 17 and 11 labels on the first image and 18 and 12 on the second. On a naive reading those counts are higher than the hand-scribble run; it looks like a regression. It is the opposite. The hand-scribble result needed μ=10 — a continuity weight so aggressive it bulldozes genuine boundaries to force the class count down. The Hough result stabilises at μ=1, where boundaries are preserved rather than smeared, and the extra labels are real distinctions the over-smoothed run had erased. The bake-off finding was unambiguous: the auto-scribble boundaries were cleaner than the hand-made ones. A classical edge detector, placing strokes along the structure the loss most needed, out-annotated the human — and removed the human from the loop entirely.
Why is this a 10x and not merely a nice trick? Left alone, the unsupervised losses spend hundreds of iterations groping toward a decomposition the scribbles hand them on iteration one. Seeding the optimisation with a structurally-correct prior does not just speed the same answer up; it changes which basin of the loss landscape the network falls into. The continuity term stops fighting the feature term across the whole image and instead refines a partition that is already roughly right along the boundaries that matter. Convergence and quality improve together because they were never independent.
The engineering lesson
The instinct in a deep-learning shop is to fix a model problem with more model: a heavier backbone, a learned edge detector, an attention module to find the boundaries. Here the leverage ran the other way — a decades-old line-voting algorithm and a median blur, wired in as a prior the network could not be trusted to discover for itself on a single unlabelled image. The generator is deterministic, has nothing to train, costs nothing at inference, and scales to every well in the archive without a human touching it. That is what turns a clever unsupervised demo into something you can run across 80-plus wells of an operator's data.
This early image-log work, on a Middle East carbonate operator we partnered with, sat upstream of the supervised fracture-detection models the engagement ultimately shipped — the unsupervised pass taught us cheaply where the structure lived before a single label was drawn. The through-line, from this Phase-1 pipeline to the production detector, is constant: respect the physics of the image, and reach for the cheapest algorithm that encodes it.
Key takeaways
- Kim-style differentiable feature-clustering CNNs segment image logs with no labels, but their class count is an emergent product of a feature-similarity loss versus a spatial-continuity loss weighted by mu — and the wrong mu over-fragments (a mu sweep of 0.1-3.0 walked one FORGE-well image from 31 down to 12 labels and back to 19).
- Pre-processing matters more than it looks: grayscale input clustered more cleanly than raw RGB (20 vs 18; 19 vs 12 labels) because the borehole-image-log colourmap injects channel variance the similarity loss over-fits.
- Hand-drawn scribbles already convert 500 iterations to 50 (a 10x), but they are a per-image manual annotation that does not scale across 80+ wells and tends to guide the network toward easy interiors rather than the faint boundaries the continuity loss actually needs.
- Hough-transform auto-scribbles — Canny edges (200/255, aperture 7, after a median blur of 5) fed to a probabilistic Hough line detector (rho 1, theta pi/180, threshold 30, min length 30, max gap 5) — place strokes along the sinusoidal structural edges automatically, deterministically, and at zero inference cost.
- The auto-scribble run held the 10x speed-up at a gentle mu=1 and produced cleaner boundaries than the hand-made strokes (which needed an aggressive mu=10 that smears real edges) — a classical CV primitive out-annotating a human and removing the human from the loop.
References
[1] Kim, W., Kanezaki, A., and Tanaka, M. Unsupervised Learning of Image Segmentation Based on Differentiable Feature Clustering. IEEE Transactions on Image Processing (2020). The segmentation-by-backpropagation formulation with the spatial-continuity loss and scribble supervision used here. https://arxiv.org/abs/2007.09990
[2] Canny, J. A Computational Approach to Edge Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (1986). The edge detector that feeds the Hough stage. https://ieeexplore.ieee.org/document/4767851