The Unsupervised Track We Ran in Parallel, and Why We Retired It

Most write-ups of an ML engagement narrate a straight line: pick the method, tune it, ship it. Ours was not straight. Through the second phase of work with a mid-sized Middle East carbonate operator, we ran two methods at the same time, on the same borehole image logs, toward the same target of picking fractures and beddings off the borehole wall. One was unsupervised. One was a supervised transformer. We paid for both, in engineer-hours and in GPU rent, and then we killed one of them. This is the account of why we funded two hypotheses at once, how we decided which one to retire, and what running both actually cost.

Why two tracks and not one

At the start of Phase 2 the honest state of the problem was that we did not yet have enough trustworthy labels to bet the whole engagement on a supervised model. Early well deliveries carried raw sinusoid picks without the tool-angle and well-deviation corrections that turn an apparent dip into a true dip, so the supervised head had nothing dependable to regress against on those wells. Betting everything on supervision would have stalled the moment label quality slipped. Betting everything on the unsupervised path would have capped us at whatever geometry-first clustering could recover with no ground truth at all.

So we split the effort. The unsupervised track leaned on a stack of classical and self-supervised methods: Hough auto-scribbles to seed candidate lines, DBSCAN with hierarchical clustering to group them, SCOPS self-supervised co-part segmentation to try to learn structure without labels, and path-opening morphology that swept roughly 100 threshold values at about two minutes per two-metre interval to recover sinusoids where line-fitting saw only noise. The supervised track was a DETR-style detection transformer that reads a patch and emits fracture and bedding parameters directly. The mechanics of each of those methods have their own write-ups; for the unsupervised clustering specifically, see our piece on why DBSCAN and hierarchical clustering could not cleanly separate sinusoids. What this piece is about is the portfolio decision that sat on top of them.

Running two tracks is not indecision. It is buying an option. As long as we were unsure which hypothesis would hold, the cheapest insurance was to keep both alive until the evidence forced a choice, then stop paying for the one that lost.

What forced the choice

The evidence that forced it was label maturity, and it showed up as a recall curve.

Once corrected dip and azimuth labels started arriving and the labelled-well count grew, the supervised transformer stopped guessing and started learning. Fracture recall at a five-centimetre depth threshold climbed from about 10 percent when only three wells were labelled, to about 40 percent at eight wells, to about 75 percent at sixteen wells. That is not a tuning gain. It is the signature of a model that was starved of data and then fed. The three-well model got essentially every sinusoid wrong; the sixteen-well model was a usable detector.

The unsupervised track had no equivalent lever. It never trained on labels, so adding labelled wells did nothing for it. It sat where its geometry and its thresholds put it, and no amount of maturing ground truth moved that ceiling. The two tracks were on different curves: one that climbed with data and one that did not.

A fork diagram of one program decision: a shared Phase-1 trunk splits at Phase 2 into two lanes run in parallel, an unsupervised path (Hough auto-scribbles, DBSCAN plus hierarchical clustering, SCOPS self-supervised, and path-opening morphology sweeping 100 threshold values at about 2 minutes per 2 m interval) and a supervised DETR transformer. Drag the labelled-well lever from 3 to 16 and the supervised recall staircase climbs 10 to 40 to 75 percent at 5 cm, sourced, while the unsupervised lane holds a flat plateau that no tuning lifts because it never trained on labels. Once the supervised curve clears the plateau, the unsupervised lane is deliberately retired, the single orange kill marker being the only element that argues. Funding both tracks through Phase 2 is priced on the right: 1,200 GPU/MLOps hours budgeted against 2,600 actual, a 2.2x overrun, plus a USD 53,300 extra-infrastructure ask. Sourced: the unsupervised method stack and its threshold sweep, the 10/40/75 percent recall at 3/8/16 labelled wells, the 1200 vs 2600 hour Phase-2 compute, and the USD 53,300 ask; illustrative: the flat height drawn for the retired lane, since the archive records no single recall figure for that stack, so it is a visual anchor below the crossover, not a plotted measurement.

The fork above is the whole decision in one frame. Drag the labelled-well lever and the supervised staircase climbs the sourced 10, 40, 75 while the unsupervised lane holds a flat plateau. The retirement is not a failure event. It is the moment the climbing curve clears the flat one, and the orange marker is the instruction that follows: stop funding the track that cannot climb.

The retirement was a decision, not a collapse

We want to be precise about what "retired" means here, because it is easy to read it as "the unsupervised methods failed." They did not fail. On other targets in the same programme the unsupervised path was the one that won. Vug and bedding work leaned on the classical and morphology methods precisely because they did not need labels the client could not yet provide. The retirement was scoped to the fracture-and-bedding supervised objective, where labels eventually stabilized and the transformer overtook everything the label-free stack could offer.

Retiring a track means you stop spending on it: no more threshold sweeps, no more clustering experiments, no more engineer-weeks tuning path-opening for a job the transformer now does better. Those hours go to the winning track instead. The discipline is in doing this deliberately and on evidence, rather than letting a dead track quietly absorb budget because nobody wanted to call it. We called it when the recall curve crossed, and we wrote down why.

What running both actually cost

Optionality is not free, and pretending it was would make this a dishonest case study.

The clearest number sits in the Phase 2 compute ledger. Phase 2 was budgeted at 1200 GPU and MLOps hours. It consumed about 2600, a 2.2x overrun. Phase 1, which ran a single unsupervised line of work, landed exactly on its 1200-hour budget. The overrun appeared in Phase 2, and the parallel supervised-plus-unsupervised split is what drove it. Two tracks means two sets of experiments, two hyperparameter searches, two sets of failed runs to learn from.

That overrun did not stay hidden in a spreadsheet. It surfaced as a formal request to the client for an additional USD 53,300 in AI infrastructure, an ask we had pre-flagged in the original proposal precisely because the dual-track strategy would raise compute. We did not absorb the cost silently and we did not disguise it as scope creep. We named the strategy, named the number, and let the client decide whether the option was worth its price.

\text{overrun} = \frac{2600\ \text{h}}{1200\ \text{h}} \approx 2.2\times

We paid roughly a doubling of Phase 2 compute to hold both hypotheses open until the data told us which one to keep. Whether that was a good trade depends on the counterfactual. Had we bet single-track on supervision at the three-well stage, where recall was 10 percent, and been wrong about labels ever maturing, we would have had nothing. Had we bet single-track on the unsupervised path, we would have capped the fracture detector below what the transformer eventually reached. The 2.2x bought us the right to be wrong about either bet without losing the engagement.

What we would carry forward

The transferable lesson is not "always run two tracks." Two tracks is expensive, and the compute ledger proves it. The lesson is about when the option is worth buying and how to close it cleanly.

Fund parallel hypotheses when the deciding variable is outside your control and you cannot yet predict it. Here the variable was label maturity: whether corrected dip and azimuth would arrive in enough wells to feed supervision. We could not force that, so we hedged. Then define, in advance, the evidence that will retire the loser. Ours was a recall crossover, a single measurable event. When it fired, we stopped, reassigned the hours, and documented the kill so the decision could be audited later rather than remembered as a vibe.

The part that is easy to skip and worth insisting on is the accounting. Optionality that never appears in the ledger looks like waste to whoever pays the bill. Named, priced at 2.2x compute and USD 53,300, and closed on stated evidence, it is what it actually was: a managed research bet with a clear exit.

Limitations

The recall figures of 10, 40, and 75 percent are the supervised fracture numbers at a five-centimetre depth threshold across three, eight, and sixteen labelled wells; they are single-programme results on one confidential carbonate dataset, not a benchmark to quote against a different tool stack. The flat plateau drawn for the unsupervised lane in the instrument is illustrative: the archive records no single recall figure for the retired stack on the fracture objective, so the plateau is a visual anchor placed below the crossover, not a measured value. The 1200-hour budget, 2600-hour actual, 2.2x overrun, and USD 53,300 infrastructure ask are the Phase 2 compute-ledger figures for this engagement; they attribute the overrun to the dual-track strategy but are not a controlled decomposition of exactly how many hours each track consumed. The claim that the unsupervised path won on vug and bedding targets is scoped to those objectives and is not contradicted by its retirement on the supervised fracture objective.

References

Weekly running log of Phase-2 method development, including the unsupervised method stack (Hough auto-scribbles, DBSCAN with hierarchical clustering, SCOPS self-supervised segmentation, path-opening morphology sweeping ~100 threshold values at ~2 minutes per 2 m interval) and the transition to a supervised detection transformer once labels stabilized. Internal engagement archive, withheld under operator confidentiality.
Fracture recall evolution 10 -> 40 -> 75 percent at 5 cm across 3 -> 8 -> 16 labelled wells; programme narrative and executive decks. Internal engagement archive, withheld under operator confidentiality.
Phase-2 compute ledger: 1200-hour budget vs 2600-hour actual (2.2x), and the USD 53,300 additional AI-infrastructure request attributed to the parallel supervised-plus-unsupervised strategy. Additional-budget request memorandum, internal engagement archive, withheld under operator confidentiality.

The Unsupervised Track We Ran in Parallel, and Why We Retired It

Why two tracks and not one

What forced the choice

The retirement was a decision, not a collapse

What running both actually cost

What we would carry forward

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on