From 82% on Paper to 63% Blind: An Honest Accounting of the Vertical-to-Horizontal Generalization Gap

Abstract

A fracture-detection paper reports a headline metric, and a headline metric is measured where the model is strongest: on the well geometry it was trained on, at the depth tolerance that flatters it most. For our combined bedding-and-fracture model on a mid-sized Middle East carbonate operator's borehole-image logs, that headline is a fracture F1 of about 82 percent at a 9 cm depth tolerance. This piece publishes the number that sits underneath it. When the same model family is run blind on a horizontal well it never trained near, the fracture F1 is 60.87 percent at a 6 cm tolerance and plateaus near 63 percent at 11 to 15 cm. We treat the roughly 19-point gap between 82 and 63 not as an embarrassment to bury but as the honest measure of the model, and we account for it in the terms that produced it: the physics of how fractures cross a horizontal borehole, the design changes that physics forced, and the metrics that survive the change. The argument is that a gap plotted alongside its mitigation ladder tells an operator more about deployment risk than a single leaderboard-style figure, because it says exactly where the model is trustworthy and where it is not.

The number a paper prints, and the number it does not

Every fracture-detection result our team has published leads with a test F1 measured on vertical wells. The combined model reaches about 82 percent at a 9 cm depth tolerance and falls to about 65 percent at 3 cm, and the fracture-only variant tracks it closely. Those numbers are real and they are on borehole-image logs from fourteen vertical wells, scored the way we score everything: a predicted sinusoid counts as a true positive only if it lands within the stated depth tolerance of the interpreter's pick, and dip and azimuth error are computed on matched hits alone. The mechanics of that depth-tolerance confusion matrix are their own subject, treated in "Depth Tolerance and the Confusion Matrix," and we do not re-derive them here.

What a paper rarely prints is the blind horizontal number. Horizontal wells are a different problem, and only five of them were ever in the dataset against fourteen vertical. When we froze a fracture model and ran it blind on a horizontal well it had not trained near, the result was materially lower than the headline, and it was lower in a way that is legible rather than mysterious. The run is self-documenting. Its experiment string encodes a learning rate of 0.0007736, 200 epochs, a ResNet-18 backbone, an L1-plus-focal loss, and a July-2023 timestamp, and it was scored at a 0.55 probability threshold rather than the 0.5 used on the vertical model. At the stated 6 cm operating point it scored precision 56.25 percent, recall 66.32 percent, and F1 60.87 percent. Push the tolerance out and precision, recall, and F1 climb and then flatten near 58, 68, and 63 percent at 11 to 15 cm. That plateau near 63 is the number nobody publishes.

The honest measure of a fracture-detection model, drawn as the distance between two curves. The solid teal curve is the vertical-well headline fracture F1 of the combined model, rising from about 65 percent at a 3 cm depth tolerance to about 82 percent at 9 cm. The dashed teal curve is the same model family run blind on a horizontal well it never trained near (the horizontal-well blind run, learning rate 0.0007736, 200 epochs, ResNet-18 backbone, L1 plus focal loss, probability threshold 0.55), whose fracture F1 is 60.87 percent at 6 cm and plateaus near 63 percent at 11 to 15 cm. Drag the depth-tolerance lever and the orange span between the two curves is the whole argument: the gap is real, it is largest at loose tolerances, and it does not close. The right rail is the mitigation ladder, four sourced reasons the gap is expected physics rather than failure: horizontal fracture sinusoids average about 15 cm of amplitude against roughly 75 cm on vertical wells, so the patch was shrunk from 800 to 200 pixels; the production run used a ResNet-18 backbone rather than the vertical paper's ResNet-10; and the blind azimuth MAE of 65.41 degrees means azimuth is near-uninformative on horizontal wells, so depth carries the score. Every plotted F1, precision, recall, and MAE is sourced from the blind-evaluation figure set; the smooth interpolation drawn between the sourced anchor points at 3, 6, 9, 11, and 15 cm is the only illustrative element.

Why the gap is physics, not failure

The instinct on seeing 82 drop to 63 is to read it as a model that broke. It is closer to a model behaving exactly as the geometry predicts. A fracture crossing a near-vertical borehole cuts it at a shallow apparent angle and traces a tall sinusoid on the unrolled image; the average fracture sinusoid on our vertical wells stands roughly 75 cm high. The same fracture crossing a near-horizontal borehole cuts it almost perpendicular and traces a short, flat sinusoid, averaging about 15 cm, with 99 percent under 50 cm. A model tuned to find tall periodic curves in an 800-pixel patch is being asked, on a horizontal well, to find features a fifth of that height in the same field of view.

That single fact cascades into every design change on the blind run and into every metric it produced. The patch height was cut from 800 pixels to 200 so the smaller amplitude filled a usable fraction of the frame. The production backbone became a ResNet-18 rather than the ResNet-10 that the vertical paper's ablation had selected, because the horizontal task's feature statistics differ enough that the vertical backbone choice does not transfer unexamined. And the orientation metrics degrade in a specific, physical direction: the blind depth MAE held at a workable 2.11 cm, but the dip MAE rose to 11.12 deg and the azimuth MAE to 65.41 deg. On the vertical model, azimuth accuracy reaches about 92 percent at a 15 deg tolerance; on the blind horizontal well, azimuth accuracy reaches only about 78 percent near a 90 deg tolerance, which is another way of saying azimuth is close to uninformative there. A flat sinusoid carries far less orientation signal than a tall one, so the model that reads it well on depth reads it poorly on azimuth. None of this is the model failing. It is the model reporting, faithfully, that the horizontal problem gives it less to work with.

The mitigation ladder, and why it beats a single figure

Publishing the gap is only half of an honest account. The other half is showing the handles that shrink it, because a gap with known handles is a managed risk and a gap without them is a warning. The mitigation ladder on the right of the instrument is those handles, each grounded in the physics above: the amplitude ratio that set the patch size, the patch shrink from 800 to 200 pixels, the backbone swap to ResNet-18, and the azimuth MAE that tells a deployment to weight depth over orientation on horizontal wells. Every rung is a decision an operator can inspect and a future run can tighten. More horizontal wells would help most; five is a thin dataset, and our own well-count ablations on the vertical side showed error falling steeply as wells were added before flattening. The choice to regress dip and azimuth directly rather than through a keypoint head was a related trade, and we treated that comparison separately in "Keypoints versus Direct Regression," so we only point to it here.

A single headline F1 hides all of this. It tells an operator the model is good without telling them where, and the first horizontal well in production would then arrive as an unpleasant surprise rather than a costed expectation. The gap plotted against its ladder does the opposite. It says: on vertical wells, expect about 82 percent at a loose tolerance; on horizontal wells run blind, expect about 63; trust the depth localisation on both, distrust the azimuth on horizontal; and here are the four levers that move the horizontal number. An operator planning an infill or a horizontal drilling campaign can price that. The relationship the score ultimately rests on is the geometric one between a fracture's dip, its azimuth, and the height of the curve it draws on the image,

h_{\text{sinusoid}} \;\propto\; \tan(\mathrm{Dip}) \cdot \sin(\theta + \mathrm{Azimuth})

where a near-horizontal borehole drives the dip term small and the amplitude with it, which is the whole reason the blind F1 sits where it does across the swept 3-to-15 cm tolerance band.

What the honest number is for

We think the 63 is more useful than the 82, not less. The 82 is a capability claim; the 63 is a deployment fact. A capability claim invites an operator to assume the number holds everywhere, and the assumption is wrong the moment the borehole tilts. A deployment fact, published with the run configuration that produced it and the physics that explains it, lets an operator decide where the model earns its place in a workflow and where a human interpreter still has to carry the horizontal wells. On this engagement that is precisely how it was used: the model cleared the vertical backlog at high F1 and flagged, rather than resolved, the harder horizontal fractures. The gap is not the failure of that arrangement. It is the design of it.

Limitations

The two F1 curves in the instrument are drawn through sourced anchor points at 3, 6, 9, 11, and 15 cm and interpolated smoothly between them; the interpolation is a reading aid, not a measurement, and the true curves are step-like around the confusion-matrix boundaries. The blind horizontal figures come from a single blind run on one horizontal well with a five-well training set, so they are an honest lower bound rather than a population estimate; a larger horizontal cohort would almost certainly move the plateau, and our expectation is upward. The vertical headline of about 82 percent is the combined model's fracture F1 at 9 cm and should not be read as a claim about beddings, dip, or azimuth, each of which has its own separate accounting. The depth, dip, and azimuth MAE values are computed on matched true positives only, so they describe the quality of hits and say nothing about missed fractures, which the recall term carries instead. Finally, this is a report on one operator's carbonate borehole-image logs; the amplitude ratio that drives the gap is general to well geometry, but the specific numbers are not a benchmark and should not be transported to another field without re-measurement.

References

No external works are cited in this piece. All figures are drawn from the engagement's own blind-evaluation figure set and progress records, anonymised per house style, and cross-referenced in the two companion pieces named in the text.

From 82% on Paper to 63% Blind: An Honest Accounting of the Vertical-to-Horizontal Generalization Gap

Abstract

The number a paper prints, and the number it does not

Why the gap is physics, not failure

The mitigation ladder, and why it beats a single figure

What the honest number is for

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on