Abstract
Promptable segmentation changed what a segmentation model is for. A user points at a pixel or draws a box and the model returns a mask, and the same model answers for classes it never saw. That interface is a real advance, and the Segment Anything line and its efficient and high-quality descendants have earned the attention they get [1] [2] [3]. This piece reads that family against a task it was not built for and does not do well: recovering the one-pixel-wide curves in a scanned well-log. We set the promptable lineage next to a purpose-built, region-trained curve segmenter from our own raster-log work, where the background class is nearly solved at an intersection over union of 0.94 while the two curve classes sit at 0.26 and 0.21 under a Dice objective, with a peak intersection over union across the run of 0.51 [7]. The region-trained model is not good in an absolute sense, and we will not pretend it is. The claim is narrower and sturdier: on an input where foreground is under two percent of the pixels of a single-channel grayscale image three-to-twelve-thousand pixels wide, the region-trained model recovers real foreground and a prompt-driven blob model recovers almost none, because the two were rewarded for different things. Promptability is an interface property; usefulness on scientific line data is a training-distribution property. Conflating them is how a strong foundation model ends up returning an empty mask on a log.
Two model families, one substrate
The task is fixed for this reading: take a scanned raster well-log, a single-channel grayscale image between 3,200 and 12,800 pixels wide, and separate three classes, the paper background and two thin ink curves. Onto that one substrate we place two model families that could in principle be asked to do it.
The first is the promptable foundation family. Segment Anything trained a promptable mask predictor on a very large corpus of masks and offered it as a generic, zero-shot segmenter a user prompts rather than trains [1]. Its descendants sharpen and shrink it without changing its premise: the high-quality variant adds an output token and fuses early features to recover finer boundaries [2], and the mobile variant distils the encoder for on-device speed [3]. All three inherit the same object-region prior, because that is what the pretraining corpus contained: masks of things, boxes and blobs and objects photographed in natural scenes.
The second is the region-trained thin-structure family, models built and supervised specifically to trace elongated, filamentary foreground. Channel-and-spatial-attention networks for curvilinear structures were designed for vessels and neurites, the thinnest foreground computer vision routinely handles [4], and connectivity-aware losses were introduced precisely because a region-overlap objective under-rewards a one-pixel tube whose every pixel matters to whether the curve stays connected [5]. Our own raster-log segmenter sits in this second family: it was trained on the log substrate, with an overlap loss chosen for sparse foreground [6], to do exactly this task and nothing else [7].
The interesting fact is not that the second family beats the first. It is why the first family, which is larger, more general, and more capable on almost every natural-image benchmark, is the wrong tool here, and why its own authors say so.
What the promptable objective actually optimises
A promptable segmenter is trained to answer a spatial question: given this point or this box, what is the object here? The reward signal is agreement with a human-drawn mask of a thing, and things in the pretraining distribution are regions with area, texture, and a boundary that encloses a filled interior. The model learns, correctly for its data, that a mask is a compact region you can fill.
A one-pixel-wide log curve is the opposite object. It has essentially no interior to fill; it is almost all boundary. It is grayscale, so the color and texture cues the pretraining leaned on are absent. It threads across a gridded, cream-toned page whose lines look more like the target than the target does. And it occupies under two percent of the pixels, so the single most reliable way to score well against a region-overlap metric is to predict very little foreground at all. Every inductive bias the promptable objective installed points the model away from the curve. The high-quality variant's paper is candid about the symptom even inside its own domain: the base model degrades on thin and intricate structures, which is exactly the failure their extra output token is trying to patch [2]. Distilling the encoder for speed does not touch the prior [3]; it makes the same wrong answer cheaper.
The region-trained family optimises a different question, and that is the whole difference. It is not asked what object is at a point. It is asked, for every pixel of this specific substrate, is this pixel curve or is it background, and it is scored with a loss chosen so that the tiny foreground is not drowned by the background it is trained against [5] [6]. It has seen thousands of these logs and nothing else. Its ceiling is low, but its floor is real.
Reading the measured numbers honestly
The region-trained numbers are not flattering and we will not flatter them. Under a Dice objective, the multiclass run reaches an intersection over union of 0.94 on the background class and an F1 of 0.97, which says the easy, area-rich class is essentially solved. The two curve classes are the hard mass: intersection over union of 0.26 and 0.21, F1 of 0.37 and 0.32, and a peak intersection over union across the run of 0.51 [7]. Read in isolation, a curve intersection over union of 0.21 is a poor result, and any honest account has to say so.
Read comparatively, those same numbers are the argument. An intersection over union of 0.21 on a one-pixel curve at under two percent foreground is small, but it is real recovered ink: the model finds a fifth of the curve's overlap where a fifth of nothing is what a blob-prior model returns. The background result is the tell. A model can reach 0.94 on the background and still sit at 0.21 on the curve, which means the score is not being carried by a lazy predict-nothing strategy; the network is spending its capacity on the thin class and getting a fraction of it, which is what training on the substrate buys you. A prompt-driven model brought to this image has no such incentive. Prompted anywhere near a curve, its object prior pulls toward the nearest fillable region, which is the page or a grid cell, not the hairline. Its effective foreground overlap collapses toward zero in the regime where foreground is under two percent, and it only recovers as the traced line is thickened into something object-shaped, which is to say only when the input stops being the scientific line image the task actually contains.
The exhibit puts the two readings side by side. The left panel is the sourced per-class overlap of the region-trained segmenter, background nearly solved and the two curves low but non-zero, with the peak intersection over union of 0.51 marked. The right panel sweeps the fraction of the grayscale scan that is curve ink and reads effective foreground overlap for each family: the region-trained model holds a modest plateau anchored to its measured curve band, and the promptable blob model dives to a near-zero floor once foreground drops under the sourced two-percent threshold. The sourced facts are the overlap numbers, the peak intersection over union, the sub-two-percent foreground fact, the three-class grayscale substrate, and the pixel widths; the specific shape of the decay the lever sweeps is illustrative and is flagged as such on the canvas, anchored only at the sub-two-percent collapse that the promptable objective makes near-certain.
Where promptability is the right tool, and where it is not
None of this is a case against promptable segmentation as an architecture. On natural images, on medical volumes with region-scale targets, on any task where the foreground is an object with an interior, the promptable interface is a genuine step change, and the efficient and high-quality descendants make it cheaper and sharper to deploy [1] [2] [3]. The interface, point-and-get-a-mask across an open label set, is worth wanting.
The line to hold is that the interface is orthogonal to the training distribution. A model can be superbly promptable and still have learned nothing about a single-channel scanned log with a hairline target, because promptability is about how you query the model and the training distribution is about what the model knows. Scientific line images, seismic horizons, fracture traces, contour lines on a map, vessels in a fundus image, sit in a corner of input space that the object-region corpora do not cover, and the thin-structure literature exists precisely because that corner needed its own models and its own losses [4] [5]. When the target is one pixel wide and two percent of the frame, the right move is not to prompt a general segmenter and hope; it is to train on the substrate, with a loss that respects the imbalance, and accept a low but real ceiling rather than a convenient interface over an empty mask.
Discussion
The architecture story of the last few years reads as a march toward generality, and for object-scale segmentation that march is real. This task is a reminder that generality has a distribution attached to it. The promptable foundation family generalises across classes within the distribution it was trained on, and a one-pixel grayscale curve is outside that distribution in three ways at once: the wrong topology (all boundary, no interior), the wrong channel count (one, not three), and the wrong foreground fraction (two percent, not object-scale). No amount of prompt cleverness moves a model across those three gaps, because none of them is an interface problem. The region-trained family closes them by construction, at the cost of doing only this one task. That trade, a narrow model that recovers real thin foreground versus a general model that returns almost none of it, is the trade the numbers describe, and it is the trade a practitioner faces the moment they point a foundation segmenter at a log and get back the page.
Where our own work sits relative to this is worth marking. The measured curve numbers are ours and are used as one honest anchor, a floor a substrate-trained model reaches on this exact input, not a benchmark of any promptable model, which we did not run on the logs. The comparison the exhibit draws between the two families on the right panel is a mechanism argument grounded in what each objective optimises, sourced only at the point where the promptable collapse is a near-certainty, and flagged illustrative elsewhere.
Limitations
This is a structured reading, not a head-to-head benchmark, and it inherits those limits. We did not run Segment Anything or its descendants on our raster logs and report a number; the promptable family's collapse on this substrate is argued from what its training objective optimises and from its own authors' documented degradation on thin structures [2], not from an ablation we recorded. The region-trained figures are the real metrics of a single multiclass run under a Dice objective from one architecture and one engagement, a background intersection over union of 0.94 and curve values of 0.26 and 0.21 with a peak of 0.51 [7], used as a comparative anchor rather than as a claim that these are the best achievable on the task; a connectivity-aware loss or a topology-aware objective could raise the thin-class numbers, and the citation to that literature is a pointer, not a result we ran [5]. The exhibit's right-panel decay curves are an illustrative model of effective foreground overlap as the traced line thins, anchored only at the sourced sub-two-percent collapse and the measured curve band, and are flagged as such on the canvas; we did not measure overlap across a swept ink fraction for either family. The reading also scopes itself to the promptable and thin-structure model families as they stood through its own quarter, so later promptable variants and later thin-structure losses are out of frame. A reader should take this as a map of when a promptable foundation model is the wrong tool for scientific line data, not as a measured tournament between the two families on their own task.
What to carry from this reading
- Promptability is an interface property, not a usefulness property. A model can be excellent at point-and-get-a-mask across an open label set and still know nothing about a single-channel scanned log with a one-pixel curve, because how you query a model is orthogonal to what its training distribution contains.
- A one-pixel log curve breaks the object-region prior in three ways at once: the wrong topology (all boundary, no interior to fill), the wrong channel count (one grayscale channel, not three colour ones), and the wrong foreground fraction (under two percent, not object-scale). None of the three is an interface problem, so prompt cleverness does not close them.
- The region-trained numbers are low but real: background intersection over union 0.94 and F1 0.97 versus curve intersection over union 0.26 and 0.21, F1 0.37 and 0.32, peak intersection over union 0.51 under a Dice loss. The gap between the near-solved background and the recovered curves is the tell that the score is not a lazy predict-nothing artefact.
- The high-quality promptable variant's own paper documents that the base model degrades on thin and intricate structures, which is the failure this reading predicts from first principles; distilling the encoder for speed makes the same wrong answer cheaper without touching the object-region prior.
- The right move for scientific line images (log curves, seismic horizons, fracture traces, vessels, contour lines) is to train on the substrate with a loss that respects the class imbalance and accept a low but real ceiling, rather than prompt a general segmenter and receive a convenient interface over an almost-empty mask.
The smallest habit this reading would install is a question to ask before reaching for a foundation segmenter on scientific imagery: is the target an object with an interior, or is it a line? If it is a line, the promptable interface is answering a question the input does not pose, and the cheaper honesty is a narrow model trained on the substrate that recovers a real fraction of the ink rather than a general one that returns the page.
References
[1] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollar, P., and Girshick, R. Segment Anything. ICCV (2023). Introduces a promptable segmentation foundation model trained on a very large mask corpus and proposes it as a generic zero-shot mask predictor prompted rather than trained. https://arxiv.org/abs/2304.02643
[2] Ke, L., Ye, M., Danelljan, M., Liu, Y., Tai, Y.-W., Tang, C.-K., and Yu, F. Segment Anything in High Quality. NeurIPS (2023). Adds a high-quality output token and fuses early and final features to sharpen boundaries, documenting that the base model degrades on thin and intricate structures. https://arxiv.org/abs/2306.01567
[3] Zhang, C., Han, D., Qiao, Y., Kim, J. U., Bae, S.-H., Lee, S., and Hong, C. S. Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. arXiv (2023). Distils the image encoder for on-device use, inheriting the base model's object-region prior. https://arxiv.org/abs/2306.14289
[4] Mou, L., Zhao, Y., Chen, L., Cheng, J., Gu, Z., Hao, H., Qi, H., Zheng, Y., Frangi, A., and Liu, J. CS-Net: Channel and Spatial Attention Network for Curvilinear Structure Segmentation. MICCAI (2019). A region-trained network built specifically for thin, elongated curvilinear structures such as vessels and neurites. https://doi.org/10.1007/978-3-030-32239-7_80
[5] Shit, S., Paetzold, J. C., Sekuboyina, A., Ezhov, I., Unger, A., Zhylka, A., Pluim, J. P. W., Bauer, U., and Menze, B. H. clDice: A Novel Topology-Preserving Loss Function for Tubular Structure Segmentation. CVPR (2021). A connectivity-aware loss that targets the thin, tubular foreground that region-overlap losses under-reward. https://arxiv.org/abs/2003.07311
[6] Milletari, F., Navab, N., and Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 3DV (2016). Introduces the Dice objective used here, an overlap loss designed for the case where foreground is a small fraction of the image. https://arxiv.org/abs/1606.04797
[7] Maiti, T., Nasim, M. Q., Patwardhan, N., and Singh, T. VeerNet: A Deep Neural Network Architecture for Raster Well-Log Digitization. Journal of Imaging, 9(7), 136 (2023). The purpose-built raster-log segmenter whose region-trained curve masks anchor this reading. https://www.mdpi.com/2313-433X/9/7/136