Skip to main content

Blog

Instance or Semantic: Choosing How to Slice an Image

A practitioner's decision guide for the first fork in any pixel-level vision task: do you want instance segmentation, which finds and separates each object as its own thing, or semantic segmentation, which labels every pixel with a class from a set you fixed in advance. The two are not competing accuracy tricks, they answer different questions, and the question the raster-log problem actually asks is the semantic one. A scanned well log carries two constant, named curves, so the output classes are known before a single pixel is seen: background, curve1, curve2, three in all. When the class set is that small and that fixed, you already know what is in the picture, and per-pixel labelling is the natural fit. This note walks the distinction plainly, shows where each method earns its keep, and grounds the choice on the numbers from our own runs: a three-class model peaking at F1 0.55 and IoU 0.51 against the fragmented alternative of separate binary masks at F1 0.37, 0.26, and 0.55. It complements the VeerNet whitepaper rather than repeating it: this is the primer, not the research comparison.

Tarry Singhby Tarry Singh8 min read
EarthScan insight

Before you pick a loss, a backbone, or an augmentation policy, there is an earlier fork that quietly decides most of them: are you doing instance segmentation or semantic segmentation? The two get filed under one heading because both produce masks, but they answer different questions, and choosing the wrong one wastes months building machinery for a problem you do not have. This is a plain guide to that fork for anyone who has to slice an image into parts, written from the raster-log-digitisation work behind VeerNet, the encoder-decoder EarthScan uses to lift curves off scanned paper logs. It is deliberately the primer, not the published research comparison; the point here is the decision, and why for our task it is not close.

The two questions, stated plainly

Semantic segmentation labels every pixel with a class drawn from a set you fix before training. The output is one map the same size as the input, and each pixel says which of your named classes it belongs to. There is no notion of separate objects inside a class: every pixel that is road is simply road, whether it is one road or five. The framing that made this the standard shape of the task is the fully convolutional network [1], and the encoder-decoder that produces a full-resolution map for a small, known class set is the U-Net family [3].

Instance segmentation asks a different thing: find each individual object and give it its own mask, even when several objects share a class. Two cars are two instances, each with a separate outline, and the method has to detect them as distinct before it masks them. The reference approach detects each object as a region first, then predicts a mask inside that detection [2]. The unifying vocabulary that names this split cleanly comes from panoptic segmentation, which distinguishes "stuff", the uncountable regions semantic labelling handles, from "things", the countable instances that instance methods exist for [4].

So the fork is not about which model is more accurate. It is a question about your data: are the parts you care about a fixed set of named categories, or an open, countable population of separable objects? Answer that, and the method follows.

What the raster log actually contains

A scanned well log, in the form our multiclass model was trained on, contains two constant curves per image. They are not anonymous objects to be discovered and counted. They are the same two named quantities on every log, and the output vocabulary is fixed before we see a single pixel: three classes, background, curve1, and curve2. Nothing about the task changes that count from image to image. There is no image where a third curve of the same kind might appear and need to be told apart from the first two as a separate instance. The population is closed, named, and small.

That is the exact profile semantic segmentation is built for. We already know what is in the picture; the only question left is which pixels belong to which of the three known classes. Reaching for instance segmentation here would be answering a question the data never asks, "how many separate curve-objects are there and where does each begin", when the honest answer is always the same: two, these two, every time. The exhibit below turns that reasoning into a fork you can drive. Drag the lever for how many distinct objects you actually need to tell apart, and watch where the verdict lands. For a small, fixed, named class set, it lands on semantic, and it stays there until the count climbs into the range where unlabelled duplicates are the real problem.

HOW TO SLICE A LOG · INSTANCE OR SEMANTICSEMANTICthe branch this answer selects2 named curves, 3 known classes: you already know what is in the pictureTHE QUESTIONDo you know theclass set aheadof time, and is itsmall and fixed?yesnoSEMANTIC · one 3-class mask, per pixelevery pixel labelled background, curve1 or curve2 at oncepeak F10.55peak IoU0.51INSTANCE · separate masks, then stitchthree disjoint binary masks, no shared class notionmask 10.37mask 20.26mask 30.55DECISION LEVERdrag distinct objects to tell apart:a log has 2, and they are the same named curves2122030402sourced: 2 curves, 3 classes, multiclass F1 0.55 & IoU 0.51, separate masks F1 0.37/0.26/0.55 · the object-count lever is an illustrative decision control
The one question that decides how to slice a raster well log, made into a fork. A log carries two constant, named curves, so the class set is known ahead of time and small: background, curve1, and curve2, three classes in all. Drag the lever for the number of distinct objects you have to tell apart. When that count is small and the objects are the same named things every image, the orange verdict arrow lands on the semantic branch, where one three-class per-pixel model peaks at F1 0.55 and IoU 0.51. Push the count up into the range of many unlabelled duplicates and the fork tips to the instance branch, which for this task means training separate binary masks and stitching them back with no shared class notion, at F1 0.37, 0.26, and 0.55 per mask. The two curves per log, the three classes, the multiclass F1 and IoU, and the per-mask binary F1s are sourced from the engagement archive; the object-count axis on the lever is an illustrative control for the decision rule, not a measured series. The orange arrow is the only element that argues: for a known, fixed class set it points at semantic.

The numbers behind the fork

The choice is not only conceptual; we have the measurements. Trained as one three-class per-pixel model, the semantic setup peaks at F1 0.55 and IoU 0.51 across the classes it labels jointly. Those are honest numbers for thin, sparse structures on noisy scans, and the whitepaper covers why thin-structure overlap is hard, but the shape that matters here is that a single model carries a shared understanding of the whole picture: it knows that a pixel is curve1 partly because it knows the pixel next to it is background and the one below is curve2. The classes are decided together, in one pass, against one another.

Contrast the instance-flavoured alternative for this task, which is to train separate binary masks, one per curve, and stitch them back together afterward. That path gives three disjoint masks scoring F1 0.37, 0.26, and 0.55, and the deeper problem is not just the lower average. It is that the masks have no shared notion of the scene. Each is solved in isolation, so nothing in the setup prevents two of them from claiming the same pixel or both going quiet in the same gap, because none of them was ever told the others exist. The joint three-class model gets that mutual exclusivity for free, by construction, which is the practical reason the semantic framing wins for a fixed class set and not merely a headline-metric reason.

When the other branch is right

None of this makes instance segmentation the weaker tool. It makes it the tool for a different job. If the task were "count and separate every vug in a carbonate image" or "outline each of an unknown number of grains", the objects would be countable, duplicated, and unlabelled in advance, and semantic labelling alone would collapse them into one undifferentiated region. That is precisely the "things" case the panoptic framing names [4], and it is where detecting each instance first [2] earns its cost. The lever in the exhibit is built to show this: push the object count up, and the fork tips over to the instance branch, because at that point you genuinely do not know how many separate objects there are, and telling them apart is the whole task.

The discipline is to answer the data question honestly rather than defaulting to whichever method is fashionable. For raster logs the answer is unambiguous. Two named curves, three fixed classes, the same set on every image: that is a semantic problem, and the machinery of instance detection would be effort spent solving a counting question that our data never poses.

Limitations

This is a decision guide grounded in one engagement, not a benchmark, and it should be read as such. The F1 and IoU figures are the real archive numbers for our specific curves, scans, and class set, so they characterise the difficulty of this task, not segmentation in general; a different operator's logs with more curves, crossing traces, or a genuinely variable curve count could shift the fork or even move it to the instance branch. The object-count lever in the exhibit is an illustrative control that dramatises the decision rule, not a measured series, and the split point on it is a rule of thumb rather than a learned threshold. The framing also assumes the class set really is known and stable ahead of time; where that assumption fails, where new kinds of curve can appear without warning, the tidy semantic case weakens and the honest answer becomes more mixed. And picking the right branch only settles the framing. It does not decide the backbone, the loss under foreground scarcity, or whether a mask that scores well actually reconstructs a usable curve, which are separate questions this note does not try to answer.

References

[1] Long, J., Shelhamer, E., and Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), pp. 3431-3440. The paper that fixed per-pixel dense labelling as the standard shape of semantic segmentation. https://openaccess.thecvf.com/content_cvpr_2015/html/Long_Fully_Convolutional_Networks_2015_CVPR_paper.html

[2] He, K., Gkioxari, G., Dollar, P., and Girshick, R. Mask R-CNN. IEEE International Conference on Computer Vision (ICCV 2017), pp. 2961-2969. The reference instance-segmentation method: detect each object, then mask inside each detection. https://openaccess.thecvf.com/content_iccv_2017/html/He_Mask_R-CNN_ICCV_2017_paper.html

[3] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015, Lecture Notes in Computer Science 9351, Springer, pp. 234-241. The encoder-decoder that produces a full-resolution per-pixel map for a small, known class set. https://link.springer.com/chapter/10.1007/978-3-319-24574-4_28

[4] Kirillov, A., He, K., Girshick, R., Rother, C., and Dollar, P. Panoptic Segmentation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), pp. 9404-9413. The framing that separates uncountable "stuff" from countable "things", the distinction that decides the fork. https://openaccess.thecvf.com/content_CVPR_2019/html/Kirillov_Panoptic_Segmentation_CVPR_2019_paper.html

Go to Top

© 2026 Copyright. Earthscan