If you have only heard semantic segmentation described from the outside, it sounds like the machine is drawing over the picture, inking in each curve the way a person would with a fine pen. That mental picture is wrong in a way that matters, and getting it right is the point of this primer. The network does not trace anything. It does not start at one end of a curve and walk along it. What it does is more mechanical and, once you see it, more robust: it goes to every pixel and answers a single small question about that pixel in isolation, which class does it belong to. Do that everywhere and you have a class map, a label for every position on the sheet. Turning that map back into a curve is a separate, later step. The segmentation itself is a pile of independent per-pixel votes.
We work on scanned well logs, sheets on which measurement curves were printed decades ago and filed away, and the task is to recover the numbers those curves encode. It is a good home for the idea, because a log makes every feature of segmentation vivid: the curves are thin, they tangle, and the page around them is mostly empty. Everything below is about that setting, but the mechanism is the general one.
The unit of the decision is a pixel, not a line
The cleanest way to feel the difference is to name the alternative. The intuitive method, the one that matches how a human traces, is edge-following: find where a curve starts, then step along it, at each move choosing the neighbour that best continues the line. That is a sequential procedure with memory, its decision at any point depending on where it just came from. Semantic segmentation throws that structure away. The formulation that made it work, Long, Shelhamer, and Darrell's fully convolutional network, reframed dense labelling as classification applied densely: the same classifier that would say "this whole image is a cat" is run at every location to say "this pixel is class k" [1]. The output is not a traced path. It is a grid of class assignments, each made without reference to its neighbours except through the shared features the network learned.
On our task the class set is deliberately tiny. There are three output classes: background, curve one, and curve two. The input is one channel, a grayscale image, because a scanned log is a grey picture and there is no colour information to lean on. In the final multiclass dataset each log carries two constant curves, which is why two of the three classes are curves and the third is everything else. So the model's entire job, stated honestly, is to look at a single grey value in its learned context and sort it into one of three bins. That is it. No line, no walking, no memory of the last pixel.
Why per-pixel labelling wins exactly where tracing loses
The reason this framing is not just a technicality shows up at the crossing. Two log curves will, sooner or later, pass over each other on the page. At that spot an edge-follower is in trouble, because its whole method is local continuation and at a crossing there are two plausible continuations and no local rule to choose between them. It can, and does, hop from one strand onto the other and carry that mistake onward, because it has no notion of which line it was ever on. It only ever knew "keep going straight-ish."
A per-pixel classifier has no such failure, because it was never following a line to begin with. At the crossing it still asks the only question it ever asks, which class does this pixel belong to, and it answers from everything the pixel's neighbourhood tells it, curvature on the way in, expected slope, the faint difference in how the two curves were drawn. The crossing is not a special case for it. It is just more pixels to label. The encoder-decoder shape that does this well, U-Net's contracting path for context and expanding path with skip connections for precise placement, was built precisely so that a thin structure keeps its fine location while the network still sees enough around it to know what the structure is [2]. That combination, wide context and sharp localisation, is what lets the two strands stay separate through an overlap that defeats sequential tracing.
The exhibit makes the argument concrete. Drag the scan line across a small patch holding two curves that cross once. Away from the crossing an edge-follower and a per-pixel classifier agree, so nothing is at stake. Slide into the crossing and the edge-follower's verdict turns ambiguous, because it genuinely has no local rule for which strand it is on, while the classifier's per-pixel labels stay clean. The one element that argues, the orange one, is that edge-follower verdict going uncertain at the exact place the pixel-labeller does not.
The lopsidedness is a feature of the framing, not a bug
There is a second reason to prefer per-pixel classification, and it is about arithmetic. A thin curve on a mostly empty page means the classes are wildly unequal in size. On our logs the background is about 97 percent of the pixels, and the two curves together are under 2 percent. If you were tracing lines, that imbalance would be invisible, because you would only ever be on a line. Cast as per-pixel classification, it becomes an explicit property of three classes: two tiny, one enormous.
That visibility is what makes the imbalance manageable rather than fatal. Because the problem is now "classify pixels into three classes of very different frequency," it lands in a well-understood regime, the one Lin and colleagues addressed for dense prediction, where a naively trained classifier will label everything background and score 97 percent while getting every curve pixel wrong [3]. The fix lives in the loss, which can be weighted or reshaped so the rare curve pixels count for as much as the common background ones. An edge-following formulation has nowhere to put that fix, because it never represented the background as a class at all. Per-pixel classification gives the imbalance a home, and once it has a home it has a handle.
What the primer is really claiming
Put together, the claim of this note is narrow and, we think, correct: for tracing curves off a scanned log, deciding a class for every pixel is a better formulation than following an edge, for two connected reasons. It survives the crossings that break sequential tracing, because it never depended on staying on a line. And it turns the brutal 97-percent-to-under-2-percent class split into an explicit three-class fact the loss can act on, instead of a nuisance the method cannot see. Everything more sophisticated we do downstream is built on that one reframing. Start from pixels, not lines, and the hard parts of curve tracing become the parts the method was designed for.
Limitations
This is a primer on a formulation, not a report of results. The class count of three, the single grayscale channel, and the two curves per log are real properties of our task, and the roughly 97 percent background with under 2 percent foreground is a fair characterisation of thin-curve class membership, but the exact split varies from log to log with curve thickness, scan resolution, and how much of the sheet is annotation. The patch and the two curves in the exhibit are illustrative geometry drawn to make per-pixel labelling legible, not a measured log, and the edge-follower there stands in for the family of sequential methods rather than any specific tracer. Per-pixel classification has its own failure modes too: a class map with the strands correctly separated still has to become a continuous, depth-indexed curve, and a locally clean label field can carry gaps or speckle the reconstruction step must handle. This note argues only that the per-pixel framing is the right starting point, not that it settles everything after it.
References
[1] Long, J., Shelhamer, E., and Darrell, T. Fully Convolutional Networks for Semantic Segmentation. CVPR (2015). The recasting of dense labelling as per-pixel classification, where the network outputs a class map rather than one label for the whole image. https://arxiv.org/abs/1411.4038
[2] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The encoder-decoder with skip connections that pairs wide context with sharp localisation, the reason thin structures keep their place while still being classified correctly. https://arxiv.org/abs/1505.04597
[3] Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. Focal Loss for Dense Object Detection. ICCV (2017). The treatment of extreme foreground-background imbalance in dense prediction, the exact regime a thin curve on an empty page occupies. https://arxiv.org/abs/1708.02002