How Maps Get Turned Into Data: A Look at Cartographic Vectorization

A paper map is a picture of the world, and a picture is a poor thing to ask questions of. You cannot ask a scanned topographic sheet how long a coastline is, or which contour crosses a road, because the sheet does not know it has a coastline or a contour on it. It knows it has dark pixels and light pixels. Turning that grid of pixels into geometry a computer can actually query, a coastline that is a list of points, a contour that carries an elevation, a road that knows it is a road, is the raster-to-vector problem, and cartographers have been working on it for as long as there have been scanners to feed. This note is a primer on that problem, and it makes one argument along the way: the pipeline we now use to lift a curve off a scanned well log is not a new invention. It is the cartographic raster-to-vector problem, narrowed to a very thin strip of paper.

We came at this from the subsurface side, digitising scanned logs, and it took us longer than it should have to notice that we were re-deriving steps a mapmaker would have recognised on sight. So the honest way to explain what we build is to explain the parent problem first, on maps, where the intuition is cleanest, and only then point at the log and say: same staircase, one step at a time.

A map is not data until someone reduces it

The reason a scan is not data is worth stating plainly, because it is the whole motivation for everything that follows. A scan is a raster: a rectangular grid of samples, each a brightness value, with no notion of what any group of them means together. The information a mapmaker cares about, that this run of dark pixels is one continuous shoreline and that other run is a river, is not written anywhere in the grid. It is implied by the arrangement, and implication is not something a database can index. Vectorization is the act of making that implication explicit: replacing "here are some dark pixels that happen to line up" with "here is a polyline, these are its vertices in order, and here is what it represents."

Herbert Freeman set this out as a computing problem as early as 1974, when the field was still called line-drawing image processing rather than vectorization [1]. His framing already contained the whole staircase in miniature: reduce the image to strokes, thin those strokes to their skeletons, follow the skeletons to recover ordered chains of points, and encode the result. Fifty years of methods have changed the tools at every step, from hand-tuned morphology to learned segmentation, but the shape of the problem is stubbornly the same, and that stability is exactly why the same four steps show up when the picture happens to be a well log rather than a nautical chart.

Step one: binarise, or decide what counts as ink

The first step is the most humble and the most consequential. A scanned map arrives in shades of grey, or in colour, and almost none of that variation carries meaning. The paper has aged unevenly, the scanner lamp was brighter in the middle, someone spilled coffee on the corner in 1987. Binarisation is the decision, pixel by pixel, of what is ink and what is background. Get it wrong and every later step inherits the mistake: a threshold set too high erases faint contour lines, a threshold set too low welds neighbouring lines into a single blob that no tracer can separate.

On a map with many overlapping layers, drawn in different colours, this step is often really a layer-separation step: pull the blue hydrography apart from the black culture apart from the brown relief, then binarise each layer on its own terms. Chiang, Leyk, and Knoblock's survey of digital map processing treats this colour and layer separation as the front of the whole pipeline, precisely because everything downstream depends on cleanly deciding what belongs to which feature before you try to trace anything [4]. The lesson transfers directly: whatever the picture, step one is always the same question, which marks on this sheet are signal and which are the accidents of paper and light.

Step two: trace, or turn ink into strokes you can follow

Once you know what is ink, you have to decide which ink belongs together. Tracing is the step that groups the surviving pixels into distinct strokes: this cluster is one contour, that cluster is a different contour, and the two that touch at a saddle are still two lines that happen to cross. Classically this was connected-component analysis followed by thinning, reducing each thick inked stroke down to a one-pixel-wide skeleton that a follower can walk without ambiguity. Lam, Lee, and Suen's survey catalogues the thinning methods that make this possible and, just as usefully, the ways they fail, spurious branches at junctions, eroded line ends, the small betrayals that a downstream vectoriser will faithfully turn into geometry [2].

In modern practice the trace step is often where learning enters, as per-pixel classification: instead of a single ink-or-not decision, each pixel is assigned to a class, background or line-of-type-A or line-of-type-B. That is the same job Freeman described, done with a segmentation model instead of a morphological operator, and it is the step where the well-log pipeline and the map pipeline become visibly identical, because both are asking a network the same question about every pixel.

The raster-to-vector staircase drawn once and read twice. Binarise, trace, vectorise, attribute: four steps that each throw pixels away and keep meaning, taking a scanned picture down to an ordered symbol. Toggle the domain to swap the worked example between cartography, the parent problem where a scanned sheet becomes a named coastline polygon, and the well log, one instance of it where a scanned strip becomes a depth-indexed curve. The steps and the direction of reduction do not change, because the sameness is the argument. The single orange mark is the reduction arrow running down the staircase: every step sheds pixels and keeps meaning. The well-log column plots archive numbers only: 136,771 raster TIF sheets in, 7,781 vector LAS strips out, 3-class per-pixel labelling at the trace step, 300 interpolated depth points at the vectorise step, written as CSV or LAS. The cartographic column names the analogous stages and is illustrative, not a count.

Step three: vectorise, or reduce a path to points

A traced skeleton is still a picture. It is a thin picture, but it is a list of pixels, not a list of vertices, and a coastline stored as ten thousand adjacent pixels is both wasteful and useless for measurement. The vectorise step follows each skeleton and reduces it to an ordered, sparse set of points that reproduces the shape to a chosen tolerance. This is where the single most cited idea in the whole field lives: the Douglas-Peucker algorithm, which throws away every point on a path that is closer than a threshold to the straight line between its surviving neighbours, keeping only the vertices that actually carry the shape [3]. A dense pixel chain becomes a handful of points, and for the first time the thing is data: you can measure its length, test whether it crosses another line, store it in a fraction of the space.

The choice of tolerance is a genuine one, and it is the same choice on a log as on a coastline. Too coarse and you lose the wiggles that mattered; too fine and you have paid for a thousand points to describe a straight edge. On the logs we worked with, the vectorise step lands each curve on a fixed grid of 300 interpolated depth points, which is the log's version of picking a tolerance: enough points to hold the shape of the trace against depth, few enough that the output is a table rather than another image.

Step four: attribute, or hang meaning off the geometry

Geometry alone is still half-mute. A polyline that knows its vertices but not that it is the 40-metre contour, or that it is the resistivity curve rather than the gamma-ray curve, cannot answer the questions people actually ask. The attribute step attaches the semantics: elevation to a contour, a name to a coastline, units and a curve identity to a log trace. On maps this is where recognised text gets matched to the lines it labels, and where a layer becomes a queryable feature class. It is the least glamorous step and often the one that decides whether the whole exercise was worth doing, because an unattributed vector layer is a drawer full of anonymous shapes.

For the log, attribution is what turns a reduced curve into a file another tool will accept: the depth points get their units, the trace gets its curve identity, and the result is written out as CSV or LAS, the plain formats the rest of the subsurface stack already reads. The output of the whole staircase, on the archive we worked with, is a set of these: from 136,771 scanned raster TIF sheets on the input side, the vector side of the same archive holds 7,781 LAS strips, each the end product of exactly these four steps run on one piece of paper.

Why it matters that it is one problem

The reason to insist on the shared staircase is not tidiness. It is that treating log digitisation as its own exotic problem leads teams to reinvent, badly, machinery that cartography settled decades ago, from line simplification to the discipline of separating layers before tracing. Recognising the log as a special case of the map means the map literature is available as prior art at every step, and the map tooling is available as a sanity check. The three-class per-pixel labelling we use at the trace step is a segmentation choice a map processor would find familiar; the 300-point reduction is a tolerance choice with a fifty-year-old algorithm behind it; the CSV or LAS output is the attribute step, no different in kind from writing a shapefile.

It also sets expectations honestly. Maps taught the field long ago that binarisation errors dominate the final quality, that junctions are where tracing breaks, and that attribution is where projects quietly stall. None of those lessons stop applying because the picture is tall and thin and shows depth instead of latitude. The well log is a map of one very narrow column of rock, and every hard part of digitising it is a hard part cartographers already named.

Limitations

This is a primer, not a benchmark, and it should be read as one. The four-step framing is a teaching decomposition; real pipelines interleave the steps, iterate between them, and sometimes fold trace and vectorise into a single learned model rather than running them in sequence, so the clean staircase is a simplification of messier practice. The well-log figures are real archive counts, 136,771 raster TIF sheets, 7,781 vector LAS strips, three output classes, and 300 interpolated depth points, but they describe the shape of one archive we worked with, not a universal ratio; a different collection would binarise, trace, and reduce to different numbers. The cartographic side of the comparison is described at the level of the standard pipeline and its founding methods rather than measured on a specific map series, and it is meant to carry the intuition, not to stand in as data. Finally, this note deliberately stops at the pipeline's shape and does not evaluate any particular model's accuracy at any step; whether a given trace is good enough to trust is a separate question that only ground truth can answer.

The map underneath the log

The habit worth keeping from all of this is to reach for the map first. When a step in a log-digitisation pipeline is confusing, the fastest way to understand it is usually to ask what the same step does to a coastline, because the coastline version is older, better studied, and easier to picture. Binarise, trace, vectorise, attribute: the order is the same, the failure modes are the same, and the output is the same kind of thing, a picture reduced to ordered geometry with meaning attached. The log is not a new problem. It is cartography, pointed straight down.

References

[1] Freeman, H. Computer Processing of Line-Drawing Images. ACM Computing Surveys 6(1), 1974, pp. 57-97. The early survey that framed line-drawing conversion as thresholding, thinning, tracing, and encoding, the raster-to-vector staircase before it had the name. https://dl.acm.org/doi/10.1145/356625.356627

[2] Lam, L., Lee, S.-W., and Suen, C. Y. Thinning Methodologies: A Comprehensive Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(9), 1992, pp. 869-885. The reference survey of skeleton and centreline extraction and its failure modes. https://ieeexplore.ieee.org/document/161346

[3] Douglas, D. H., and Peucker, T. K. Algorithms for the Reduction of the Number of Points Required to Represent a Digitized Line or its Caricature. Cartographica 10(2), 1973, pp. 112-122. The classic line-simplification method behind the vectorise step. https://utpjournals.press/doi/10.3138/FM57-6770-U75U-7727

[4] Chiang, Y.-Y., Leyk, S., and Knoblock, C. A. A Survey of Digital Map Processing Techniques. ACM Computing Surveys 47(1), 2014, article 1. A modern survey of extracting lines, text, and semantics from scanned maps, the same pipeline with contemporary methods. https://dl.acm.org/doi/10.1145/2557423

How Maps Get Turned Into Data: A Look at Cartographic Vectorization

A map is not data until someone reduces it

Step one: binarise, or decide what counts as ink

Step two: trace, or turn ink into strokes you can follow

Step three: vectorise, or reduce a path to points

Step four: attribute, or hang meaning off the geometry

Why it matters that it is one problem

Limitations

The map underneath the log

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on