Read the Header, Not the Curves: Why OCR Still Earns Its Keep

Look at a scanned well log the way a model has to, and you will notice it is not one image. It is two. There is the part that was drawn: a gamma-ray curve wandering down a track, a resistivity trace looping past its decade gridlines, a porosity pair crossing and recrossing each other in the third column. And there is the part that was typed: the track labels at the top, the depth numerals running down the margin, the units, the small printed legend naming what each curve measures. Both are ink on the same sheet, scanned at the same time, and most digitisation pipelines reach for the same hammer to recover both. That is the mistake this piece is about.

The two halves are different vision problems with different right answers, and the cheapest way to see why is to ask a single question of each thing on the page: how reliably can an ordinary text recognizer read it? Ask that, and the sheet sorts itself into two piles. The printed header is a recognition problem, and the venerable, unglamorous tool for recognition, optical character recognition, still wins it on cost and reliability. The plotted curves are a segmentation problem, and there OCR is helpless and a learned pixel model is the only honest tool. The skill is not picking one technique for the whole scan. It is routing each part of the scan to the technique that is actually good at it.

Two problems wearing one coat

The reason the header and the curves feel like one task is that they arrive together, as a single raster. But they differ on the axis that matters for tooling, which is the size of the thing the model has to recognise.

The header is finite-vocabulary, printed text. A track is labelled "Track 1" or "GR" or "RESISTIVITY"; the depth axis is a column of numerals; the units are drawn from a tiny closed set, API and ohm.m and grams per cubic centimetre and feet. Every one of those is a glyph string from a small, near-fixed alphabet, set in type, with crisp edges and predictable layout. That is precisely the shape of problem optical character recognition was built for, and an off-the-shelf engine like Tesseract, with its line-finding, per-character classification, and dictionary-driven decoding, reads it with the kind of accuracy a hand-tuned segmenter would have to work hard to match [1].

The curves are the opposite of finite. A plotted log trace is a free-form analogue line, hand-drawn or pen-plotted decades ago, with no vocabulary at all. It is not a character; it is a continuous signal rendered as pixels, and the thing you want back is the value at every depth. No recognizer has a class for "the specific squiggle this gamma-ray curve makes between 5,200 and 5,260 feet," because there is no such class. Reading that line is a dense, per-pixel labelling task, which is segmentation, and the modern answer to it is a learned encoder-decoder in the U-Net lineage that labels each pixel as curve or background and lets you trace the trace [3].

So the coat is shared and the bodies underneath are not. One is a closed-set recognition task; the other is an open-ended regression-through-segmentation task. The instinct to run a single model over the whole image quietly forces the easy half to pay the price of the hard half.

Why OCR wins the header and loses the curves

It helps to be concrete about why the recognizer wins where it wins, because the same property that makes OCR strong on the header is exactly what makes it useless on the curves.

OCR's whole machinery assumes its input is symbols from a known alphabet arranged on lines. Tesseract finds text lines, slices them into character cells, classifies each cell against a learned font model, and then uses a language model to prefer real tokens over garbage [1]. Every step of that pipeline is leaning on the same assumption: there is a small, enumerable set of things this ink could be, and the job is to pick the right one. On a header that assumption is true, and the payoff is enormous. The recognizer is cheap to run, needs no labelled training data of your own, is robust to scan noise because the language model fills small gaps, and returns a typed string you can route on immediately, "Track 3", "NPHI", "ohm.m".

Point that same machinery at a plotted curve and every assumption breaks at once. There is no alphabet. There are no character cells. There is no language model that will tell you a gamma-ray excursion at 5,240 feet is more "plausible" than one at 5,260. The recognizer has nothing to recognise, and it will either return nothing or hallucinate symbols out of noise. The curve does not need a model that picks from a menu; it needs a model that draws, pixel by pixel, where the ink is. That is the segmenter's job, and it is a job that requires labelled examples, a learned feature extractor, and real training compute, none of which the header needs.

This is the inversion worth internalising. The header is cheap because it is constrained. The curves are expensive because they are not. Spending segmentation-grade effort on the header is waste, and spending OCR-grade effort on the curves is failure. The same scan rewards opposite tools.

A ladder, not a switch

In practice the decision is not a clean two-way switch, because a real header has a gradient of legibility. Crisp printed units sit at the very top of a recognizer's confidence. The depth numerals are nearly as easy. The measurement-column names are a touch harder because their vocabulary is larger and abbreviations vary by vendor. Track labels are easy when present and absent on plenty of old sheets. And then there is a cliff, below which sit the plotted curves, where confidence falls off entirely.

The clearest way to think about that gradient is to borrow a real, public vocabulary for the header text. The FORCE 2020 release of Norwegian-Sea wells, widely taught through the tutorial slice of 118 wells, fixed a naming convention for its measurement channels, and that slice carries 22 electrical-measurement columns: the gamma-ray, calliper, spontaneous-potential, the shallow, medium, and deep resistivities, the neutron and density porosities, and so on, each a short printed token [5][6]. Those 22 column names are exactly the kind of finite, recognisable header text OCR eats for breakfast. They are also a useful public stand-in for the printed legend on any scanned log, because they enumerate the small alphabet of things a log header actually says.

The instrument below puts that gradient on a single axis. It ranks the field types of a scanned log by how reliably classical OCR reads each, from the printed units and the depth scale and those 22 measurement-column names down to the plotted curves themselves, and lets you drag a "read-as-OCR" cutoff up and down the ladder. Above the cutoff, route to cheap OCR; below it, hand the work to the heavy segmenter. Drag the cutoff too low and you can watch a plotted curve fall into the OCR bucket, where no recognizer can follow it. The plotted-curve rungs carry the real track taxonomy of the system we built for raster-log digitisation, which we call VeerNet: Track 1 and Track 2 each plot three curves (gamma-ray, spontaneous-potential, and calliper; shallow, medium, and deep resistivity), and Track 3 plots a porosity pair (neutron and density), so eight drawn curves in all that the segmenter must own no matter where you set the lever.

A confidence ladder ranking the field types on a single scanned log by how reliably classical OCR reads each one. Printed, finite-vocabulary text sits high: units and scale ticks, the depth scale, the 22 electrical-measurement column names from the public Xeek / FORCE 2020 tutorial set across 118 Norwegian-Sea wells, and the track labels. Hand-plotted analogue curves sit at the bottom, where no recognizer wins: the eight plotted curves of the engagement's 3-track taxonomy (Track 1 and Track 2 each carry three, GR/SP/CALI and shallow/medium/deep resistivity; Track 3 carries two, NPHI and RHOB). Drag the orange READ-AS-OCR cutoff up and down the ladder. Field types above it route to cheap OCR; below it, to the heavy pixel segmenter. The right panel tallies the split and flags a misroute when the cutoff drops low enough to push a drawn curve into the OCR bucket, where a recognizer cannot follow it. The 3+3+2 taxonomy and the 22-column, 118-well figures are sourced; the per-rung confidence values are an illustrative ordering, not measured OCR accuracies.

The point the ladder argues is that there is a correct place to put the cutoff, and it is not at either extreme. Set it so high that even the track labels go to the segmenter and you have thrown away the recognizer's cheapest, most reliable wins. Set it so low that a curve gets routed to OCR and the pipeline silently breaks on the part that mattered most. The header text clusters near the top, the drawn curves cluster at the bottom, and the honest cutoff lives in the gap between them.

What the header buys the segmenter

There is a second reason to read the header first that goes beyond saving effort: the header tells the segmenter what it is looking at.

A pixel segmenter, left to its own devices, sees an undifferentiated field of ink and has to guess how many distinct curves live in a track and which is which. The header collapses that ambiguity before the expensive model runs. Once OCR has read "Track 2: shallow, medium, deep resistivity", the segmenter knows that column is a three-curve problem, not a two- or four-curve one, and it knows the canonical identity to attach to each recovered trace. Once OCR has read the depth scale, the recovered curves can be anchored to true depth rather than to pixel rows. The cheap recognition step is not just cheaper than segmentation; it conditions the segmentation, turning an open guess into a constrained one.

This is the same composition principle that classical and learned methods have always used together in document-image work. Deterministic geometry handles the rigid, enumerable structure, the Hough transform finding and subtracting gridlines, morphology cleaning strokes, because for rigid structure a closed-form method beats a learned one on speed and reliability [2]. A whole line of well-log digitisation work, in fact, recovered curves from scanned parameter graphs with morphology alone and no learned component at all, which is a useful reminder that the classical toolkit is not a relic; it is the right answer for the parts of the page that are regular [4]. The learned segmenter then does the one thing no rule survives, tracing a free-form curve through faded ink, crossing tracks, and scanner noise. OCR belongs in that same family of deterministic-first steps. It is the recogniser you run on the part of the page that is, structurally, just text.

The reflex to build in

So the working habit is not "use deep learning for scanned logs" or "use OCR for scanned logs". It is to look at any scan and immediately ask which regions are finite-vocabulary printed text and which are free-form drawn signal, and to send each to the tool whose assumptions actually hold there. The header is text; recognise it. The curves are signal; segment them. The recognizer is the cheaper tool, and on the half of the page it was designed for it is also the more reliable one, which is a rare and welcome combination in applied vision. Reach for the segmenter only where the recognizer has nothing to recognise, because that is the only place its cost is justified, and on a well log that place is the curves and nothing else.

Key takeaways

A scanned well log is two vision problems on one sheet: the printed header (track labels, depth scale, units, measurement-column names) is finite-vocabulary text, and the plotted curves are free-form analogue signal. They have different right tools, and treating them as one problem makes the easy half pay the price of the hard half.
Classical OCR (e.g. Tesseract) wins the header because the header is constrained: a small alphabet, set in type, on lines. OCR is cheap, needs no labelled data of your own, is robust to scan noise via its language model, and returns a typed string you can route on. The same machinery is useless on a drawn curve, which has no alphabet to recognise.
The plotted curves are a dense per-pixel labelling task. A learned encoder-decoder in the U-Net lineage is the only honest tool there, and it is the expensive one: it needs labelled examples, a learned feature extractor, and real training compute that the header never requires.
The decision is a ladder, not a switch. Field types sort by OCR confidence, from crisp units and depth numerals and the 22 electrical-measurement column names of the public FORCE 2020 / Xeek 118-well slice down to the eight plotted curves of the 3-track taxonomy (GR/SP/CALI; shallow/medium/deep resistivity; NPHI/RHOB). The honest cutoff sits in the gap between the header text and the drawn curves.
Reading the header first does more than save effort: it conditions the segmenter. OCR'ing the legend tells the segmenter how many curves a track holds and their canonical identities, and OCR'ing the depth scale anchors recovered curves to true depth. Cheap recognition turns an open segmentation guess into a constrained one, the same way Hough geometry and morphology scaffold the learned step elsewhere in the pipeline.

Read the Header, Not the Curves: Why OCR Still Earns Its Keep

Two problems wearing one coat

Why OCR wins the header and loses the curves

A ladder, not a switch

What the header buys the segmenter

The reflex to build in

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on