Skip to main content

Case Study

Adding an OCR Header Reader to Auto-Detect Tracks and Depth Scales

OCR earns its place on the header, not the curves. We added a header reader to VeerNet that automatically resolves the 3-track, 2-curve taxonomy and the depth scale, so the segmenter knows what each track is before it runs a single inference.

Case study

Most of the attention in raster-log digitization goes to the curves. That is where the hard segmentation problem lives, and it is where we spent the bulk of our modelling effort with VeerNet, the network we built to read scanned well-logs. But a segmenter that does not know what it is looking at is guessing. Before it runs a single inference, it needs to know which track carries which curve and what depth each pixel maps to. That information is not in the curves. It is printed at the top of the log, in the header. So we added an OCR header reader to VeerNet, and it changed where the model starts: not at the first curve pixel, but at the field that tells it what every curve pixel means.

At a glance

Three numbers describe the header the reader resolves and the grid it anchors.

3 + 3 + 2
GR/SP/CALI · resistivity triple · NPHI/RHOB

Curves per log across three tracks

3
background plus two curves, Track 3 is the 2-curve case

Segmentation classes the header routes per track

300
set by the depth scale the header reads

Depth points the validation grid anchors on

The header is the routing layer, not decoration

A printed well-log is a stack of tracks, each a vertical strip of the page carrying one or more curves against a depth axis. The curves themselves are anonymous ink. A gamma-ray trace and a resistivity trace are both thin wandering lines; nothing in the pixels of the curve says which is which. What disambiguates them is the header: the block of printed text and scale markings at the top of each track that names the curves, gives their units, and fixes the depth and value axes.

For a human interpreter this is automatic. You glance at the header, read "Gamma Ray, API," and from then on you know the leftmost trace in Track 1 is GR. For a segmentation model trained on rendered logs, that step does not happen unless you build it. The model can learn to separate ink from background and one trace from another, but it has no way to attach a name to a mask. It will hand you a per-pixel mask for "curve 1" and "curve 2" in a track and leave the identification to a downstream human, which puts the interpreter right back in the loop the digitization was meant to remove.

The existing manual tooling makes the cost of skipping this step explicit. The standard commercial workflow, NeuraLog, requires the user to calibrate every raster by hand before any tracing happens: draw a rectangle around the header and scale area, place a set of points down the depth track, set the left and right axis values for each curve, and declare whether each scale is linear or logarithmic. That is the header being read, manually, every single log. It is slow and it does not scale across an archive of tens of thousands of scans. The header is not optional context. It is the routing layer, and someone or something has to read it.

Why OCR belongs on the header, not the curves

There is a tempting wrong turn here, which is to point OCR at the whole problem. Optical character recognition is built to turn printed glyphs into text, and a well-log page is covered in printed text: header labels, depth ticks, value-axis numbers, hand-written annotations. It is reasonable to ask whether OCR could read the curves too, or at least the scale numbers along a curve. It cannot read the curves, because a curve is not text, and pointing a character recognizer at a wandering analog trace produces nothing. The curves are a segmentation problem and stay one.

But the header is text. It is exactly what OCR is good at: short printed strings in a constrained vocabulary, laid out in a predictable region of the page. Modern OCR engines are mature and robust at this [1], and the broader line of document-understanding work has shown that jointly modelling text content and its layout position is what makes structured reading of scanned documents reliable [2]. A well-log header is a small, well-behaved instance of that problem. The labels come from a fixed petrophysical vocabulary (gamma ray, spontaneous potential, caliper, resistivity, neutron porosity, bulk density), the units are standard, and the spatial arrangement into tracks is consistent enough to exploit.

So we drew a clean line. OCR earns its place on the header, where the signal is genuinely text. Segmentation owns the curves, where the signal is genuinely ink. Splitting the page this way means each component does the job it is actually suited to, and the header reader feeds the segmenter the one thing the segmenter cannot recover on its own: identity.

This is a different use of OCR than the classical gridline-elimination line of work, which treats the whole graph as an image-processing target and strips the grid to isolate traces [3]. That approach never reads the header at all; it produces curves with no names and no axis calibration, which is why it breaks on exactly the multi-track, multi-curve scans a real archive is full of. Reading the header first is what lets the rest of the pipeline stay honest about what each output curve actually is.

What the reader resolves: the 3-track, 2-curve taxonomy

The header reader's job is to take the printed labels in each track and resolve them into the canonical curve taxonomy the segmenter is trained against. For the logs in this engagement that taxonomy is fixed and small, which is what makes the routing tractable.

  • Track 1 carries three curves: GR, SP, and CALI. Gamma ray, spontaneous potential, and caliper. This is the lithology-and-borehole-shape track, and the reader has to separate three labels that share a single strip of page.
  • Track 2 carries three curves: shallow, medium, and deep resistivity. The resistivity triple. Here the labels are near-identical strings differing only by depth-of-investigation, so the OCR output has to be parsed for the qualifier (shallow, medium, deep), not just the base word.
  • Track 3 carries two curves: NPHI and RHOB. Neutron porosity and bulk density. This is the porosity pair, and it is the case where a track holds two curves rather than three, which the reader has to detect rather than assume.

That last point is the reason header parsing cannot be hard-coded. If the pipeline assumed every track held three curves, it would hallucinate a third curve in Track 3 and force the segmenter to find ink that is not there. The reader has to count the curves a header actually declares, per track, and pass that count downstream. The taxonomy is regular but not uniform: three, three, two.

OCR HEADER READER · ROUTES BEFORE IT SEGMENTSDepth scaleOCR read: “DEPTH (FT)TAXONOMY: 3 + 3 + 2 CURVES PER LOGHover a header field to resolve it before segmentation runsThe segmenter needs to know what each track is. The header tells it; OCR reads the header.500051005200530054005500Track 1 · 3GAMMA RAYS.P.CALIPERTrack 2 · 3RES SHALRES MEDRES DEEPTrack 3 · 2NEUT PORBULK DENSDEPTH (FT)schematic log header · field glyphs & tick spacing illustrativeOCR RESOLUTIONRAW GLYPHSDEPTH (FT)CANONICALDepth scaleanchors the 300-point validation gridWHAT THE HEADER ROUTESTrack 1Lithology + borehole shape3 curvesTrack 2Resistivity triple3 curvesTrack 3Porosity pair2 curvesDepthanchors the validation grid300 points✓ Route taxonomy to segmenter (3 classes/track)The segmenter knows each track is a 3-class (Track 3: 2-curve) problem before it starts.track curves (teal)depth anchor (the accent)Track taxonomy 3+3+2 curves, the GR/SP/CALI + shallow/medium/deep + NPHI/RHOB set & the 300validation points are the engagement's own · header glyphs & tick spacing are schematic
The OCR header reader resolves each detected header field into VeerNet's real track taxonomy before the segmenter runs: Track 1 and Track 2 each carry three curves (GR/SP/CALI and shallow/medium/deep resistivity), Track 3 carries two (NPHI, RHOB), and the depth scale anchors the 300-point validation grid. Hover or click a field to see the OCR overlay resolve raw glyphs into the canonical curve label and light up the schematic track; toggle routing to see the per-track class budget the header hands the segmenter. The 3+3+2 curve-per-track taxonomy, the curve set, and the 300 interpolated validation points are the engagement's own; the header glyphs and tick spacing are schematic.

The interactive exhibit above is the reader's job in one frame. Each detected header field resolves from the raw glyphs OCR reads off the page into the canonical curve label the router emits, and the schematic track lights up to show where that curve lives. The depth gutter on the left is the one field that is not a curve at all, and it is the field everything else hangs from.

The depth scale anchors the validation grid

Naming the curves is half the job. The other half is depth. A digitized curve is worthless as data unless every point on it carries a real depth, and depth comes from the scale printed down the side of the log, not from the pixels of the curve. The header reader resolves the depth scale the same way it resolves the curve labels: it reads the printed depth markings, establishes the mapping from pixel-row to measured depth, and fixes whether the increment is regular.

This is the field that turns a segmentation mask into a log. Once the depth scale is known, a per-pixel curve mask becomes a depth-indexed series: at this depth, the curve reads this value. Without it, the segmenter's output is a shape in image coordinates with no physical meaning. The depth scale is also what makes validation possible. When we check a digitized curve against its LAS ground truth, we resample both onto a common depth grid and compare value-by-value; in the validation notebooks that grid is 300 interpolated depth points across the interval. Those 300 points exist in depth, not in pixels, and they only exist because the header reader recovered the depth scale that maps one to the other.

So the depth scale is doing double duty. It anchors every digitized value to a real measured depth at inference time, and it anchors the 300-point grid we score the model on at validation time. Get the depth scale wrong and both the output and the metric drift together, which is the worst kind of failure because it hides itself. That is why the depth field is treated as the load-bearing one: it is read first, and the rest of the routing is built on top of it.

How the header changes what the segmenter does

With the header read, the segmenter starts from a different place. Instead of being handed a track and asked to find some unknown number of anonymous curves, it is told: this is Track 1, expect three curves named GR, SP, CALI; this is Track 3, expect two curves named NPHI, RHOB; the depth scale is this. The multiclass segmentation problem we run is a background-plus-two-curves, three-class task per track, and the header is what sets that class budget correctly track by track rather than guessing it from the ink.

That changes the failure modes. A segmenter running without header routing has to infer curve count from the image, which means it can split one curve into two or merge two into one, and it has no name to attach either way. A segmenter running with header routing knows the count and the identity up front, so a mask that does not match the declared taxonomy is a flagged disagreement, not a silent mislabel. The header turns an open-ended "what is in this track" into a checked "find the curves the header says are here." It does not make the curve segmentation easier in the pixels; the thin, faded, overlapping traces are still hard. It makes the segmentation answerable, because the question is now well-posed.

Why the header reader earns its place

  1. OCR belongs on the header, not the curves: header labels are short printed strings in a fixed petrophysical vocabulary, which is exactly what OCR handles, while the curves stay a segmentation problem because a trace is ink, not text.
  2. The reader resolves the real 3-track, 2-curve taxonomy: Track 1 and Track 2 carry three curves each (GR/SP/CALI and shallow/medium/deep resistivity) and Track 3 carries two (NPHI, RHOB), so the curve count is detected per track rather than assumed.
  3. The depth scale is the load-bearing field: it turns a per-pixel mask into a depth-indexed series at inference and anchors the 300-point grid the model is validated on, so reading it first is what makes both the output and the metric mean something.

What reading the header first made possible

The header reader is a small component next to the segmentation network, but it sits at the point where the whole pipeline either stays automatic or falls back to a human. Without it, every digitized curve arrives anonymous and uncalibrated, and an interpreter has to name the curves and set the depth axis by hand, which is the NeuraLog calibration step we were trying to remove. With it, the curves arrive named and depth-anchored, and the interpreter's time moves to where it is actually worth spending: judging whether a hard, faded trace was read correctly, not re-typing what the header already said.

It also makes the segmenter's job well-posed before it starts, which is the quieter win. Routing the taxonomy and depth scale ahead of segmentation means the model is solving a defined problem (find these specific curves on this specific depth grid) rather than an open one. That is what lets the per-track metrics be read honestly, because a mismatch between the segmenter's output and the header's declared taxonomy is now a real, checkable disagreement.

The honest limit is that the header reader is only as good as the header it is given. A scan where the header itself is torn, stamped over, or missing pushes the problem back onto defaults and human review, and a curve vocabulary outside the fixed taxonomy needs the routing extended before the reader can resolve it. The reader does not solve digitization. It does the one thing that has to happen before digitization can be trusted: it tells the model what it is looking at.

References

  1. Smith, R. (2007). An Overview of the Tesseract OCR Engine. Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, 629-633. https://www.semanticscholar.org/paper/An-Overview-of-the-Tesseract-OCR-Engine-Smith/89d9aae7e0c8b6edd56d0d79b277c07b7ab66fda

  2. Xu, Y. et al. (2020). LayoutLM: Pre-training of Text and Layout for Document Image Understanding. KDD 2020. https://arxiv.org/abs/1912.13318

  3. Yuan, B. and Yang, Q. (2019). Digitization of Well-Logging Parameter Graphs Based on Gridlines-Elimination Approach. Journal of Petroleum Exploration and Production Technology. http://www.jsoftware.us/show-409-JSW15423.html

Go to Top

© 2026 Copyright. Earthscan