Optical Music Recognition as a Raster-to-Symbol Problem

Set a scanned page of sheet music beside a scanned paper well log and squint until the subject matter blurs. What is left is the same picture twice: a set of straight horizontal rules printed as a reference frame, and threaded through them a much thinner, wandering foreground that is the thing you actually want. On the score the rules are the staff and the foreground is the notes. On the log the rules are the track borders and depth grid and the foreground is the logging curve. In both cases a computer that reads the raw raster sees dark strokes on light paper and nothing more, and in both cases the hard part is not classifying a big obvious object but recovering a hair-thin one that shares its ink with the furniture around it. This is why the field that taught machines to read music has more to say about reading well logs than its subject matter suggests, and why we treat optical music recognition as a sibling discipline rather than a curiosity.

The point of this survey is not to retell what we built. It is to place a well-log curve extractor inside a much older lineage and to show that the lineage already diagnosed the problems we hit. Optical music recognition has been a named research problem for decades, with its own benchmarks and a pipeline the community converged on well before segmentation networks were common [1]. That pipeline, read stage by stage, is a curve extractor with different labels on the boxes.

The shared skeleton of a raster-to-symbol pipeline

The canonical optical-music-recognition pipeline has four stages, and the survey literature states them plainly: preprocess the image, detect and handle the staff, recognise the musical symbols, and reconstruct the semantics [1]. Strip the domain-specific nouns and you have the general recipe for turning a ruled raster into structured symbols: clean the page, deal with the ruling, find and label the thin marks that carry meaning, then assemble those marks into something a downstream system can consume, a note sequence in one case and a depth-indexed curve in the other.

Curve extraction from scanned logs walks the identical four stages. We clean the scan, we contend with the printed grid, we segment the curve pixels away from the background, and we reconstruct a depth-value series a petrophysicist or a database can use. The correspondence is not loose analogy; it is stage-for-stage. The exhibit below lays the two pipelines out as mirror columns so the structural claim is visible rather than merely asserted, and it hangs our own measured numbers on the stage where they were recorded.

Optical music recognition and well-log digitisation read as one problem: dense thin symbols drawn on structured, ruled paper. The two columns are the two domains, staff lines with note glyphs on the left and track rulings with logging curves on the right, and the rows are the shared pipeline stages, laid out identically because they are identical: a ruled substrate, a dense thin foreground, the removal of the ruling, and the tracing and class assignment of what remains. Drag the orange transfer beam down the shared stack and each optical-music stage lands on its well-log twin. When the beam reaches the class-assignment row it reads out the one panel of sourced numbers from our own well-log run: a 3-class segmentation over 15000 synthetic instances, with the thin foreground about 3% of the pixels, reaching a peak intersection over union of 0.51, a peak F1 of 0.55, and a peak recall of 0.97. Those metrics are sourced from the engagement archive; the two column sketches of glyphs and curves are illustrative renderings of the shared structure, not measured output.

The ruling is furniture, and both fields learned to remove it

The stage where the two disciplines rhyme most exactly is the treatment of the ruled substrate. In optical music recognition, staff-line detection and removal is not a side note; it is a first-class subproblem with its own ground-truth datasets and comparative benchmarks. The CVC-MUSCIMA corpus exists specifically to score staff removal as a standalone task, with pixel-level labels for what is staff and what is not [2], and a full comparative study of staff-removal algorithms exists because removing the lines cleanly changes what the recogniser downstream can do [3]. The community made the ruling a problem in its own right after discovering, the hard way, that a symbol recogniser trained on pages that still carried the staff spent its capacity re-learning to ignore lines it should never have been shown.

Well-log digitisation reaches the same conclusion from the other direction. The printed graticule of track borders and depth grid is, at the pixel level, indistinguishable from the curve: both are dark ink on light paper. A segmenter that has not had the grid removed will confidently trace it, because a long straight dark stroke is easier to fit than a faint wandering one. What is genuinely shared here is not a specific algorithm but a design conviction: on a ruled raster, the ruling is the first thing you subtract, and treating it as furniture rather than signal is the decision that makes everything after it tractable.

Once it is a per-pixel problem, both fields converge

The other convergence is in formulation. When optical music recognition moved from hand-crafted heuristics to learning, it recast music-document analysis as pixelwise classification, assigning every pixel to a class such as staff, symbol, or background [4]. That is exactly semantic segmentation, the fully-convolutional dense-prediction formulation that treats an image as a grid of per-pixel labels [5], typically served by an encoder-decoder with skip connections so that the thin structures a downsampling network would blur back into the background survive to the output [6]. Our curve extractor is the same object: a per-pixel classifier that assigns each pixel to background or to one of the logging curves, built on an encoder-decoder for exactly the skip-connection reason.

The moment both problems are stated as per-pixel segmentation, they inherit the same pathology, and this is the heart of the parallel. The foreground is tiny. A note stem is a few pixels wide against a large white page; a logging curve is a hair against a whole track. On our own well-log run the thin foreground is about 3% of the pixels, which means a model can reach 97% pixel accuracy by predicting background everywhere and learning nothing. This is why the thin-symbol segmentation literature reaches for overlap-based objectives such as the Dice loss, which scores the intersection of prediction and truth on the foreground directly rather than counting correct background pixels [7]. Optical music recognition and log digitisation both live or die on the same small-foreground class imbalance, and both are correctly scored not by accuracy but by overlap and by recall.

What our numbers say, and why the shape matters

The numbers we can put on the table come from an adjacent well-log run, and their shape is the argument. On a 3-class segmentation, background and two curves, trained over 15000 synthetic instances, the multiclass model reaches a peak intersection over union of 0.51, a peak F1 of 0.55, and a peak recall of 0.97. Read those three together rather than one at a time. Recall of 0.97 says the model finds almost all of the true foreground pixels; it rarely misses a stroke. F1 of 0.55 and intersection over union of 0.51 say that it pays for that completeness with false positives, dark pixels it claims as curve that were not, which is precisely what happens when a thin foreground sits on a busy ruled page and the model would rather over-mark than drop a stroke.

That recall-heavy, precision-light signature is not a well-log peculiarity. It is the characteristic fingerprint of the whole raster-to-symbol family, and optical music recognition documented it long before we measured it on logs. A staff-line remover that leaves a few stray line fragments is annoying; one that erases a note stem is catastrophic, so the field learned to bias toward keeping foreground and cleaning up afterward. A symbol detector on a dense score faces the same asymmetry: missing a symbol loses information that cannot be recovered downstream, while a spurious mark can often be filtered by the reconstruction stage that knows what a legal score looks like [8]. The completeness-first bias that produces high recall and moderate overlap is the rational response to a raster-to-symbol task, and seeing the same numeric shape in music and in logs is the strongest evidence that they are one problem.

What actually transfers, and what does not

Being honest about a cross-domain parallel means marking its edges. What transfers cleanly is the front of the pipeline: the treatment of the ruling as a removable substrate, the per-pixel segmentation formulation, the overlap-and-recall scoring discipline, and the completeness-first bias the small foreground forces on both fields. Those are structural, and they are why the optical-music-recognition literature is a legitimate source of engineering priors for a curve extractor rather than a loose metaphor.

What does not transfer is the reconstruction stage. Music has a hard grammar: a recognised page must resolve to notes on a scale, durations that sum, and clefs that constrain pitch, and that grammar lets the reconstruction stage reject impossibilities the pixel classifier proposed [8]. A well log has far weaker downstream constraints, since a curve can take almost any physically plausible value at any depth, so it cannot lean on syntax to clean up the segmentation the way a music system can. That asymmetry is exactly why the precision-side numbers in our run are only moderate and cannot be rescued for free: the domain that borrowed the front of the pipeline did not inherit the back of it. The lesson is not that music recognition solves log digitisation, but that the first three stages are shared enough that ignoring the older field leaves decades of diagnosed failure modes on the table.

Limitations

This is a survey with a worked example, not a benchmark. It reads the optical-music-recognition literature for its structure and its diagnosed failure modes and argues, on that basis, that curve extraction from scanned logs belongs to the same raster-to-symbol family; it does not re-run any music-recognition system or evaluate a model on a music corpus. The numbers used to ground the parallel, a peak intersection over union of 0.51, a peak F1 of 0.55, a peak recall of 0.97, three classes, a thin foreground of about 3% of pixels, and 15000 synthetic instances, are the real metrics of one multiclass well-log run from an adjacent engagement, used as an illustration of the family's numeric signature rather than as a comparison against any published music-recognition result. Because we did not evaluate the two domains on a common corpus, the claim that they share a pathology is an argument from shared formulation and matching metric shape, not a measured head-to-head, and a reader who needs that head-to-head should treat it as an open experiment. The two column sketches in the exhibit are illustrative renderings of the shared structure; only the metric panel carries measured numbers, and it is flagged as such on the canvas.

References

[1] Rebelo, A., Fujinaga, I., Paszkiewicz, F., Marcal, A. R. S., Guedes, C., and Cardoso, J. S. Optical music recognition: state-of-the-art and open issues. International Journal of Multimedia Information Retrieval, 1(3), 173 to 190 (2012). The survey that frames OMR as a pipeline of preprocessing, staff detection, symbol recognition, and reconstruction. https://doi.org/10.1007/s13735-012-0004-6

[2] Fornes, A., Dutta, A., Gordo, A., and Llados, J. CVC-MUSCIMA: a ground truth of handwritten music score images for writer identification and staff removal. International Journal on Document Analysis and Recognition, 15(3), 243 to 251 (2012). Defines staff-line removal as a standalone, ground-truthed task on scanned scores. https://doi.org/10.1007/s10032-011-0168-2

[3] Dalitz, C., Droettboom, M., Pfaehler, B., and Fujinaga, I. A comparative study of staff removal algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5), 753 to 766 (2008). Benchmarks classical staff-line removal and shows how removing the ruling changes what the recogniser sees. https://ieeexplore.ieee.org/document/4429387

[4] Calvo-Zaragoza, J., Vigliensoni, G., and Fujinaga, I. Pixelwise classification for music document analysis. IEEE International Conference on Image Processing Theory, Tools and Applications, 1 to 6 (2017). Recasts music-document analysis as per-pixel semantic segmentation, the same formulation a curve extractor uses. https://ieeexplore.ieee.org/document/8310134

[5] Long, J., Shelhamer, E., and Darrell, T. Fully Convolutional Networks for Semantic Segmentation. CVPR (2015). The dense per-pixel prediction formulation both music-document segmentation and log-curve segmentation inherit. https://arxiv.org/abs/1411.4038

[6] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The encoder-decoder backbone with skip connections that recovers thin structures downsampling would otherwise erase. https://arxiv.org/abs/1505.04597

[7] Milletari, F., Navab, N., and Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 3D Vision (2016). Introduces the Dice objective that targets overlap on a small foreground, the metric a thin-symbol task is scored on. https://arxiv.org/abs/1606.04797

[8] Pacha, A., Hajic, J., and Calvo-Zaragoza, J. A Baseline for General Music Object Detection with Deep Learning. Applied Sciences, 8(9), 1488 (2018). Treats music symbols as objects to detect and classify, the discrete-symbol end of the same raster-to-symbol pipeline. https://doi.org/10.3390/app8091488

Optical Music Recognition as a Raster-to-Symbol Problem

The shared skeleton of a raster-to-symbol pipeline

The ruling is furniture, and both fields learned to remove it

Once it is a per-pixel problem, both fields converge

What our numbers say, and why the shape matters

What actually transfers, and what does not

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on