Building a Reusable Synthetic-Data Generator for Chart Digitisation

We did not set out to build a general-purpose chart engine. We set out to read curves off scanned well logs, and we hit the wall every supervised-learning project on legacy documents hits first: there was no labelled corpus to train on. A raster well log is a photograph or a scan of a paper plot, sometimes decades old, and nobody had ever sat down and traced the curve pixels of enough of them to teach a segmentation network what a curve looks like against a ruled grid. So before we could digitise anything, we had to manufacture the training data, and that meant writing a generator that draws synthetic logs from scratch with a pixel-exact mask for free, because the renderer knows where it put every line. The engine fed our raster log digitiser, VeerNet, and on its largest configuration it produced a run of 20,000 two-curve multiclass logs across a width range of 3,200 to 12,800 pixels and a height range of 480 to 640 pixels.

The interesting part is what happened to that generator after the petroleum problem was solved. The first version was full of well-log vocabulary: track widths, depth grids, the specific look of a gamma-ray trace. The second version had almost none of it. Somewhere in the refactor we noticed that nothing in the drawing loop actually cared that it was a well log. A curve is a curve, an axis is an axis, and a gridline is a gridline, whether the page is a sonic log or a heart rate strip. Once we pushed the petroleum specifics out of the code and into a configuration the engine reads at run time, the same generator could draw a chart from any domain that puts a continuous trace against ruled axes. This piece is about why that generalisation was sitting there all along, where the public literature had already drawn the same line, and what it cost us to make it real.

A scanned chart is a scanned chart

The reason a well-log generator transfers so cleanly is that chart digitisation is not many problems, it is one problem photographed in many costumes. The task is always the same: take a raster image of a plot, recover the numeric series that produced it, and you do that by separating the data marks from the chart furniture and then mapping pixel coordinates back into the chart's own units. The earliest fully automated work on this framed it exactly that way. Scatteract treated the extraction of data from scatter plots as a pipeline of detecting the visual marks, reading the axis tick labels with optical character recognition, and fitting a robust regression from pixel space into data space, and it pointedly did not assume anything about what the scatter plot was measuring (Cliche et al., 2017).

That domain-agnostic framing scaled up as the field matured. ChartOCR generalised the idea across chart types with a hybrid framework that detects the structural elements of a chart, the axes, the legend, the plotted points, and reconstructs the underlying table from them, and it leaned on a large synthetic chart dataset to train the deep components because, once again, hand-labelled charts at that volume did not exist (Luo et al., 2021). The pattern is consistent across this lineage: the structural primitives of a chart are shared, the labelled data is scarce, and synthetic generation is how the field has repeatedly filled the gap. A well log is just a member of that family with a tall aspect ratio and a particular naming convention for its curves.

Geometry is a parameter, not a domain

The move that turned our one-off generator into a reusable one was to identify the small set of knobs that actually describe a chart and refuse to hard-code any of them. There are only a handful. The axis range sets the units the curve lives in. The gridline density sets how busy the background is, which matters enormously to a segmentation network that has to learn the difference between a faint ruled line and a thin data trace. The curve count sets how many series share the frame and therefore how much occlusion the model must reason through. And the image dimensions set the canvas the whole thing is rendered onto, which for us spanned that 3,200 to 12,800 pixel width and 480 to 640 pixel height because real archive scans vary that widely. None of those five quantities is petroleum. They are properties of charts in general.

This is the same insight domain randomisation arrived at from the robotics side: rather than match a simulator carefully to one real environment, you parameterise the nuisance variation and sample across it widely, so that the real world looks like just another draw from the training distribution (Tobin et al., 2017). We are doing a narrower, more literal version of it. We are not randomising a 3D scene, we are randomising the parameters of a 2D chart, but the principle that the things which vary should be inputs rather than constants is identical. Once gridline spacing and axis range are sampled rather than fixed, a model trained on the output stops overfitting to the visual habits of one document family, and the generator stops being able to tell, or care, what discipline it is drawing for.

What retargeting looks like

The parameter board below makes the reuse concrete. It exposes the generator's actual reusable knobs, gridline density, curves per frame, image width and image height, seeded with the real values from our work: two curves per log, the 3,200 to 12,800 pixel width range, the 480 to 640 pixel height range. The target picker swaps the engine between its origin domain, the two-curve well log, and two non-petroleum charts. Switch the target and the same parametric draw re-renders under different axis labels and units, while the knobs keep meaning exactly what they meant before. The right-hand portability ledger keeps the score that matters: of the five knobs, the count that carries a petroleum-specific assumption is zero.

The curve generator we built for raster well logs is itself domain-agnostic: geometry, axes, gridlines and image dimensions are all generator parameters rather than petroleum constants, so the same engine retargets to any ruled line chart by changing numbers, not code. The left column exposes the reusable knobs (gridline density, curves per frame, image width 3200 to 12800 px, image height 480 to 640 px) seeded with the engagement's real values, the target picker swaps the engine between its origin domain (a 2-curve well log) and two non-petroleum charts, and the center tile re-renders the identical parametric draw under the new axis labels and units. The right column reports the 20,000-log run that produced the original two-curve multiclass corpus and a portability ledger that counts how many of the five knobs carry a petroleum-specific assumption: zero. Sourced: 2 curves per log in the final multiclass dataset, the 3200 to 12800 px width range, the 480 to 640 px height range, and the 20,000-log run. The non-petroleum target presets and the live preview geometry are illustrative depictions of how the same knobs retarget the engine, both flagged on the canvas.

Two honesties about the exhibit. The well-log target, the two-curve count, the pixel dimension ranges and the 20,000-log run are real figures from the engagement. The two non-petroleum target presets, their axis ranges and units, and the live preview geometry are illustrative, flagged on the canvas and in the method line, and are there to show how a single set of knobs retargets the engine, not to claim we shipped a heart-rate digitiser. The argument the instrument is making is structural, not a benchmark: the same controls, the same draw call, a different chart.

Where the ceiling is

It would overstate the result to say a log generator is a universal chart engine, and the recent literature is exactly where you see the gap. Benchmarks like ChartQA pushed the task past extraction into reasoning over the recovered values, which demands an understanding of legends, categorical axes and chart semantics that a continuous-trace generator simply does not model (Masry et al., 2022). The frontier moved toward general visual-language pretraining, where a single image-to-text model learns to parse arbitrary screenshots rather than relying on a bespoke generator per chart family (Lee et al., 2023), and plot-to-table translation showed that the extraction step itself can be folded into a learned model that emits a structured table directly from the image (Liu et al., 2023).

Our generator does not compete with that frontier and was never trying to. It does one thing those approaches still depend on: it manufactures clean, perfectly labelled training imagery cheaply, for the narrow but common case of continuous curves on ruled axes. That is a useful thing to own. The reusability is not a grand claim about general chart intelligence; it is the unglamorous observation that we wrote less petroleum into the code than we thought we had, and the leftover was a tool that draws line charts for whoever asks. The engagement asked for well logs. The engine never knew that was special, and the version that finally shipped is the one that agreed with it.

References

[1] M. Cliche, D. Rosenberg, D. Madeka, C. Yee. Scatteract: Automated Extraction of Data from Scatter Plots. ECML PKDD 2017. https://arxiv.org/abs/1704.06687

[2] J. Luo, Z. Li, J. Wang, C.-Y. Lin. ChartOCR: Data Extraction from Charts Images via a Deep Hybrid Framework. WACV 2021. https://openaccess.thecvf.com/content/WACV2021/html/Luo_ChartOCR_Data_Extraction_From_Charts_Images_via_a_Deep_Hybrid_WACV_2021_paper.html

[3] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, P. Abbeel. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS 2017. https://arxiv.org/abs/1703.06907

[4] A. Masry, X. L. Do, J. Q. Tan, S. Joty, E. Hoque. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. Findings of ACL 2022. https://arxiv.org/abs/2203.10244

[5] K. Lee, M. Joshi, I. Turc, H. Hu, F. Liu, J. Eisenschlos, U. Khandelwal, P. Shaw, M.-W. Chang, K. Toutanova. Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. ICML 2023. https://arxiv.org/abs/2210.03347

[6] F. Liu, J. Eisenschlos, F. Piccinno, S. Krichene, C. Pang, K. Lee, M. Joshi, W. Chen, N. Collier, Y. Altun. DePlot: One-shot Visual Language Reasoning by Plot-to-Table Translation. Findings of ACL 2023. https://arxiv.org/abs/2212.10505

Building a Reusable Synthetic-Data Generator for Chart Digitisation

A scanned chart is a scanned chart

Geometry is a parameter, not a domain

What retargeting looks like

Where the ceiling is

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on