The Quiet Work of Standardising NaN Across Messy Log Files

Ask a room of people what a missing number looks like and you will get a confident answer that is wrong in a different way from everyone else's. To one person it is a blank cell; to another it is the word NaN typed into a spreadsheet; to a third it is the value minus nine hundred ninety-nine point two five, which is what the instrument writes when the tool was off depth or the reading saturated. They are all describing the same thing, the absence of a real measurement, and none of their descriptions agree. That disagreement is harmless right up until a model tries to learn from the column, at which point it becomes the most expensive kind of bug there is: the silent one. The work of making every absence look the same is the least glamorous task in the whole pipeline, it never appears in a results table, and skipping it is the surest way to ship a model that was confidently trained on fiction.

Absence wears a lot of costumes

A well log corpus is an unusually honest place to see this, because the data passed through so many hands before it reached yours. A reading might have started life on paper, been scanned, been transcribed, been exported to the Log ASCII Standard, been opened in a spreadsheet by someone who saved it back out, and been concatenated with a thousand of its neighbours. At every one of those steps, the convention for what to write when there is nothing to write could change. The LAS standard itself is disciplined about this: it reserves a per-file NULL value, conventionally minus nine hundred ninety-nine point two five, and declares in the header that any cell equal to it is absent rather than measured [6]. That is a good design. The problem is that the discipline lives in the header, and by the time the numbers have been flattened into a wide table for training, the header has usually been thrown away. What survives is a column full of a very specific, very real-looking number that means nothing, sitting next to columns where absence was written as an empty string, or as the text NaN, or as a stray N/A that some export tool emitted, or, most dangerously, as a plain zero.

Each of those costumes fools a different reader. A numeric parser is happy to read minus nine hundred ninety-nine as a perfectly good float and will cheerfully include it in a mean, a standard deviation, a min-max scaler. A program that only knows to skip cells already typed as a floating-point NaN will pass straight over an empty string and the word "N/A" alike, counting both as present. And the zero is the cruellest of all, because a zero is sometimes a genuine reading and sometimes a placeholder for a missing one, and from the value alone you cannot tell which. The corpus is not lying on purpose. It is just that absence was never recorded in one language, and a model does not get to ask follow-up questions.

A problem the data-quality literature named long ago

None of this is new, and it is worth being honest that the people who thought hardest about it were not building log digitisers. The statistics literature has a precise vocabulary for the gaps themselves, separating data that is missing completely at random from data whose very absence carries information, and it makes the uncomfortable point that the mechanism behind a gap dictates what you are even allowed to do about it [4]. That distinction is upstream of any cleaning code, but it has a hard prerequisite that the same literature is blunt about: before you can reason about why a cell is empty, you have to correctly know that it is empty, and a corpus that spells absence five ways defeats that on the first line [5]. The practical-tooling tradition arrived at the same place from the other direction. The argument for tidy data was, in the end, an argument that most of the effort in an analysis is spent reshaping inputs into one consistent structure so that everything after it can be simple [2], and the library that most of us reach for to do that reshaping made a deliberate, load-bearing decision to represent missingness as a single explicit marker, NaN, with dtype-aware rules for how it propagates [3]. The whole point of that decision is that there should be exactly one thing a missing value is.

The well-log world has its own version of the lesson, taught gently. A widely read tutorial on the public FORCE 2020 and Xeek dataset walks through exactly this slice, 118 Norwegian-Sea wells across 22 measurement columns, and the first thing it does, before any modelling, is visualise where the data is missing so that the gaps become a thing you can see rather than a thing that silently distorts the fit [1]. That is the right instinct and a useful public mirror, because the same shapes show up in any digitised archive: some columns nearly complete, a few riddled with holes, the missingness clustered by well and by tool rather than sprinkled evenly. But you only get to see those shapes truthfully if every disguise of absence has already been collapsed to the one marker the visualiser understands. Run the same audit on a raw corpus and it will report a tidy, reassuring, and completely false picture, because half the absent cells are still wearing a costume the audit reads as present.

What this looked like on our corpus

We will be precise about which part of this is ours, because the technique is not. Our VeerNet pipeline for digitising raster logs sits on top of a corpus assembled from the Texas Railroad Commission public archive, on the order of seven thousand seven hundred eighty-one LAS files alongside the scanned images, and the housekeeping we are describing here is the unremarkable thing we had to do before any of the interesting modelling could start. We did not invent NULL handling; we inherited a corpus where it had been done inconsistently by everyone who touched the files before us, and we paid the tax of making it consistent. The public Norwegian-Sea slice is not our data, but it is close enough in shape and openly inspectable, so we use its 118 wells and 22 electrical-measurement columns below as a stand-in to show the mechanism without exposing anything operator-specific.

The shape of the work is a funnel. You take the corpus, enumerate every distinct way a cell could be standing in for absence, and map each one, deliberately and one family at a time, onto the single canonical null. The numeric sentinel goes first because it is the most common and the most insidious. Then the empty and whitespace-only strings, which a string parser reads as present text. Then the literal text spellings, NaN and N/A and their cousins, which a numeric coercion will turn into a real NaN only if you ask it to and will otherwise leave as objects. The zero is left for last and handled with the most care, because collapsing every zero to null would destroy real measurements; that one requires a per-column judgement about whether zero is a physically possible reading for that curve, and it is the step that most rewards talking to a petrophysicist rather than guessing.

A scanned-log corpus arrives with absence written several ways: the classic LAS numeric NULL sentinel (-999.25), empty and whitespace strings, the text spellings NaN and N/A, and a zero pressed into service as a stand-in. Until each of those disguises collapses to one machine-readable null, a missingness audit lies twice: it counts text sentinels as present (a false floor) and it counts real readings that equal the sentinel as absent (a false alarm). Drag the lever to recognise each token family in turn and watch the audited share of the 22 electrical-measurement columns climb from its raw, pre-normalisation floor toward the true post-normalisation reading; the orange gap is the undercount still hiding at each step. The corpus scale (7,781 LAS files), the 22 columns, and the 118 Norwegian-Sea wells are sourced figures; the per-family column contributions and the 18-column truth are illustrative of the mechanism, not a measured per-token audit.

The meter above is the funnel made interactive on the public slice. It starts at the raw read, where only the cells already typed as NaN are counted, and the audit reports a small, comfortable fraction of the 22 columns as carrying any missing values. Drag the lever to recognise each token family in turn and the audited share climbs, not because the data changed but because you stopped mistaking disguised absence for measurement. The orange gap at every intermediate step is the part of the truth still hiding: columns that genuinely have holes but are still being counted as complete because their nulls are wearing a costume the parser respects. Only when every family has been mapped to the one null does the audited bar meet the true reading, and only then is anything you compute downstream, a mean, a correlation, a missingness heatmap, describing the corpus rather than a flattering fiction of it. The corpus scale, the column count, and the well count are real; the per-family contributions and the final affected-column figure are drawn to argue the mechanism, not to report a measured per-token tally of that public set.

Why the boring step is the load-bearing one

It is tempting to treat normalisation as a checkbox you tick on the way to the real work, and that framing is exactly how the cost gets deferred rather than paid. The reframing toward data-centric practice was, at bottom, the observation that consistency in the dataset is often where the next dependable gain actually lives, more than in another architecture [7], and null standardisation is the most elementary instance of that consistency there is. A scaler fit on a column still carrying minus nine hundred ninety-nine learns a mean and a spread that no real reading has; every value it later normalises is shifted by a quantity that came from counting an absence as a number. A model that imputes the gaps cannot impute the ones it never identified, so the costumed nulls sail through untouched and become training targets the network treats as ground truth. And a missingness analysis run before normalisation gives you a map of the wrong territory, which is worse than no map, because you will trust it.

The reason this is hard to take seriously is precisely that it produces nothing to show. There is no curve to plot, no metric that ticks up, no figure for a deck. What it produces is the absence of a category of error that, left in place, would have surfaced much later as a model that behaves strangely on exactly the wells where a tool was off depth, and that would have cost a week of someone's confusion to trace back to a column that was never as complete as it looked. We did the funnel once, carefully, wrote down the per-column decisions about zero, and then never thought about it again, which is the whole reward. The unglamorous step everyone is tempted to skip is the one that lets every glamorous step afterward be about the model instead of about the mess underneath it. Get the meaning of a missing number settled first, in one language, and the rest of the pipeline finally gets to assume what it always quietly assumed anyway: that a number it can see is a number that was really there.

References

[1] McDonald, A. Using the missingno Python library to identify and visualise missing data prior to machine learning. Towards Data Science (2021). A tutorial on the FORCE 2020 / Xeek slice of 118 Norwegian Sea wells with 22 measurement columns, surfacing missingness before training. https://towardsdatascience.com/using-the-missingno-python-library-to-identify-and-visualise-missing-data-prior-to-machine-learning-34c8c5b5f009

[2] Wickham, H. Tidy Data. Journal of Statistical Software 59(10) (2014). Argues most wrangling effort is spent reshaping inputs into one consistent structure, and names the conventions that make the rest tractable. https://www.jstatsoft.org/article/view/v059i10

[3] McKinney, W. Data Structures for Statistical Computing in Python. SciPy (2010). Introduced pandas and its explicit, dtype-aware treatment of NaN as the missing-value marker. https://conference.scipy.org/proceedings/scipy2010/mckinney.html

[4] Little, R. J. A., and Rubin, D. B. Statistical Analysis with Missing Data, 3rd ed. Wiley (2019). The standard reference for the missingness taxonomy and why the mechanism behind a gap dictates what you may do about it. https://onlinelibrary.wiley.com/doi/book/10.1002/9781119482260

[5] van Buuren, S. Flexible Imputation of Missing Data, 2nd ed. CRC Press (2018). A practical treatment of multiple imputation that insists on correctly identifying which cells are missing before any model fills them. https://stefvanbuuren.name/fimd/

[6] Canadian Well Logging Society. LAS (Log ASCII Standard) format specification. Defines a per-file NULL value (commonly -999.25) for absent readings, which downstream tools must honour rather than treat as data. https://www.cwls.org/products/#products-las

[7] Ng, A. A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. DeepLearning.AI (2021). Reframed the next reliable gain as a property of the dataset and its consistency rather than the architecture. https://www.deeplearning.ai/the-batch/a-chat-with-andrew-on-mlops-from-model-centric-to-data-centric-ai/

The Quiet Work of Standardising NaN Across Messy Log Files

Absence wears a lot of costumes

A problem the data-quality literature named long ago

What this looked like on our corpus

Why the boring step is the load-bearing one

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on