Geoscience Language Models: A Survey of LLMs for Reports, Logs, and Well Metadata

Abstract

A well produces two kinds of signal, and a language model can only read one of them. The first is text: the completion report and its prose, the header block stamped on every log, and the tabular metadata that describes each curve, its units, its datum, and the measurement that produced it. A language model reads all of that, and the emerging family of geoscience language models is built to read it well. The second is geometry: the shape of a resistivity or porosity trace as it was drawn on a scanned paper log, a thin line across a raster image. That is not text about a well; it is a picture of one, and no language model recovers it. This survey reads the geoscience-LLM field and the public well-data tooling around it against that single distinction, with one concrete contrast to make the line sharp: our raster digitiser, VeerNet, recovers curve geometry at a best single-curve R-squared of 0.9891 and a lowest mean absolute error of 0.0132, a task that stays a vision problem no matter how good the language model gets. We ground the value question against the applied-upstream survey that reports a roughly 20 percent drilling cost-savings figure for machine learning [1], and anchor the metadata modality in the public Norwegian Sea tutorial data of 118 wells and 22 electrical-measurement feature columns [2]. The finding is that a geoscience language model complements the vision digitiser rather than replacing it.

Background: what a geoscience language model is being asked to do

The subsurface has always been a text-heavy discipline dressed as a quantitative one. Behind every digital log curve sits a paper trail: the drilling report, the mud log narrative, the completion summary, the header block that records what tool ran, in what units, referenced to which datum. For decades this text was searchable only by the person who had read it. A geoscience language model makes that corpus queryable, letting an interpreter ask a plain question of a thousand reports and get a grounded answer with the header facts attached.

The promise is real and it is bounded. A language model operates on tokens, and the subsurface signal it can tokenise is exactly the part written in words or laid out in a table. The report is words. The header is words and numbers in a fixed schema. The well metadata, the description of each curve and its provenance, is a table, and the 22 electrical-measurement feature columns across the 118 wells in the public tutorial set are precisely the kind of object a language model can read, summarise, and reason over [2]. None of that touches the curve itself.

Method: reading the field against one distinction

This is a structured reading of the emerging geoscience-LLM work and the public well-data tooling around it, not a new model or benchmark. We fixed one distinction at the start and sorted every task in the subsurface stack against it: is the signal text a model can read, or geometry that must be seen? Report prose is text. The log header block is text in a schema. Well metadata is a table, anchored to the public FORCE tutorial data because those 118 wells and their 22 measurement columns are the concrete, inspectable form of the object [2]. Against those three text modalities we placed one geometry modality, the two-curve raster segmentation our own digitiser performs, and quoted its real metrics, a best single-curve R-squared of 0.9891 and a lowest mean absolute error of 0.0132, as the worked contrast rather than a claim about any language model.

For the value question we did not invent a number. The applied-upstream survey gathers reported machine-learning outcomes across drilling, reservoir engineering, and production, and its roughly 20 percent drilling cost-savings figure is the honest ceiling to ground both modalities against [1]. A language model that summarises reports and a vision model that recovers curves are both machine-learning contributions to the same upstream workflow, and neither gets to claim value above what that workflow has been shown to return.

The two modalities, side by side

The text side is well served by a language model, and it is worth being specific about why. A report is unstructured, but it is unstructured language, and language is what these models are for. Ask which wells reported losses in a particular formation, or which headers record a non-standard datum that would corrupt a depth merge, and a model with the reports and headers in context can answer, because the answer is written somewhere in the text. The metadata table sharpens this: with the 22 feature columns and their descriptions in hand, a model can explain what each curve measures, flag which wells are missing a channel, and translate a plain question into a filter over the table [2]. This is the modality where a geoscience LLM earns its place.

The geometry side is where the wall is, and the wall is not a matter of model quality. A scanned log is a raster image, and the curve on it is a thin line whose position, pixel by pixel, encodes the measured value. Recovering that line is a dense-prediction vision task: segment the two curves out of the background, trace each centreline, and read its position back into a depth-indexed value. Our digitiser does this at a best single-curve R-squared of 0.9891 and a lowest mean absolute error of 0.0132, numbers that describe a vision model reading pixel geometry, not a language model reading text. Handed the same scan, a language model can read any typeset words on it, the header, the track labels, the scale annotations, but not the curve, because the curve is not written; it is drawn. This is the distinction the exhibit below is built to argue.

A coverage matrix that reads the subsurface task stack one modality at a time and sorts each into what a language model can address or cannot. The three text bands -- well and completion reports, the log header block, and the well metadata columns (the 22 electrical-measurement feature columns across the 118 Norwegian-Sea tutorial wells) -- are LLM-addressable: they are words a model can read. The fourth row, the two-curve trace geometry whose served vision model reaches a best single-curve R-squared of 0.9891 at a lowest mean absolute error of 0.0132, is the one orange cell no language model fills, because it is the shape of a line on a scan and not text about it. The toggle grounds each row's value against the reported 20 percent drilling cost-savings ceiling for upstream machine learning (Koroteev and Tekic) rather than against an open-ended promise. The public archive footing shows where the two signals physically live: 136,771 raster scans that carry the words and 7,781 digital curve files. The 22 columns, 118 wells, R-squared 0.9891, MAE 0.0132, and 20 percent ceiling are sourced; the addressable / not-addressable reading and the value shares are an editorial allocation, flagged as such. The argument the orange row carries is the article's spine: a language model complements, and does not replace, the vision digitiser in the subsurface stack.

The matrix sorts the subsurface stack into the two bands and marks the single cell no language model fills. Three of the four task modalities, the report, the header, and the metadata table, are text a model can address; they sit in the teal band. The fourth, the two-curve trace geometry, sits alone in the orange band with its sourced R-squared and mean-absolute-error anchor, because it is the shape of a line and not text about it. The value-ceiling toggle grounds every row against the reported 20 percent drilling cost-savings figure [1] rather than against an open-ended promise, so the matrix reads as an allocation of a bounded return, not a hype board. The archive footing shows where the two signals physically live: a public raster archive holds 136,771 scanned images, which is where the words around a well sit as annotations, alongside 7,781 digital curve files, which is where the geometry has already been recovered. The words and the curves are different objects even inside the same archive.

Why "complement" is the honest verb

It is tempting, when a language model reads reports and headers and metadata so fluently, to expect it to eventually read the curve too, and to treat the vision digitiser as a stopgap a better LLM will retire. The survey reading argues the opposite, and the argument is not about current capability but about modality. The curve on a scan carries no text to tokenise; its information lives in the sub-pixel position of a line, a quantity a vision model measures and a language model has no channel for. A multimodal model that accepts an image can be trained to do the vision task, but at that point it is doing dense-prediction segmentation, the digitiser's job, under a larger and more expensive roof; it has not read the curve as language, it has learned to see it. The clean division of labour is the cheaper and more honest one: let the language model address the words around the well, let the vision model address the trace, and measure each on the modality it consumes.

The value picture reinforces this. Grounded against the roughly 20 percent drilling cost-savings ceiling [1], neither modality is a silver bullet, and both are contributions to the same bounded return. The report-reading language model saves interpreter hours by making a text corpus queryable; the curve-reading vision model saves them by turning a scanned log back into numbers. They add value in series, on different signals, and a stack with one but not the other has a gap exactly the width of the missing modality. That is the practical content of "complement": the two models do not compete for the same task, they cover different halves of the same well.

Discussion

Read as a family, the geoscience language models arriving now are best understood by what they consume rather than by how large they are. Their native input is the text of the subsurface: the reports, the headers, the metadata tables. On that input they are strong, and the public tooling around datasets like the Norwegian Sea tutorial set shows how naturally well metadata falls into the tabular, model-readable shape they thrive on [2]. Where the family stops is at the boundary of text, and the boundary is not soft. The curve geometry on a scanned log is on the far side of it, recovered by a vision digitiser at a best single-curve R-squared of 0.9891 and a lowest mean absolute error of 0.0132, and it stays there.

The recommendation that falls out is a stack, not a single model: a geoscience language model over the reports, headers, and metadata, and a vision digitiser over the scans, each measured on its own modality and grounded against the same bounded value ceiling. VeerNet is the vision half of that division of labour, the model that reads the modality a language model cannot. Reaching for the LLM to also read the curve is the move to avoid, not because the model is weak but because the curve was never text.

Limitations

This is a survey and inherits a survey's limits. It synthesises how the emerging geoscience-language-model family and the public well-data tooling treat the subsurface, and it does not train, fine-tune, or benchmark any language model of its own; the model metrics it quotes, a best single-curve R-squared of 0.9891 and a lowest mean absolute error of 0.0132, are the real numbers of a single raster-digitisation run from one engagement and one architecture, used as a worked contrast for the geometry modality rather than as a head-to-head measurement against any language model. The 22 feature columns and 118 wells are the public FORCE tutorial data as reported [2], and they stand in for the metadata modality generally; a production metadata schema will differ. The roughly 20 percent drilling cost-savings figure is one survey's reported ceiling for a specific upstream task [1] and is used to ground the value conversation, not as a guaranteed return for any deployment. The matrix's addressable and not-addressable reading of each modality, and the value shares it draws when the ceiling overlay is on, are an editorial allocation and are flagged as illustrative on the canvas; every number plotted, the columns, the wells, the R-squared, the mean absolute error, the ceiling, and the archive file counts, is sourced. A reader should take this as a map of which subsurface signals a language model can and cannot read, not as a substitute for evaluating a specific geoscience LLM on their own reports and their own metadata.

What to carry from the survey

A well produces two kinds of signal, and a language model reads only one. Text (the report prose, the log header block, and the well metadata table of 22 electrical-measurement feature columns across 118 Norwegian Sea tutorial wells) is LLM-addressable. Curve geometry (the shape of a trace on a scanned log) is not, because it is a line drawn on an image, not text about a well.
The geometry modality stays a vision problem regardless of language-model quality. Our raster digitiser recovers curve position at a best single-curve R-squared of 0.9891 and a lowest mean absolute error of 0.0132; those numbers describe a vision model reading pixels, a channel a language model does not have.
A geoscience LLM complements the vision digitiser rather than replacing it. The two consume different modalities of the same well, address different halves of the workflow, and a stack with one but not the other has a gap exactly the width of the missing modality.
Ground the value honestly. Both modalities are machine-learning contributions to the same bounded upstream return, ceilinged near the reported 20 percent drilling cost-savings figure; neither is a silver bullet, and the matrix reads as an allocation of a bounded return, not an open-ended promise.
Even inside one public archive the signals are different objects: 136,771 raster scans carry the words around a well as annotations, while 7,781 digital curve files carry the geometry that has already been recovered. Reading reports and reading curves are different jobs for different models.

The smallest habit this survey would install is a question to ask before pointing a language model at a scanned log: is the thing I need written on this image, or drawn on it? If it is written, the header, the labels, the report attached to it, a geoscience language model is the right tool. If it is drawn, the curve, the trace, the geometry, the language model has no channel for it, and the vision digitiser is the model that reads that modality.

References

[1] Koroteev, D., and Tekic, Z. Artificial intelligence in oil and gas upstream: Trends, challenges, and scenarios for the future. Energy and AI, 3, 100041 (2021). Surveys applied machine learning across upstream and reports the value ranges we ground against, including a roughly 20 percent drilling-optimisation cost saving. https://doi.org/10.1016/j.egyai.2020.100041

[2] McDonald, A. Using the missingno Python library to Identify and Visualise Missing Data Prior to Machine Learning. Towards Data Science (2021). A tutorial on the Xeek/FORCE 2020 Norwegian Sea well-log dataset, 118 wells with the electrical-measurement feature columns that make well metadata a tabular, model-readable object. https://towardsdatascience.com/using-the-missingno-python-library-to-identify-and-visualise-missing-data-prior-to-machine-learning-ec2b9a1e5f5c

Geoscience Language Models: A Survey of LLMs for Reports, Logs, and Well Metadata

Abstract

Background: what a geoscience language model is being asked to do

Method: reading the field against one distinction

The two modalities, side by side

Why "complement" is the honest verb

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on