Publishing Peer-Reviewed AI Research Under a National-Oil-Company NDA

There is a folk theorem in machine learning that says peer review and open data are inseparable — that a result you cannot reproduce from a public repository is a result the community should not trust. It is a fine principle for benchmark papers. It is unworkable for industrial subsurface AI, where the training data is a national oil company's most closely guarded asset and the contract you signed to touch it forbids you from releasing a single depth digit. The interesting question is not whether you can publish under those constraints. It is how — and what has to substitute for the open-data trust mechanism when open data is off the table.

We faced this directly across a roughly twenty-month engagement with a mid-sized Middle East carbonate operator we partnered with, building a Detection-Transformer model to pick fractures and bedding planes on image logs from two different microresistivity imaging tools. The work was, first and last, an applied machine-learning build — a custom set-prediction architecture, a from-scratch backbone, a bespoke evaluation harness — and it produced two manuscripts bound for a major petroleum-engineering journal, one on the fracture-and-bedding detector, one on a classical computer-vision vug-quantification pipeline, plus a peer-reviewed abstract presented at a European geoscience conference. Every one of them shipped while the well data, the model checkpoints, and the source code stayed behind a tripartite confidentiality agreement. This piece is about the engineering and editorial discipline that made that possible — because the constraints are not incidental to the science; they shape the experimental design itself.

The constraint is upstream of the model

Start with what the agreement actually locks down. The data belongs to the operator; it is a proprietary corpus of processed image logs accumulated over decades of drilling onshore acreage, and it does not leave the operator's governance. That single fact reorders the entire build-versus-buy calculus for a national oil company, and it is exactly why the AI is built with a partner under NDA rather than bought off a shelf: the corpus is the moat, and the moat cannot be exported to a vendor's cloud or a public benchmark.

In 2026 the AI build-vs-buy split in oil & gas is sorted by operator tier, and the deciding variable is the depth of the proprietary subsurface corpus an operator owns. Pick a tier — NOCs (Build), Western IOCs (Partner), mid-tier independents (Buy) — and the panel reconfigures to that tier's posture, named operators and the article's own commitments. The orange ladder is the single argument: the deeper the owned corpus (the sourced NOC band runs from ADNOC's 50+ years to Aramco's 90 years), the further toward BUILD a tier sits. Drag the corpus-depth marker — or step tiers with the chips / arrow keys — and the recommended posture snaps to the band the depth lands in. Named operators, the NOC corpus depths, model sizes ($340M / 28 fields, 250B / 70B params), the 70% / 75% gains and the $7.6B→$25B market are the article's own; the corpus-depth axis, the gate thresholds and the IOC / independent marker positions are illustrative.

For the research team, that posture has three concrete consequences, and each one lands on a different layer of the machine-learning pipeline.

The first is the data layer. We never controlled how much geology we would see, and we saw less than a benchmark paper would demand — fourteen vertical wells in total, with eleven carrying consistent bedding picks. That is a small dataset for set-prediction transformers, and far too small to do the textbook thing and hold out an entire well as an untouched test set: sacrifice one well to testing and you have removed seven percent of your geology from training, on a problem where the gap between three wells and nine wells is the gap between a useless model and a deployable one.

The second is the disclosure layer. The model architecture, the loss, the ablations, the metrics — all publishable. The depths at which any of it was measured — not publishable. So in every figure, every table, and every worked example in the manuscripts, the depth points were masked by stripping the leading two digits. A blind-well prediction reported at a masked depth of XX16.68 metres is scientifically complete — the offset, the dip error, the azimuth error are all there — while the absolute position that would fingerprint the well and the field is gone.

The third is the artifact layer. No code release, no data release, no model weights. One reviewer asked for exactly that, and it was the single comment across the whole review cycle we could not satisfy — not because we were unwilling, but because the confidentiality agreement forbids it. That refusal is the crux of the whole problem: in a culture that increasingly treats a GitHub link as a precondition for belief, how do you make a withheld-everything result credible?

Substituting evaluation rigour for open data

The answer is that the trust mechanism moves. When a reader cannot rerun your code, the burden shifts onto the transparency and severity of your evaluation — and that is something you can disclose in full without leaking a single confidential byte. We leaned on three moves, all of them ordinary good ML engineering, made load-bearing by the constraint.

Partition by patch, and say so loudly. Because fourteen wells cannot spare a hold-out well, we split train, validation, and test at the level of image patches drawn across the wells, and we stated this explicitly in the methodology and defended it line-by-line in the reviewer response. This is the kind of choice that looks like a shortcut if you hide it and like sound engineering if you expose it. Patch-level partitioning on a fourteen-well corpus is a defensible answer to a hard data-scarcity constraint; pretending you had well-level isolation you did not have would have been the actual integrity failure. The honest disclosure is what makes the limitation a methodological choice rather than a buried flaw.

Report the ablations that show the model is learning geology, not memorising it. The most persuasive evidence under an NDA is not a public dataset — it is a sensitivity analysis that no overfit model could fake. Sweeping the number of training wells, classification error falls from 93.1% at three wells to 1.06% at nine, and lands at 2.54% on the full fourteen-well fractures-only model. That curve is the argument: a model that had merely memorised patches would not collapse its error by nearly two orders of magnitude as unrelated geology accumulates. Pair it with a static-versus-dynamic-imagery ablation — where training on statically scaled logs leaves classification error at 63.5% against 2.54% on dynamic logs — and a backbone sweep in which a from-scratch ResNet-10, deliberately light to resist overfitting on small data, beats every deeper variant, and a careful reader can audit the model's behaviour without ever seeing the wells.

Loss-function choice decides whether the network learns curve continuity. VeerNet tested five losses under identical conditions; only the one whose gradient aligns with the IoU/F1 metric (Lovász-Softmax) wins, and the shipped answer is a two-loss SCE-warmup → Lovász-finetune schedule. Pick a loss to see its ablation verdict, the reason, and a schematic ground-truth-vs-prediction trace; toggle the two-loss schedule (same accuracy, half the wall-clock). The five candidates, verdicts, F1 35%/IoU 30% and the two-loss schedule are the whitepaper's own; the podium bar heights are ordinal (rank sourced) and the prediction thumbnails are schematic.

Validate out of distribution, on the operator's terms. When the harshest reviewer recommended the paper as unpublishable on generalisation grounds, the reply was not a public benchmark — it was more confidential evidence, disclosed at the same rigour. We added a continuous 12 m blind-zone prediction on a held-back well and brought in five horizontal wells purely to test inclination generalisation, reporting fracture-detection performance around 85% within an 8 cm depth tolerance and roughly 65% at the tighter 3 cm band. Every number traceable, every depth masked, no artifact released. The reviewer's legitimate worry — does this thing work off the training distribution? — got a legitimate answer that the confidentiality agreement could live with.

The editorial mechanics matter as much as the science

None of this survives contact with a journal unless the correspondence layer is handled with the same care as the model. Three things were load-bearing.

The cover letters did the disclosure proactively. Rather than waiting to be caught, the revised submission stated up front that one reviewer request — to share code and data — could not be met, and named the confidentiality agreement as the reason. An editor who learns about a hard constraint from the authors reads it as a boundary condition; an editor who discovers it later reads it as evasion.

The reviewer responses treated every limitation as a design rationale, not an apology. Across three reviewers and an associate editor, the recurring questions — why patch-level splitting, why no comparison against mask-based detectors, why such a small backbone — each had a real engineering answer rooted in the data constraint, and we wrote them as answers, not excuses. The roughly eleven-thousand-word journal limit even forced some of the extra out-of-distribution analysis into supplementary material, which is its own small lesson: rigour has to be budgeted, not just produced.

And the work was staged across venues so credibility compounded. The conference abstract went out first, earned its own peer scrutiny in front of a geoscience audience, and the manuscripts followed — each carrying the formal publication permissions the partnership required. None of it leaked the corpus; all of it accrued external validation to a body of work that, by contract, can never be open-sourced.

What actually transfers

The takeaway for any team doing confidential industry AI is that "we can't open-source it" is not a death sentence for peer review — it is a design constraint you absorb upstream, in the experiment, not downstream, in an apology. Partition honestly and disclose the partition. Replace the reproducibility you cannot offer with evaluation severity you can — ablations that expose the model's learning curve, out-of-distribution tests at full reported rigour, metrics in the physical units the domain trusts. Mask what the contract requires and nothing more, so the science stays complete even when the coordinates are gone. Handle the editorial correspondence as a first-class engineering surface. Do that, and a result no one can rerun can still be a result the community has every reason to believe. It is the same discipline we carry across the operators we have worked with, in the Middle East and beyond.

Key takeaways

Confidential industry AI cannot meet the open-data norm of ML peer review — the training corpus is the operator's core asset and the NDA forbids releasing code, weights, or even raw depths. The trust mechanism has to move from reproducibility to evaluation severity.
Data scarcity is upstream of the model: with only 14 wells (11 with consistent bedding picks), holding out a whole well is impossible, so train/val/test was partitioned by image patch — a defensible choice precisely because it was disclosed and defended, not hidden.
Disclosure was surgical: depth points were masked by stripping the leading two digits, so every figure and worked example stays scientifically complete while the absolute coordinates that would fingerprint the well and field are removed.
Ablations substitute for open data. Classification error falling 93.1%→1.06%→2.54% as training wells grow (3→9→14), plus static-vs-dynamic (63.5% vs 2.54%) and backbone sweeps, let a reader audit that the model learns geology rather than memorising patches — no dataset access required.
Out-of-distribution validation answered the generalisation challenge with more confidential evidence at full rigour: a continuous 12 m blind-zone prediction and 5 added horizontal wells, reporting ~85% fracture-detection performance at 8 cm (~65% at 3 cm), every depth masked.
Editorial mechanics are first-class engineering: proactively name the confidentiality constraint in the cover letter, write every limitation as a design rationale in the reviewer response, and stage conference and journal venues so external validation compounds around work that can never be open-sourced.

Publishing Peer-Reviewed AI Research Under a National-Oil-Company NDA

The constraint is upstream of the model

Substituting evaluation rigour for open data

The editorial mechanics matter as much as the science

What actually transfers

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on