Reproducible by Design: DVC, Weights & Biases and CI/CD on a DGX A100 Stack for Subsurface ML

A model that beats the benchmark in a notebook is not a capability an operator can run a field on. The gap between the two is almost entirely engineering: can you reproduce last quarter's result bit-for-bit, can you tell which dataset version and which hyperparameters produced the checkpoint now serving interpreters, and can you tell — before a geoscientist does — that the live model has gone stale. For a mid-sized Middle East carbonate operator we partnered with, the borehole-geology model that picks fractures and bedding planes from two different microresistivity imaging tools had already been validated. The work of this phase was different in kind: we had to harden that research code into an operable platform, so every result behind it became rerunnable and auditable rather than a one-off in someone's working directory.

At a glance

The deliverable of this phase was not a better model. It was an MLOps spine — data versioning with DVC, experiment tracking with Weights & Biases, and CI/CD with drift monitoring — running on an on-prem NVIDIA DGX A100 stack with a self-hosted data store and container-orchestration layer. Concretely, it moved the programme from a research artefact to a deployable one along three axes engineers will recognise.

Phase 3: from research artefact to operable platform

Before

Research code, untracked runs

Hand-managed datasets in working directories; checkpoints without a provenance trail; results reproducible only by the person who ran them

After

Versioned data + tracked experiments + CI/CD

DVC-versioned, UUID-named datasets; every run logged to Weights & Biases; CI/CD and drift monitoring on a DGX A100 stack

Every borehole-geology result rerunnable and auditable; model rollback and production serving in place

Why a benchmark-beating model is not yet a platform

The earlier phases produced a real result. A DETR-derived set-prediction model — the SUB-DETR / borehole-geology lineage — learned to emit depth, dip, and azimuth for overlapping fracture and bedding sinusoids in a single forward pass, and the ablations that justified its small ResNet-10 backbone were decisive: as the training corpus grew across wells, the Hungarian matching loss fell from 0.801 to 0.015 and classification error from 93.115 to a low single digit. That is a model worth operating.

But the way it had been produced was the way most research code is produced. Datasets lived as folders. Augmentation expanded an original corpus of roughly 900 image-and-ground-truth pairs to more than 55,000 — a 65-fold increase via overlap and geometry-preserving augmentation — and that expansion was a script someone ran, not a tracked transformation. Two engineers running “the same” experiment could not be certain they had the same data underneath. A checkpoint serving interpreters carried no machine-readable answer to the only questions that matter in production: which data version, which code commit, which hyperparameters, and is it still fit for the wells coming in now.

The three questions a research repo cannot answer

Can you reproduce a result from six months ago, byte-for-byte? Can you trace the model now in service back to its exact data version and config? And can the system tell you it has drifted before a human notices the picks degrading? Phase 3 existed to make all three answerable by construction, not by recollection.

The MLOps spine: DVC, Weights & Biases, CI/CD

We built the platform around three load-bearing layers, each chosen to close one of those questions.

Data versioning with DVC. Every dataset became an addressable, content-versioned artefact rather than a directory. Datasets were named with hexadecimal UUIDs and tracked so that a given training run pinned an exact data revision — the specific train/validation/test split and the specific augmentation expansion behind it. This is the layer that makes the 65-fold augmentation reproducible instead of incidental: the 55,000-pair corpus is no longer “whatever the script produced that day” but a named, immutable version any run can reference. DVC sits over the Seafile data server — a 1 TB network backbone with 4 TB of redundant SSD — so large image-log datasets are versioned without bloating the Git history.

Experiment tracking with Weights & Biases. Every run logged its config, metrics, and artefacts to W&B, turning a wall of checkpoint filenames into a queryable record. The tracking grew with the data: the experiment ledger evolved across mid-year milestones as the well count climbed from the original three wells to eight, each new cohort a comparable, dated set of runs rather than an overwrite. The practical payoff is the one engineers feel daily — you can answer “which configuration produced the model we are serving” from the dashboard, and a regression in a metric is attributable to a specific change in data or hyperparameters rather than a mystery.

CI/CD with drift monitoring. The lifecycle closed into a loop: data exploration, feature engineering, experimentation, training, evaluation, a production-ready model, CI/CD to deploy it, monitoring for drift, and a feedback path back into the next dataset version. This is what defines the phase on the programme's maturity ladder — Phase 3 is explicitly the model-rollback-and-serving and production-CI/CD rung, the point at which a model can be shipped, watched, and reverted rather than merely trained. Inference was packaged for serving as a containerised Streamlit application exposed via Docker on port 8501, so interpreters consumed predictions through an app, not a notebook.

The drift layer is the one that changes how an asset team experiences the model. Without it, a model degrades silently as new wells arrive from a slightly different distribution, and the staleness is discovered only when the picks stop matching the rock. With monitoring in the loop, the gap between “the model has drifted” and “the model is fresh again” becomes a managed, observable interval rather than an unbounded blind spot.

The case study's real argument isn't faster retrains — it's that the gap between ‘model is stale’ and ‘model is fresh again’ is where decisions ran on quietly degrading predictions. Drag ‘today’ across a year of operating life: under the manual queue, drift hid for months and a retrain took six-plus weeks, so the model's staleness sawtooth grows long teeth; under the agentic loop, drift surfaces in days and a retrain runs overnight, so the teeth collapse. The orange band is the silent-drift exposure the loop removes — the window in which five of 18 requeued models had drifted into a combined $4.2M of misallocated infill capital. The retrain cycle times (6 weeks → overnight, ~40×), the drift-detection cut (months → days), +22% accuracy, the $4.2M figure and ~40 assets are the case study's own; the week-by-week staleness curve shape, the year-long retrain cadence and the vertical weeks-stale scale are schematic, drawn to argue the gap rather than chart a measured series.

The stack underneath: an on-prem DGX A100

Image-log models are trained from scratch on large augmented corpora, and the operator's constraint was that subsurface data largely stays on-premise. The platform therefore ran on a dedicated on-prem HPC tier rather than rented cloud. The core is an NVIDIA DGX A100 node — 4 to 8 A100 GPUs, up to 640 GB of total GPU memory across 6 NVSwitches, delivering 2.5 to 5 petaFLOPS of AI compute and 5 to 10 petaOPS of INT8 — with a documented scale-out path to a SuperPod of 5 to 10 DGX A100 nodes (25 to 50 PFLOPS AI, 3 to 6 TB of GPU memory, 200 Gb HDR InfiniBand) for when the well count and model count grow.

Around the trainer sat the supporting iron: a DGX server with 7.7 TB SSD, 512 GB RAM, and 320 GB GPU RAM; a 4 TB SSD / 128 GB RAM data-management server; and custom development nodes (2 TB SSD / 64 GB RAM / 11 GB GPU and 2 TB SSD / 128 GB RAM / 64 GB GPU). The contrast with where the work started is itself the story of the phase — the legacy development environment was a stack of 1080Ti machines at 8 GB of GPU memory each, adequate for a research prototype and nowhere near adequate for an auditable, retrainable production model on 55,000 image pairs.

The orchestration sat on a container-orchestration-based MLOps layer over the self-hosted data store, with deployment offered across the spectrum operators actually ask for — containers, bare metal, or VMs. That flexibility mapped directly to a hosting decision the operator had to make, which we costed three ways: operator-only, operator-plus-our-team, and fully managed by us. The economics were concrete, with investment tiers laid out at roughly USD 250–350K, 650–800K, and 1.5–4M depending on the depth of managed service, and the operational difference was just as concrete — under an as-is managed model a retrain runs in minutes to hours, an off-prem managed alternative measured 2 to 3 weeks, and a go-it-alone path landed in between. The platform existed to make the first of those the default.

What “operable” bought the interpreters

The point of the engineering was never the engineering. It was that the productised tools riding on this spine — the fracture and vug interpreters, and the Well-to-Well correlation tool targeting 80 wells — could be trusted in a working interpretation loop. With the platform in place, the tools moved interpretation roughly 5x faster, and the correlation tool was specified against a +60% interpreter-productivity and +75% interpretation-accuracy uplift, with target operating points of 95% precision and 90% stratigraphic-correlation success. None of those numbers survive contact with reality unless the model behind them is reproducible, trackable, and monitored — which is exactly what this phase delivered.

The reproducibility also has a human dividend that is easy to miss. A versioned, tracked platform is teachable in a way a pile of notebooks is not, and the programme trained a cohort of 55 young professionals — 15 of them part of an in-country capability-building cohort — on the stack, building local capability rather than a dependency. An auditable pipeline is a transferable one.

What this generalises to

The architecture here is not specific to carbonate image logs. In our work across subsurface AI engagements — with operators in the Middle East and the United States — the same three failures recur: data that cannot be versioned, experiments that cannot be compared, and models that drift undetected in service. The fix is the same spine. DVC (or an equivalent content-versioning layer) makes the data an addressable artefact; an experiment tracker makes runs comparable instead of anecdotal; and CI/CD with drift monitoring turns a model from a frozen deliverable into a maintained one. The hardware tier scales with the corpus, but the discipline is invariant.

The honest limit is the one every on-prem MLOps build hits: the platform is only as auditable as the data flowing into it, and a drift monitor surfaces distribution shift but does not, by itself, supply the new labels to retrain on. The next lever is the same one the model ablations already pointed at — more wells, more geological diversity — now fed through a pipeline that can version, track, and serve them without losing the thread.

From research artefact to operable subsurface-AI platform

Reproducibility is an engineering deliverable, not a property of a good model: DVC-versioned, UUID-named datasets pinned to each run made the 65x-augmented 55,000-pair corpus rerunnable, and Weights & Biases turned a wall of checkpoints into a queryable provenance trail.
CI/CD with drift monitoring closes the loop — Phase 3 is explicitly the model-rollback, production-serving, and drift-aware rung — so a deployed model can be shipped, watched, and reverted rather than silently degrading as new wells arrive.
The on-prem DGX A100 stack (4-8x A100, up to 640 GB GPU memory, 2.5-5 PFLOPS) plus a self-hosted data and orchestration layer made retrains a minutes-to-hours operation and let productised interpreters move ~5x faster on a platform a local cohort of 55 was trained to run.

References

Phase-3 ICT/infrastructure transition documentation and monthly steering decks for a confidential Middle East carbonate engagement; hardware, MLOps-stack, dataset-growth, and tool-uplift figures derived from internal programme records, withheld under operator confidentiality.
NVIDIA DGX A100 system specifications (4–8 A100 GPUs, up to 640 GB GPU memory, 6 NVSwitches; SuperPod scale-out over 200 Gb HDR InfiniBand) — vendor reference for the on-prem HPC tier.

Reproducible by Design: DVC, Weights & Biases and CI/CD on a DGX A100 Stack for Subsurface ML

At a glance

Why a benchmark-beating model is not yet a platform

The MLOps spine: DVC, Weights & Biases, CI/CD

The stack underneath: an on-prem DGX A100

What “operable” bought the interpreters

What this generalises to

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on