Skip to main content

Blog

DVC + Weights & Biases: Making a Subsurface Model Reproducible

A field-tested account of how versioned data and experiment tracking turned a research-grade borehole-geology model into a reproducible one — and why, without that lineage, a 14-well ablation table is just a set of numbers nobody can defend.

Tarry Singhby Tarry Singh10 min read
EarthScan insight

There is a moment, on every applied-AI engagement that lasts longer than a quarter, when someone asks the question that quietly decides whether the work was science or theatre: can you reproduce the number on slide 14? In our work with a mid-sized Middle East NOC carbonate operator we partnered with — a roughly twenty-month programme to detect fractures and beddings on borehole image logs with a Detection-Transformer-derived model — that number was a 14-well ablation table. It said classification error fell from 93.1% on three wells to 2.5% on fourteen, that dynamic borehole-image imagery crushed static, that a from-scratch ResNet-10 beat every deeper backbone. Those numbers carried the entire scientific argument. And they were only worth anything because, by the time we ran them, every dataset and every training run that produced them was versioned, named, and recoverable. This piece is about the boring infrastructure that made that possible: data versioning with DVC and experiment tracking with Weights & Biases. It is the least glamorous part of a subsurface-AI build, and it is the part that decides whether anyone can trust the rest.

The failure mode you don't see coming

The trap is not that researchers are careless. It is that the natural unit of a research codebase — a folder of data and a script that trains on it — has no memory. Six months into a programme you have a model checkpoint that performs well, a directory of preprocessed image patches, and a results figure. Then a reviewer asks a sharp question, you re-run the experiment to answer it, and the number comes out different. Now you are debugging a ghost. Was it a different split? A different augmentation seed? Did the patch-extraction stride change? Did someone re-export the ground-truth picks? You cannot tell, because none of those three things — the data, the code, the run — was pinned to the others.

This is acute in subsurface work for a reason that has nothing to do with deep learning. The data itself is enormous, slow-moving, and constantly re-derived. A single borehole-image log is around 1.5 GB. Raw downhole-log files are converted to image strips, normalised, sliced into overlapping patches, and matched against expert dip picks exported from interpretation software. Every one of those steps is a transformation with parameters, and every parameter is a place where two "identical" datasets can silently diverge. When we grew the working set from roughly 900 image–ground-truth pairs to over 55,000 — a 65× expansion driven by overlapping patches and augmentation — the dataset stopped being something a human could eyeball and started being something only a content hash could verify.

Versioning the data, not just the code

Git is excellent at versioning code and useless at versioning a 55,000-image dataset. DVC (Data Version Control) exists to close exactly that gap. It keeps a small text pointer in Git — a .dvc file containing a content hash and a remote path — while the heavy bytes live in object storage. Checking out an old commit checks out the pointer; dvc pull then materialises the exact dataset that commit was trained on. The effect is that a dataset acquires a commit history the same way code does, and a training run can be tied to a precise data state rather than to "whatever was in the folder that week."

That binding is the whole point. Reproducibility is not a property of data or of code in isolation; it is a property of the pair. DVC makes the pair addressable: one Git SHA now resolves to one code state and one data state, so "re-run experiment X" stops being an act of faith.

The discipline it enforced on us was as valuable as the tooling. Early on, datasets were named like research output usually is — by whatever was salient that day, sometimes by an opaque hash. That works until you have a dozen of them and need to explain to a client geologist which dataset produced which curve. We ended up running a data-management surface through twelve numbered iterations over the programme, and the naming convention mattered as much as the storage: a dataset name had to encode its provenance — which wells, dynamic or static imagery, what patch stride — because a name that cannot be decoded is a name that will eventually be mistrusted. When we cut the patch stride from 40 to 80 pixels across a roughly 92,000-patch overlapped-and-augmented set, that was a new dataset version, not an edit to the old one. The old one still existed, still pulled, still reproduced its own results.

Tracking the experiments, not just the best model

Versioned data answers what did I train on. Experiment tracking answers what happened when I did. Weights & Biases was the system of record for the second half. Every training run logged its hyperparameters, its loss curves, its evaluation metrics, and — critically — a pointer back to the data version and the resulting checkpoint. A run was no longer a transient event in someone's terminal; it was a durable, queryable artifact.

The convention that made this legible was baking the experiment's identity into the checkpoint name itself. A best-model file in this programme looked like 0.0004_250_resnet10_l1_focal_<timestamp> — learning rate 0.0004, 250 epochs, a ResNet-10 backbone, an L1 regression loss on depth/dip/azimuth, and a focal classification loss, with a Unix timestamp pinning the exact run. Read that filename and you have reconstructed the experiment without opening a single config. It is a small thing. It is also the difference between a checkpoint you can defend in a review and one you found in a folder and hope is the right one. Pair it with the W&B run that produced it — same hyperparameters, same logged data version — and the artifact is fully self-describing.

GLOBAL SUPERMAJOR · ~40 PRODUCING ASSETS~40×faster model retrain cycle6 weeks → overnightThe cost lived in the staleness window, not the retrainDrag “today” across a year of operating life — the orange band is the drift the loop removes.Retrain cycle timeSilent-drift windowfreshstalerstalestVERTICAL SCALE: SCHEMATIC (WEEKS A MODEL HAS BEEN STALE)wk 0wk 13wk 26wk 39wk 52manual queue · drifts for weeksagentic loop · caught in dayswk 34 · gap 15.0w (schematic)← drag “today” · orange = silent-drift exposure the loop eliminatesFive of 18 requeued models had drifted into a $4.2M misallocated-capital exposureloop keeps models current → +22% production-forecast accuracy across ~40 assets6 wk → overnight, ~40×, +22%, $4.2M (5 of 18), ~40 assets are the case study's own · staleness curve & vertical scale schematic
The case study's real argument isn't faster retrains — it's that the gap between ‘model is stale’ and ‘model is fresh again’ is where decisions ran on quietly degrading predictions. Drag ‘today’ across a year of operating life: under the manual queue, drift hid for months and a retrain took six-plus weeks, so the model's staleness sawtooth grows long teeth; under the agentic loop, drift surfaces in days and a retrain runs overnight, so the teeth collapse. The orange band is the silent-drift exposure the loop removes — the window in which five of 18 requeued models had drifted into a combined $4.2M of misallocated infill capital. The retrain cycle times (6 weeks → overnight, ~40×), the drift-detection cut (months → days), +22% accuracy, the $4.2M figure and ~40 assets are the case study's own; the week-by-week staleness curve shape, the year-long retrain cadence and the vertical weeks-stale scale are schematic, drawn to argue the gap rather than chart a measured series.

The instrument above is from a different, later engagement, but the mechanism it dramatises is the one at stake here. Its argument is that the expensive thing is rarely the retrain — it is the window in which nobody could tell a model had gone stale, because there was no lineage to compare against. Versioned data and tracked experiments are what collapse that window: when every run is pinned to a data state, you can see, the moment a new well lands, exactly what changed and whether the model genuinely improved or merely moved. Without the lineage, "the model is better now" is an assertion. With it, it is a diff.

What lineage let us actually prove

The payoff was not abstract. The model's quality came in observable steps, and we could attribute each step to a specific change because the data and runs were versioned. The W&B record traced a clean evolution across three milestones — call them the July, September, and November checkpoints — as the training set moved from an early three-well dynamic-imagery dataset to an eight-well one, and the classification, depth, dip, and azimuth errors fell in lockstep. Each milestone was a tracked run against a pinned dataset, so the improvement was a comparison between two known states, not a vibe.

That lineage is also what let us trust the ablations that carry the scientific claim. The well-count sweep is the cleanest example: classification error of 93.1% at three wells, 18.4% at six, 1.06% at nine, 0.82% at eleven, and 2.54% at the full fourteen-well fractures dataset. A curve like that — steep, non-monotonic at the tail — is exactly the kind of result a sceptical reviewer probes. The only honest way to defend it is to be able to re-run any single point on demand, with the precise data and code that produced it. We could, because each point was a tracked run over a versioned dataset. A subtler win came from a quieter comparison: moving from eight wells to eleven improved the mean absolute error on depth, dip, and azimuth by only about 0.007 in normalised units. That is a small enough delta that, without pinned data states on both sides, you would have no way to know whether it was a real gain or measurement noise. Lineage is what turns a 0.007 into a defensible finding instead of a rounding error.

The engineering reading

It is tempting to file all of this under "good hygiene" and move on. That undersells it. For an applied-AI programme — and this was a deep-learning, computer-vision build with a from-scratch ResNet-10 feature extractor feeding a transformer detector, trained on bespoke geomatics-grade image pipelines — data versioning and experiment tracking are not hygiene; they are the substrate the science runs on. The architecture, the augmentation policy, the loss design, the backbone sweep: every one of those decisions was validated by a comparison, and a comparison is only valid if both sides of it are recoverable. DVC made the data side recoverable. Weights & Biases made the run side recoverable. Together they made the comparison — which is to say, the result — trustworthy.

The practical advice is unglamorous and load-bearing. Version your data with the same seriousness you version your code, and bind the two so one commit resolves to both. Track every run, not just the winner, and make the artifact self-describing — encode the experiment in the checkpoint name and pin it to its data version. Adopt a dataset-naming convention that survives a year and a dozen iterations. Do this from the first week, not the week before the model goes to production, because the lineage you did not capture is gone, and the number on slide 14 is only as good as your ability to reproduce it on demand. We have built these pipelines for operators across the Middle East and the United States; the data, the geology, and the model change from one to the next, but this rule does not.

Key takeaways

  1. Reproducibility is a property of the (code, data) pair, not of either alone. Git versions code and is useless on a 55,000-image, 65×-augmented dataset; DVC pins the heavy data to a content hash so one commit resolves to one exact code-and-data state.
  2. In subsurface work the data is the volatile part: a single borehole-image log is ~1.5 GB, and every preprocessing step (log-to-image conversion, normalisation, patch stride, augmentation) is a place two 'identical' datasets silently diverge. Cutting patch stride 40→80 px is a new dataset version, not an edit.
  3. Track every run, not just the best model. Weights & Biases logs hyperparameters, losses, metrics, and a pointer to the data version; baking the experiment into the checkpoint name (e.g. 0.0004_250_resnet10_l1_focal_<timestamp>) makes the artifact self-describing.
  4. Lineage is what makes a comparison defensible. The W&B milestone trail (3-well → 8-well dataset across three checkpoints) and the well-count ablation (93.1% → 18.4% → 1.06% → 0.82% → 2.54% class error from 3 to 14 wells) are only trustworthy because every point can be re-run from its pinned data and code.
  5. Set up versioning and tracking in week one. The 0.007-MAE gain from 8→11 wells is only distinguishable from noise because both data states were pinned. Uncaptured lineage cannot be reconstructed after the fact — and an unreproducible ablation table is just a slide nobody can defend.
Go to Top

© 2026 Copyright. Earthscan