Wiring CI/CD and Code Review Into a Research ML Repository

The model worked. That was exactly the problem. By the time a Texas onshore operator chose the accelerated track for their raster well-log digitization programme, VeerNet, the curve-reading model we had built, could already take a scanned paper log and hand back a depth-indexed curve, and we had a notebook that demonstrated it live. What we did not have was any way to keep six engineers from breaking that notebook by lunchtime. A research repository that one person edits is a private workshop. The same repository with six people pushing to it on a 16-week clock is a shared production line with no quality gates, and a single bad merge can take the one artefact the operator actually sees, the running demo, down with it.

At a glance

The accelerated track put six full-time engineers on the programme for sixteen weeks at a 180,000 EUR contract value. None of that headcount buys anything if the team spends its velocity re-breaking and re-fixing the same demo.

6 FTE · 16 weeks

180,000 EUR envelope

Accelerated track we had to keep stable

6 gates

review, CI, demo guard, deploy, tag

Hygiene controls retrofitted onto a live repo

0 broken demos

the only artefact the operator saw

What the green-demo merge guard protected

The repository we inherited from ourselves

VeerNet started life the way most research models do: as code written to answer a question, not to be operated. Datasets lived in working directories, the demo ran from whatever was checked out, and the only person who could reliably reproduce a result was whoever had produced it. That is the normal and even correct shape for the exploration phase, when the cost of process is paid every day and the payoff is uncertain. The literature on machine-learning systems names this debt precisely: the model code is the small visible part, and the surrounding plumbing of data handling, configuration and reproducibility is where the hidden cost accumulates [2].

For a single researcher that debt is bearable because the blast radius is one person. The accelerated track changed the arithmetic overnight. Six engineers working in parallel meant six chances a day to push something that imported cleanly on one machine and not another, that retrained against a slightly different dataset version, or that quietly changed the demo's behaviour. The exploration-phase repository was about to be asked to behave like a delivery-phase one, and it had none of the controls that makes that safe.

What we decided to hold fixed

We set ourselves one non-negotiable and one constraint. The non-negotiable was that the live demo must never be broken by a merge, because the demo was the operator's only direct window onto the work and a black screen on a review call costs more trust than a week of progress earns. The constraint was that we could not stop to rebuild the repository properly first. The clock had started; the science was already in flight. Whatever hygiene we installed had to go in around running work, without a freeze, and had to make the team faster on balance rather than slower.

That ruled out the tempting move of pausing to do it right. The lesson we leaned on is the one the Google engineering practice states plainly: process exists to let a codebase survive being touched by many people over time, and the cheapest moment to install it is before the team is large, not after a failure forces it [1]. We were slightly late by that measure, so the work became a retrofit under load rather than a clean-room setup.

Installing the gates while the line was running

We added the controls in the order that bought the most safety per unit of disruption, not in the order a textbook lists them.

The first gate was the cheapest and the highest-leverage: pull-request review. We turned off direct pushes to the main branch and required every change to arrive as a reviewed pull request. This single rule converted six private workshops back into one shared codebase, because now no change reached main without a second pair of eyes, and the trunk-based discipline of integrating small changes frequently kept those reviews fast rather than turning into week-long merge epics [3]. The second gate was a lint and format check that ran on every push, which sounds cosmetic and is not: it moved the entire category of style argument out of human review, so reviewers spent their attention on whether the logic was right instead of where the commas went.

Then came the part that actually protects a research repo: a continuous-integration pipeline that ran the test suite on every pull request and blocked merge on a red result. For a model codebase the most valuable tests are not elaborate; an import-smoke test that simply confirms every module loads catches the broken refactor before it reaches anyone, and a handful of shape-and-range checks on the inference path catch the silent ones.

The gate the whole effort was built around came fourth. We made a scripted run of the demo notebook a required check: before any pull request could merge, continuous integration executed the demo end to end against a fixed sample scan and confirmed it still produced a curve. This is the control that let the team push fast. Velocity is only safe when the thing you are protecting is checked automatically, and once the demo proved itself green on every merge, no one had to remember to test it by hand and no reviewer had to take a contributor's word that it still worked.

Reading the gates against the burn

The instrument below lays the six gates over the track's real burn profile. The horizontal axis is the sixteen-week calendar; the teal area is the weekly burn, anchored on the two months for which we have hard figures, November at 88,400 EUR and December at 111,220 EUR, the stretch where all six engineers were at full velocity. Each flag is a gate, placed at the point in the track where we installed it. Drag the week marker and you can see which controls were live as the spend climbed, and the reason the sequencing matters: the heaviest, most parallel, most dangerous weeks for the demo are exactly the ones in which the most people were pushing the most code, and those are the weeks the green-demo guard had to already be in place to cover.

A maturity-gate timeline for the 16-week accelerated track (6 FTE, 180,000 EUR). The horizontal axis is the delivery calendar in weeks; the teal area underneath is the team's weekly burn, anchored on the two sourced months (November 88,400 EUR and December 111,220 EUR, the heavy stretch where all six engineers ran at full velocity). On top of the burn, each flag is an engineering-hygiene gate retrofitted onto the research repository: pull-request review, a lint and format gate, a CI test pipeline, the green-demo merge guard (orange), deploy-on-merge, and a tagged handover release. Drag the week marker to see which gates were live by that point and which were still ahead, and select a flag to read what it enforced. The track envelope and the two monthly burn figures are the engagement's own; the gate-install weeks are an ordinal sequencing of the hardening work, and the ramp and tail burn are tapered to the two sourced months.

The shape the instrument makes is the argument. The gates that cost the least to install, review and linting, went in during the low-burn ramp when slowing down was cheap. The demo guard landed before the December peak, so that by the time burn and parallelism were highest the most consequential check was already automatic. Deploy-on-merge followed, so that a successful merge rebuilt and shipped the demo image and the deployed demo could never drift from what was on main. The last gate turned the handover itself into a tagged, reproducible release rather than a snapshot of someone's working copy on the final Friday.

What buying the controls actually returned

The honest accounting is that none of this made the model more accurate. It made the team faster and the operator's experience stable, which on an accelerated track is worth more. Before the gates, a broken demo was a tax paid in the most expensive currency we had: a contributor's afternoon spent bisecting which of six recent merges took it down, plus the credibility cost if the break happened to surface on a review call. After the gates, that failure mode was simply removed from the board, because a change that broke the demo could not merge in the first place.

From research workshop to a repo six people can push to safely

Before

Direct pushes, manual demo checks

Six engineers pushing to main; the live demo tested by whoever remembered; a bad merge surfaces as a black screen on a review call and costs an afternoon to bisect

After

Reviewed PRs + CI + green-demo guard + deploy-on-merge

No change reaches main unreviewed or with a red pipeline; the demo is executed automatically before every merge; a merge rebuilds and ships the demo image

The break-the-demo failure mode removed entirely; review attention freed for logic, not formatting; handover is a tagged reproducible release

There is a quieter return that only showed up at handover. Because deploy and release were automated and every result behind the demo had passed through a reviewed pull request with a green pipeline, the operator received not just a working model but a repository they could keep pushing to without us. The hygiene we installed for our own six engineers became the thing that let the next team inherit the work safely, which is the difference between handing over an asset and handing over a liability with good intentions attached.

The discipline this engagement taught us

The instinct on an accelerated track is to treat process as the enemy of speed, and to defer it until the work is done. We learned the opposite on this repository: on a short clock with a shared codebase, the controls are not what slows you down, the broken demos and the un-reproducible results are, and the controls are how you stop paying that tax. The retrofit cost us a few days spread across the ramp; it bought us a demo that never went dark and a handover that did not depend on anyone's memory. We would not run a six-person sprint against a research repo any other way again, and the order we found, cheap reviews first, the demo guard before the peak, is the order we now reach for by default.

Wiring CI/CD and Code Review Into a Research ML Repository

At a glance

The repository we inherited from ourselves

What we decided to hold fixed

Installing the gates while the line was running

Reading the gates against the burn

What buying the controls actually returned

The discipline this engagement taught us

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on