The model worked. That was exactly the problem. By the time a Texas onshore operator chose the accelerated track for their raster well-log digitization programme, VeerNet, the curve-reading model we had built, could already take a scanned paper log and hand back a depth-indexed curve, and we had a notebook that demonstrated it live. What we did not have was any way to keep six engineers from breaking that notebook by lunchtime. A research repository that one person edits is a private workshop. The same repository with six people pushing to it on a 16-week clock is a shared production line with no quality gates, and a single bad merge can take the one artefact the operator actually sees, the running demo, down with it.
At a glance
The accelerated track put six full-time engineers on the programme for sixteen weeks at a 180,000 EUR contract value. None of that headcount buys anything if the team spends its velocity re-breaking and re-fixing the same demo.
Accelerated track we had to keep stable
Hygiene controls retrofitted onto a live repo
What the green-demo merge guard protected
The repository we inherited from ourselves
VeerNet started life the way most research models do: as code written to answer a question, not to be operated. Datasets lived in working directories, the demo ran from whatever was checked out, and the only person who could reliably reproduce a result was whoever had produced it. That is the normal and even correct shape for the exploration phase, when the cost of process is paid every day and the payoff is uncertain. The literature on machine-learning systems names this debt precisely: the model code is the small visible part, and the surrounding plumbing of data handling, configuration and reproducibility is where the hidden cost accumulates [2].
For a single researcher that debt is bearable because the blast radius is one person. The accelerated track changed the arithmetic overnight. Six engineers working in parallel meant six chances a day to push something that imported cleanly on one machine and not another, that retrained against a slightly different dataset version, or that quietly changed the demo's behaviour. The exploration-phase repository was about to be asked to behave like a delivery-phase one, and it had none of the controls that makes that safe.
What we decided to hold fixed
We set ourselves one non-negotiable and one constraint. The non-negotiable was that the live demo must never be broken by a merge, because the demo was the operator's only direct window onto the work and a black screen on a review call costs more trust than a week of progress earns. The constraint was that we could not stop to rebuild the repository properly first. The clock had started; the science was already in flight. Whatever hygiene we installed had to go in around running work, without a freeze, and had to make the team faster on balance rather than slower.
That ruled out the tempting move of pausing to do it right. The lesson we leaned on is the one the Google engineering practice states plainly: process exists to let a codebase survive being touched by many people over time, and the cheapest moment to install it is before the team is large, not after a failure forces it [1]. We were slightly late by that measure, so the work became a retrofit under load rather than a clean-room setup.
Installing the gates while the line was running
We added the controls in the order that bought the most safety per unit of disruption, not in the order a textbook lists them.
The first gate was the cheapest and the highest-leverage: pull-request review. We turned off direct pushes to the main branch and required every change to arrive as a reviewed pull request. This single rule converted six private workshops back into one shared codebase, because now no change reached main without a second pair of eyes, and the trunk-based discipline of integrating small changes frequently kept those reviews fast rather than turning into week-long merge epics [3]. The second gate was a lint and format check that ran on every push, which sounds cosmetic and is not: it moved the entire category of style argument out of human review, so reviewers spent their attention on whether the logic was right instead of where the commas went.
Then came the part that actually protects a research repo: a continuous-integration pipeline that ran the test suite on every pull request and blocked merge on a red result. For a model codebase the most valuable tests are not elaborate; an import-smoke test that simply confirms every module loads catches the broken refactor before it reaches anyone, and a handful of shape-and-range checks on the inference path catch the silent ones.
The gate the whole effort was built around came fourth. We made a scripted run of the demo notebook a required check: before any pull request could merge, continuous integration executed the demo end to end against a fixed sample scan and confirmed it still produced a curve. This is the control that let the team push fast. Velocity is only safe when the thing you are protecting is checked automatically, and once the demo proved itself green on every merge, no one had to remember to test it by hand and no reviewer had to take a contributor's word that it still worked.
Reading the gates against the burn
The instrument below lays the six gates over the track's real burn profile. The horizontal axis is the sixteen-week calendar; the teal area is the weekly burn, anchored on the two months for which we have hard figures, November at 88,400 EUR and December at 111,220 EUR, the stretch where all six engineers were at full velocity. Each flag is a gate, placed at the point in the track where we installed it. Drag the week marker and you can see which controls were live as the spend climbed, and the reason the sequencing matters: the heaviest, most parallel, most dangerous weeks for the demo are exactly the ones in which the most people were pushing the most code, and those are the weeks the green-demo guard had to already be in place to cover.
The shape the instrument makes is the argument. The gates that cost the least to install, review and linting, went in during the low-burn ramp when slowing down was cheap. The demo guard landed before the December peak, so that by the time burn and parallelism were highest the most consequential check was already automatic. Deploy-on-merge followed, so that a successful merge rebuilt and shipped the demo image and the deployed demo could never drift from what was on main. The last gate turned the handover itself into a tagged, reproducible release rather than a snapshot of someone's working copy on the final Friday.
What buying the controls actually returned
The honest accounting is that none of this made the model more accurate. It made the team faster and the operator's experience stable, which on an accelerated track is worth more. Before the gates, a broken demo was a tax paid in the most expensive currency we had: a contributor's afternoon spent bisecting which of six recent merges took it down, plus the credibility cost if the break happened to surface on a review call. After the gates, that failure mode was simply removed from the board, because a change that broke the demo could not merge in the first place.
Before
Direct pushes, manual demo checks
Six engineers pushing to main; the live demo tested by whoever remembered; a bad merge surfaces as a black screen on a review call and costs an afternoon to bisect
After
Reviewed PRs + CI + green-demo guard + deploy-on-merge
No change reaches main unreviewed or with a red pipeline; the demo is executed automatically before every merge; a merge rebuilds and ships the demo image
The break-the-demo failure mode removed entirely; review attention freed for logic, not formatting; handover is a tagged reproducible release
There is a quieter return that only showed up at handover. Because deploy and release were automated and every result behind the demo had passed through a reviewed pull request with a green pipeline, the operator received not just a working model but a repository they could keep pushing to without us. The hygiene we installed for our own six engineers became the thing that let the next team inherit the work safely, which is the difference between handing over an asset and handing over a liability with good intentions attached.
The discipline this engagement taught us
The instinct on an accelerated track is to treat process as the enemy of speed, and to defer it until the work is done. We learned the opposite on this repository: on a short clock with a shared codebase, the controls are not what slows you down, the broken demos and the un-reproducible results are, and the controls are how you stop paying that tax. The retrofit cost us a few days spread across the ramp; it bought us a demo that never went dark and a handover that did not depend on anyone's memory. We would not run a six-person sprint against a research repo any other way again, and the order we found, cheap reviews first, the demo guard before the peak, is the order we now reach for by default.