Reproducibility and Experiment Tracking in a Small Research-ML Team

Abstract

A machine-learning result is only as trustworthy as the team's ability to regenerate it, and the field has known for years that this ability is rarer than it looks. This paper surveys the public literature and toolchain of reproducibility, then reports how a small research team applied it in practice. We credit the audit work that quantified the gap [1], the conference-scale program that turned reproducibility into a reviewed checklist [2], the experiment-tracking infrastructure that captures a run's full context [3], the variance accounting that shows why one seeded run can mislead [4], and the systems analysis that named the hidden maintenance costs a research repository accrues [5]. Against that backdrop we describe our own discipline on a raster well-log digitisation project: a five-loss-function ablation matrix (Dice, Focal, Lovasz, Soft CE, Tversky) run on a single shared corpus version, with seeds, the 15,000-instance multiclass dataset, the 80/20 train-validation split, the 50-epoch budget, and the per-run provenance all pinned so that a logged curve-1 R-squared of 0.9891 could be reproduced from the record alone months later. The finding is that the public reproducibility playbook scales down cleanly to a team of a few engineers, but only if provenance is captured at the moment of the run rather than reconstructed afterward, and that for a comparison across five losses the single most valuable artifact is the ledger that pins each run to its loss, its data version, its budget, and its headline metric. The practice and its analysis are what we contribute; the cited works and tools are their authors', and VeerNet, our digitiser, is ours.

The reproducibility problem and how the field framed it

Reproducibility is the property that a result can be obtained again from the same conditions. For a learning system that means more than re-running a script. It means knowing the exact data the model saw, the exact code that trained it, the random seeds that fixed the otherwise stochastic path through the loss surface, the hyperparameters and epoch budget, and the metric definition the headline number was computed under. Drop any one of these and the number floats free of the experiment that produced it. We approach this as practitioners reading a field that other groups built, and we credit each contribution to its authors rather than claiming the framing as ours.

The audit that put numbers on the gap is Gundersen and Kjensmo's survey of empirical artificial-intelligence papers at major venues [1]. They scored papers against a set of reproducibility factors, the kind of documentation a reader would need to regenerate the work, and found that very few papers recorded enough to clear even a modest bar. Their contribution to our reading is the graded definition: reproducibility is not one binary property but a ladder, from being able to repeat an experiment with the original code and data, through reproducing it from a description, to replicating the finding under independent conditions. A small team almost never reaches the top of that ladder, but it can and should reach the bottom rungs reliably, and the survey is what tells us which rungs those are.

The community then turned the diagnosis into procedure. The NeurIPS 2019 reproducibility program, reported by Pineau and colleagues [2], built a machine-learning reproducibility checklist into the submission and review process, paired it with a code-submission policy, and ran a parallel reproducibility challenge in which volunteers tried to regenerate accepted papers. The lasting artifact for a practitioner is the checklist itself: a concrete enumeration of the things a paper must state, from dataset splits and the number of training-and-evaluation runs to the exact compute and the measure of central tendency reported. We treat that checklist as a specification a research repository should satisfy by construction rather than a box ticked at submission time.

Where the checklist says what to record, the tooling literature says how. Sacred, by Greff and colleagues [3], is the infrastructure example we lean on conceptually: it captures a run's configuration, its source, its randomness, and its output metrics into a stored record keyed to the run, so that any logged result carries its own provenance and can be queried later. The design lesson is that provenance has to be captured at run time, by the harness, not reconstructed from memory or from a half-remembered command line weeks later. A run that was not recorded as it happened is, for practical purposes, not reproducible, however good the intentions around it.

Two further works frame why the discipline matters even when nothing is obviously broken. Bouthillier and colleagues show that a benchmark result carries real variance from data sampling, initialisation, and hyperparameter choice, and that a single seeded run can therefore misrepresent a method's true standing [4]. This bears directly on a loss-function ablation: if five losses are compared on one seed each, the ranking partly reflects seed luck, so the seeds must at minimum be fixed and logged, and ideally the comparison should be read as indicative rather than as a significance test. And Sculley and colleagues name the slow tax that an unmaintained research codebase pays, the entanglement, the undeclared data dependencies, and the configuration debt that make last quarter's result unregenerable not through any single mistake but through accumulated drift [5]. Reproducibility discipline is, in their terms, the interest payment that keeps that debt from compounding.

How we recorded our own runs

Our method is a practice report rather than a controlled study, so we state precisely what we did and which numbers are the engagement's recorded figures. The setting is the training of VeerNet, our encoder-decoder for raster well-log digitisation. The scientific question for this paper is narrow and concrete: across five segmentation loss functions, which one best recovers the plotted curves, and can the winning run be regenerated from its record months after it was logged?

The experiment was a single-variable ablation by design. We held the corpus fixed at one synthetic dataset version, 15,000 instances for the multiclass setting with three output classes (background plus two curves), and held the data partition fixed at an 80/20 train-validation split, so the same training and validation examples were seen by every run. We held the optimisation budget fixed at 50 epochs. The only thing we varied across the five runs was the loss function: Dice, Focal, Lovasz, Soft CE, and Tversky. Fixing everything but the loss is what makes any difference in the headline metric attributable to the loss rather than to a confound, and it is the cheapest possible nod to the variance concern Bouthillier and colleagues raise [4], because at least the data-sampling and split sources of variance are removed by construction even though we did not run multiple seeds per loss.

For each run we recorded a provenance entry with four keys: the loss function, the dataset version, the epoch budget, and the headline validation metric, alongside the fixed seed and the split. The headline metric we tracked is the per-curve coefficient of determination, the R-squared between the extracted trace and ground truth at the validated depth points, because for a digitiser the question that matters is how faithfully the recovered curve tracks the real one. The recorded peak was a curve-1 R-squared of 0.9891 under the Tversky loss, with a peak validation F1 of 0.55 on the recall-critical curve masks. The provenance entry is the artifact that lets that 0.9891 be regenerated: it names the exact loss, the exact corpus version and split, and the exact budget, so the run can be relaunched and the number recovered without anyone having to remember how it was produced. This is the Sacred design principle [3] applied at the scale of a handful of engineers rather than a lab: capture the context with the run, not after it.

Results

The instrument below is the ledger we kept, made interactive. Each row is one of the five logged runs, pinned to its loss function, its dataset version, its epoch budget, and its headline R-squared. Selecting a row opens its reproduction card, and the epoch lever shows where the recorded 50-epoch budget pins the checkpoint the metric was reported from.

A provenance ledger that links each logged training run to the four facts you need to regenerate it: its loss function, its dataset version, its epoch budget, and its headline metric. Five loss functions were evaluated on one shared multiclass corpus (Dice, Focal, Lovasz, Soft CE, Tversky), each at the standard 50-epoch budget on the 15,000-instance dataset with an 80/20 train-validation split. Click a row to open its reproduction card, and drag the epoch lever to see where the recorded budget pins the checkpoint. The orange row is the run that produced the headline curve-1 R-squared of 0.9891 (Tversky). The loss-function set, the epoch budget, the instance counts, the split, and the headline R-squared are the engagement's own recorded figures; the per-run metric ordering across the four non-Tversky rows is an illustrative validation ranking used only to order the ledger.

Two readings come off the ledger. The first is the ablation result itself. Because the corpus, the split, and the budget are identical across all five rows, the metric column is a clean comparison of the loss functions, and the Tversky run carries the headline curve-1 R-squared of 0.9891, the orange row the whole table exists to make regenerable. Tversky is a recall-leaning loss, and on a problem where the two curve classes are scarce against an overwhelming background, leaning toward recall is exactly what keeps the faint traces from being washed out, which is why it also logged the peak validation F1 of 0.55 on the curve masks. The four region and cross-entropy alternatives trail it on the headline regression metric. The single-variable design is what licenses that sentence: nothing but the loss changed, so the gap is the loss's doing.

The second reading is the reproducibility one, and it is the point of the paper. The value of the ledger is not that it shows Tversky won, which a single results table would also show. It is that every row carries enough provenance to relaunch the run and recover its number from the record alone. Months after the runs were logged, the question of whether 0.9891 was real and how it was obtained has a mechanical answer rather than a recollected one: the entry names the loss as Tversky, the dataset as the 15,000-instance multiclass version under an 80/20 split, the budget as 50 epochs, and the seed as fixed, and re-running that configuration regenerates the result. The epoch lever makes the budget's role explicit: below the recorded 50-epoch budget the headline checkpoint has not been reached, so a reproduction that stops early is not reproducing the reported number but a different, earlier one. The budget is part of the provenance, not an incidental detail, and the ledger pins it.

Discussion

The honest summary is that the public reproducibility playbook scales down to a small team almost without modification, and that the scaling-down is mostly a matter of discipline rather than infrastructure. We did not need a conference-grade program to make a five-loss ablation reproducible; we needed to satisfy, by construction, the same things the NeurIPS checklist asks a paper to state [2], and to capture them at run time in the spirit of Sacred [3] rather than reconstruct them later. The dataset version and split, the seed, the epoch budget, the number of runs, and the exact metric definition are the checklist items that turned a results table into a regenerable record. None of them is expensive to log; all of them are expensive to recover once lost, which is the asymmetry that the technical-debt framing of Sculley and colleagues captures precisely [5].

Where our practice falls short of the literature's ideal is exactly where a small team's budget bites, and we name it rather than paper over it. We ran one seed per loss, not the multiple seeds that Bouthillier and colleagues show are needed to estimate variance and to claim one method beats another with confidence [4]. That means our ablation ranking should be read as a logged, regenerable comparison under fixed conditions, not as a statistically significant ordering of the five losses. The reproducibility we achieved is the lower-rung kind in Gundersen and Kjensmo's ladder [1]: the runs can be repeated from their records and the numbers regenerated, which is the rung that protects a team from losing its own results, but it is not the higher rung of demonstrating that the finding holds under independent variation. For a research engagement whose job was to ship a working digitiser rather than to publish a benchmark, that is the right rung to have nailed, and being explicit about which rung we reached is itself part of the discipline.

The placement we want to leave is modest and specific. The reproducibility field was built by groups auditing the gap, designing the checklists, and writing the tooling, and a small team building a model should read that work first and adopt its specification wholesale. What a small team adds is not method but proof that the specification is satisfiable on a shoestring: a four-key provenance ledger, captured at run time, was enough to make a five-loss ablation and its 0.9891 headline regenerable a quarter later. The ledger is the artifact we would hand to the next engineer who asks whether a number from last spring can be trusted, and the answer it gives is that they do not have to trust it because they can rebuild it.

Limitations

This is a practice report with explicit edges. The reproducibility we document is repeatability from our own records under fixed conditions, the lowest rung of the graded definition in the literature [1], not replication under independent variation; a third party regenerating these numbers from an independent implementation would test a stronger claim than we make. We ran a single fixed seed per loss function, so the five-way ranking does not account for the run-to-run variance that a proper comparison requires [4], and the ordering of the four non-Tversky losses in particular should be treated as indicative rather than as a significance test. The headline figures we report, the curve-1 R-squared of 0.9891, the peak validation F1 of 0.55, the 15,000-instance multiclass corpus, the 80/20 split, and the 50-epoch budget, are the engagement's own recorded values for our synthetic-corpus runs, not population statistics over physical logs, which are scanned at other settings and would carry their own variance. The instrument's per-run metric ordering across the four non-Tversky rows is an illustrative validation ranking used to order the ledger, while the headline R-squared, the loss set, the budget, the instance counts, and the split are the recorded figures. Finally, our claim that provenance captured at run time is the decisive practice rests on our own experience regenerating these runs rather than on a controlled comparison against a team that reconstructed provenance after the fact; a study contrasting the two regimes would test it more rigorously than this report does.

References

[1] Gundersen, O. E., and Kjensmo, S. State of the art: reproducibility in artificial intelligence. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2018. An audit of empirical AI papers at top venues against six reproducibility factors, finding that very few documented the experiment well enough to be regenerated and proposing a graded definition of reproducibility. https://ojs.aaai.org/index.php/AAAI/article/view/11503

[2] Pineau, J., Vincent-Lamarre, P., Sinha, K., Lariviere, V., Beygelzimer, A., d'Alche-Buc, F., Fox, E., and Larochelle, H. Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program). Journal of Machine Learning Research, 2021. A report on the conference-scale reproducibility program built around a machine-learning reproducibility checklist, a code-submission policy, and a community reproducibility challenge. https://arxiv.org/abs/2003.12206

[3] Greff, K., Klein, A., Chovanec, M., Hutter, F., and Schmidhuber, J. The Sacred infrastructure for computational research. Proceedings of the 16th Python in Science Conference (SciPy), 2017. An open-source framework for running computational experiments that captures configuration, randomness, source code, and run metadata into a queryable record so a result can be traced back to the conditions that produced it. https://proceedings.scipy.org/articles/shinma-7f4c6e7-008

[4] Bouthillier, X., Delaunay, P., Bronzi, M., Trofimov, A., Nichyporuk, B., et al. Accounting for variance in machine learning benchmarks. Proceedings of Machine Learning and Systems (MLSys), 2021. A study of the sources of variance in benchmark results, including data sampling, initialisation, and hyperparameters, showing that a single seeded run can mislead and that variance must be estimated to compare methods honestly. https://arxiv.org/abs/2103.03098

[5] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., and Dennison, D. Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems (NeurIPS), 2015. The paper that named the maintenance costs specific to machine-learning systems, including entanglement, undeclared data dependencies, and configuration debt, all of which a research repository accrues quietly. https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems

Reproducibility and Experiment Tracking in a Small Research-ML Team

Abstract

The reproducibility problem and how the field framed it

How we recorded our own runs

Results

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on