The first real decision on VeerNet, our raster-log digitization model, was not about architecture or data. It was a question every team faces and almost nobody writes down honestly: with one GPU box and a deadline, which optimiser do you train with? The literature has a long, careful argument about Adam versus stochastic gradient descent, and most of it is about where each one lands after a generous training run on hardware you can spare. We had neither generosity nor spare hardware. We had a single memory-constrained machine that the whole team shared, a synthetic dataset we were still growing, and a client expecting a curve out of a scanned log. So we resolved the optimiser question the way the box let us resolve it: as a race against the clock.
A single GPU box and a question theory could not answer
The textbook framing of the Adam-versus-SGD debate is about the quality of the solution. Adam, the adaptive-moment method, adapts a per-parameter learning rate and tends to make fast early progress [1]. SGD with momentum, the older and simpler workhorse, often generalizes to a slightly better final minimum if you tune its schedule and let it run [2]. There is even a well-known result arguing that adaptive methods can reach worse solutions than plain SGD on some problems [3]. If you have a cluster and weeks, that argument is worth having, because the prize is the last fraction of a percent of accuracy at a minimum you can afford to chase.
We could not afford to chase it. Our binary segmentation stage trained on 2,000 synthetic instances, and because those synthetic logs vary in size from one example to the next, we were stuck at a batch size of 1 until we wrote the padding machinery that came later. One pass of 50 epochs over that binary set took 110 minutes on our box. The multiclass stage, on 15,000 instances, took 550 minutes for the same 50 epochs. Those are not numbers you iterate on casually. At batch size 1 on a shared machine, every training run is most of a working day, and a wrong optimiser choice does not cost you a worse minimum. It costs you the day. The question was never which optimiser finds the prettier basin. It was which optimiser hands us a curve we can actually use before the box has to go to someone else.
The only metric we let decide: minutes to a usable curve
So we wrote the decision rule down before we ran anything, because a rule you invent after seeing the numbers is not a rule. We would race Adam against SGD on identical settings, the same model, the same 50-epoch schedule, the same batch size of 1, the same box, and we would score them on one thing only: wall-clock minutes to first reach a usable validation R-squared. Not minutes to convergence. Not the final R-squared at epoch 50. Minutes to usable, where usable meant a regression fit on the recovered curve good enough to put in front of a petrophysicist without embarrassment.
The reason for that metric is operational, not academic. A digitized curve does not have to be perfect to be worth shipping; it has to be close enough that correcting it by hand is faster than tracing it from scratch. On our scale that threshold sat high but reachable, and the run that eventually defined our finish line peaked at an R-squared of 0.9891 with a mean absolute error of 0.0132. That pair of numbers became the anchor for everything: the finish line was the best curve the run ever produced, and the race was about who got near it first in box-time, not who edged it out at the end of a schedule we could rarely afford to run twice.
Wiring the race onto one memory-bound box
Setting up a fair race on constrained hardware takes more discipline than it sounds. The two optimisers see the same model and the same data, but they do not respond to the same learning rate, so a naive head-to-head with one shared learning rate would just be a rigged fight. We gave each optimiser the learning rate it wanted: a small, steady rate for Adam, where the adaptive moments do the per-parameter scaling [1], and a larger rate with momentum for SGD, the regime where it is supposed to shine [2]. We left the weight-decay handling plain rather than reaching for the decoupled variant that later became the standard Adam recipe [4], because at this stage the question was the optimiser, not its regulariser. We held everything else identical. Same 50 epochs, same batch size of 1 forced by the variable image dimensions, same validation split, same box.
We also instrumented the thing we actually cared about. Instead of only logging the end-of-run metrics, we logged validation R-squared as wall-clock time accumulated, so we could read each optimiser's accuracy at the 20-minute mark, the 60-minute mark, the 100-minute mark, rather than only at epoch 50. That reframing, from epochs to minutes, is the whole point. Epochs are free in theory and ruinously expensive on one shared machine. Minutes are the currency the project was actually spending. Once the axis was minutes, the comparison stopped being a research plot and became a scheduling decision.
Watching SGD lose the morning, then the afternoon
The binary stage settled it fast. Adam climbed hard in the first stretch of the 110-minute budget, reaching a curve we would call usable while SGD was still grinding up the early, near-flat part of its trajectory. SGD was not broken. With the larger learning rate and momentum it was doing exactly what the theory says it does, descending steadily and promising a competitive minimum if we let the full schedule finish. But "if we let it finish" was the luxury we did not have. By the time SGD reached a usable fit, Adam had been usable for a long while, and on a box someone else needed by afternoon, that gap was the entire decision.
The multiclass stage made the same point at five times the scale, which is what convinced us it was not a fluke of the small dataset. At 550 minutes per run, the cost of betting on the slower climber was not minutes, it was most of a day per experiment, and we were running many experiments. Adam's early lead, expressed in minutes rather than epochs, compounded into real calendar time across the dozens of training runs the segmentation work needed. The honest qualification is that we never ran SGD to a fully tuned, schedule-perfect finish on this hardware, precisely because doing so would have cost the box-time the whole exercise was trying to protect. We were not measuring which optimiser is better in the limit. We were measuring which optimiser is better to live with on one machine and a deadline, and that is a different, more practical question.
What the wall clock said
The exhibit below is the race as we read it. It puts wall-clock minutes on the horizontal axis, anchored to the two real budgets, 110 minutes over 2,000 binary instances and 550 minutes over 15,000 multiclass instances, both at 50 epochs and batch size 1. The orange line is the finish line, the run's real peak R-squared of 0.9891. You can flip between the binary and multiclass budgets, drag a playhead across box-time to read each optimiser's accuracy at any minute, and move the usable threshold to set the bar each optimiser has to clear. The verdict it reports is the one number we cared about: how many minutes of box-time Adam saved over SGD in reaching a usable curve.
Read the race with the threshold parked where we parked it, high but shy of the finish line, and the story is plain. Adam crosses into usable territory early, banks the rest of the budget, and the curve flattens as it approaches the peak. SGD reaches the same neighbourhood, but it does so late in the same schedule, and on a budget measured in hundreds of minutes per run, late is expensive. Slide the threshold down and the gap narrows, because almost anything is reachable quickly if you do not ask for much. Slide it up toward 0.9891 and the gap is moot, because near the finish line the two are converging anyway. The decision lives in the middle, at the bar where a curve becomes shippable, and across that whole middle band Adam reaches it first in box-time. That, and nothing about asymptotic minima, is why Adam won.
Why we still re-run this race on every new box
The thing we kept from these runs is a habit, not a verdict. We did not conclude "Adam beats SGD" as a law, because we know the answer is hardware-dependent and metric-dependent, and the moment we got a machine that let us train at a real batch size the calculus could shift. What we concluded is that an optimiser choice on constrained hardware is a wall-clock question wearing a convergence-theory costume, and you should undress it before you answer. The right axis is minutes of the box-time you actually have, the right finish line is the accuracy at which your output becomes useful to the person downstream, and the right winner is whoever reaches that line first under those constraints.
So we re-run this exact race whenever the hardware underneath VeerNet changes. New box, new memory budget, possibly a new batch size, and the first thing we do is plot minutes against usable R-squared for both optimisers before committing the next month of experiments to one of them. The earliest curve-segmentation runs taught us to treat the optimiser not as a setting you inherit from a paper but as a bet you re-price every time the machine changes, because on a single shared GPU the cheapest path to a usable curve is worth more than the prettiest path to a perfect one.
References
-
Kingma, D. P., and Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR 2015. https://arxiv.org/abs/1412.6980
-
Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the Importance of Initialization and Momentum in Deep Learning. ICML 2013. https://proceedings.mlr.press/v28/sutskever13.html
-
Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. (2017). The Marginal Value of Adaptive Gradient Methods in Machine Learning. NeurIPS 2017. https://arxiv.org/abs/1705.08292
-
Loshchilov, I., and Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR 2019. https://arxiv.org/abs/1711.05101