Skip to main content

Case Study

Cutting Multiclass Training to 10 Hours on a 15,000-Log Set

The multiclass segmenter had the loss and the labels it needed. What it did not have was a training loop that finished in a workday. On a 15,000-instance set of variable-width log images the first honest measurement was not a number we could plan around, because the run was still stuck in a per-instance crawl at batch size one, the only batch that fit. This is our account of the throughput work that pulled the wall-clock down to about ten hours: a custom collate_fn that padded variable image dimensions into one batch tensor, a batch size of sixteen, mixed precision, gradient checkpointing, and a dataloader that kept the card fed. We changed how the run was fed, not what it learned.

Case study

We had settled the hard modelling questions before this run and were left with a plainer one that turned out to matter just as much: could we train the multiclass segmenter fast enough to iterate on it. The dataset was fifteen thousand synthetic log instances, the target was three classes, and the schedule was fifty epochs. None of that was in question here. The problem was that the first version of the training loop treated the fifteen thousand instances more or less one at a time, and on images this large that is not a training run, it is a vigil. This piece is only about the speed. Not the loss, which was decided elsewhere, and not the choice to move from binary masks to a three-class target, which was its own decision with its own reasons. Just the wall-clock, and the handful of levers that pulled it down to about ten hours.

Why one image at a time was the whole problem

The images are the reason the naive loop was slow, and they are worth describing precisely because they dictate everything that follows. A synthetic log instance is tall and narrow and, crucially, not a fixed size. Widths ran from 3,200 to 12,800 pixels and heights from 480 to 640, so no two batches held images of the same shape. That variability is fatal to the ordinary way you accelerate training. The standard speedup is to stack many samples into one batch tensor and let the GPU work on all of them in a single pass, but you cannot stack tensors of different widths and heights into one rectangular array. The path of least resistance is to give up on batching and feed the network one image per step, and that is exactly where the binary run had lived: batch size one, pinned there not by choice but by the fact that variable image sizes left no obvious way to group them.

Batch size one is the slowest place a training loop can be. Every one of the fifteen thousand instances becomes its own optimiser step, so a single epoch is fifteen thousand steps, and fifty epochs is three quarters of a million of them. Each step carries fixed overhead that has nothing to do with the arithmetic of the forward and backward pass, the launch of the kernels, the synchronisation, the gradient update, and at batch one you pay that overhead fifteen thousand times per epoch instead of amortising it across a group. The GPU spends much of its time waiting rather than computing. We had a concrete reference point for how this felt: the earlier binary run, on two thousand instances at batch one for the same fifty epochs, took 110 minutes. Scale that per-instance cost up to fifteen thousand instances and the multiclass run was heading somewhere we could not plan a week around.

The lever that mattered was the batch, and the collate_fn is what unlocked it

The single relationship that governs this whole story is simple. Total wall-clock is roughly the number of optimiser steps times the cost per step, and the number of steps per epoch is the instance count divided by the batch size. Every fixed-overhead cost you pay per step gets divided by the batch. Take the batch from one to sixteen and you have cut the number of steps per epoch by a factor of sixteen, from fifteen thousand down to under a thousand, and you have handed the GPU sixteen images to work on in parallel where before it had one. That is the entire mechanism. Everything else we did was in service of making a batch of sixteen possible at all on images this awkward.

Making it possible came down to a custom collate_fn. A dataloader's collate function is the step that takes a list of individual samples and assembles them into one batched tensor, and the default one assumes every sample is the same shape. Ours could not assume that, so we wrote a collate_fn that took a group of sixteen variable-dimension images and padded them to a common shape before stacking, recording enough to keep the padding from being mistaken for signal downstream. With that in place the dataloader could actually return batches of sixteen instead of surrendering to batch one, and the throughput relationship above stopped being theoretical.

MULTICLASS WALL-CLOCK · 15,000 LOGS · 50 EPOCHS9.2 hprojected training wall-clock at batch 16Fewer optimiser steps per epoch, same loss and labels, far less wall-clockA · LEVERS THAT LET BATCH 16 FITCustom collate_fnpads variable dims into one batchEfficient dataloaderkeeps the GPU fedMixed precisionhalves activation memoryGradient checkpointingrecompute for memorybatch ceiling16drop a lever and the batch that fits falls backtoward 1, the memory-bound binary regimeUNCHANGED: WORKLOAD, WIDTHS, SPLIT3,200-12,800 px wide, 480-640 px tall, 80/20 splitB · WALL-CLOCK FALLS AS BATCH SIZE RISES0h37h73h110h147h1481216measured 550 minbatch 1 is the per-instance crawlBATCH-SIZE LEVERdrag effective batch: steps/epoch =15,000 instances over batch1481216b16steps/epoch938wall-clock9.2 hvs batch 116.0xsourced: 550 min at batch 16 & 110 min binary at batch 1 (measured), 15,000 & 2,000 instances, 50 epochs · the batch-to-wall-clock curve between them is the step-count model, illustrative
How a small set of throughput levers collapsed the multiclass training wall-clock on the 15,000-log, 50-epoch build, without touching the loss or the labels. Total wall-clock tracks the number of optimiser steps per epoch, which is instances divided by effective batch size, so raising the batch lowers the step count and the clock with it. Lever A toggles the four enablers that let a batch of 16 fit at all on images 3,200 to 12,800 pixels wide: a custom collate_fn that pads variable-dimension images into one batch tensor, an efficient dataloader that keeps the GPU fed, mixed precision that halves the activation footprint, and gradient checkpointing that trades recompute for memory. Drop any of them and the batch that fits falls back toward 1, the memory-bound regime the binary run lived in. Lever B drags the effective batch. The curve plots projected wall-clock against batch size at the multiclass workload, anchored on two measured points from the engagement: the binary crawl at batch 1 and 110 minutes on 2,000 instances, and the multiclass floor at batch 16 and 550 minutes, about 10 hours, on 15,000 instances. The orange marker is the only element that argues: the wall-clock sliding down the curve toward the floor as batch size rises. The two endpoint times, both batch sizes, the instance and epoch counts, the pixel ranges, and the 80/20 split are sourced from the ML Progress Handover; the step-count curve between the two measured points is an illustrative model calibrated to reproduce the measured floor, not a per-batch benchmark.

The three levers that kept a batch of sixteen inside the memory budget

A batch of sixteen large images is a lot of memory, and the reason batch one had been the default was never really the collate logic alone, it was that bigger batches did not fit. So the collate_fn came with company. Mixed precision let us hold activations in half the space by using sixteen-bit floats where full precision was not needed, which roughly halved the activation footprint and, on the right hardware, sped up the math as a bonus rather than a cost. Gradient checkpointing bought more headroom by refusing to keep every intermediate activation from the forward pass in memory, instead recomputing the cheap ones during the backward pass, a deliberate trade of a little extra compute for a lot less memory. And the dataloader itself was tuned to prepare the next batch while the current one trained, so the card was not left idle waiting on the CPU to pad and stack the next sixteen images.

None of these three changed the model's output. Mixed precision, checkpointing, and prefetching are all invisible to the loss and the weights: the network sees the same images, computes the same gradients to within numerical tolerance, and arrives at the same place. What they changed was whether a batch of sixteen fit in the memory budget, and that is the only thing that let the batch lever move. Take any one of them away and the batch that fits shrinks back toward one, and the wall-clock climbs back up the curve the instrument above traces.

What the clock actually read

With the collate_fn, batch sixteen, mixed precision, checkpointing, and a fed dataloader all in place, the multiclass run over fifteen thousand instances for fifty epochs took 550 minutes, a little over nine hours, which we rounded to about ten hours when we talked about it as an overnight job. The comparison that made the point internally was against the binary anchor: 110 minutes for two thousand instances at batch one. The multiclass set was seven and a half times larger and the run took five times as long, not the seven and a half or more you would expect if per-instance cost had held constant. The batch had absorbed the difference. We used an 80/20 train-validation split throughout, so the epoch was training on twelve thousand of the fifteen thousand instances, and that is the number the throughput math above is really dividing.

The reason ten hours mattered was not the ten hours. It was that ten hours is one night. A run that finishes overnight is a run you can start at the end of a day, read in the morning, and change something about before the next evening, which turns a training loop into something you can iterate against on a daily cadence. Two days per run does not give you that. It gives you a couple of attempts a week and a strong incentive to guess rather than measure. Pulling the wall-clock under a working night was the difference between a model we tuned by experiment and one we would have tuned by superstition.

Limitations

The specific 550-minute figure is one number from one engagement on one machine, and it should be read as evidence that these levers move wall-clock by roughly this much on a workload like this, not as a benchmark to port. The projection between the two measured points in the exhibit is a step-count model calibrated to reproduce the measured floor, not a per-batch timing; the real curve would bend where memory bandwidth, kernel efficiency, or dataloader throughput become the binding constraint rather than step count, and we did not sweep the intermediate batch sizes to map that bend. Larger batches also change the effective gradient noise, so beyond a point raising the batch is no longer free with respect to what the model learns, even though it stays cheap with respect to the clock; on this run sixteen was the ceiling memory allowed and we did not need to push against the learning question. Finally, everything here rests on the synthetic image dimensions we describe, 3,200 to 12,800 pixels wide and 480 to 640 tall. Field scans with a wider spread of sizes would put more pressure on the padding in the collate_fn, and the memory headroom that made batch sixteen fit is not guaranteed to survive a heavier tail of very wide images.

Go to Top

© 2026 Copyright. Earthscan